llama3.2-vision

Llama 3.2-Vision 多模態大型語言模型 (LLM) 系列，是一系列經過指令微調的圖像推理生成模型，具有 11B 和 90B 兩種尺寸（文字 + 圖像輸入 / 文字輸出）。Llama 3.2-Vision 經過指令微調的模型針對視覺辨識、圖像推理、圖像說明和回答關於圖像的通用問題進行了最佳化。這些模型在常見的產業基準測試中，效能超越了許多現有的開源和封閉式多模態模型。

支援語言：對於僅限文字的任務，官方支援英語、德語、法語、義大利語、葡萄牙語、印地語、西班牙語和泰語。Llama 3.2 的訓練語料庫涵蓋比這 8 種支援語言更廣泛的語言。請注意，對於圖像+文字應用，僅支援英語。

使用方法

首先，提取模型

ollama pull llama3.2-vision

Python 函式庫

若要搭配 Ollama Python 函式庫使用 Llama 3.2 Vision

import ollama

response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': ['image.jpg']
    }]
)

print(response)

JavaScript 函式庫

若要搭配 Ollama JavaScript 函式庫使用 Llama 3.2 Vision

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'llama3.2-vision',
  messages: [{
    role: 'user',
    content: 'What is in this image?',
    images: ['image.jpg']
  }]
})

console.log(response)

cURL

curl https://127.0.0.1:11434/api/chat -d '{
  "model": "llama3.2-vision",
  "messages": [
    {
      "role": "user",
      "content": "what is in this image?",
      "images": ["<base64-encoded image data>"]
    }
  ]
}'

參考文獻

GitHub

HuggingFace

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Supported Languages: For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.

## Usage

First, pull the model:

```bash
ollama pull llama3.2-vision
```

### Python Library

To use Llama 3.2 Vision with the Ollama [Python library](https://github.com/ollama/ollama-python):

```python
import ollama

response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': ['image.jpg']
    }]
)

print(response)
```

### JavaScript Library

To use Llama 3.2 Vision with the Ollama [JavaScript library](https://github.com/ollama/ollama-js):

```javascript
import ollama from 'ollama'

const response = await ollama.chat({
  model: 'llama3.2-vision',
  messages: [{
    role: 'user',
    content: 'What is in this image?',
    images: ['image.jpg']
  }]
})

console.log(response)
```

### cURL

```shell
curl https://127.0.0.1:11434/api/chat -d '{
  "model": "llama3.2-vision",
  "messages": [
    {
      "role": "user",
      "content": "what is in this image?",
      "images": ["<base64-encoded image data>"]
    }
  ]
}'
```

## References

[GitHub](https://github.com/meta-llama/llama-models)

[HuggingFace](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf)

貼上、拖曳或點擊以載入圖片 (.png, .jpeg, .jpg, .svg, .gif)