llama3.2-vision

Llama 3.2-Vision 多模態大型語言模型 (LLM) 系列是一個 11B 和 90B 尺寸的指令微調圖像推理生成模型集合（文字 + 圖像輸入 / 文字輸出）。 Llama 3.2-Vision 指令微調模型針對視覺辨識、圖像推理、圖像描述和回答關於圖像的通用問題進行了最佳化。這些模型在常見的行業基準測試中，效能優於許多可用的開源和封閉式多模態模型。

支援語言：對於僅限文字的任務，官方支援英文、德文、法文、義大利文、葡萄牙文、印地文、西班牙文和泰文。 Llama 3.2 的訓練語料庫比這 8 種支援語言更廣泛。請注意，對於圖像+文字應用，僅支援英文。

使用方式

首先，拉取模型

ollama pull llama3.2-vision

Python 程式庫

若要將 Llama 3.2 Vision 與 Ollama Python 程式庫搭配使用

import ollama

response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': ['image.jpg']
    }]
)

print(response)

JavaScript 程式庫

若要將 Llama 3.2 Vision 與 Ollama JavaScript 程式庫搭配使用

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'llama3.2-vision',
  messages: [{
    role: 'user',
    content: 'What is in this image?',
    images: ['image.jpg']
  }]
})

console.log(response)

cURL

curl https://#:11434/api/chat -d '{
  "model": "llama3.2-vision",
  "messages": [
    {
      "role": "user",
      "content": "what is in this image?",
      "images": ["<base64-encoded image data>"]
    }
  ]
}'

參考文獻

GitHub

HuggingFace

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Supported Languages: For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.

## Usage

First, pull the model:

```bash
ollama pull llama3.2-vision
```

### Python Library

To use Llama 3.2 Vision with the Ollama [Python library](https://github.com/ollama/ollama-python):

```python
import ollama

response = ollama.chat(
    model='llama3.2-vision',
    messages=[{
        'role': 'user',
        'content': 'What is in this image?',
        'images': ['image.jpg']
    }]
)

print(response)
```

### JavaScript Library

To use Llama 3.2 Vision with the Ollama [JavaScript library](https://github.com/ollama/ollama-js):

```javascript
import ollama from 'ollama'

const response = await ollama.chat({
  model: 'llama3.2-vision',
  messages: [{
    role: 'user',
    content: 'What is in this image?',
    images: ['image.jpg']
  }]
})

console.log(response)
```

### cURL

```shell
curl https://#:11434/api/chat -d '{
  "model": "llama3.2-vision",
  "messages": [
    {
      "role": "user",
      "content": "what is in this image?",
      "images": ["<base64-encoded image data>"]
    }
  ]
}'
```

## References

[GitHub](https://github.com/meta-llama/llama-models)

[HuggingFace](https://huggingface.co/collections/meta-llama/llama-32-66f448ffc8c32f949b04c8cf)

貼上、拖放或點擊以上傳圖片 (.png, .jpeg, .jpg, .svg, .gif)