minicpm-v

MiniCPM-V 2.6 是 MiniCPM-V 系列中最新且功能最強大的模型。此模型基於 SigLip-400M 和 Qwen2-7B 構建，總共有 8B 個參數。相較於 MiniCPM-Llama3-V 2.5，其效能顯著提升，並為多圖像和影片理解引入了新功能。MiniCPM-V 2.6 的顯著功能包括

🔥 領先效能：MiniCPM-V 2.6 在最新版本的 OpenCompass 上取得了 65.2 的平均分，OpenCompass 是對 8 個熱門基準的全面評估。僅使用 8B 參數，它在單圖像理解方面超越了廣泛使用的專有模型，如 GPT-4o mini、GPT-4V、Gemini 1.5 Pro 和 Claude 3.5 Sonnet。
🖼️ 多圖像理解和上下文學習。MiniCPM-V 2.6 還可以對多個圖像執行對話和推理。在流行的多圖像基準測試（如 Mantis-Eval、BLINK、Mathverse mv 和 Sciverse mv）上取得了最先進的效能，並在上下文學習能力方面也展現出前景。
💪 強大的 OCR 功能：MiniCPM-V 2.6 可以處理任何長寬比且高達 180 萬像素的圖像（例如，1344x1344）。在 OCRBench 上取得了最先進的效能，超越了 GPT-4o、GPT-4V 和 Gemini 1.5 Pro 等專有模型。基於最新的 RLAIF-V 和 VisCPM 技術，它具有值得信賴的行為，在 Object HalBench 上的幻覺率顯著低於 GPT-4o 和 GPT-4V，並支援英語、中文、德語、法語、義大利語、韓語等多語言能力。
🚀 優越的效率：除了其友善的尺寸外，MiniCPM-V 2.6 還展現了最先進的 token 密度（即，編碼到每個視覺 token 中的像素數量）。在處理 180 萬像素的圖像時，它僅產生 640 個 token，比大多數模型少 75%。這直接提高了推理速度、首個 token 延遲、記憶體使用量和功耗。

參考文獻

GitHub

Hugging Face

> Note: this model requires [Ollama 0.3.10](https://github.com/ollama/ollama/releases/tag/v0.3.10) or later.

MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:

* **🔥 Leading Performance**: MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding.

* **🖼️ Multi Image Understanding and In-context Learning**. MiniCPM-V 2.6 can also perform conversation and reasoning over multiple images. It achieves state-of-the-art performance on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.

* **💪 Strong OCR Capability**: MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. Based on the the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports multilingual capabilities on English, Chinese, German, French, Italian, Korean, etc.

* **🚀 Superior Efficiency**: In addition to its friendly size, MiniCPM-V 2.6 also shows state-of-the-art token density (i.e., number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption.

## Refrences

[GitHub](https://github.com/OpenBMB/MiniCPM-V)

[Hugging Face](https://huggingface.co/openbmb/MiniCPM-V-2_6)

貼上、拖曳或點擊上傳圖片 (.png, .jpeg, .jpg, .svg, .gif)