mistral-small:22b-instruct-2409-q3_K_L

我們與外部協力廠商供應商針對一組超過 1k 個專有編碼和通用提示進行了並排評估。評估人員的任務是從 Mistral Small 3 與另一個模型產生的匿名生成結果中選擇他們偏好的模型回應。我們知道，在某些情況下，關於人類判斷的基準與公開可用的基準截然不同，但我們已格外謹慎地驗證了公平的評估。我們確信上述基準是有效的。

指令效能

我們的指令微調模型在程式碼、數學、一般知識和指令遵循基準方面的效能與比其大三倍的開放權重模型以及專有的 GPT4o-mini 模型相比，具有競爭力。

所有基準的效能準確度均透過相同的內部評估管道獲得 - 因此，數字可能與先前報告的效能（Qwen2.5-32B-Instruct、Llama-3.3-70B-Instruct、Gemma-2-27B-IT）略有不同。基於判斷的評估（例如 Wildbench、Arena hard 和 MTBench）基於 gpt-4o-2024-05-13。

客戶正在多個產業中評估 Mistral Small 3，包括

金融服務客戶用於詐欺偵測
醫療保健提供者用於客戶分流
機器人技術、汽車和製造公司用於設備端命令和控制
跨客戶的橫向用例包括虛擬客戶服務以及情緒和回饋分析。

Mistral Small 3 sets a new benchmark in the "small" Large Language Models category below 70B, boasting 24B parameters and achieving state-of-the-art capabilities comparable to larger models.

Mistral Small can be deployed locally and is exceptionally "knowledge-dense", fitting in a single RTX 4090 or a 32GB RAM MacBook once quantized.
Perfect for:

- Fast response conversational agents.
- Low latency function calling.
- Subject matter experts via fine-tuning.
- Local inference for hobbyists and organizations handling sensitive data.

### Key Features
- **Multilingual:** Supports dozens of languages, including English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish.
- **Agent-Centric:** Offers best-in-class agentic capabilities with native function calling and JSON outputting.
- **Advanced Reasoning:** State-of-the-art conversational and reasoning capabilities.
- **Apache 2.0 License:** Open license allowing usage and modification for both commercial and non-commercial purposes.
- **Context Window:** A 32k context window.
- **System Prompt:** Maintains strong adherence and support for system prompts.
- **Tokenizer:** Utilizes a Tekken tokenizer with a 131k vocabulary size.

### Human Evaluations 
![Human ratings](/assets/library/mistral-small/90f227bd-9751-4fe9-aa23-4d5d89b9d0c6)

We conducted side by side evaluations with an external third-party vendor, on a set of over 1k proprietary coding and generalist prompts. Evaluators were tasked with selecting their preferred model response from anonymized generations produced by Mistral Small 3 vs another model. We are aware that in some cases the benchmarks on human judgement starkly differ from publicly available benchmarks, but have taken extra caution in verifying a fair evaluation. We are confident that the above benchmarks are valid.

### Instruct performance 
Our instruction tuned model performs competitively with open weight models three times its size and with proprietary GPT4o-mini model across Code, Math, General knowledge and Instruction following benchmarks.

![instruct performance](/assets/library/mistral-small/d27f75e4-0dae-4721-bade-2999a1dd4a7b)
![instruct performance](/assets/library/mistral-small/e677ae9e-edfa-47f9-a35c-b1eb9dfb51c8)

![instruct performance](/assets/library/mistral-small/4545bb49-d87f-4731-bfdb-e191dc2c2a9a)

Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline - as such, numbers may vary slightly from previously reported performance (Qwen2.5-32B-Instruct, Llama-3.3-70B-Instruct, Gemma-2-27B-IT). Judge based evals such as Wildbench, Arena hard and MTBench were based on gpt-4o-2024-05-13.
 
Customers are evaluating Mistral Small 3 across multiple industries, including:

- Financial services customers for fraud detection
- Healthcare providers for customer triaging 
- Robotics, automotive, and manufacturing companies for on-device command and control
- Horizontal use cases across customers include virtual customer service, and sentiment and feedback analysis.

貼上、拖曳或點擊以上傳圖片 (.png, .jpeg, .jpg, .svg, .gif)