opencoder

OpenCoder 是一個開放且可重現的程式碼 LLM 系列，包含 1.5B 和 8B 模型，同時支援英文和中文語言。OpenCoder 從頭開始，在 2.5 兆個 tokens 上進行預訓練，其中 90% 為原始程式碼，10% 為程式碼相關的網路資料，並在超過 450 萬個高品質 SFT 範例上進行監督式微調，最終達到頂級程式碼 LLM 的效能。我們不僅提供模型權重和推論程式碼，還提供可重現的訓練資料、完整的資料處理流程、嚴謹的實驗消融結果以及詳細的訓練協議。OpenCoder 為研究人員提供構建和創新的能力，是您推進程式碼 AI 的開放基礎。

完全開源：OpenCoder 確保完全透明，不僅發布模型權重和即將發布的推論程式碼，還發布用於訓練的完整資料清理程式碼。此版本包括高品質的合成資料、大量的檢查點以及超過 450 萬個監督式微調 (SFT) 條目的資料集，使 OpenCoder 成為目前最全面開源的模型之一。
全面的實驗分析：OpenCoder 通過對各種資料清理策略和訓練過程（包括檔案層級和儲存庫層級的重複資料刪除實驗）進行廣泛的消融研究，從而進行嚴格的測試，確保對模型的效能進行徹底的探索和驗證。
高品質合成資料：OpenCoder 提供完整開發的合成資料生成流程和超過 450 萬個 SFT 資料條目，為模型訓練和評估建立穩固的資料基礎。
卓越的效能：OpenCoder 在多個語言模型基準測試中都取得了優異的效能，使其躋身程式碼領域領先的開源模型之列。

參考文獻

GitHub

論文

Hugging Face

**OpenCoder** is an open and reproducible code LLM family which includes 1.5B and 8B  models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples, finally reaching the performance of top-tier code LLMs. We provide not only model weights and inference code, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols. Empowering researchers to build and innovate, OpenCoder is your open foundation for advancing code AI.

- **Complete Open Source**: OpenCoder ensures full transparency by releasing not only the model weights and forthcoming inference code but also the complete data-cleaning code for training. This release includes high-quality synthetic data, an extensive set of checkpoints, and a dataset of over 4.5 million supervised fine-tuning (SFT) entries, making OpenCoder one of the most comprehensively open-sourced models available.
- **Comprehensive Experimental Analysis**: OpenCoder is rigorously tested through extensive ablation studies on various data-cleaning strategies and training processes, including file-level and repository-level deduplication experiments, ensuring thorough exploration and validation of the model’s performance.
- **High-Quality Synthetic Data**: OpenCoder provides a fully developed synthetic data generation process and over 4.5 million SFT data entries, establishing a robust data foundation for model training and evaluation.
- **Exceptional Performance**: OpenCoder achieves high performance across multiple language model benchmarks, positioning it among the leading open-source models for code.

## References

[GitHub](https://github.com/OpenCoder-llm/OpenCoder-llm)

[Paper](https://arxiv.org/pdf/2411.04905)

[Hugging Face](https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e)

貼上、拖曳或點擊以上傳圖片 (.png, .jpeg, .jpg, .svg, .gif)