150 Tokens/Second Lightning Speed
MiMo-V2-Flash delivers unprecedented inference speed through Multi-Token Prediction (MTP) and self-speculative decoding, generating code and content 2-3x faster than competitors.
MiMo-V2-Flash by Xiaomi is a revolutionary 309B parameter MoE language model delivering 150 tokens/sec inference speed. Ranked #1 on SWE-Bench Verified with 73.4% score. Experience enterprise-grade AI at just 2.5% of Claude's cost.
MiMo-V2-Flash combines groundbreaking architecture with exceptional performance, making it the premier choice for developers, enterprises, and AI enthusiasts worldwide.
MiMo-V2-Flash delivers unprecedented inference speed through Multi-Token Prediction (MTP) and self-speculative decoding, generating code and content 2-3x faster than competitors.
With 309 billion total parameters and 15 billion active parameters, MiMo-V2-Flash leverages Mixture-of-Experts (MoE) for optimal efficiency without compromising intelligence.
At just $0.1 per million input tokens and $0.3 per million output tokens, MiMo-V2-Flash delivers enterprise-grade performance at a fraction of proprietary model costs.
Process entire codebases, lengthy documents, and complex multi-turn conversations with MiMo-V2-Flash's massive 256,000 token context window.
MiMo-V2-Flash achieves 73.4% on SWE-Bench Verified, claiming the #1 position among all open-source models and rivaling proprietary solutions.
Fully open-source under MIT license. Download, modify, deploy, and commercialize MiMo-V2-Flash without restrictions or royalties.
Try MiMo-V2-Flash directly through Xiaomi MiMo Studio. Experience the blazing-fast 150 tok/s inference speed and superior code generation capabilities firsthand.
MiMo-V2-Flash dominates across coding, reasoning, and agentic benchmarks, outperforming DeepSeek-V3.2, Gemini 3.0 Pro, and Claude Sonnet 4.5.
| Benchmark | MiMo-V2-Flash | DeepSeek-V3.2 | Gemini 3.0 Pro | Claude Sonnet 4.5 | GPT-5 High |
|---|---|---|---|---|---|
| SWE-Bench Verified | 73.4% #1 Open Source | 73.1% | 76.2% | 77.2% | 74.9% |
| SWE-Bench Multilingual | 71.7% | 70.2% | - | 68.0% | 55.3% |
| LiveCodeBench v6 | 80.6% | 83.3% | 90.7% | 64.0% | 84.5% |
| AIME 2025 | 94.1% | 93.1% | 95.0% | 87.0% | 94.6% |
| GPQA-Diamond | 83.7% | 82.4% | 91.9% | 83.4% | 85.7% |
| MMLU-Pro | 84.9% | 85.0% | 90.1% | 88.2% | 87.5% |
| ฯยฒ-Bench (Agent) | 80.3% | 80.3% | 85.4% | 84.7% | 80.2% |
| BrowseComp | 58.3% | 67.6% | 59.2% | 24.1% | 54.9% |
Discover the innovative hybrid architecture that makes MiMo-V2-Flash the fastest and most efficient open-source AI model available.
128 token window ร 5 layers per block
Full context ร 1 layer per block
0.33B params per block ร 3 layers
309B total โ 15B active parameters
MiMo-V2-Flash addresses quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio. Uses aggressive 128-token window with learnable attention sink bias.
Native MTP module using dense FFN (instead of MoE) and SWA (instead of GA) keeps parameter count low at 0.33B per block. Facilitates self-speculative decoding, tripling generation speed.
From vibe coding to enterprise agents, MiMo-V2-Flash powers the next generation of AI applications across every domain.
Integrate MiMo-V2-Flash with Cursor, Cline, and Claude Code for lightning-fast code generation, refactoring, and debugging at 150 tok/s.
Build sophisticated AI agents with MiMo-V2-Flash's agentic capabilities, trained with large-scale RL on 100,000+ GitHub issues.
Leverage SWE-Bench #1 performance for automated code review, bug detection, and intelligent refactoring suggestions.
Process large datasets with 256K context window. Generate analysis scripts, visualizations, and reports instantly.
Create complete web applications, landing pages, and components with MiMo-V2-Flash's superior HTML/CSS/JS generation.
Generate high-quality content, documentation, and creative writing with MiMo-V2-Flash's 86.2% Arena-Hard Creative Writing score.
Get started with MiMo-V2-Flash in minutes. Deploy locally with SGLang or access via the free API.
Install the SGLang framework for optimized MiMo-V2-Flash inference
Start the MiMo-V2-Flash server with recommended configuration
Query MiMo-V2-Flash via OpenAI-compatible API endpoints
# Install SGLang for MiMo-V2-Flash deployment
pip install sglang
# Launch MiMo-V2-Flash server
python3 -m sglang.launch_server \
--model-path XiaomiMiMo/MiMo-V2-Flash \
--served-model-name mimo-v2-flash \
--tp-size 8 \
--dp-size 2 \
--enable-dp-attention \
--host 0.0.0.0 \
--port 9001 \
--trust-remote-code \
--context-length 262144 \
--enable-mtp
import openai
# Connect to MiMo-V2-Flash API
client = openai.OpenAI(
base_url="http://localhost:9001/v1",
api_key="mimo-v2-flash"
)
# Send request to MiMo-V2-Flash
response = client.chat.completions.create(
model="mimo-v2-flash",
messages=[{
"role": "user",
"content": "Write a binary search in Python"
}],
max_tokens=4096,
temperature=0.8,
extra_body={"enable_thinking": True}
)
print(response.choices[0].message.content)
Experience enterprise-grade AI at a fraction of the cost. MiMo-V2-Flash API is currently FREE with limited usage.
Deploy MiMo-V2-Flash on your own infrastructure
Cloud API with instant access
Access via OpenRouter marketplace
Seamlessly integrate MiMo-V2-Flash with your favorite development tools and frameworks.
Configure MiMo-V2-Flash as your AI coding assistant in Cursor for 150 tok/s code generation.
Setup GuideUse MiMo-V2-Flash with Cline VS Code extension for autonomous coding tasks.
Setup GuideDay-0 SGLang support with MTP acceleration for optimal MiMo-V2-Flash performance.
DocumentationDeploy MiMo-V2-Flash in containerized environments with pre-built Docker images.
Docker GuideScale MiMo-V2-Flash deployments with Kubernetes orchestration and Ray clusters.
K8s GuideJoin thousands of developers who have already experienced the power of MiMo-V2-Flash.
MiMo-V2-Flash is insanely fast! The 150 tok/s generation speed completely transformed my coding workflow. It's like having a supercharged AI pair programmer.
Finally an open-source model that rivals Claude for coding tasks. The SWE-Bench scores don't lie - MiMo-V2-Flash handles complex refactoring tasks brilliantly.
The cost savings are unreal. We switched from Claude Sonnet to MiMo-V2-Flash and cut our AI costs by 97%. Performance is just as good, sometimes better.
MiMo-V2-Flash introduces several groundbreaking innovations that set it apart from other open-source language models. Understanding these technical details helps developers maximize the model's potential.
# English System Prompt for MiMo-V2-Flash
"""
You are MiMo, an AI assistant developed by Xiaomi.
Today's date: {date} {week}.
Your knowledge cutoff date is December 2024.
"""
# Chinese System Prompt
"""
ไฝ ๆฏMiMo๏ผไธญๆๅ็งฐไนๆฏMiMo๏ผ๏ผๆฏๅฐ็ฑณๅ
ฌๅธ็ ๅ็AIๆบ่ฝๅฉๆใ
ไปๅคฉ็ๆฅๆ๏ผ{date} {week}๏ผ
ไฝ ็็ฅ่ฏๆชๆญขๆฅๆๆฏ2024ๅนด12ๆใ
"""
Complete API reference for integrating MiMo-V2-Flash into your applications.
Choose the right hardware configuration for your MiMo-V2-Flash deployment needs.
Pipeline parallel with FP8 quantization
Full performance with tensor parallel
Maximum throughput for production
Connect with thousands of developers building with MiMo-V2-Flash.
Common questions about MiMo-V2-Flash deployment, performance, and capabilities.
MiMo-V2-Flash is a state-of-the-art open-source Mixture-of-Experts (MoE) language model developed by Xiaomi's MiMo team, led by Luo Fuli (็ฝ็ฆ่). Released on December 16, 2025, it features 309B total parameters with 15B active parameters, delivering 150 tokens/sec inference speed and achieving #1 on SWE-Bench Verified among open-source models.
MiMo-V2-Flash achieves 150 tokens per second inference speed, making it 2-3x faster than comparable models. This speed is enabled by Multi-Token Prediction (MTP) technology with self-speculative decoding. The lightweight MTP module (0.33B params per block) triples generation speed while maintaining quality.
Yes! MiMo-V2-Flash is released under the MIT License, which means you can use, modify, distribute, and commercialize it without restrictions. The model weights are freely downloadable from Hugging Face, and the API is currently available with limited free usage at just $0.1/M input tokens.
For full performance at 150 tok/s with 256K context, you need 8x H100 80GB GPUs with tensor parallelism. Minimum deployment requires 4x H100 with pipeline parallelism and FP8 quantization, achieving ~50tok/s with 128K context. Alternatively, use the cloud API for instant access without hardware requirements.
MiMo-V2-Flash achieves 73.4% on SWE-Bench Verified, surpassing DeepSeek-V3.2's 73.1% and approaching Claude Sonnet 4.5's 77.2%. On AIME 2025, MiMo scores 94.1% vs DeepSeek's 93.1%. The key advantage is cost: MiMo-V2-Flash costs just 2.5% of Claude Sonnet at $0.1/M input tokens vs $4.0/M.
MiMo-V2-Flash has day-0 SGLang support. Install with pip install sglang, then launch: python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --tp-size 8 --enable-mtp. Enable EAGLE speculative decoding for maximum speed. Full documentation is available on the LMSYS blog.
Hybrid Thinking mode enables MiMo-V2-Flash to show its reasoning process through a reasoning_content field alongside tool_calls. Enable it with "enable_thinking": true in your API request. For multi-turn conversations, persist all reasoning_content in the messages array for consistent context.
Yes! MiMo-V2-Flash is fully compatible with vibe coding tools like Cursor, Cline, and Claude Code. Configure your custom model endpoint to point to the MiMo-V2-Flash API or your local SGLang server. Experience 150 tok/s code generation with SWE-Bench #1 performance at a fraction of the cost.
Comprehensive comparison of MiMo-V2-Flash against leading open-source and proprietary AI models.
| Feature | MiMo-V2-Flash | DeepSeek-V3.2 | Kimi K2 | Claude Sonnet 4.5 | GPT-5 |
|---|---|---|---|---|---|
| Total Parameters | 309B | 671B | 1043B | Unknown | Unknown |
| Active Parameters | 15B | 37B | 32B | Unknown | Unknown |
| Context Window | 256K | 128K | 128K | 200K | 128K |
| Inference Speed | 150 tok/s | ~60 tok/s | ~50 tok/s | ~40 tok/s | ~45 tok/s |
| SWE-Bench Verified | 73.4% | 73.1% | 71.3% | 77.2% | 74.9% |
| Input Cost (per 1M) | $0.10 | $0.60 | $0.55 | $4.00 | $5.00 |
| Open Source | โ | โ | โ | โ | โ |
| MIT License | โ | โ | โ | โ | โ |
| MoE Architecture | โ | โ | โ | โ | โ |
| Multi-Token Prediction | โ | โ | โ | โ | โ |
Stay updated on the latest developments and upcoming features for MiMo-V2-Flash.
309B parameter model released with 150 tok/s inference, 256K context, and SWE-Bench #1 performance. Full weights available on Hugging Face under MIT license.
Native SGLang integration with MTP acceleration, EAGLE speculative decoding, and optimized inference configurations.
Vision capabilities addition enabling image understanding, chart analysis, and multimodal reasoning tasks.
GGUF, EXL2, and AWQ quantized versions for consumer GPU deployment on RTX 4090 and similar hardware.
Access MiMo-V2-Flash model weights, technical reports, and deployment resources.
Instruction-tuned model with MOPD and Agentic RL post-training.
Pre-trained base model for fine-tuning and research purposes.
Comprehensive technical documentation covering architecture and training.
Open-sourced Multi-Token Prediction weights for research.