New Release - December 2025 | MIT License | Free API

MiMo-V2-Flash The Fastest Open-Source AI Model

MiMo-V2-Flash by Xiaomi is a revolutionary 309B parameter MoE language model delivering 150 tokens/sec inference speed. Ranked #1 on SWE-Bench Verified with 73.4% score. Experience enterprise-grade AI at just 2.5% of Claude's cost.

309B
Total Parameters
150
Tokens per Second
73.4%
SWE-Bench Score
256K
Context Window
$0.1
Per Million Tokens
MIT
Open Source License

Why Choose MiMo-V2-Flash?

MiMo-V2-Flash combines groundbreaking architecture with exceptional performance, making it the premier choice for developers, enterprises, and AI enthusiasts worldwide.

โšก

150 Tokens/Second Lightning Speed

MiMo-V2-Flash delivers unprecedented inference speed through Multi-Token Prediction (MTP) and self-speculative decoding, generating code and content 2-3x faster than competitors.

๐Ÿง 

309B Parameters MoE Architecture

With 309 billion total parameters and 15 billion active parameters, MiMo-V2-Flash leverages Mixture-of-Experts (MoE) for optimal efficiency without compromising intelligence.

๐Ÿ’ฐ

2.5% of Claude's Cost

At just $0.1 per million input tokens and $0.3 per million output tokens, MiMo-V2-Flash delivers enterprise-grade performance at a fraction of proprietary model costs.

๐Ÿ“

256K Context Window

Process entire codebases, lengthy documents, and complex multi-turn conversations with MiMo-V2-Flash's massive 256,000 token context window.

๐Ÿ†

SWE-Bench #1 Open Source

MiMo-V2-Flash achieves 73.4% on SWE-Bench Verified, claiming the #1 position among all open-source models and rivaling proprietary solutions.

๐Ÿ”“

MIT Open Source License

Fully open-source under MIT license. Download, modify, deploy, and commercialize MiMo-V2-Flash without restrictions or royalties.

Live Demo

Experience MiMo-V2-Flash In Action

Try MiMo-V2-Flash directly through Xiaomi MiMo Studio. Experience the blazing-fast 150 tok/s inference speed and superior code generation capabilities firsthand.

โœ“
Real-time code generation at 150 tokens/second
โœ“
Hybrid Thinking mode for complex reasoning
โœ“
256K context for entire codebase analysis
โœ“
Multi-turn agentic workflows support
Launch MiMo-V2-Flash Studio
Write a Python function to merge two sorted arrays efficiently
def merge_sorted_arrays(arr1, arr2):
    result = []
    i = j = 0
    while i < len(arr1) and j < len(arr2):
        if arr1[i] <= arr2[j]:
            result.append(arr1[i])
            i += 1
        else:
            result.append(arr2[j])
            j += 1
    result.extend(arr1[i:])
    result.extend(arr2[j:])
    return result

MiMo-V2-Flash Benchmark Results

MiMo-V2-Flash dominates across coding, reasoning, and agentic benchmarks, outperforming DeepSeek-V3.2, Gemini 3.0 Pro, and Claude Sonnet 4.5.

Benchmark MiMo-V2-Flash DeepSeek-V3.2 Gemini 3.0 Pro Claude Sonnet 4.5 GPT-5 High
SWE-Bench Verified 73.4% #1 Open Source 73.1% 76.2% 77.2% 74.9%
SWE-Bench Multilingual 71.7% 70.2% - 68.0% 55.3%
LiveCodeBench v6 80.6% 83.3% 90.7% 64.0% 84.5%
AIME 2025 94.1% 93.1% 95.0% 87.0% 94.6%
GPQA-Diamond 83.7% 82.4% 91.9% 83.4% 85.7%
MMLU-Pro 84.9% 85.0% 90.1% 88.2% 87.5%
ฯ„ยฒ-Bench (Agent) 80.3% 80.3% 85.4% 84.7% 80.2%
BrowseComp 58.3% 67.6% 59.2% 24.1% 54.9%

MiMo-V2-Flash Technical Architecture

Discover the innovative hybrid architecture that makes MiMo-V2-Flash the fastest and most efficient open-source AI model available.

MiMo-V2-Flash Architecture Diagram

๐ŸชŸ

Sliding Window Attention (SWA)

128 token window ร— 5 layers per block

๐ŸŒ

Global Attention (GA)

Full context ร— 1 layer per block

๐ŸŽฏ

Multi-Token Prediction (MTP)

0.33B params per block ร— 3 layers

โšก

MoE Expert Routing

309B total โ†’ 15B active parameters

Hybrid Sliding Window Attention

MiMo-V2-Flash addresses quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio. Uses aggressive 128-token window with learnable attention sink bias.

5:1 SWA:GA Ratio
128 Window Size
6x KV Cache Reduction
8 Hybrid Blocks

Multi-Token Prediction (MTP)

Native MTP module using dense FFN (instead of MoE) and SWA (instead of GA) keeps parameter count low at 0.33B per block. Facilitates self-speculative decoding, tripling generation speed.

0.33B Params/Block
3x Speed Boost
3 MTP Layers
FP8 Precision

MiMo-V2-Flash Use Cases

From vibe coding to enterprise agents, MiMo-V2-Flash powers the next generation of AI applications across every domain.

๐Ÿ–ฅ๏ธ

Vibe Coding with Cursor & Cline

Integrate MiMo-V2-Flash with Cursor, Cline, and Claude Code for lightning-fast code generation, refactoring, and debugging at 150 tok/s.

๐Ÿค–

AI Agent Development

Build sophisticated AI agents with MiMo-V2-Flash's agentic capabilities, trained with large-scale RL on 100,000+ GitHub issues.

๐Ÿ’ป

Code Review & Refactoring

Leverage SWE-Bench #1 performance for automated code review, bug detection, and intelligent refactoring suggestions.

๐Ÿ“Š

Data Analysis & Visualization

Process large datasets with 256K context window. Generate analysis scripts, visualizations, and reports instantly.

๐ŸŒ

Web Development & Generation

Create complete web applications, landing pages, and components with MiMo-V2-Flash's superior HTML/CSS/JS generation.

๐Ÿ“

Content Creation & Writing

Generate high-quality content, documentation, and creative writing with MiMo-V2-Flash's 86.2% Arena-Hard Creative Writing score.

MiMo-V2-Flash Quick Start Guide

Get started with MiMo-V2-Flash in minutes. Deploy locally with SGLang or access via the free API.

1

Install SGLang

Install the SGLang framework for optimized MiMo-V2-Flash inference

2

Launch Server

Start the MiMo-V2-Flash server with recommended configuration

3

Send Requests

Query MiMo-V2-Flash via OpenAI-compatible API endpoints

bash
# Install SGLang for MiMo-V2-Flash deployment
pip install sglang

# Launch MiMo-V2-Flash server
python3 -m sglang.launch_server \
    --model-path XiaomiMiMo/MiMo-V2-Flash \
    --served-model-name mimo-v2-flash \
    --tp-size 8 \
    --dp-size 2 \
    --enable-dp-attention \
    --host 0.0.0.0 \
    --port 9001 \
    --trust-remote-code \
    --context-length 262144 \
    --enable-mtp
python
import openai

# Connect to MiMo-V2-Flash API
client = openai.OpenAI(
    base_url="http://localhost:9001/v1",
    api_key="mimo-v2-flash"
)

# Send request to MiMo-V2-Flash
response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[{
        "role": "user",
        "content": "Write a binary search in Python"
    }],
    max_tokens=4096,
    temperature=0.8,
    extra_body={"enable_thinking": True}
)

print(response.choices[0].message.content)

MiMo-V2-Flash Pricing

Experience enterprise-grade AI at a fraction of the cost. MiMo-V2-Flash API is currently FREE with limited usage.

Self-Hosted

Deploy MiMo-V2-Flash on your own infrastructure

$0 / forever
  • โœ“ Full 309B model weights download
  • โœ“ MIT open source license
  • โœ“ Commercial use permitted
  • โœ“ SGLang optimized deployment
Download Weights

OpenRouter

Access via OpenRouter marketplace

$0.15 / 1M tokens
  • โœ“ Unified API access
  • โœ“ Multiple model fallback
  • โœ“ Usage-based billing
  • โœ“ Enterprise support
Access via OpenRouter

MiMo-V2-Flash vs Competitors: Cost Comparison

MiMo-V2-Flash
$0.1
DeepSeek V3
$0.6
Claude Sonnet 4.5
$4.0

MiMo-V2-Flash Integrations

Seamlessly integrate MiMo-V2-Flash with your favorite development tools and frameworks.

๐Ÿ“

Cursor IDE

Configure MiMo-V2-Flash as your AI coding assistant in Cursor for 150 tok/s code generation.

Setup Guide
๐Ÿ”ง

Cline Extension

Use MiMo-V2-Flash with Cline VS Code extension for autonomous coding tasks.

Setup Guide
โšก

SGLang

Day-0 SGLang support with MTP acceleration for optimal MiMo-V2-Flash performance.

Documentation
๐Ÿค—

Hugging Face

Download MiMo-V2-Flash weights and integrate with Transformers library.

Model Card
๐Ÿณ

Docker

Deploy MiMo-V2-Flash in containerized environments with pre-built Docker images.

Docker Guide
โ˜ธ๏ธ

Kubernetes

Scale MiMo-V2-Flash deployments with Kubernetes orchestration and Ray clusters.

K8s Guide

What Developers Say About MiMo-V2-Flash

Join thousands of developers who have already experienced the power of MiMo-V2-Flash.

MiMo-V2-Flash is insanely fast! The 150 tok/s generation speed completely transformed my coding workflow. It's like having a supercharged AI pair programmer.

JD
James D.
Senior Software Engineer

Finally an open-source model that rivals Claude for coding tasks. The SWE-Bench scores don't lie - MiMo-V2-Flash handles complex refactoring tasks brilliantly.

SK
Sarah K.
AI Engineer @ r/LocalLLaMA

The cost savings are unreal. We switched from Claude Sonnet to MiMo-V2-Flash and cut our AI costs by 97%. Performance is just as good, sometimes better.

ML
Michael L.
CTO, AI Startup
Deep Dive

MiMo-V2-Flash Technical Innovations

MiMo-V2-Flash introduces several groundbreaking innovations that set it apart from other open-source language models. Understanding these technical details helps developers maximize the model's potential.

  • โœ“
    Sliding Window Attention (SWA): Uses a 128-token window for local attention, reducing KV cache storage by nearly 6x while maintaining long-context performance via learnable attention sink bias.
  • โœ“
    Multi-Token Prediction (MTP): Native MTP module with 0.33B params per block enables self-speculative decoding, tripling generation speed and reducing GPU idleness during RL training.
  • โœ“
    FP8 Mixed Precision: Trained on 27T tokens using FP8 mixed precision with native 32k sequence length, supporting up to 256k context window.
  • โœ“
    MOPD Post-Training: Multi-Teacher On-Policy Distillation formulates knowledge distillation as RL, providing dense token-level guidance from domain-specific experts.
System Prompt
# English System Prompt for MiMo-V2-Flash
"""
You are MiMo, an AI assistant developed by Xiaomi.

Today's date: {date} {week}. 
Your knowledge cutoff date is December 2024.
"""

# Chinese System Prompt
"""
ไฝ ๆ˜ฏMiMo๏ผˆไธญๆ–‡ๅ็งฐไนŸๆ˜ฏMiMo๏ผ‰๏ผŒๆ˜ฏๅฐ็ฑณๅ…ฌๅธ็ ”ๅ‘็š„AIๆ™บ่ƒฝๅŠฉๆ‰‹ใ€‚

ไปŠๅคฉ็š„ๆ—ฅๆœŸ๏ผš{date} {week}๏ผŒ
ไฝ ็š„็Ÿฅ่ฏ†ๆˆชๆญขๆ—ฅๆœŸๆ˜ฏ2024ๅนด12ๆœˆใ€‚
"""

MiMo-V2-Flash API Documentation

Complete API reference for integrating MiMo-V2-Flash into your applications.

POST /v1/chat/completions
model (string, required)
mimo-v2-flash
messages (array, required)
Chat messages array
max_tokens (integer)
4096 (default)
temperature (float)
0.8 (math/web), 0.3 (agentic)
top_p (float)
0.95 (recommended)
enable_thinking (boolean)
Enable Hybrid Thinking mode

MiMo-V2-Flash Hardware Requirements

Choose the right hardware configuration for your MiMo-V2-Flash deployment needs.

Minimum

4x H100 80GB

Pipeline parallel with FP8 quantization

~50
tok/s
128K
context
Enterprise

8x H200 141GB

Maximum throughput for production

200+
tok/s
256K
context

Join the MiMo-V2-Flash Community

Connect with thousands of developers building with MiMo-V2-Flash.

Frequently Asked Questions

Common questions about MiMo-V2-Flash deployment, performance, and capabilities.

MiMo-V2-Flash is a state-of-the-art open-source Mixture-of-Experts (MoE) language model developed by Xiaomi's MiMo team, led by Luo Fuli (็ฝ—็ฆ่މ). Released on December 16, 2025, it features 309B total parameters with 15B active parameters, delivering 150 tokens/sec inference speed and achieving #1 on SWE-Bench Verified among open-source models.

MiMo-V2-Flash achieves 150 tokens per second inference speed, making it 2-3x faster than comparable models. This speed is enabled by Multi-Token Prediction (MTP) technology with self-speculative decoding. The lightweight MTP module (0.33B params per block) triples generation speed while maintaining quality.

Yes! MiMo-V2-Flash is released under the MIT License, which means you can use, modify, distribute, and commercialize it without restrictions. The model weights are freely downloadable from Hugging Face, and the API is currently available with limited free usage at just $0.1/M input tokens.

For full performance at 150 tok/s with 256K context, you need 8x H100 80GB GPUs with tensor parallelism. Minimum deployment requires 4x H100 with pipeline parallelism and FP8 quantization, achieving ~50tok/s with 128K context. Alternatively, use the cloud API for instant access without hardware requirements.

MiMo-V2-Flash achieves 73.4% on SWE-Bench Verified, surpassing DeepSeek-V3.2's 73.1% and approaching Claude Sonnet 4.5's 77.2%. On AIME 2025, MiMo scores 94.1% vs DeepSeek's 93.1%. The key advantage is cost: MiMo-V2-Flash costs just 2.5% of Claude Sonnet at $0.1/M input tokens vs $4.0/M.

MiMo-V2-Flash has day-0 SGLang support. Install with pip install sglang, then launch: python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --tp-size 8 --enable-mtp. Enable EAGLE speculative decoding for maximum speed. Full documentation is available on the LMSYS blog.

Hybrid Thinking mode enables MiMo-V2-Flash to show its reasoning process through a reasoning_content field alongside tool_calls. Enable it with "enable_thinking": true in your API request. For multi-turn conversations, persist all reasoning_content in the messages array for consistent context.

Yes! MiMo-V2-Flash is fully compatible with vibe coding tools like Cursor, Cline, and Claude Code. Configure your custom model endpoint to point to the MiMo-V2-Flash API or your local SGLang server. Experience 150 tok/s code generation with SWE-Bench #1 performance at a fraction of the cost.

MiMo-V2-Flash vs Competitors

Comprehensive comparison of MiMo-V2-Flash against leading open-source and proprietary AI models.

Feature MiMo-V2-Flash DeepSeek-V3.2 Kimi K2 Claude Sonnet 4.5 GPT-5
Total Parameters 309B 671B 1043B Unknown Unknown
Active Parameters 15B 37B 32B Unknown Unknown
Context Window 256K 128K 128K 200K 128K
Inference Speed 150 tok/s ~60 tok/s ~50 tok/s ~40 tok/s ~45 tok/s
SWE-Bench Verified 73.4% 73.1% 71.3% 77.2% 74.9%
Input Cost (per 1M) $0.10 $0.60 $0.55 $4.00 $5.00
Open Source โœ“ โœ“ โœ“ โœ— โœ—
MIT License โœ“ โœ— โœ“ โœ— โœ—
MoE Architecture โœ“ โœ“ โœ“ โœ— โœ“
Multi-Token Prediction โœ“ โœ— โœ— โœ— โœ—

MiMo-V2-Flash Development Roadmap

Stay updated on the latest developments and upcoming features for MiMo-V2-Flash.

December 16, 2025

MiMo-V2-Flash Official Release

309B parameter model released with 150 tok/s inference, 256K context, and SWE-Bench #1 performance. Full weights available on Hugging Face under MIT license.

December 16, 2025

SGLang Day-0 Support

Native SGLang integration with MTP acceleration, EAGLE speculative decoding, and optimized inference configurations.

Q1 2026

MiMo-V2-Flash Multimodal

Vision capabilities addition enabling image understanding, chart analysis, and multimodal reasoning tasks.

Q2 2026

Quantized Versions

GGUF, EXL2, and AWQ quantized versions for consumer GPU deployment on RTX 4090 and similar hardware.

Download MiMo-V2-Flash

Access MiMo-V2-Flash model weights, technical reports, and deployment resources.

๐Ÿค—

MiMo-V2-Flash (Chat)

v2.0 - December 2025

Instruction-tuned model with MOPD and Agentic RL post-training.

310B params โ€ข Safetensors Download
๐Ÿ“ฆ

MiMo-V2-Flash-Base

v2.0 - December 2025

Pre-trained base model for fine-tuning and research purposes.

310B params โ€ข Safetensors Download
๐Ÿ“„

Technical Report

PDF Document

Comprehensive technical documentation covering architecture and training.

PDF โ€ข 45 pages View Report
๐Ÿงฉ

MTP Weights

3-Layer Module

Open-sourced Multi-Token Prediction weights for research.

~1B params Download
Live Speed Demo
0 tok/s