What is MiMo-V2-Flash?

MiMo-V2-Flash is Xiaomi's open-source Mixture-of-Experts (MoE) language model with 309B total parameters and 15B active parameters. It features 150 tokens/sec inference speed, 256K context window, and achieves state-of-the-art performance on SWE-Bench with 73.4% score.

How fast is MiMo-V2-Flash?

MiMo-V2-Flash achieves 150 tokens per second inference speed, making it one of the fastest open-source large language models available. This speed is enabled by Multi-Token Prediction (MTP) and self-speculative decoding technology.

Is MiMo-V2-Flash free to use?

Yes, MiMo-V2-Flash is completely free and open-source under the MIT license. You can download, modify, and use it commercially without restrictions. The API is also available with limited free usage.

How does MiMo-V2-Flash compare to DeepSeek V3?

MiMo-V2-Flash outperforms DeepSeek-V3.2 on SWE-Bench Verified (73.4% vs lower scores) and achieves comparable or better results on reasoning benchmarks like AIME 2025 (94.1%). It offers significantly faster inference at 150 tok/s.

What hardware is required to run MiMo-V2-Flash?

For full deployment, MiMo-V2-Flash requires 8x H100 or H200 GPUs. However, quantized versions are available for smaller setups, and the API can be accessed without any hardware requirements.

MiMo-V2-Flash: 309B Open Source AI Model | 150 tok/s Fastest LLM by Xiaomi

Name: MiMo-V2-Flash
Rating: 4.9 (1024 reviews)
Author: Xiaomi

Why Choose MiMo-V2-Flash?

MiMo-V2-Flash combines groundbreaking architecture with exceptional performance, making it the premier choice for developers, enterprises, and AI enthusiasts worldwide.

⚡

150 Tokens/Second Lightning Speed

MiMo-V2-Flash delivers unprecedented inference speed through Multi-Token Prediction (MTP) and self-speculative decoding, generating code and content 2-3x faster than competitors.

🧠

309B Parameters MoE Architecture

With 309 billion total parameters and 15 billion active parameters, MiMo-V2-Flash leverages Mixture-of-Experts (MoE) for optimal efficiency without compromising intelligence.

💰

2.5% of Claude's Cost

At just $0.1 per million input tokens and $0.3 per million output tokens, MiMo-V2-Flash delivers enterprise-grade performance at a fraction of proprietary model costs.

📏

256K Context Window

Process entire codebases, lengthy documents, and complex multi-turn conversations with MiMo-V2-Flash's massive 256,000 token context window.

🏆

SWE-Bench #1 Open Source

MiMo-V2-Flash achieves 73.4% on SWE-Bench Verified, claiming the #1 position among all open-source models and rivaling proprietary solutions.

🔓

MIT Open Source License

Fully open-source under MIT license. Download, modify, deploy, and commercialize MiMo-V2-Flash without restrictions or royalties.

Live Demo

Experience MiMo-V2-Flash In Action

Try MiMo-V2-Flash directly through Xiaomi MiMo Studio. Experience the blazing-fast 150 tok/s inference speed and superior code generation capabilities firsthand.

✓

Real-time code generation at 150 tokens/second

✓

Hybrid Thinking mode for complex reasoning

✓

256K context for entire codebase analysis

✓

Multi-turn agentic workflows support

Launch MiMo-V2-Flash Studio

Write a Python function to merge two sorted arrays efficiently


def merge_sorted_arrays(arr1, arr2):

    result = []

    i = j = 0

    while i < len(arr1) and j < len(arr2):

        if arr1[i] <= arr2[j]:

            result.append(arr1[i])

            i += 1

        else:

            result.append(arr2[j])

            j += 1

    result.extend(arr1[i:])

    result.extend(arr2[j:])

    return result

MiMo-V2-Flash Benchmark Results

MiMo-V2-Flash dominates across coding, reasoning, and agentic benchmarks, outperforming DeepSeek-V3.2, Gemini 3.0 Pro, and Claude Sonnet 4.5.

Benchmark	MiMo-V2-Flash	DeepSeek-V3.2	Gemini 3.0 Pro	Claude Sonnet 4.5	GPT-5 High
SWE-Bench Verified	73.4% #1 Open Source	73.1%	76.2%	77.2%	74.9%
SWE-Bench Multilingual	71.7%	70.2%	-	68.0%	55.3%
LiveCodeBench v6	80.6%	83.3%	90.7%	64.0%	84.5%
AIME 2025	94.1%	93.1%	95.0%	87.0%	94.6%
GPQA-Diamond	83.7%	82.4%	91.9%	83.4%	85.7%
MMLU-Pro	84.9%	85.0%	90.1%	88.2%	87.5%
τ²-Bench (Agent)	80.3%	80.3%	85.4%	84.7%	80.2%
BrowseComp	58.3%	67.6%	59.2%	24.1%	54.9%

MiMo-V2-Flash Technical Architecture

Discover the innovative hybrid architecture that makes MiMo-V2-Flash the fastest and most efficient open-source AI model available.

MiMo-V2-Flash Architecture Diagram

🪟

Sliding Window Attention (SWA)

128 token window × 5 layers per block

🌐

Global Attention (GA)

Full context × 1 layer per block

🎯

Multi-Token Prediction (MTP)

0.33B params per block × 3 layers

⚡

MoE Expert Routing

309B total → 15B active parameters

Hybrid Sliding Window Attention

MiMo-V2-Flash addresses quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio. Uses aggressive 128-token window with learnable attention sink bias.

5:1 SWA:GA Ratio

128 Window Size

6x KV Cache Reduction

8 Hybrid Blocks

Multi-Token Prediction (MTP)

Native MTP module using dense FFN (instead of MoE) and SWA (instead of GA) keeps parameter count low at 0.33B per block. Facilitates self-speculative decoding, tripling generation speed.

0.33B Params/Block

3x Speed Boost

3 MTP Layers

FP8 Precision

MiMo-V2-Flash Use Cases

From vibe coding to enterprise agents, MiMo-V2-Flash powers the next generation of AI applications across every domain.

🖥️

Vibe Coding with Cursor & Cline

Integrate MiMo-V2-Flash with Cursor, Cline, and Claude Code for lightning-fast code generation, refactoring, and debugging at 150 tok/s.

🤖

AI Agent Development

Build sophisticated AI agents with MiMo-V2-Flash's agentic capabilities, trained with large-scale RL on 100,000+ GitHub issues.

💻

Code Review & Refactoring

Leverage SWE-Bench #1 performance for automated code review, bug detection, and intelligent refactoring suggestions.

📊

Data Analysis & Visualization

Process large datasets with 256K context window. Generate analysis scripts, visualizations, and reports instantly.

🌐

Web Development & Generation

Create complete web applications, landing pages, and components with MiMo-V2-Flash's superior HTML/CSS/JS generation.

📝

Content Creation & Writing

Generate high-quality content, documentation, and creative writing with MiMo-V2-Flash's 86.2% Arena-Hard Creative Writing score.

MiMo-V2-Flash Quick Start Guide

Get started with MiMo-V2-Flash in minutes. Deploy locally with SGLang or access via the free API.

1

Install SGLang

Install the SGLang framework for optimized MiMo-V2-Flash inference

2

Launch Server

Start the MiMo-V2-Flash server with recommended configuration

3

Send Requests

Query MiMo-V2-Flash via OpenAI-compatible API endpoints

bash

# Install SGLang for MiMo-V2-Flash deployment
pip install sglang

# Launch MiMo-V2-Flash server
python3 -m sglang.launch_server \
    --model-path XiaomiMiMo/MiMo-V2-Flash \
    --served-model-name mimo-v2-flash \
    --tp-size 8 \
    --dp-size 2 \
    --enable-dp-attention \
    --host 0.0.0.0 \
    --port 9001 \
    --trust-remote-code \
    --context-length 262144 \
    --enable-mtp

python

import openai

# Connect to MiMo-V2-Flash API
client = openai.OpenAI(
    base_url="http://localhost:9001/v1",
    api_key="mimo-v2-flash"
)

# Send request to MiMo-V2-Flash
response = client.chat.completions.create(
    model="mimo-v2-flash",
    messages=[{
        "role": "user",
        "content": "Write a binary search in Python"
    }],
    max_tokens=4096,
    temperature=0.8,
    extra_body={"enable_thinking": True}
)

print(response.choices[0].message.content)

MiMo-V2-Flash Pricing

Experience enterprise-grade AI at a fraction of the cost. MiMo-V2-Flash API is currently FREE with limited usage.

Self-Hosted

Deploy MiMo-V2-Flash on your own infrastructure

$0 / forever

✓ Full 309B model weights download
✓ MIT open source license
✓ Commercial use permitted
✓ SGLang optimized deployment

Download Weights

MiMo-V2-Flash API

Cloud API with instant access

$0.1 / 1M input tokens

✓ 150 tokens/sec inference speed
✓ 256K context window
✓ $0.3/M output tokens
✓ Limited FREE trial available

Get Free API Access

OpenRouter

Access via OpenRouter marketplace

$0.15 / 1M tokens

✓ Unified API access
✓ Multiple model fallback
✓ Usage-based billing
✓ Enterprise support

Access via OpenRouter

MiMo-V2-Flash vs Competitors: Cost Comparison

MiMo-V2-Flash

$0.1

DeepSeek V3

$0.6

Claude Sonnet 4.5

$4.0

MiMo-V2-Flash Integrations

Seamlessly integrate MiMo-V2-Flash with your favorite development tools and frameworks.

📝

Cursor IDE

Configure MiMo-V2-Flash as your AI coding assistant in Cursor for 150 tok/s code generation.

Setup Guide

🔧

Cline Extension

Use MiMo-V2-Flash with Cline VS Code extension for autonomous coding tasks.

Setup Guide

⚡

SGLang

Day-0 SGLang support with MTP acceleration for optimal MiMo-V2-Flash performance.

Documentation

🤗

Hugging Face

Download MiMo-V2-Flash weights and integrate with Transformers library.

Model Card

🐳

Docker

Deploy MiMo-V2-Flash in containerized environments with pre-built Docker images.

Docker Guide

☸️

Kubernetes

Scale MiMo-V2-Flash deployments with Kubernetes orchestration and Ray clusters.

K8s Guide

What Developers Say About MiMo-V2-Flash

Join thousands of developers who have already experienced the power of MiMo-V2-Flash.

MiMo-V2-Flash is insanely fast! The 150 tok/s generation speed completely transformed my coding workflow. It's like having a supercharged AI pair programmer.

JD

James D.

Senior Software Engineer

Finally an open-source model that rivals Claude for coding tasks. The SWE-Bench scores don't lie - MiMo-V2-Flash handles complex refactoring tasks brilliantly.

SK

Sarah K.

AI Engineer @ r/LocalLLaMA

The cost savings are unreal. We switched from Claude Sonnet to MiMo-V2-Flash and cut our AI costs by 97%. Performance is just as good, sometimes better.

ML

Michael L.

CTO, AI Startup

Deep Dive

MiMo-V2-Flash Technical Innovations

MiMo-V2-Flash introduces several groundbreaking innovations that set it apart from other open-source language models. Understanding these technical details helps developers maximize the model's potential.

✓

Sliding Window Attention (SWA): Uses a 128-token window for local attention, reducing KV cache storage by nearly 6x while maintaining long-context performance via learnable attention sink bias.
✓

Multi-Token Prediction (MTP): Native MTP module with 0.33B params per block enables self-speculative decoding, tripling generation speed and reducing GPU idleness during RL training.
✓

FP8 Mixed Precision: Trained on 27T tokens using FP8 mixed precision with native 32k sequence length, supporting up to 256k context window.
✓

MOPD Post-Training: Multi-Teacher On-Policy Distillation formulates knowledge distillation as RL, providing dense token-level guidance from domain-specific experts.

                            System Prompt
                        

                            # English System Prompt for MiMo-V2-Flash
"""
You are MiMo, an AI assistant developed by Xiaomi.

Today's date: {date} {week}. 
Your knowledge cutoff date is December 2024.
"""

# Chinese System Prompt
"""
你是MiMo（中文名称也是MiMo），是小米公司研发的AI智能助手。

今天的日期：{date} {week}，
你的知识截止日期是2024年12月。
"""
                        

MiMo-V2-Flash API Documentation

Complete API reference for integrating MiMo-V2-Flash into your applications.

POST /v1/chat/completions

model (string, required)

mimo-v2-flash

messages (array, required)

Chat messages array

max_tokens (integer)

4096 (default)

temperature (float)

0.8 (math/web), 0.3 (agentic)

top_p (float)

0.95 (recommended)

enable_thinking (boolean)

Enable Hybrid Thinking mode

MiMo-V2-Flash Hardware Requirements

Choose the right hardware configuration for your MiMo-V2-Flash deployment needs.

Minimum

4x H100 80GB

Pipeline parallel with FP8 quantization

~50

tok/s

128K

context

Recommended

8x H100 80GB

Full performance with tensor parallel

150

tok/s

256K

context

Enterprise

8x H200 141GB

Maximum throughput for production

200+

tok/s

256K

context

Join the MiMo-V2-Flash Community

Connect with thousands of developers building with MiMo-V2-Flash.

⭐

GitHub

1.1K+ Stars

Star the repo, contribute, and report issues

🤗

Hugging Face

200+ Likes

Download weights and access model card

💬

Reddit

r/LocalLLaMA

Discuss with the LocalLLaMA community

🎮

Discord

Join Server

Real-time chat with MiMo developers

Frequently Asked Questions

Common questions about MiMo-V2-Flash deployment, performance, and capabilities.

MiMo-V2-Flash is a state-of-the-art open-source Mixture-of-Experts (MoE) language model developed by Xiaomi's MiMo team, led by Luo Fuli (罗福莉). Released on December 16, 2025, it features 309B total parameters with 15B active parameters, delivering 150 tokens/sec inference speed and achieving #1 on SWE-Bench Verified among open-source models.

MiMo-V2-Flash achieves 150 tokens per second inference speed, making it 2-3x faster than comparable models. This speed is enabled by Multi-Token Prediction (MTP) technology with self-speculative decoding. The lightweight MTP module (0.33B params per block) triples generation speed while maintaining quality.

Yes! MiMo-V2-Flash is released under the MIT License, which means you can use, modify, distribute, and commercialize it without restrictions. The model weights are freely downloadable from Hugging Face, and the API is currently available with limited free usage at just $0.1/M input tokens.

For full performance at 150 tok/s with 256K context, you need 8x H100 80GB GPUs with tensor parallelism. Minimum deployment requires 4x H100 with pipeline parallelism and FP8 quantization, achieving ~50tok/s with 128K context. Alternatively, use the cloud API for instant access without hardware requirements.

MiMo-V2-Flash achieves 73.4% on SWE-Bench Verified, surpassing DeepSeek-V3.2's 73.1% and approaching Claude Sonnet 4.5's 77.2%. On AIME 2025, MiMo scores 94.1% vs DeepSeek's 93.1%. The key advantage is cost: MiMo-V2-Flash costs just 2.5% of Claude Sonnet at $0.1/M input tokens vs $4.0/M.

MiMo-V2-Flash has day-0 SGLang support. Install with pip install sglang, then launch: python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-V2-Flash --tp-size 8 --enable-mtp. Enable EAGLE speculative decoding for maximum speed. Full documentation is available on the LMSYS blog.

Hybrid Thinking mode enables MiMo-V2-Flash to show its reasoning process through a reasoning_content field alongside tool_calls. Enable it with "enable_thinking": true in your API request. For multi-turn conversations, persist all reasoning_content in the messages array for consistent context.

Yes! MiMo-V2-Flash is fully compatible with vibe coding tools like Cursor, Cline, and Claude Code. Configure your custom model endpoint to point to the MiMo-V2-Flash API or your local SGLang server. Experience 150 tok/s code generation with SWE-Bench #1 performance at a fraction of the cost.

MiMo-V2-Flash vs Competitors

Comprehensive comparison of MiMo-V2-Flash against leading open-source and proprietary AI models.

Feature	MiMo-V2-Flash	DeepSeek-V3.2	Kimi K2	Claude Sonnet 4.5	GPT-5
Total Parameters	309B	671B	1043B	Unknown	Unknown
Active Parameters	15B	37B	32B	Unknown	Unknown
Context Window	256K	128K	128K	200K	128K
Inference Speed	150 tok/s	~60 tok/s	~50 tok/s	~40 tok/s	~45 tok/s
SWE-Bench Verified	73.4%	73.1%	71.3%	77.2%	74.9%
Input Cost (per 1M)	$0.10	$0.60	$0.55	$4.00	$5.00
Open Source	✓	✓	✓	✗	✗
MIT License	✓	✗	✓	✗	✗
MoE Architecture	✓	✓	✓	✗	✓
Multi-Token Prediction	✓	✗	✗	✗	✗

MiMo-V2-Flash Development Roadmap

Stay updated on the latest developments and upcoming features for MiMo-V2-Flash.

December 16, 2025

MiMo-V2-Flash Official Release

309B parameter model released with 150 tok/s inference, 256K context, and SWE-Bench #1 performance. Full weights available on Hugging Face under MIT license.

December 16, 2025

SGLang Day-0 Support

Native SGLang integration with MTP acceleration, EAGLE speculative decoding, and optimized inference configurations.

Q1 2026

MiMo-V2-Flash Multimodal

Vision capabilities addition enabling image understanding, chart analysis, and multimodal reasoning tasks.

Q2 2026

Quantized Versions

GGUF, EXL2, and AWQ quantized versions for consumer GPU deployment on RTX 4090 and similar hardware.

Download MiMo-V2-Flash

Access MiMo-V2-Flash model weights, technical reports, and deployment resources.

🤗

MiMo-V2-Flash (Chat)

v2.0 - December 2025

Instruction-tuned model with MOPD and Agentic RL post-training.

310B params • Safetensors Download

📦

MiMo-V2-Flash-Base

v2.0 - December 2025

Pre-trained base model for fine-tuning and research purposes.

310B params • Safetensors Download

📄

Technical Report

PDF Document

Comprehensive technical documentation covering architecture and training.

PDF • 45 pages View Report

🧩

MTP Weights

3-Layer Module

Open-sourced Multi-Token Prediction weights for research.

~1B params Download

MiMo-V2-Flash The Fastest Open-Source AI Model

Why Choose MiMo-V2-Flash?

150 Tokens/Second Lightning Speed

309B Parameters MoE Architecture

2.5% of Claude's Cost

256K Context Window

SWE-Bench #1 Open Source

MIT Open Source License

Experience MiMo-V2-Flash In Action

MiMo-V2-Flash Benchmark Results

MiMo-V2-Flash Technical Architecture

MiMo-V2-Flash Architecture Diagram

Sliding Window Attention (SWA)

Global Attention (GA)

Multi-Token Prediction (MTP)

MoE Expert Routing

Hybrid Sliding Window Attention

Multi-Token Prediction (MTP)

MiMo-V2-Flash Use Cases

Vibe Coding with Cursor & Cline

AI Agent Development

Code Review & Refactoring

Data Analysis & Visualization

Web Development & Generation

Content Creation & Writing

MiMo-V2-Flash Quick Start Guide

Install SGLang

Launch Server

Send Requests

MiMo-V2-Flash Pricing

Self-Hosted

MiMo-V2-Flash API

OpenRouter

MiMo-V2-Flash vs Competitors: Cost Comparison

MiMo-V2-Flash Integrations

Cursor IDE

Cline Extension

SGLang

Hugging Face

Docker

Kubernetes

What Developers Say About MiMo-V2-Flash

MiMo-V2-Flash Technical Innovations

MiMo-V2-Flash API Documentation

MiMo-V2-Flash Hardware Requirements

4x H100 80GB

8x H100 80GB

8x H200 141GB

Join the MiMo-V2-Flash Community

GitHub

Hugging Face

Reddit

Discord

Frequently Asked Questions

MiMo-V2-Flash vs Competitors

MiMo-V2-Flash Development Roadmap

MiMo-V2-Flash Official Release

SGLang Day-0 Support

MiMo-V2-Flash Multimodal

Quantized Versions

Download MiMo-V2-Flash

MiMo-V2-Flash (Chat)

MiMo-V2-Flash-Base

Technical Report

MTP Weights

Stay Updated on MiMo-V2-Flash