Mercury: Innovation in Ultra-Fast Diffusion-Based Language Models

Mercury, announced by Inception Labs, is an innovative large language model based on diffusion that surpasses the limitations of conventional autoregressive models to deliver ultra-fast inference performance.

How Diffusion Models Transform Language Generation

Limitations of Existing Language Models

Traditional autoregressive language models carry the following fundamental constraints:

Sequential generation: only one token can be generated at a time
Increasing latency: wait times accumulate when generating long text
Inefficient GPU utilization: parallel processing capabilities are underutilized
Real-time application constraints: degraded user experience in code autocomplete, agent workflows, and similar scenarios

Mercury’s Diffusion Approach

Mercury addresses these issues through parallel token generation:

# Conventional autoregressive approach
for i in range(sequence_length):
    token = model.generate_next_token(context)
    context.append(token)  # sequential generation

# Mercury diffusion approach
noisy_tokens = initialize_random_noise(sequence_length)
for step in range(diffusion_steps):
    # Improve all tokens simultaneously
    noisy_tokens = model.denoise_all_tokens(noisy_tokens)

The Mercury Coder Model Family

Model Lineup

Mercury Coder Mini

Speed-focused: 1,109 tokens/sec (on H100 GPU)
Use case: real-time code autocomplete, rapid prototyping
Quality: outperforms open-source speed-optimized models

Mercury Coder Small

Balanced performance: 737 tokens/sec
Quality: on par with commercial speed-optimized models
Versatility: supports complex coding tasks and reasoning workloads

Technical Architecture

Diffusion Training Process

Mercury’s training consists of a forward process and a reverse process:

Forward Process (adding noise):

Clean text x -> noisy z1 -> z2 -> ... -> fully noisy zT

Reverse Process (removing noise):

Fully noisy zT -> zT-1 -> ... -> z1 -> recovered text x

Training objective:

L(x) = -E_t[gamma(t) * E_{z_t~q} log p_theta(x|z_t)]

Transformer-Based Architecture

Mercury adopts a Transformer architecture, gaining the following advantages:

Compatibility: fully compatible with existing optimization techniques
Scalability: architecture well suited to large-scale model training
Efficiency: can leverage low-level operation optimizations

Evaluation Results

Coding Benchmark Performance

Model	HumanEval	MBPP	MultiPL-E	Speed (tokens/sec)
Mercury Coder Mini	88.0	77.1	74.1	1,109
Mercury Coder Small	90.0	76.6	76.2	737
GPT-4o Mini	88.0	74.6	72.0	59
Claude 3.5 Haiku	86.0	78.0	72.3	61
Gemini 2.0 Flash Lite	90.0	75.0	79.5	201

Multilingual Code Generation

MultiPL-E benchmark results (accuracy %):

Model	C++	Java	JavaScript	PHP	Bash	TypeScript	Avg
Mercury Coder Mini	78.9	74.5	78.9	72.7	56.5	83.2	74.1
Mercury Coder Small	82.0	80.1	83.9	78.3	50.1	82.6	76.2
Codestral 2501	80.1	72.7	83.2	73.9	47.2	83.2	73.4

Fill-in-the-Middle (FIM) Performance

Performance in code autocomplete scenarios:

Model	Single-Line	Random-Span	Average
Mercury Coder Mini	92.9	71.5	82.2
Mercury Coder Small	93.1	76.5	84.8
Codestral 2501	93.0	72.0	82.5
GPT-4o Mini	74.8	47.0	60.9

Real-World User Evaluation: Copilot Arena

Performance Rankings

Model	Latency (s)	Latency Rank	Elo Score	Quality Rank
DeepSeek V2.5 (FIM)	2.07	11	1025	1
Claude 3.5 Sonnet	1.46	8	1003	1
Mercury Coder Mini	0.25	1	993	2
Codestral	0.31	2	992	2
GPT-4o	0.76	5	980	3

Key observation: Mercury Coder Mini achieved quality rank 2 while simultaneously recording the fastest speed.

Core Technical Innovations

Parallel Inference Optimization

Mercury’s speed gains come from the following system-level optimizations:

Dynamic batching:

class MercuryInferenceEngine:
    def __init__(self):
        self.dynamic_batcher = DynamicBatcher()
        self.custom_kernels = ParallelInferenceKernels()

    def generate(self, prompts, quality_speed_tradeoff=0.5):
        # Automatic quality-speed tradeoff adjustment
        batch = self.dynamic_batcher.optimize_batch(prompts)
        return self.parallel_diffusion_sample(batch)

Custom CUDA kernels:

Maximum utilization of GPU memory bandwidth
Optimized parallel denoising operations
Dynamic batch size adjustment

Guaranteed Compatibility

Mercury provides full compatibility with existing ecosystems:

OpenAI API compatible:

# Existing OpenAI API code can be used as-is
response = openai.ChatCompletion.create(
    model="mercury-coder-mini",  # only the model name changes
    messages=[{"role": "user", "content": "Write a Python function"}],
    max_tokens=500
)

Fine-tuning support:

RLHF (Reinforcement Learning from Human Feedback)
DPO (Direct Preference Optimization)
Conventional instruction tuning methodologies applicable

Practical Use Cases

Code Autocomplete System

class MercuryCodeCompletion:
    def __init__(self):
        self.model = MercuryCoder("mini")
        self.cache = CompletionCache()

    async def complete_code(self, context, cursor_position):
        # Real-time completion at an average latency of 25 ms
        start_time = time.time()
        completion = await self.model.fill_in_middle(
            prefix=context[:cursor_position],
            suffix=context[cursor_position:],
            max_tokens=100
        )
        latency = time.time() - start_time

        # Typically achieves under 25 ms
        assert latency < 0.1, f"Latency exceeded: {latency}s"
        return completion

Agent Workflows

Fast inference handles complex multi-step tasks efficiently:

class CodeReviewAgent:
    def __init__(self):
        self.mercury = MercuryCoder("small")

    async def review_pr(self, code_diff):
        # Run multiple analyses in parallel
        tasks = [
            self.mercury.analyze_security(code_diff),
            self.mercury.check_performance(code_diff),
            self.mercury.suggest_improvements(code_diff),
            self.mercury.generate_tests(code_diff)
        ]

        # Entire analysis completes within a few seconds
        results = await asyncio.gather(*tasks)
        return self.consolidate_review(results)

Scalability and Future Outlook

Scaling Characteristics

Mercury models demonstrate consistent performance improvements as size increases:

Mercury Coder Small consistently outperforms Mini across all benchmarks
Validates scaling laws for diffusion LLMs
Indicates strong potential for further gains with larger models

Application Expansion

Current: coding-specialized model
Planned:

General text generation model
Multimodal model (code + images)
Domain-specialized models (science, mathematics, law, etc.)

Industry Impact

Cost efficiency:

Substantially reduced inference costs
Real-time service deployment becomes feasible
Edge computing environments become viable

User experience innovation:

Natural interaction through minimized latency
Real-time processing of complex reasoning tasks
New possibilities for interactive coding tools

Technical Challenges and Solutions

Inherent Challenges of Diffusion Models

Handling discrete data:

Language consists of discrete tokens, unlike continuous images
Complexity of the noise-adding and noise-removing process
Solution: developed new noising and denoising algorithms

Minimizing inference steps:

Need to reduce diffusion steps while maintaining quality
Solution: adaptive sampling algorithms and custom schedulers

System Optimization

Memory efficiency:

class MemoryEfficientDiffusion:
    def __init__(self):
        self.gradient_checkpointing = True
        self.mixed_precision = True
        self.dynamic_batching = True

    def optimize_memory_usage(self, batch_size, sequence_length):
        # Dynamically adjust memory usage
        optimal_config = self.calculate_optimal_config(
            available_memory=torch.cuda.get_device_properties(0).total_memory,
            batch_size=batch_size,
            sequence_length=sequence_length
        )
        return optimal_config

Conclusion

Mercury has achieved the following innovations through the new paradigm of diffusion-based language models:

Core Achievements

Speed innovation: inference speed up to 10x faster than existing models
Quality maintained: code generation quality equivalent to commercial models
Practicality: verified performance in real developer environments
Scalability: demonstrated potential for scaling to larger models

Industry Impact

Immediate impact:

Transformed user experience for code autocomplete tools
Real-time AI coding assistant deployment becomes feasible
Activation of agent-based development workflows

Long-term outlook:

Fundamental shift in the cost structure of AI inference
Emergence of new forms of interactive AI services
High-performance LLM utilization in edge computing environments

Mercury goes beyond a simple performance improvement: it signifies a paradigm shift for AI language models. This research, which proves that a diffusion-based approach can deliver results in language generation as revolutionary as those it has achieved in image generation, points the way toward developing faster and more efficient AI systems.

Original paper: Mercury: Ultra-Fast Language Models Based on Diffusion
API and playground: platform.inceptionlabs.ai | chat.inceptionlabs.ai

Mercury: Innovation in Ultra-Fast Diffusion-Based Language Models

How Diffusion Models Transform Language Generation

Limitations of Existing Language Models

Mercury’s Diffusion Approach

The Mercury Coder Model Family

Model Lineup

Technical Architecture

Diffusion Training Process

Transformer-Based Architecture

Evaluation Results

Coding Benchmark Performance

Multilingual Code Generation

Fill-in-the-Middle (FIM) Performance

Real-World User Evaluation: Copilot Arena

Performance Rankings

Core Technical Innovations

Parallel Inference Optimization

Guaranteed Compatibility

Practical Use Cases

Code Autocomplete System

Agent Workflows

Scalability and Future Outlook

Scaling Characteristics

Application Expansion

Industry Impact

Technical Challenges and Solutions

Inherent Challenges of Diffusion Models

System Optimization

Conclusion

Core Achievements

Industry Impact

참고

SkillOpt: 에이전트 스킬을 훈련 가능한 텍스트 컴포넌트로 최적화하다 (arXiv:2605.23904)

보상 없이 스스로 진화하는 LLM 에이전트: 월드 노리지 탐색 기반 학습 (arXiv:2604.18131)

코드가 에이전트 하네스다: AI 에이전트 인프라의 세 계층 구조 (arXiv:2605.18747)

Autogenesis: 에이전트가 스스로를 고치는 자기진화 프로토콜 (arXiv:2604.15034)