Mercury, announced by Inception Labs, is an innovative large language model based on diffusion that surpasses the limitations of conventional autoregressive models to deliver ultra-fast inference performance.

How Diffusion Models Transform Language Generation

Limitations of Existing Language Models

Traditional autoregressive language models carry the following fundamental constraints:

  • Sequential generation: only one token can be generated at a time
  • Increasing latency: wait times accumulate when generating long text
  • Inefficient GPU utilization: parallel processing capabilities are underutilized
  • Real-time application constraints: degraded user experience in code autocomplete, agent workflows, and similar scenarios

Mercury’s Diffusion Approach

Mercury addresses these issues through parallel token generation:

# Conventional autoregressive approach
for i in range(sequence_length):
    token = model.generate_next_token(context)
    context.append(token)  # sequential generation

# Mercury diffusion approach
noisy_tokens = initialize_random_noise(sequence_length)
for step in range(diffusion_steps):
    # Improve all tokens simultaneously
    noisy_tokens = model.denoise_all_tokens(noisy_tokens)

The Mercury Coder Model Family

Model Lineup

Mercury Coder Mini

  • Speed-focused: 1,109 tokens/sec (on H100 GPU)
  • Use case: real-time code autocomplete, rapid prototyping
  • Quality: outperforms open-source speed-optimized models

Mercury Coder Small

  • Balanced performance: 737 tokens/sec
  • Quality: on par with commercial speed-optimized models
  • Versatility: supports complex coding tasks and reasoning workloads

Technical Architecture

Diffusion Training Process

Mercury’s training consists of a forward process and a reverse process:

Forward Process (adding noise):

Clean text x -> noisy z1 -> z2 -> ... -> fully noisy zT

Reverse Process (removing noise):

Fully noisy zT -> zT-1 -> ... -> z1 -> recovered text x

Training objective:

L(x) = -E_t[gamma(t) * E_{z_t~q} log p_theta(x|z_t)]

Transformer-Based Architecture

Mercury adopts a Transformer architecture, gaining the following advantages:

  • Compatibility: fully compatible with existing optimization techniques
  • Scalability: architecture well suited to large-scale model training
  • Efficiency: can leverage low-level operation optimizations

Evaluation Results

Coding Benchmark Performance

Model HumanEval MBPP MultiPL-E Speed (tokens/sec)
Mercury Coder Mini 88.0 77.1 74.1 1,109
Mercury Coder Small 90.0 76.6 76.2 737
GPT-4o Mini 88.0 74.6 72.0 59
Claude 3.5 Haiku 86.0 78.0 72.3 61
Gemini 2.0 Flash Lite 90.0 75.0 79.5 201

Multilingual Code Generation

MultiPL-E benchmark results (accuracy %):

Model C++ Java JavaScript PHP Bash TypeScript Avg
Mercury Coder Mini 78.9 74.5 78.9 72.7 56.5 83.2 74.1
Mercury Coder Small 82.0 80.1 83.9 78.3 50.1 82.6 76.2
Codestral 2501 80.1 72.7 83.2 73.9 47.2 83.2 73.4

Fill-in-the-Middle (FIM) Performance

Performance in code autocomplete scenarios:

Model Single-Line Random-Span Average
Mercury Coder Mini 92.9 71.5 82.2
Mercury Coder Small 93.1 76.5 84.8
Codestral 2501 93.0 72.0 82.5
GPT-4o Mini 74.8 47.0 60.9

Real-World User Evaluation: Copilot Arena

Performance Rankings

Model Latency (s) Latency Rank Elo Score Quality Rank
DeepSeek V2.5 (FIM) 2.07 11 1025 1
Claude 3.5 Sonnet 1.46 8 1003 1
Mercury Coder Mini 0.25 1 993 2
Codestral 0.31 2 992 2
GPT-4o 0.76 5 980 3

Key observation: Mercury Coder Mini achieved quality rank 2 while simultaneously recording the fastest speed.

Core Technical Innovations

Parallel Inference Optimization

Mercury’s speed gains come from the following system-level optimizations:

Dynamic batching:

class MercuryInferenceEngine:
    def __init__(self):
        self.dynamic_batcher = DynamicBatcher()
        self.custom_kernels = ParallelInferenceKernels()

    def generate(self, prompts, quality_speed_tradeoff=0.5):
        # Automatic quality-speed tradeoff adjustment
        batch = self.dynamic_batcher.optimize_batch(prompts)
        return self.parallel_diffusion_sample(batch)

Custom CUDA kernels:

  • Maximum utilization of GPU memory bandwidth
  • Optimized parallel denoising operations
  • Dynamic batch size adjustment

Guaranteed Compatibility

Mercury provides full compatibility with existing ecosystems:

OpenAI API compatible:

# Existing OpenAI API code can be used as-is
response = openai.ChatCompletion.create(
    model="mercury-coder-mini",  # only the model name changes
    messages=[{"role": "user", "content": "Write a Python function"}],
    max_tokens=500
)

Fine-tuning support:

  • RLHF (Reinforcement Learning from Human Feedback)
  • DPO (Direct Preference Optimization)
  • Conventional instruction tuning methodologies applicable

Practical Use Cases

Code Autocomplete System

class MercuryCodeCompletion:
    def __init__(self):
        self.model = MercuryCoder("mini")
        self.cache = CompletionCache()

    async def complete_code(self, context, cursor_position):
        # Real-time completion at an average latency of 25 ms
        start_time = time.time()
        completion = await self.model.fill_in_middle(
            prefix=context[:cursor_position],
            suffix=context[cursor_position:],
            max_tokens=100
        )
        latency = time.time() - start_time

        # Typically achieves under 25 ms
        assert latency < 0.1, f"Latency exceeded: {latency}s"
        return completion

Agent Workflows

Fast inference handles complex multi-step tasks efficiently:

class CodeReviewAgent:
    def __init__(self):
        self.mercury = MercuryCoder("small")

    async def review_pr(self, code_diff):
        # Run multiple analyses in parallel
        tasks = [
            self.mercury.analyze_security(code_diff),
            self.mercury.check_performance(code_diff),
            self.mercury.suggest_improvements(code_diff),
            self.mercury.generate_tests(code_diff)
        ]

        # Entire analysis completes within a few seconds
        results = await asyncio.gather(*tasks)
        return self.consolidate_review(results)

Scalability and Future Outlook

Scaling Characteristics

Mercury models demonstrate consistent performance improvements as size increases:

  • Mercury Coder Small consistently outperforms Mini across all benchmarks
  • Validates scaling laws for diffusion LLMs
  • Indicates strong potential for further gains with larger models

Application Expansion

Current: coding-specialized model
Planned:

  • General text generation model
  • Multimodal model (code + images)
  • Domain-specialized models (science, mathematics, law, etc.)

Industry Impact

Cost efficiency:

  • Substantially reduced inference costs
  • Real-time service deployment becomes feasible
  • Edge computing environments become viable

User experience innovation:

  • Natural interaction through minimized latency
  • Real-time processing of complex reasoning tasks
  • New possibilities for interactive coding tools

Technical Challenges and Solutions

Inherent Challenges of Diffusion Models

Handling discrete data:

  • Language consists of discrete tokens, unlike continuous images
  • Complexity of the noise-adding and noise-removing process
  • Solution: developed new noising and denoising algorithms

Minimizing inference steps:

  • Need to reduce diffusion steps while maintaining quality
  • Solution: adaptive sampling algorithms and custom schedulers

System Optimization

Memory efficiency:

class MemoryEfficientDiffusion:
    def __init__(self):
        self.gradient_checkpointing = True
        self.mixed_precision = True
        self.dynamic_batching = True

    def optimize_memory_usage(self, batch_size, sequence_length):
        # Dynamically adjust memory usage
        optimal_config = self.calculate_optimal_config(
            available_memory=torch.cuda.get_device_properties(0).total_memory,
            batch_size=batch_size,
            sequence_length=sequence_length
        )
        return optimal_config

Conclusion

Mercury has achieved the following innovations through the new paradigm of diffusion-based language models:

Core Achievements

  1. Speed innovation: inference speed up to 10x faster than existing models
  2. Quality maintained: code generation quality equivalent to commercial models
  3. Practicality: verified performance in real developer environments
  4. Scalability: demonstrated potential for scaling to larger models

Industry Impact

Immediate impact:

  • Transformed user experience for code autocomplete tools
  • Real-time AI coding assistant deployment becomes feasible
  • Activation of agent-based development workflows

Long-term outlook:

  • Fundamental shift in the cost structure of AI inference
  • Emergence of new forms of interactive AI services
  • High-performance LLM utilization in edge computing environments

Mercury goes beyond a simple performance improvement: it signifies a paradigm shift for AI language models. This research, which proves that a diffusion-based approach can deliver results in language generation as revolutionary as those it has achieved in image generation, points the way toward developing faster and more efficient AI systems.

Original paper: Mercury: Ultra-Fast Language Models Based on Diffusion
API and playground: platform.inceptionlabs.ai | chat.inceptionlabs.ai