Mercury: Innovation in Ultra-Fast Diffusion-Based Language Models
Mercury, announced by Inception Labs, is an innovative large language model based on diffusion that surpasses the limitations of conventional autoregressive models to deliver ultra-fast inference performance.
How Diffusion Models Transform Language Generation
Limitations of Existing Language Models
Traditional autoregressive language models carry the following fundamental constraints:
- Sequential generation: only one token can be generated at a time
- Increasing latency: wait times accumulate when generating long text
- Inefficient GPU utilization: parallel processing capabilities are underutilized
- Real-time application constraints: degraded user experience in code autocomplete, agent workflows, and similar scenarios
Mercury’s Diffusion Approach
Mercury addresses these issues through parallel token generation:
# Conventional autoregressive approach
for i in range(sequence_length):
token = model.generate_next_token(context)
context.append(token) # sequential generation
# Mercury diffusion approach
noisy_tokens = initialize_random_noise(sequence_length)
for step in range(diffusion_steps):
# Improve all tokens simultaneously
noisy_tokens = model.denoise_all_tokens(noisy_tokens)
The Mercury Coder Model Family
Model Lineup
Mercury Coder Mini
- Speed-focused: 1,109 tokens/sec (on H100 GPU)
- Use case: real-time code autocomplete, rapid prototyping
- Quality: outperforms open-source speed-optimized models
Mercury Coder Small
- Balanced performance: 737 tokens/sec
- Quality: on par with commercial speed-optimized models
- Versatility: supports complex coding tasks and reasoning workloads
Technical Architecture
Diffusion Training Process
Mercury’s training consists of a forward process and a reverse process:
Forward Process (adding noise):
Clean text x -> noisy z1 -> z2 -> ... -> fully noisy zT
Reverse Process (removing noise):
Fully noisy zT -> zT-1 -> ... -> z1 -> recovered text x
Training objective:
L(x) = -E_t[gamma(t) * E_{z_t~q} log p_theta(x|z_t)]
Transformer-Based Architecture
Mercury adopts a Transformer architecture, gaining the following advantages:
- Compatibility: fully compatible with existing optimization techniques
- Scalability: architecture well suited to large-scale model training
- Efficiency: can leverage low-level operation optimizations
Evaluation Results
Coding Benchmark Performance
| Model | HumanEval | MBPP | MultiPL-E | Speed (tokens/sec) |
|---|---|---|---|---|
| Mercury Coder Mini | 88.0 | 77.1 | 74.1 | 1,109 |
| Mercury Coder Small | 90.0 | 76.6 | 76.2 | 737 |
| GPT-4o Mini | 88.0 | 74.6 | 72.0 | 59 |
| Claude 3.5 Haiku | 86.0 | 78.0 | 72.3 | 61 |
| Gemini 2.0 Flash Lite | 90.0 | 75.0 | 79.5 | 201 |
Multilingual Code Generation
MultiPL-E benchmark results (accuracy %):
| Model | C++ | Java | JavaScript | PHP | Bash | TypeScript | Avg |
|---|---|---|---|---|---|---|---|
| Mercury Coder Mini | 78.9 | 74.5 | 78.9 | 72.7 | 56.5 | 83.2 | 74.1 |
| Mercury Coder Small | 82.0 | 80.1 | 83.9 | 78.3 | 50.1 | 82.6 | 76.2 |
| Codestral 2501 | 80.1 | 72.7 | 83.2 | 73.9 | 47.2 | 83.2 | 73.4 |
Fill-in-the-Middle (FIM) Performance
Performance in code autocomplete scenarios:
| Model | Single-Line | Random-Span | Average |
|---|---|---|---|
| Mercury Coder Mini | 92.9 | 71.5 | 82.2 |
| Mercury Coder Small | 93.1 | 76.5 | 84.8 |
| Codestral 2501 | 93.0 | 72.0 | 82.5 |
| GPT-4o Mini | 74.8 | 47.0 | 60.9 |
Real-World User Evaluation: Copilot Arena
Performance Rankings
| Model | Latency (s) | Latency Rank | Elo Score | Quality Rank |
|---|---|---|---|---|
| DeepSeek V2.5 (FIM) | 2.07 | 11 | 1025 | 1 |
| Claude 3.5 Sonnet | 1.46 | 8 | 1003 | 1 |
| Mercury Coder Mini | 0.25 | 1 | 993 | 2 |
| Codestral | 0.31 | 2 | 992 | 2 |
| GPT-4o | 0.76 | 5 | 980 | 3 |
Key observation: Mercury Coder Mini achieved quality rank 2 while simultaneously recording the fastest speed.
Core Technical Innovations
Parallel Inference Optimization
Mercury’s speed gains come from the following system-level optimizations:
Dynamic batching:
class MercuryInferenceEngine:
def __init__(self):
self.dynamic_batcher = DynamicBatcher()
self.custom_kernels = ParallelInferenceKernels()
def generate(self, prompts, quality_speed_tradeoff=0.5):
# Automatic quality-speed tradeoff adjustment
batch = self.dynamic_batcher.optimize_batch(prompts)
return self.parallel_diffusion_sample(batch)
Custom CUDA kernels:
- Maximum utilization of GPU memory bandwidth
- Optimized parallel denoising operations
- Dynamic batch size adjustment
Guaranteed Compatibility
Mercury provides full compatibility with existing ecosystems:
OpenAI API compatible:
# Existing OpenAI API code can be used as-is
response = openai.ChatCompletion.create(
model="mercury-coder-mini", # only the model name changes
messages=[{"role": "user", "content": "Write a Python function"}],
max_tokens=500
)
Fine-tuning support:
- RLHF (Reinforcement Learning from Human Feedback)
- DPO (Direct Preference Optimization)
- Conventional instruction tuning methodologies applicable
Practical Use Cases
Code Autocomplete System
class MercuryCodeCompletion:
def __init__(self):
self.model = MercuryCoder("mini")
self.cache = CompletionCache()
async def complete_code(self, context, cursor_position):
# Real-time completion at an average latency of 25 ms
start_time = time.time()
completion = await self.model.fill_in_middle(
prefix=context[:cursor_position],
suffix=context[cursor_position:],
max_tokens=100
)
latency = time.time() - start_time
# Typically achieves under 25 ms
assert latency < 0.1, f"Latency exceeded: {latency}s"
return completion
Agent Workflows
Fast inference handles complex multi-step tasks efficiently:
class CodeReviewAgent:
def __init__(self):
self.mercury = MercuryCoder("small")
async def review_pr(self, code_diff):
# Run multiple analyses in parallel
tasks = [
self.mercury.analyze_security(code_diff),
self.mercury.check_performance(code_diff),
self.mercury.suggest_improvements(code_diff),
self.mercury.generate_tests(code_diff)
]
# Entire analysis completes within a few seconds
results = await asyncio.gather(*tasks)
return self.consolidate_review(results)
Scalability and Future Outlook
Scaling Characteristics
Mercury models demonstrate consistent performance improvements as size increases:
- Mercury Coder Small consistently outperforms Mini across all benchmarks
- Validates scaling laws for diffusion LLMs
- Indicates strong potential for further gains with larger models
Application Expansion
Current: coding-specialized model
Planned:
- General text generation model
- Multimodal model (code + images)
- Domain-specialized models (science, mathematics, law, etc.)
Industry Impact
Cost efficiency:
- Substantially reduced inference costs
- Real-time service deployment becomes feasible
- Edge computing environments become viable
User experience innovation:
- Natural interaction through minimized latency
- Real-time processing of complex reasoning tasks
- New possibilities for interactive coding tools
Technical Challenges and Solutions
Inherent Challenges of Diffusion Models
Handling discrete data:
- Language consists of discrete tokens, unlike continuous images
- Complexity of the noise-adding and noise-removing process
- Solution: developed new noising and denoising algorithms
Minimizing inference steps:
- Need to reduce diffusion steps while maintaining quality
- Solution: adaptive sampling algorithms and custom schedulers
System Optimization
Memory efficiency:
class MemoryEfficientDiffusion:
def __init__(self):
self.gradient_checkpointing = True
self.mixed_precision = True
self.dynamic_batching = True
def optimize_memory_usage(self, batch_size, sequence_length):
# Dynamically adjust memory usage
optimal_config = self.calculate_optimal_config(
available_memory=torch.cuda.get_device_properties(0).total_memory,
batch_size=batch_size,
sequence_length=sequence_length
)
return optimal_config
Conclusion
Mercury has achieved the following innovations through the new paradigm of diffusion-based language models:
Core Achievements
- Speed innovation: inference speed up to 10x faster than existing models
- Quality maintained: code generation quality equivalent to commercial models
- Practicality: verified performance in real developer environments
- Scalability: demonstrated potential for scaling to larger models
Industry Impact
Immediate impact:
- Transformed user experience for code autocomplete tools
- Real-time AI coding assistant deployment becomes feasible
- Activation of agent-based development workflows
Long-term outlook:
- Fundamental shift in the cost structure of AI inference
- Emergence of new forms of interactive AI services
- High-performance LLM utilization in edge computing environments
Mercury goes beyond a simple performance improvement: it signifies a paradigm shift for AI language models. This research, which proves that a diffusion-based approach can deliver results in language generation as revolutionary as those it has achieved in image generation, points the way toward developing faster and more efficient AI systems.
Original paper: Mercury: Ultra-Fast Language Models Based on Diffusion
API and playground: platform.inceptionlabs.ai | chat.inceptionlabs.ai