⏱️ Estimated reading time: 15 minutes

Large-Context LLM Inference with oLLM

One of the biggest constraints when working with Large Language Models (LLMs) is the context length limitation. With typical GPU memory, it’s challenging to process long documents or conversation histories in a single pass.

oLLM is an innovative library that solves this problem. It enables processing 100k token contexts even with just 8GB of GPU memory.

What is oLLM?

oLLM is a lightweight Python library built on top of HuggingFace Transformers and PyTorch. It features:

  • Large-context processing: Handle up to 100k tokens
  • Low-cost GPU utilization: Run large models with just 8GB VRAM
  • No quantization: Maintains fp16/bf16 precision
  • SSD offloading: Offloads KV cache and layer weights to SSD

Supported Models and Performance

Memory Usage on 8GB Nvidia 3060 Ti

Model Weights Context Length KV Cache Baseline VRAM oLLM GPU VRAM oLLM Disk
qwen3-next-80B 160 GB (bf16) 50k 20 GB ~190 GB ~7.5 GB 180 GB
gpt-oss-20B 13 GB (packed bf16) 10k 1.4 GB ~40 GB ~7.3GB 15 GB
llama3-1B-chat 2 GB (fp16) 100k 12.6 GB ~16 GB ~5 GB 15 GB
llama3-3B-chat 7 GB (fp16) 100k 34.1 GB ~42 GB ~5.3 GB 42 GB
llama3-8B-chat 16 GB (fp16) 100k 52.4 GB ~71 GB ~6.6 GB 69 GB

Installation and Setup

1. Create Virtual Environment

# Create virtual environment
python3 -m venv ollm_env
source ollm_env/bin/activate  # Linux/Mac
# or
ollm_env\Scripts\activate  # Windows

2. Install oLLM

# Install via pip
pip install ollm

# Or install from source
git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install -e .

3. Install Additional Dependencies

# Install kvikio for your CUDA version
pip install kvikio-cu12  # For CUDA 12.x
# or
pip install kvikio-cu11  # For CUDA 11.x

4. Additional Installation for qwen3-next Model

# qwen3-next model requires special transformers version
pip install git+https://github.com/huggingface/transformers.git

Basic Usage

1. Basic Inference Example

from ollm import Inference, file_get_contents, TextStreamer

# Initialize model
o = Inference("llama3-1B-chat", device="cuda:0", logging=True)
o.ini_model(models_dir="./models/", force_download=False)

# Optional: Offload some layers to CPU for speed boost
o.offload_layers_to_cpu(layers_num=2)

# Set up KV cache (set to None if context is small)
past_key_values = o.DiskCache(cache_dir="./kv_cache/")

# Set up text streamer
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)

# Compose messages
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "List the planets."}
]

# Tokenize and generate
input_ids = o.tokenizer.apply_chat_template(
    messages, 
    reasoning_effort="minimal", 
    tokenize=True, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(o.device)

outputs = o.model.generate(
    input_ids=input_ids,
    past_key_values=past_key_values,
    max_new_tokens=500,
    streamer=text_streamer
).cpu()

# Decode result
answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)
print(answer)

2. Run Command

# Run with CUDA memory allocation optimization
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py

Advanced Usage

1. Large Document Analysis

def analyze_large_document(document_path, model_name="llama3-8B-chat"):
    """Function to analyze large documents"""
    
    # Initialize model
    o = Inference(model_name, device="cuda:0", logging=True)
    o.ini_model(models_dir="./models/", force_download=False)
    
    # Set up KV cache for large context
    past_key_values = o.DiskCache(cache_dir="./kv_cache/")
    
    # Read document
    document_content = file_get_contents(document_path)
    
    # Analysis prompt
    messages = [
        {
            "role": "system", 
            "content": "You are a document analysis expert. Analyze the given document and summarize key content while extracting important points."
        },
        {
            "role": "user", 
            "content": f"Please analyze the following document:\n\n{document_content}"
        }
    ]
    
    # Tokenize
    input_ids = o.tokenizer.apply_chat_template(
        messages, 
        tokenize=True, 
        add_generation_prompt=True, 
        return_tensors="pt"
    ).to(o.device)
    
    # Generate
    outputs = o.model.generate(
        input_ids=input_ids,
        past_key_values=past_key_values,
        max_new_tokens=1000,
        temperature=0.7,
        do_sample=True
    )
    
    # Return result
    result = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
    return result

2. Streaming Response Processing

def stream_response(model_name, messages, max_tokens=500):
    """Function to handle streaming responses"""
    
    o = Inference(model_name, device="cuda:0", logging=True)
    o.ini_model(models_dir="./models/", force_download=False)
    
    # Set up text streamer
    text_streamer = TextStreamer(
        o.tokenizer, 
        skip_prompt=True, 
        skip_special_tokens=False
    )
    
    # Tokenize
    input_ids = o.tokenizer.apply_chat_template(
        messages, 
        tokenize=True, 
        add_generation_prompt=True, 
        return_tensors="pt"
    ).to(o.device)
    
    # Generate with streaming
    outputs = o.model.generate(
        input_ids=input_ids,
        max_new_tokens=max_tokens,
        streamer=text_streamer,
        temperature=0.7,
        do_sample=True
    )
    
    return outputs

Real-World Use Cases

1. Contract and Regulatory Document Analysis

def analyze_contract(contract_path):
    """Contract analysis"""
    messages = [
        {
            "role": "system",
            "content": "You are a legal document analysis expert. Analyze the contract's key clauses, risk factors, and rights and obligations clearly."
        },
        {
            "role": "user", 
            "content": f"Please analyze the following contract: {file_get_contents(contract_path)}"
        }
    ]
    return stream_response("llama3-8B-chat", messages, max_tokens=1000)

2. Medical Records Analysis

def analyze_medical_records(records_path):
    """Medical records analysis"""
    messages = [
        {
            "role": "system",
            "content": "You are a medical data analysis expert. Analyze patient records and summarize key diagnoses, treatment processes, and precautions."
        },
        {
            "role": "user",
            "content": f"Please analyze the following medical records: {file_get_contents(records_path)}"
        }
    ]
    return stream_response("llama3-8B-chat", messages, max_tokens=1500)

3. Large Log File Analysis

def analyze_logs(log_path):
    """Log file analysis"""
    messages = [
        {
            "role": "system",
            "content": "You are a system log analysis expert. Analyze logs to identify error patterns, performance issues, and security threats."
        },
        {
            "role": "user",
            "content": f"Please analyze the following log file: {file_get_contents(log_path)}"
        }
    ]
    return stream_response("llama3-8B-chat", messages, max_tokens=2000)

Performance Optimization Tips

1. Memory Optimization

# Save GPU memory by offloading layers
o.offload_layers_to_cpu(layers_num=4)  # Offload more layers to CPU

# Disk offloading for KV cache
past_key_values = o.DiskCache(cache_dir="./kv_cache/")

2. Speed Optimization

# Optimize CUDA memory allocation
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

# Adjust batch size
batch_size = 1  # Adjust based on memory

3. Model Selection Guide

  • For fast responses: llama3-1B-chat
  • For balanced performance: llama3-8B-chat
  • For highest quality: qwen3-next-80B (requires more disk space)

Troubleshooting

1. Out of Memory Error

# Solution 1: Offload more layers to CPU
o.offload_layers_to_cpu(layers_num=6)

# Solution 2: Use smaller model
o = Inference("llama3-1B-chat", device="cuda:0")

2. Insufficient Disk Space

# Disable KV cache (for small contexts)
past_key_values = None

# Or use smaller model
o = Inference("llama3-1B-chat", device="cuda:0")

3. Slow Performance

# Optimize CUDA memory
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

# Adjust layer offloading
o.offload_layers_to_cpu(layers_num=2)  # Offload fewer layers

Conclusion

oLLM is an innovative tool that democratizes large-context LLM inference. With the ability to process 100k token contexts using just 8GB GPU, individual developers and small teams can now perform large document analysis.

Key advantages:

  • Cost efficiency: Run large models without expensive GPUs
  • Flexibility: Support for various models and context lengths
  • Practicality: Tools that can be immediately applied to real work

Use oLLM to efficiently perform various large-scale text processing tasks such as contract analysis, medical record processing, and log analysis!

References