LangExtract: Comprehensive Tutorial for Structured Text Extraction with LLMs

⏱️ Estimated Reading Time: 12 minutes

Introduction

LangExtract is a powerful Python library developed by Google that revolutionizes how we extract structured information from unstructured text using Large Language Models (LLMs). With over 15,400 stars on GitHub, this cutting-edge tool provides precise source grounding and interactive visualization capabilities, making it an essential tool for modern data extraction workflows.

What is LangExtract?

LangExtract is designed to bridge the gap between unstructured text data and structured information extraction. Unlike traditional parsing methods, LangExtract leverages the power of advanced LLMs to understand context, relationships, and nuanced information within text documents.

Key Features

Multi-Model Support: Works with Gemini, OpenAI, and local Ollama models
Source Grounding: Provides precise attribution to source text
Interactive Visualization: Built-in tools for exploring extraction results
Schema Constraints: Enforces structured output formats
Parallel Processing: Handles large documents efficiently
Plugin System: Extensible architecture for custom model providers

Installation and Setup

Basic Installation

The simplest way to get started is through pip:

# Create virtual environment
python -m venv langextract_env
source langextract_env/bin/activate  # On Windows: langextract_env\Scripts\activate

# Install LangExtract
pip install langextract

Development Installation

For development work or accessing the latest features:

git clone https://github.com/google/langextract.git
cd langextract

# Basic installation
pip install -e .

# With development tools
pip install -e ".[dev]"

# With testing dependencies
pip install -e ".[test]"

Docker Setup

For containerized deployments:

docker build -t langextract .
docker run --rm -e LANGEXTRACT_API_KEY="your-api-key" langextract python your_script.py

API Key Configuration

Cloud Models Setup

LangExtract supports multiple cloud providers. Here’s how to configure API keys:

Option 1: Environment Variables

export LANGEXTRACT_API_KEY="your-api-key-here"

Option 2: .env File (Recommended)

# Create .env file
cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF

# Secure your API key
echo '.env' >> .gitignore

Option 3: Vertex AI Authentication

For enterprise environments using Google Cloud:

import langextract as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information...",
    examples=[...],
    model_id="gemini-2.5-flash",
    language_model_params={
        "vertexai": True,
        "project": "your-project-id",
        "location": "global"
    }
)

API Key Sources

Gemini Models: Get API keys from AI Studio
OpenAI Models: Access keys from OpenAI Platform
Vertex AI: For enterprise use with service accounts

Basic Usage Examples

Simple Information Extraction

Let’s start with a basic example extracting contact information:

import langextract as lx

# Sample text
text = """
Dr. Sarah Johnson is a cardiologist at Metro Hospital.
You can reach her at sarah.johnson@metro.com or call (555) 123-4567.
Her office is located at 123 Medical Drive, Suite 456, Boston, MA 02101.
"""

# Define extraction prompt
prompt = "Extract contact information including name, title, email, phone, and address"

# Basic extraction
result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    model_id="gemini-2.5-flash"
)

print(result)

Structured Extraction with Examples

For more complex scenarios, provide examples to guide the extraction:

import langextract as lx

# Medical text
medical_text = """
Patient presents with chest pain, shortness of breath, and elevated heart rate.
Prescribed Metoprolol 50mg twice daily and Lisinopril 10mg once daily.
Follow-up appointment scheduled in 2 weeks.
"""

# Define examples
examples = [
    {
        "input": "Patient taking Aspirin 81mg daily for prevention",
        "output": {
            "medications": [
                {
                    "name": "Aspirin",
                    "dosage": "81mg",
                    "frequency": "daily",
                    "purpose": "prevention"
                }
            ]
        }
    }
]

# Extract with examples
result = lx.extract(
    text_or_documents=medical_text,
    prompt_description="Extract medication information including name, dosage, frequency, and purpose",
    examples=examples,
    model_id="gemini-2.5-flash"
)

print(result)

Working with Different Model Providers

Using OpenAI Models

import langextract as lx
import os

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract key information",
    examples=examples,
    model_id="gpt-4o",
    api_key=os.environ.get('OPENAI_API_KEY'),
    fence_output=True,
    use_schema_constraints=False  # Required for OpenAI
)

Using Local Models with Ollama

For privacy-focused deployments or offline processing:

# Install and setup Ollama
# Visit ollama.com for installation instructions
ollama pull gemma2:2b
ollama serve

import langextract as lx

result = lx.extract(
    text_or_documents=input_text,
    prompt_description="Extract information",
    examples=examples,
    model_id="gemma2:2b",
    model_url="http://localhost:11434",
    fence_output=False,
    use_schema_constraints=False
)

Advanced Features

Large Document Processing

LangExtract excels at processing large documents with parallel processing:

import langextract as lx
import requests

# Download large document (Romeo and Juliet example)
url = "https://www.gutenberg.org/files/1513/1513-0.txt"
response = requests.get(url)
full_text = response.text

# Extract character information
result = lx.extract(
    text_or_documents=full_text,
    prompt_description="Extract character names, relationships, and key scenes",
    model_id="gemini-2.5-flash",
    max_parallel_calls=4  # Parallel processing
)

Schema-Constrained Extraction

Define precise output schemas for consistent results:

from pydantic import BaseModel
from typing import List

class Medication(BaseModel):
    name: str
    dosage: str
    frequency: str
    route: str = "oral"

class MedicalRecord(BaseModel):
    patient_id: str
    medications: List[Medication]
    symptoms: List[str]

# Use schema for extraction
result = lx.extract(
    text_or_documents=medical_text,
    prompt_description="Extract medical information",
    schema=MedicalRecord,
    model_id="gemini-2.5-flash",
    use_schema_constraints=True
)

Interactive Visualization

LangExtract provides built-in visualization tools:

# Enable visualization
result = lx.extract(
    text_or_documents=text,
    prompt_description="Extract entities",
    model_id="gemini-2.5-flash",
    visualize=True
)

# Access visualization data
print(result.visualization_data)

Real-World Use Cases

Healthcare: Medical Records Processing

def extract_medical_info(clinical_notes):
    """Extract structured medical information from clinical notes."""
    
    examples = [
        {
            "input": "Patient reports severe headache, prescribed Ibuprofen 600mg every 6 hours",
            "output": {
                "symptoms": ["severe headache"],
                "medications": [
                    {
                        "name": "Ibuprofen",
                        "dosage": "600mg",
                        "frequency": "every 6 hours"
                    }
                ]
            }
        }
    ]
    
    return lx.extract(
        text_or_documents=clinical_notes,
        prompt_description="Extract symptoms, medications, and treatment plans",
        examples=examples,
        model_id="gemini-2.5-flash"
    )

Legal: Contract Analysis

def extract_contract_terms(contract_text):
    """Extract key terms from legal contracts."""
    
    prompt = """
    Extract contract information including:
    - Parties involved
    - Contract duration
    - Key obligations
    - Payment terms
    - Termination clauses
    """
    
    return lx.extract(
        text_or_documents=contract_text,
        prompt_description=prompt,
        model_id="gemini-2.5-flash",
        temperature=0.1  # Lower temperature for legal accuracy
    )

Academic: Research Paper Analysis

def extract_research_info(paper_text):
    """Extract structured information from research papers."""
    
    examples = [
        {
            "input": "This study examines 500 participants over 12 months...",
            "output": {
                "sample_size": 500,
                "study_duration": "12 months",
                "methodology": "longitudinal study"
            }
        }
    ]
    
    return lx.extract(
        text_or_documents=paper_text,
        prompt_description="Extract research methodology, sample size, and key findings",
        examples=examples,
        model_id="gemini-2.5-flash"
    )

Custom Model Providers

LangExtract’s plugin system allows you to add custom model providers:

from langextract.registry import registry

@registry.register(
    name="custom_provider",
    priority=10,
    model_ids=["custom-model-v1"]
)
class CustomProvider:
    def __init__(self, model_id, api_key=None, **kwargs):
        self.model_id = model_id
        self.api_key = api_key
    
    def generate(self, prompt, **kwargs):
        # Implement your custom generation logic
        pass
    
    @staticmethod
    def get_schema_class():
        # Optional: return custom schema class
        return None

Performance Optimization

Best Practices

Use Appropriate Models: Choose the right model for your use case
- Gemini 2.5 Flash: Fast, cost-effective
- GPT-4: High accuracy for complex tasks
- Local models: Privacy and offline processing
Optimize Prompts: Clear, specific prompts yield better results
Leverage Examples: Provide 2-3 high-quality examples
Batch Processing: Process multiple documents in parallel
Schema Constraints: Use schemas for consistent output format

Error Handling

import langextract as lx
from langextract.exceptions import LangExtractError

try:
    result = lx.extract(
        text_or_documents=text,
        prompt_description=prompt,
        model_id="gemini-2.5-flash",
        max_retries=3,
        timeout=30
    )
except LangExtractError as e:
    print(f"Extraction failed: {e}")
    # Implement fallback logic

Testing and Validation

Unit Testing

import unittest
import langextract as lx

class TestLangExtract(unittest.TestCase):
    def setUp(self):
        self.sample_text = "Dr. John Doe can be reached at john@example.com"
        
    def test_contact_extraction(self):
        result = lx.extract(
            text_or_documents=self.sample_text,
            prompt_description="Extract email addresses",
            model_id="gemini-2.5-flash"
        )
        
        self.assertIn("john@example.com", str(result))

if __name__ == "__main__":
    unittest.main()

Integration Testing

# Run full test suite
pytest tests

# Test specific provider
pytest tests/test_ollama.py

# Run with coverage
pytest --cov=langextract tests

Troubleshooting

Common Issues

API Key Errors
- Verify API key is correctly set
- Check key permissions and quotas
Model Availability
- Ensure model ID is correct
- Verify model is available in your region
Memory Issues with Large Documents
- Use chunk processing for very large texts
- Enable parallel processing
Inconsistent Output Format
- Use schema constraints
- Provide more examples
- Lower temperature for consistency

Debug Mode

import logging
logging.basicConfig(level=logging.DEBUG)

result = lx.extract(
    text_or_documents=text,
    prompt_description=prompt,
    model_id="gemini-2.5-flash",
    debug=True
)

Security Considerations

Data Privacy

Local Processing: Use Ollama for sensitive data
API Security: Rotate API keys regularly
Data Retention: Understand provider data policies

Input Validation

def safe_extract(text, max_length=10000):
    """Safely extract with input validation."""
    
    if len(text) > max_length:
        raise ValueError("Input text too long")
    
    # Sanitize input
    text = text.strip()
    
    return lx.extract(
        text_or_documents=text,
        prompt_description="Extract information",
        model_id="gemini-2.5-flash"
    )

Conclusion

LangExtract represents a significant advancement in structured information extraction from unstructured text. Its combination of powerful LLM integration, precise source grounding, and flexible architecture makes it an invaluable tool for modern data processing workflows.

Whether you’re processing medical records, analyzing legal documents, or extracting insights from research papers, LangExtract provides the tools and flexibility needed to transform unstructured text into actionable structured data.

Next Steps

Explore Examples: Check the official examples
Join Community: Contribute to the community providers
Read Documentation: Visit the official documentation
Try Live Demo: Experience RadExtract demo

Start your journey with LangExtract today and revolutionize how you work with unstructured text data!

💡 Pro Tip: Begin with simple extraction tasks and gradually increase complexity as you become familiar with the library’s capabilities. The key to success with LangExtract is crafting clear prompts and providing good examples.