Complete RAG System Tutorial: Building Production-Ready AI with ArXiv Paper Curator

⏱️ Estimated Reading Time: 12 minutes

Introduction: Building Production-Ready RAG Systems

Retrieval-Augmented Generation (RAG) has become the cornerstone of modern AI applications, enabling LLMs to access and reason over vast knowledge bases. However, most tutorials focus on toy examples that fail to address real-world challenges like scalability, monitoring, and production deployment.

The ArXiv Paper Curator project by Jam With AI bridges this gap by providing a complete 6-week curriculum for building production-ready RAG systems. This tutorial will guide you through the entire journey, from basic setup to advanced monitoring and caching strategies.

🎯 What You’ll Build

By the end of this tutorial, you’ll have constructed a sophisticated RAG system that:

Ingests and processes research papers from ArXiv automatically
Implements hybrid search combining BM25 keyword search with vector similarity
Provides intelligent Q&A capabilities using local LLMs
Includes production monitoring with comprehensive observability
Features caching strategies for optimal performance
Offers a web interface for interactive exploration

🏗️ System Architecture Overview

The ArXiv Paper Curator follows a modular, production-ready architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Data Ingestion │    │  Processing     │    │   Search & RAG  │
│   (ArXiv API)    │───▶│  (PDF + Text)   │───▶│  (Hybrid Index) │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   PostgreSQL    │    │   OpenSearch    │    │    Gradio UI    │
│   (Metadata)    │    │ (Search Index)  │    │ (User Interface)│
└─────────────────┘    └─────────────────┘    └─────────────────┘

📋 Prerequisites

Before starting, ensure you have:

Docker & Docker Compose: For containerized services
Python 3.11+: For running the application
8GB+ RAM: For running all services locally
10GB+ Storage: For papers and indices

🚀 Week-by-Week Implementation Guide

Week 1: Infrastructure Foundation

The first week establishes the core infrastructure using Docker Compose to orchestrate multiple services.

Setting Up the Environment

# Clone the repository
git clone https://github.com/jamwithai/arxiv-paper-curator.git
cd arxiv-paper-curator

# Start all services
make start

# Verify health status
make health

Core Services Architecture

The system leverages several containerized services:

PostgreSQL: Stores paper metadata and relationships
OpenSearch: Provides both keyword and vector search capabilities
FastAPI: Serves the RESTful API endpoints
Gradio: Delivers the interactive web interface

Environment Configuration

# Key configuration variables
DATABASE_URL="postgresql://postgres:password@localhost:5432/arxiv_papers"
OPENSEARCH_URL="http://localhost:9200"
JINA_API_KEY="your-jina-api-key"  # For embeddings
LANGFUSE_SECRET_KEY="your-key"     # For monitoring

Week 2: Data Ingestion Pipeline

Week 2 focuses on building a robust data ingestion pipeline that automatically fetches, processes, and stores research papers.

ArXiv API Integration

The system uses the ArXiv API to fetch papers based on categories and search queries:

async def fetch_papers_from_arxiv(
    query: str = "cat:cs.AI",
    max_results: int = 100,
    start_date: Optional[datetime] = None
) -> List[ArxivPaper]:
    """
    Fetch papers from ArXiv API with comprehensive metadata
    """
    base_url = "http://export.arxiv.org/api/query"
    params = {
        "search_query": query,
        "start": 0,
        "max_results": max_results,
        "sortBy": "submittedDate",
        "sortOrder": "descending"
    }

PDF Processing with Docling

The project uses IBM’s Docling library for sophisticated PDF processing that preserves document structure:

from docling.document_converter import DocumentConverter

def process_pdf_with_docling(pdf_path: str) -> ProcessedDocument:
    """
    Extract structured content from research papers
    """
    converter = DocumentConverter()
    result = converter.convert(pdf_path)
    
    return ProcessedDocument(
        title=result.document.title,
        abstract=result.document.abstract,
        sections=result.document.sections,
        figures=result.document.figures,
        tables=result.document.tables
    )

Database Schema Design

The PostgreSQL schema is optimized for research paper metadata:

-- Core papers table
CREATE TABLE papers (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    arxiv_id VARCHAR(50) UNIQUE NOT NULL,
    title TEXT NOT NULL,
    abstract TEXT,
    authors TEXT[],
    categories TEXT[],
    published_date TIMESTAMP,
    pdf_url TEXT,
    content_extracted BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Authors relationship table
CREATE TABLE paper_authors (
    paper_id UUID REFERENCES papers(id),
    author_name TEXT,
    author_order INTEGER,
    PRIMARY KEY (paper_id, author_name)
);

Week 3: Keyword Search with BM25

Week 3 implements sophisticated keyword search using OpenSearch’s BM25 algorithm, providing the foundation for the hybrid search system.

OpenSearch Index Configuration

# BM25-optimized index settings
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": {
            "analyzer": {
                "scientific_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "scientific_stemmer",
                        "stop_words_filter"
                    ]
                }
            },
            "filter": {
                "scientific_stemmer": {
                    "type": "stemmer",
                    "language": "english"
                },
                "stop_words_filter": {
                    "type": "stop",
                    "stopwords": "_english_"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "scientific_analyzer",
                "boost": 3.0
            },
            "abstract": {
                "type": "text",
                "analyzer": "scientific_analyzer",
                "boost": 2.0
            },
            "content": {
                "type": "text",
                "analyzer": "scientific_analyzer"
            }
        }
    }
}

Advanced BM25 Query Construction

def build_bm25_query(
    query: str,
    filters: Optional[Dict] = None,
    boost_recent: bool = True
) -> Dict:
    """
    Construct sophisticated BM25 queries with field boosting
    """
    base_query = {
        "query": {
            "bool": {
                "must": [
                    {
                        "multi_match": {
                            "query": query,
                            "fields": [
                                "title^3",      # Boost title matches
                                "abstract^2",   # Boost abstract matches
                                "content",      # Standard content matching
                                "authors"       # Include author matching
                            ],
                            "type": "best_fields",
                            "fuzziness": "AUTO"
                        }
                    }
                ]
            }
        }
    }
    
    # Add recency boosting
    if boost_recent:
        base_query["query"]["bool"]["should"] = [
            {
                "function_score": {
                    "functions": [
                        {
                            "gauss": {
                                "published_date": {
                                    "origin": "now",
                                    "scale": "30d",
                                    "decay": 0.5
                                }
                            }
                        }
                    ]
                }
            }
        ]
    
    return base_query

Week 4: Hybrid Search with Embeddings

Week 4 introduces vector embeddings and hybrid search, combining semantic similarity with keyword matching for superior retrieval performance.

Text Chunking Strategy

The system implements section-aware chunking that preserves document structure:

class SectionAwareChunker:
    def __init__(self, max_chunk_size: int = 512, overlap: int = 50):
        self.max_chunk_size = max_chunk_size
        self.overlap = overlap
    
    def chunk_document(self, document: ProcessedDocument) -> List[DocumentChunk]:
        """
        Create chunks that respect section boundaries
        """
        chunks = []
        
        # Process each section separately
        for section in document.sections:
            section_chunks = self._chunk_section(
                section.content,
                section.title,
                section.level
            )
            chunks.extend(section_chunks)
        
        return chunks
    
    def _chunk_section(
        self, 
        content: str, 
        section_title: str, 
        level: int
    ) -> List[DocumentChunk]:
        """
        Chunk section content while preserving context
        """
        sentences = self._split_sentences(content)
        chunks = []
        current_chunk = []
        current_length = 0
        
        for sentence in sentences:
            sentence_length = len(sentence.split())
            
            if (current_length + sentence_length > self.max_chunk_size 
                and current_chunk):
                # Create chunk with section context
                chunk_text = f"{section_title}\n\n" + " ".join(current_chunk)
                chunks.append(DocumentChunk(
                    text=chunk_text,
                    section=section_title,
                    level=level,
                    word_count=current_length
                ))
                
                # Start new chunk with overlap
                overlap_sentences = current_chunk[-self.overlap:]
                current_chunk = overlap_sentences + [sentence]
                current_length = sum(len(s.split()) for s in current_chunk)
            else:
                current_chunk.append(sentence)
                current_length += sentence_length
        
        # Add final chunk
        if current_chunk:
            chunk_text = f"{section_title}\n\n" + " ".join(current_chunk)
            chunks.append(DocumentChunk(
                text=chunk_text,
                section=section_title,
                level=level,
                word_count=current_length
            ))
        
        return chunks

Vector Embedding Generation

The system uses Jina AI’s embedding service for high-quality vector representations:

class JinaEmbeddingService:
    def __init__(self, api_key: str, model: str = "jina-embeddings-v2-base-en"):
        self.api_key = api_key
        self.model = model
        self.client = httpx.AsyncClient()
    
    async def generate_embeddings(
        self, 
        texts: List[str], 
        batch_size: int = 32
    ) -> List[List[float]]:
        """
        Generate embeddings in batches for efficiency
        """
        all_embeddings = []
        
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            batch_embeddings = await self._embed_batch(batch)
            all_embeddings.extend(batch_embeddings)
        
        return all_embeddings
    
    async def _embed_batch(self, texts: List[str]) -> List[List[float]]:
        """
        Process a single batch of texts
        """
        response = await self.client.post(
            "https://api.jina.ai/v1/embeddings",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            json={
                "model": self.model,
                "input": texts,
                "encoding_format": "float"
            }
        )
        
        result = response.json()
        return [item["embedding"] for item in result["data"]]

Hybrid Search Implementation

The hybrid search combines BM25 and vector similarity with sophisticated score fusion:

class HybridSearchEngine:
    def __init__(
        self, 
        opensearch_client, 
        embedding_service,
        bm25_weight: float = 0.6,
        vector_weight: float = 0.4
    ):
        self.opensearch = opensearch_client
        self.embeddings = embedding_service
        self.bm25_weight = bm25_weight
        self.vector_weight = vector_weight
    
    async def search(
        self, 
        query: str, 
        k: int = 10,
        filters: Optional[Dict] = None
    ) -> List[SearchResult]:
        """
        Perform hybrid search with score fusion
        """
        # Generate query embedding
        query_embedding = await self.embeddings.generate_embeddings([query])
        query_vector = query_embedding[0]
        
        # Construct hybrid query
        hybrid_query = {
            "size": k * 2,  # Retrieve more for reranking
            "query": {
                "bool": {
                    "should": [
                        # BM25 component
                        {
                            "multi_match": {
                                "query": query,
                                "fields": ["title^3", "abstract^2", "content"],
                                "type": "best_fields"
                            }
                        },
                        # Vector similarity component
                        {
                            "knn": {
                                "embedding": {
                                    "vector": query_vector,
                                    "k": k
                                }
                            }
                        }
                    ]
                }
            }
        }
        
        # Add filters if provided
        if filters:
            hybrid_query["query"]["bool"]["filter"] = self._build_filters(filters)
        
        # Execute search
        response = await self.opensearch.search(
            index="papers_hybrid",
            body=hybrid_query
        )
        
        # Process and rerank results
        results = self._process_results(response["hits"]["hits"])
        reranked_results = self._rerank_results(results, query, k)
        
        return reranked_results
    
    def _rerank_results(
        self, 
        results: List[SearchResult], 
        query: str, 
        k: int
    ) -> List[SearchResult]:
        """
        Apply sophisticated reranking based on multiple signals
        """
        # Calculate normalized scores
        max_bm25 = max(r.bm25_score for r in results) if results else 1.0
        max_vector = max(r.vector_score for r in results) if results else 1.0
        
        for result in results:
            # Normalize individual scores
            norm_bm25 = result.bm25_score / max_bm25
            norm_vector = result.vector_score / max_vector
            
            # Calculate hybrid score
            result.hybrid_score = (
                self.bm25_weight * norm_bm25 + 
                self.vector_weight * norm_vector
            )
            
            # Apply additional ranking signals
            result.hybrid_score *= self._calculate_quality_multiplier(result)
        
        # Sort by hybrid score and return top k
        return sorted(results, key=lambda x: x.hybrid_score, reverse=True)[:k]

Week 5: Complete RAG System

Week 5 integrates everything into a complete RAG system with sophisticated prompt engineering and response generation.

LLM Integration with Ollama

The system uses Ollama for local LLM inference, providing privacy and control:

class OllamaRAGService:
    def __init__(
        self, 
        base_url: str = "http://localhost:11434",
        model: str = "llama3.1:8b"
    ):
        self.base_url = base_url
        self.model = model
        self.client = httpx.AsyncClient()
    
    async def generate_answer(
        self, 
        query: str, 
        context_chunks: List[DocumentChunk],
        max_context_length: int = 4000
    ) -> RAGResponse:
        """
        Generate contextual answers using retrieved documents
        """
        # Prepare context with smart truncation
        context = self._prepare_context(context_chunks, max_context_length)
        
        # Construct RAG prompt
        prompt = self._build_rag_prompt(query, context)
        
        # Generate response
        response = await self._call_ollama(prompt)
        
        # Post-process and validate
        processed_response = self._post_process_response(
            response, 
            query, 
            context_chunks
        )
        
        return processed_response
    
    def _build_rag_prompt(self, query: str, context: str) -> str:
        """
        Construct sophisticated RAG prompts with role definition
        """
        return f"""You are an expert research assistant specializing in AI and computer science literature. Your task is to provide accurate, comprehensive answers based on the provided research papers.

**Context from Research Papers:**
{context}

**User Question:**
{query}

**Instructions:**
1. Base your answer primarily on the provided context
2. If the context doesn't contain sufficient information, clearly state this
3. Include specific references to papers when making claims
4. Provide technical details when appropriate
5. If discussing methodologies, explain them clearly
6. Highlight any limitations or caveats mentioned in the papers

**Answer:**"""

    def _prepare_context(
        self, 
        chunks: List[DocumentChunk], 
        max_length: int
    ) -> str:
        """
        Intelligently select and format context chunks
        """
        # Sort chunks by relevance score
        sorted_chunks = sorted(chunks, key=lambda x: x.score, reverse=True)
        
        context_parts = []
        current_length = 0
        
        for chunk in sorted_chunks:
            chunk_text = f"[{chunk.paper_title}]\n{chunk.text}\n---\n"
            chunk_length = len(chunk_text.split())
            
            if current_length + chunk_length > max_length:
                break
                
            context_parts.append(chunk_text)
            current_length += chunk_length
        
        return "\n".join(context_parts)

Advanced Prompt Engineering

The system includes specialized prompts for different types of queries:

class PromptTemplates:
    COMPARATIVE_ANALYSIS = """Based on the research papers provided, compare and contrast the following approaches/methods/concepts: {concepts}

Please structure your response as follows:
1. **Overview**: Brief introduction to each concept
2. **Key Similarities**: What do these approaches have in common?
3. **Key Differences**: How do they differ in methodology, assumptions, or results?
4. **Performance Comparison**: If available, compare their effectiveness
5. **Use Cases**: When would you choose one over the other?
6. **Limitations**: What are the constraints of each approach?

Papers referenced: {paper_titles}"""

    METHODOLOGY_EXPLANATION = """Explain the methodology described in the research papers for: {topic}

Please provide:
1. **Problem Definition**: What problem is being addressed?
2. **Approach Overview**: High-level methodology description
3. **Technical Details**: Step-by-step explanation of the method
4. **Implementation Considerations**: Practical aspects of implementation
5. **Evaluation Methods**: How was the approach validated?
6. **Results Summary**: Key findings and performance metrics
7. **Limitations and Future Work**: Acknowledged constraints and next steps

Base your explanation on: {paper_titles}"""

    TREND_ANALYSIS = """Analyze trends and developments in: {research_area}

Based on the provided papers, discuss:
1. **Historical Context**: How has this area evolved?
2. **Current State**: What are the dominant approaches today?
3. **Emerging Patterns**: What new trends are visible?
4. **Key Innovations**: What breakthrough contributions are mentioned?
5. **Open Challenges**: What problems remain unsolved?
6. **Future Directions**: What research directions are suggested?

Papers analyzed: {paper_titles}"""

Week 6: Production Monitoring and Caching

Week 6 focuses on production readiness with comprehensive monitoring and intelligent caching strategies.

Langfuse Integration for Observability

class RAGMonitoringService:
    def __init__(self, langfuse_client):
        self.langfuse = langfuse_client
    
    async def trace_rag_request(
        self, 
        query: str, 
        search_results: List[SearchResult],
        llm_response: RAGResponse,
        execution_time: float
    ) -> None:
        """
        Comprehensive RAG pipeline tracing
        """
        trace = self.langfuse.trace(
            name="rag_query",
            input={"query": query},
            output={"answer": llm_response.answer},
            metadata={
                "execution_time_ms": execution_time * 1000,
                "num_results": len(search_results),
                "model_used": llm_response.model,
                "context_length": llm_response.context_length
            }
        )
        
        # Track search performance
        search_span = trace.span(
            name="search_retrieval",
            input={"query": query},
            output={"num_results": len(search_results)},
            metadata={
                "search_type": "hybrid",
                "bm25_weight": 0.6,
                "vector_weight": 0.4
            }
        )
        
        # Track LLM generation
        generation_span = trace.span(
            name="llm_generation",
            input={"prompt_length": len(llm_response.prompt)},
            output={"response_length": len(llm_response.answer)},
            metadata={
                "model": llm_response.model,
                "temperature": llm_response.temperature,
                "tokens_used": llm_response.token_count
            }
        )
        
        # Track quality metrics
        self._track_quality_metrics(trace, query, llm_response)
    
    def _track_quality_metrics(
        self, 
        trace, 
        query: str, 
        response: RAGResponse
    ) -> None:
        """
        Track response quality indicators
        """
        metrics = {
            "has_citations": len(response.citations) > 0,
            "answer_length": len(response.answer.split()),
            "confidence_score": response.confidence_score,
            "context_utilization": response.context_utilization_score
        }
        
        trace.score(
            name="response_quality",
            value=self._calculate_quality_score(metrics)
        )

Intelligent Caching Strategy

class RAGCacheService:
    def __init__(self, redis_client, ttl: int = 3600):
        self.redis = redis_client
        self.ttl = ttl
    
    async def get_cached_response(
        self, 
        query: str, 
        filters: Optional[Dict] = None
    ) -> Optional[CachedRAGResponse]:
        """
        Retrieve cached responses with semantic similarity matching
        """
        # Generate cache key
        cache_key = self._generate_cache_key(query, filters)
        
        # Check exact match first
        cached = await self.redis.get(cache_key)
        if cached:
            return CachedRAGResponse.parse_raw(cached)
        
        # Check for similar queries using embedding similarity
        similar_response = await self._find_similar_cached_query(query)
        if similar_response and similar_response.similarity_score > 0.85:
            return similar_response
        
        return None
    
    async def cache_response(
        self, 
        query: str, 
        response: RAGResponse,
        filters: Optional[Dict] = None
    ) -> None:
        """
        Cache responses with metadata for intelligent retrieval
        """
        cache_key = self._generate_cache_key(query, filters)
        
        cached_response = CachedRAGResponse(
            query=query,
            response=response,
            filters=filters,
            cached_at=datetime.utcnow(),
            access_count=1,
            quality_score=response.quality_score
        )
        
        # Store with TTL based on quality score
        ttl = self._calculate_adaptive_ttl(response.quality_score)
        
        await self.redis.setex(
            cache_key,
            ttl,
            cached_response.json()
        )
        
        # Update query embedding index for similarity search
        await self._index_query_embedding(query, cache_key)
    
    def _calculate_adaptive_ttl(self, quality_score: float) -> int:
        """
        Adjust cache TTL based on response quality
        """
        base_ttl = self.ttl
        quality_multiplier = min(2.0, max(0.5, quality_score * 2))
        return int(base_ttl * quality_multiplier)

🔧 API Endpoints and Usage

The system provides comprehensive RESTful APIs:

Core Endpoints

# Health check
GET /health

# Paper management
GET /api/v1/papers              # List papers with pagination
GET /api/v1/papers/{id}         # Get specific paper
POST /api/v1/papers/ingest      # Trigger ingestion

# Search endpoints
POST /api/v1/search             # BM25 keyword search
POST /api/v1/hybrid-search      # Hybrid search (BM25 + Vector)
POST /api/v1/ask                # RAG question answering

# Analytics endpoints
GET /api/v1/analytics/search    # Search performance metrics
GET /api/v1/analytics/rag       # RAG system metrics

Example API Usage

import httpx

async def query_rag_system():
    async with httpx.AsyncClient() as client:
        # Perform hybrid search
        search_response = await client.post(
            "http://localhost:8000/api/v1/hybrid-search",
            json={
                "query": "transformer attention mechanisms",
                "k": 5,
                "filters": {
                    "categories": ["cs.AI", "cs.LG"],
                    "date_range": {
                        "start": "2023-01-01",
                        "end": "2024-12-31"
                    }
                }
            }
        )
        
        # Ask RAG question
        rag_response = await client.post(
            "http://localhost:8000/api/v1/ask",
            json={
                "question": "How do transformer attention mechanisms work and what are their limitations?",
                "context_k": 10,
                "model": "llama3.1:8b",
                "temperature": 0.7
            }
        )
        
        return rag_response.json()

🎛️ Gradio Web Interface

The system includes a sophisticated web interface built with Gradio:

import gradio as gr

def create_rag_interface():
    with gr.Blocks(title="ArXiv Paper Curator") as interface:
        gr.Markdown("# 🤖 ArXiv Paper Curator RAG System")
        
        with gr.Tab("🔍 Search Papers"):
            with gr.Row():
                search_query = gr.Textbox(
                    label="Search Query",
                    placeholder="Enter your search terms..."
                )
                search_type = gr.Radio(
                    ["Keyword (BM25)", "Semantic (Vector)", "Hybrid"],
                    value="Hybrid",
                    label="Search Type"
                )
            
            search_button = gr.Button("Search", variant="primary")
            search_results = gr.JSON(label="Search Results")
        
        with gr.Tab("💬 Ask Questions"):
            with gr.Row():
                with gr.Column(scale=2):
                    question = gr.Textbox(
                        label="Question",
                        placeholder="Ask about the research papers...",
                        lines=3
                    )
                    
                    with gr.Row():
                        model_choice = gr.Dropdown(
                            ["llama3.1:8b", "mistral:7b", "codellama:13b"],
                            value="llama3.1:8b",
                            label="Model"
                        )
                        temperature = gr.Slider(
                            0.0, 1.0, 0.7,
                            label="Temperature"
                        )
                
                with gr.Column(scale=1):
                    context_k = gr.Slider(
                        1, 20, 10,
                        label="Context Papers"
                    )
                    include_citations = gr.Checkbox(
                        True,
                        label="Include Citations"
                    )
            
            ask_button = gr.Button("Ask Question", variant="primary")
            
            with gr.Row():
                answer = gr.Textbox(
                    label="Answer",
                    lines=10,
                    max_lines=20
                )
                context_papers = gr.JSON(
                    label="Source Papers"
                )
        
        with gr.Tab("📊 Analytics"):
            refresh_button = gr.Button("Refresh Analytics")
            
            with gr.Row():
                total_papers = gr.Number(label="Total Papers")
                indexed_papers = gr.Number(label="Indexed Papers")
                cache_hit_rate = gr.Number(label="Cache Hit Rate (%)")
            
            performance_chart = gr.Plot(label="Query Performance")
    
    return interface

# Launch the interface
if __name__ == "__main__":
    interface = create_rag_interface()
    interface.launch(server_name="0.0.0.0", server_port=7860)

🚀 Deployment and Scaling

Docker Compose Production Setup

version: '3.8'

services:
  app:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://postgres:${POSTGRES_PASSWORD}@db:5432/arxiv_papers
      - OPENSEARCH_URL=http://opensearch:9200
      - REDIS_URL=redis://redis:6379
    depends_on:
      - db
      - opensearch
      - redis
    volumes:
      - ./data:/app/data
  
  db:
    image: postgres:15
    environment:
      - POSTGRES_DB=arxiv_papers
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
  
  opensearch:
    image: opensearchproject/opensearch:2.8.0
    environment:
      - discovery.type=single-node
      - "OPENSEARCH_JAVA_OPTS=-Xms2g -Xmx2g"
    volumes:
      - opensearch_data:/usr/share/opensearch/data
  
  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
  
  gradio:
    build: .
    command: python gradio_launcher.py
    ports:
      - "7860:7860"
    depends_on:
      - app

volumes:
  postgres_data:
  opensearch_data:
  redis_data:

Performance Optimization Tips

Database Optimization:
- Index frequently queried fields
- Use connection pooling
- Implement read replicas for scaling
Search Optimization:
- Optimize OpenSearch cluster settings
- Use index templates for consistent mapping
- Implement search result caching
Embedding Optimization:
- Batch embedding generation
- Cache embeddings for reused content
- Use quantized embeddings for memory efficiency
LLM Optimization:
- Implement model warming
- Use GPU acceleration when available
- Optimize prompt templates for efficiency

🔍 Monitoring and Observability

Key Metrics to Track

# Performance metrics
search_latency = histogram("search_latency_seconds")
rag_latency = histogram("rag_latency_seconds")
cache_hit_rate = gauge("cache_hit_rate")

# Quality metrics
answer_relevance = histogram("answer_relevance_score")
context_utilization = histogram("context_utilization_score")
user_satisfaction = histogram("user_satisfaction_score")

# System metrics
active_connections = gauge("active_database_connections")
opensearch_cluster_health = gauge("opensearch_cluster_health")
embedding_queue_size = gauge("embedding_queue_size")

Langfuse Dashboard Configuration

The system integrates with Langfuse for comprehensive observability:

Trace Analysis: Complete request traces from query to response
Performance Monitoring: Latency breakdowns by component
Quality Tracking: Response quality scores and user feedback
Cost Tracking: Token usage and API costs
Error Monitoring: Failure rates and error categorization

🎯 Advanced Features and Extensions

Extend the system to handle figures and tables from papers:

class MultiModalProcessor:
    async def process_figures(self, paper_id: str) -> List[Figure]:
        """Extract and process figures from research papers"""
        
    async def process_tables(self, paper_id: str) -> List[Table]:
        """Extract and process tables with structure preservation"""

2. Citation Network Analysis

Build citation graphs for enhanced paper discovery:

class CitationNetworkService:
    async def build_citation_graph(self) -> NetworkGraph:
        """Construct citation networks for paper recommendation"""
        
    async def find_influential_papers(self, topic: str) -> List[Paper]:
        """Identify highly cited papers in specific domains"""

3. Personalization Engine

Implement user preference learning:

class PersonalizationService:
    async def learn_user_preferences(self, user_id: str, interactions: List[Interaction]):
        """Learn user preferences from search and reading patterns"""
        
    async def personalize_search_results(self, user_id: str, results: List[SearchResult]) -> List[SearchResult]:
        """Rerank results based on user preferences"""

🏆 Best Practices and Lessons Learned

1. Data Quality Management

Implement robust PDF processing with fallback strategies
Validate extracted content before indexing
Monitor extraction success rates and identify problematic papers
Handle multilingual content appropriately

2. Search Quality Optimization

Tune hybrid search weights based on query types
Implement query expansion for better recall
Use reranking models for improved precision
A/B test different retrieval strategies

3. RAG System Design

Design context-aware prompts for different query types
Implement response validation to catch hallucinations
Use structured output formats for consistent responses
Monitor and improve answer quality continuously

4. Production Readiness

Implement comprehensive monitoring from day one
Design for horizontal scaling with stateless services
Use graceful degradation when components fail
Implement circuit breakers for external dependencies

🔮 Future Enhancements

The ArXiv Paper Curator roadmap includes several exciting developments:

1. Advanced RAG Techniques

Retrieval-Augmented Fine-tuning: Combine retrieval with specialized model training
Multi-Step Reasoning: Implement chain-of-thought for complex queries
Cross-Paper Synthesis: Generate insights by connecting multiple papers

2. Enhanced User Experience

Interactive Visualizations: Paper relationship graphs and trend analysis
Collaborative Features: Shared reading lists and annotations
Mobile Application: Native iOS/Android apps for on-the-go access

3. Research Assistant Features

Automated Literature Reviews: Generate comprehensive survey papers
Research Gap Identification: Highlight unexplored research areas
Methodology Comparison: Side-by-side analysis of different approaches

📚 Additional Resources

Learning Resources

Week-by-Week Notebooks: Detailed Jupyter notebooks for each implementation phase
Video Walkthroughs: Step-by-step implementation guides
Community Forum: Active discussion and support community

Documentation

API Reference: Complete OpenAPI specification
Architecture Guide: Detailed system design documentation
Deployment Guide: Production deployment best practices

Community and Support

GitHub Discussions: Community Q&A and feature requests
Discord Server: Real-time chat and collaboration
Office Hours: Weekly sessions with the development team

🎉 Conclusion

The ArXiv Paper Curator represents a comprehensive approach to building production-ready RAG systems. By following this tutorial, you’ve learned to:

Design scalable RAG architectures that handle real-world complexity
Implement hybrid search systems that combine multiple retrieval strategies
Build robust data pipelines for continuous content ingestion
Deploy monitoring solutions for production observability
Create intuitive interfaces for end-user interaction

The skills and patterns you’ve learned here are directly applicable to any domain where you need to build intelligent systems that can reason over large document collections. Whether you’re working with legal documents, medical research, or technical documentation, the principles remain the same.

Next Steps

Experiment with different domains: Adapt the system for your specific use case
Contribute to the project: Share improvements and extensions with the community
Scale to production: Deploy your own instance and gather real user feedback
Stay updated: Follow the project for new features and enhancements

Remember, building production AI systems is an iterative process. Start with the basics, measure everything, and continuously improve based on real-world usage patterns.

Ready to build your own RAG system? Clone the repository and start with Week 1: ArXiv Paper Curator on GitHub

For the latest updates and community discussions, join our Discord server and follow @jamwithai on Twitter.