PDF to Markdown Without an LLM: LiteParse, RAG Ingest Cost, and Data Sovereignty
The first step of a RAG pipeline is document ingest, and the most common bottleneck in that first step is PDF parsing. LLM-based parsers have become more common recently, but running an LLM over every document racks up cost and latency, and sending sensitive documents to an external model raises data sovereignty concerns. LlamaIndex (Jerry Liu) has released LiteParse, which takes a different approach. It is an Apache 2.0 open-source parser that converts PDFs to markdown without an LLM.
At ThakiCloud we handle RAG document ingest on our Kubernetes-based AI/ML SaaS platform. Let’s look at why a model-independent parser is attractive from a cost and sovereignty standpoint, and where it still needs to be hedged.
What’s Different: Model-Free Parsing
The core differentiator of LiteParse is that it does not use an LLM to parse. The benefits of this design are clear.
- Cost: there is no per-document LLM call cost. Ingesting large volumes of documents does not cause cost to explode linearly.
- Latency: parsing is fast because there is no LLM inference round trip.
- Data sovereignty: documents are never sent to an external model. This is a decisive advantage for organizations that need to process sensitive documents in-house.
- Determinism: an LLM-based parser can produce different results for the same document on different calls, whereas a rule-based parser is reproducible.
LiteParse’s team claims the top scores on several benchmarks within the model-free parser category. It is important to note, though, that this claim is self-reported and scoped to the model-free category. It is not an absolute comparison against LLM-based parsers, but rather a claim of “best among parsers that don’t use a model.” Being honest about speed and accuracy claims means hedging them to this specific category.
Trade-offs from a RAG Ingest Perspective
A model-free parser is not a silver bullet. The trade-offs need to be stated clearly.
- Structurally complex documents: tables, multi-column layouts, and scanned image PDFs are areas where rule-based parsers tend to struggle. An LLM vision-based parser may perform better here.
- A hybrid strategy: it is practical to process most ordinary documents quickly and cheaply with a model-free parser, and route only the small minority of structurally complex documents to an LLM-based parser. This design separates cost from quality.
The ThakiCloud View: Treating Ingest Cost as a First-Class Citizen
When you run a RAG pipeline in production, ingest cost ends up taking up a surprisingly large share of the total. The more documents you have and the more frequently they are updated, whether or not you use an LLM for parsing determines your operating cost. Setting a model-free parser like LiteParse as the default path, and escalating only complex documents to an LLM-based parser, is a cost-efficient routing strategy.
This is exactly the area we work in. We standardize document ingest pipelines on Kubernetes, route parsers according to document type, and process sensitive documents in-house to guarantee data sovereignty. We treat ingest not as simple preprocessing, but as a first-class design problem where cost, sovereignty, and quality all meet.
Closing Thoughts
LiteParse delivers the message that “an LLM is not always necessary for RAG ingest.” A model-free parser has clear advantages in cost, latency, and data sovereignty, and a hybrid approach that supplements it with an LLM-based parser for complex documents is practical. For engineers who care about treating ingest cost as a first-class citizen, this is exactly the kind of problem we work on every day.
Source: LlamaIndex LiteParse (Apache 2.0). GitHub: https://github.com/run-llama/llama_cloud_services (benchmark scores are self-reported and scoped to the model-free category).