{actual generated value}

LiteParse: A Fast, Local PDF Parsing Toolbox for Structured Data

CI Badge
Crates.io version
npm version
WASM version
PyPI version
License: Apache 2.0
Docs: https://developers.llamaindex.ai/liteparse/

out

Looking for LiteParse V1? Follow this link to the old code: https://github.com/run-llama/liteparse/tree/logan/liteparse-v1

LiteParse is a standalone open-source PDF parsing tool focused exclusively on fast and light parsing. It performs high-quality spatial text parsing with bounding boxes and does not rely on proprietary LLM features or cloud dependencies. Everything runs locally on your machine. When the limits of local parsing begin to bite—dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs—LiteParse partners with LlamaParse, a cloud-based document parser designed for production pipelines. LlamaParse handles the hard parts so your models see clean, structured data and markdown. If you’re hitting those limits, you can sign up for LlamaParse free and explore the cloud option.

Overview and Core Promise

LiteParse is designed for developers who want speed, locality, and control. It embraces a simple yet powerful model: convert various formats to a common input (PDF) and extract structured text with precise bounding boxes. It then merges native text with OCR-derived results to recreate a faithful representation of the document’s layout for downstream pipelines or LLM-backed agents.

Key promises include:

Speed and lightness: optimized for rapid parsing without cloud calls or costly LLM ops.
Local execution: every step runs on your machine, from conversion to extraction to rendering.
Spatial awareness: bounding boxes accompany text to preserve layout, enabling layout-preserving downstream processing.
Flexible OCR: a built-in OCR option and pluggable HTTP OCR servers for higher accuracy or custom setups.
Multi-language and multi-platform support: usable from Rust, Node.js/TypeScript, Python, and in the browser via WASM; runs on Linux, macOS (Intel/ARM), and Windows.

A Unified Flow: Input to Output

LiteParse embraces a holistic view of document processing. The pipeline begins with a wide range of input formats and ends with multiple output formats suitable for ingestion into large language models, databases, or document workflows. The architecture is designed to be accessible through multiple language bindings and run in diverse environments.

System Architecture in Brief

The project is organized around a Rust core with language bindings and platform-specific wrappers. The core components are designed to interoperate through a consistent, well-documented API. The included diagram below, written in Mermaid, maps the high-level flow from input formats to final outputs and bindings:

flowchart LR
  subgraph Input["Input Formats"]
    PDF["PDF"]
    DOCX["DOCX"]
    XLSX["XLSX"]
    PPTX["PPTX"]
    IMG["Images"]
  end
  subgraph Core["Rust Core"]
    CONV["Format Conversion\nLibreOffice / ImageMagick"]
    EXTRACT["Text Extraction\nPDFium C library"]
    OCR["Selective OCR\nTesseract / HTTP / Custom"]
    MERGE["OCR Merge\nNative text + OCR results"]
    PROJ["Grid Projection\nSpatial layout reconstruction"]
  end
  CONV --> EXTRACT
  EXTRACT --> OCR
  OCR --> MERGE
  MERGE --> PROJ
  PROJ --> Output["Output"]
  PDF --> CONV
  DOCX & XLSX & PPTX & IMG --> CONV
  subgraph Output["Output"]
    JSON["Structured JSON\ntext + bounding boxes"]
    TEXT["Plain Text\nlayout-preserved"]
    SCREEN["Screenshots\nPNG rendering"]
  end
  Output --> Bindings
  Bindings["Language Bindings"]
  Bindings --> NAPI["Node.js / TypeScript\nnapi-rs"]
  Bindings --> PYO3["Python\nPyO3"]
  Bindings --> WASM["Browser / WASM\nwasm-bindgen"]
  Bindings --> CLI["CLI\ncargo / npm / pip"]
  PDF --> EXTRACT

Note: This diagram is a schematic representation of the workflow. The actual implementation stitches together multiple subsystems to produce structured outputs ready for downstream processing.

Core Capabilities and Features

LiteParse is built around a concise set of capabilities that cover what most teams need for local document parsing and extraction. Here are the core features you’ll likely leverage first:

Fast Text Parsing: Spatial text parsing using PDFium ensures text is extracted with precise geometry, enabling faithful reproduction of document layout.
Flexible OCR System:
Built-in OCR: Tesseract is bundled with the library for zero-setup OCR out of the box.
HTTP OCR Servers: You can plug in an OCR server (EasyOCR, PaddleOCR, or your own) via a simple interface.
Standard API: A simple, well-defined OCR API specification supports easy integration with various OCR backends and pipelines.
Screenshot Generation: Generate high-quality page screenshots to accompany textual data for LLM agents or human review.
Multiple Output Formats: Choose between JSON for structured data and raw text for quick lookups.
Bounding Boxes: Each piece of text comes with precise bounding box coordinates to preserve spatial information.
Multi-language and Cross-Platform: Use LiteParse from Rust, Node.js/TypeScript, Python, or directly in the browser via WASM.
Broad Platform Support: Compatible with Linux, macOS (Intel and ARM), and Windows.

Input Formats and the Core Workflow

LiteParse emphasizes the ability to handle a broad set of input formats by automatically converting documents to PDF before parsing. This normalization step allows a consistent parsing strategy across diverse sources.

Supported Input Formats

Office Documents (via LibreOffice, automatic conversion):
Word: .doc, .docx, .docm, .odt, .rtf, .pages
PowerPoint: .ppt, .pptx, .pptm, .odp, .key
Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv, .numbers
Quick note: LibreOffice is required for automatic conversion; installation commands vary by OS (macOS, Ubuntu/Debian, Windows).
Images (via ImageMagick):
Formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg
ImageMagick handles conversion to PDF for the parsing step.

The Core Processing Stages

Format Conversion: LibreOffice (for office documents) and ImageMagick (for images) convert various formats to PDF so that a single parsing engine can work consistently.
Text Extraction: PDFium handles the heavy lifting of extracting text with a focus on preserving page geometry.
OCR (Selective OCR): If OCR is enabled, Tesseract funneled through a Rust binding or an HTTP server provides the textual content for text that is not captured by the PDF text layer.
OCR Merge: Native text and OCR results are merged to maximize accuracy and layout fidelity.
Layout Reconstruction: Grid projection and spatial layout reconstruction bring the document’s structure back into a machine-friendly form for downstream use.
Output Production: The system emits JSON with text and bounding boxes, plain text with preserved layout, and page screenshots for agent-based workflows.

Output Formats and Bindings

Output Formats:
JSON: Structured JSON that includes text and bounding boxes.
Text: Plain text maintaining layout semantics.
Screenshots: PNG renderings of pages for visual inspection or agent reasoning.
Language Bindings:
Node.js / TypeScript: napi-rs bindings.
Python: PyO3 bindings.
WASM: Browser-ready bindings via wasm-bindgen.
CLI: Rust-based CLI that works across platforms.

Installation and Getting Started

LiteParse ships with a consistent CLI across languages (except WASM differs in packaging). The installation instructions below reflect the shared CLI experience across supported languages.

Node.js / TypeScript
Install: npm i @llamaindex/liteparse
Library docs: Node.js README at packages/node/README.md
Python
Install: pip install liteparse
Library docs: Python README at packages/python/README.md
Rust
CLI: cargo install liteparse
Library: cargo add liteparse
Library docs: Rust README on crates.io at crates/liteparse/README.md
Browser (WASM)
Install: npm i @llamaindex/liteparse-wasm
Library docs: WASM README at packages/wasm/README.md

Agent Skill: Integrating LiteParse into Automated Pipelines

LiteParse can be loaded as a skill in agent environments. For example, you can download it with the skills CLI:

npx skills add run-llama/llamaparse-agent-skills --skill liteparse

You can also review the SKILL.md file for LiteParse within the llamaparse-agent-skills repository to tailor it to your own skills setup.

Command-Line Usage: Parse, Batch, and Screenshots

The CLI in LiteParse is designed for ease of use and consistency across all installations. Here are representative usage patterns.

Parse Files

Basic parsing
lit parse document.pdf
Parse with a specific output format
lit parse document.pdf --format json -o output.json
Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"
Parse without OCR
lit parse document.pdf --no-ocr
Parse a remote PDF
curl -sL https://example.com/report.pdf | lit parse -

Batch Parsing

Process an entire directory of documents
lit batch-parse ./input-directory ./output-directory

Generate Screenshots

Screenshot all pages
lit screenshot document.pdf -o ./screenshots
Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
Rendering DPI
lit screenshot document.pdf --dpi 300 -o ./screenshots

CLI Reference Highlights

Parse Command Options (highlights)

-o, --output : Output file path
--format : Output format; defaults to text
--no-ocr: Disable OCR
--ocr-language : OCR language (Tesseract format; default eng)
--ocr-server-url : HTTP OCR server URL (uses Tesseract if not provided)
--tessdata-path : Path to tessdata directory
--max-pages : Max pages to parse (default: 1000)
--target-pages : Pages to parse (e.g., "1-5,10,15-20")
--dpi : Rendering DPI (default 150)
--preserve-small-text: Keep very small text
--password : Password for encrypted documents
--num-workers : Concurrent OCR workers

Batch Parse Command Options (highlights)

--format
--no-ocr
--ocr-language
--ocr-server-url
--tessdata-path
--max-pages (default 1000)
--dpi (default 150)
--recursive: Recursively search input directory
--extension : Process only files with this extension
--password
--num-workers

Screenshot Command Options (highlights)

-o, --output-dir : Output directory
--target-pages : Pages to screenshot
--dpi : Rendering DPI
--password : Password for encrypted documents

OCR Setup: Default and Optional Paths

Default: Tesseract is bundled and enabled by default
Basic flow: lit parse document.pdf
Language customization: lit parse document.pdf --ocr-language fra
Disable OCR: lit parse document.pdf --no-ocr
Offline Environments: Use TESSDATA_PREFIX to point to traineddata files
Example: export TESSDATA_PREFIX=/path/to/tessdata
Then run: lit parse document.pdf --ocr-language eng
Direct Tessdata Path:
lit parse document.pdf --tessdata-path /path/to/tessdata

Optional: HTTP OCR Servers

LiteParse can be augmented with HTTP OCR services for higher accuracy or scalability. The project provides ready-to-use wrappers for popular engines:

EasyOCR (via ocr/easyocr)
PaddleOCR (via ocr/paddleocr)

You can implement any OCR service by following the simple LiteParse OCR API specification (OCRAPISPEC.md). The API typically requires:

POST /ocr endpoint
Accepts file and language parameters
Returns JSON: { results: [{ text, bbox: [x1, y1, x2, y2], confidence }] }

Multi-Format Input Support: Office and Images

One of LiteParse’s strengths is its seamless handling of a broad range of input formats through automatic conversion to PDF prior to parsing.

Office Documents via LibreOffice

Word formats: .doc, .docx, .docm, .odt, .rtf, .pages
PowerPoint formats: .ppt, .pptx, .pptm, .odp, .key
Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv, .numbers

LibreOffice installation commands (examples):

macOS: brew install --cask libreoffice
Ubuntu/Debian: sudo apt-get install libreoffice
Windows: choco install libreoffice-fresh
Note: On Windows you may need to add LibreOffice’s program directory to your PATH, e.g., C:\Program Files\LibreOffice\program

Images via ImageMagick

Supported image formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

ImageMagick installation commands (examples):

macOS: brew install imagemagick
Ubuntu/Debian: sudo apt-get install imagemagick
Windows: choco install imagemagick.app

Environment Variables and Development

TESSDATA_PREFIX: Path to a directory containing Tesseract traineddata files. This is essential for offline or air-gapped environments.

Development: A Rust Core with Bindings and Packages

LiteParse is organized as a Rust workspace with a core library and language-specific binding crates. The repository structure typically includes:

crates/
liteparse/: Core library + CLI binary
liteparse-napi/: Node.js bindings (napi-rs)
liteparse-python/: Python bindings (PyO3)
liteparse-wasm/: WASM bindings (wasm-bindgen)
pdfium/: PDFium Rust wrapper
pdfium-sys/: PDFium FFI bindings
packages/
node/: npm package (TS wrapper + native binary)
python/: PyPI package (Python wrapper + native binary)
wasm/: WASM npm package

Building

Compile the CLI
cargo build --release -p liteparse
Build Node.js bindings
cd packages/node && npm run build
Build Python bindings
cd packages/python && maturin develop --release
Build WASM
cd packages/wasm && npm run build

Development notes often accompany the repository with AGENTS.md and CLAUDE.md guidance to help teams set up environments and coding agents around LiteParse.

Credits and License

LiteParse is Apache-2.0 licensed, with credits acknowledging the foundational projects it builds upon:

PDFium: PDF rendering and text extraction
Tesseract: OCR engine (via tesseract-rs)
EasyOCR: HTTP OCR server (optional)
PaddleOCR: HTTP OCR server (optional)
napi-rs: Node.js native bindings
PyO3: Python native bindings
wasm-bindgen: WebAssembly bindings

A Production-Grade Parser for Local Pipelines

LiteParse is designed for teams that require robust local parsing with a strong emphasis on layout fidelity. It provides the essential building blocks for production pipelines where data must be extracted from documents efficiently, privately, and without reliance on external cloud services. The combination of a fast Rust core, flexible OCR options, and multi-language bindings makes it a solid foundation for:

Data extraction from legal, financial, or administrative documents
Preprocessing for knowledge graphs or relational databases
Preparation of evidence or reports with preserved layout for humans and agents
Agent workflows that need reliable, visual context through page screenshots

Documentation and Learning Resources

Official documentation is available at the Docs URL included at the top of this post.
The project encourages exploring the old LiteParse V1 for historical context and migration paths.
The OSS nature invites community contributions, feature requests, and extensions (for example, additional OCR backends or new output formats).

A Closing View: Why LiteParse Matters

In today’s document-driven ecosystems, the tension between speed, privacy, and accuracy often forces teams toward either cloud-based solutions or heavy on-premises tooling with steep setup costs. LiteParse provides a pragmatic middle ground: a fast, locally run parser that respects privacy and runs with modest dependencies. The built-in Tesseract OCR eliminates immediate setup friction while the option to swap in HTTP OCR servers unlocks higher accuracy for challenging documents. The multi-format input support and PDF-centric pipeline enable a uniform approach to parsing that can be extended into larger document-processing pipelines.

If you are building a document automation system, a data ingest pipeline for ML models, or an agent-based tool that needs precise, layout-aware text extraction, LiteParse is worth evaluating. Combine it with LlamaParse for cloud-based enhancements when your pipeline demands scale or more advanced OCR capabilities. Together, they cover the spectrum from local to cloud, all while keeping your data in your control.

Tags: #PDFParsing #OCR #DocumentProcessing #Rust #WASM #NodeJS #Python #OpenSource #BoundingBoxes #LayoutPreservation

Would you like a quick guided walkthrough tailored to your environment (language of choice, target OS, and your OCR backend of choice)? I can tailor commands, setup steps, and a sample workflow to get you from install to a running parsing job in under an hour.

LiteParse

Enjoying this project?

GitHub - run-llama/liteparse: LiteParse

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category