LiteParse
LiteParse: A Fast, Local PDF Parsing Toolbox for Structured Data
CI Badge
Crates.io version
npm version
WASM version
PyPI version
License: Apache 2.0
Docs: https://developers.llamaindex.ai/liteparse/
Looking for LiteParse V1? Follow this link to the old code: https://github.com/run-llama/liteparse/tree/logan/liteparse-v1
LiteParse is a standalone open-source PDF parsing tool focused exclusively on fast and light parsing. It performs high-quality spatial text parsing with bounding boxes and does not rely on proprietary LLM features or cloud dependencies. Everything runs locally on your machine. When the limits of local parsing begin to bite—dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs—LiteParse partners with LlamaParse, a cloud-based document parser designed for production pipelines. LlamaParse handles the hard parts so your models see clean, structured data and markdown. If you’re hitting those limits, you can sign up for LlamaParse free and explore the cloud option.
Overview and Core Promise
LiteParse is designed for developers who want speed, locality, and control. It embraces a simple yet powerful model: convert various formats to a common input (PDF) and extract structured text with precise bounding boxes. It then merges native text with OCR-derived results to recreate a faithful representation of the document’s layout for downstream pipelines or LLM-backed agents.
Key promises include:
- Speed and lightness: optimized for rapid parsing without cloud calls or costly LLM ops.
- Local execution: every step runs on your machine, from conversion to extraction to rendering.
- Spatial awareness: bounding boxes accompany text to preserve layout, enabling layout-preserving downstream processing.
- Flexible OCR: a built-in OCR option and pluggable HTTP OCR servers for higher accuracy or custom setups.
- Multi-language and multi-platform support: usable from Rust, Node.js/TypeScript, Python, and in the browser via WASM; runs on Linux, macOS (Intel/ARM), and Windows.
A Unified Flow: Input to Output
LiteParse embraces a holistic view of document processing. The pipeline begins with a wide range of input formats and ends with multiple output formats suitable for ingestion into large language models, databases, or document workflows. The architecture is designed to be accessible through multiple language bindings and run in diverse environments.
System Architecture in Brief
The project is organized around a Rust core with language bindings and platform-specific wrappers. The core components are designed to interoperate through a consistent, well-documented API. The included diagram below, written in Mermaid, maps the high-level flow from input formats to final outputs and bindings:
flowchart LR
subgraph Input["Input Formats"]
PDF["PDF"]
DOCX["DOCX"]
XLSX["XLSX"]
PPTX["PPTX"]
IMG["Images"]
end
subgraph Core["Rust Core"]
CONV["Format Conversion\nLibreOffice / ImageMagick"]
EXTRACT["Text Extraction\nPDFium C library"]
OCR["Selective OCR\nTesseract / HTTP / Custom"]
MERGE["OCR Merge\nNative text + OCR results"]
PROJ["Grid Projection\nSpatial layout reconstruction"]
end
CONV --> EXTRACT
EXTRACT --> OCR
OCR --> MERGE
MERGE --> PROJ
PROJ --> Output["Output"]
PDF --> CONV
DOCX & XLSX & PPTX & IMG --> CONV
subgraph Output["Output"]
JSON["Structured JSON\ntext + bounding boxes"]
TEXT["Plain Text\nlayout-preserved"]
SCREEN["Screenshots\nPNG rendering"]
end
Output --> Bindings
Bindings["Language Bindings"]
Bindings --> NAPI["Node.js / TypeScript\nnapi-rs"]
Bindings --> PYO3["Python\nPyO3"]
Bindings --> WASM["Browser / WASM\nwasm-bindgen"]
Bindings --> CLI["CLI\ncargo / npm / pip"]
PDF --> EXTRACT
Note: This diagram is a schematic representation of the workflow. The actual implementation stitches together multiple subsystems to produce structured outputs ready for downstream processing.
Core Capabilities and Features
LiteParse is built around a concise set of capabilities that cover what most teams need for local document parsing and extraction. Here are the core features you’ll likely leverage first:
- Fast Text Parsing: Spatial text parsing using PDFium ensures text is extracted with precise geometry, enabling faithful reproduction of document layout.
- Flexible OCR System:
- Built-in OCR: Tesseract is bundled with the library for zero-setup OCR out of the box.
- HTTP OCR Servers: You can plug in an OCR server (EasyOCR, PaddleOCR, or your own) via a simple interface.
- Standard API: A simple, well-defined OCR API specification supports easy integration with various OCR backends and pipelines.
- Screenshot Generation: Generate high-quality page screenshots to accompany textual data for LLM agents or human review.
- Multiple Output Formats: Choose between JSON for structured data and raw text for quick lookups.
- Bounding Boxes: Each piece of text comes with precise bounding box coordinates to preserve spatial information.
- Multi-language and Cross-Platform: Use LiteParse from Rust, Node.js/TypeScript, Python, or directly in the browser via WASM.
- Broad Platform Support: Compatible with Linux, macOS (Intel and ARM), and Windows.
Input Formats and the Core Workflow
LiteParse emphasizes the ability to handle a broad set of input formats by automatically converting documents to PDF before parsing. This normalization step allows a consistent parsing strategy across diverse sources.
Supported Input Formats
- Office Documents (via LibreOffice, automatic conversion):
- Word: .doc, .docx, .docm, .odt, .rtf, .pages
- PowerPoint: .ppt, .pptx, .pptm, .odp, .key
- Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv, .numbers
- Quick note: LibreOffice is required for automatic conversion; installation commands vary by OS (macOS, Ubuntu/Debian, Windows).
- Images (via ImageMagick):
- Formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg
- ImageMagick handles conversion to PDF for the parsing step.
The Core Processing Stages
- Format Conversion: LibreOffice (for office documents) and ImageMagick (for images) convert various formats to PDF so that a single parsing engine can work consistently.
- Text Extraction: PDFium handles the heavy lifting of extracting text with a focus on preserving page geometry.
- OCR (Selective OCR): If OCR is enabled, Tesseract funneled through a Rust binding or an HTTP server provides the textual content for text that is not captured by the PDF text layer.
- OCR Merge: Native text and OCR results are merged to maximize accuracy and layout fidelity.
- Layout Reconstruction: Grid projection and spatial layout reconstruction bring the document’s structure back into a machine-friendly form for downstream use.
- Output Production: The system emits JSON with text and bounding boxes, plain text with preserved layout, and page screenshots for agent-based workflows.
Output Formats and Bindings
- Output Formats:
- JSON: Structured JSON that includes text and bounding boxes.
- Text: Plain text maintaining layout semantics.
- Screenshots: PNG renderings of pages for visual inspection or agent reasoning.
- Language Bindings:
- Node.js / TypeScript: napi-rs bindings.
- Python: PyO3 bindings.
- WASM: Browser-ready bindings via wasm-bindgen.
- CLI: Rust-based CLI that works across platforms.
Installation and Getting Started
LiteParse ships with a consistent CLI across languages (except WASM differs in packaging). The installation instructions below reflect the shared CLI experience across supported languages.
- Node.js / TypeScript
- Install: npm i @llamaindex/liteparse
- Library docs: Node.js README at packages/node/README.md
- Python
- Install: pip install liteparse
- Library docs: Python README at packages/python/README.md
- Rust
- CLI: cargo install liteparse
- Library: cargo add liteparse
- Library docs: Rust README on crates.io at crates/liteparse/README.md
- Browser (WASM)
- Install: npm i @llamaindex/liteparse-wasm
- Library docs: WASM README at packages/wasm/README.md
Agent Skill: Integrating LiteParse into Automated Pipelines
LiteParse can be loaded as a skill in agent environments. For example, you can download it with the skills CLI:
- npx skills add run-llama/llamaparse-agent-skills --skill liteparse
You can also review the SKILL.md file for LiteParse within the llamaparse-agent-skills repository to tailor it to your own skills setup.
Command-Line Usage: Parse, Batch, and Screenshots
The CLI in LiteParse is designed for ease of use and consistency across all installations. Here are representative usage patterns.
Parse Files
- Basic parsing
- lit parse document.pdf
- Parse with a specific output format
- lit parse document.pdf --format json -o output.json
- Parse specific pages
- lit parse document.pdf --target-pages "1-5,10,15-20"
- Parse without OCR
- lit parse document.pdf --no-ocr
- Parse a remote PDF
- curl -sL https://example.com/report.pdf | lit parse -
Batch Parsing
- Process an entire directory of documents
- lit batch-parse ./input-directory ./output-directory
Generate Screenshots
- Screenshot all pages
- lit screenshot document.pdf -o ./screenshots
- Screenshot specific pages
- lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots
- Rendering DPI
- lit screenshot document.pdf --dpi 300 -o ./screenshots
CLI Reference Highlights
Parse Command Options (highlights)
- -o, --output
: Output file path - --format
: Output format; defaults to text - --no-ocr: Disable OCR
- --ocr-language
: OCR language (Tesseract format; default eng) - --ocr-server-url
: HTTP OCR server URL (uses Tesseract if not provided) - --tessdata-path
: Path to tessdata directory - --max-pages
: Max pages to parse (default: 1000) - --target-pages
: Pages to parse (e.g., "1-5,10,15-20") - --dpi
: Rendering DPI (default 150) - --preserve-small-text: Keep very small text
- --password
: Password for encrypted documents - --num-workers
: Concurrent OCR workers
Batch Parse Command Options (highlights)
- --format
- --no-ocr
- --ocr-language
- --ocr-server-url
- --tessdata-path
- --max-pages
(default 1000) - --dpi
(default 150) - --recursive: Recursively search input directory
- --extension
: Process only files with this extension - --password
- --num-workers
Screenshot Command Options (highlights)
- -o, --output-dir
: Output directory - --target-pages
: Pages to screenshot - --dpi
: Rendering DPI - --password
: Password for encrypted documents
OCR Setup: Default and Optional Paths
- Default: Tesseract is bundled and enabled by default
- Basic flow: lit parse document.pdf
- Language customization: lit parse document.pdf --ocr-language fra
- Disable OCR: lit parse document.pdf --no-ocr
- Offline Environments: Use TESSDATA_PREFIX to point to traineddata files
- Example: export TESSDATA_PREFIX=/path/to/tessdata
- Then run: lit parse document.pdf --ocr-language eng
- Direct Tessdata Path:
- lit parse document.pdf --tessdata-path /path/to/tessdata
Optional: HTTP OCR Servers
LiteParse can be augmented with HTTP OCR services for higher accuracy or scalability. The project provides ready-to-use wrappers for popular engines:
- EasyOCR (via ocr/easyocr)
- PaddleOCR (via ocr/paddleocr)
You can implement any OCR service by following the simple LiteParse OCR API specification (OCRAPISPEC.md). The API typically requires:
- POST /ocr endpoint
- Accepts file and language parameters
- Returns JSON: { results: [{ text, bbox: [x1, y1, x2, y2], confidence }] }
Multi-Format Input Support: Office and Images
One of LiteParse’s strengths is its seamless handling of a broad range of input formats through automatic conversion to PDF prior to parsing.
Office Documents via LibreOffice
- Word formats: .doc, .docx, .docm, .odt, .rtf, .pages
- PowerPoint formats: .ppt, .pptx, .pptm, .odp, .key
- Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv, .numbers
LibreOffice installation commands (examples):
- macOS: brew install --cask libreoffice
- Ubuntu/Debian: sudo apt-get install libreoffice
- Windows: choco install libreoffice-fresh
- Note: On Windows you may need to add LibreOffice’s program directory to your PATH, e.g., C:\Program Files\LibreOffice\program
Images via ImageMagick
- Supported image formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg
ImageMagick installation commands (examples):
- macOS: brew install imagemagick
- Ubuntu/Debian: sudo apt-get install imagemagick
- Windows: choco install imagemagick.app
Environment Variables and Development
- TESSDATA_PREFIX: Path to a directory containing Tesseract traineddata files. This is essential for offline or air-gapped environments.
Development: A Rust Core with Bindings and Packages
LiteParse is organized as a Rust workspace with a core library and language-specific binding crates. The repository structure typically includes:
- crates/
- liteparse/: Core library + CLI binary
- liteparse-napi/: Node.js bindings (napi-rs)
- liteparse-python/: Python bindings (PyO3)
- liteparse-wasm/: WASM bindings (wasm-bindgen)
- pdfium/: PDFium Rust wrapper
- pdfium-sys/: PDFium FFI bindings
- packages/
- node/: npm package (TS wrapper + native binary)
- python/: PyPI package (Python wrapper + native binary)
- wasm/: WASM npm package
Building
- Compile the CLI
- cargo build --release -p liteparse
- Build Node.js bindings
- cd packages/node && npm run build
- Build Python bindings
- cd packages/python && maturin develop --release
- Build WASM
- cd packages/wasm && npm run build
Development notes often accompany the repository with AGENTS.md and CLAUDE.md guidance to help teams set up environments and coding agents around LiteParse.
Credits and License
LiteParse is Apache-2.0 licensed, with credits acknowledging the foundational projects it builds upon:
- PDFium: PDF rendering and text extraction
- Tesseract: OCR engine (via tesseract-rs)
- EasyOCR: HTTP OCR server (optional)
- PaddleOCR: HTTP OCR server (optional)
- napi-rs: Node.js native bindings
- PyO3: Python native bindings
- wasm-bindgen: WebAssembly bindings
A Production-Grade Parser for Local Pipelines
LiteParse is designed for teams that require robust local parsing with a strong emphasis on layout fidelity. It provides the essential building blocks for production pipelines where data must be extracted from documents efficiently, privately, and without reliance on external cloud services. The combination of a fast Rust core, flexible OCR options, and multi-language bindings makes it a solid foundation for:
- Data extraction from legal, financial, or administrative documents
- Preprocessing for knowledge graphs or relational databases
- Preparation of evidence or reports with preserved layout for humans and agents
- Agent workflows that need reliable, visual context through page screenshots
Documentation and Learning Resources
- Official documentation is available at the Docs URL included at the top of this post.
- The project encourages exploring the old LiteParse V1 for historical context and migration paths.
- The OSS nature invites community contributions, feature requests, and extensions (for example, additional OCR backends or new output formats).
A Closing View: Why LiteParse Matters
In today’s document-driven ecosystems, the tension between speed, privacy, and accuracy often forces teams toward either cloud-based solutions or heavy on-premises tooling with steep setup costs. LiteParse provides a pragmatic middle ground: a fast, locally run parser that respects privacy and runs with modest dependencies. The built-in Tesseract OCR eliminates immediate setup friction while the option to swap in HTTP OCR servers unlocks higher accuracy for challenging documents. The multi-format input support and PDF-centric pipeline enable a uniform approach to parsing that can be extended into larger document-processing pipelines.
If you are building a document automation system, a data ingest pipeline for ML models, or an agent-based tool that needs precise, layout-aware text extraction, LiteParse is worth evaluating. Combine it with LlamaParse for cloud-based enhancements when your pipeline demands scale or more advanced OCR capabilities. Together, they cover the spectrum from local to cloud, all while keeping your data in your control.
Tags: #PDFParsing #OCR #DocumentProcessing #Rust #WASM #NodeJS #Python #OpenSource #BoundingBoxes #LayoutPreservation
Would you like a quick guided walkthrough tailored to your environment (language of choice, target OS, and your OCR backend of choice)? I can tailor commands, setup steps, and a sample workflow to get you from install to a running parsing job in under an hour.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/run-llama/liteparse
GitHub - run-llama/liteparse: LiteParse
LiteParse is a fast, local PDF parsing toolbox for structured data....
github - run-llama/liteparse