Surya: A Lightweight yet Powerful Document OCR and Analysis Toolkit from Datalab

1) Introduction and Visual Identity

At the very top, the Datalab emblem anchors the post, reminding readers of the brand behind this state-of-the-art work.

Datalab has built Surya as a compact, efficient, and highly capable OCR and document understanding model. Surya packs layout analysis, text recognition, and table understanding into a single vision-language model around 650 million parameters. The goal is to deliver robust results across a wide range of document styles and languages while keeping latency and resource usage reasonable for practical deployments, including cloud and edge environments.

2) A Quick Glance: What Surya Delivers

Accuracy: Surya achieves top-tier recognition quality on standard document benchmarks, including olmOCR-bench, all while staying under the 3B-parameter boundary that often constrains speed-accuracy tradeoffs.
Speed: With modern GPUs, Surya processes content rapidly—throughput on high-end hardware reaches several pages per second, enabling near real-time document pipelines.
Multilingual Capability: Surya demonstrates strong multilingual performance, with high scores across a broad language set. In internal testing, it reached an 87.2% pass rate across 91 languages, with many languages exceeding 80–90% scores.
End-to-End Document Understanding: Beyond plain OCR, Surya recognizes layout and structure—identifying reading order, headers, figures, and tables, and it performs table recognition with column and row semantics.
Flexible Outputs: Depending on the prompt and mode, Surya can emit a structured layout JSON or a full-page HTML output, facilitating downstream data extraction, archiving, search indexing, or human review.

3) The Managed Platform: Accessibility and Quick Start

Datalab offers a managed platform that runs Surya and its high-accuracy variants (including Chandra). New users can jump in with generous incentives—$5 in free credits—to explore capabilities quickly, either by signing up or trying the public playground. (Note: This section uses branding imagery to anchor the platform story.)
Why a managed platform matters: a single model family handles layout, OCR, and table recognition, while the backend auto-spawns the necessary inference servers, removing heavy setup from users.

4) Model Information and Core Capabilities

Surya is named after the Hindu sun god, symbolizing universal vision and wide coverage across scripts and layouts. The model’s core is a unified vision-language architecture that coalesces detection, recognition, and layout comprehension into a cohesive inference pass when prompted properly.
Core components:
Text detection and OCR: Per-block or full-page, producing bounding boxes, text content, and confidence scores.
Layout analysis: Reading order, section headers, captions, figures, and other structural elements.
Table recognition: Detecting table regions, rows, columns, and cell semantics; HTML export supports spanning cells and header semantics.
Optional per-page image outputs for debugging or validation.
The model family includes smaller line-level detectors and OCR error detectors for specialized tasks within document streams.

5) Visual Aids: A Quick Look at Model Anatomy To illustrate Surya’s capabilities, several input visuals accompany this section:

The general layout chart and model size reference image set:
Per-page excerpts and layout/readability visualizations:
Excerpt and text extraction visuals:
Layout and table recognition captures:

6) Practical Examples: Real Pages, Real Annotations Surya’s examples demonstrate how a single pass can yield multiple, richly annotated outputs. Each row showcases five annotated views of the same page: text-line detection, OCR, layout, reading order, and, when present, table recognition. Example categories include newspaper pages, textbooks, tax forms, handwritten notes, and corporate documents:

Newspaper
Detection image: newspaper.png
OCR image: newspaper_text.png
Layout image: newspaper_layout.png
Reading order image: newspaper_reading.png
Textbook
Detection image: textbook.png
OCR image: textbook_text.png
Layout image: textbook_layout.png
Reading order image: textbook_reading.png
Tax Form
Detection image: form.png
OCR image: form_text.png
Layout image: form_layout.png
Reading order image: form_reading.png
Table recognition image: form_tablerec.png
Handwritten Notes
Detection image: handwritten.png
OCR image: handwritten_text.png
Layout image: handwritten_layout.png
Reading order image: handwritten_reading.png
Table recognition image: handwritten_tablerec.png
Corporate Doc
Detection image: corporate.png
OCR image: corporate_text.png
Layout image: corporate_layout.png
Reading order image: corporate_reading.png
Table recognition image: corporate_tablerec.png

6a) Visual Gallery: Per-Category Snippets

Newspaper: A pair of images shows the raw detector output and the OCR-ready text, providing a sense of how the two stages align in a real-world page.
Textbook: The detection overlay helps confirm section breaks and figure placements, while OCR captures the textual content for downstream extraction.
Tax Form: The layout and reading order visuals reveal how form fields align with their labels and how tabular segments get structured in the final output.
Handwritten Notes: The system handles diverse handwriting and mixed content, with dedicated images illustrating both the handwritten blocks and their OCR results.
Corporate Document: Business documents with multi-column text, headers, and figures demonstrate robust layout parsing alongside precise text extraction.

7) Commercial Use and Licensing

Surya's code is licensed under Apache 2.0, a permissive open-source license that supports broad usage.
The model weights utilize an OpenRAIL-M license variant, which is free for research, personal use, and startups under $5M in funding or revenue. For broader commercial licensing of weights, the pricing page provides pathways for enterprise usage.
This licensing arrangement encourages experimentation and integration while offering a clear path to scale for commercial deployments.

8) Getting Started: Installation and Prerequisites

Installation command:
pip install surya-ocr
Inference backend prerequisites:
NVIDIA GPUs: Docker plus the NVIDIA Container Toolkit to auto-spawn the inference server.
CPU or Apple Silicon: llama.cpp-based solutions are used. On macOS, for example, you can install the binary via:
- brew install llama.cpp
Upgrading from Surya v1:
Surya v2 introduces a unified manager and new output schemas. A sample migration path includes switching to SuryaInferenceManager and using the new per-section JSON outputs. Common ideas:
- SuryaInferenceManager replaces FoundationPredictor.
- All predictors (Layout, Recognition, TableRec) share the same manager instance.
- Outputs shift from line-based blocks to a more structured “blocks” with HTML for layout-aware rendering.
Quick startup snippet (conceptual):
Python:
- from surya.inference import SuryaInferenceManager
- from surya.recognition import RecognitionPredictor
- manager = SuryaInferenceManager()
- rec = RecognitionPredictor(manager)
- predictions = rec([image])

9) Usage Highlights: How Surya Works in Practice

Surya 2 integrates layout, OCR, and table recognition into a single VLM (Vision-Language Model) flow. The inference manager can spawn the correct backend automatically or be pointed to an existing server:
SURYAINFERENCEBACKEND=vllm
SURYAINFERENCEURL=http://host:port/v1
Settings and overrides live in surya/settings.py, with the ability to override via environment variables (for example, SURYAINFERENCEBACKEND and SURYAINFERENCEURL).
Two important modes:
Full-page OCR: A single VLM call per page yields the full textual content along with layout hints.
Block/line mode: Layout-aware OCR where the system processes layout first, then decodes OCR per block.
Output schema evolution:
Text lines have given way to blocks, with HTML output for rendered layout fidelity.
Layout decoding now emphasizes count and structure rather than a top_k threshold.
Table outputs capture more semantic details (headers, spans) in HTML form.

10) Server Lifecycle: Keeping a Server Alive

By default, each Surya command starts and stops the VLM server, which can incur startup costs if you chain several runs.
Use --keep_server to attach subsequent commands to the running server:
suryaocr DATAPATH --keep_server
suryalayout DATAPATH
suryatable DATAPATH
You can also export SURYAINFERENCEKEEP_ALIVE=1 to make this behavior permanent.
Stopping the server: docker stop the surya-vllm-* container or kill the llama-server process.

11) Interactive Tools: Try Before You Buy

An interactive Streamlit app is available to experiment with Surya on images or PDFs.
Quick start steps:
pip install streamlit pdftext surya_gui
Then launch the app to upload documents and visualize results live.

12) OCR and Text Recognition: What You Get

The OCR command outputs a JSON with:
blocks: per-block OCR results in reading order
label and raw_label: canonicalized and raw layout labels (e.g., Text, SectionHeader, Table, Form, Picture)
reading_order: block positions in the layout
html: HTML rendering for the block content (math in KaTeX-compatible LaTeX)
polygon and bbox: shapes for the detected blocks
confidence: token-level or block-level confidence
image_bbox: the page’s image bounds
Practical tips to maximize accuracy:
Increase image resolution if text is too small; if the image is already high-res, reduce it to around 2048px width to balance throughput and fidelity.
Preprocess images with binarization or deskewing for degraded originals.
Adjust DETECTORBLANKTHRESHOLD and DETECTORTEXTTHRESHOLD carefully (they should be in 0–1 range, with text threshold higher than blank threshold).

13) Text Line Detection and Translation of Layout into Actions

Text line detection returns bounding boxes, polygons, confidence scores, and layout context.
Layout predictions include a canonical set of labels such as Caption, PageHeader, Table, Text, Form, Figure, etc., with the ability to capture the precise reading order and page geometry.
Performance is largely backend-driven; tuning the inference backend can help balance throughput and latency.

14) Layout and Reading Order: Decoding Structure

Layout prediction produces a hierarchical structure of blocks arranged in reading order, with each block carrying a label, HTML representation, and confidence score.
The system exports a JSON that is designed for downstream processors to reconstruct the document precisely as it appeared to a human reader, while enabling programmatic extraction and indexing.

15) Table Recognition: Cells, Rows, and Headers

The table recognition module outputs a collection of detected table structures, including rows and columns, their geometry, and the content of cells.
There are two modes:
Simple: basic row/column arrangement derived from intersections
Full: HTML output that preserves cell spanning and header semantics
A helpful companion project, the TableConverter, can be used to export detected tables to json, markdown, or HTML, aiding downstream workflows.

16) Math and Equations in Documents

Surya 2 handles mathematical expressions inline as part of full-page OCR.
Recognized equations appear within the text HTML, delimited for KaTeX-compatible LaTeX rendering, enabling downstream tools to render complex formulas faithfully.

17) Inference Backends: vllm vs llama.cpp

Layout, OCR, and table recognition share a single, unified VLM backend.
Choices:
vllm (GPU-accelerated, NVIDIA-centric)
llama.cpp (CPU / Apple Silicon)
The SuryaInferenceManager abstracts the backend and can auto-spawn if needed, or point to an existing server:
SURYAINFERENCEBACKEND=vllm
SURYAINFERENCEURL=http://localhost:8000/v1

18) Benchmarks and Multilingual Prowess

olmOCR-bench standings (Surya 2 on the default preset):
Surya OCR 2 model with only 0.65B parameters achieves 83.3% score, placing it among the top-performing models under 3B params on olmOCR-bench.
Other notable comparisons feature much larger models; Surya emphasizes efficiency without sacrificing core accuracy.
Multilingual evaluation:
Overall pass rate of 87.2% across 91 languages.
Of the 91 languages, 38 score ≥ 90%, 76 score ≥ 80%.
Quick look at languages and scores shows robust cross-language performance, with English and several widely used languages reaching the 90%+ marks.
Throughput measurements:
On RTX 5090 with vllm: approximately 5 pages per second at higher concurrency, with stable tokens per second, p50 and p95 latency shapes.
On Apple Silicon with llama.cpp/Metal: throughput scales with the provided parallelism and hardware constraints; the power consumption remains modest relative to the task.

19) Throughput and Deployment Scenarios

Full-page OCR at 96–192 DPI input yields about 2,400 output tokens per page on average, with client-side measurements providing realistic expectations for production use.
Concurrency and batch settings:
GPU setups can leverage high concurrency to maximize throughput, using appropriate max-seq and tokens configurations.
CPU/Apple Silicon deployments require careful tuning of the parallelism and the llama-server settings to balance latency and resource usage.

20) Reproducing, Training, and Extending

Surya 2 is a single-vision-language model with a ~650M parameter footprint, trained on diverse document images to emit either structured JSON or full HTML output, depending on the prompt.
A separate, smaller model exists for line detection (text-line detection) using a tailored EfficientViT SegFormer variant trained on document line annotations.
If you want to help fine-tune Surya on your own data or use Datalab’s training stack, you can reach out via hi@datalab.to.

21) Community and Acknowledgments

The project credits a wide ecosystem of open-source AI work that made Surya possible, including but not limited to:
Qwen3-VL from Alibaba
vllm and llama.cpp for inference
SegFormer and EfficientViT families
timm and transformers libraries
CRAFT for scene text detection
The message is one of gratitude toward the open-source community and the developers who contribute to shared AI tooling.

22) How to Cite Surya If you use Surya or its associated models in your work, the project provides a BibTeX entry for citation. This helps ensure that developers, researchers, and practitioners acknowledge the effort behind Surya when integrating it into papers, presentations, or production pipelines.

BibTeX entry:
@misc{paruchuri2025surya, author = {Vikas Paruchuri and Datalab Team}, title = {Surya: A lightweight document OCR and analysis toolkit}, year = {2025}, howpublished = {\url{https://github.com/datalab-to/surya}}, note = {GitHub repository}, }

23) Looking Forward: Why Surya Matters for Document Intelligence

Surya represents a practical convergence of accuracy, speed, and versatility for document understanding. By combining layout analysis, full-page OCR, and table recognition into a single, adaptable model, it reduces the complexity and latency that often come with multi-model pipelines.
The managed platform, supported backends, and modular API design make Surya accessible to researchers and practitioners who want dependable results without managing a sprawling stack.
The multilingual depth and broad format coverage (text, headers, captions, tables, and math) allow enterprises to deploy Surya across global document workflows, from invoices and tax forms to textbooks and corporate reports.

24) Final Thoughts: A Platform for Document Intelligence Surya stands as a testament to the power of unified document AI. It doesn’t just read text; it reads pages as structured information—understanding the reading order, the relationships between blocks, the semantics of tables, and the presence of mathematics—while offering practical deployment options through a managed platform and flexible backends. The combination of robust performance, broad language support, and accessible licensing makes Surya a compelling tool for anyone looking to extract structured data from documents at scale.

25) Image Gallery Recap

Datalab Logo: at the top of the post, reinforcing the brand identity.
Size and Capability Visuals:
olmOCR size chart image to convey model scale and performance context.
Excerpt and layout visuals showing how detection, OCR, and layout work together.
Example Pages:
Newspaper, Textbook, Tax Form, Handwritten Notes, Corporate Doc examples with both detection and transcription visuals.
Each category paired with an OCR-focused image, a layout image, a reading-order image, and a table-recognition image when applicable.
These images illustrate the practical, end-to-end workflow Surya enables, from raw document images to structured, machine-readable outputs.

If you’d like, I can tailor this blog post to fit a specific word count, adjust the image placements, or add more sections focused on deployment workflows, integration patterns, or case-study style narratives.

Surya: A lightweight document OCR and analysis toolkit

Surya: A Lightweight yet Powerful Document OCR and Analysis Toolkit from Datalab

Enjoying this project?

GitHub - VikParuchuri/surya: Surya: A lightweight document OCR and analysis toolkit

Stay Updated

Product

Learn

Company

Legal

Stay Updated

Browse by Category

What's New