Text-to-SQL: LangExtract for Structured Data Extraction
Comprehensive Overview of LangExtract: A Python Library for Structured Information Extraction from Unstructured Text
Introduction to LangExtract
LangExtract is a cutting-edge Python library designed to leverage large language models (LLMs) for extracting structured information from unstructured text documents. Its primary function is to transform raw textual data—such as clinical notes, research papers, legal documents, or literary works—into organized, machine-readable formats while maintaining precise source grounding. By enabling users to define extraction tasks through clear prompts and examples, LangExtract ensures that extracted entities are accurately mapped back to their original contexts within the input text.
The library is particularly valuable for applications requiring high precision in information retrieval, such as healthcare documentation, legal analysis, and literary studies. Its architecture supports a wide range of use cases by allowing users to customize extraction logic without extensive model fine-tuning, making it accessible even to non-experts in AI or NLP.
Core Features and Advantages
1. Precise Source Grounding
One of LangExtract’s most distinctive features is its ability to map every extracted entity directly to its exact location within the source text. This capability ensures that users can visually verify the accuracy of extractions by highlighting relevant passages in the original document. For example, when extracting relationships between characters from Romeo and Juliet, LangExtract can pinpoint specific lines where emotional states or interactions are mentioned, allowing for easy cross-referencing.
This feature is particularly useful in domains like clinical documentation, where maintaining traceability of patient data is critical. By providing a direct link between extracted information and its source context, LangExtract reduces the risk of misinterpretation or errors that could arise from paraphrased or abstracted summaries.
2. Reliable Structured Outputs
LangExtract enforces a consistent output schema based on user-defined few-shot examples. This ensures that extracted data adheres to predefined structures, eliminating ambiguity in formatting and attributes. For instance, when extracting medication details from clinical notes, the library will consistently return fields such as name, dosage, route_of_administration, and timing—regardless of variations in input phrasing.
This consistency is achieved through controlled generation within supported models like Google’s Gemini, which guarantees robust and predictable results. Unlike open-ended extraction methods that may produce inconsistent or incomplete outputs, LangExtract’s structured approach minimizes variability while maintaining high accuracy.
3. Optimization for Long Documents
A significant challenge in information extraction is handling large documents—such as medical records spanning hundreds of pages or full-length novels—that contain sparse but critical information. LangExtract addresses this issue through a multi-faceted optimization strategy:
- Text Chunking: The library divides lengthy documents into smaller, manageable segments to reduce computational overhead and improve processing efficiency.
- Parallel Processing: By leveraging multiple threads (up to
max_workers=20by default), LangExtract accelerates extraction tasks across large datasets, making it feasible to process entire books or extensive clinical records in a reasonable timeframe. - Multiple Extraction Passes (
extraction_passes=3): To enhance recall rates, LangExtract performs iterative passes over the text. Each pass refines and expands extracted entities based on previous results, ensuring that no critical information is overlooked.
This approach transforms what might otherwise be a computationally prohibitive task into an efficient, scalable operation. For example, extracting all named characters, relationships, and emotional states from Romeo and Juliet—a text with over 147,000 characters—can be completed in minutes rather than hours or days.
4. Interactive Visualization
One of the most user-friendly aspects of LangExtract is its ability to generate interactive HTML visualizations from extracted data. After running an extraction task, users can save results to a .jsonl file and then generate an HTML report that displays all entities in their original textual context.
This visualization tool provides several benefits:
- Contextual Review: Users can hover over extracted entities to see their exact placement within the source text, allowing for immediate verification of accuracy.
- Scalability: The interactive format supports thousands of entities without performance degradation, making it practical even for large-scale projects.
- Self-contained Output: The generated HTML file is standalone and does not require external dependencies, simplifying deployment and sharing.
For instance, when analyzing a radiology report structured by LangExtract’s RadExtract module, users can visualize findings alongside their corresponding images or annotations, facilitating collaborative review with clinicians or researchers.
5. Flexible LLM Support
LangExtract is designed to accommodate a diverse range of LLMs, ensuring that users can choose the model best suited to their needs:
- Cloud-Based Models: The library supports major cloud providers, including Google’s Gemini family (with versions like
gemini-2.5-flashandgemini-2.5-pro), OpenAI’s GPT models (e.g.,gpt-4o), and others via custom model providers. - Local LLMs: Users can integrate local open-source models through platforms like Ollama, which allows for offline processing without relying on API keys. For example, running a lightweight model such as
gemma2:2blocally enables extraction tasks even in environments with restricted internet access.
This flexibility ensures that LangExtract remains adaptable to evolving LLM capabilities and cost structures. Whether users prefer the speed and scalability of cloud models or the privacy benefits of local inference, LangExtract provides a seamless integration experience.
6. Domain Adaptability
LangExtract’s strength lies in its ability to be tailored to any domain through simple prompt engineering. Users define extraction tasks by providing clear instructions and high-quality examples, allowing the library to adapt without requiring extensive model fine-tuning.
For example:
- In healthcare, users can define prompts for extracting medications, diagnoses, or patient symptoms from clinical notes.
- In literary analysis, they might focus on character arcs, thematic elements, or stylistic devices in novels.
- In legal contexts, extraction could involve identifying clauses, case precedents, or procedural details.
This adaptability makes LangExtract a versatile tool for researchers, developers, and practitioners across multiple fields. The key to success lies in crafting effective prompts and examples that guide the LLM toward accurate and relevant extractions.
7. Leveraging LLM World Knowledge
While LangExtract prioritizes extracting information directly from the source text, it also allows users to incorporate world knowledge when appropriate. By carefully structuring prompts, users can influence how the model infers attributes beyond those explicitly present in the input.
For example:
- In Romeo and Juliet, a prompt might include additional context like
"identity: Capulet family daughter"or"literary_context: tragic heroine"to enrich extracted character profiles. - In medical documentation, users could request that dosage information be supplemented with general knowledge about drug interactions or side effects.
The balance between text-evidence and world knowledge is controlled through the prompt’s phrasing. Over-reliance on world knowledge can introduce inaccuracies, while overemphasis on source grounding may limit creativity. LangExtract provides the flexibility to strike this balance based on the specific requirements of each task.
Quick Start: Getting Started with LangExtract
1. Defining Your Extraction Task
Before running an extraction, users must define a clear prompt and provide high-quality examples to guide the model’s behavior. The prompt describes what should be extracted, while the examples serve as templates for structured output.
Example Prompt:
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")
Example Data (for Romeo and Juliet):
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
extractions=[
lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
]
)
]
Key Notes:
- The
extraction_textshould match verbatim from the example’stext(no paraphrasing). - Extractions must appear in order of appearance within the source document.
- LangExtract will issue warnings if examples don’t align with this pattern, prompting users to refine their inputs.
2. Running the Extraction
Once the prompt and examples are defined, users can extract information from input text using the lx.extract function:
import langextract as lx
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
)
Model Selection:
gemini-2.5-flashis the default choice, offering a balance of speed, cost, and accuracy.- For complex tasks requiring deeper reasoning,
gemini-2.5-promay yield better results. - Large-scale or production use should consider Tier 2 Gemini quotas to avoid rate limits.
Model Lifecycle: Users should consult Google’s official documentation for model version updates, as Gemini models follow a defined lifecycle with retirement dates. Staying informed ensures compatibility with the latest stable versions.
3. Visualizing Results
Extracted data can be saved to a .jsonl file and visualized in an interactive HTML report:
# Save results
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
# Generate visualization
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
if hasattr(html_content, 'data'):
f.write(html_content.data)
else:
f.write(html_content)
This generates an animated HTML file (as shown in the example) that allows users to explore extracted entities within their original context. For instance, the Romeo and Juliet visualization might highlight lines like "ROMEO" with a note on its emotional state or show relationships between characters across multiple passages.
Scaling to Longer Documents
For documents exceeding typical processing limits—such as full-length novels or extensive clinical records—LangExtract employs optimized strategies:
Direct URL Processing
Users can extract information from entire books directly via URLs, leveraging parallel processing and enhanced sensitivity:
result = lx.extract(
text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3, # Improves recall through multiple passes
max_workers=20, # Parallel processing for speed
max_char_buffer=1000 # Smaller contexts for better accuracy
)
This approach processes Romeo and Juliet (147,843 characters) efficiently, extracting hundreds of entities while maintaining high accuracy. The interactive visualization seamlessly handles large result sets, allowing users to explore thousands of entries without performance degradation.
Vertex AI Batch Processing
For cost-effective large-scale tasks, LangExtract supports integration with Google’s Vertex AI Batch API:
language_model_params = {"vertexai": True, "batch": {"enabled": True}}
This enables batch processing of documents at scale while minimizing per-request costs. Users can refer to the Vertex AI Batch API example for detailed implementation guidance.
Installation and Setup
From PyPI
The simplest installation method is via pip:
pip install langextract
For isolated environments, users can create a virtual environment:
python -m venv langextract_env
source langextract_env/bin/activate # Linux/macOS
langextract_env\Scripts\activate # Windows
pip install langextract
From Source
LangExtract uses modern Python packaging with pyproject.toml for dependency management:
git clone https://github.com/google/langextract.git
cd langextract
# Basic installation
pip install -e .
# Development (includes linting tools)
pip install -e ".[dev]"
# Testing (includes pytest)
pip install -e ".[test]"
Docker
Users can also deploy LangExtract in a Docker container:
docker build -t langextract .
docker run --rm -e LANGEXTRACT_API_KEY="your-api-key" langextract python your_script.py
API Key Setup for Cloud Models
When using cloud-hosted models like Google’s Gemini or OpenAI, users must configure an API key. Local LLMs (such as those via Ollama) do not require authentication.
Where to Get API Keys
- Google AI Studio: https://aistudio.google.com/app/apikey
- Vertex AI: https://cloud.google.com/vertex-ai/generative-ai/docs/sdks/overview
- OpenAI Platform: https://platform.openai.com/api-keys
Setting Up API Keys
Users can configure their key in one of three ways:
- Environment Variable:
export LANGEXTRACT_API_KEY="your-api-key-here"
.envFile (Recommended): Create or edit a.envfile:
cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF
Add .env to .gitignore to exclude it from version control.
- Directly in Code (Not Recommended for Production):
result = lx.extract(
text_or_documents=input_text,
prompt_description="Extract information...",
examples=[...],
model_id="gemini-2.5-flash",
api_key="your-api-key-here" # Only for testing/development
)
- Vertex AI (Service Accounts):
result = lx.extract(
text_or_documents=input_text,
prompt_description="Extract information...",
examples=[...],
model_id="gemini-2.5-flash",
language_model_params={
"vertexai": True,
"project": "your-project-id",
"location": "global" # or regional endpoint
}
)
Adding Custom Model Providers
LangExtract supports custom LLM providers through a lightweight plugin system. Users can extend the library to include new models without modifying its core code.
Key Features of the Provider System:
- Isolated Dependencies: Custom providers run independently, keeping the library’s dependencies clean.
- Priority-Based Resolution: Multiple providers can register for the same model ID, allowing users to choose based on performance or cost.
- Structured Output Schemas: Providers can define schemas via
get_schema_class(), ensuring consistent output formats.
How to Create a Custom Provider
Users can follow these steps:
- Register their provider with
@registry.register(...). - Publish an entry point for discovery (e.g., as a Python package).
- Optionally provide a schema class for structured outputs.
- Integrate with the factory via
create_model(...).
For detailed guidance, refer to the Provider System Documentation.
Specialized Use Cases
1. Using OpenAI Models
LangExtract supports OpenAI models (e.g., gpt-4o) through an optional dependency:
pip install langextract[openai]
Example Usage:
import langextract as lx
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gpt-4o",
api_key=os.environ.get('OPENAI_API_KEY'),
fence_output=True, # Required for OpenAI
use_schema_constraints=False # LangExtract doesn’t implement schema constraints for OpenAI
)
Note: OpenAI models require fence_output=True and use_schema_constraints=False because LangExtract does not yet enforce schema constraints in this context.
2. Using Local LLMs with Ollama
LangExtract can integrate local models via platforms like Ollama, enabling offline processing:
pip install langextract # No API key needed for local inference
Example Usage:
import langextract as lx
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemma2:2b", # Local Ollama model
model_url="http://localhost:11434", # Ollama server endpoint
fence_output=False, # No need for fencing in local inference
use_schema_constraints=False
)
Quick Setup:
- Install Ollama from ollama.com.
- Pull the desired model (e.g.,
gemma2:2b). - Run
ollama serve. - For detailed installation and examples, see the Ollama integration guide.
Advanced Examples
1. Romeo and Juliet Full Text Extraction
LangExtract can process entire books directly from URLs, demonstrating its scalability:
result = lx.extract(
text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3,
max_workers=20,
max_char_buffer=1000
)
This example processes the full text of Romeo and Juliet (147,843 characters) in parallel, extracting hundreds of entities with high accuracy. The interactive visualization allows users to explore all findings within their original context.
For more details, see the full Romeo and Juliet extraction example.
2. Medication Extraction
LangExtract excels at extracting structured medical information from clinical notes:
# Example prompt for medication extraction
prompt = textwrap.dedent("""
Extract all medications, dosages, routes, and timing from the following note.
Use exact text where possible. If dosage is not specified, infer a standard dose.
""")
# Example data
examples = [
lx.data.ExampleData(
text="Patient was prescribed acetaminophen 500mg every 6 hours for fever.",
extractions=[
lx.data.Extraction(extraction_class="medication", extraction_text="acetaminophen", attributes={"dosage": "500mg"}),
lx.data.Extraction(extraction_class="timing", extraction_text="every 6 hours", attributes={"schedule": "prn"}),
]
)
]
# Run extraction
result = lx.extract(
text_or_documents=clinical_note,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash"
)
Disclaimer: This example is for illustrative purposes only and does not represent a finished or approved medical product. Users should consult healthcare professionals before applying extracted data in clinical settings.
3. Radiology Report Structuring: RadExtract
LangExtract’s RadExtract module provides an interactive demo on HuggingFace Spaces, showcasing its ability to structure radiology reports:
This demo allows users to explore how LangExtract can automatically parse and annotate radiological findings, making it accessible without requiring local setup.
Community Providers
LangExtract encourages community contributions by providing a registry of custom model providers. Users can discover or contribute plugins via the Community Provider Plugins registry.
For guidance on creating a provider plugin, see the Custom Provider Plugin Example.
Contributing to LangExtract
Contributions are welcome and encouraged! Here’s how users can get involved:
1. Development Guidelines
Refer to the CONTRIBUTING.md for detailed instructions on:
- Setting up a development environment.
- Running tests locally.
- Following coding standards and best practices.
2. Contributor License Agreement (CLA)
Before submitting patches, users must sign the Contributor License Agreement. This ensures proper attribution of contributions to Google’s projects.
3. Testing
To run tests locally:
git clone https://github.com/google/langextract.git
cd langextract
# Install test dependencies
pip install -e ".[test]"
# Run all tests
pytest tests
For a full CI matrix, users can execute:
tox # Runs pylint + pytest on Python 3.10 and 3.11
4. Ollama Integration Testing
If Ollama is installed locally, users can run integration tests:
tox -e ollama-integration # Requires Ollama with gemma2:2b model
Development Best Practices
Code Formatting
LangExtract uses automated formatting tools to maintain consistency:
# Auto-format all code
./autoformat.sh
# Or run formatters separately
isort langextract tests --profile google --line-length 80
pyink langextract tests --config pyproject.toml
Pre-commit Hooks
For automatic formatting checks:
pre-commit install # One-time setup
pre-commit run --all-files # Manual run
Linting
Run linting before submitting pull requests:
pylint --rcfile=.pylintrc langextract tests
Disclaimer
LangExtract is not an officially supported Google product. Users should cite the library appropriately when publishing work that relies on it. For health-related applications, compliance with Google’s Health AI Developer Foundations Terms of Use applies.
Conclusion: Why LangExtract Stands Out
LangExtract represents a significant advancement in structured information extraction from unstructured text. Its combination of precise source grounding, scalable processing capabilities, and adaptability to diverse domains makes it an indispensable tool for researchers, developers, and practitioners across multiple fields. By providing a user-friendly interface for defining extraction tasks and visualizing results, LangExtract simplifies the process of transforming raw data into actionable insights—whether in healthcare, literature, or beyond.
With its support for both cloud-based and local LLMs, LangExtract offers flexibility that aligns with evolving technological landscapes. Whether users are processing a single clinical note or extracting entities from an entire novel, LangExtract delivers reliable, structured outputs that can be directly integrated into workflows or further analyzed as needed.
As the field of AI continues to advance, tools like LangExtract will play a crucial role in bridging the gap between raw textual data and meaningful, actionable information—ushering in a new era of intelligent document processing.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/google/langextract
GitHub - google/langextract: Text-to-SQL: LangExtract for Structured Data Extraction
LangExtract is a cutting-edge Python library designed to leverage large language models (LLMs) for extracting structured information from unstructured text docu...
github - google/langextract