Skip to main content

LangExtract docs

LLM-powered structured extraction from text, grounded to the source.

Source grounding

Every extraction is mapped back to its exact character span in the original text. Values the model can't locate are flagged, so hallucinations are easy to filter out.

Structured output from examples

You don't hand-write a schema. A few high-quality examples shape the output, and on supported models LangExtract applies schema constraints for consistency.

Built for long documents

Text is chunked, processed in parallel, and can run over multiple passes to improve recall on large inputs — books, reports, clinical notes.

Interactive visualization

Results export to JSONL and render as a self-contained HTML file that highlights every extracted entity in its original context.

Multiple model backends

Google Gemini (the default), OpenAI, and local models via Ollama work out of the box, with a plugin system for adding more providers.

Open source

Released under the Apache 2.0 license. Read the source, file issues, or extend it with your own providers on GitHub.

LangExtract is open source under the Apache 2.0 license and is not an officially supported Google product. This is an independent documentation site and is not official Google documentation.