Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt

Use this file to discover all available pages before exploring further.

The structured data extractor supports two extraction methods — LLM and Regex — each suited to different document types and use cases. The extraction method setting determines how each field is populated.

How each extraction method works

Use this table to compare the two methods at a high level — how each processes your documents, the schema format it expects, and what the output looks like.
LLMRegex
Choose whenValues depend on context; you need nested or typed fieldsValues follow a stable, recognizable pattern (for example: invoice numbers, dates, phone numbers)
How it worksA model reads meaning from text and populates schema-defined fields with inferred valuesThe extractor scans partitioned text for named patterns and returns matched strings
Schema formatJSON in OpenAI Structured Outputs format: named fields, types, descriptions, optional nestingname / pattern pairs: a label and a regex for each capture field
Output structureTyped fields — objects, arrays, numbers, booleans, and strings. See output examples.An array of matched substrings per pattern name. See output examples.
Model selectionThe provider and model are configurableNo model required — extraction uses a regex engine that matches patterns directly against partitioned text, not a language model
If both methods could fit, run a small sample with each and compare quality and maintenance cost before you standardize on one.

Available options

The following table shows which options are available for each method. Links go to the detail pages where each option is described.
LLMRegex
Model selection — choose the LLM provider and model that powers your extractionYesNo
Visual schema builder and JSON upload / export — build your schema visually or import from a JSON fileYesYes
Schema-only output toggle — return extracted fields only, without Unstructured document elementsYesYes
Schema prompt — generate a schema from plain-language instructionsYesNo
Extraction guidance — instruct the LLM how to format or normalize extracted valuesYesNo

Next steps

To learn more about the options for each method: To go straight to step-by-step procedures for using either method: