Structured Extraction with LLM

When the extraction method is LLM, a model reads meaning from your documents and populates schema-defined fields with inferred values. This page covers those options — schema definition, model selection, schema prompt, and extraction guidance. To compare LLM and Regex before choosing, see Choose an extraction method.

Define your schema

You can build your extraction schema directly in the visual schema builder, or generate a starting point from a plain-language prompt. Once generated, you can refine the schema in the builder and export it as JSON. Be aware that generating a new schema from the plain-language prompt will overwrite any existing builder content.

If you already have a schema in the visual schema builder and want to try generating one from a plain-language prompt, export your current schema to a JSON file first. You can upload it again later if you prefer the original.

Visual schema builder and JSON upload/export

On the Start page or in the workflow editor, you can access the visual schema builder in the Define Schema view. From there you can:

Upload a JSON file to the editor.
Edit the fields in the schema directly in the editor.
Export the schema you have defined to a JSON file for reuse.

An extraction schema is a JSON-formatted schema that defines the structure of the data that Unstructured extracts. If you already have an extraction schema defined in a JSON file, you can click Upload JSON to upload the file to Unstructured.

The schema must conform to the OpenAI Structured Outputs guidelines, which are a subset of the JSON Schema language.

The following shows the extraction schema for the sample real estate listing — first in the visual schema builder, then as a JSON schema file. The LLM visual schema builder:

LLM visual schema builder showing an extraction schema with the Export schema as JSON option

JSON schema file:

{
  "type": "object",
  "properties": {
    "street_address": {
      "type": "string",
      "description": "The full street address of the property including street number, street name, city, state, and postal code"
    },
    "square_footage": {
      "type": "number",
      "description": "The total living space area of the property, in square feet"
    },
    "price": {
      "type": "number",
      "description": "The listed selling price of the property, in local currency"
    },
    "features": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "description": "A list of property features and highlights"
    },
    "agent_contact": {
      "type": "object",
      "properties": {
        "phone": {
          "type": "string",
          "description": "The agent's contact phone number"
        }
      },
      "required": [
        "phone"
      ],
      "additionalProperties": false,
      "description": "Contact information for the real estate agent"
    }
  },
  "additionalProperties": false,
  "required": [
    "street_address",
    "square_footage",
    "price",
    "features",
    "agent_contact"
  ]
}

Plain language in a schema prompt

A schema prompt is plain-language instructions that describe what to extract from your documents, similar to a prompt you would give a chatbot or AI agent. Unstructured generates an extraction schema from those instructions: a structured definition (fields, types, and constraints) that guides extraction from the source documents.

This option is only available from the Start page.

From the Start page click Suggest, enter your prompt in the Prompt a Schema dialog, then click Generate schema. Following your prompt instructions, Unstructured will generate a schema that will display in the visual schema builder.

Prompt a Schema dialog showing a plain-language prompt for a real estate listing

Selecting Generate schema overwrites the existing schema that’s displayed in the Define Schema pane. If you’d like to save the current schema before generating a new one, click the ellipses (three dots) icon, then click Export schema as JSON.

The generated schema displays in the visual schema builder. You can continue to edit the schema from the visual schema builder if you wish. For this real estate listing example, you might enter the following prompt:

Extract the following information from the listing, and present it in the following format:

- street_address: The full street address of the property including street number, street name, city, state, and postal code.
- square_footage: The total living space area of the property, in square feet.
- price: The listed selling price of the property, in local currency.
- features: A list of property features and highlights.
- agent_contact: Contact information for the real estate agent.
- phone: The agent's contact phone number.

The following image shows the generated schema that displays in the visual schema builder.

LLM visual schema builder displaying a schema generated from a plain-language prompt

Select your LLM provider and model

You can select a provider and model for the LLM extraction method. For Model, select your provider and model from the drop-down.

Provider and model selection dropdown in the workflow editor Extract node

This option is only available from the workflow editor.

Configure your output

Once your schema determines which fields to extract and what types they return, these settings control what the output looks like. Schema-only output lets you strip away Unstructured’s document elements and return just the extracted fields. Extraction guidance lets you tell the LLM how to format, normalize, or summarize values into the fields your schema defines.

Schema-only output

The Schema-Only Output setting applies to both the LLM and Regex extraction methods. In the workflow editor, select the workflow’s Extract node. Under Output settings, you can set Schema-Only Output to ON or OFF whenever you edit the workflow.

When Schema-Only Output is ON, the Extract node returns only the JSON produced for your explicitly defined fields. In workflow JSON, that is the extracted data only layout from Custom defined output (no surrounding Unstructured element list).
When Schema-Only Output is OFF (the default), Unstructured also emits the usual document elements and metadata alongside those extracted values. In workflow JSON, that is the elements with extracted data layout from the same Custom defined output section (structured fields under DocumentData plus the rest of the element list).

Schema-Only Output toggle in the Extract node Output settings

This option is only available from the workflow editor.

Extraction guidance

In the workflow editor, use the Extraction Guidance Prompt to tell the LLM how to format, normalize, or present values after your schema defines which fields to extract.

This option is only available from the workflow editor.

The schema still defines what to extract (fields, types, and constraints). Extraction guidance adds plain-language direction for how to format, normalize, or summarize that output when JSON Schema alone is not enough. For example, you can ask the model to standardize addresses, return dates in a consistent format, or summarize long text into a predefined field. You can save this guidance in the workflow editor with the Extract node settings and with the workflow you’re defining, so later runs, including API operations against that workflow, use the same guidance. You can add or revise an Extraction Guidance Prompt in the workflow editor after you add or select the Extract node. From the structured data extractor, click + Add Prompt to enter plain-language instructions for how the LLM should format or present values after your schema has defined the fields. Saving writes the prompt into the node’s settings. Extracted values must still conform to the schema; the prompt only describes presentation and cleanup on top of that contract. You can edit and save the extraction guidance again as you iterate.

Extraction Guidance Prompt field in the workflow editor Extract node

Unstructured UI

Getting started with the UI

Using the UI

Concepts

Structured Extraction with LLM

Define your schema