Structured data extraction

To begin using the structured data extractor right away, skip ahead to the how-to procedures.

When Unstructured partitions your source documents, the default output is a list of Unstructured document elements. Unstructured expresses these elements in its own format — types include Title, NarrativeText, UncategorizedText, Table, Image, and List. Consider a situation where you have a stack of customer order forms in PDF format — identical layout, but different content per order. You might want to extract common fields like customer IDs, item quantities, and order totals. Normally, that would require custom logic you write yourself. Unstructured’s structured data extractor simplifies this kind of scenario without requiring custom logic. It lets you define the target structure up front. Unstructured then extracts values in a consistent JSON format that matches your fields, ready to use in your own applications. The diagram below shows how data flows from a patient form into JSON output, saved to cloud file storage. From there, you could run a script to insert the JSON as records into a database.

Conceptual flow of structured data extraction

To show how the structured data extractor works from a technical perspective, take a look at the following real estate listing PDF. This file is one of the sample files that is available directly from the Start page and the workflow editor’s Source node in the Unstructured user interface (UI). The file’s content is as follows:

Without the structured data extractor, if you run a workflow that references this file, Unstructured extracts the listing’s data in a default format similar to the following (note that the ellipses in this output indicate omitted fields for brevity):

[
  {
    "type": "Title",
    "element_id": "3f1ad705648037cf65e4d029d834a0de",
    "text": "HOME FOR FUTURE",
    "metadata": {
      "...": "..."
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "320ca4f48e63d8bcfba56ec54c9be9af",
    "text": "221 Queen Street, Melbourne VIC 3000",
    "metadata": {
      "...": "..."
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "05f648e815e73fe5140f203a62d8a3cc",
    "text": "2,800 sq. ft living space",
    "metadata": {
      "...": "..."
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "27a9ded56b42f559999e48d1dcd76c9e",
    "text": "Recently renovated kitchen",
    "metadata": {
      "...": "..."
    }
  },
  {
    "...": "..."
  }
]

In the preceding output, the text fields contain information about the listing, such as the street address, the square footage, and one of the listing’s features. The default output leaves you parsing values out of generic text fields. With the structured data extractor, each value is captured in its own named field — structurally accessible and directly consumable by your applications without additional parsing logic.

Custom-defined output

By using the structured data extractor in your Unstructured workflows, you can have Unstructured extract the listing’s data in a custom-defined format.

Elements with extracted data format

The first custom-defined output format is known as the elements with extracted data format, as follows (ellipses indicate omitted fields for brevity):

[
  {
    "type": "DocumentData",
    "element_id": "f2ee7334-c00a-4fc0-babc-2fcea28c1fb6",
    "text": "",
    "metadata": {
      "...": "...",
      "extracted_data": {
        "street_address": "221 Queen Street, Melbourne VIC 3000",
        "square_footage": 2800,
        "price": 1000000,
        "features": [
          "Recently renovated kitchen",
          "Smart home automation system",
          "2-car garage with storage space",
          "Spacious open-plan layout with natural lighting",
          "Designer kitchen with quartz countertops and built-in appliances",
          "Master suite with walk-in closet and en-suite bath",
          "Covered patio and landscaped backyard garden"
        ],
        "agent_contact": {
          "phone": "+01 555 123456"
        }
      }
    }
  },
  {
    "type": "Title",
    "element_id": "3f1ad705648037cf65e4d029d834a0de",
    "text": "HOME FOR FUTURE",
    "metadata": {
      "...": "..."
    }
  },
  {
    "...": "..."
  }
]

In this output, the first element — type DocumentData — carries your extracted fields in an extracted_data field under metadata. From the second element onward, Unstructured also outputs the document’s data as its standard document elements and metadata. The following illustrates the layout at a glance:

[
  { "type": "DocumentData", ...      ← first element: carries your extracted fields
    "metadata": {
      "extracted_data": { ... }      ← your schema-defined fields live here
    }
  },
  { "type": "Title", ... },          ← second element onward: standard document elements
  { ... }
]

Extracted data only format

The second custom-defined output format is known as the extracted data only format, as follows:

{
  "street_address": "221 Queen Street, Melbourne VIC 3000",
  "square_footage": 2800,
  "price": 1000000,
  "features": [
    "Recently renovated kitchen",
    "Smart home automation system",
    "2-car garage with storage space",
    "Spacious open-plan layout with natural lighting",
    "Designer kitchen with quartz countertops and built-in appliances",
    "Master suite with walk-in closet and en-suite bath",
    "Covered patio and landscaped backyard garden"
  ],
  "agent_contact": {
    "phone": "+01 555 123456"
  }
}

In this format, the document’s data is output only as the JSON fields you defined in your schema. The following illustrates what is removed compared to the elements with extracted data format, and the resulting structure: Elements no longer included:

- [                                  ← removed: no surrounding array
-   { "type": "DocumentData",        ← removed: no DocumentData element
-     "metadata": {                  ← removed: no metadata wrapper
-       "extracted_data": {          ← removed: no extracted_data nesting
          "street_address": "...",
          "square_footage": ...,
          "features": [ ... ],
          "agent_contact": { ... }
-       }
-     }
-   },
-   { "type": "Title", ... },        ← removed: no standard document elements
- ]

Extracted data only format:

{
  "street_address": "...",    ← fields appear directly at the root
  "square_footage": ...,
  "features": [ ... ],
  "agent_contact": { ... }
}

In the workflow editor, the Schema-Only Output toggle controls which format is used. Switch it ON to get the extracted data only format, or leave it OFF (the default) to get the elements with extracted data format.

For step-by-step instructions in the UI, see the workflow editor procedure.
For API users, these correspond to the output formats described in the Extract node API reference.

Limitations

The structured data extractor is not guaranteed to work with the Pinecone destination connector. This is because Pinecone has strict limits on the amount of metadata that it can manage. These limits are below the threshold of what the structured data extractor typically needs for the amount of metadata that it manages.

Next steps

To learn about the two structured data extraction methods — LLM and Regex — and compare them to decide which fits your use case, see Choose an extraction method: LLM or Regex.
To go straight to step-by-step procedures, see Using the structured data extractor.

Unstructured UI

Getting started with the UI

Using the UI

Concepts

Structured data extraction

Custom-defined output

Elements with extracted data format

Extracted data only format

Limitations

Next steps

Unstructured UI

Getting started with the UI

Using the UI

Concepts

Documentation Index

​Custom-defined output

​Elements with extracted data format

​Extracted data only format

​Limitations

​Next steps

Custom-defined output

Elements with extracted data format

Extracted data only format

Limitations

Next steps