Concepts

Ingestion is the term that Unstructured uses to refer to the set of activities that happens when files are input for processing. Ingestion enables multiple files to be processed as a batch.

You can perform ingestion with the following tools:

  • The Unstructured Platform, a no-code user interface, unlimited pay-as-you-go platform to get all of your data ready for Retrieval Augmented Generation (RAG) and model fine-tuning.
  • The Unstructured Ingest CLI, with unlimited pay-as-you-go and limited free options, that enable you to use command-line scripts to get all of your data ready for RAG and model fine-tuning.
  • The Unstructured Ingest Python library, with unlimited pay-as-you-go and limited free options, that enable you to use Python code to get all of your data ready for RAG and model fine-tuning.

The Unstructured Python SDK and Unstructured JavaScript/TypeScript SDK can process only one file at a time.

Files are ingested from an originating source location. Each batch of ingested files is processed either all by Unstructured or all locally. The processed data is sent to a target destination location. The kinds of locations you can specify varies:

When you use the Unstructured Platform, the source and destination must both be in cloud storage. Local source or local destination locations are not allowed. For example:

The Unstructured Platform enables you to connect to many kinds of sources and destinations.

If you use the Unstructured Ingest CLI or the Unstructured Ingest Python library, the source or destination can be a cloud storage location or a local location. For example:

Unstructured provides many source and destination connectors.

Ingestion options for the Unstructured service

This is the flow for sending files to Unstructured for processing and the processed data being delivered by Unstructured:

  • This flow always happens for the Unstructured Platform. The Platform only allows sending files from cloud storage and sending processed data to cloud storage.

  • For the Unstructured Ingest CLI or the Unstructured Ingest Python library, to use this flow:

    • When using the Unstructured Ingest CLI, include the --partition-by-api option and set --api-key and --partition-endpoint to a valid, matching Unstructured API key and API URL, respectively.
    • When using the Unstructured Ingest Python library, set partition_by_api=True and api_key and set partition_endpoint to a valid, matching Unstructured API key and API URL, respectively.

Local ingestion options

This is the flow for processing files locally. No files are sent to Unstructured for processing:

  • This flow never happens for the Unstructured Platform. The Platform does not allow sending files from a local destination to Unstructured or Unstructured sending processed data to a local destination.

  • For the Unstructured Ingest CLI or the Unstructured Ingest Python library, to use this flow:

    • When using the Unstructured Ingest CLI, omit the --partition-by-api, --api-key, and --partition-endpoint options.
    • When using the Unstructured Ingest Python library, omit partition_by_api or explicitly set parition_by_api=False. Also omit api_key and partition_endpoint.

Unstructured Ingest CLI

The Unstructured Ingest CLI enables you to use command-line scripts to get all of your data ready for RAG and model fine-tuning.

One approach to using the CLI is installing Python and then running the following command to install the CLI:

pip install unstructured-ingest

This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations.

You might also need to install additional dependencies, depending on your needs. Learn more.

For additional installation options, see:

To display the list of available source connector commands, run the following command:

unstructured-ingest --help

To display the list of available destination connector commands, run the following command:

unstructured-ingest local --help

To display help for a specific source connector command, run the following command:

unstructured-ingest <command-name> --help

To display help for a specific destination connector command, run the following command:

unstructured-ingest local <command-name> --help

To begin using the CLI, see the quickstarts for the:

To migrate from older, deprecated versions of the Ingest CLI that used pip install unstructured, see the migration guide.

Unstructured Ingest Python library

The Unstructured Ingest Python library enable you to use Python code to get all of your data ready for RAG and model fine-tuning.

The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured API services for processing:

One approach to using the Unstructured Ingest Python library is installing Python and then running the following command to install the library and the default connectors:

pip install unstructured-ingest

This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations.

You might also need to install additional dependencies, depending on your needs. Learn more.

For additional installation options, see:

Some source and destination connectors provide newer v2 and older v1 implementations, while some provide only older v1 implementations. You should use the v2 implementations wherever they are available, to help ensure better forward-compatibility of your code. For the lists of available v2 and v1 connectors, see:

To begin using the Unstructured Ingest Python library, see the code examples for the source and destination connectors.

To migrate from older, deprecated versions of the Ingest Python library that used pip install unstructured, see the migration guide.

Migration guide

The older unstructured versions of the Unstructured Ingest CLI and Unstructured Ingest Python library have been replaced and are now deprecated.

To migrate to the newer unstructured-ingest versions of the Ingest CLI and Ingest Python library, do the following:

  1. If you previously ran pip install unstructured only for the purposes of using the Ingest CLI or the Ingest Python library, upgrade to the lastest versions by running the following commands:

    a. pip uninstall unstructured
    b. pip install unstructured-ingest

  2. If you previously installed an older version of a source or destination connector, for example pip install "unstructured[azure]" for the Azure Storage connector, upgrade to the latest version by running the following commands:

    a. pip uninstall "unstructured[azure]"
    b. pip install "unstructured-ingest[azure]"

  3. If you were running Python code against an older version of the Ingest Python library, update your import statements by replacing all instances of unstructured.ingest with unstructured_ingest to run against the latest version.

Was this page helpful?