Overview
You can send multiple files in batches to be ingested by Unstructured API services for processing. Ingestion is the term that Unstructured uses to refer to the set of activities that happens when files are input for processing. Learn more.
You can send batches to Unstructured API services by using the following tools:
- The Unstructured Ingest CLI
- The Unstructured Ingest Python library
The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured API services for processing:
Ingest flow
The Unstructured ingest flow is similar to an extract, transform and load (ETL) data pipeline. Because of this, a customer-defined implementation of the Unstructured ingest flow is sometimes referred to as an ingest pipeline or simply a pipeline. An Unstructured ingest pipeline contains the following logical steps:
Index
Reaches out to the source location and pulls in metadata for each document.
For example, this could include information such as the path to the files to be analyzed.
- For the Unstructured CLI, you can control this behavior, where available for a connector, through its
--input-path
command option. - For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior, where available for a connector, through its
<Prefix>IndexerConfig
class (where<Prefix>
represents the connector provider’s name, such asAzure
for Azure.)
Post-Index Filter
After indexing, you might not want to download everything that was indexed.
For example, you might want to download only files that match specific types, file names, paths, or sizes.
For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior through the FiltererConfig
class.
Download
Using the information generated from the indexer and the filter, downloads the content as files on the local file system for processing. This may require manipulation of the data to prepare it for partitioning.
For example, this could include information such as the path to a local directory to download files to.
- For the Unstructured CLI, you can control this behavior through a connector’s
--download-dir
command option. - For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior through a connector’s
<Prefix>DownloaderConfig
class.
Post-Download Filter
After downloading, if uncompression is enabled, you might not want to uncompress everything that was downloaded. The filter that was defined at the beginning is repeated here.
Uncompress
If enabled, searches for any compressed files (Unstructured supports TAR and ZIP) and uncompresses them.
- For the Unstructured CLI, you can control this behavior through the
--uncompress
command option. - For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior through the
UncompressConfig
class.
Post-Uncompress Filter
After downloading, and uncompressing if enabled, you might not want to partition everything that was downloaded, and uncompressed if enabled. The filter that was defined at the beginning is repeated here again.
Partition
Generates the structured enriched content from the local files that have been downloaded, uncompressed if enabled, and filtered. Both local-based partitioning and Unstructured API services-based partitioning is supported, with API services-based partitioning set to run asynchronously and local-based partitioning set to run through multiprocessing.
- For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior through the
PartitionerConfig
class. - For the Unstructured Ingest Python library’s v1 calling pattern, you can control this behavior through the
PartitionConfig
class.
Chunk
Optionally, chunks the partitioned content. Chunking can also be run locally or through Unstructured API services, with asynchronous or multiprocessing behaviors set in the same approach as the partitioner.
- For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior through the
ChunkerConfig
class. - For the Unstructured Ingest Python library’s v1 calling pattern, you can control this behavior through the
ChunkingConfig
class.
Embed
Optionally, generates vector embeddings for each element in the structured output. Most of the time, this is done through API calls to a third-party embedding vendor and therefore runs asynchronously. But it can also use a locally available Hugging Face model, which will run through multiprocessing.
- For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior through the
EmbedderConfig
class. - For the Unstructured Ingest Python library’s v1 calling pattern, you can control this behavior through the
EmbeddingConfig
class.
Stage
This is an optional step that does not apply most of the time. However, sometimes the data needs to be modified from the existing structure to better support the upload step, such as converting it to a CSV file for tabular-based destinations.
For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior, where available for a connector, through its <Prefix>UploadStagerConfig
class.
Upload
Writes the local content to the destination. If no destination is provided, the local one will be used, which writes the final result to a location on the local filesystem. If batch uploads are needed, this will run in a single process with access to all documents. If batch uploads are not supported, all documents can be uploaded concurrently by using the asynchronous approach.
For the Unstructured Ingest Python library’s v2 calling pattern, you can control this behavior through a connector’s <Prefix>UploaderConfig
class.
Learn more
- Ingest configuration settings enable you to control how batches are sent and processed.
- Source connectors enable you to send batches from local or remote locations to be ingested by Unstructured API services for processing.
- Destination connectors enable Unstructured API services to send the processed data to local or remote locations.
See also
- The Unstructured Platform enables you to send batches to Unstructured API services from remote locations, and to have Unstructured API services send the processed data to remote locations, all without using code or a CLI.
Was this page helpful?