Process files in batches by using the Unstructured Ingest Python library
The Unstructured Ingest Python library enables you to use Python code to send files in batches to Unstructured for processing, and to tell Unstructured where to deliver the processed data.
The Unstructured Ingest Python library does not work with the Unstructured API.
For information about the Unstructured API, see the Unstructured API Overview.
The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured for processing:
Getting started
You can use the Unstructured Ingest Python library to process files locally, or you can use the Ingest Python library to send files in batches to Unstructured for processing.
Local processing does not use an Unstructured API key or API URL.
Using the Ingest Python library to send files in batches to Unstructured for processing is more robust but requires an Unstructured API key and API URL, as follows:
If you signed up for Unstructured through the For Enterprise page, or if you are using a self-hosted deployment of Unstructured, the following information about signing up, signing in, and getting your Unstructured API key might apply differently to you. For details, contact Unstructured Sales at sales@unstructured.io.
-
Go to https://platform.unstructured.io and use your email address, Google account, or GitHub account to sign up for an Unstructured account (if you do not already have one) and sign into the account at the same time. The Unstructured user interface (UI) appears.
-
Get your Unstructured API key:
a. In the Unstructured UI, click API Keys on the sidebar.
b. Click Generate API Key.
c. Follow the on-screen instructions to finish generating the key.
d. Click the Copy icon next to your new key to add the key to your system’s clipboard. If you lose this key, simply return and click the Copy icon again.
By following the preceding instructions, you are signed up for a Developer pay per page account by default.
To save money, consider switching to a Subscribe & Save account instead. To save even more money, consider switching to an Enterprise account instead.
- The default Unstructured API URL for Unstructured Ingest is
https://api.unstructuredapp.io/general/v0/general
, which is the API URL for the Unstructured Partition Endpoint. You must specify this API URL in your code only if you are not using this default, for example, if you are calling a version of the Unstructured API that is hosted on your own compute infrastructure.
If the Unstructured API is hosted on your own compute infrastructure, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at sales@unstructured.io.
Installation
One approach to get started quickly with the Unstructured Ingest Python library is to install Python and then run the following command:
This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations.
You might also need to install additional dependencies, depending on your needs. Learn more.
For additional installation options, see the Unstructured Ingest Python library in the Ingest section.
pip install unstructured
, see the migration guide.Usage
For example, to use the Unstructured Ingest Python library to ingest files from a local source (input) location and to deliver the processed data to an Azure Storage account destination (output) location:
To learn how to use the Unstructured Ingest Python library to work with a specific source (input) and destination (output) location, see the Python code examples for the source and destination connectors that are available for you to choose from.
See also the ingest configuration settings that enable you to further control how batches are sent and processed.
Migration guide
The Unstructured Ingest Python library introduced breaking changes in version 0.7.0 as follows:
-
The library has moved the calling pattern (formerly commonly referred to as “v2”) up one level in the package. If you are using this calling pattern in your existing code, beginning with version 0.7.0 you must remove
.v2
from your relatedfrom...import
statements. For example, if you previously had statments such as these:Beginning with version 0.7.0, you must change to these by removing
.v2
:If you do not make these changes beginning with version 0.7.0, your existing code will no longer work. To run your existing code without making these changes, you must pin your installed library version to 0.6.4 or earlier.
-
The earlier calling pattern (commonly referred to as “v1”) has been deprecated and removed from the library and will no longer work beginning with version 0.7.0. If you are using this earlier “v1” calling pattern in your code, you must either manually migrate your code to the new calling pattern (formerly commonly referred to as “v2”), or pin your installed library version to 0.6.4 or earlier to keep using this earlier “v1” calling pattern.
To learn how to use the new calling pattern (formerly commonly referred to as “v2”), see the code examples in the source connectors and destination connectors documentation.
Was this page helpful?