The Python and TypeScript SDK clients allow for easy interaction with the Unstructured API. Whether you’re using the free Unstructured API, the Unstructured Serverless API, Unstructured API on Azure/AWS or your local deployment of Unstructured API, you can access the API using the Python or TypeScript SDK.

Installation

Before using the SDK to interact with your Unstructured API, install the library:

pip install unstructured-client

The SDKs use semantic versioning and major bumps could bring breaking changes. It is advised to pin your installed version.

Basics

Let’s start with a simple example in which you send a pdf document to be partitioned via the free Unstructured API:

The TypeScript SDK has the following breaking changes in v0.11.0:

  • Imports under the dist path have moved up a level
  • Enums are now used for parameters with a set of options
    • This includes chunkingStrategy, outputFormat, and strategy
  • All parameters to partition have moved to a partitionParameters object

Python SDK Deprecation Warning (>v0.22.0): The legacy method of passing shared.PartitionParameters directly to client.general.partition() is currently supported but may be deprecated and could break in future releases. Users should migrate to the new shared.PartitionRequest object to ensure compatibility with future updates.

By default, the SDKs send requests to the free API. In order to access the paid API, you’ll need to include the argument for the server URL. If you receive authentication errors, and you’ve checked that your key is correct, be sure that you are accessing the correct endpoint.

import unstructured_client
from unstructured_client.models import operations, shared

client = unstructured_client.UnstructuredClient(
    api_key_auth="YOUR_API_KEY",
    # server_url="YOUR_API_URL",
)

filename = "PATH_TO_FILE"
with open(filename, "rb") as f:
    data = f.read()

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=data,
            file_name=filename,
        ),
        # --- Other partition parameters ---
        # Note: Defining `strategy`, `chunking_strategy`, and `output_format`
        # parameters as strings is accepted, but will not pass strict type checking. It is
        # advised to use the defined enum classes as shown below.
        strategy=shared.Strategy.AUTO,  
        languages=['eng'],
    ),
)

try:
    res = client.general.partition(request=req)
    print(res.elements[0])
except Exception as e:
    print(e)

Note that currently, the SDKs only support sending one file at a time.

Unstructured Serverless API

If you want to use the Unstructured Serverless API, all you need to do is initialize the SDK client with the unique API URL that you received in the same email as your API key. For Unstructured API on Azure/AWS, use the API URL that you configured through those services.

client = unstructured_client.UnstructuredClient(
    api_key_auth="YOUR_API_KEY",
    server_url="YOUR_API_URL",
)

Page Splitting

In order to speed up processing of large PDF files, the splitPdfPage* parameter is true by default. This causes the PDF to be split into small batches of pages before sending requests to the API. The client awaits all parallel requests and combines the responses into a single response object. This is specific to PDF files and other filetypes are ignored.

The number of parallel requests is controlled by splitPdfConcurrencyLevel*. The default is 8 and the max is set to 15 to avoid high resource usage and costs.

If at least one request is successful, the responses are combined into a single response object. An error is returned only if all requests failed or there was an error during splitting.

This feature may lead to unexpected results when chunking because the server does not see the entire document context at once. If you’d like to chunk across the whole document and still get the speedup from parallel processing, you can:

  • Partition the pdf with splitPdfPage set to true, without any chunking parameters
  • Store the returned elements in results.json
  • Partition this json file with the desired chunking parameters
req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=file.read(),
            file_name=filename,
        ),
        split_pdf_page=True,  # Set to `False` to disable PDF splitting
        split_pdf_concurrency_level=10,  # Modify split_pdf_concurrency_level to set the number of parallel requests
    )
)
res = client.general.partition(req)

Customizing the Client

Retries

You can also change the defaults for retries through the retryConfig* when initializing the client. If a request to the API fails, the client will retry the request with an exponential backoff strategy up to a maximum interval of one minute. The function keeps retrying until the total elapsed time exceeds maxElapsedTime*, which defaults to one hour:

client = UnstructuredClient(
    api_key_auth="YOUR_API_KEY",
    retry_config=RetryConfig(
        strategy="backoff",
        retry_connection_errors=True,
        backoff=BackoffStrategy(
            # time intervals are defined in milliseconds
            initial_interval=500,
            max_interval=60000,
            exponent=1.5,
            max_elapsed_time=900000,  # 15min*60sec*1000ms = 15 minutes
        ),
    )
)

Disabling SSL Validation

If you disable SSL validation, requests will accept any TLS certificate presented by the server and ignore hostname mismatches and/or expired certificates, which will make your application vulnerable to man-in-the-middle (MitM) attacks. Only set this to False for testing.

http_client = requests.Session()
http_client.verify = False

client = UnstructuredClient(
    client=http_client,
    ...
)

Handling the Response

The partition response defaults to a dict format that can be converted to Unstructured elements with the elements_from_dicts utility function as seen below. Otherwise, the API response can be sent directly to your vector store or another destination.

from unstructured.staging.base import elements_from_dicts

...

if res.elements is not None:
    elements = elements_from_dicts(response.elements)

Parameters & examples

Parameter names

The parameter names used in this document are for the TypeScript SDK, which follow camelCase convention (while the Python SDK uses snake_case). Other than this difference in naming convention, the names used in the SDKs are the same across all methods.

  • Refer to the API parameters page for the full list of available parameters.
  • Refer to the Examples page for some inspiration on using the parameters.