> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Python SDK

<Note>
  The following information applies to the legacy Unstructured Partition Endpoint.

  Unstructured recommends that you use the
  [on-demand jobs](/api-reference/workflow/overview#run-an-on-demand-job) functionality in the
  [Unstructured API](/api-reference/overview) instead. Unstructured's on-demand jobs provide
  many benefits over the legacy Unstructured Partition Endpoint, including support for:

  * Production-level usage.
  * Multiple local input files in batches.
  * The latest and highest-performing models.
  * Post-transform enrichments.
  * All of Unstructured's chunking strategies.
  * The generation of vector embeddings.

  The Unstructured API also provides support for processing files and data in remote locations.
</Note>

The [Unstructured Python SDK](https://github.com/Unstructured-IO/unstructured-python-client) client allows you to send one file at a time for processing by
the [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview).

To use the Python SDK, you'll first need to set an environment variable named `UNSTRUCTURED_API_KEY`,
representing your Unstructured API key. To get your API key, do the following:

1. If you do not already have an Unstructured account, [sign up for free](https://unstructured.io/?modal=try-for-free).
   After you sign up, you are automatically signed in to your new Unstructured **Let's Go** account, at [https://platform.unstructured.io](https://platform.unstructured.io).

   <Note>
     To sign up for a **Business** account instead, [contact Unstructured Sales](https://unstructured.io/?modal=contact-sales), or [learn more](/api-reference/overview#pricing).
   </Note>

2. If you have an Unstructured **Let's Go**, **Pay-As-You-Go**, or **Business SaaS** account and are not already signed in, sign in to your account at [https://platform.unstructured.io](https://platform.unstructured.io).

   <Note>
     For other types of **Business** accounts, see your Unstructured account administrator for sign-in instructions,
     or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
   </Note>

3. Get your Unstructured API key:<br />

   a. After you sign in to your Unstructured **Let's Go**, **Pay-As-You-Go**, or **Business** account, click **API Keys** on the sidebar.<br />

   <Note>
     For a **Business** account, before you click **API Keys**, make sure you have selected the organizational workspace you want to create an API key
     for. Each API key works with one and only one organizational workspace. [Learn more](/ui/account/workspaces#create-an-api-key-for-a-workspace).
   </Note>

   b. Click **Generate API Key**.<br />
   c. Follow the on-screen instructions to finish generating the key.<br />
   d. Click the **Copy** icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the **Copy** icon again.<br />

## Installation

Before using the SDK to interact with Unstructured, install the library:

```bash Python theme={null}
pip install unstructured-client
```

<Note>
  The SDK uses semantic versioning and major bumps could bring breaking changes. It is advised to
  pin your installed version. See the [migration guide](#migration-guide), later on this page, for breaking change announcements.
</Note>

## Basics

Let's start with a simple example in which you send a PDF document to the Unstructured Partition Endpoint to be partitioned by Unstructured.

```python Python theme={null}
import os, json

import unstructured_client
from unstructured_client.models import operations, shared

client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
)

filename = "PATH_TO_INPUT_FILE"

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=open(filename, "rb"),
            file_name=filename,
        ),
        strategy=shared.Strategy.VLM,
        vlm_model="gpt-4o",
        vlm_model_provider="openai",
        languages=['eng'],
        split_pdf_page=True,            # If True, splits the PDF file into smaller chunks of pages.
        split_pdf_allow_failed=True,    # If True, the partitioning continues even if some pages fail.
        split_pdf_concurrency_level=15  # Set the number of concurrent request to the maximum value: 15.
    ),
)

try:
    res = client.general.partition(
        request=req
    )
    element_dicts = [element for element in res.elements]
    
    # Print the processed data's first element only.
    print(element_dicts[0])

    # Write the processed data to a local file.
    json_elements = json.dumps(element_dicts, indent=2)

    with open("PATH_TO_OUTPUT_FILE", "w") as file:
        file.write(json_elements)
except Exception as e:
    print(e)
```

## Async partitioning

The Python SDK also has a `partition_async`. This call is equivalent to `partition` except that it can be used in a non blocking context. For instance, `asyncio.gather` can be used to concurrently process multiple files inside of a directory hierarchy, as demonstrated here:

<CodeGroup>
  ```python Python theme={null}
  import asyncio
  import os
  import json
  import unstructured_client
  from unstructured_client.models import shared

  client = unstructured_client.UnstructuredClient(
      api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
  )

  async def call_api(filename, input_dir, output_dir):
      req = {
          "partition_parameters": {
              "files": {
                  "content": open(filename, "rb"),
                  "file_name": os.path.basename(filename),
              },
              "strategy": shared.Strategy.VLM,
              "vlm_model": "gpt-4o",
              "vlm_model_provider": "openai",
              "languages": ['eng'],
              "split_pdf_page", True,            # If True, splits the PDF file into smaller chunks of pages.
              "split_pdf_allow_failed": True,    # If True, the partitioning continues even if some pages fail.
              "split_pdf_concurrency_level": 15  # Set the number of concurrent request to the maximum value: 15.
          }
      }

      try:
          res = await client.general.partition_async(
              request=req
          )
          element_dicts = [element for element in res.elements]
          json_elements = json.dumps(element_dicts, indent=2)

          # Create the output directory structure.
          relative_path = os.path.relpath(os.path.dirname(filename), input_dir)
          output_subdir = os.path.join(output_dir, relative_path)
          os.makedirs(output_subdir, exist_ok=True)

          # Write the output file.
          output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json")
          with open(output_filename, "w") as file:
              file.write(json_elements)

      except Exception as e:
          print(f"Error processing {filename}: {e}")

  async def process_files(input_directory, output_directory):
      tasks = []

      for root, _, files in os.walk(input_directory):
          for file in files:
              if not file.endswith('.json'):
                  full_path = os.path.join(root, file)
                  tasks.append(call_api(full_path, input_directory, output_directory))

      await asyncio.gather(*tasks)

  if __name__ == "__main__":
      asyncio.run(process_files(
          input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), 
          output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR")
      ))
  ```
</CodeGroup>

## Page splitting

In order to speed up processing of large PDF files, the `split_pdf_page`[\*](#parameter-names) parameter is `True` by default. This
causes the PDF to be split into small batches of pages before sending requests to the API. The client
awaits all parallel requests and combines the responses into a single response object. This is specific to PDF files and other
filetypes are ignored.

The number of parallel requests is controlled by `split_pdf_concurrency_level`[\*](#parameter-names).
The default is 8 and the max is set to 15 to avoid high resource usage and costs.

If at least one request is successful, the responses are combined into a single response object. An
error is returned only if all requests failed or there was an error during splitting.

<Note>
  This feature may lead to unexpected results when chunking because the server does not see the entire
  document context at once. If you'd like to chunk across the whole document and still get the speedup from
  parallel processing, you can:

  * Partition the PDF with `split_pdf_page` set to `True`, without any chunking parameters.
  * Store the returned elements in `results.json`.
  * Partition this JSON file with the desired chunking parameters.
</Note>

```python Python theme={null}
req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=file.read(),
            file_name=filename,
        ),
        strategy=shared.Strategy.VLM,
        vlm_model="gpt-4o",
        vlm_model_provider="openai",
        split_pdf_page=True,            # If True, splits the PDF file into smaller chunks of pages.
        split_pdf_allow_failed=True,    # If True, the partitioning continues even if some pages fail.
        split_pdf_concurrency_level=15  # Set the number of concurrent request to the maximum value: 15.
    )
)
res = client.general.partition(
    request=req
)
```

## Customizing the client

### Retries

You can also change the defaults for retries through the `retry_config`[\*](#parameter-names)
parameter when initializing the client. If a request to the API fails, the client will retry the
request with an exponential backoff strategy up to a maximum interval of one minute. The
function keeps retrying until the total elapsed time exceeds `max_elapsed_time`[\*](#parameter-names),
which defaults to one hour. However, you can override these defaults, for example as follows:

```python Python theme={null}
# ...
from unstructured_client.utils import RetryConfig, BackoffStrategy

client = UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")
    retry_config=RetryConfig(
        strategy="backoff",
        retry_connection_errors=True,
        backoff=BackoffStrategy(
            initial_interval=500,     # 0.5 seconds: 500 milliseconds
            max_interval=60000,       # 1 minute: 60 seconds * 1000 milliseconds
            exponent=1.5,
            max_elapsed_time=900000,  # 15 minutes: 15 * 60 seconds * 1000 milliseconds
        ),
    )
)
```

Available `RetryConfig` settings include:

| Setting                            | Description                                                                       |
| ---------------------------------- | --------------------------------------------------------------------------------- |
| `strategy`                         | The strategy to use for retries. The only supported value is `backoff`.           |
| `retry_connection_errors`          | True to retry on connection errors.                                               |
| `BackoffStrategy.initial_interval` | After the first error, wait the specified number of milliseconds before retrying. |
| `BackoffStrategy.max_interval`     | The maximum wait time, specified in milliseconds between retries.                 |
| `BackoffStrategy.exponent`         | After each retry, increase the wait time exponentially by the specified factor.   |
| `BackoffStrategy.max_elapsed_time` | Stop retrying after the specified number of milliseconds.                         |

### Disabling SSL validation

If you disable SSL validation, requests will accept any TLS certificate
presented by the server and ignore hostname mismatches and/or expired certificates,
which will make your application vulnerable to man-in-the-middle (MitM) attacks.
Only set this to `False` for testing.

```python Python theme={null}
http_client = requests.Session()
http_client.verify = False

client = UnstructuredClient(
    client=http_client,
    ...
)
```

## Handling the response

The partition response defaults to a dict format that can be converted to Unstructured elements with
the `elements_from_dicts` utility function as seen below. Otherwise, the API response can be sent directly
to your vector store or another destination.

```python Python theme={null}
from unstructured.staging.base import elements_from_dicts

# ...

if res.elements is not None:
    elements = elements_from_dicts(response.elements)
```

## Parameters & examples

The parameter names used in this document are for the Python SDK, which follow snake\_case convention. The JavaScript/TypeScript SDK follows camelCase
convention. Other than this difference in naming convention,
the names used in the SDKs are the same across all methods.

* Refer to the [API parameters](/api-reference/legacy-api/partition/api-parameters) page for the full list of available parameters.
* Refer to the [Examples](/api-reference/legacy-api/partition/examples) page for some inspiration on using the parameters.

## Migration guide

There are breaking changes beginning with Python SDK version 0.26.0. If you encounter any errors when upgrading, please find the solution below.

**If you see the error: `AttributeError: 'PartitionParameters' object has no attribute 'partition_parameters'`**

Before 0.26.0, the SDK accepted a `PartitionParameters` object as input to the `sdk.general.partition` function. Beginning with 0.26.0, this object must be wrapped in a `PartitionRequest` object. The old behavior was deprecated in 0.23.0 and removed in 0.26.0.

```python  theme={null}
# Instead of:
from unstructured_client.models import shared

req = shared.PartitionParameters(
    files=files,
)

resp = s.general.partition(
    request=req
)

# Switch to:
from unstructured_client.models import shared, operations

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=files,
    )
)

resp = s.general.partition(
    request=req
)
```

**If you see the error: `TypeError: BaseModel.__init__() takes 1 positional argument but 2 were given`**

Beginning with 0.26.0, the `PartitionRequest` constructor no longer allows for positional arguments. You must specify `partition_parameters` by name.

```python  theme={null}
# Instead of:
req = operations.PartitionRequest(
    shared.PartitionParameters(
        files=files,
    )
)

# Switch to:
req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=files,
    )
)
```

**If you see the error: `TypeError: General.partition() takes 1 positional argument but 2 were given`**

Beginning with 0.26.0, the `partition` function no longer allows for positional arguments. You must specify `request` by name.

```python  theme={null}
# Instead of:
resp = s.general.partition(req)

# Switch to:
resp = s.general.partition(
    request=req
)
```
