Skip to main content
This quickstart requires you to install the Unstructured Python SDK on your local machine. If you cannot or do not want to install or run anything on your local machine, you can skip over to the notebook Unstructured API On-Demand JobsQuickstart instead. The notebook is remotely hosted on Google Colab and requires no local machine setup steps.
This quickstart shows how to use the Unstructured Python SDK to have Unstructured process local files by using the Unstructured API’s on-demand jobs functionality. This functionality is part of the Unstructured API’s collection of workflow operations. On-demand jobs take one or more local files as input. Unstructured outputs the local files’ contents as a series of Unstructured document elements and metadata. This format is ideal for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning.
The on-demand jobs functionality is designed to work only by processing local files.To process files (and data) in remote file and blob storage, databases, and vector stores, you must use other workflow operations in the Unstructured API. To learn how, see for example the notebook Dropbox-To-Pinecone Connector API Quickstart for Unstructured.

Requirements

To run this quickstart, you will need:
  • An Unstructured account. To sign up for an account, go to https://unstructured.io/?modal=try-for-free. After you sign up, you are immediately signed in to your new Let’s Go account, at https://platform.unstructured.io.
  • An Unstructured API key, as follows:
    1. After you are signed in to your account, on the sidebar click API Keys.
    2. Click Generate New Key.
    3. Enter some meaningful display name for the key, and then click Continue.
    4. Next to the new key’s name, click the Copy icon. The key’s value is copied to your system’s clipboard. If you lose this key, simply return to the list and click Copy again.
  • Python 3.9 or higher installed on your local machine.
  • A Python virtual environment is recommended for isolating and versioning Python project code dependencies on your local machine, but this is not required. This quickstart uses the popular Python package and project manager uv for managing virtual environments. Installation and use of uv are described in the following steps.
  • One or more local files for Unstructured to process. The following code example assumes that the local files you want to process are in a folder named input, which is in the same project directory as your Python code. Creation of this folder is described in the following steps. The files’ types must be in the list of supported file types.
    Each on-demand job is limited to 10 files, and each file is limited to 10 MB in size.
  • A destination folder for Unstructured to send its processed results to. The following code example assumes that the destination folder is named output and is in the same project directory as your Python code. Creation of this folder is described in the following steps.

Step 1: Create a Python virtual environment

In this step, you use uv to create a new Python project, and a virtual environment within this project, for this quickstart.
If you do not want to use uv or create a Python project, you can do the following instead:
  1. Install the Unstructured Python SDK globally on your local machine by running the following command:
    pip install unstructured-client
    
  2. Skip ahead to Step 2.
  1. Install uv on your local machine, if it is not already installed.
  2. Create a new, blank folder on your local machine for this quickstart, and then switch to this new folder. This example creates a folder named unstructured_api_quickstart within your current working directory and then switches to this new folder:
    mkdir unstructured_api_quickstart
    cd unstructured_api_quickstart
    
  3. Create a new uv project for this quickstart by running the following command from within the new folder:
    uv init
    
  4. Create a new virtual environment within this project by running the following command:
    uv venv
    
  5. Activate the virtual environment by running the following command:
    source .venv/bin/activate
    
  6. Install the Unstructured Python SDK into the virtual environment, by running the following commands:
    uv add unstructured-client
    

Step 2: Add the example code

  1. In the same directory as the project’s main.py code file, add two folders named input and output. Your project directory should now look like this:
    unstructured_api_quickstart/
      ├── .venv/
      ├── input/  <- Upload your input files here.
      ├── output/ <- Unstructured will download its output files here.
      ├── .gitignore
      ├── .python-version
      ├── main.py <- Your Python code will go here.
      ├── pyproject.toml
      ├── README.md
      └── uv.lock
    
    If you do not want to use uv or create a Python project, just create a blank file named main.py and two folders named input and output that are all in the same local directory instead, for example:
    <parent-directory>/
      ├── input/  <- Upload your input files here.
      ├── output/ <- Unstructured will download its output files here.
      └── main.py <- Your Python code will go here.
    
  2. Upload the files you want Unstructured to process into the new input folder.
  3. Overwrite the contents of the main.py file with the following code. In this code, replace <your-unstructured-api-key> (in the main() function, near the end of the following code) with the value of your Unstructured API key. Then save this file.
    from unstructured_client import UnstructuredClient
    from unstructured_client.models.operations import CreateJobRequest
    from unstructured_client.models.operations import DownloadJobOutputRequest
    from unstructured_client.models.shared import BodyCreateJob, InputFiles, JobInformation
    import json, os, time
    from typing import Optional
    
    
    def run_on_demand_job(
            client: UnstructuredClient,
            input_dir: str,
            job_template_id: Optional[str] = None, 
            job_nodes: Optional[list[dict[str, object]]] = None
    ) -> tuple[str, list[dict[str, str]]]:
        """Runs an Unstructured on-demand job.
    
        Arguments:
        - client {UnstructuredClient}: The initialized Unstructured API client to use.
        - input_dir {str}: The directory that contains the input files.
        - job_template_id: {Optional[str]}: If this job is to use a workflow template, the ID of the workflow template to use.
        - job_nodes {Optional[list[dict[str, object]]]}: If this job is to use a custom workflow definition, the list of custom workflow nodes to use.
    
        Raises:
        - ValueError: If neither the job template ID nor job nodes (and not both) are specified.
            
        Returns:
        - job_id {str}: The ID of the on-demand job.
        - job_input_file_ids {list[str]}: The input file IDs of the on-demand job.
        - job_output_node_files {list[dict[str, str]]}: The output node files of the on-demand job.
        """
        request_data = {}
        files = []
    
        for filename in os.listdir(input_dir):
            full_path = os.path.join(input_dir, filename)
    
            # Skip non-files (for example, directories).
            if not os.path.isfile(full_path):
                continue
    
            files.append(
                (
                    InputFiles(
                        content=open(full_path, "rb"),
                        file_name=filename,
                        content_type="application/pdf"
                    )
                )
            )
    
        if job_template_id is not None:
            request_data = json.dumps({"template_id": job_template_id})
        elif job_nodes is not None:
            request_data = json.dumps({"job_nodes": job_nodes})
        else:
            raise ValueError(f"Must specify a job template ID or job nodes (but not both).")
            exit(1)
    
        response = client.jobs.create_job(
            request=CreateJobRequest(
                body_create_job=BodyCreateJob(
                    request_data=request_data,
                    input_files=files
                )
            )
        )
    
        job_id = response.job_information.id
        job_input_file_ids = response.job_information.input_file_ids
        job_output_node_files = response.job_information.output_node_files
    
        return job_id, job_input_file_ids, job_output_node_files
    
    
    def poll_for_job_status(client: UnstructuredClient, job_id: str) -> JobInformation:
        """Keeps checking a job's status until the job is completed.
    
        Arguments:
        - client {UnstructuredClient}: The initialized Unstructured API client to use.
        - job_id {str}: The job ID to check the status of.
    
        Returns:
        - job {JobInformation}: Information about the Unstructured job.
        """
        while True:
            response = client.jobs.get_job(
                request={
                    "job_id": job_id
                }
            )
    
            job = response.job_information
    
            if job.status == "SCHEDULED":
                print("Job is scheduled, polling again in 10 seconds...")
                time.sleep(10)
            elif job.status == "IN_PROGRESS":
                print("Job is in progress, polling again in 10 seconds...")
                time.sleep(10)
            else:
                print("Job is completed.")
                break
    
        return job
    
    
    def download_job_output(
            client: UnstructuredClient,
            job_id: str,
            job_input_file_ids: list[str],
            output_dir: str
    ) -> None:
        """Downloads the output of an Unstructured job.
    
        Arguments:
        - client {UnstructuredClient}: The initialized Unstructured API client to use.
        - job_id {str}: The job ID to download the output from.
        - job_input_file_ids {list[str]}: The input file IDs of the job.
        - output_dir {str}: The directory to download the output into.
        """
        for job_input_file_id in job_input_file_ids:
            print(f"Attempting to get processed results from file_id '{job_input_file_id}'...")
    
            response = client.jobs.download_job_output(
                request=DownloadJobOutputRequest(
                    job_id=job_id,
                    file_id=job_input_file_id
                )
            )
    
            output_path = os.path.join(output_dir, f"{job_input_file_id}.json")
    
            with open(output_path, "w") as f:
                json.dump(response.any, f, indent=4)
    
            print(f"Saved output for file_id '{job_input_file_id}' to '{output_path}'.\n")
    
    
    def main():
        # API key and source/destination folder paths.
        UNSTRUCTURED_API_KEY = "<your-unstructured-api-key>"
        INPUT_FOLDER_PATH = "./input"
        OUTPUT_FOLDER_PATH = "./output"
    
        # On-demand job settings.
        job_template_id = "hi_res_and_enrichment"
        job_nodes = [] # Applies only if the job is to use a custom workflow definition.
    
        # Internal tracking variables.
        job_id = ""
        job_input_file_ids = []
        job_output_node_files = []
    
        with UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY) as client:
            print("-" * 80)
            print(f"Attempting to run the on-demand job, ingesting the input files from '{INPUT_FOLDER_PATH}'...")
            job_id, job_input_file_ids, job_output_node_files = run_on_demand_job(
                client = client,
                input_dir = INPUT_FOLDER_PATH,
                job_template_id = job_template_id
            )
    
            print(f"Job ID: {job_id}\n")
            print("Input file details:\n")
    
            for job_input_file_id in job_input_file_ids:
                print(job_input_file_id)
    
            print("\nOutput node file details:\n")
    
            for output_node_file in job_output_node_files:
                print(output_node_file)
    
            print("-" * 80)
            print("Polling for job status...")
    
            job = poll_for_job_status(client, job_id)
            
            print(f"Job details:\n---\n{job.model_dump_json(indent=4)}")
        
            if job.status != "COMPLETED":
                print("Job did not complete successfully. Stopping this script without downloading any output.")
                exit(1)
    
            print("-" * 80)
            print("Attempting to download the job output...")
            download_job_output(client, job_id, job_input_file_ids, OUTPUT_FOLDER_PATH)
            
            print("-" * 80)
            print(f"Script completed. Check the output folder '{OUTPUT_FOLDER_PATH}' for the results.")
            exit(0)
    
    
    if __name__ == "__main__":
        main()
    

Step 3: Run the code and view the results

Each on-demand job is limited to 10 files, and each file is limited to 10 MB in size.
  1. Run the code in the main.py file, by running the following command:
    uv run main.py
    
    If you do not want to use uv or create a Python project, you can run the code by running the following command instead:
    python main.py
    
  2. After the code finishes running, look in the output folder to see Unstructured’s results.

Next steps

This quickstart showed you how to use the Unstructured Python SDK to process local files by using Unstructured’s on-demand jobs functionality. To learn more about how to use Unstructured to process local files, see the example notebook Unstructured API On-Demand Jobs Walkthrough, running remotely in Google Colab. This notebook requires no local setup. To learn how to process files (and data) in remote file and blob storage, databases, and vector stores, see for example the notebook Dropbox-To-Pinecone Connector API Quickstart for Unstructured, also running remotely in Google Colab. This notebook requires no local setup.