Follow these steps to run the Unstructured open source library inside a Docker container.

1

Install and run Docker

If you do not have Docker already installed and running, you can install and run a tool such as Docker Desktop, which is available for macOS, Windows, and Linux. Learn how to install and run:

2

Pull the Unstructured Docker image

If you are an experienced Docker user, you plan to parse only a single type of data, and you want to accelerate the image-building process, you can build your own Docker image instead of pulling the latest prebuilt image.

The following steps are for AMD64-based systems.

If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the Docker CLI tab in this step instead.

  1. In your Docker Desktop UI’s search box, enter downloads.unstructured.io/unstructured-io/unstructured:latest.
  2. On the Images tab, next to unstructured-io/unstructured, click Pull.

To list the available images on your machine, in the sidebar, click Images.

To remove this image from your machine at any time, click the trash can (Delete) icon next to the image in the list of available images.

3

Create and run a container from the image

The following steps are for AMD64-based systems.

If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the Docker CLI tab in this step instead.

  1. In the Docker Desktop UI’s sidebar, click Images.
  2. Next to unstructured-io/unstructured, click the play (Run) icon.
  3. Expand Optional settings.
  4. For Container name, enter some name for your container, such as unstructured.
  5. In the sidebar, click Containers.
  6. Next to your container, click the play (Start) icon.
4

Interact with the Unstructured open source library by running code inside the container

  1. In the Docker Desktop UI, in the lower right corner, click the Terminal button.

  2. To start a terminal session inside the container, run the following command, replacing <container-name> with the name of your container, such as unstructured:

    docker exec -it <container-name> bash
  3. Run Unstructured open source library calls from inside the container. For example, start the Python interpreter:

    python

    And then run the following commands, one command at a time, to make calls to the Unstructured open source library. These calls process a PDF file in the /app/example-docs/pdf directory named layout-parser-paper.pdf. The processed data is written as a JSON file named layout-parser-paper-output.json in that same directory:

    >>> from unstructured.partition.pdf import partition_pdf
    >>> from unstructured.staging.base import elements_to_json
    >>> elements = partition_pdf(filename="/app/example-docs/pdf/layout-parser-paper.pdf")
    >>> elements_to_json(elements=elements, filename="/app/example-docs/pdf/layout-parser-paper-output.json")

    After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal:

    >>> exit()
    
    cat ./example-docs/pdf/layout-parser-paper-output.json
  4. To exit the terminal session, run the following command, or press Ctrl+D:

    exit
5

Interact with the Unstructured open source library by running code outside the container

You can also interact with the Unstructured open source library by running code that is on the same machine as the running container but not within the container itself. To do this, you can use the Docker CLI to create a container that mounts the local directory containing the code into the container itself, and then run that code from the container.

  1. Run one of the following commands, replacing the following placeholders with the appropriate values:

    • Replace <host-path> with the path to the directory containing your code, for example /Users/<username>/my_example_code/.
    • Replace <container-path> with the path to some directory within the container to mount <host-path> into, for example /app/my_example_code/. If <container-path> does not already exist, it will be created at the same time that the container is created.
    • Replace <container-name> with some name for your container, such as unstructured_mount.

    For AMD64-based systems, run the following command:

    # The AMD64 platform is the default.
    docker run -dt -v <host-path>:<container-path>--name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest
    
    # Or, to explicitly specify the AMD64 platform:
    docker run -dt -v <host-path>:<container-path> --platform=linux/amd64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest

    For ARM64-based systems (such as Apple Silicon), run the following command instead:

    docker run -dt -v <host-path>:<container-path> --platform=linux/arm64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest
  2. Start a terminal session inside the container by running the following command, replacing <container-name> with the name of your container, such as unstructured_mount:

    docker exec -it <container-name> bash
  3. Add <container-path> to the PYTHONPATH environment variable within the container by running the following commands, replacing <container-path> with the path to the target directory within the container:

    PYTHONPATH="${PYTHONPATH}:<container-path>"
    export PYTHONPATH
  4. Run Unstructured open source library calls, referencing your code from <container-path>.

    For example, if you have a file named main.py in <host-path>that contains the four commands following >>> from the previous step, you can run it as follows, replacing <container-path> with the path to the target directory within the container:

    python <container-path>/main.py

    To print the contents of the JSON file to the terminal, run the following command:

    cat /app/example-docs/pdf/layout-parser-paper-output.json
  5. To exit the terminal session, run the following command, or press Ctrl+D:

    exit
6

Stop running the container

If you do not need the keep running the container, you can stop it as follows:

  1. In the Docker Desktop UI, in the sidebar, click Containers.
  2. Next to your container, click the square (Stop) icon.

Building your own Docker image

You can build your own Docker image instead of pulling the latest prebuilt image. If you only plan to parse a single type of data, you can accelerate the build process by excluding certain packages or requirements needed for other data types. Refer to the Dockerfile to determine which lines are necessary for your requirements.

make docker-build

# Start a Bash shell inside of the running Docker container.
make docker-start-bash