Docker installation
Follow these steps to run the Unstructured open source library inside a Docker container.
Install and run Docker
If you do not have Docker already installed and running, you can install and run a tool such as Docker Desktop, which is available for macOS, Windows, and Linux. Learn how to install and run:
Pull the Unstructured Docker image
The following steps are for AMD64-based systems.
If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the Docker CLI tab in this step instead.
- In your Docker Desktop UI’s search box, enter
downloads.unstructured.io/unstructured-io/unstructured:latest
. - On the Images tab, next to unstructured-io/unstructured, click Pull.
To list the available images on your machine, in the sidebar, click Images.
To remove this image from your machine at any time, click the trash can (Delete) icon next to the image in the list of available images.
The following steps are for AMD64-based systems.
If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the Docker CLI tab in this step instead.
- In your Docker Desktop UI’s search box, enter
downloads.unstructured.io/unstructured-io/unstructured:latest
. - On the Images tab, next to unstructured-io/unstructured, click Pull.
To list the available images on your machine, in the sidebar, click Images.
To remove this image from your machine at any time, click the trash can (Delete) icon next to the image in the list of available images.
From your terminal or command prompt, run the following command.
For AMD64-based systems, run the following command:
For ARM64-based systems (such as Apple Silicon), run the following command instead:
To list the available images on your machine, run the following command:
To remove this image from your machine at any time, run the following command:
Create and run a container from the image
The following steps are for AMD64-based systems.
If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the Docker CLI tab in this step instead.
- In the Docker Desktop UI’s sidebar, click Images.
- Next to unstructured-io/unstructured, click the play (Run) icon.
- Expand Optional settings.
- For Container name, enter some name for your container, such as
unstructured
. - In the sidebar, click Containers.
- Next to your container, click the play (Start) icon.
The following steps are for AMD64-based systems.
If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the Docker CLI tab in this step instead.
- In the Docker Desktop UI’s sidebar, click Images.
- Next to unstructured-io/unstructured, click the play (Run) icon.
- Expand Optional settings.
- For Container name, enter some name for your container, such as
unstructured
. - In the sidebar, click Containers.
- Next to your container, click the play (Start) icon.
For AMD64-based systems, run the following command, replacing <container-name>
with some name for your container, such as unstructured
:
For ARM64-based systems (such as Apple Silicon), run the following command instead, replacing <container-name>
with some name for your container, such as unstructured
:
Interact with the Unstructured open source library by running code inside the container
-
In the Docker Desktop UI, in the lower right corner, click the Terminal button.
-
To start a terminal session inside the container, run the following command, replacing
<container-name>
with the name of your container, such asunstructured
: -
Run Unstructured open source library calls from inside the container. For example, start the Python interpreter:
And then run the following commands, one command at a time, to make calls to the Unstructured open source library. These calls process a PDF file in the
/app/example-docs/pdf
directory namedlayout-parser-paper.pdf
. The processed data is written as a JSON file namedlayout-parser-paper-output.json
in that same directory:After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal:
-
To exit the terminal session, run the following command, or press
Ctrl+D
:
-
In the Docker Desktop UI, in the lower right corner, click the Terminal button.
-
To start a terminal session inside the container, run the following command, replacing
<container-name>
with the name of your container, such asunstructured
: -
Run Unstructured open source library calls from inside the container. For example, start the Python interpreter:
And then run the following commands, one command at a time, to make calls to the Unstructured open source library. These calls process a PDF file in the
/app/example-docs/pdf
directory namedlayout-parser-paper.pdf
. The processed data is written as a JSON file namedlayout-parser-paper-output.json
in that same directory:After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal:
-
To exit the terminal session, run the following command, or press
Ctrl+D
:
-
Run the following command, replacing
<container-name>
with the name of your container, such asunstructured
: -
Run Unstructured open source library calls from inside the container. For example, start the Python interpreter:
And then run the following commands, one command at a time, to make calls to the Unstructured open source library. These calls process a PDF file in the
/app/example-docs/pdf
directory namedlayout-parser-paper.pdf
. The processed data is written as a JSON file namedlayout-parser-paper-output.json
in that same directory:After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal:
-
To exit the terminal session, run the following command, or press
Ctrl+D
:
Interact with the Unstructured open source library by running code outside the container
You can also interact with the Unstructured open source library by running code that is on the same machine as the running container but not within the container itself. To do this, you can use the Docker CLI to create a container that mounts the local directory containing the code into the container itself, and then run that code from the container.
-
Run one of the following commands, replacing the following placeholders with the appropriate values:
- Replace
<host-path>
with the path to the directory containing your code, for example/Users/<username>/my_example_code/
. - Replace
<container-path>
with the path to some directory within the container to mount<host-path>
into, for example/app/my_example_code/
. If<container-path>
does not already exist, it will be created at the same time that the container is created. - Replace
<container-name>
with some name for your container, such asunstructured_mount
.
For AMD64-based systems, run the following command:
For ARM64-based systems (such as Apple Silicon), run the following command instead:
- Replace
-
Start a terminal session inside the container by running the following command, replacing
<container-name>
with the name of your container, such asunstructured_mount
: -
Add
<container-path>
to thePYTHONPATH
environment variable within the container by running the following commands, replacing<container-path>
with the path to the target directory within the container: -
Run Unstructured open source library calls, referencing your code from
<container-path>
.For example, if you have a file named
main.py
in<host-path>
that contains the four commands following>>>
from the previous step, you can run it as follows, replacing<container-path>
with the path to the target directory within the container:To print the contents of the JSON file to the terminal, run the following command:
-
To exit the terminal session, run the following command, or press
Ctrl+D
:
Stop running the container
If you do not need the keep running the container, you can stop it as follows:
- In the Docker Desktop UI, in the sidebar, click Containers.
- Next to your container, click the square (Stop) icon.
- In the Docker Desktop UI, in the sidebar, click Containers.
- Next to your container, click the square (Stop) icon.
Run the following command, replacing <container-name>
with the name of your container, such as unstructured
or unstructured_mount
:
Building your own Docker image
You can build your own Docker image instead of pulling the latest prebuilt image. If you only plan to parse a single type of data, you can accelerate the build process by excluding certain packages or requirements needed for other data types. Refer to the Dockerfile to determine which lines are necessary for your requirements.