Prerequisites
uv
and venv
are described in the following steps.
However, uv
and venv
are not required to use the Unstructured open source library.uv
to install Python if needed, as described in the following steps.layout-parser-paper.pdf
that you can download in a later step. (The Unstructured open source library provides
support for additional file types as well.)Install uv
curl
with sh
:wget
with sh
instead:uv
by using other approaches such as PyPI, Homebrew, or WinGet,
see Installing uv.Install Python
uv
will detect and use Python if you already have it installed.
To view a list of installed Python versions, run the following command:uv
by running the following command. For example, this command installs Python 3.12 for use with uv
:Create a uv project
uv
to create a project by switching to the directory on your development machine where you want to
create the project and then running the following command:Create a venv virtual environment
uv
to create a virtual environment with venv
by running the following command:Activate the virtual environment
venv
virtual environment, run one of the following commands:bash
or zsh
, run source .venv/bin/activate
fish
, run source .venv/bin/activate.fish
csh
or tcsh
, run source .venv/bin/activate.csh
pwsh
, run .venv/bin/Activate.ps1
deactivate
.Install the Unstructured open source library
uv
to install the Unstructured open source library by running the following command:.txt
), HTML files (.html
), XML files (.xml
), and emails (.eml
, .msg
, and .p7s
) without any additional dependencies.To work with other file types, you must also install these dependencies, as follows, replacing <extra>
with the appropriate extra for the target file type:all-docs
(for all supported file types in this list)csv
(for .csv
files only)docx
(for .doc
and .docx
files only)epub
(for .epub
files only)image
(for all supported image file types: .bmp
, .heic
, .jpeg
, .png
, and .tiff
)md
(for .md
files only)odt
(for .odt
files only)org
(for .org
files only)pdf
(for .pdf
files only)pptx
(for .ppt
and .pptx
files only)rst
(for .rst
files only)rtf
(for .rtf
files only)tsv
(for .tsv
files only)xlsx
(for .xls
and .xlsx
files only)Install system dependencies
tesseract-lang
(for additional language support).epub
, .odt
, and .rtf
files. For .rtf
files, you must have version 2.14.2 or newer. Running this script will install the correct version for you.)Download the sample PDF file
layout-parser-paper.pdf
from the following location to your local development machine:https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf(You can also use any other PDF file that you want to work with instead of this sample file, if you prefer.)Add the Python code
main.py
file, add the following Python code, replacing <path/to>
with the
path to the layout-parser-paper.pdf
file that you downloaded to your local development machine.(If you want to use a different PDF file, replace layout-parser-paper
with the name of that PDF file instead.)Run the Python code
uv
to run the preceding Python code by running the following command:View the output
layout-parser-paper-output.json
file in your editor. This file will be in
the location as the original layout-parser-paper.pdf
file.(If you used a different PDF file, the output file will be named <your-file-name>-output.json
instead.)partition_pdf
for converting other types of files into standard Unstructured document elements and metadata.auto
partitioning strategy. Learn about other available partitioning strategies for fine-tuned approaches to converting different types of files into Unstructured document elements.