Full installation
To install the Unstructured open source library on a local development machine, run one or more of the following commands.
These commands assume that you are using the Python package and project manager
uv, running within an activated venv
virtual environment that was created with uv
. However, uv
and venv
are not required.
To work with all supported file types, run:
To conserve disk space and reduce code dependencies, you can run the following command instead to work with a default set of supported file types:
The preceding command supports plain text files (.txt
), HTML files (.html
), XML files (.xml
), and emails (.eml
, .msg
, and .p7s
) by default.
To further conserve disk space and reduce code dependencies, you can run the following command instead, replacing <extra>
with the appropriate extra for the target file type:
The following file type extras are available:
all-docs
(for all supported file types in this list)csv
(for.csv
files only)docx
(for.doc
and.docx
files only)epub
(for.epub
files only)image
(for all supported image file types:.bmp
,.heic
,.jpeg
,.png
, and.tiff
)md
(for.md
files only)odt
(for.odt
files only)org
(for.org
files only)pdf
(for.pdf
files only)pptx
(for.ppt
and.pptx
files only)rst
(for.rst
files only)rtf
(for.rtf
files only)tsv
(for.tsv
files only)xlsx
(for.xls
and.xlsx
files only)
Note that you can install multiple extras at the same time by separating them with commas, for example:
For maximum compatiblity, you should also install the following system dependencies:
- libmagic-dev (for filetype detection)
- poppler-utils and tesseract-ocr (for images and PDFs), and
tesseract-lang
(for additional language support) - libreoffice (for Microsoft Office documents)
- pandoc (for
.epub
,.odt
, and.rtf
files. For.rtf
files, you must have version 2.14.2 or newer. Running this script will install the correct version for you.)
Installation instructured for these system dependencies vary by operating system type. For details, follow the preceding links or see your operating system’s documentation.