Basic Usage

For a complete set of extras catering to every document type, use:

pip install "unstructured[all-docs]"

To install unstructured, you’ll also need to install the following system dependencies: libmagic, poppler, libreoffice, pandoc, and tesseract. Instruction details for these dependencies will vary by operating system. We recommend running unstructured from the officially supported Docker image, which has these dependencies installed already.

Installation for Specific Document Types

If you’re processing document types beyond the basics, you can install the necessary extras:

pip install "unstructured[docx,pptx]"

Available document types:

"csv", "doc", "docx", "epub", "image", "md", "msg", "odt", "org", "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"

Installation for Specific Data Connectors

To use any of the data connectors, you must install the specific dependency:

pip install "unstructured-ingest[s3]"

Available data connectors:

"airtable", "azure", "azure-cognitive-search", "biomed", "box", "confluence", "couchbase", "delta-table", "discord", "dropbox", "elasticsearch", "gcs", "github", "gitlab", "google-drive", "jira", "mongodb", "notion", "opensearch", "onedrive", "outlook", "reddit", "s3", "sharepoint", "salesforce", "slack", "wikipedia"