Installation

  1. Installing the open source library: You can install the core SDK using pip:

    pip install unstructured

    Plain text files, HTML, XML, JSON, and Emails are immediately supported without any additional dependencies.

    If you need to process other document types, you can install the extras required by following the Full Installation

  2. System Dependencies: Ensure the subsequent system dependencies are installed. Your requirements might vary based on the document types you’re handling:

    • libmagic-dev : Essential for filetype detection.

    • poppler-utils : Needed for images and PDFs.

    • tesseract-ocr : Essential for images and PDFs.

    • libreoffice : For MS Office documents.

    • pandoc : For EPUBs, RTFs, and Open Office documents. Please note that to handle RTF files, you need version 2.14.2 or newer. Running this script will install the correct version for you.

Validating Installation

After installation, confirm the setup by executing the below Python code:

from unstructured.partition.auto import partition
elements = partition(filename="example-docs/eml/fake-email.eml")

If you’ve opted for the “local-inference” installation, you should also be able to execute:

from unstructured.partition.auto import partition
elements = partition("example-docs/pdf/layout-parser-paper.pdf")

If these code snippets run without errors, congratulations! Your unstructured installation is successful and ready for use.

The following section will cover basic concepts and usage patterns in unstructured. After reading this section, you should be able to:

  • Partitioning a document with the partition function.

  • Understand how documents are structured in unstructured.

  • Convert a document to a dictionary and/or save it as a JSON.

The example documents in this section come from the example-docs directory in the unstructured repo.

Before running the code in this make sure you’ve installed the unstructured library and all dependencies using the instructions in the quickstart section.

Partitioning a document

In this section, we’ll cut right to the chase and get to the most important part of the library: partitioning a document. The goal of document partitioning is to read in a source document, split the document into sections, categorize those sections, and extract the text associated with those sections. Depending on the document type, unstructured uses different methods for partitioning a document. We’ll cover those in a later section. For now, we’ll use the simplest API in the library, the partition function. The partition function will detect the filetype of the source document and route it to the appropriate partitioning function. You can try out the partition function by running the cell below.

from unstructured.partition.auto import partition

elements = partition(filename="example-10k.html")

You can also pass in a file as a file-like object using the following workflow:

with open("example-10k.html", "rb") as f:
    elements = partition(file=f)

The partition function uses libmagic for filetype detection. If libmagic is not present and the user passes a filename, partition falls back to detecting the filetype using the file extension. libmagic is required if you’d like to pass a file-like object to partition. We highly recommend installing libmagic and you may observe different file detection behaviors if libmagic is not installed`.

The partition methods and functions in the open source library process single files only. To process multiple files at a time, use the Unstructured Ingest CLI or the Unstructured Ingest Python library with their provided source connectors and destination connectors.

Quickstart Tutorial

If you’re eager to dive in, head over Getting Started on Google Colab to get a hands-on introduction to the unstructured library. In a few minutes, you’ll have a basic workflow set up and running!

For more detailed information about specific components or advanced features, explore the rest of the documentation.