Connect GitHub to your preprocessing pipeline, and batch process all your documents using unstructured-ingest to store structured outputs locally on your filesystem.

First, install the GitHub dependencies as shown here.

pip install "unstructured-ingest[github]"

Provide the GitHub repo URL (url) to fetch the files from, e.g. "https://github.com/Unstructured-IO/unstructured" or "Unstructured-IO/unstructured", and supply your GitHub access token (git-access-token). Learn more about GitHub authentication here.

Optionally, specify a branch and what file types to limit the ingestion to:

  • git-branch: The branch to fetch files from. If not given, the default repository branch is used.
  • git-file-glob: A comma-separated list of file globs to limit which types of files are accepted, e.g. '*.html,*.txt'

Make sure to set the --partition-by-api flag and pass in your API key with --api-key:

#!/usr/bin/env bash

unstructured-ingest \
  github \
    --url Unstructured-IO/unstructured \
    --git-branch main \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --num-processes 2 \
    --verbose \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --strategy hi_res

Additionally, if you’re using Unstructured SaaS API, your locally deployed Unstructured API, or an Unstructured API deployed on Azure or AWS, you also need to specify the API URL via the --partition-endpoint argument.