Connect GitHub to your preprocessing pipeline, and batch process all your documents using unstructured-ingest to store structured outputs locally on your filesystem.

First, install the GitHub dependencies as shown here.

pip install "unstructured-ingest[github]"

Provide the GitHub repo URL (url) to fetch the files from, e.g. "https://github.com/Unstructured-IO/unstructured" or "Unstructured-IO/unstructured", and supply your GitHub access token (git-access-token). Learn more about GitHub authentication here.

Optionally, specify a branch and what file types to limit the ingestion to:

  • git-branch: The branch to fetch files from. If not given, the default repository branch is used.
  • git-file-glob: A comma-separated list of file globs to limit which types of files are accepted, e.g. '*.html,*.txt'
#!/usr/bin/env bash

unstructured-ingest \
  github \
    --url Unstructured-IO/unstructured \
    --git-branch main \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --num-processes 2 \
    --verbose \
    --strategy hi_res

For a full list of the options the Unstructured Ingest CLI accepts check unstructured-ingest github --help.