Connect Delta Table to your preprocessing pipeline, and batch process all your documents using unstructured-ingest to store structured outputs locally on your filesystem.

Make sure to have the Delta Table dependencies installed:

Shell
pip install "unstructured-ingest[delta-table]"

AWS credentials need to be available for use with the storage options. Specify the to the DeltaTable using the table-uri argument, and pass a dictionary of the options to use for the storage backend via storage_options.

#!/usr/bin/env bash

unstructured-ingest \
  delta-table \
    --table-uri s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/ \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --storage_options "AWS_REGION=us-east-2,AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
    --verbose \
    --strategy hi_res

For a full list of the options the Unstructured Ingest CLI accepts check unstructured-ingest delta-table --help.