The following information applies only to the Unstructured Ingest CLI and the Unstructured Ingest Python library.

The Unstructured SDKs for Python and JavaScript/TypeScript and the Unstructured open-source library do not support this functionality.

Task

You want to process only files with specified extensions, only files at or below a specified size, or both.

Approach

For the Ingest CLI, use the following command options. For the Ingest Python library, use the following parameters for the FiltererConfig object.

  • Use --file-glob (CLI) or file_glob (Python) to specify the list of file extensions to process.
  • Use --max-file-size (CLI) or max_file_size (Python) to specify the maximum size of files to process, in bytes.

To run this example

The following example processes only .pdf and .eml files that have a file size of 100 KB or less. To run this example, you should have a directory with a mixture of files, including at least one .pdf file and one .eml file, and with at least one of these files having a file size of 100 KB or less.

Code