Connect Reddit to your preprocessing pipeline, and batch process all your documents using unstructured-ingest to store structured outputs locally on your filesystem.

First, install the Reddit dependencies as shown here.

pip install "unstructured-ingest[reddit]"

You must provide:

  • A client ID and a client secret to authenticate yourself. Learn how to get them here.
  • user-agent: user agent request header to use when calling Reddit API
  • subreddit-name: The name of a subreddit, without the “r\”, e.g. “machinelearning”

Optionally you can choose to specify:

  • search-query: If set, return posts using this query. Otherwise, use hot posts.
  • num-posts: If set, limits the number of posts to pull in.
#!/usr/bin/env bash

unstructured-ingest \
  reddit \
    --subreddit-name machinelearning \
    --client-id $REDDIT_CLIENT_ID \
    --client-secret $REDDIT_CLIENT_SECRET \
    --user-agent "Unstructured Ingest Subreddit fetcher by \u\..." \
    --search-query "Unstructured" \
    --num-posts 10 \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --num-processes 2 \
    --verbose \
    --strategy hi_res

For a full list of the options the Unstructured Ingest CLI accepts check unstructured-ingest reddit --help.