Google Cloud Storage
Connect Google Cloud Storage to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem.
You will need:
The Google Cloud Storage prerequisites:
-
A Google Cloud service account. Create a service account.
-
A service account key for the service account. See Create a service account key in Create and delete service account keys.
To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace
<path-to-downloaded-key-file>
with the path to the service account key file that you downloaded by following the preceding instructions.- For macOS or Linux:
- For Windows:
- For macOS or Linux:
-
The URI for a Google Cloud Storage bucket. This URI consists of the target bucket name, plus any target folder within the bucket, expressed as
gs://<bucket-name>[/folder-name]
. Create a bucket.This bucket must have, at minimum, one of the following roles applied to the target Google Cloud service account:
Storage Object Viewer
for bucket read access.Storage Object Creator
for bucket write access.- The
Storage Object Admin
role provides read and write access, plus access to additional bucket operations.
To apply one of these roles to a service account for a bucket, see Add a principal to a bucket-level policy in Set and manage IAM policies on buckets.
The Google Cloud Storage connector dependencies:
You might also need to install additional dependencies, depending on your needs. Learn more.
The following environment variables:
GCS_SERVICE_ACCOUNT_KEY
- The Google Cloud service account key for Google Cloud Storage, represented by--service-account-key
(CLI) orservice_account_key
(Python).GCS_REMOTE_URL
- The Google Cloud Storage bucket URL, represented by--remote-url
(CLI) orremote_url
(Python).
These environment variables:
UNSTRUCTURED_API_KEY
- Your Unstructured API key value.UNSTRUCTURED_API_URL
- Your Unstructured API URL.
Now call the Unstructured CLI or Python SDK. The destination connector can be any of the ones supported. This example uses the local destination connector:
Was this page helpful?