> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Google Drive

Connect Google Drive to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem.

The requirements are as follows.

<iframe width="560" height="315" src="https://www.youtube.com/embed/ZiRD7hC-wh8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* A [Google Cloud account](https://console.cloud.google.com).

* The **Google Drive API** enabled in the account.
  [Learn how](https://cloud.google.com/apis/docs/getting-started#enabling_apis).

* Within the account, a Google Cloud service account and its related `credentials.json` key file or its contents in JSON format.
  [Create a service account](https://developers.google.com/workspace/guides/create-credentials#create_a_service_account).
  [Create credentials for a service account](https://developers.google.com/workspace/guides/create-credentials#create_credentials_for_a_service_account).

  To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as
  a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself).
  To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt.
  In this command, replace `<path-to-downloaded-key-file>` with the path to the `credentials.json` key file that you downloaded by following the preceding instructions.

  * For macOS or Linux:

    ```text theme={null}
    tr -d '\n' < <path-to-downloaded-key-file>
    ```

  * For Windows:

    ```text theme={null}
    (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "")
    ```

* A Google Drive [shared folder](https://support.google.com/drive/answer/2375091) or [shared drive](https://support.google.com/a/users/answer/7212025).

* Give the service account access to the shared folder or shared drive. To do this, share the folder or drive with the service account's email address.
  [Learn how](https://support.google.com/drive/answer/7166529).
  [Learn more](https://www.googlecloudcommunity.com/gc/Workspace-Q-A/Can-i-give-access-to-document-of-google-drive-to-service-account/m-p/530106).

* Get the shared folder's ID or shared drive's ID. This is a part of the URL for your Google Drive shared folder or shared drive, represented in the following URL as `{folder_id}`: `https://drive.google.com/drive/folders/{folder-id}`.

## Document permissions metadata

The source connector outputs any permissions information that it can find in the source location about the processed source documents and associates that information with each
corresponding element that is generated. This permissions information is output into the `permissions_data` field, which is within the
`data_source` field under the element's `metadata` field. This information lists the users or groups, if any, that have
permissions to read, update, or delete the element's associated source document.

<Warning>
  The permissions metadata Unstructured outputs should not be used for runtime authorization
  or access control enforcement.

  Unstructured outputs document permissions metadata that is accurate only
  at the point in time when Unstructured ingested the corresponding document
  to which those permissions applied. Because this metadata is a
  point-in-time copy of the permissions in the source location, these
  metadata outputs that are sent to your destination location are not
  always guaranteed to match the current permissions in the source location.

  Also, be aware that Unstructured updates permission metadata for a document only when the document's content has changed.

  This is because Unstructured performs incremental processing of documents only when documents' *content* has changed—not when only the documents'
  *permissions* have changed. Whenever Unstructured performs incremental processing of documents for a
  workflow (in other words, if **Reprocess All Files** is turned off or set
  to false for a workflow), that worfklow will not output metadata for any
  document permissions that have been added, changed, or removed since the
  previous workflow run, *unless* the corresponding documents' content
  has also been changed since the previous workflow run.
</Warning>

The following example shows what the output looks like. Ellipses indicate content that has been omitted from this example for brevity.

```json theme={null}
[
    {
        "...": "...",
        "metadata": {
            "...": "...",
            "data_source": {
                "...": "...",
                "permissions_data": [
                    {
                        "read": {
                            "users": [
                                "11111111111111111111"
                            ],
                            "groups": [
                                "22222222222222222222",
                                "33333333333333333333"
                            ]
                        }
                    },
                    {
                        "update": {
                            "users": [
                                "44444444444444444444",
                                "55555555555555555555"
                            ],
                            "groups": [
                                "66666666666666666666",
                            ]
                        }
                    },
                    {
                        "delete": {
                            "users": [
                                "77777777777777777777"
                            ],
                            "groups": [
                                "88888888888888888888"
                            ]
                        }
                    }
                ],
                "...": "..."
            }
        }
    }
]
```

To look up information about a particular Google Cloud user, use the user's ID along with the
[Admin SDK API](https://developers.google.com/workspace/admin/directory/reference/rest/v1/users/get) or the
[People API](https://developers.google.com/people/api/rest/v1/people/get) for Google Cloud.

To look up information about a particular Google Cloud group, use the group's ID along with the
[Admin SDK API](https://developers.google.com/workspace/admin/directory/reference/rest/v1/groups/get) or the
[Cloud Identity API](https://cloud.google.com/identity/docs/reference/rest/v1/groups/get) for Google Cloud.

The Google Drive connector dependencies:

```bash CLI, Python theme={null}
pip install "unstructured-ingest[google-drive]"
```

You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies).

The following environment variables:

* `GOOGLE_DRIVE_FOLDER_ID` - The folder or drive ID, represented by `--drive-id` (CLI) or `drive_id` (Python).
* One of the following:

  * `GCP_SERVICE_ACCOUNT_KEY_FILEPATH` - The path to the `credentials.json` key file, represented by `--service-account-key-path` (CLI) or `service_account_key_path` (Python).
  * `GCP_SERVICE_ACCOUNT_KEY_STRING` - The contents of the `credentials.json` key file as a string, represented by `--service-account-key` (CLI) or `service_account_key` (Python).

<Note>
  To use `--extensions` with a comma-separated list (CLI) or `extensions` with an array of strings (Python) to process files with only those extensions,
  do not include the leading dot in the file extensions. For example, use `pdf` or `docx` instead of `.pdf` or `.docx`.
</Note>

Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector.

This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page.

<CodeGroup>
  ```bash CLI theme={null}
  #!/usr/bin/env bash

  # Chunking and embedding are optional.

  unstructured-ingest \
    google-drive \
      --download-dir $LOCAL_FILE_DOWNLOAD_DIR \
      --drive-id $GOOGLE_DRIVE_FOLDER_ID \
      --service-account-key-path $GCP_SERVICE_ACCOUNT_KEY_FILEPATH \ # Or
      --service-account-key $GCP_SERVICE_ACCOUNT_KEY_STRING \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --strategy hi_res \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    local \
      --output-dir $LOCAL_FILE_OUTPUT_DIR
  ```

  ```python Python Ingest theme={null}
  import os

  from unstructured_ingest.pipeline.pipeline import Pipeline
  from unstructured_ingest.interfaces import ProcessorConfig
  from unstructured_ingest.processes.connectors.google_drive import (
      GoogleDriveConnectionConfig,
      GoogleDriveAccessConfig,
      GoogleDriveIndexerConfig,
      GoogleDriveDownloaderConfig
  )
  from unstructured_ingest.processes.partitioner import PartitionerConfig
  from unstructured_ingest.processes.chunker import ChunkerConfig
  from unstructured_ingest.processes.embedder import EmbedderConfig
  from unstructured_ingest.processes.connectors.local import LocalUploaderConfig

  # Chunking and embedding are optional.

  if __name__ == "__main__":
      Pipeline.from_configs(
          context=ProcessorConfig(),
          indexer_config=GoogleDriveIndexerConfig(),
          downloader_config=GoogleDriveDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
          source_connection_config=GoogleDriveConnectionConfig(
              access_config=GoogleDriveAccessConfig(
                  service_account_key_path=os.getenv("GCP_SERVICE_ACCOUNT_KEY_FILEPATH"), # Or
                  service_account_key=os.getenv("GCP_SERVICE_ACCOUNT_KEY_STRING")
              ),
              drive_id=os.getenv("GOOGLE_DRIVE_FOLDER_ID"),
          ),
          partitioner_config=PartitionerConfig(
              partition_by_api=True,
              api_key=os.getenv("UNSTRUCTURED_API_KEY"),
              partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
              strategy="hi_res",
              additional_partition_args={
                  "split_pdf_page": True,
                  "split_pdf_allow_failed": True,
                  "split_pdf_concurrency_level": 15
              }
          ),
          chunker_config=ChunkerConfig(chunking_strategy="by_title"),
          embedder_config=EmbedderConfig(embedding_provider="huggingface"),
          uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
      ).run()
  ```
</CodeGroup>

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed:

* To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python).

  Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`

* To send files to the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python).

  Unstructured also requires an Unstructured API key and API URL, by adding the following:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively.

  <Note>
    You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, which applies to **Let's Go**, **Pay-As-You-Go**, and **Business SaaS** accounts.

    The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).

    If you do not have an API key, [get one now](/api-reference/legacy-api/partition/overview).

    If you are using a **Business** account, the process
    for generating Unstructured API keys, and the Unstructured API URL that you use, are different.
    For instructions, see your Unstructured account administrator, or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
  </Note>

## Output a key file's contents as a string

If you need to convert the contents of a `credentials.json` file into a string, you could use a Python script such as the following.
This script takes the local path to the key file as input and outputs the key file's contents as a string.

```python Python theme={null}
# Filename: json_file_to_string.py

import json
import sys

def json_file_to_string(file_path):
    try:
        # Read the JSON file.
        with open(file_path, 'r') as file:
            # Load the JSON data.
            data = json.load(file)
        
        # Convert the JSON data back to a string, with no whitespace.
        json_string = json.dumps(data, separators=(',', ':'))
        
        return json_string
    
    except FileNotFoundError:
        print(f"Error: File '{file_path}' not found.")
        return None
    except json.JSONDecodeError:
        print(f"Error: '{file_path}' is not a valid JSON file.")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python json_file_to_string.py <path_to_json_file>")
        sys.exit(1)
    
    file_path = sys.argv[1]
    result = json_file_to_string(file_path)
    
    if result:
        print(result)
```