> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# PostgreSQL

Connect PostgreSQL to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem.

The requirements are as follows.

* For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local PostgreSQL installations are not supported.
* For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local PostgreSQL installations are supported.

The following video shows how to set up [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/):

<iframe width="560" height="315" src="https://www.youtube.com/embed/QuIlEimejDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

The following video shows how to set up [Azure Database for PostgreSQL](https://azure.microsoft.com/products/postgresql):

<iframe width="560" height="315" src="https://www.youtube.com/embed/6lvtBUFI7eQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* A PostgreSQL instance.

  * [Create an Amazon RDS for PostgreSQL instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html).
  * [Create an Azure Database for PostgreSQL server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-deploy-on-azure-free-account).
  * [Install PostgreSQL locally](https://www.postgresql.org/docs/current/tutorial-install.html).

* The host name and port number for the instance.

  * For Amazon RDS for PostgreSQL, learn how to [get the host name and port number](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html#postgresql-endpoint).
  * For Azure Database for PostgreSQL, learn how to [get the host](https://learn.microsoft.com/azure/postgresql/flexible-server/quickstart-create-server#get-the-connection-information). The port number is `5432`.
  * For local PostgreSQL installations, these values are in the `postgresql.conf` file's `listen_addresses` and `port` settings. This file should be on the same machine as the instance. These values might also already be set as environment variables named `PGHOST` and `PGPORT` on the same machine as the instance.
  * For other installation types, see your PostgreSQL provider's documentation.

* Depending on your network security requirements, you might need to allow access to your instance only from specific IP addresses.

  To get Unstructured's IP address ranges, go to
  [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json)
  and allow all of the `ip_prefix` fields' values that are listed.

  <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note>

  To learn how to allow these IP address ranges, see your PostgreSQL provider's documentation, for example with
  [Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) or
  [Azure Database for PostgreSQL](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-manage-firewall-portal#create-a-firewall-rule-after-server-is-created).

  <Note>
    For Amazon RDS for PostgreSQL, Amazon recommends that you set the instance's **Public access** setting to **No** by default, as this
    approach is more secure. This means that no
    resources can connect to the instance outside of the instance's associated Virtual Private Cloud (VPC) without extra configuration.
    [Learn more](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.WorkingWithRDSInstanceinaVPC.html#USER_VPC.Hiding).
    [Access an Amazon RDS instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html).

    If you must enable public access, set the instance's **Public access** setting to **Yes**, and then adjust the instance's related
    security group to allow this access.
    [Learn how](https://repost.aws/en/questions/QUxemKa9u5TV6CmLiO-r5prg/lost-public-access-to-aws-rds-postgresql-instance).

    [Troubleshoot issues with connecting to Amazon RDS instances](https://repost.aws/knowledge-center/rds-connectivity-instance-subnet-vpc).
  </Note>

* A database in the instance.

  * For Amazon RDS for PostgreSQL and Azure Database for PostgreSQL, the default database name is `postgres` unless a custom database name was specified during the instance creation process.
  * For local PostgreSQL installations, learn how to [create a database](https://www.postgresql.org/docs/current/tutorial-createdb.html).
  * For other installation types, see your PostgreSQL provider's documentation.

* A table in the database. Learn how to [create a table](https://www.postgresql.org/docs/current/tutorial-table.html).

  For the destination connector, the table must have a defined schema before Unstructured can write to the table. The minimum viable
  schema for Unstructured contains only the fields `id`, `element_id`, `record_id`, `text`, (and `embeddings`, if you are using `pgvector` and generating vector embeddings), as follows.
  `type` is an optional field, but highly recommended.

  If you are using `pgvector` and generating vector embeddings, the number of dimensions (in this example, `1536`) must match the number of dimensions for the associated embedding model that you use in any related Unstructured workflows or ingestion pipelines.

  <CodeGroup>
    ```sql PostgreSQL theme={null}
    CREATE TABLE elements (
        id UUID PRIMARY KEY,
        element_id TEXT,
        record_id TEXT,
        text TEXT,
        type TEXT
    );
    ```

    ```sql PostgreSQL with pgvector theme={null}
    CREATE EXTENSION vector;

    CREATE TABLE elements (
        id UUID PRIMARY KEY,
        element_id TEXT,
        record_id TEXT,
        text TEXT,
        type TEXT,
        embeddings vector(1536)
    );
    ```
  </CodeGroup>

  For objects in the `metadata` field that Unstructured produces and that you want to store in PostgreSQL, you must create fields in your table's schema that
  follows Unstructured's `metadata` field naming convention. For example, if Unstructured produces a `metadata` field with the following
  child objects:

  ```json  theme={null}
  "metadata": {
    "is_extracted": "true",
    "coordinates": {
      "points": [
        [
          134.20055555555555,
          241.36027777777795
        ],
        [
          134.20055555555555,
          420.0269444444447
        ],
        [
          529.7005555555555,
          420.0269444444447
        ],
        [
          529.7005555555555,
          241.36027777777795
        ]
      ],
      "system": "PixelSpace",
      "layout_width": 1654,
      "layout_height": 2339
    },
    "filetype": "application/pdf",
    "languages": [
      "eng"
    ],
    "page_number": 1,
    "image_mime_type": "image/jpeg",
    "filename": "realestate.pdf",
    "data_source": {
      "url": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf",
      "record_locator": {
        "protocol": "file",
        "remote_file_path": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf"
      }
    }
  }
  ```

  You could create corresponding fields in your table's schema by using the following field names and data types:

  <CodeGroup>
    ```sql PostgreSQL theme={null}
    -- The fields "id", "element_id", "record_id", and "text" are required.
    -- "type" is an optional field, but highly recommended.
    -- All other "metadata" fields are optional.
    CREATE TABLE elements (
        id UUID PRIMARY KEY,
        element_id TEXT,
        record_id TEXT,
        text TEXT,
        type TEXT,
        is_extracted TEXT,
        points JSONB,
        system TEXT,
        layout_width INTEGER,
        layout_height INTEGER,
        filetype TEXT,
        languages TEXT[],
        page_number TEXT,
        image_mime_type TEXT,
        url TEXT,
        record_locator JSONB
    );
    ```

    ```sql PostgreSQL with pgvector  theme={null}
    -- The fields "id", "element_id", "record_id", and "text" are required.
    -- "embeddings" is required if you are generating vector embeddings.
    --   If you are generating embeddings, the number of dimensions in "embeddings" 
    --   must match the number of dimensions for the associated embedding model 
    --   that you use in any related Unstructured workflows or ingestion pipelines.
    -- "type" is an optional field, but highly recommended.
    -- All other "metadata" fields are optional.
    CREATE EXTENSION vector;

    CREATE TABLE elements (
        id UUID PRIMARY KEY,
        element_id TEXT,
        record_id TEXT,
        text TEXT,
        type TEXT,
        embeddings vector(1536),
        is_extracted TEXT,
        points JSONB,
        system TEXT,
        layout_width INTEGER,
        layout_height INTEGER,
        filetype TEXT,
        languages TEXT[],
        page_number TEXT,
        image_mime_type TEXT,
        url TEXT,
        record_locator JSONB
    );
    ```
  </CodeGroup>

  Unstructured cannot provide a schema that is guaranteed to work in all
  circumstances. This is because these schemas will vary based on your source files' types; how you
  want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.

  See also:

  * [CREATE TABLE](https://www.postgresql.org/docs/current/sql-createtable.html) for PostgreSQL
  * [CREATE TABLE](https://github.com/pgvector/pgvector) for PostrgreSQL with pgvector
  * [Unstructured document elements and metadata](/api-reference/legacy-api/partition/document-elements)

  The following video shows how to use the `psql` utility to connect to PostgreSQL, list databases, and list and create tables:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/IKo-4QHdNF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* A user in the database, and a password for the user.

  * For Amazon RDS for PostgreSQL, learn how to [create a user](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Roles.html).
  * For Azure Database for PostgreSQL, learn how to [create a user](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-create-users).
  * For local PostgreSQL installations, learn how to [create a user](https://www.postgresql.org/docs/current/sql-createuser.html).
  * For other installation types, see your PostgreSQL provider's documentation.

* Database access for the user.

  * For Amazon RDS for PostgreSQL, learn how to [control user access](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Access.html).
  * For Azure Database for PostgreSQL, learn how to [control user access](https://www.postgresql.org/docs/current/sql-createuser.html).
  * For local PostgreSQL installations, learn how to [give database access to a user](https://www.postgresql.org/docs/current/sql-grant.html).
  * For other installation types, see your PostgreSQL provider's documentation.

The PostgreSQL connector dependencies:

```bash CLI, Python theme={null}
pip install "unstructured-ingest[postgres]"
```

You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies).

The following environment variables:

* `PGHOST` - The host name, represented by `--host` (CLI) or `host` (Python).
* `PGPORT` - The port number, represented by `--port` (CLI) or `port` (Python).
* `PGUSER` - The username, represented by `--username` (CLI) or `username` (Python).
* `PGPASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python).
* `PGDATABASE` - The name of the database, represented by `--database` (CLI) or `database` (Python).

Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector:

This example sends data to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

<CodeGroup>
  ```bash CLI theme={null}
  #!/usr/bin/env bash

  # Chunking and embedding are optional.

  unstructured-ingest \
    postgres \
      --host $PGHOST \
      --port $PGPORT \
      --database $PGDATABASE \
      --table-name elements \
      --id-column id \
      --username $PGUSER \
      --password $PGPASSWORD \
      --download-dir $LOCAL_FILE_DOWNLOAD_DIR \
      --chunking-strategy by_title \
      --embedding-provider huggingface \
      --output-dir $LOCAL_FILE_OUTPUT_DIR \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL
  ```

  ```python Python Ingest theme={null}
  import os

  from unstructured_ingest.pipeline.pipeline import Pipeline
  from unstructured_ingest.interfaces import ProcessorConfig

  from unstructured_ingest.processes.connectors.sql.postgres import (
      PostgresIndexerConfig,
      PostgresDownloaderConfig,
      PostgresConnectionConfig,
      PostgresAccessConfig
  )

  from unstructured_ingest.processes.partitioner import PartitionerConfig
  from unstructured_ingest.processes.chunker import ChunkerConfig
  from unstructured_ingest.processes.embedder import EmbedderConfig

  from unstructured_ingest.processes.connectors.local import (
      LocalConnectionConfig,
      LocalUploaderConfig
  )

  # Chunking and embedding are optional.

  if __name__ == "__main__":
      Pipeline.from_configs(
          context=ProcessorConfig(),
          indexer_config=PostgresIndexerConfig(
              table_name="elements",
              id_column="id"
          ),
          downloader_config=PostgresDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")),
          source_connection_config=PostgresConnectionConfig(
              access_config=PostgresAccessConfig(password=os.getenv("PGPASSWORD")),
              host=os.getenv("PGHOST"),
              port=os.getenv("PGPORT"),
              username=os.getenv("PGUSER"),
              database=os.getenv("PGDATABASE")
          ),
          partitioner_config=PartitionerConfig(
              partition_by_api=True,
              api_key=os.getenv("UNSTRUCTURED_API_KEY"),
              partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
              additional_partition_args={
                  "split_pdf_page": True,
                  "split_pdf_allow_failed": True,
                  "split_pdf_concurrency_level": 15
              }
          ),
          chunker_config=ChunkerConfig(chunking_strategy="by_title"),
          embedder_config=EmbedderConfig(embedding_provider="huggingface"),
          destination_connection_config=LocalConnectionConfig(),
          uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
      ).run()
  ```
</CodeGroup>

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed:

* To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python).

  Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`

* To send files to the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python).

  Unstructured also requires an Unstructured API key and API URL, by adding the following:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively.

  <Note>
    You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, which applies to **Let's Go**, **Pay-As-You-Go**, and **Business SaaS** accounts.

    The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the legacy[Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).

    If you do not have an API key, [get one now](/api-reference/legacy-api/partition/overview).

    If you are using a **Business** account, the process
    for generating Unstructured API keys, and the Unstructured API URL that you use, are different.
    For instructions, see your Unstructured account administrator, or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
  </Note>
