> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# PostgreSQL

Batch process all your records to store structured outputs in a PostgreSQL schema.

Insert query is currently limited to append.

The requirements are as follows.

* For the [Unstructured Pipelines](/pipelines/overview) or the [Unstructured API](/api-reference/overview), local PostgreSQL installations are not supported.
* For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local PostgreSQL installations are supported.

The following video shows how to set up [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/):

<iframe width="560" height="315" src="https://www.youtube.com/embed/QuIlEimejDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

The following video shows how to set up [Azure Database for PostgreSQL](https://azure.microsoft.com/products/postgresql):

<iframe width="560" height="315" src="https://www.youtube.com/embed/6lvtBUFI7eQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* A PostgreSQL instance.

  * [Create an Amazon RDS for PostgreSQL instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html).
  * [Create an Azure Database for PostgreSQL server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-deploy-on-azure-free-account).
  * [Install PostgreSQL locally](https://www.postgresql.org/docs/current/tutorial-install.html).

* The host name and port number for the instance.

  * For Amazon RDS for PostgreSQL, learn how to [get the host name and port number](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html#postgresql-endpoint).
  * For Azure Database for PostgreSQL, learn how to [get the host](https://learn.microsoft.com/azure/postgresql/flexible-server/quickstart-create-server#get-the-connection-information). The port number is `5432`.
  * For local PostgreSQL installations, these values are in the `postgresql.conf` file's `listen_addresses` and `port` settings. This file should be on the same machine as the instance. These values might also already be set as environment variables named `PGHOST` and `PGPORT` on the same machine as the instance.
  * For other installation types, see your PostgreSQL provider's documentation.

* Depending on your network security requirements, you might need to allow access to your instance only from specific IP addresses.

  To get Unstructured's IP address ranges, go to
  [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json)
  and allow all of the `ip_prefix` fields' values that are listed.

  <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note>

  To learn how to allow these IP address ranges, see your PostgreSQL provider's documentation, for example with
  [Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) or
  [Azure Database for PostgreSQL](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-manage-firewall-portal#create-a-firewall-rule-after-server-is-created).

  <Note>
    For Amazon RDS for PostgreSQL, Amazon recommends that you set the instance's **Public access** setting to **No** by default, as this
    approach is more secure. This means that no
    resources can connect to the instance outside of the instance's associated Virtual Private Cloud (VPC) without extra configuration.
    [Learn more](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.WorkingWithRDSInstanceinaVPC.html#USER_VPC.Hiding).
    [Access an Amazon RDS instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html).

    If you must enable public access, set the instance's **Public access** setting to **Yes**, and then adjust the instance's related
    security group to allow this access.
    [Learn how](https://repost.aws/en/questions/QUxemKa9u5TV6CmLiO-r5prg/lost-public-access-to-aws-rds-postgresql-instance).

    [Troubleshoot issues with connecting to Amazon RDS instances](https://repost.aws/knowledge-center/rds-connectivity-instance-subnet-vpc).
  </Note>

* A database in the instance.

  * For Amazon RDS for PostgreSQL and Azure Database for PostgreSQL, the default database name is `postgres` unless a custom database name was specified during the instance creation process.
  * For local PostgreSQL installations, learn how to [create a database](https://www.postgresql.org/docs/current/tutorial-createdb.html).
  * For other installation types, see your PostgreSQL provider's documentation.

* A table in the database. Learn how to [create a table](https://www.postgresql.org/docs/current/tutorial-table.html).

  For the destination connector, the table must have a defined schema before Unstructured can write to the table. The minimum viable
  schema for Unstructured contains only the fields `id`, `element_id`, `record_id`, `text`, (and `embeddings`, if you are using `pgvector` and generating vector embeddings), as follows.
  `type` is an optional field, but highly recommended.

  If you are using `pgvector` and generating vector embeddings, the number of dimensions (in this example, `1536`) must match the number of dimensions for the associated embedding model that you use in any related Unstructured workflows or ingestion pipelines.

  <CodeGroup>
    ```sql PostgreSQL theme={null}
    CREATE TABLE elements (
        id UUID PRIMARY KEY,
        element_id TEXT,
        record_id TEXT,
        text TEXT,
        type TEXT
    );
    ```

    ```sql PostgreSQL with pgvector theme={null}
    CREATE EXTENSION vector;

    CREATE TABLE elements (
        id UUID PRIMARY KEY,
        element_id TEXT,
        record_id TEXT,
        text TEXT,
        type TEXT,
        embeddings vector(1536)
    );
    ```
  </CodeGroup>

  <Info>
    The `record_id`, `element_id`, and `id` fields are closely related, but each has a distinct purpose. For more information, see [How connectors use record IDs, element IDs, and IDs](/api-reference/record-element-id).
  </Info>

  For objects in the `metadata` field that Unstructured produces and that you want to store in PostgreSQL, you must create fields in your table's schema that
  follows Unstructured's `metadata` field naming convention. For example, if Unstructured produces a `metadata` field with the following
  child objects:

  ```json theme={null}
  "metadata": {
    "is_extracted": "true",
    "coordinates": {
      "points": [
        [
          134.20055555555555,
          241.36027777777795
        ],
        [
          134.20055555555555,
          420.0269444444447
        ],
        [
          529.7005555555555,
          420.0269444444447
        ],
        [
          529.7005555555555,
          241.36027777777795
        ]
      ],
      "system": "PixelSpace",
      "layout_width": 1654,
      "layout_height": 2339
    },
    "filetype": "application/pdf",
    "languages": [
      "eng"
    ],
    "page_number": 1,
    "image_mime_type": "image/jpeg",
    "filename": "realestate.pdf",
    "data_source": {
      "url": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf",
      "record_locator": {
        "protocol": "file",
        "remote_file_path": "file:///home/etl/node/downloads/00000000-0000-0000-0000-000000000001/7458635f-realestate.pdf"
      }
    }
  }
  ```

  You could create corresponding fields in your table's schema by using the following field names and data types:

  <CodeGroup>
    ```sql PostgreSQL theme={null}
    -- The fields "id", "element_id", "record_id", and "text" are required.
    -- "type" is an optional field, but highly recommended.
    -- All other "metadata" fields are optional.
    CREATE TABLE elements (
        id UUID PRIMARY KEY,
        element_id TEXT,
        record_id TEXT,
        text TEXT,
        type TEXT,
        is_extracted TEXT,
        points JSONB,
        system TEXT,
        layout_width INTEGER,
        layout_height INTEGER,
        filetype TEXT,
        languages TEXT[],
        page_number TEXT,
        image_mime_type TEXT,
        url TEXT,
        record_locator JSONB
    );
    ```

    ```sql PostgreSQL with pgvector  theme={null}
    -- The fields "id", "element_id", "record_id", and "text" are required.
    -- "embeddings" is required if you are generating vector embeddings.
    --   If you are generating embeddings, the number of dimensions in "embeddings" 
    --   must match the number of dimensions for the associated embedding model 
    --   that you use in any related Unstructured workflows or ingestion pipelines.
    -- "type" is an optional field, but highly recommended.
    -- All other "metadata" fields are optional.
    CREATE EXTENSION vector;

    CREATE TABLE elements (
        id UUID PRIMARY KEY,
        element_id TEXT,
        record_id TEXT,
        text TEXT,
        type TEXT,
        embeddings vector(1536),
        is_extracted TEXT,
        points JSONB,
        system TEXT,
        layout_width INTEGER,
        layout_height INTEGER,
        filetype TEXT,
        languages TEXT[],
        page_number TEXT,
        image_mime_type TEXT,
        url TEXT,
        record_locator JSONB
    );
    ```
  </CodeGroup>

  Unstructured cannot provide a schema that is guaranteed to work in all
  circumstances. This is because these schemas will vary based on your source files' types; how you
  want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors.

  See also:

  * [CREATE TABLE](https://www.postgresql.org/docs/current/sql-createtable.html) for PostgreSQL
  * [CREATE TABLE](https://github.com/pgvector/pgvector) for PostrgreSQL with pgvector
  * [Unstructured document elements and metadata](/api-reference/legacy-api/partition/document-elements)

  The following video shows how to use the `psql` utility to connect to PostgreSQL, list databases, and list and create tables:

  <iframe width="560" height="315" src="https://www.youtube.com/embed/IKo-4QHdNF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

* A user in the database, and a password for the user.

  * For Amazon RDS for PostgreSQL, learn how to [create a user](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Roles.html).
  * For Azure Database for PostgreSQL, learn how to [create a user](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-create-users).
  * For local PostgreSQL installations, learn how to [create a user](https://www.postgresql.org/docs/current/sql-createuser.html).
  * For other installation types, see your PostgreSQL provider's documentation.

* Database access for the user.

  * For Amazon RDS for PostgreSQL, learn how to [control user access](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Access.html).
  * For Azure Database for PostgreSQL, learn how to [control user access](https://www.postgresql.org/docs/current/sql-createuser.html).
  * For local PostgreSQL installations, learn how to [give database access to a user](https://www.postgresql.org/docs/current/sql-grant.html).
  * For other installation types, see your PostgreSQL provider's documentation.

The PostgreSQL connector dependencies:

```bash CLI, Python theme={null}
pip install "unstructured-ingest[postgres]"
```

You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies).

The following environment variables:

* `PGHOST` - The host name, represented by `--host` (CLI) or `host` (Python).
* `PGPORT` - The port number, represented by `--port` (CLI) or `port` (Python).
* `PGUSER` - The username, represented by `--username` (CLI) or `username` (Python).
* `PGPASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python).
* `PGDATABASE` - The name of the database, represented by `--database` (CLI) or `database` (Python).

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector:

This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

<CodeGroup>
  ```bash CLI theme={null}
  #!/usr/bin/env bash

  # Specify which fields to output in the processed data. This can help prevent
  # database record insert issues, where a particular field in the processed data
  # does not match a column in the database table on insert.
  metadata_includes="id,element_id,text,embeddings,type,system,layout_width,\
  layout_height,points,url,version,date_created,date_modified,date_processed,\
  permissions_data,record_locator,category_depth,parent_id,attached_filename,\
  filetype,last_modified,file_directory,filename,languages,page_number,links,\
  page_name,link_urls,link_texts,sent_from,sent_to,subject,section,\
  header_footer_type,emphasized_text_contents,emphasized_text_tags,\
  text_as_html,regex_metadata,detection_class_prob"

  unstructured-ingest \
    local \
      --input-path $LOCAL_FILE_INPUT_DIR \
      --output-dir $LOCAL_FILE_OUTPUT_DIR \
      --num-processes 2 \
      --verbose \
      --strategy hi_res \
      --partition-by-api \
      --api-key $UNSTRUCTURED_API_KEY \
      --partition-endpoint $UNSTRUCTURED_API_URL \
      --metadata-include "$metadata_includes" \
      --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
    postgres \
      --username $PGUSER \
      --password $PGPASSWORD \
      --host $PGHOST \
      --port $PGPORT \
      --database $PGDATABASE
  ```

  ```python Python Ingest theme={null}
  import os

  from unstructured_ingest.pipeline.pipeline import Pipeline
  from unstructured_ingest.interfaces import ProcessorConfig

  from unstructured_ingest.processes.connectors.sql.postgres import (
      PostgresConnectionConfig,
      PostgresAccessConfig,
      PostgresUploaderConfig,
      PostgresUploadStagerConfig
  )
  from unstructured_ingest.processes.connectors.local import (
      LocalIndexerConfig,
      LocalDownloaderConfig,
      LocalConnectionConfig
  )
  from unstructured_ingest.processes.partitioner import PartitionerConfig
  from unstructured_ingest.processes.chunker import ChunkerConfig
  from unstructured_ingest.processes.embedder import EmbedderConfig

  # Chunking and embedding are optional.

  if __name__ == "__main__":
      # Specify which fields to output in the processed data. This can help prevent
      # database record insert issues, where a particular field in the processed data
      # does not match a column in the database table on insert.
      metadata_includes = [
          "id", "element_id", "text", "embeddings", "type", "system", "layout_width",
          "layout_height", "points", "url", "version", "date_created", "date_modified",
          "date_processed", "permissions_data", "record_locator", "category_depth",
          "parent_id", "attached_filename", "filetype", "last_modified", "file_directory",
          "filename", "languages", "page_number", "links", "page_name", "link_urls",
          "link_texts", "sent_from", "sent_to", "subject", "section", "header_footer_type",
          "emphasized_text_contents", "emphasized_text_tags", "text_as_html", "regex_metadata",
          "detection_class_prob"
      ]

      Pipeline.from_configs(
          context=ProcessorConfig(),
          indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
          downloader_config=LocalDownloaderConfig(),
          source_connection_config=LocalConnectionConfig(),
          partitioner_config=PartitionerConfig(
              partition_by_api=True,
              api_key=os.getenv("UNSTRUCTURED_API_KEY"),
              partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
              metadata_include=metadata_includes,
              additional_partition_args={
                  "split_pdf_page": True,
                  "split_pdf_allow_failed": True,
                  "split_pdf_concurrency_level": 15
              }
          ),
          chunker_config=ChunkerConfig(chunking_strategy="by_title"),
          embedder_config=EmbedderConfig(embedding_provider="huggingface"),
          destination_connection_config=PostgresConnectionConfig(
              access_config=PostgresAccessConfig(password=os.getenv("PGPASSWORD")),
              host=os.getenv("PGHOST"),
              port=os.getenv("PGPORT"),
              username=os.getenv("PGUSER"),
              database=os.getenv("PGDATABASE")
          ),
          stager_config=PostgresUploadStagerConfig(),
          uploader_config=PostgresUploaderConfig()
      ).run()
  ```
</CodeGroup>

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed:

* To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python).

  Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`

* To send files to the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python).

  Unstructured also requires an Unstructured API key and API URL, by adding the following:

  * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python)
  * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python)
  * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively.

  <Note>
    You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, which applies to **Let's Go**, **Pay-As-You-Go**, and **Business SaaS** accounts.

    The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the legacy [Unstructured Partition Endpoint](/api-reference/legacy-api/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).

    If you do not have an API key, [get one now](/api-reference/legacy-api/partition/overview).

    If you are using a **Business** account, the process
    for generating Unstructured API keys, and the Unstructured API URL that you use, are different.
    For instructions, see your Unstructured account administrator, or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
  </Note>