Teradata - Unstructured

Batch process all your records to store structured outputs in Teradata. The requirements are as follows.

A Teradata Vantage system that can be accessed by its host name or IP address. For example, a Teradata Vantage system in Teradata ClearScape Analytics Experience includes:
- A Teradata ClearScape Analytics Experience account.
- An environment in the account.
- A Teradata Vantage database in the environment.
- The name and password for a Teradata user who has the appropriate access to the database.
Learn how to create these in Teradata ClearScape Analytics Experience.
The system’s corresponding host name or IP address. For example, you can get these values from Teradata ClearScape Analytics Experience as follows:
1. Sign in to your Teradata ClearScape Analytics Experience account.
2. On the sidebar, under Environments, click the name of the database’s corresponding environment.
3. Under Connection details for Vantage database, use the Host value.
The name of the target database in the system. To get a list of available databases in the system, you can run a Teradata SQL query such as the following:
SELECT DatabaseName FROM DBC.DatabasesV ORDER BY DatabaseName;
The name of the target table in the database. To get a list of available tables in a database, you can run a Teradata SQL query such as the following, replacing <database-name> with the name of the target database:
SELECT TableName FROM DBC.TablesV WHERE DatabaseName = '<database-name>' AND TableKind = 'T' ORDER BY TableName;
When Unstructured writes rows to a table, the table’s columns must have a schema that is compatible with Unstructured. Unstructured cannot provide a schema that is guaranteed to work for everyone in all circumstances. This is because these schemas will vary based on your source files’ types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. In any case, note the following about table schemas:
- The following columns are always required by Unstructured: record_id and element_id.
- The following columns are optional for Unstructured, but highly recommended: text and type.
- The rest of the columns are optional and typically will be output by Unstructured as part of the metadata field.
- If Unstructured is generating vector embeddings, the embeddings column is also required.
  The destination connector outputs Unstructured-generated embeddings that are not directly compatible with Teradata Enterprise Vector Store. To use embeddings with Teradata Enterprise Vector Store, Unstructured recommends that you choose from among the following options:
  - Define a column in your target table named embeddings that is of type VARCHAR(64000), to store the Unstructured-generated embeddings. After Unstructured adds its embeddings to your embeddings column, choose from among Teradata’s options to convert the embeddings column’s VARCHAR values to the Teradata VECTOR Data Type yourself.
  - Omit any columns named embedding, message, or num_tokens from your target table. Then choose from among Teradata’s options (such as AI_TextEmbeddings) to have Teradata generate the embeddings for you, instead of having Unstructured generate them.
Here is an example table schema that is compatible with Unstructured. It includes all of the required and recommended columns, as well as a few additional columns that are typically output by Unstructured as part of the metadata field. Be sure to replace <database-name> with the name of the target database and <table-name> with the name of the target table (by Unstructured convention, the table name is typically elements, but this is not a requirement).
CREATE SET TABLE "<database-name>"."<table-name>" ( "id" VARCHAR(64) NOT NULL, PRIMARY KEY ("id"), "record_id" VARCHAR(64), "element_id" VARCHAR(64), "text" VARCHAR(32000) CHARACTER SET UNICODE, "type" VARCHAR(50), "embeddings" VARCHAR(64000), -- Add this column only if Unstructured is generating vector embeddings. "last_modified" VARCHAR(50), "languages" VARCHAR(200), "file_directory" VARCHAR(500), "filename" VARCHAR(255), "filetype" VARCHAR(50), "record_locator" VARCHAR(1000), "date_created" VARCHAR(50), "date_modified" VARCHAR(50), "date_processed" VARCHAR(50), "permissions_data" VARCHAR(1000), "filesize_bytes" INTEGER, "parent_id" VARCHAR(64) )
For the source connector, the name of the primary key column in the table (for example, a column named id, typically defined as "id" VARCHAR(64) NOT NULL, PRIMARY KEY ("id")).
For the source connector, the names of any specific columns to fetch from the table. By default, all columns are fetched unless otherwise specified.
For the destination connector, the name of the column in the table that uniquely identifies each record for Unstructured to perform any necessary record updates. By default convention, Unstructured expects this field to be named record_id.
The name of the Teradata user who has the appropriate access to the target database. For example, you can get this from Teradata ClearScape Analytics Experience as follows:
1. Sign in to your Teradata ClearScape Analytics account.
2. On the sidebar, under Environments, click the name of the database’s corresponding environment.
3. Under Connection details for Vantage database, use the Username value.
The Teradata SQL command to get a list of available users is as follows:
SELECT UserName FROM DBC.UsersV ORDER BY UserName;
The password for the user, which was set up when the user was created. If the user has forgotten their password, the Teradata SQL command to change a user’s password is as follows, replacing <user-name> with the name of the user and <new-password> with the new password:
MODIFY USER <user-name> SET PASSWORD = '<new-password>';
To change a user’s password, you must be an administrator (such as the DBC user or another user with DROP USER privileges).

The record_id, element_id, and id fields are closely related, but each has a distinct purpose. For more information, see How connectors use record IDs, element IDs, and IDs.

The Teradata connector dependencies:

CLI, Python

pip install "unstructured-ingest[teradata]"

You might also need to install additional dependencies, depending on your needs. Learn more. These environment variables:

TERADATA_HOST - The host name, represented by --host (CLI) or host (Python).
TERADATA_PORT - The port number, represented by --dbs-port (CLI) or dbs_port (Python). This is optional, and the default is 1025 if not otherwise specified.
TERADATA_USERNAME - The name of the user who has access to the database, represented by --user (CLI) or user (Python).
TERADATA_PASSWORD - The user’s password, represented by --password (CLI) or password (Python).
TERADATA_DATABASE - The name of the database, represented by --database (CLI) or database (Python). If not otherwise specified, the default database name is used. To get the name of the default database, you can run the Teradata SQL command SELECT DATABASE;.
TERADATA_TABLE - The name of the table, represented by --table-name (CLI) or table_name (Python).
TERADATA_ID_COLUMN - For the source connector, the name of the column that uniquely identifies each record in the table, represented by --id-column (CLI) or id_column (Python).
TERADATA_RECORD_ID_KEY - For the destination connector, the name of the column in the table that uniquely identifies each record, represented by --record-id-key (CLI) or record_id_key (Python).

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

#!/usr/bin/env bash

# Chunking and embedding are optional.

unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --num-processes 2 \
    --verbose \
    --strategy fast \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --metadata-include "$metadata_includes" \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  teradata \
    --host $TERADATA_HOST \
    # --dbs-port $TERADATA_PORT \
    --database $TERADATA_DATABASE \
    --user $TERADATA_USERNAME \
    --password $TERADATA_PASSWORD \
    --table-name $TERADATA_TABLE \
    --record-id-key $TERADATA_RECORD_ID_KEY \
    --batch-size 50

import os

from unstructured_ingest.pipeline.pipeline import Pipeline
from unstructured_ingest.interfaces import ProcessorConfig

from unstructured_ingest.processes.connectors.sql.teradata import(
    TeradataConnectionConfig,
    TeradataAccessConfig,
    TeradataUploaderConfig,
    TeradataUploadStagerConfig
)
from unstructured_ingest.processes.connectors.local import (
    LocalIndexerConfig,
    LocalDownloaderConfig,
    LocalConnectionConfig
)
from unstructured_ingest.processes.partitioner import PartitionerConfig
from unstructured_ingest.processes.chunker import ChunkerConfig
from unstructured_ingest.processes.embedder import EmbedderConfig

# Chunking and embedding are optional.

if __name__ == "__main__":
    Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=os.getenv("UNSTRUCTURED_API_KEY"),
            partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        chunker_config=ChunkerConfig(chunking_strategy="by_title"),
        embedder_config=EmbedderConfig(embedding_provider="huggingface"),
        destination_connection_config=TeradataConnectionConfig(
            access_config=TeradataAccessConfig(
                password=os.getenv("TERADATA_PASSWORD")
            ),
            host=os.getenv("TERADATA_HOST"),
            # dbs_port=os.getenv("TERADATA_PORT"),
            database=os.getenv("TERADATA_DATABASE"),
            user=os.getenv("TERADATA_USERNAME")
        ),
        stager_config=TeradataUploadStagerConfig(),
        uploader_config=TeradataUploaderConfig(
            batch_size=50,
            table_name=os.getenv("TERADATA_TABLE"),
            record_id_key=os.getenv("TERADATA_RECORD_ID_KEY")
        )
    ).run()

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the --partition-by-api option (CLI) or partition_by_api (Python) parameter to specify where files are processed:

To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
To send files to the legacy Unstructured Partition Endpoint for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.
You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, which applies to Let’s Go, Pay-As-You-Go, and Business SaaS accounts.The default API URL for Unstructured Ingest is https://api.unstructuredapp.io/general/v0/general, which is the API URL for the legacy Unstructured Partition Endpoint. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, request support.If you do not have an API key, get one now.If you are using a Business account, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For instructions, see your Unstructured account administrator, or request support.