Snowflake - Unstructured

This page was recently updated. What do you think about it? Let us know!.

Connect Snowflake to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem.

The requirements are as follows.

A Snowflake account and its account identifier.

To get the identifier for the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click your username, and then click Account > View account details.
3. On the Account tab, note the value of the Account Identifier field.
Alternatively, the following Snowflake query returns the current account’s identifier:
```
SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS "Account Identifier"
```
The Snowflake user’s login name (not its username) and its password in the account.

To view the login name for a user:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Admin > Users & Roles.
3. On the Users tab, in the list of available users, click the name of the target user.
4. In the About tile, note the Login Name for the user.
Alternatively, the following Snowflake query returns information about the user with the username of <my-user>, including their login_name value representing their login name:
```
SHOW USERS LIKE '<my-user>';
```
The name of the Snowflake role that the user belongs to and that also has sufficient access to the Snowflake database, schema, table, and host.
- To create a database in Snowflake, the role needs to be granted CREATE DATABASE privilege at the current account level; and USAGE privilege on the warehouse that is used to create the database.
- To create a schema in a database in Snowflake, the role needs to be granted USAGE privilege on the database and the warehouse that is used to create the schema; and CREATE SCHEMA on the database.
- To create a table in a schema in Snowflake, the role needs to be granted USAGE privilege on the database and schema and the warehouse that is used to create the table; and CREATE TABLE on the schema.
- To write to a table in Snowflake, the role needs to be granted USAGE privilege on the database and schema and the warehouse that is used to write to the table; and INSERT on the table.
- To read from a table in Snowflake, the role needs to be granted USAGE privilege on the database and schema and the warehouse that is used to write to the table; and SELECT on the table.
To view a list of available roles in the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Admin > Users & Roles.
3. Click the Roles tab.
Alternatively, the following Snowflake query returns a list of available roles in the current account:
```
SHOW ROLES;
```
Grant privileges to a role. Learn more.
The Snowflake warehouse’s hostname and its port number in the account.

To view a list of available warehouses in the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Admin > Warehouses. This view does not provide access to the warehouses’ hostnames or port numbers. To get this information, you must run a Snowflake query.
The following Snowflake query returns a list of available warehouse types, hostnames, and port numbers in the current account. Look for the row with a type of SNOWFLAKE_DEPLOYMENT:
```
SELECT t.VALUE:type::VARCHAR as type,
       t.VALUE:host::VARCHAR as host,
       t.VALUE:port as port
FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t;
```
The name of the Snowflake database in the account.

To view a list of available databases in the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Data > Databases.
Alternatively, the following Snowflake query returns a list of available databases in the current account:
```
SHOW DATABASES;
```
The name of the schema in the database.

To view a list of available schemas for a database in the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Data > Databases.
3. Expand the name of the target database.
Alternatively, the following Snowflake query returns a list of available schemas in the current account:
```
SHOW SCHEMAS;
```
The following Snowflake query returns a list of available schemas for the database named <database_name> in the current account:
```
SHOW SCHEMAS IN DATABASE <database_name>;
```

The name of the table in the schema.

To view a list of available tables for a schema in a database in the current Snowflake account:

Log in to Snowsight with your Snowflake account.
In Snowsight, on the navigation menu, click Data > Databases.
Expand the name of the database that contains the target schema.
Expand the name of the target schema.
Expand Tables.

Alternatively, the following Snowflake query returns a list of available tables for the schema named <schema_name> in the datbase named <database_name> in the current account:

SHOW TABLES IN SCHEMA <database_name>.<schema_name>;

Snowflake requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured is as follows. In the following CREATE TABLE statement, replace the following placeholders with the appropriate values:

<database_name>: The name of the target database in the Snowflake account.
<schema_name>: The name of the target schema in the database.
<number-of-dimensions>: The number of dimensions for any embeddings that you plan to use. This value must match the number of dimensions for any embeddings that are
specified in your related Unstructured workflows or pipelines. If you plan to use Snowflake vector embedding generation or Snowflake vector search, this value must match the number of dimensions that you plan to have Snowflake generate or search against.

SQL

CREATE TABLE <database_name>.<schema_name>.ELEMENTS (
  ID VARCHAR(36) PRIMARY KEY NOT NULL DEFAULT UUID_STRING(),
  RECORD_ID VARCHAR,
  ELEMENT_ID VARCHAR,
  TEXT VARCHAR,
  EMBEDDINGS VECTOR(FLOAT, <number-of-dimensions>),
  TYPE VARCHAR,
  SYSTEM VARCHAR,
  LAYOUT_WIDTH DECIMAL,
  LAYOUT_HEIGHT DECIMAL,
  POINTS VARCHAR,
  URL VARCHAR,
  VERSION VARCHAR,
  DATE_CREATED TIMESTAMP_TZ,
  DATE_MODIFIED TIMESTAMP_TZ,
  DATE_PROCESSED TIMESTAMP_TZ,
  PERMISSIONS_DATA VARCHAR,
  RECORD_LOCATOR VARCHAR,
  CATEGORY_DEPTH INTEGER,
  PARENT_ID VARCHAR,
  ATTACHED_FILENAME VARCHAR,
  FILETYPE VARCHAR,
  LAST_MODIFIED TIMESTAMP_TZ,
  FILE_DIRECTORY VARCHAR,
  FILENAME VARCHAR,
  LANGUAGES ARRAY,
  PAGE_NUMBER VARCHAR,
  LINKS VARCHAR,
  PAGE_NAME VARCHAR,
  LINK_URLS ARRAY,
  LINK_TEXTS ARRAY,
  SENT_FROM ARRAY,
  SENT_TO ARRAY,
  SUBJECT VARCHAR,
  SECTION VARCHAR,
  HEADER_FOOTER_TYPE VARCHAR,
  EMPHASIZED_TEXT_CONTENTS ARRAY,
  EMPHASIZED_TEXT_TAGS ARRAY,
  TEXT_AS_HTML VARCHAR,
  REGEX_METADATA VARCHAR,
  DETECTION_CLASS_PROB DECIMAL,
  IMAGE_BASE64 VARCHAR,
  IMAGE_MIME_TYPE VARCHAR,
  ORIG_ELEMENTS VARCHAR,
  IS_CONTINUATION BOOLEAN
);

The name of the column in the table that uniquely identifies each record (for example, RECORD_ID).

The Snowflake connector dependencies:

CLI, Python

pip install "unstructured-ingest[snowflake]"

You might also need to install additional dependencies, depending on your needs. Learn more.

These environment variables:

SNOWFLAKE_ACCOUNT - The ID of the target Snowflake account, represented by --account (CLI) or account (Python).
SNOWFLAKE_USER - The name of the target Snowflake user, represented by --user (CLI) or user (Python).
SNOWFLAKE_PASSWORD - The user’s password, represented by --password (CLI) or password (Python).
SNOWFLAKE_ROLE - The target role for the user, represented by --role (CLI) or role (Python).
SNOWFLAKE_HOST - The hostname for the target Snowflake warehouse, represented by --host (CLI) or host (Python).
SNOWFLAKE_PORT - The warehouse’s port number, represented by --port (CLI) or port (Python). The default is 443 if not otherwise specified.
SNOWFLAKE_DATABASE - The name of the target Snowflake database, represented by --database (CLI) or database (Python).
SNOWFLAKE_SCHEMA - The name of the target schema in the database, represented by --schema (CLI) or schema (Python).
SNOWFLAKE_TABLE - The name of the target table in the schema, represented by --table-name (CLI) or table_name (Python). For the destination connector, the default is elements if not otherwise specified.
SNOWFLAKE_RECORD_ID_KEY - The name of the column in the table that uniquely identifies each record, represented by:
- For the source connector, --id-column (CLI) or id_column (Python).
- For the destination connector, --record-id-key (CLI) or record_id_key (Python). For the destination connector, the default is record_id if not otherwise specified.

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector:

This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page.

#!/usr/bin/env bash

unstructured \
  snowflake \
    --account $SNOWFLAKE_ACCOUNT \
    --user $SNOWFLAKE_USER \
    --password $SNOWFLAKE_PASSWORD \
    --role $SNOWFLAKE_ROLE \
    --host $SNOWFLAKE_HOST \
    --port $SNOWFLAKE_PORT \
    --database $SNOWFLAKE_DATABASE \
    --schema $SNOWFLAKE_SCHEMA \
    --batch-size 50 \
    --table-name $SNOWFLAKE_TABLE \
    --id-column $SNOWFLAKE_RECORD_ID_KEY \
    --download-dir $LOCAL_FILE_DOWNLOAD_DIR\
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --output-dir $LOCAL_FILE_OUTPUT_DIR

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the --partition-by-api option (CLI) or partition_by_api (Python) parameter to specify where files are processed:

To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python).

Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
To send files to the Unstructured Partition Endpoint for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python).

Unstructured also requires an Unstructured API key and API URL, by adding the following:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.
You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a self-hosted instance of the Unstructured API.
The default API URL for Unstructured Ingest is https://api.unstructuredapp.io/general/v0/general, which is the API URL for the Unstructured Partition Endpoint. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at sales@unstructured.io.
If you do not have an API key, get one now.
If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at sales@unstructured.io.

This page was recently updated. What do you think about it? Let us know!.

The requirements are as follows.

A Snowflake account and its account identifier.

To get the identifier for the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click your username, and then click Account > View account details.
3. On the Account tab, note the value of the Account Identifier field.
Alternatively, the following Snowflake query returns the current account’s identifier:
```
SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS "Account Identifier"
```
The Snowflake user’s login name (not its username) and its password in the account.

To view the login name for a user:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Admin > Users & Roles.
3. On the Users tab, in the list of available users, click the name of the target user.
4. In the About tile, note the Login Name for the user.
Alternatively, the following Snowflake query returns information about the user with the username of <my-user>, including their login_name value representing their login name:
```
SHOW USERS LIKE '<my-user>';
```
The name of the Snowflake role that the user belongs to and that also has sufficient access to the Snowflake database, schema, table, and host.
- To create a database in Snowflake, the role needs to be granted CREATE DATABASE privilege at the current account level; and USAGE privilege on the warehouse that is used to create the database.
- To create a schema in a database in Snowflake, the role needs to be granted USAGE privilege on the database and the warehouse that is used to create the schema; and CREATE SCHEMA on the database.
- To create a table in a schema in Snowflake, the role needs to be granted USAGE privilege on the database and schema and the warehouse that is used to create the table; and CREATE TABLE on the schema.
- To write to a table in Snowflake, the role needs to be granted USAGE privilege on the database and schema and the warehouse that is used to write to the table; and INSERT on the table.
- To read from a table in Snowflake, the role needs to be granted USAGE privilege on the database and schema and the warehouse that is used to write to the table; and SELECT on the table.
To view a list of available roles in the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Admin > Users & Roles.
3. Click the Roles tab.
Alternatively, the following Snowflake query returns a list of available roles in the current account:
```
SHOW ROLES;
```
Grant privileges to a role. Learn more.
The Snowflake warehouse’s hostname and its port number in the account.

To view a list of available warehouses in the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Admin > Warehouses. This view does not provide access to the warehouses’ hostnames or port numbers. To get this information, you must run a Snowflake query.
The following Snowflake query returns a list of available warehouse types, hostnames, and port numbers in the current account. Look for the row with a type of SNOWFLAKE_DEPLOYMENT:
```
SELECT t.VALUE:type::VARCHAR as type,
       t.VALUE:host::VARCHAR as host,
       t.VALUE:port as port
FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t;
```
The name of the Snowflake database in the account.

To view a list of available databases in the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Data > Databases.
Alternatively, the following Snowflake query returns a list of available databases in the current account:
```
SHOW DATABASES;
```
The name of the schema in the database.

To view a list of available schemas for a database in the current Snowflake account:
1. Log in to Snowsight with your Snowflake account.
2. In Snowsight, on the navigation menu, click Data > Databases.
3. Expand the name of the target database.
Alternatively, the following Snowflake query returns a list of available schemas in the current account:
```
SHOW SCHEMAS;
```
The following Snowflake query returns a list of available schemas for the database named <database_name> in the current account:
```
SHOW SCHEMAS IN DATABASE <database_name>;
```

The name of the table in the schema.

To view a list of available tables for a schema in a database in the current Snowflake account:

Log in to Snowsight with your Snowflake account.
In Snowsight, on the navigation menu, click Data > Databases.
Expand the name of the database that contains the target schema.
Expand the name of the target schema.
Expand Tables.

Alternatively, the following Snowflake query returns a list of available tables for the schema named <schema_name> in the datbase named <database_name> in the current account:

SHOW TABLES IN SCHEMA <database_name>.<schema_name>;

<database_name>: The name of the target database in the Snowflake account.
<schema_name>: The name of the target schema in the database.
<number-of-dimensions>: The number of dimensions for any embeddings that you plan to use. This value must match the number of dimensions for any embeddings that are
specified in your related Unstructured workflows or pipelines. If you plan to use Snowflake vector embedding generation or Snowflake vector search, this value must match the number of dimensions that you plan to have Snowflake generate or search against.

SQL

CREATE TABLE <database_name>.<schema_name>.ELEMENTS (
  ID VARCHAR(36) PRIMARY KEY NOT NULL DEFAULT UUID_STRING(),
  RECORD_ID VARCHAR,
  ELEMENT_ID VARCHAR,
  TEXT VARCHAR,
  EMBEDDINGS VECTOR(FLOAT, <number-of-dimensions>),
  TYPE VARCHAR,
  SYSTEM VARCHAR,
  LAYOUT_WIDTH DECIMAL,
  LAYOUT_HEIGHT DECIMAL,
  POINTS VARCHAR,
  URL VARCHAR,
  VERSION VARCHAR,
  DATE_CREATED TIMESTAMP_TZ,
  DATE_MODIFIED TIMESTAMP_TZ,
  DATE_PROCESSED TIMESTAMP_TZ,
  PERMISSIONS_DATA VARCHAR,
  RECORD_LOCATOR VARCHAR,
  CATEGORY_DEPTH INTEGER,
  PARENT_ID VARCHAR,
  ATTACHED_FILENAME VARCHAR,
  FILETYPE VARCHAR,
  LAST_MODIFIED TIMESTAMP_TZ,
  FILE_DIRECTORY VARCHAR,
  FILENAME VARCHAR,
  LANGUAGES ARRAY,
  PAGE_NUMBER VARCHAR,
  LINKS VARCHAR,
  PAGE_NAME VARCHAR,
  LINK_URLS ARRAY,
  LINK_TEXTS ARRAY,
  SENT_FROM ARRAY,
  SENT_TO ARRAY,
  SUBJECT VARCHAR,
  SECTION VARCHAR,
  HEADER_FOOTER_TYPE VARCHAR,
  EMPHASIZED_TEXT_CONTENTS ARRAY,
  EMPHASIZED_TEXT_TAGS ARRAY,
  TEXT_AS_HTML VARCHAR,
  REGEX_METADATA VARCHAR,
  DETECTION_CLASS_PROB DECIMAL,
  IMAGE_BASE64 VARCHAR,
  IMAGE_MIME_TYPE VARCHAR,
  ORIG_ELEMENTS VARCHAR,
  IS_CONTINUATION BOOLEAN
);

The name of the column in the table that uniquely identifies each record (for example, RECORD_ID).

The Snowflake connector dependencies:

CLI, Python

pip install "unstructured-ingest[snowflake]"

You might also need to install additional dependencies, depending on your needs. Learn more.

These environment variables:

SNOWFLAKE_ACCOUNT - The ID of the target Snowflake account, represented by --account (CLI) or account (Python).
SNOWFLAKE_USER - The name of the target Snowflake user, represented by --user (CLI) or user (Python).
SNOWFLAKE_PASSWORD - The user’s password, represented by --password (CLI) or password (Python).
SNOWFLAKE_ROLE - The target role for the user, represented by --role (CLI) or role (Python).
SNOWFLAKE_HOST - The hostname for the target Snowflake warehouse, represented by --host (CLI) or host (Python).
SNOWFLAKE_PORT - The warehouse’s port number, represented by --port (CLI) or port (Python). The default is 443 if not otherwise specified.
SNOWFLAKE_DATABASE - The name of the target Snowflake database, represented by --database (CLI) or database (Python).
SNOWFLAKE_SCHEMA - The name of the target schema in the database, represented by --schema (CLI) or schema (Python).
SNOWFLAKE_TABLE - The name of the target table in the schema, represented by --table-name (CLI) or table_name (Python). For the destination connector, the default is elements if not otherwise specified.
SNOWFLAKE_RECORD_ID_KEY - The name of the column in the table that uniquely identifies each record, represented by:
- For the source connector, --id-column (CLI) or id_column (Python).
- For the destination connector, --record-id-key (CLI) or record_id_key (Python). For the destination connector, the default is record_id if not otherwise specified.

Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector:

This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page.

#!/usr/bin/env bash

unstructured \
  snowflake \
    --account $SNOWFLAKE_ACCOUNT \
    --user $SNOWFLAKE_USER \
    --password $SNOWFLAKE_PASSWORD \
    --role $SNOWFLAKE_ROLE \
    --host $SNOWFLAKE_HOST \
    --port $SNOWFLAKE_PORT \
    --database $SNOWFLAKE_DATABASE \
    --schema $SNOWFLAKE_SCHEMA \
    --batch-size 50 \
    --table-name $SNOWFLAKE_TABLE \
    --id-column $SNOWFLAKE_RECORD_ID_KEY \
    --download-dir $LOCAL_FILE_DOWNLOAD_DIR\
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --output-dir $LOCAL_FILE_OUTPUT_DIR

To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python).

Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
To send files to the Unstructured Partition Endpoint for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python).

Unstructured also requires an Unstructured API key and API URL, by adding the following:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.
You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a self-hosted instance of the Unstructured API.
The default API URL for Unstructured Ingest is https://api.unstructuredapp.io/general/v0/general, which is the API URL for the Unstructured Partition Endpoint. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at sales@unstructured.io.
If you do not have an API key, get one now.
If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at sales@unstructured.io.

FAQ