LanceDB - Unstructured

This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in LanceDB.

The requirements are as follows.

A LanceDB open source software (OSS) installation on a local machine, a server, or a virtual machine. (LanceDB Cloud is not supported.)
For LanceDB OSS with local data storage:
- The local path to the folder where the LanceDB data is (or will be) stored. See Connect to a database in the LanceDB documentation.
- The name of the target LanceDB table within the local data folder.
  
  Unstructured recommends that the target table have a field named record_id with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether.
For LanceDB OSS with data storage in an Amazon S3 bucket:
- The URI for the target Amazon S3 bucket and any target folder path within that bucket. Use the format s3://<bucket-name>[/<folder-name>].
- The name of the target LanceDB table within the Amazon S3 bucket.
  
  Unstructured recommends that the target table have a field named record_id with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether.
- The AWS access key ID and AWS secret access key for the AWS IAM entity that has access to the Amazon S3 bucket.
For more information, see AWS S3 in the LanceDB documentation, along with the following video:
For LanceDB OSS with data storage in an Azure Blob Storage account:
- The name of the target Azure Blob Storage account. = The URI for the target container within that Azure Blob Storage account and any target folder path within that container. Use the format az://<container-name>[/<folder-name>].
- The name of the target LanceDB table within the Azure Blob Storage account.
  
  Unstructured recommends that the target table have a field named record_id with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether.
- The access key for the Azure Blob Storage account.
For more information, see Azure Blob Storage in the LanceDB documentation, along with the following video:
For LanceDB OSS with data storage in a Google Cloud Storage bucket:
- The URI for the target Google Cloud Storage bucket and any target folder path within that bucket. Use the format gs://<bucket-name>[/<folder-name>].
- The name of the target LanceDB table within the Google Cloud Storage bucket.
  
  Unstructured recommends that the target table have a field named record_id with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether.
- A single-line string that contains the contents of the downloaded service account key file for the Google Cloud service account that has access to the Google Cloud Storage bucket.
For more information, see Google Cloud Storage in the LanceDB documentation, along with the following video:

The LanceDB connector dependencies:

CLI, Python

pip install "unstructured-ingest[lancedb]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

For LanceDB OSS with local data storage:
- LANCEDB_URI - The local path to the folder where the LanceDB data is stored, represented by --uri (CLI) or uri (Python).
- LANCEDB_TABLE - The name of the target LanceDB table within the local data folder, represented by --table-name (CLI) or table_name (Python).
For LanceDB OSS with data storage in an Amazon S3 bucket:
- LANCEDB_URI - The URI for the target Amazon S3 bucket and any target folder path within that bucket. Use the format s3://<bucket-name>[/<folder-name>]. This is represented by --uri (CLI) or uri (Python).
- LANCEDB_TABLE - The name of the target LanceDB table within the Amazon S3 bucket, rrepresented by --table-name (CLI) or table_name (Python).
- AWS_ACCESS_KEY_ID - The AWS access key ID for the AWS IAM entity that has access to the Amazon S3 bucket, represented by --aws-access-key-id (CLI) or aws_access_key_id (Python).
- AWS_SECRET_ACCESS_KEY - The AWS secret access key for the AWS IAM entity that has access to the Amazon S3 bucket, represented by --aws-secret-access-key (CLI) or aws_secret_access_key (Python).
For LanceDB OSS with data storage in an Azure Blob Storage account:
- LANCEDB_URI - The URI for the target container within that Azure Blob Storage account and any target folder path within that container. Use the format az://<container-name>[/<folder-name>]. This is represented by --uri (CLI) or uri (Python).
- LANCEDB_TABLE - The name of the target LanceDB table within the Azure Blob Storage account, represented by --table-name (CLI) or table_name (Python).
- AZURE_STORAGE_ACCOUNT_NAME - The name of the target Azure Blob Storage account, represented by --azure-storage-account-name (CLI) or azure_storage_account_name (Python).
- AZURE_STORAGE_ACCOUNT_KEY - The access key for the Azure Blob Storage account, represented by --azure-storage-account-key (CLI) or azure_storage_account_key (Python).
For LanceDB OSS with data storage in a Google Cloud Storage bucket:
- LANCEDB_URI - The URI for the target Google Cloud Storage bucket and any target folder path within that bucket. Use the format gs://<bucket-name>[/<folder-name>]. This is represented by --uri (CLI) or uri (Python).
- LANCEDB_TABLE - The name of the target LanceDB table within the Google Cloud Storage bucket, represented by --table-name (CLI) or table_name (Python).
- GCS_SERVICE_ACCOUNT_KEY - A single-line string that contains the contents of the downloaded service account key file for the Google Cloud service account that has access to the Google Cloud Storage bucket, represented by --google-service-account-key (CLI) or google_service_account_key (Python).

Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector:

This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

#!/usr/bin/env bash

# Chunking and embedding are optional.

# For LanceDB OSS with local data storage:
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  lancedb-local \
    --uri $LANCEDB_URI \
    --table-name $LANCEDB_TABLE

# For LanceDB OSS with data storage in an Amazon S3 bucket:
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  lancedb-aws \
    --aws-access-key-id $AWS_ACCESS_KEY_ID \
    --aws-secret-access-key $AWS_SECRET_ACCESS_KEY \
    --uri $LANCEDB_URI \
    --table-name $LANCEDB_TABLE \
    --timeout 30s

# For LanceDB OSS with data storage in an Azure Blob Storage account:
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  lancedb-azure \
    --azure-storage-account-name $AZURE_STORAGE_ACCOUNT_NAME \
    --azure-storage-account-key $AZURE_STORAGE_ACCOUNT_KEY \
    --uri $LANCEDB_URI \
    --table-name $LANCEDB_TABLE \
    --timeout 30s

# For LanceDB OSS with data storage in a Google Cloud Storage bucket:
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  lancedb-gcs \
    --google-service-account-key $GCS_SERVICE_ACCOUNT_KEY \
    --uri $LANCEDB_URI \
    --table-name $LANCEDB_TABLE \
    --timeout 30s

For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the --partition-by-api option (CLI) or partition_by_api (Python) parameter to specify where files are processed:

To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python).

Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
To send files to the Unstructured Partition Endpoint for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python).

Unstructured also requires an Unstructured API key and API URL, by adding the following:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.
You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a self-hosted instance of the Unstructured API.
The default API URL for Unstructured Ingest is https://api.unstructuredapp.io/general/v0/general, which is the API URL for the Unstructured Partition Endpoint. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at sales@unstructured.io.
If you do not have an API key, get one now.
If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at sales@unstructured.io.

This page was recently updated. What do you think about it? Let us know!.

Batch process all your records to store structured outputs in LanceDB.

The requirements are as follows.

A LanceDB open source software (OSS) installation on a local machine, a server, or a virtual machine. (LanceDB Cloud is not supported.)
For LanceDB OSS with local data storage:
- The local path to the folder where the LanceDB data is (or will be) stored. See Connect to a database in the LanceDB documentation.
- The name of the target LanceDB table within the local data folder.
  
  Unstructured recommends that the target table have a field named record_id with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether.
For LanceDB OSS with data storage in an Amazon S3 bucket:
- The URI for the target Amazon S3 bucket and any target folder path within that bucket. Use the format s3://<bucket-name>[/<folder-name>].
- The name of the target LanceDB table within the Amazon S3 bucket.
  
  Unstructured recommends that the target table have a field named record_id with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether.
- The AWS access key ID and AWS secret access key for the AWS IAM entity that has access to the Amazon S3 bucket.
For more information, see AWS S3 in the LanceDB documentation, along with the following video:
For LanceDB OSS with data storage in an Azure Blob Storage account:
- The name of the target Azure Blob Storage account. = The URI for the target container within that Azure Blob Storage account and any target folder path within that container. Use the format az://<container-name>[/<folder-name>].
- The name of the target LanceDB table within the Azure Blob Storage account.
  
  Unstructured recommends that the target table have a field named record_id with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether.
- The access key for the Azure Blob Storage account.
For more information, see Azure Blob Storage in the LanceDB documentation, along with the following video:
For LanceDB OSS with data storage in a Google Cloud Storage bucket:
- The URI for the target Google Cloud Storage bucket and any target folder path within that bucket. Use the format gs://<bucket-name>[/<folder-name>].
- The name of the target LanceDB table within the Google Cloud Storage bucket.
  
  Unstructured recommends that the target table have a field named record_id with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether.
- A single-line string that contains the contents of the downloaded service account key file for the Google Cloud service account that has access to the Google Cloud Storage bucket.
For more information, see Google Cloud Storage in the LanceDB documentation, along with the following video:

The LanceDB connector dependencies:

CLI, Python

pip install "unstructured-ingest[lancedb]"

You might also need to install additional dependencies, depending on your needs. Learn more.

The following environment variables:

For LanceDB OSS with local data storage:
- LANCEDB_URI - The local path to the folder where the LanceDB data is stored, represented by --uri (CLI) or uri (Python).
- LANCEDB_TABLE - The name of the target LanceDB table within the local data folder, represented by --table-name (CLI) or table_name (Python).
For LanceDB OSS with data storage in an Amazon S3 bucket:
- LANCEDB_URI - The URI for the target Amazon S3 bucket and any target folder path within that bucket. Use the format s3://<bucket-name>[/<folder-name>]. This is represented by --uri (CLI) or uri (Python).
- LANCEDB_TABLE - The name of the target LanceDB table within the Amazon S3 bucket, rrepresented by --table-name (CLI) or table_name (Python).
- AWS_ACCESS_KEY_ID - The AWS access key ID for the AWS IAM entity that has access to the Amazon S3 bucket, represented by --aws-access-key-id (CLI) or aws_access_key_id (Python).
- AWS_SECRET_ACCESS_KEY - The AWS secret access key for the AWS IAM entity that has access to the Amazon S3 bucket, represented by --aws-secret-access-key (CLI) or aws_secret_access_key (Python).
For LanceDB OSS with data storage in an Azure Blob Storage account:
- LANCEDB_URI - The URI for the target container within that Azure Blob Storage account and any target folder path within that container. Use the format az://<container-name>[/<folder-name>]. This is represented by --uri (CLI) or uri (Python).
- LANCEDB_TABLE - The name of the target LanceDB table within the Azure Blob Storage account, represented by --table-name (CLI) or table_name (Python).
- AZURE_STORAGE_ACCOUNT_NAME - The name of the target Azure Blob Storage account, represented by --azure-storage-account-name (CLI) or azure_storage_account_name (Python).
- AZURE_STORAGE_ACCOUNT_KEY - The access key for the Azure Blob Storage account, represented by --azure-storage-account-key (CLI) or azure_storage_account_key (Python).
For LanceDB OSS with data storage in a Google Cloud Storage bucket:
- LANCEDB_URI - The URI for the target Google Cloud Storage bucket and any target folder path within that bucket. Use the format gs://<bucket-name>[/<folder-name>]. This is represented by --uri (CLI) or uri (Python).
- LANCEDB_TABLE - The name of the target LanceDB table within the Google Cloud Storage bucket, represented by --table-name (CLI) or table_name (Python).
- GCS_SERVICE_ACCOUNT_KEY - A single-line string that contains the contents of the downloaded service account key file for the Google Cloud service account that has access to the Google Cloud Storage bucket, represented by --google-service-account-key (CLI) or google_service_account_key (Python).

Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector:

This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page.

#!/usr/bin/env bash

# Chunking and embedding are optional.

# For LanceDB OSS with local data storage:
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  lancedb-local \
    --uri $LANCEDB_URI \
    --table-name $LANCEDB_TABLE

# For LanceDB OSS with data storage in an Amazon S3 bucket:
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  lancedb-aws \
    --aws-access-key-id $AWS_ACCESS_KEY_ID \
    --aws-secret-access-key $AWS_SECRET_ACCESS_KEY \
    --uri $LANCEDB_URI \
    --table-name $LANCEDB_TABLE \
    --timeout 30s

# For LanceDB OSS with data storage in an Azure Blob Storage account:
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  lancedb-azure \
    --azure-storage-account-name $AZURE_STORAGE_ACCOUNT_NAME \
    --azure-storage-account-key $AZURE_STORAGE_ACCOUNT_KEY \
    --uri $LANCEDB_URI \
    --table-name $LANCEDB_TABLE \
    --timeout 30s

# For LanceDB OSS with data storage in a Google Cloud Storage bucket:
unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --chunking-strategy by_title \
    --embedding-provider huggingface \
    --partition-by-api \
    --api-key $UNSTRUCTURED_API_KEY \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
  lancedb-gcs \
    --google-service-account-key $GCS_SERVICE_ACCOUNT_KEY \
    --uri $LANCEDB_URI \
    --table-name $LANCEDB_TABLE \
    --timeout 30s

To do local file processing, omit --partition-by-api (CLI) or partition_by_api (Python), or explicitly specify partition_by_api=False (Python).

Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL
To send files to the Unstructured Partition Endpoint for processing, specify --partition-by-api (CLI) or partition_by_api=True (Python).

Unstructured also requires an Unstructured API key and API URL, by adding the following:
- --api-key $UNSTRUCTURED_API_KEY (CLI) or api_key=os.getenv("UNSTRUCTURED_API_KEY") (Python)
- --partition-endpoint $UNSTRUCTURED_API_URL (CLI) or partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") (Python)
- The environment variables UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL, representing your API key and API URL, respectively.
You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a self-hosted instance of the Unstructured API.
The default API URL for Unstructured Ingest is https://api.unstructuredapp.io/general/v0/general, which is the API URL for the Unstructured Partition Endpoint. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at sales@unstructured.io.
If you do not have an API key, get one now.
If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at sales@unstructured.io.

FAQ