# Create destination connection check Source: https://docs.unstructured.io/api-reference/destinations/create-destination-connection-check https://platform.unstructuredapp.io/openapi.json post /api/v1/destinations/{destination_id}/connection-check Initiate a connection check for the destination connector # Create destination connector Source: https://docs.unstructured.io/api-reference/destinations/create-destination-connector https://platform.unstructuredapp.io/openapi.json post /api/v1/destinations/ Create a new destination connector using the provided configuration and name. # Delete destination connector Source: https://docs.unstructured.io/api-reference/destinations/delete-destination-connector https://platform.unstructuredapp.io/openapi.json delete /api/v1/destinations/{destination_id} Delete a specific destination connector by its ID. # Get destination connector Source: https://docs.unstructured.io/api-reference/destinations/get-destination-connector https://platform.unstructuredapp.io/openapi.json get /api/v1/destinations/{destination_id} Retrieve detailed information for a specific destination connector by its ID. # Get the latest destination connector connection check Source: https://docs.unstructured.io/api-reference/destinations/get-the-latest-destination-connector-connection-check https://platform.unstructuredapp.io/openapi.json get /api/v1/destinations/{destination_id}/connection-check Retrieves the most recent connection check for the specified destination connector. # List destination connectors Source: https://docs.unstructured.io/api-reference/destinations/list-destination-connectors https://platform.unstructuredapp.io/openapi.json get /api/v1/destinations/ Retrieve a list of available destination connectors. # Update destination connector Source: https://docs.unstructured.io/api-reference/destinations/update-destination-connector https://platform.unstructuredapp.io/openapi.json put /api/v1/destinations/{destination_id} Update the configuration of an existing destination connector. # Summary Source: https://docs.unstructured.io/api-reference/general/summary https://api.unstructuredapp.io/general/openapi.json post /general/v0/general Description # Cancel Job Source: https://docs.unstructured.io/api-reference/jobs/cancel-job https://platform.unstructuredapp.io/openapi.json post /api/v1/jobs/{job_id}/cancel Cancel the specified job. # Download Job output Source: https://docs.unstructured.io/api-reference/jobs/download-job-output https://platform.unstructuredapp.io/openapi.json get /api/v1/jobs/{job_id}/download Download the output of a job from a workflow where the input file was provided at runtime. # Get Job Source: https://docs.unstructured.io/api-reference/jobs/get-job https://platform.unstructuredapp.io/openapi.json get /api/v1/jobs/{job_id} Retrieve detailed information for a specific job by its ID. # Get Job Failed Files Source: https://docs.unstructured.io/api-reference/jobs/get-job-failed-files https://platform.unstructuredapp.io/openapi.json get /api/v1/jobs/{job_id}/failed-files Retrieve failed files for a specific job by its ID. # Get Job processing details Source: https://docs.unstructured.io/api-reference/jobs/get-job-processing-details https://platform.unstructuredapp.io/openapi.json get /api/v1/jobs/{job_id}/details Retrieve processing details for a specific job by its ID. # List Jobs Source: https://docs.unstructured.io/api-reference/jobs/list-jobs https://platform.unstructuredapp.io/openapi.json get /api/v1/jobs/ Retrieve a list of jobs with optional filtering by workflow ID or job status. # Unstructured API on AWS Source: https://docs.unstructured.io/api-reference/legacy-api/aws Follow these steps to deploy the Unstructured API service into your AWS account. The Unstructured API on AWS is deprecated. It is no longer supported and is not being actively updated. Unstructured is now available on the AWS Marketplace as a private offering. To explore supported options for running Unstructured within your virtual private cloud (VPC), email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). This page is not being actively updated. It might contain out-of-date information. This page is provided for legacy reference purposes only. This article describes how to create several interrelated resources in your AWS account. Your AWS account will be charged on an ongoing basis for these resources, even if you are not actively using them.

Manually stopping or terminating the associated Amazon EC2 instances alone will not reduce these ongoing charges.

To stop accruing all related ongoing charges, you must delete all of the associated AWS resources. To do this, see [Manage related AWS account costs](#manage-related-aws-account-costs). *Estimated time to complete: 30 minutes* The requirements are as follows. 1. **An AWS account**: * If you have an existing account, log in: [https://aws.amazon.com/](https://aws.amazon.com) > **Sign In to the Console**. * If you do not have an existing account, create one: [https://aws.amazon.com/free](https://aws.amazon.com/free) > **Create a Free Account**. 2. **IAM permissions**: In a later step, AWS CloudFormation creates required infrastructure in your account. To learn how to create the associated role, see [Creating IAM Roles with AWS CloudFormation](https://blog.awsfundamentals.com/aws-iam-roles-with-aws-cloudformation#heading-creating-iam-roles-with-aws-cloudformation). 3. **SSH key pair**: For secure access to the Amazon EC2 instance that CloudFormation creates in a later step, create an SSH key pair. To learn how, see [Create a key pair for your Amazon EC2 instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html). ## Part I: Setting up the Virtual Private Cloud (VPC) *Note: If you have already configured a Virtual Private Cloud (VPC) for your organization that meets the requirements for deploying the Unstructured API, you may skip this part and proceed to the Part II. Ensure that your existing VPC setup includes the necessary subnets, internet gateway, and route tables as outlined in this guide.* In Part I, you will construct a resilient and secure infrastructure within AWS by setting up a Virtual Private Cloud (VPC). Your VPC will encompass a dual-tiered subnet model consisting of both **public** and **private** subnets across multiple Availability Zones (AZs). You will establish the foundational network structure for deploying the Unstructured API by creating two public subnets and one private subnet within your VPC. The public subnets will host resources that require direct access to the internet, such as a load balancer, enabling them to communicate with external users. The private subnet is designed for resources that should not be directly accessible from the internet, like EC2 Compute Engine. ![Infrastructure Diagram](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Infrastructure_Diagram.png) 1. **Access the VPC dashboard**: a. In the AWS Management Console, in the top menu bar, click **Services > Networking & Content Delivery > VPC**.
b. In the sidebar, click **Your VPCs**, and then click **Create VPC**.
2. **Create the VPC**: a. Select **VPC only**.
b. Enter a **Name tag** for your VPC.
c. Specify the **IPv4 CIDR block** (for example, `10.0.0.0/16`).
d. You may leave **IPv6 CIDR block**, **Tenancy**, and **Tags** settings at their defaults.
e. Click **Create VPC**. ![create vpc](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/VPC_Step2.png) 3. **Create the subnets**: a. After creating the VPC, in the sidebar, click **Subnets**.
b. Click **Create subnet**.
c. In the **VPC ID** dropdown menu. select the VPC that you just created.
d. For the first public subnet:
* Enter a **Subnet name**. * Select an **Availability Zone**. * Specify the **IPv4 CIDR block** (for example, `10.0.0.0/16`). * Specify the **IPv4 subnet CIDR block** (for example, `10.0.1.0/24`). * You may leave the **Tags** setting at its default. * Click **Add new subnet**. (Do not click **Create subnet** yet.) e. Repeat the process for the second public subnet with a different **Availability Zone** and **IPv4 subnet CIDR block** (for example, `10.0.2.0/24`).
* *Note: Each subnet must reside entirely within one Availability Zone and cannot span zones. If you specify the same Availability Zone or IPv4 subnet CIDR block as the first public subnet, AWS CloudFormation might fail in a later step*. * To learn more, see [Subnet basics](https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html#subnet-basics). * Click **Add new subnet**. (Do not click **Create subnet** yet.) f. Repeat the process for the private subnet with a different **Availability Zone** and **IPv4 subnet CIDR block** (for example, `10.0.3.0/24`).
* *Note: Each subnet must reside entirely within one Availability Zone and cannot span zones. If you specify the same Availability Zone or IPv4 subnet CIDR block as the first or second public subnets, AWS CloudFormation might fail in a later step*. g. Click **Create subnet**. ![create subnet](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/VPC_Step3.png) 4. **Create the internet gateway (for the public subnets)**: a. In the sidebar, click **Internet gateways**.
b. Click **Create internet gateway**, enter a **Name tag**, and click **Create internet gateway**.
c. In the sidebar, click **Internet gateways** again.
d. Click the **Internet gateway ID** for the internet gateway that you just created.
e. Click **Actions > Attach to VPC**.
f. In the **Available VPCs** dropdown list, select the VPC from *Step 2 - Create the VPC*.
g. Click **Attach internet gateway**.
![create internet gateway](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/VPC_Step4.png) 5. **Set up route tables (for the public subnets)**: AWS automatically created a default route table in *Step 3 - Create the subnets*. To tailor your network architecture, you will create a new route table specifically for your public subnets, which will include a route to the internet gateway from *Step 4 - Create the internet gateway (for the public subnets)*. a. In the sidebar, click *Route tables*. b. Click **Create route table**. c. Enter a **Name**. d. Select the **VPC** from *Step 2 - Create the VPC*. e. Click **Create route table**. ![create route table](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/VPC_Step5.png) 6. **Associate public subnets to the route table and internet gateway**: a. Connect the **public subnets** to the **route table** from *Step 5 - Set up route tables (for the public subnets)*:
* In the sidebar, click **Subnets**. * Select the first public subnet from *Step 3 - Create the subnets*. * Click **Actions > Edit route table association**. * In the **Route table ID** dropdown list, select the route table from *Step 5 - Set up route tables (for the public subnets)*, and then click **Save**. * Repeat the process for the second public subnet. b. Now, you’ll ensure that the two public subnets can access the internet by connecting the route table to the internet gateway:
* In the sidebar, click **Route tables**. * Select the route table from *Step 5 - Set up route tables (for the public subnets)*. * Click **Actions > Edit routes**. * Click **Add route**, in the destination box, enter `0.0.0.0/0`, which represents all IP addresses. In the **Target** dropdown list, select **Internet Gateway**, and select the internet gateway from *Step 4 - Create the internet gateway (for the public subnets)*. * Click **Save changes** to establish the route, granting internet access to the first and second public subnets at the same time. c. For the **private subnet**: * In the sidebar, click **Subnets**. * Select the private subnet from *Step 3 - Create the subnets*. * Click **Actions > Edit route table association**. * In the **Route table ID** dropdown list, select the main route table, or create and then select a new route table without a route to the internet gateway. * Click **Save**. ![connect public subnet to route table](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/VPC_Step6.png) ![edit routes](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/VPC_Step7.png) 7. **Inspect the VPC resource map**: You can check the configurations from the resource maps on the VPC details dashboard by clicking **Your VPCs** in the sidebar, clicking the **VPC ID** for your VPC, and then clicking the **Resource map** tab. ![VPC Resource Maps](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/VPC_Step8.png) ## Part II: Deploying the Unstructured API from the AWS Marketplace 8. **Go to the Unstructured API page on AWS Marketplace**: a. Leaving the VPC dashboard from Part I open, in a separate web browser tab, go to the [Unstructured API](http://aws.amazon.com/marketplace/pp/prodview-fuvslrofyuato) product page in the AWS Marketplace. b. Click **Continue to Subscribe**. c. Review the terms and conditions. d. Click **Continue to Configuration**. ![Unstructured API on AWS Marketplace](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Marketplace_Step8.png) 9. **Configure the CloudFormation template**: a. In the **Fulfillment option** dropdown list, select **CloudFormation Template**. b. For **Fulfillment option** and **Software version**, leave the default `UnstructuredAPI` template and software version. c. In the **Region** dropdown list, select the Region that corresponds to the VPC from Part I. * *Note: You must select the same Region where you set up the VPC in Part I. To find the Region, on the VPC dashboard tab from Part I that you left open, with your VPC displayed, find the VPC's Region name next to your username in the top navigation bar.* d. Click **Continue to Launch**. e. In the **Choose Action** dropdown list, select **Launch CloudFormation**. f. Click **Launch**. ![CloudFormation Configuration](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Marketplace_Step9.png) 10. **Create the CloudFormation stack**: After you click **Launch**, the **Create stack** page appears in CloudFormation. **Step 1: Create the stack** a. Leave **Choose an existing template** selected. b. Leave **Amazon S3 URL** selected and the default **Amazon S3 URL** value unchanged. c. Click **Next**. ![Create Stack](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Marketplace_Step10a.png) **Step 2: Specify the stack's details** a. Enter some unique **Stack name**. b. In the **Parameters** section, in the **InstanceType** drop-down list, select **m5.xlarge**. c. In the **KeyName** drop-down list, select the name of the SSH key pair from the beginning of this article. d. In the **LoadBalancerScheme** dropdown list, select **internet-facing**. e. For **SSHLocation**, enter `0.0.0.0/0`, but only if you allow public access on the internet. * **Note**: It is generally recommended to limit SSH access to a specific IP range for enhanced security. This can be done by setting the `SSHLocation` to the IP address or range associated with your organization. Please consult your IT department or VPN vendor to obtain the correct IP information for these settings. * AWS provides `AWS Client VPN`, which is a managed client-based VPN service that enables secure access AWS resources and resources in your on-premises network. To learn more, see [Getting started with AWS Client VPN](https://docs.aws.amazon.com/vpn/latest/clientvpn-admin/cvpn-getting-started.html). f. In the **Subnets** dropdown multiselect list, select the two public subnets and the private subnet from Part I. g. In the **VPC** dropdown list, select the VPC from Part I. h. You can leave the default values for all of the other **Parameters** fields. i. Click **Next**. ![Specify stack details](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Marketplace_Step10b.png) **Step 3: Configure the stack's options** a. You can leave the default values, or specify any non-default stack options. b. Click **Next**. ![Specify stack options](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Marketplace_Step10c.png) **Step 4: Review** a. Review the stack's settings. b. Click **Submit**. ![Review stack](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Marketplace_Step10d.png) 11. **Get the Unstructured API endpoint**: a. The CloudFormation details page for the stack appears. If you do not see it, on the sidebar, click **Stacks**, and then click the name of your stack. b. Check the status of the CloudFormation stack. A successful deployment will show a **CREATE\_COMPLETE** value for the **Status** field on the **Stack Info** tab on this stack's details page. The deployment can take several minutes. c. After a successful deployment, click the **Resources** tab on this stack's details page. Then click the **Physical ID** link next to **ApplicationLoadBalancer** on this tab. d. On the **EC2 > Load balancers > (Load balancer ID)** page that appears, copy the **DNS Name** value, which is shown as an **(A Record)** and ends with `.elb.amazonaws.com`. * Note: You will use this **DNS Name** to replace the `` for the following healthcheck and data processing steps. ![Unstructured API Endpoint](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Marketplace_Step11.png) ## Healthcheck Perform a health check by running this [curl](https://curl.se/) command from a terminal on your local machine, replacing `` with your application load balancer's DNS name. This health check can take several minutes: ```bash curl http:///healthcheck ``` ![Healthcheck](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/healthcheck.png) ## Data processing For example, run one of the following, setting the following environment variables to make your code more portable: * Set `UNSTRUCTURED_API_URL` to `http://`, followed by your load balancer's DNS name, followed by `/general/v0/general`. You can now use this value (`http://`, followed by your load balancer's DNS name, followed by `/general/v0/general`) in place of calling the [Unstructured Partition Endpoint](/api-reference/partition/overview) URL as described elsewhere in the Unstructured API documentation. * Set `LOCAL_FILE_INPUT_DIR` to the path on your local machine to the files for the Unstructured API to process. If you do not have any input files available, you can download any of the ones from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in GitHub. * Set `LOCAL_FILE_OUTPUT_DIR` to the path on your local machine for Unstructured API to send the processed output in JSON format: You must first [install the Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli). Because you are calling a private API and therefore do not need an Unstructured API key, you can omit the command-line option `--api-key` Or, for better code portability, it is recommended that you first set the environment variable `UNSTRUCTURED_API_KEY` to an empty string and then include the command-line option `--api-key`. ```bash CLI unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` You must first [install the Unstructured Ingest Python library](/open-source/ingestion/python-ingest). Because you are calling a private API and therefore do not need an Unstructured API key, you can omit the parameter `api_key`. Or, for better code portability, it is recommended that you first set the environment variable `UNSTRUCTURED_API_KEY` to an empty string and then include the parameter `api_key`. ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig if __name__ == "__main__": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` ## Accessing the hosting EC2 instance If you need to access the Amazon EC2 instance that hosts the Unstructured API, do the following: 1. In the CloudFormation console, open the details page for the stack from Part II. If you do not see it, on the CloudFormation console's sidebar, click **Stacks**, and then click the name of your stack. 2. Click the **Resources** tab on this stack's details page. Then click the **Physical ID** link next to **EC2TargetGroup** on this tab. 3. On the **EC2 > Target groups > (CloudFormation stack name)** page that appears, on the **Targets** tab, click the **Instance ID** link. 4. In the list of instances that appears, click the **Instance ID** link. 5. Click **Connect**, and then follow any of the on-screen options to access the EC2 instance. ## Manage related AWS account costs After you run the CloudFormation stack that you created in Part II, charges will begin accruing to your AWS account on an ongoing basis for related AWS resources. The amounts of these charges vary based on where these resource are located, which resources are covered by AWS Free Tier offerings, the extent to which you customize these resources' settings, how much you use these resources, and other factors. Stopping or terminating the related Amazon EC2 instances alone will not eliminate these ongoing charges. To stop these charges from accruing, [delete the CloudFormation stack](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-delete-stack.html) that you created and ran in Part II. This stops and deletes all of the related AWS resources. Before you delete the stack, note the following: * You should click the **Resources** tab on the stack's details page to be aware of the associated resources that will be deleted. * You should note any resource dependencies, resources with deletion protection or termination protection enabled, or nested stacks that might prevent stack deletion, and resolve these issues that could prevent stack deletion. To find these kinds of issues: * On the **Template** tab on the stack's details page, look for occurences of the `DependsOn` attribute, which are set to the name of the resource dependency. * On the **Template** tab on the stack's details page, look for occurences of the `DeletionPolicy` attribute set to `Retain` or the `UpdateReplacePolicy` attribute set to `Retain`. The associated resources have deletion protection enabled. * On the **Stack info** tab on the stack's details page, look for the **Termination protection** field. If it is set to **Activated**, termination protection is enabled. * On the **Resources** tab on the stack's details page, look for resources with their **Type** set to `AWS::CloudFormation::Stack`. These indicate nested stacks. After you delete the stack, you should check your [AWS Billing and Cost Management dashboard](https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/billinginfo.html) to confirm that associated charges are no longer accruing. # Unstructured API on Azure Source: https://docs.unstructured.io/api-reference/legacy-api/azure The Unstructured API on Azure is deprecated. It is no longer supported and is not being actively updated. Unstructured is now available on the Azure Marketplace as a private offering. To explore supported options for running Unstructured within your virtual private cloud (VPC), email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). This page is not being actively updated. It might contain out-of-date information. This page is provided for legacy reference purposes only. Follow these steps to deploy the Unstructured API service into your Azure account. This article describes how to create several interrelated resources in your Azure account. Your Azure account will be charged on an ongoing basis for these resources, even if you are not actively using them.

Manually shutting down the associated Azure virtual machine when you are not using it can help reduce—but not fully eliminate—these ongoing charges.

To stop accruing all related ongoing charges, you must delete all of the associated Azure resources. Go to [https://portal.azure.com](https://portal.azure.com/). Go to the [Unstructured Data Preprocessing - Customer Hosted API](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/unstructured1691024866136.customer_api_v1?tab=Overview/) offering in the Azure Marketplace. ![Azure Marketplace](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Azure_Step2.png) 1. Click **Get It Now** and fill out the form. 2. Read the terms and click **Continue**. 3. Click **Create**. ![Deployment Process](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Azure_Step3.png) 1. On the **Create a virtual machine** page, click the **Basics** tab. 2. In the **Project details** section, select an existing **Subscription**, and select an **Resource group** from the dropdown menus, or create a new resource group by clicking **Create new**. ![project details](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Azure_Step4a.png) 3. In the **Instance details** section, enter a name in the **Virtual machine name** field. Note this name, as you will need it later steps. 4. Select a **Region** from the dropdown menu. 5. For **Image**, select **Unstructured Customer Hosted API Hourly - x64 Gen2** (*default*). 6. For **Size**, select a VM size from the dropdown menu, or leave the default VM size selection. To learn more, see [Azure VM comparisons](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/). ![instance details](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Azure_Step4b.png) 7. In the **Administrator account** section, for **Authentication type**, select **SSH public key** or **Password**. 8. Enter the credential settings, depending on the authentication type. ![administrator account](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Azure_Step4c.png) Do not click **Review + create** yet. You must still set up the load balancer. 1. Click the **Networking** tab. 2. In the **Networking interface** section, fill out the following fields: * **Virtual network**: Click **Create new**, or select an existing virtual network from the dropdown menu. To learn more, see [Quickstart: Use the Azure portal to create a virtual network](https://learn.microsoft.com/en-us/azure/virtual-network/quick-create-portal). * **Subnet**: Click **Manage subnet configuration**, or select a new or existing subnet from the dropdown menu. To learn more, see [Add, change, or delete a virtual network subnet](https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-manage-subnet?tabs=azure-portal). * **Configure network security group**: Click **Create new**, or select an existing security group from the dropdown menu. To learn more, see [Create, change, or delete a network security group](https://learn.microsoft.com/en-us/azure/virtual-network/manage-network-security-group?tabs=network-security-group-portal). 3. In the **Load balancing** section, fill out the following fields: * **Load balancing options**: Select **Azure load balancer**. * **Select a load balancer**: Click **Create a load balancer** and fill out the following fields in the pop-up window, or select an existing load balancer from the dropdown menu and note this name as you will need it in later steps: * Enter a **Load balancer name**. Note this name, as you will need it in later steps. * For **Type**, select **Public** or **Internal**. * For **Protocol**, select **TCP** or **UDP**. * Set both **Port** and **Backend port** to **80**. ![load balancer](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Azure_Step5.png) 4. Click **Create**. 1. Click **Review + create**. 2. Wait for validation. 3. Click **Create**. ![deployment](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Azure_Step6.png) 1. Go to your load balancer: in the Azure portal's **Search resources, services and docs** box, enter and then select **Load balancers**. 2. Search for and open the new load balancer that you created earlier, or the existing load balancer that you chose earlier. 3. Make any necessary settings updates to your new or existing load balancer, based on the recommended configurations in the [Load balancer network settings](#load-balancer-network-settings) section, later on this page. 1. Go to your virtual machine: in the Azure portal's **Search resources, services and docs** box, enter and then select **Virtual machines**. 2. Search for and open the new virtual machine that you created earlier, using the name that you entered earlier. 3. On the **Overview** tab, under **Properties**, note the **Public IP address** for the **Load balancer**. ![retrieve public ip](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/Azure_Step7.png) 4. The deployed endpoint URL is **http\://\/general/v0/general**. Note this endpoint URL, as you will need it later to call the Unstructured API. Note the API environment variables in the [API environment variables](#api-environment-variables) section, later on this page. If you need to set any of these in the Docker container on the virtual machine, do the following: 1. If the virtual machine is not already running from earlier, click the **Start** icon. 2. After the virtual machine starts, click the **Connect** icon, and then click **Connect** from the drop-down list. 3. Follow the on-screen directions for one of the available options to connect to the virtual machine and display a connected terminal. 4. Stop the running container in the virtual machine, so that you can restart it later with the environment variables set: In the connected terminal, run the following command: `sudo docker container ls`. 5. Note the `CONTAINER ID` value for the running container. 6. Run the following command, replacing `` with the `CONTAINER ID` value: ```bash sudo docker container rm --force ``` 7. Now run the container again, setting the environment variables at the same time: Run the following command: `sudo docker image ls`. 8. Note the `REPOSITORY` and `TAG` value for the Docker image. 9. Run the following command, replacing `` and `` with the `REPOSITORY` and `TAG` values for the Docker image, and replacing `=`, `=` and so on with the environment variable name and value pairs: ```bash sudo docker run -d --restart unless-stopped \ -p 80:5000 \ -e = -e = -e = \ : ``` 10. Verify that the environment variables were set correctly: Run the following command: ```bash sudo docker container ls ``` 11. Note the `CONTAINER ID` value for the running container. 12. Run the following command, replacing `` with the `CONTAINER ID` value: ```bash sudo docker exec bash -c 'printenv' ``` 14. The environment variables should be in the list that appears. To help manage your overall costs, you should click the **Stop** icon whenever you are not using this virtual machine to call the Unstructured API. You can now use the running virtual machine to call the Unstructured API. For example, run one of the following, setting the following environment variables to make your code more portable: * Set `UNSTRUCTURED_API_URL` to `http://`, followed by your load balancer's public IP address, followed by `/general/v0/general`. * Set `LOCAL_FILE_INPUT_DIR` to the path on your local machine to the files for the Unstructured API to process. If you do not have any input files available, you can download any of the ones from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in GitHub. * Set `LOCAL_FILE_OUTPUT_DIR` to the path on your local machine for Unstructured API to send the processed output in JSON format. You must first [install the Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli). Because you are calling a private API and therefore do not need an Unstructured API key, you can omit the command-line option `--api-key` Or, for better code portability, it is recommended that you first set the environment variable `UNSTRUCTURED_API_KEY` to an empty string and then include the command-line option `--api-key`. ```bash CLI unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` You must first [install the Unstructured Ingest Python library](/open-source/ingestion/python-ingest). Because you are calling a private API and therefore do not need an Unstructured API key, you can omit the parameter `api_key`. Or, for better code portability, it is recommended that you first set the environment variable `UNSTRUCTURED_API_KEY` to an empty string and then include the parameter `api_key`. ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig if __name__ == "__main__": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` To help manage your overall costs, you should stop running the associated virtual machine whenever you are not using it to call the Unstructured API. ## Load balancer network settings Unstructured recommends the following load balancer settings, which you should set on your deployment's load balancer soon after you finalize and deploy it. On the load balancer's **Overview** tab in the Azure portal: * **SKU**: **Standard** On the load balancer's **Settings** tab in the Azure portal: * **Frontend IP configuration**: Private IP * **Backend pools**: VMSS * **Health probes**: * **Protocol**: **HTTP**, or **HTTPS** (this requires setting up a reverse proxy on the VMSS set to do TLS termination) * **Port**: `80` or `443` (this can be any port that the backend VMs are listening on) * **Path**: `/healthcheck` * **Interval (seconds)**: `5` * **Load balancing rules**: * **Protocol**: **TCP** * **Port**: `443` for HTTPS, or `80` for HTTP * **Backend port**: `443` for HTTPS, or `80` for HTTP * **Idle timeout (minutes)**: `60` * **Enable TCP Reset** box: Checked * **Inbound NAT rules**: * **Frontend Port**: `443` for HTTPS, or `80` for HTTP * **Backend port**: `443` for HTTPS, or `80` for HTTP * **Protocol**: **TCP** * **Enable TCP Reset** box: Checked * **Idle timeout (minutes)**: `60` ## API environment variables Unstructured supports the following environment variables, which you can set in the Docker image on the virtual machine, as needed: * `ALLOW_ORIGINS`: CORS-allowed origins. * `UNSTRUCTURED_ALLOWED_MIMETYPE`: The list of allowed MIME types, if you want to limit the file types that can be processed. * `UNSTRUCTURED_API_KEY`: The default Unstructured API key to use. * `UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB`: The minimum amount of free memory in MB to allow for processing a file. If this memory is too low, the server will return a `503` error. * `UNSTRUCTURED_PDF_HI_RES_MAX_PAGES`: The maximum number of pages in a PDF file that the Unstructured API will not reject, if the `hi_res` strategy is used. The default is `300`. * `UNSTRUCTURED_REDIRECT_ROOT_URL`: If this is set, redirect a `GET` request to the Unstructured API to use this URL instead. # Overview Source: https://docs.unstructured.io/api-reference/legacy-api/overview Unstructured has deprecated the following APIs: * The [Unstructured API on AWS](/api-reference/legacy-api/aws) is deprecated. It is no longer supported and is not being actively updated. Unstructured is now available on the AWS Marketplace as a private offering. To explore supported options for running Unstructured within your virtual private cloud (VPC), email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). * The [Unstructured API on Azure](/api-reference/legacy-api/azure) is deprecated. It is no longer supported and is not being actively updated. Unstructured is now available on the AWS Marketplace as a private offering. To explore supported options for running Unstructured within your virtual private cloud (VPC), email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). These pages are not being actively updated. They might contain out-of-date information. These pages are provided for legacy reference purposes only. # Overview Source: https://docs.unstructured.io/api-reference/overview The Unstructured API consists of two parts: * The [Unstructured Workflow Endpoint](/api-reference/workflow/overview) enables a full range of partitioning, chunking, embedding, and enrichment options for your files and data. It is designed to batch-process files and data in remote locations; send processed results to various storage, databases, and vector stores; and use the latest and highest-performing models on the market today. It has built-in logic to deliver the highest quality results at the lowest cost. [Learn more](/api-reference/workflow/overview). * The [Unstructured Partition Endpoint](/api-reference/partition/overview) is intended for rapid prototyping of Unstructured's various partitioning strategies, with limited support for chunking. It is designed to work only with processing of local files, one file at a time. Use the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) for production-level scenarios, file processing in batches, files and data in remote locations, generating embeddings, applying post-transform enrichments, using the latest and highest-performing models, and for the highest quality results at the lowest cost. [Learn more](/api-reference/partition/overview). # Benefits over open source The Unstructured API provides the following benefits beyond the [Unstructured open source library](/open-source/introduction/overview) offering: * Designed for production scenarios. * Significantly increased performance on document and table extraction. * Access to newer and more sophisticated vision transformer models. * Access to Unstructured's fine-tuned OCR models. * Access to Unstructured's by-page and by-similarity chunking strategies. * Adherence to security and SOC2 Type 1, SOC2 Type 2, and HIPAA compliance standards. * Authentication and identity management. * Incremental data loading. * Image extraction from documents. * More sophisticated document hierarchy detection. * Unstructured manages code dependencies, for instance for libraries such as Tesseract. * Unstructured manages its own infrastructure, including parallelization and other performance optimizations. ## Pricing To call the Unstructured API, you must have an Unstructured account. Unstructured offers three account pricing plans: * **SaaS Cloud-hosted** - Processing happens on Unstructured's software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment. * **Private SaaS** - Processing also happens on Unstructured's SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy. * **VPC** - Sometimes referred to as [self-hosted](/self-hosted/overview), an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support. For more details, see the [Unstructured Pricing](https://unstructured.io/pricing) page. Some of these plans are billed on a per-page basis. Unstructured calculates a page as follows: * For these file types, a page is a page, slide, or image: `.pdf`, `.pptx`, and `.tiff`. * For `.docx` files that have page metadata, Unstructured calculates the number of pages based on that metadata. * For all other file types, Unstructured calculates the number of pages as the file's size divided by 100 KB. * For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed. ## Get support Should you require any assistance or have any questions regarding the Unstructured API, please [contact us directly](https://unstructured.io/contact). ## Learn more * [Unstructured API: All Unstructured Enterprise ETL Functionality for Your MCP Integrations](https://unstructured.io/blog/unstructured-api-all-unstructured-enterprise-etl-functionality-for-your-mcp-integrations) # Partition Endpoint parameters Source: https://docs.unstructured.io/api-reference/partition/api-parameters The Unstructured Partition Endpoint provides parameters to customize the processing of documents. These parameters include: The only required parameter is `files` - the file you wish to process. [//]: # "TODO: In the future we should automatically generate this table to make sure it's up to date" | POST, Python | JavaScript/TypeScript | Description | | ------------------------------------------ | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `files` (*shared.Files*) | `files` (*File*, *Blob*, *shared.Files*) | The file to process. | | `chunking_strategy` (*str*) | `chunkingStrategy` (*string*) | Use one of the supported strategies to chunk the returned elements after partitioning. When no chunking strategy is specified, no chunking is performed and any other chunking parameters provided are ignored. Supported strategies: `basic`, `by_title`, `by_page`, and `by_similarity`. [Learn more](/api-reference/partition/chunking). | | `content_type` (*str*) | `contentType` (*string*) | A hint to Unstructured about the content type to use (such as `text/markdown`), when there are problems processing a specific file. This value is a MIME type in the format `type/subtype`. For available MIME types, see [model.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/model.py). | | `coordinates` (*bool*) | `coordinates` (*boolean*) | True to return bounding box coordinates for each element extracted with OCR. Default: false. [Learn more](/api-reference/partition/examples#saving-bounding-box-coordinates). | | `encoding` (*str*) | `encoding` (*string*) | The encoding method used to decode the text input. Default: `utf-8`. | | `extract_image_block_types` (*List\[str]*) | `extractImageBlockTypes` (*string\[]*) | The types of elements to extract, for use in extracting image blocks as Base64 encoded data stored in element metadata fields, for example: `["Image","Table"]`. Supported filetypes are image and PDF. [Learn more](/api-reference/partition/extract-image-block-types). | | `gz_uncompressed_content_type` (*str*) | `gzUncompressedContentType` (*string*) | If file is gzipped, use this content type after unzipping. Example: `application/pdf` | | `hi_res_model_name` (*str*) | `hiResModelName` (*string*) | The name of the inference model used when strategy is `hi_res`. Options are `layout_v1.1.0` and `yolox`. Default: `layout_v1.1.0`. [Learn more](/api-reference/partition/examples#changing-partition-strategy-for-a-pdf). | | `include_page_breaks` (*bool*) | `includePageBreaks` (*boolean*) | True for the output to include page breaks if the filetype supports it. Default: false. | | `languages` (*List\[str]*) | `languages` (*string\[]*) | The languages present in the document, for use in partitioning and OCR. [View the list of available languages](https://github.com/tesseract-ocr/tessdata). [Learn more](/api-reference/partition/examples#specifying-the-language-of-a-document-for-better-ocr-results). | | `output_format` (*str*) | `outputFormat` (*string*) | The format of the response. Supported formats are `application/json` and `text/csv`. Default: `application/json`. | | `pdf_infer_table_structure` (*bool*) | `pdfInferTableStructure` (*boolean*) | **Deprecated!** Use `skip_infer_table_types` instead. If true and `strategy` is `hi_res`, any `Table` elements extracted from a PDF will include an additional metadata field, `text_as_html`, where the value (string) is a just a transformation of the data into an HTML table. | | `skip_infer_table_types` (*List\[str]*) | `skipInferTableTypes` (*string\[]*) | The document types that you want to skip table extraction for. Default: `[]`. | | `starting_page_number` (*int*) | `startingPageNumber` (*number*) | The page number to be be assigned to the first page in the document. This information will be included in elements' metadata and can be be especially useful when partitioning a document that is part of a larger document. | | `strategy` (*str*) | `strategy` (*string*) | The strategy to use for partitioning PDF and image files. Options are `auto`, `vlm`, `hi_res`, `fast`, and `ocr_only`. Default: `auto`. [Learn more](/api-reference/partition/partitioning). | | `unique_element_ids` (*bool*) | `uniqueElementIds` (*boolean*) | True to assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of the element's text is used. Default: false. | | `vlm_model` (*str*) | (Not yet available) | Applies only when `strategy` is `vlm`. The name of the vision language model (VLM) provider to use for partitioning. `vlm_model_provider` must also be specified. For a list of allowed values, see the end of this article. | | `vlm_model_provider` (*str*) | (Not yet available) | Applies only when `strategy` is `vlm`. The name of the vision language model (VLM) to use for partitioning. `vlm_model` must also be specified. For a list of allowed values, see the end of this article. | | `xml_keep_tags` (*bool*) | `xmlKeepTags` (*boolean*) | True to retain the XML tags in the output. Otherwise it will just extract the text from within the tags. Only applies to XML documents. | The following parameters only apply when a chunking strategy is specified. Otherwise, they are ignored. [Learn more](/api-reference/partition/chunking). | POST, Python | JavaScript/TypeScript | Description | | -------------------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `combine_under_n_chars` (*int*) | `combineUnderNChars` (*number*) | Applies only when the chunking strategy is set to `by_title`. Use this parameter to combines small chunks until the combined chunk reaches a length of `n` characters. This can mitigate the appearance of small chunks created by short paragraphs, not intended as section headings, being identified as `Title` elements in certain documents. Default: the same value as `max_characters`. | | `include_orig_elements` (*bool*) | `includeOrigElements` (*boolean*) | True (the default) to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. | | `max_characters` (*int*) | `maxCharacters` (*number*) | Cut off new sections after reaching a length of `n` characters. (This is a hard maximum.) Default: 500. | | `multipage_sections` (*bool*) | `multipageSections` (*boolean*) | Applies only when the chunking strategy is set to `by_title`. Determines if a chunk can include elements from more than one page. Default: true. | | `new_after_n_chars` (*int*) | `newAfterNChars` (*number*) | Applies only when the chunking strategy is specified. Cuts off new sections after reaching a length of `n` characters. (This is a soft maximum.) Default: 1500. | | `overlap` (*int*) | `overlap` (*number*) | A prefix of this many trailing characters from the prior text-split chunk is applied to second and later chunks formed from oversized elements by text-splitting. Default: none. | | `overlap_all` (*bool*) | `overlapAll` (*boolean*) | True to have an overlap also applied to "normal" chunks formed by combining whole elements. Use with caution, as this can introduce noise into otherwise clean semantic units. Default: none. | | `similarity_threshold` (*float*) | `similarityThreshold` (*number*) | Applies only when the chunking strategy is set to `by_similarity`. The minimum similarity text in consecutive elements must have to be included in the same chunk. Must be between 0.0 and 1.0, exclusive (0.01 to 0.99, inclusive). Default: 0.5. | The following parameters are specific to the Python and JavaScript/TypeScript clients and are not sent to the server. [Learn more](/api-reference/partition/sdk-python#page-splitting). | POST, Python | JavaScript/TypeScript | Description | | ------------------------------------- | ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `split_pdf_page` (*bool*) | `splitPdfPage` (*boolean*) | True to split the PDF file client-side. [Learn more](/api-reference/partition/sdk-python#page-splitting). | | `split_pdf_allow_failed` (*bool*) | `splitPdfAllowFailed` (*boolean*) | When `true`, a failed split request will not stop the processing of the rest of the document. The affected page range will be ignored in the results. When `false`, a failed split request will cause the entire document to fail. Default: `false`. | | `split_pdf_concurrency_level` (*int*) | `splitPdfConcurrencyLevel` (*number*) | The number of split files to be sent concurrently. Default: 5. Maximum: 15. | | `split_pdf_page_range` (*List\[int]*) | `splitPdfPageRange` (*number\[]*) | A list of 2 integers within the range `[1, length_of_pdf]`. When pdf splitting is enabled, this will send only the specified page range to the API. | Need help getting started? Check out the [Examples page](/api-reference/partition/examples) for some inspiration. Allowed values for `vlm_model_provider` and `vlm_model` pairs include the following: | `vlm_model_provider` | `vlm_model` | | -------------------- | ---------------------------------------------- | | `anthropic` | `claude-3-5-sonnet-20241022` | | `anthropic_bedrock` | `claude-3-5-sonnet-20241022` | | `bedrock` | `us.amazon.nova-lite-v1:0` | | `bedrock` | `us.amazon.nova-pro-v1:0` | | `bedrock` | `us.anthropic.claude-3-5-sonnet-20241022-v2:0` | | `bedrock` | `us.anthropic.claude-3-haiku-20240307-v1:0` | | `bedrock` | `us.anthropic.claude-3-opus-20240229-v1:0` | | `bedrock` | `us.anthropic.claude-3-sonnet-20240229-v1:0` | | `bedrock` | `us.meta.llama3-2-11b-instruct-v1:0` | | `bedrock` | `us.meta.llama3-2-90b-instruct-v1:0` | | `openai` | `gpt-4o` | | `vertexai` | `gemini-2.0-flash-001` | # Endpoint validation errors Source: https://docs.unstructured.io/api-reference/partition/api-validation-errors This section details the structure of HTTP validation errors returned by the Unstructured Partition Endpoint. ## HTTPValidationError **Type**: object **Title**: HTTPValidationError **Detail** * **Type**: array * **Description**: An array of ValidationError items, providing detailed information about the validation errors encountered. ## ValidationError **Type**: object **Title**: ValidationError **Required Fields**: loc, msg, type * **Location (loc)** * **Type**: array * **Description**: The location of the validation error in the request. Each item in the array can be either a string (e.g., field name) or an integer (e.g., array index). * **Message (msg)** * **Type**: string * **Description**: A descriptive message about the validation error. * **Error Type (type)** * **Type**: string * **Description**: The type of validation error, categorizing the nature of the error. # Chunking strategies Source: https://docs.unstructured.io/api-reference/partition/chunking Chunking functions use metadata and document elements detected with partition functions to split a document into appropriately-sized chunks for uses cases such as Retrieval Augmented Generation (RAG). If you are familiar with chunking methods that split long text documents into smaller chunks, you'll notice that Unstructured methods slightly differ, since the partitioning step already divides an entire document into its structural elements. Individual elements will only be split if they exceed the desired maximum chunk size. Two or more consecutive text elements that will together fit within `max_characters` will be combined. After chunking, you will only have elements of the following types: * `CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a combination of two or more original text elements that together fit within the maximum chunk size. It can also be a single element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original text element that was too big to fit in one chunk and required splitting. * `Table`: A table element is not combined with other elements and if it fits within `max_characters` it will remain as is. * `TableChunk`: large tables that exceed `max_characters` chunk size are split into special `TableChunk` elements. ### "basic" chunking strategy * The basic strategy combines sequential elements to maximally fill each chunk while respecting both the specified `max_characters` (hard-max) and `new_after_n_chars` (soft-max) option values. * A single element that by itself exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text-splitting. * A `Table` element is always isolated and never combined with another element. A `Table` can be oversized, like any other text element, and in that case is divided into two or more `TableChunk` elements using text-splitting. * If specified, `overlap` is applied between chunks formed by splitting oversized elements and is also applied between other chunks when `overlap_all` is `True`. ### "by\_title" chunking strategy The `by_title` chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk. In addition to the behaviors of the `basic` strategy above, the `by_title` strategy has the following behaviors: * **Detect section headings.** A `Title` element is considered to start a new section. When a `Title` element is encountered, the prior chunk is closed and a new chunk started, even if the `Title` element would fit in the prior chunk. * **Respect page boundaries.** Page boundaries can optionally also be respected using the `multipage_sections` argument. This defaults to `True` meaning that a page break does *not* start a new chunk. Setting this to `False` will separate elements that occur on different pages into distinct chunks. * **Combine small sections.** In certain documents, partitioning may identify a list-item or other short paragraph as a `Title` element even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using the `combine_text_under_n_chars` argument. This defaults to the same value as `max_characters` such that sequential small sections are combined to maximally fill the chunking window. Setting this to `0` will disable section combining. ### "by\_page" chunking strategy Only available in the Unstructured UI and API. The `by_page` chunking strategy ensures the content from different pages do not end up in the same chunk. When a new page is detected, the existing chunk is completed and a new one is started, even if the next element would fit in the prior chunk. ### "by\_similarity" chunking strategy Only available in Unstructured API and Platform. The `by_similarity` chunking strategy employs the `sentence-transformers/multi-qa-mpnet-base-dot-v1` embedding model to identify topically similar sequential elements and combine them into chunks. As with other strategies, chunks will never exceed the hard-maximum chunk size set by `max_characters`. For this reason, not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can guarantee that two elements with low similarity will not be combined in a single chunk. You can control the level of topic similarity you require for elements to have by setting the `similarity_threshold` parameter. `similarity_threshold` expects a value between 0.0 and 1.0 specifying the minimum similarity text in consecutive elements must have to be included in the same chunk. The default is 0.5. ### ## Learn more [Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices) # Document elements and metadata Source: https://docs.unstructured.io/api-reference/partition/document-elements `unstructured` simplifies and streamline the preprocessing of structured and unstructured documents for downstream tasks. What that means is no matter where your data is and no matter what format that data is in, Unstructured's toolkit will transform and preprocess that data into an easily digestible and usable format that is uniform across data formats. When you partition a document with Unstructured, the result is a list of document `Element` objects. These element objects represent different components of the source document. ## Element example Here's an example of what an element might look like: ``` {'type': 'NarrativeText', 'element_id': '5ef1d1117721f0472c1ad825991d7d37', 'text': 'The Unstructured API documentation covers the following API services:', 'metadata': {'last_modified': '2024-05-01T14:15:22', 'page_number': 1, 'languages': ['eng'], 'parent_id': '56f24319ae258b735cac3ec2a271b1d9', 'file_directory': '/content', 'filename': 'Unstructured - Unstructured.html', 'filetype': 'text/html'}} ``` As you can see, every element will have a `type`, an `element_id`, the extracted `text`, and some `metadata` which may vary depending on the element type, document structure, and some additional parameters used during partitioning and/or chunking. Let's explore some of these document element components in more detail. ## Element type Instead of treating all documents like a wall of plain text, Unstructured preserves the semantic structure of the documents. This gives you more control and flexibility over how you further use the processed documents and allows you to take their structure into consideration. At the same time, normalizing data from various file formats to the Unstructured element type scheme lets you treat all documents the same in your downstream processing, regardless of source format. For example, if you plan to summarize a document, you may only be interested in the narrative of the document, and not care about footers and headers. You can easily filter out the elements you don't need using their type. Here are some examples of the element types your document may contain: | Element type | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | | `Formula` | An element containing formulas in a document. | | `FigureCaption` | An element for capturing text associated with figure captions. | | `NarrativeText` | NarrativeText is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. | | `ListItem` | ListItem is a NarrativeText element that is part of a list. | | `Title` | A text element for capturing titles. | | `Address` | A text element for capturing physical addresses. | | `EmailAddress` | A text element for capturing email addresses. | | `Image` | A text element for capturing image metadata. | | `PageBreak` | An element for capturing page breaks. | | `Table` | An element for capturing tables. | | `Header` | An element for capturing document headers. | | `Footer` | An element for capturing document footers. | | `CodeSnippet` | An element for capturing code snippets. | | `PageNumber` | An element for capturing page numbers. | | `UncategorizedText` | Base element for capturing free text from within document. | If you apply chunking during partitioning of a document or later, you will also see the `CompositeElement` type. `CompositeElement` is a chunk formed from text (non-Table) elements. It is only produced by chunking. A composite element may be formed by combining one or more sequential elements produced by partitioning. For example, several individual list items may be combined into a single chunk. ## Element ID By default, the element ID is a SHA-256 hash of the element's text, its position on the page, page number it's on, and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level. To obtain globally unique IDs in the output (UUIDs), you can pass `unique_element_ids=True` into any of the partition functions. This can be helpful if you'd like to use the IDs as a primary key in a database, for example. ## Metadata Unstructured tracks a variety of metadata about the elements extracted from documents. Here are a couple of examples of what element metadata enables you to do: * filter document elements based on an element metadata value. For instance, you may want to limit your scope to elements from a certain page, or you may want to use only elements that have an email matching a regular expression in their metadata. * map an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria. Metadata is tracked at the element level. You can access the metadata for a given document element with `element.metadata`. For a dictionary representation, use `element.metadata.to_dict()`. ### Common metadata fields All document types return the following metadata fields when the information is available from the source file: | Metadata field name | Description | | -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `filename` | Filename | | `file_directory` | File directory | | `last_modified` | Last modified Date | | `filetype` | File type | | `coordinates` | XY Bounding Box Coordinates. See notes below for further details about the bounding box. | | `parent_id` | Element Hierarchy. `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a NarrativeText element may have a Title element as a parent (a “sub-title”), which in turn may have another Title element as its parent (a "title"). | | `category_depth` | Element depth relative to other elements of the same category. Category depth is the depth of an element relative to other elements of the same category. It’s set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. Category depth may be set using native document hierarchies, e.g. reflecting \

, \

, or \

tags within an HTML document or the indentation level of a bulleted list item in a Word document. | | `text_as_html` | HTML representation of extracted tables. Only applicable to table elements. | | `languages` | Document Languages. At document level or element level. List is ordered by probability of being the primary language of the text. | | `emphasized_text_contents` | Emphasized text (bold or italic) in the original document. | | `emphasized_text_tags` | Tags on text that is emphasized in the original document. | | `is_continuation` | True if element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to max\_characters. | | `detection_class_prob` | Detection model class probabilities. From unstructured-inference, hi-res strategy. | Notes on common metadata fields: #### Metadata for document hierarchy `parent_id` and `category_depth` enhance hierarchy detection to identify the document structure in various file formats by measuring relative depth of an element within its category. This is especially useful in documents with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure. #### Element's coordinates Some document types support location data for the elements, usually in the form of bounding boxes. If it exists, an element's location data is available with `element.metadata.coordinates`. The `coordinates` property of an `ElementMetadata` stores: * `points` : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the `y` coordinate increases in the downward direction. * `system`: The points have an associated coordinate system. A typical example of a coordinate system is `PixelSpace`, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height. The Unstructured Open Source library offers a way to change the coordinates of an element to a new coordinate system by using the `Element.convert_coordinates_to_new_system` method. If the `in_place` flag is `True`, the coordinate system and points of the element are updated in place and the new coordinates are returned. If the `in_place` flag is `False`, only the altered coordinates are returned. ```python from unstructured.documents.elements import Element from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem coordinates = ((10, 10), (10, 100), (200, 100), (200, 10)) coordinate_system = PixelSpace(width=850, height=1100) element = Element(coordinates=coordinates, coordinate_system=coordinate_system) print(element.metadata.coordinates.to_dict()) print(element.metadata.coordinates.system.orientation) print(element.metadata.coordinates.system.width) print(element.metadata.coordinates.system.height) element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True) # Should now be in terms of new coordinate system print(element.metadata.coordinates.to_dict()) print(element.metadata.coordinates.system.orientation) print(element.metadata.coordinates.system.width) print(element.metadata.coordinates.system.height) ``` ### Additional metadata fields by document type | Field name | Applicable file types | Description | | ---------------------- | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `attached_to_filename` | MSG | The name of the file that the attached file is attached to. | | `bcc_recipient` | EML | The related [email](#email) BCC recipient. | | `cc_recipient` | EML | The related [email](#email) CC recipient. | | `email_message_id` | EML | The related [email](#email) message ID. | | `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. | | `image_path` | PDF | The path to the image. This is useful when you want to extract the image and save it in a specified path instead of serializing the image within the processed data. | | `image_mime_type` | PDF | The MIME type of the image. | | `image_url` | HTML | The URL to the image. | | `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. | | `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. | | `link_urls` | HTML | A list of URLs within the extracted content. | | `links` | PDF | A list of links within the extracted content. | | `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). | | `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. | | `section` | EPUB | The book section title corresponding to a table of contents. | | `sent_from` | EML | The related [email](#email) sender. | | `sent_to` | EML | The related [email](#email) recipient. | | `signature` | EML | The related [email](#email) signature. | | `subject` | EML | The related [email](#email) subject. | Notes on additional metadata by document type: #### Email For emails, metadata will contain the following fields, where available: * `bcc_recipient` * `cc_recipient` * `email_message_id` * `sent_from` * `sent_to` * `signature` * `subject` `sent_from` is a list of strings because the [RFC 822](https://www.rfc-editor.org/rfc/rfc822) spec for emails allows for multiple sent from email addresses. #### Microsoft Excel documents For Excel documents, `ElementMetadata` will contain a `page_name` element, which corresponds to the sheet name in the Excel document. #### Microsoft Word documents Headers and footers in Word documents include a `header_footer_type` indicating which page a header or footer applies to. Valid values are `"primary"`, `"even_only"`, and `"first_page"`. ### Table-specific metadata For `Table` elements, the raw text of the table will be stored in the `text` attribute for the Element, and HTML representation of the table will be available in the element metadata under `element.metadata.text_as_html`. By default, Unstructured will automatically extract all tables for all doc types unless you set `skip_infer_table_types` parameter. Here's an example of a table element. The `text` of the element will look like this: ``` Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents ``` And the `text_as_html` metadata for the same element will look like this: ```py
Dataset | Base Model’ | Notes
PubLayNet [38] F/M Layouts of modern scientific documents
PRImA [3] M Layouts of scanned modern magazines and scientific reports
Newspaper F Layouts of scanned US newspapers from the 20th century
TableBank F Table region on modern scientific and business document
HJDataset [31] F/M Layouts of history Japanese documents
``` ### Data connector metadata fields Documents processed through source connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector. #### Common data connector metadata fields * Data Source metadata (on json output): * url * version * date created * date modified * date processed * record locator * Record locator is specific to each connector #### Additional metadata fields by connector type (via record locator) | Source connector | Additional metadata | | --------------------- | -------------------------------- | | airtable | base id, table id, view id | | azure (from fsspec) | protocol, remote file path | | box (from fsspec) | protocol, remote file path | | confluence | url, page id | | discord | channel | | dropbox (from fsspec) | protocol, remote file path | | elasticsearch | url, index name, document id | | fsspec | protocol, remote file path | | google drive | drive id, file id | | gcs (from fsspec) | protocol, remote file path | | jira | base url, issue key | | onedrive | user pname, server relative path | | outlook | message id, user email | | s3 (from fsspec) | protocol, remote file path | | sharepoint | server path, site url | | wikipedia | page title, age url | # Examples Source: https://docs.unstructured.io/api-reference/partition/examples This page provides some examples of accessing Unstructured Partition Endpoint via different methods. To use these examples, you'll first need to set an environment variable named `UNSTRUCTURED_API_KEY`, representing your Unstructured API key. [Get your API key](/api-reference/partition/overview). Also, you'll need to set an environment variable named `UNSTRUCTURED_API_URL` to the value of the Unstructured API URL for your account. This API URL was provided to you when your Unstructured account was created. If you do not have this API URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). For the Unstructured Python SDK, the API URL is set to `https://api.unstructuredapp.io/general/v0/general` by default. However, you should always use the API URL that was provided to you when your Unstructured account was created. ### Changing partition strategy for a PDF Here's how you can modify partition strategy for a PDF file, and select an alternative model to use with Unstructured API. ```bash POST curl -X 'POST' $UNSTRUCTURED_API_URL \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \ -F 'files=@sample-docs/layout-parser-paper.pdf' \ -F 'strategy=vlm' \ -F 'vlm_model_provider=openai' \ -F 'vlm_model=gpt-4o' ``` ```python Python import asyncio import os import json import unstructured_client from unstructured_client.models import shared client = unstructured_client.UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) async def call_api(filename, input_dir, output_dir): req = { "partition_parameters": { "files": { "content": open(filename, "rb"), "file_name": os.path.basename(filename), }, "strategy": shared.Strategy.VLM, "vlm_model": "gpt-4o", "vlm_model_provider": "openai", "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } } try: res = await client.general.partition_async( request=req ) element_dicts = [element for element in res.elements] json_elements = json.dumps(element_dicts, indent=2) # Create the output directory structure. relative_path = os.path.relpath(os.path.dirname(filename), input_dir) output_subdir = os.path.join(output_dir, relative_path) os.makedirs(output_subdir, exist_ok=True) # Write the output file. output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json") with open(output_filename, "w") as file: file.write(json_elements) except Exception as e: print(f"Error processing {filename}: {e}") async def process_files(input_directory, output_directory): tasks = [] for root, _, files in os.walk(input_directory): for file in files: if not file.endswith('.json'): full_path = os.path.join(root, file) tasks.append(call_api(full_path, input_directory, output_directory)) await asyncio.gather(tasks) if name == "main": asyncio.run(process_files( input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR") )) ``` ```typescript TypeScript import { UnstructuredClient } from "unstructured-client"; import as fs from "fs"; import * as path from "path"; import { Strategy } from "unstructured-client/sdk/models/shared/index.js"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; // Send all files in the source path to Unstructured for processing. // Send the processed data to the destination path. function processFiles( client: UnstructuredClient, sourcePath: string, destinationPath: string ): void { // If an output directory does not exist for the corresponding input // directory, then create it. if (!fs.existsSync(destinationPath)) { fs.mkdirSync(destinationPath, { recursive: true }); } // Get all folders and files at the current level of the input directory. const items = fs.readdirSync(sourcePath); // For each folder and file in the input directory... for (const item of items) { const inputPath = path.join(sourcePath, item); const outputPath = path.join(destinationPath, item) // If it's a folder, call this function recursively. if (fs.statSync(inputPath).isDirectory()) { processFiles(client, inputPath, outputPath); } else { // If it's a file, send it to Unstructured for processing. const data = fs.readFileSync(inputPath); client.general.partition({ partitionParameters: { files: { content: data, fileName: inputPath }, strategy: Strategy.HiRes, hiResModelName: "layout_v1.1.0", splitPdfPage: true, splitPdfConcurrencyLevel: 15, splitPdfAllowFailed: true } }).then((res: PartitionResponse) => { // If successfully processed, write the processed data to // the destination directory. if (res.statusCode == 200) { const jsonElements = JSON.stringify(res, null, 2) fs.writeFileSync(outputPath + ".json", jsonElements) } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); } } } const client = new UnstructuredClient({ security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY }, serverURL: process.env.UNSTRUCTURED_API_URL }); processFiles( client, process.env.LOCAL_FILE_INPUT_DIR, process.env.LOCAL_FILE_OUTPUT_DIR ); ``` If you have a local deployment of the Unstructured API, you can use other supported models, such as `yolox`. ### Specifying the language of a document for better OCR results For better OCR results, you can specify what languages your document is in using the `languages` parameter. [View the list of available languages](https://github.com/tesseract-ocr/tessdata). ```bash POST curl -X 'POST' $UNSTRUCTURED_API_URL \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \ -F 'files=@sample-docs/korean.png' \ -F 'strategy=vlm' \ -F 'vlm_model_provider=openai' \ -F 'vlm_model=gpt-4o' \-F 'languages=kor' ``` ```python Python import asyncio import os import json import unstructured_client from unstructured_client.models import shared client = unstructured_client.UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) async def call_api(filename, input_dir, output_dir): req = { "partition_parameters": { "files": { "content": open(filename, "rb"), "file_name": os.path.basename(filename), }, "strategy": shared.Strategy.OCR_ONLY, "languages": ["kor"], "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } } try: res = await client.general.partition_async( request=req ) element_dicts = [element for element in res.elements] json_elements = json.dumps(element_dicts, indent=2) # Create the output directory structure. relative_path = os.path.relpath(os.path.dirname(filename), input_dir) output_subdir = os.path.join(output_dir, relative_path) os.makedirs(output_subdir, exist_ok=True) # Write the output file. output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json") with open(output_filename, "w") as file: file.write(json_elements) except Exception as e: print(f"Error processing {filename}: {e}") async def process_files(input_directory, output_directory): tasks = [] for root, _, files in os.walk(input_directory): for file in files: if not file.endswith('.json'): full_path = os.path.join(root, file) tasks.append(call_api(full_path, input_directory, output_directory)) await asyncio.gather(tasks) if name == "main": asyncio.run(process_files( input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR") )) ``` ```typescript TypeScript import { UnstructuredClient } from "unstructured-client"; import as fs from "fs"; import * as path from "path"; import { Strategy } from "unstructured-client/sdk/models/shared/index.js"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; // Send all files in the source path to Unstructured for processing. // Send the processed data to the destination path. function processFiles( client: UnstructuredClient, sourcePath: string, destinationPath: string ): void { // If an output directory does not exist for the corresponding input // directory, then create it. if (!fs.existsSync(destinationPath)) { fs.mkdirSync(destinationPath, { recursive: true }); } // Get all folders and files at the current level of the input directory. const items = fs.readdirSync(sourcePath); // For each folder and file in the input directory... for (const item of items) { const inputPath = path.join(sourcePath, item); const outputPath = path.join(destinationPath, item) // If it's a folder, call this function recursively. if (fs.statSync(inputPath).isDirectory()) { processFiles(client, inputPath, outputPath); } else { // If it's a file, send it to Unstructured for processing. const data = fs.readFileSync(inputPath); client.general.partition({ partitionParameters: { files: { content: data, fileName: inputPath }, strategy: Strategy.OcrOnly, languages: ["kor"], splitPdfPage: true, splitPdfConcurrencyLevel: 15, splitPdfAllowFailed: true } }).then((res: PartitionResponse) => { // If successfully processed, write the processed data to // the destination directory. if (res.statusCode == 200) { const jsonElements = JSON.stringify(res, null, 2) fs.writeFileSync(outputPath + ".json", jsonElements) } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); } } } const client = new UnstructuredClient({ security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY }, serverURL: process.env.UNSTRUCTURED_API_URL }); processFiles( client, process.env.LOCAL_FILE_INPUT_DIR, process.env.LOCAL_FILE_OUTPUT_DIR ); ``` ### Saving bounding box coordinates When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the `coordinates` parameter to `true` to add this field to the elements in the response. ```bash POST curl -X 'POST' $UNSTRUCTURED_API_URL \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \ -F 'files=@sample-docs/layout-parser-paper.pdf' \ -F 'coordinates=true' \ -F 'strategy=hi_res' ``` ```python Python import asyncio import os import json import unstructured_client from unstructured_client.models import shared client = unstructured_client.UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) async def call_api(filename, input_dir, output_dir): req = { "partition_parameters": { "files": { "content": open(filename, "rb"), "file_name": os.path.basename(filename), }, "strategy": shared.Strategy.HI_RES, "coordinates": True, "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } } try: res = await client.general.partition_async( request=req ) element_dicts = [element for element in res.elements] json_elements = json.dumps(element_dicts, indent=2) # Create the output directory structure. relative_path = os.path.relpath(os.path.dirname(filename), input_dir) output_subdir = os.path.join(output_dir, relative_path) os.makedirs(output_subdir, exist_ok=True) # Write the output file. output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json") with open(output_filename, "w") as file: file.write(json_elements) except Exception as e: print(f"Error processing {filename}: {e}") async def process_files(input_directory, output_directory): tasks = [] for root, _, files in os.walk(input_directory): for file in files: if not file.endswith('.json'): full_path = os.path.join(root, file) tasks.append(call_api(full_path, input_directory, output_directory)) await asyncio.gather(tasks) if name == "main": asyncio.run(process_files( input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR") )) ``` ```typescript TypeScript import { UnstructuredClient } from "unstructured-client"; import as fs from "fs"; import * as path from "path"; import { Strategy } from "unstructured-client/sdk/models/shared/index.js"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; // Send all files in the source path to Unstructured for processing. // Send the processed data to the destination path. function processFiles( client: UnstructuredClient, sourcePath: string, destinationPath: string ): void { // If an output directory does not exist for the corresponding input // directory, then create it. if (!fs.existsSync(destinationPath)) { fs.mkdirSync(destinationPath, { recursive: true }); } // Get all folders and files at the current level of the input directory. const items = fs.readdirSync(sourcePath); // For each folder and file in the input directory... for (const item of items) { const inputPath = path.join(sourcePath, item); const outputPath = path.join(destinationPath, item) // If it's a folder, call this function recursively. if (fs.statSync(inputPath).isDirectory()) { processFiles(client, inputPath, outputPath); } else { // If it's a file, send it to Unstructured for processing. const data = fs.readFileSync(inputPath); client.general.partition({ partitionParameters: { files: { content: data, fileName: inputPath }, strategy: Strategy.HiRes, coordinates: true, splitPdfPage: true, splitPdfConcurrencyLevel: 15, splitPdfAllowFailed: true } }).then((res: PartitionResponse) => { // If successfully processed, write the processed data to // the destination directory. if (res.statusCode == 200) { const jsonElements = JSON.stringify(res, null, 2) fs.writeFileSync(outputPath + ".json", jsonElements) } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); } } } const client = new UnstructuredClient({ security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY }, serverURL: process.env.UNSTRUCTURED_API_URL }); processFiles( client, process.env.LOCAL_FILE_INPUT_DIR, process.env.LOCAL_FILE_OUTPUT_DIR ); ``` ### Returning unique element IDs By default, the element ID is a SHA-256 hash of the element text. This is to ensure that the ID is deterministic. One downside is that the ID is not guaranteed to be unique. Different elements with the same text will have the same ID, and there could also be hash collisions. To use UUIDs in the output instead, set `unique_element_ids=true`. Note: this means that the element IDs will be random, so with every partition of the same file, you will get different IDs. This can be helpful if you'd like to use the IDs as a primary key in a database, for example. ```bash POST curl -X 'POST' $UNSTRUCTURED_API_URL \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \ -F 'files=@sample-docs/layout-parser-paper-fast.pdf' \ -F 'unique_element_ids=true' \ -F 'strategy=vlm' \ -F 'vlm_model_provider=openai' \ -F 'vlm_model=gpt-4o' ``` ```python Python import asyncio import os import json import unstructured_client from unstructured_client.models import shared client = unstructured_client.UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) async def call_api(filename, input_dir, output_dir): req = { "partition_parameters": { "files": { "content": open(filename, "rb"), "file_name": os.path.basename(filename), }, "strategy": shared.Strategy.VLM, "vlm_model": "gpt-4o", "vlm_model_provider": "openai", "unique_element_ids": True, "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } } try: res = await client.general.partition_async( request=req ) element_dicts = [element for element in res.elements] json_elements = json.dumps(element_dicts, indent=2) # Create the output directory structure. relative_path = os.path.relpath(os.path.dirname(filename), input_dir) output_subdir = os.path.join(output_dir, relative_path) os.makedirs(output_subdir, exist_ok=True) # Write the output file. output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json") with open(output_filename, "w") as file: file.write(json_elements) except Exception as e: print(f"Error processing {filename}: {e}") async def process_files(input_directory, output_directory): tasks = [] for root, _, files in os.walk(input_directory): for file in files: if not file.endswith('.json'): full_path = os.path.join(root, file) tasks.append(call_api(full_path, input_directory, output_directory)) await asyncio.gather(tasks) if name == "main": asyncio.run(process_files( input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR") )) ``` ```typescript TypeScript import { UnstructuredClient } from "unstructured-client"; import as fs from "fs"; import * as path from "path"; import { Strategy } from "unstructured-client/sdk/models/shared/index.js"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; // Send all files in the source path to Unstructured for processing. // Send the processed data to the destination path. function processFiles( client: UnstructuredClient, sourcePath: string, destinationPath: string ): void { // If an output directory does not exist for the corresponding input // directory, then create it. if (!fs.existsSync(destinationPath)) { fs.mkdirSync(destinationPath, { recursive: true }); } // Get all folders and files at the current level of the input directory. const items = fs.readdirSync(sourcePath); // For each folder and file in the input directory... for (const item of items) { const inputPath = path.join(sourcePath, item); const outputPath = path.join(destinationPath, item) // If it's a folder, call this function recursively. if (fs.statSync(inputPath).isDirectory()) { processFiles(client, inputPath, outputPath); } else { // If it's a file, send it to Unstructured for processing. const data = fs.readFileSync(inputPath); client.general.partition({ partitionParameters: { files: { content: data, fileName: inputPath }, uniqueElementIds: true, strategy: Strategy.HiRes, splitPdfPage: true, splitPdfConcurrencyLevel: 15, splitPdfAllowFailed: true } }).then((res: PartitionResponse) => { // If successfully processed, write the processed data to // the destination directory. if (res.statusCode == 200) { const jsonElements = JSON.stringify(res, null, 2) fs.writeFileSync(outputPath + ".json", jsonElements) } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); } } } const client = new UnstructuredClient({ security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY }, serverURL: process.env.UNSTRUCTURED_API_URL }); processFiles( client, process.env.LOCAL_FILE_INPUT_DIR, process.env.LOCAL_FILE_OUTPUT_DIR ); ``` ### Adding the chunking step after partitioning You can combine partitioning and subsequent chunking in a single request by setting the `chunking_strategy` parameter. By default, the `chunking_strategy` is set to `None`, and no chunking is performed. [//]: # "TODO: add a link to the concepts section about chunking strategies. Need to create the shared Concepts section first" ```bash POST curl -X 'POST' $UNSTRUCTURED_API_URL \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \ -F 'files=@sample-docs/layout-parser-paper-fast.pdf' \ -F 'chunking_strategy=by_title' \ -F 'max_characters=1024' \ -F 'strategy=vlm' \ -F 'vlm_model_provider=openai' \ -F 'vlm_model=gpt-4o' ``` ```python Python import asyncio import os import json import unstructured_client from unstructured_client.models import shared client = unstructured_client.UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) async def call_api(filename, input_dir, output_dir): req = { "partition_parameters": { "files": { "content": open(filename, "rb"), "file_name": os.path.basename(filename), }, "chunking_strategy": "by_title", "max_characters": 1024, "strategy": shared.Strategy.VLM, "vlm_model": "gpt-4o", "vlm_model_provider": "openai", "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } } try: res = await client.general.partition_async( request=req ) element_dicts = [element for element in res.elements] json_elements = json.dumps(element_dicts, indent=2) # Create the output directory structure. relative_path = os.path.relpath(os.path.dirname(filename), input_dir) output_subdir = os.path.join(output_dir, relative_path) os.makedirs(output_subdir, exist_ok=True) # Write the output file. output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json") with open(output_filename, "w") as file: file.write(json_elements) except Exception as e: print(f"Error processing {filename}: {e}") async def process_files(input_directory, output_directory): tasks = [] for root, _, files in os.walk(input_directory): for file in files: if not file.endswith('.json'): full_path = os.path.join(root, file) tasks.append(call_api(full_path, input_directory, output_directory)) await asyncio.gather(tasks) if name == "main": asyncio.run(process_files( input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR") )) ``` ```typescript TypeScript import { UnstructuredClient } from "unstructured-client"; import as fs from "fs"; import * as path from "path"; import { ChunkingStrategy, Strategy } from "unstructured-client/sdk/models/shared/index.js"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; // Send all files in the source path to Unstructured for processing. // Send the processed data to the destination path. function processFiles( client: UnstructuredClient, sourcePath: string, destinationPath: string ): void { // If an output directory does not exist for the corresponding input // directory, then create it. if (!fs.existsSync(destinationPath)) { fs.mkdirSync(destinationPath, { recursive: true }); } // Get all folders and files at the current level of the input directory. const items = fs.readdirSync(sourcePath); // For each folder and file in the input directory... for (const item of items) { const inputPath = path.join(sourcePath, item); const outputPath = path.join(destinationPath, item) // If it's a folder, call this function recursively. if (fs.statSync(inputPath).isDirectory()) { processFiles(client, inputPath, outputPath); } else { // If it's a file, send it to Unstructured for processing. const data = fs.readFileSync(inputPath); client.general.partition({ partitionParameters: { files: { content: data, fileName: inputPath }, strategy: Strategy.HiRes, chunkingStrategy: ChunkingStrategy.ByTitle, maxCharacters: 1024, splitPdfPage: true, splitPdfConcurrencyLevel: 15, splitPdfAllowFailed: true } }).then((res: PartitionResponse) => { // If successfully processed, write the processed data to // the destination directory. if (res.statusCode == 200) { const jsonElements = JSON.stringify(res, null, 2) fs.writeFileSync(outputPath + ".json", jsonElements) } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); } } } const client = new UnstructuredClient({ security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY }, serverURL: process.env.UNSTRUCTURED_API_URL }); processFiles( client, process.env.LOCAL_FILE_INPUT_DIR, process.env.LOCAL_FILE_OUTPUT_DIR ); ``` # Extract images and tables from documents Source: https://docs.unstructured.io/api-reference/partition/extract-image-block-types ## Task You want to get, decode, and show elements, such as images and tables, that are embedded in a PDF document. ## Approach Extract the Base64-encoded representation of specific elements, such as images and tables, in the document. For each of these extracted elements, decode the Base64-encoded representation of the element into its original visual representation and then show it. ## To run this example You will need a document that is one of the document types supported by the `extract_image_block_types` argument. See the `extract_image_block_types` entry in [API Parameters](/api-reference/partition/api-parameters). This example uses a PDF file with embedded images and tables. ## Code For the [Unstructured Python SDK](/api-reference/partition/sdk-python), you'll need: These environment variables: * `UNSTRUCTURED_API_KEY` - Your Unstructured API key value. * `UNSTRUCTURED_API_URL` - Your Unstructured API URL. ```python Python SDK from unstructured_client import UnstructuredClient from unstructured_client.models import operations, shared from unstructured.staging.base import elements_from_dicts, elements_to_json import os import base64 from PIL import Image import io if name == "main": client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) # Source: https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/embedded-images-tables.pdf # Where to get the input file and store the processed data, relative to this .py file. local_input_filepath = "local-ingest-input-pdf/embedded-images-tables.pdf" local_output_filepath = "local-ingest-output/embedded-images-tables.json" with open(local_input_filepath, "rb") as f: files = shared.Files( content=f.read(), file_name=local_input_filepath ) request = operations.PartitionRequest( shared.PartitionParameters( files=files, split_pdf_page=True, split_pdf_allow_failed=True, split_pdf_concurrency_level=15, # Extract the Base64-encoded representation of each # processed "Image" and "Table" element. Extract each into # an "image_base64" object, as a child of the # "metadata" object, for that element in the result. # Element type names, such as "Image" and "Table" here, # are case-insensitive. # Any available Unstructured element type is allowed. extract_image_block_types=["Image", "Table"] ) ) try: result = client.general.partition( request=request ) for element in result.elements: if "image_base64" in element["metadata"]: # Decode the Base64-encoded representation of the # processed "Image" or "Table" element into its original # visual representation, and then show it. image_data = base64.b64decode(element["metadata"]["image_base64"]) image = Image.open(io.BytesIO(image_data)) image.show() # Optionally, prepare to print or save the elements as JSON. dict_elements = elements_from_dicts( element_dicts=result.elements ) # Print the elements as JSON... json_elements = elements_to_json( elements=dict_elements, indent=2 ) print(json_elements) # ...or save as JSON. elements_to_json( elements=dict_elements, indent=2, filename=local_output_filepath ) except Exception as e: print(e) ``` ## See also * [Extract text as HTML](/api-reference/partition/text-as-html) * [Table extraction from PDF](/examplecode/codesamples/apioss/table-extraction-from-pdf) # Generate a JSON schema for a file Source: https://docs.unstructured.io/api-reference/partition/generate-schema ## Task You want to generate a schema for a JSON file that Unstructured produces, so that you can validate, test, and document related JSON files across your systems. ## Approach Use a Python package such as [genson](https://pypi.org/project/genson/) to generate schemas for your JSON files. The `genson` package is not owned or supported by Unstructured. For questions and requests, see the [Issues](https://github.com/wolverdude/genson/issues) tab of the `genson` repository in GitHub. ## Generate a schema from the terminal Use [pip](https://pip.pypa.io/en/stable/installation/) to install the [genson](https://pypi.org/project/genson/) package. ```bash pip install genson ``` By default, `genson` generates the JSON schema as a single string without any line breaks or indented whitespace. To pretty-print the schema that `genson` produces, install the [jq](https://jqlang.github.io/jq/) utility. The `jq` utility is not owned or supported by Unstructured. For questions and requests, see the [Issues](https://github.com/jqlang/jq/issues) tab of the `jq` repository in GitHub. 1. Run the `genson` command, specifying the path to the input (source) JSON file, and the path to the output (target) JSON schema file to be generated. Use `jq` to pretty-print the schema's content into the file to be generated. ```bash genson "/path/to/input/file.json" | jq '.' > "/path/to/output/schema.json" ``` 2. You can find the generated JSON schema file in the output path that you specified. ## Generate a schema from Python code In your Python project, install the [genson](https://pypi.org/project/genson/) package. ```bash pip install genson ``` 1. Set the following local environment variables: * Set `LOCAL_FILE_INPUT_PATH` to the local path to the input (source) JSON file. * Set `LOCAL_FILE_OUTPUT_PATH` to the local path to the output (target) JSON schema file to be generated. 2. Add the following Python code file to your project: ```python import os, json from genson import SchemaBuilder def json_schema_from_file( input_file_path: str, output_schema_path: str ) -> None: try: with open(input_file_path, "r") as file: json_data = json.load(file) builder = SchemaBuilder() builder.add_object(json_data) schema = builder.to_schema() try: with open(output_schema_path, "w") as schema_file: json.dump(schema, schema_file, indent=2) except IOError as e: raise IOError(f"Error writing to output file: {e}") print(f"JSON schema successfully generated and saved to '{output_schema_path}'.") except FileNotFoundError: print(f"Error: Input file '{input_file_path}' not found.") except IOError as e: print(f"I/O error occurred: {e}") except Exception as e: print(f"An unexpected error occurred: {e}") if name == "main": json_schema_from_file( input_file_path=os.getenv("LOCAL_FILE_INPUT_PATH"), output_schema_path=os.getenv("LOCAL_FILE_OUTPUT_PATH") ) ``` 3. Run the Python code file. 4. Check the path specified by `LOCAL_FILE_OUTPUT_PATH` for the generated JSON schema file. # Get chunked elements Source: https://docs.unstructured.io/api-reference/partition/get-chunked-elements ## Task You want to get, and print or save, the contents of elements that have been chunked. ## Approach Chunked elements are typically represented in this format: ```json { "type": "CompositeElement", "element_id": "3800f7ff047e97cbf0a5f7df7ad52c80", "text": "4,5 Exception noted in \u201cSpecial Considerations for Promotional Labeling and Advertising Material\u201d\n\n3", "metadata": { "filetype": "application/pdf", "languages": ["eng"], "page_number": 6, "orig_elements": "eJy ... Full content omitted for brevity ... x8=", "filename": "Portable-Document-Format-Specifications.pdf" } } ``` ```json { "type": "Table", "element_id": "758127b42c51b93b59abf08640d1ccab", "text": "Gels and karyotypes High pressure liquid chromatography 600 dpi (8 bit grayscale depth) 300 dpi", "metadata": { "text_as_html": "
Gels and karyotypes 600 dpi (8 bit grayscale depth)
High pressure liquid chromatography 300
", "filetype": "application/pdf", "languages": ["eng"], "page_number": 8, "orig_elements": "eJy ... Full content omitted for brevity ... MnD", "filename": "Portable-Document-Format-Specifications.pdf" } ``` To get the element's chunked content, extract the contents of the element's `text` field. The chunked content might not contain all of its associated content. To get all of the elements that were used to derive this chunked content, extract the contents of the element's `orig_elements` field, which is nested inside of its parent `metadata` field. The contents of the `orig_elements` field is in compressed Base64 gzipped format. To get the content, Base64-decode the bytes, decompress them, and then decode them using UTF-8. ## To run this example You will need to chunk a document during processing. This example uses a PDF file chunked into 200- to 300-character elements. ## Code For the [Unstructured Python SDK](/api-reference/partition/sdk-python), you'll need: These environment variables: * `UNSTRUCTURED_API_KEY` - Your Unstructured API key value. * `UNSTRUCTURED_API_URL` - Your Unstructured API URL. ```python from unstructured_client import UnstructuredClient from unstructured_client.models import operations, shared from unstructured.staging.base import elements_from_dicts, elements_to_json import os, base64, zlib from typing import List, Dict, Any # Extract the contents of an orig_elements field. def extract_orig_elements(orig_elements): decoded_orig_elements = base64.b64decode(orig_elements) decompressed_orig_elements = zlib.decompress(decoded_orig_elements) return decompressed_orig_elements.decode('utf-8') # Source file: https://www.fda.gov/files/drugs/published/Portable-Document-Format-Specifications.pdf input_filepath = "local-ingest-input-pdf/Portable-Document-Format-Specifications.pdf" output_filepath = "local-ingest-output-json/Portable-Document-Format-Specifications.json" client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) with open(input_filepath, "rb") as f: files = shared.Files( content=f.read(), file_name=input_filepath ) # Chunk the document with a basic chunking strategy. # Create chunks that are at least around 200 characters (soft limit) # but never more than 300 characters (hard maximum). req = operations.PartitionRequest( shared.PartitionParameters( files=files, strategy=shared.Strategy.VLM, vlm_model="gpt-4o", vlm_model_provider="openai", split_pdf_page=True, split_pdf_allow_failed=True, split_pdf_concurrency_level=15, chunking_strategy="basic", new_after_n_chars=200, max_characters=300 ) ) try: res = await client.general.partition_async( request=req ) # Create a dictionary that will hold only # a transposed version of the returned elements. # For instance, we just want to capture each element's ID, # the chunk's text, and the chunk's associated elements in context. orig_elements_dict: List[Dict[str, Any]] = [] for element in res.elements: # For each chunk that has an "orig_elements" field... if "orig_elements" in element["metadata"]: # ...get the chunk's associated elements in context... orig_elements = extract_orig_elements(element["metadata"]["orig_elements"]) # ...and then transpose it and other associated fields into a separate dictionary. orig_elements_dict.append({ "element_id": element["element_id"], "text": element["text"], "orig_elements": json.loads(orig_elements) }) # Convert the elements into a JSON object. orig_elements_json = json.dumps(orig_elements_dict, indent=2) # Write the JSON to a file. with open(output_filepath, "w") as file: file.write(orig_elements_json) except Exception as e: print(e) ``` ## See also * [Recovering chunk elements](/open-source/core-functionality/chunking#recovering-chunk-elements) * [Chunking strategies](/api-reference/partition/chunking) # Get element contents Source: https://docs.unstructured.io/api-reference/partition/get-elements ## Task You want to get, manipulate, and print or save, the contents of the [document elements and metadata](/api-reference/partition/document-elements) from the processed data that Unstructured returns. ## Approach Each element in the document elements contains fields for that element's type, its ID, the extracted text, and associated metadata. The programmatic approach you take to get these document elements will depend on which SDK you use: For the [Unstructured Python SDK](/api-reference/partition/sdk-python), calling an `UnstructuredClient` object's `general.partition_async` method returns a `PartitionResponse` object. This `PartitionResponse` object's `elements` variable contains a list of key-value dictionaries (`List[Dict[str, Any]]`). For example: ```python Python # ... res = await client.general.partition_async(request=req) # Do something with the elements, for example: save_elements_to_file(res.elements) # ... ``` You can use standard Python [list operations](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) on this list. You can also use standard Python [looping techniques](https://docs.python.org/3/tutorial/datastructures.html#looping-techniques) on this list to access each element in this list. To work with an individual element's contents, you can use standard [dictionary operations](https://docs.python.org/3/library/stdtypes.html#dict) on the element. For example: ```python Python # ... res = await client.general.partition_async(request=req) for element in res.elements: # Do something with each element, for example: save_element_to_database(f"{element["element_id"]}") save_element_to_database(f"{element["text"]}") save_element_to_database(f"{element["metadata"]["filename"]}\n") # ... ``` To serialize this list as JSON, you can: 1. Use the `elements_from_dicts` function to convert the list of key-value dictionaries (`Iterable[Dict[str, Any]]`) into a list of elements (`Iterable[Element]`). 2. Use the `elements_to_json` function to convert the list of elements into a JSON-formatted string and then print or save that string. For example: ```python Python from unstructured.staging.base import elements_from_dicts, elements_to_json # ... res = await client.general.partition_async(request=req) dict_elements = elements_from_dicts( element_dicts=res.elements ) elements_to_json( elements=dict_elements, indent=2, filename=output_filepath ) # ... ``` For the [Unstructured JavaScript/TypeScript SDK](/api-reference/partition/sdk-jsts), calling an `UnstructuredClient` object's `general.partition` method returns a `Promise` object. This `PartitionResponse` object's `elements` property contains an `Array` of string-value objects (`{ [k: string]: any; }[]`). For example: ```typescript TypeScript // ... client.general.partition({ partitionParameters: { files: { content: data, fileName: inputFilepath }, strategy: Strategy.HiRes, splitPdfPage: true, splitPdfAllowFailed: true, splitPdfConcurrencyLevel: 15 } }).then((res) => { if (res.statusCode == 200) { // Do something with the elements, for example: saveElementsToFile(res) } } // ... ``` You can use standard [Array operations](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array) on this array. You can also use standard `Array` techniques such as [forEach](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/forEach) to access each object in this array. For example: ```typescript TypeScript // ... client.general.partition({ partitionParameters: { files: { content: data, fileName: inputFilepath }, strategy: Strategy.HiRes, splitPdfPage: true, splitPdfAllowFailed: true, splitPdfConcurrencyLevel: 15 } }).then((res) => { if (res.statusCode == 200) { res.forEach(element => { // Do something with each element, for example: saveElementToDatabase(`${element["element_id"]}`) saveElementToDatabase(`${element["text"]}`) saveElementToDatabase(`${element["metadata"]["filename"]}`) } } } // ... ``` To serialize this list as JSON, you can use the standard [JSON.stringify](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify) function to serialize it to JSON-formatted string and the Node.js [fs.WriteFileSync](https://nodejs.org/api/fs.html#fswritefilesyncfile-data-options) function to save it as a file. For example: ```typescript TypeScript // ... client.general.partition({ partitionParameters: { files: { content: data, fileName: inputFilepath }, strategy: Strategy.HiRes, splitPdfPage: true, splitPdfAllowFailed: true, splitPdfConcurrencyLevel: 15 } }).then((res) => { if (res.statusCode == 200) { const jsonElements = JSON.stringify(res, null, 2) fs.writeFileSync(outputFilepath, jsonElements) } } // ... ``` # Overview Source: https://docs.unstructured.io/api-reference/partition/overview The Unstructured Partition Endpoint, part of the [Unstructured API](/api-reference/overview), is intended for rapid prototyping of Unstructured's various partitioning strategies, with limited support for chunking. It is designed to work only with processing of local files, one file at a time. Use the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) for production-level scenarios, file processing in batches, files and data in remote locations, generating embeddings, applying post-transform enrichments, using the latest and highest-performing models, and for the highest quality results at the lowest cost. ## Get started To call the Unstructured Partition Endpoint, you need an Unstructured account and an Unstructured API key: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.
b. Click Generate API Key.
c. Follow the on-screen instructions to finish generating the key.
d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.
[Try the quickstart](#quickstart). ## Pricing Unstructured offers three account pricing plans: * SaaS Cloud-hosted - Processing happens on Unstructured's software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment. * Private SaaS - Processing also happens on Unstructured's SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy. * VPC - Sometimes referred to as [self-hosted](/self-hosted/overview), an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support. For more details, see the [Unstructured Pricing](https://unstructured.io/pricing) page. Some of these plans are billed on a per-page basis. Unstructured calculates a page as follows: * For these file types, a page is a page, slide, or image: `.pdf`, `.pptx`, and `.tiff`. * For `.docx` files that have page metadata, Unstructured calculates the number of pages based on that metadata. * For all other file types, Unstructured calculates the number of pages as the file's size divided by 100 KB. * For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed. ## Quickstart This example uses the [curl](https://curl.se/) utility on your local machine to call the Unstructured Partition Endpoint. It sends a source (input) file from your local machine to the Unstructured Partition Endpoint which then delivers the processed data to a destination (output) location, also on your local machine. Data is processed on Unstructured-hosted compute resources. If you do not have a source file readily available, you could use for example a sample PDF file containing the text of the United States Constitution, available for download from [https://constitutioncenter.org/media/files/constitution.pdf](https://constitutioncenter.org/media/files/constitution.pdf). From your terminal or Command Prompt, set the following two environment variables. * Replace `` with the Unstructured Partition Endpoint URL. This URL was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). The default URL for the Unstructured Partition Endpoint is `https://api.unstructuredapp.io/general/v0/general`. However, you should always use the URL that was provided to you when your Unstructured account was created. * Replace `` with your Unstructured API key, which you generated earlier on this page. ```bash export UNSTRUCTURED_API_URL= export UNSTRUCTURED_API_KEY="" ``` Run the following `curl` command, replacing `` with the path to the source file on your local machine. If the source file is not a PDF file, then remove `;type=application/pdf` from the final `--form` option in this command. ```bash curl --request 'POST' \ "$UNSTRUCTURED_API_URL" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-Type: multipart/form-data' \ --form 'content_type=string' \ --form 'strategy=vlm' \ --form 'vlm_model_provider=openai' \ --form 'vlm_model=gpt-4o' \ --form 'output_format=application/json' \ --form 'files=@;type=application/pdf' ``` After you run the `curl` command, the results are printed to your terminal or Command Prompt. The command might take several minutes to complete. By default, the JSON is printed without indenting or other whitespace. You can pretty-print the JSON output by using utilities such as [jq](https://jqlang.org/tutorial/) in future command runs. You can also pipe the JSON output to a local file by using the `curl` option [-o, --output \](https://curl.se/docs/manpage.html#-o) in future command runs. You can also call the Unstructured Partition Endpoint by using the [Unstructured Python SDK](/api-reference/partition/sdk-python) or the [Unstructured JavaScript/TypeScript SDK](/api-reference/partition/sdk-jsts). # Partitioning strategies Source: https://docs.unstructured.io/api-reference/partition/partitioning For certain document types, such as images and PDFs, for example, Unstructured products offer a variety of different ways to preprocess them, controlled by the `strategy` parameter. PDF documents, for example, vary in quality and complexity. In simple cases, traditional NLP extraction techniques may be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required to process a PDF. You can think of the strategies as being "rule-based" workflows (thus they are "fast"), or "model-based" workflows (slower workflow because it requires model inference, but you get "higher resolution", thus "hi\_res"). When choosing a partitioning strategy for your files, you have to be mindful of the quality/speed trade-off. To give you an example, the `fast` strategy is roughly 100x faster than leading image-to-text models. Available options: * `auto` (default strategy): The "auto" strategy will choose the partitioning strategy based on document characteristics and the function kwargs. * `fast`: The "rule-based" strategy leverages traditional NLP extraction techniques to quickly pull all the text elements. "Fast" strategy is not recommended for image-based file types. * `hi_res`: The "model-based" strategy identifies the layout of the document. The advantage of "hi\_res" is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements. * `ocr_only`: Another "model-based" strategy that leverages Optical Character Recognition to extract text from the image-based files. * `vlm`: Uses a vision language model (VLM) to extract text from these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`. # Process an individual file by making a direct POST request Source: https://docs.unstructured.io/api-reference/partition/post-requests Watch the following 4-minute video to learn how to make POST requests to the Unstructured Partition Endpoint to process individual files: Open the related [notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_API_Partition_endpoint.ipynb) that is shown in the preceding video. To make POST requests to the Unstructured Partition Endpoint, you will need: These environment variables: * `UNSTRUCTURED_API_KEY` - Your Unstructured API key value. * `UNSTRUCTURED_API_URL` - Your Unstructured API URL. To get your API key, do the following: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> The API URL was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). <Note> The default URL for the Unstructured Partition Endpoint is `https://api.unstructuredapp.io/general/v0/general`. However, this URL might not necessarily work for your account. </Note> Let's start with a simple example in which you use [curl](https://curl.se/) to send a local PDF file (`.pdf`) to partition via the Unstructured Partition Endpoint. In this command, be sure to replace `<path/to/file>` with the path to your local PDF file. ```bash curl --request 'POST' \ "$UNSTRUCTURED_API_URL" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-Type: multipart/form-data' \ --form 'content_type=string' \ --form 'strategy=vlm' \ --form 'vlm_model_provider=openai' \ --form 'vlm_model=gpt-4o' \ --form 'output_format=application/json' \ --form 'files=@<path/to/file>;type=application/pdf' ``` In the example above we're representing the API endpoint with the environment variable `UNSTRUCTURED_API_URL`. Note, however, that you also need to authenticate yourself with your individual API Key, represented by the environment variable `UNSTRUCTURED_API_KEY`. Learn how to obtain an API URL and API key in the [Unstructured Partition Endpoint guide](/api-reference/partition/overview). ## Parameters & examples The API parameters are the same across all methods of accessing the Unstructured Partition Endpoint. Refer to the [API parameters](/api-reference/partition/api-parameters) page for the full list of available parameters. * Refer to the [Examples](/api-reference/partition/examples) page for some inspiration on using the parameters. [//]: # "TODO: when we have the concepts page shared across products, link it from here for the users to learn about partition strategies, chunking strategies and other important shared concepts" ## Postman collection Unstructured offers a [Postman collection](https://learning.postman.com/docs/collections/collections-overview/) that you can import into Postman to make POST requests through a graphical user interface. 1. [Install Postman](https://learning.postman.com/docs/getting-started/installation/installation-and-updates/). 2. [Sign in to Postman](https://learning.postman.com/docs/getting-started/installation/postman-account/#signing-in-to-postman). 3. In your workspace, click Import. ![Import a Postman collection](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/post/import.png) 4. In the Paste cURL, Raw text or URL box, enter the following URL, and then press `Enter`: ``` https://raw.githubusercontent.com/Unstructured-IO/docs/main/examplecode/codesamples/api/Unstructured-POST.postman_collection.json ``` 5. On the sidebar, click Collections. 6. Expand Unstructured POST. 7. Click (Partition Endpoint) Basic Request. 8. On the Headers tab, next to `unstructured-api-key`, enter your Unstructured API key in the Value column. 9. On the Body tab, next to `files`, click the Select files box in the Value column. 10. Click New file from local machine. 11. Browse to and select the file that you want Unstructured to process. 12. Click Send. Processing could take several minutes. To download the processed data to your local machine, in the response area, click the ellipses, and then click Save response to file. # JavaScript/TypeScript SDK Source: https://docs.unstructured.io/api-reference/partition/sdk-jsts The [Unstructured JavaScript/TypeScript SDK](https://github.com/Unstructured-IO/unstructured-js-client) client allows you to send one file at a time for processing by the Unstructured Partition Endpoint. To use the JavaScript/TypeScript SDK, you'll first need to set an environment variable named `UNSTRUCTURED_API_KEY`, representing your Unstructured API key. To get your API key, do the following: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> ## Installation Before using the SDK to interact with Unstructured, install the library: <CodeGroup> ```bash JavaScript/TypeScript npm install unstructured-client ``` </CodeGroup> <Note> The SDK uses semantic versioning and major bumps could bring breaking changes. It is advised to pin your installed version. </Note> ## Basics Let's start with a simple example in which you send a PDF document to the Unstructured Partition Endpoint to be partitioned by Unstructured. <Warning> The JavaScript/TypeScript SDK has the following breaking changes in v0.11.0: * Imports under the `dist` path have moved up a level * Enums are now used for parameters with a set of options * This includes `chunkingStrategy`, `outputFormat`, and `strategy` * All parameters to `partition` have moved to a `partitionParameters` object </Warning> <CodeGroup> ```typescript TypeScript import { UnstructuredClient } from "unstructured-client"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; import { Strategy } from "unstructured-client/sdk/models/shared"; import * as fs from "fs"; const key = process.env.UNSTRUCTURED_API_KEY; const url = process.env.UNSTRUCTURED_API_URL; const client = new UnstructuredClient({ serverURL: url, security: { apiKeyAuth: key, }, }); const filename = "PATH_TO_INPUT_FILE"; const data = fs.readFileSync(filename); client.general.partition({ partitionParameters: { files: { content: data, fileName: filename, }, strategy: Strategy.HiRes, splitPdfPage: true, splitPdfAllowFailed: true, splitPdfConcurrencyLevel: 15 languages: ['eng'] } }).then((res: PartitionResponse) => { if (res.statusCode == 200) { // Print the processed data's first element only. console.log(res.elements?.[0]) // Write the processed data to a local file. const jsonElements = JSON.stringify(res, null, 2) fs.writeFileSync("PATH_TO_OUTPUT_FILE", jsonElements) } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); ``` ```typescript TypeScript (SDK <=v0.10.6) import { UnstructuredClient } from "unstructured-client"; import { PartitionResponse } from "unstructured-client/dist/sdk/models/operations"; import * as fs from "fs"; const key = process.env.UNSTRUCTURED_API_KEY; const url = process.env.UNSTRUCTURED_API_URL; const client = new UnstructuredClient({ serverURL: url, security: { apiKeyAuth: key, }, }); const filename = "PATH_TO_FILE"; const data = fs.readFileSync(filename); client.general.partition({ files: { content: data, fileName: filename, }, strategy: 'hi_res', languages: ['eng'] }).then((res: PartitionResponse) => { if (res.statusCode == 200) { console.log(res.elements?.[0]); } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); ``` </CodeGroup> For a code example that works with an entire directory of files instead of just a single PDF, see the [Processing multiple files](#processing-multiple-files) section. ## Page splitting In order to speed up processing of large PDF files, the `splitPdfPage`[\](#parameter-names) parameter is `true` by default. This causes the PDF to be split into small batches of pages before sending requests to the API. The client awaits all parallel requests and combines the responses into a single response object. This is specific to PDF files and other filetypes are ignored. The number of parallel requests is controlled by `splitPdfConcurrencyLevel`[\](#parameter-names). The default is 8 and the max is set to 15 to avoid high resource usage and costs. If at least one request is successful, the responses are combined into a single response object. An error is returned only if all requests failed or there was an error during splitting. <Note> This feature may lead to unexpected results when chunking because the server does not see the entire document context at once. If you'd like to chunk across the whole document and still get the speedup from parallel processing, you can: * Partition the PDF with `splitPdfPage` set to `true`, without any chunking parameters. * Store the returned elements in `results.json`. * Partition this JSON file with the desired chunking parameters. </Note> ```typescript TypeScript client.general.partition({ partitionParameters: { files: { content: data, fileName: filename, }, strategy: Strategy.HiRes, // Set to `false` to disable PDF splitting splitPdfPage: true, // Continue PDF splitting even if some earlier split operations fail. splitPdfAllowFailed: true, // Modify splitPdfConcurrencyLevel to set the number of parallel requests splitPdfConcurrencyLevel: 10, } }) ``` ## Customizing the client ### Retries You can also change the defaults for retries through the `retryConfig`[\](#parameter-names) when initializing the client. If a request to the API fails, the client will retry the request with an exponential backoff strategy up to a maximum interval of one minute. The function keeps retrying until the total elapsed time exceeds `maxElapsedTime`[\](#parameter-names), which defaults to one hour: ```typescript TypeScript const key = process.env.UNSTRUCTURED_API_KEY; const url = process.env.UNSTRUCTURED_API_URL; const client = new UnstructuredClient({ security: { apiKeyAuth: key }, serverURL: url, retryConfig: { strategy: "backoff", retryConnectionErrors: true, backoff: { initialInterval: 500, maxInterval: 60000, exponent: 1.5, maxElapsedTime: 900000, // 15min60sec1000ms = 15 minutes }, }; }); ``` ## Processing multiple files The code example in the [Basics](#basics) section processes a single PDF file. But what if you want to process multiple files inside a directory with a mixture of subdirectories and files with different file types? The following example takes an input directory path to read files from and an output directory path to write the processed data to, processing one file at a time. ```typescript TypeScript import { UnstructuredClient } from "unstructured-client"; import * as fs from "fs"; import * as path from "path"; import { Strategy } from "unstructured-client/sdk/models/shared/index.js"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; // Send all files in the source path to Unstructured for processing. // Send the processed data to the destination path. function processFiles( client: UnstructuredClient, sourcePath: string, destinationPath: string ): void { // If an output directory does not exist for the corresponding input // directory, then create it. if (!fs.existsSync(destinationPath)) { fs.mkdirSync(destinationPath, { recursive: true }); } // Get all folders and files at the current level of the input directory. const items = fs.readdirSync(sourcePath); // For each folder and file in the input directory... for (const item of items) { const inputPath = path.join(sourcePath, item); const outputPath = path.join(destinationPath, item) // If it's a folder, call this function recursively. if (fs.statSync(inputPath).isDirectory()) { processFiles(client, inputPath, outputPath); } else { // If it's a file, send it to Unstructured for processing. const data = fs.readFileSync(inputPath); client.general.partition({ partitionParameters: { files: { content: data, fileName: inputPath }, strategy: Strategy.HiRes, splitPdfPage: true, splitPdfConcurrencyLevel: 15, splitPdfAllowFailed: true } }).then((res: PartitionResponse) => { // If successfully processed, write the processed data to // the destination directory. if (res.statusCode == 200) { const jsonElements = JSON.stringify(res, null, 2) fs.writeFileSync(outputPath + ".json", jsonElements) } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); } } } const client = new UnstructuredClient({ security: { apiKeyAuth: process.env.UNSTRUCTURED_API_KEY }, serverURL: process.env.UNSTRUCTURED_API_URL }); processFiles( client, process.env.LOCAL_FILE_INPUT_DIR, process.env.LOCAL_FILE_OUTPUT_DIR ); ``` ## Parameters & examples The parameter names used in this document are for the JavaScript/TypeScript SDK, which follows the `camelCase` convention. The Python SDK follows the `snake_case` convention. Other than this difference in naming convention, the names used in the SDKs are the same across all methods. * Refer to the [API parameters](/api-reference/partition/api-parameters) page for the full list of available parameters. * Refer to the [Examples](/api-reference/partition/examples) page for some inspiration on using the parameters. # Python SDK Source: https://docs.unstructured.io/api-reference/partition/sdk-python The [Unstructured Python SDK](https://github.com/Unstructured-IO/unstructured-python-client) client allows you to send one file at a time for processing by the [Unstructured Partition Endpoint](/api-reference/partition/overview). To use the Python SDK, you'll first need to set an environment variable named `UNSTRUCTURED_API_KEY`, representing your Unstructured API key. To get your API key, do the following: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> ## Installation Before using the SDK to interact with Unstructured, install the library: ```bash Python pip install unstructured-client ``` <Note> The SDK uses semantic versioning and major bumps could bring breaking changes. It is advised to pin your installed version. See the [migration guide](#migration-guide), later on this page, for breaking change announcements. </Note> ## Basics Let's start with a simple example in which you send a PDF document to the Unstructured Partition Endpoint to be partitioned by Unstructured. ```python Python import os, json import unstructured_client from unstructured_client.models import operations, shared client = unstructured_client.UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) filename = "PATH_TO_INPUT_FILE" req = operations.PartitionRequest( partition_parameters=shared.PartitionParameters( files=shared.Files( content=open(filename, "rb"), file_name=filename, ), strategy=shared.Strategy.VLM, vlm_model="gpt-4o", vlm_model_provider="openai", languages=['eng'], split_pdf_page=True, # If True, splits the PDF file into smaller chunks of pages. split_pdf_allow_failed=True, # If True, the partitioning continues even if some pages fail. split_pdf_concurrency_level=15 # Set the number of concurrent request to the maximum value: 15. ), ) try: res = client.general.partition( request=req ) element_dicts = [element for element in res.elements] # Print the processed data's first element only. print(element_dicts[0]) # Write the processed data to a local file. json_elements = json.dumps(element_dicts, indent=2) with open("PATH_TO_OUTPUT_FILE", "w") as file: file.write(json_elements) except Exception as e: print(e) ``` ## Async partitioning The Python SDK also has a `partition_async`. This call is equivalent to `partition` except that it can be used in a non blocking context. For instance, `asyncio.gather` can be used to concurrently process multiple files inside of a directory hierarchy, as demonstrated here: <CodeGroup> ```python Python import asyncio import os import json import unstructured_client from unstructured_client.models import shared client = unstructured_client.UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) async def call_api(filename, input_dir, output_dir): req = { "partition_parameters": { "files": { "content": open(filename, "rb"), "file_name": os.path.basename(filename), }, "strategy": shared.Strategy.VLM, "vlm_model": "gpt-4o", "vlm_model_provider": "openai", "languages": ['eng'], "split_pdf_page", True, # If True, splits the PDF file into smaller chunks of pages. "split_pdf_allow_failed": True, # If True, the partitioning continues even if some pages fail. "split_pdf_concurrency_level": 15 # Set the number of concurrent request to the maximum value: 15. } } try: res = await client.general.partition_async( request=req ) element_dicts = [element for element in res.elements] json_elements = json.dumps(element_dicts, indent=2) # Create the output directory structure. relative_path = os.path.relpath(os.path.dirname(filename), input_dir) output_subdir = os.path.join(output_dir, relative_path) os.makedirs(output_subdir, exist_ok=True) # Write the output file. output_filename = os.path.join(output_subdir, os.path.basename(filename) + ".json") with open(output_filename, "w") as file: file.write(json_elements) except Exception as e: print(f"Error processing {filename}: {e}") async def process_files(input_directory, output_directory): tasks = [] for root, _, files in os.walk(input_directory): for file in files: if not file.endswith('.json'): full_path = os.path.join(root, file) tasks.append(call_api(full_path, input_directory, output_directory)) await asyncio.gather(tasks) if name == "main": asyncio.run(process_files( input_directory=os.getenv("LOCAL_FILE_INPUT_DIR"), output_directory=os.getenv("LOCAL_FILE_OUTPUT_DIR") )) ``` </CodeGroup> ## Page splitting In order to speed up processing of large PDF files, the `split_pdf_page`[\](#parameter-names) parameter is `True` by default. This causes the PDF to be split into small batches of pages before sending requests to the API. The client awaits all parallel requests and combines the responses into a single response object. This is specific to PDF files and other filetypes are ignored. The number of parallel requests is controlled by `split_pdf_concurrency_level`[\](#parameter-names). The default is 8 and the max is set to 15 to avoid high resource usage and costs. If at least one request is successful, the responses are combined into a single response object. An error is returned only if all requests failed or there was an error during splitting. <Note> This feature may lead to unexpected results when chunking because the server does not see the entire document context at once. If you'd like to chunk across the whole document and still get the speedup from parallel processing, you can: Partition the PDF with `split_pdf_page` set to `True`, without any chunking parameters. * Store the returned elements in `results.json`. * Partition this JSON file with the desired chunking parameters. </Note> ```python Python req = operations.PartitionRequest( partition_parameters=shared.PartitionParameters( files=shared.Files( content=file.read(), file_name=filename, ), strategy=shared.Strategy.VLM, vlm_model="gpt-4o", vlm_model_provider="openai", split_pdf_page=True, # If True, splits the PDF file into smaller chunks of pages. split_pdf_allow_failed=True, # If True, the partitioning continues even if some pages fail. split_pdf_concurrency_level=15 # Set the number of concurrent request to the maximum value: 15. ) ) res = client.general.partition( request=req ) ``` ## Customizing the client ### Retries You can also change the defaults for retries through the `retry_config`[\](#parameter-names) when initializing the client. If a request to the API fails, the client will retry the request with an exponential backoff strategy up to a maximum interval of one minute. The function keeps retrying until the total elapsed time exceeds `max_elapsed_time`[\](#parameter-names), which defaults to one hour: ```python Python import os client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") retry_config=RetryConfig( strategy="backoff", retry_connection_errors=True, backoff=BackoffStrategy( # time intervals are defined in milliseconds initial_interval=500, max_interval=60000, exponent=1.5, max_elapsed_time=900000, # 15min60sec1000ms = 15 minutes ), ) ) ``` ### Disabling SSL validation If you disable SSL validation, requests will accept any TLS certificate presented by the server and ignore hostname mismatches and/or expired certificates, which will make your application vulnerable to man-in-the-middle (MitM) attacks. Only set this to `False` for testing. ```python Python http_client = requests.Session() http_client.verify = False client = UnstructuredClient( client=http_client, ... ) ``` ## Handling the response The partition response defaults to a dict format that can be converted to Unstructured elements with the `elements_from_dicts` utility function as seen below. Otherwise, the API response can be sent directly to your vector store or another destination. ```python Python from unstructured.staging.base import elements_from_dicts # ... if res.elements is not None: elements = elements_from_dicts(response.elements) ``` ## Parameters & examples The parameter names used in this document are for the Python SDK, which follow snake\_case convention. The JavaScript/TypeScript SDK follows camelCase convention. Other than this difference in naming convention, the names used in the SDKs are the same across all methods. * Refer to the [API parameters](/api-reference/partition/api-parameters) page for the full list of available parameters. * Refer to the [Examples](/api-reference/partition/examples) page for some inspiration on using the parameters. ## Migration guide There are breaking changes beginning with Python SDK version 0.26.0. If you encounter any errors when upgrading, please find the solution below. If you see the error: `AttributeError: 'PartitionParameters' object has no attribute 'partition_parameters'` Before 0.26.0, the SDK accepted a `PartitionParameters` object as input to the `sdk.general.partition` function. Beginning with 0.26.0, this object must be wrapped in a `PartitionRequest` object. The old behavior was deprecated in 0.23.0 and removed in 0.26.0. ```python # Instead of: from unstructured_client.models import shared req = shared.PartitionParameters( files=files, ) resp = s.general.partition( request=req ) # Switch to: from unstructured_client.models import shared, operations req = operations.PartitionRequest( partition_parameters=shared.PartitionParameters( files=files, ) ) resp = s.general.partition( request=req ) ``` If you see the error: `TypeError: BaseModel.init() takes 1 positional argument but 2 were given` Beginning with 0.26.0, the `PartitionRequest` constructor no longer allows for positional arguments. You must specify `partition_parameters` by name. ```python # Instead of: req = operations.PartitionRequest( shared.PartitionParameters( files=files, ) ) # Switch to: req = operations.PartitionRequest( partition_parameters=shared.PartitionParameters( files=files, ) ) ``` If you see the error: `TypeError: General.partition() takes 1 positional argument but 2 were given` Beginning with 0.26.0, the `partition` function no longer allows for positional arguments. You must specify `request` by name. ```python # Instead of: resp = s.general.partition(req) # Switch to: resp = s.general.partition( request=req ) ``` # Speed up processing of large files and batches Source: https://docs.unstructured.io/api-reference/partition/speed-up-large-files-batches When you use Unstructured, here are some techniques that you can try to help speed up the processing of large files and large batches of files. Choose your partitioning strategy wisely. For example, if you have simple PDFs that don't have images and tables, you might be able to use the `fast` strategy. Try the `fast` strategy on a few of your documents before you try using the `hi_res` strategy. [Learn more](/api-reference/partition/partitioning). To speed up PDF file processing, the [Unstructured SDK for Python](/api-reference/partition/sdk-python) and the [Unstructured SDK for JavaScript/TypeScript](/api-reference/partition/sdk-jsts) provide the following parameters to help speed up processing a large PDF file: * `split_pdf_page` (Python) or `splitPdfPage` (JavaScript/TypeScript), when set to true (the default), splits the PDF file on the client side before sending it as batches to Unstructured for processing. The number of pages in each batch is determined internally. Batches can contain between 2 and 20 pages. * `split_pdf_concurrency_level` (Python) or `splitPdfConcurrencyLevel` (JavaScript/TypeScript) is an integer that specifies the number of parallel requests. The default is 5. The maximum is 15. This behavior is ignored unless `split_pdf_page` (Python) or `splitPdfPage` (JavaScript/TypeScript) is also set to true. * `split_pdf_allow_failed` (Python) or splitPdfAllowFailed\` (JavaScript/TypeScript), when set to true, allows partitioning to continue even if some pages fail. * `split_pdf_page_range` (Python only) is a list of two integers that specify the beginning and ending page numbers of the PDF file to be sent. A `ValueError` is raised if the specified range is not valid. This behavior is ignored unless `split_pdf_page` is also set to true. [Learn more](/api-reference/partition/sdk-python#page-splitting). # Extract tables as HTML Source: https://docs.unstructured.io/api-reference/partition/text-as-html ## Task You want to get, save, or show the contents of elements that are represented as HTML, such as tables that are embedded in a PDF document. ## Approach Extract the contents of an element's `text_as_html` JSON object, which is nested inside of its parent `metadata` object. ## To run this example You will need a document that is one of the document types that can output the `text_as_html` JSON object. For the list of applicable document types, see the entries in the table at the beginning of [Partitioning](/open-source/core-functionality/partitioning) where "Table Support" is "Yes." This example uses a PDF file with an embedded table. ## Code For the [Unstructured Python SDK](/api-reference/partition/sdk-python), you'll need: These environment variables: * `UNSTRUCTURED_API_KEY` - Your Unstructured API key value. * `UNSTRUCTURED_API_URL` - Your Unstructured API URL. ```python Python SDK from unstructured_client import UnstructuredClient from unstructured_client.models import operations, shared from unstructured.staging.base import elements_from_dicts, elements_to_json import os, webbrowser if name == "main": client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) # Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/pdf/embedded-images-tables.pdf # Where to get the local file, relative to this .py file. local_input_filepath = "local-ingest-input-pdf/embedded-images-tables.pdf" # Where to store the retrieved HTML (and the processed JSON), relative to this .py file. local_output_filepath = "local-ingest-output" with open(local_input_filepath, "rb") as f: files = shared.Files( content=f.read(), file_name=local_input_filepath ) request = operations.PartitionRequest( shared.PartitionParameters( files=files, strategy=shared.Strategy.VLM, vlm_model="gpt-4o", vlm_model_provider="openai", split_pdf_page=True, split_pdf_allow_failed=True, split_pdf_concurrency_level=15 ) ) try: result = await client.general.partition_async( request=request ) # Provide some minimal CSS for better table readability. table_css = "<head><style>table, th, td { border: 1px solid; }</style></head>" for element in result.elements: if "text_as_html" in element["metadata"]: # Surround the element's HTML with basic <html> and <body> tags, and add the minimal CSS. html_string = f"<!DOCTYPE html><html>{table_css}<body>{element["metadata"]["text_as_html"]}</body></html>" # Save the element's HTML to a local file. save_path = f"{local_output_path}/{element["element_id"]}.html" file = open(save_path, 'w') file.write(html_string) file.close() # View the locally saved file in the local default web browser. webbrowser.open_new(f"file:///{os.getcwd()}/{save_path}") # Also get the elements for inspection and validation. dict_elements = elements_from_dicts( element_dicts=result.elements ) # Save the elements as JSON. elements_to_json( elements=dict_elements, indent=2, filename=f"{local_output_filepath}/embedded-images-tables.json" ) except Exception as e: print(e) ``` ## See also * [Extract images and tables from documents](/api-reference/partition/extract-image-block-types) * [Table Extraction from PDF](/examplecode/codesamples/apioss/table-extraction-from-pdf) # Transform a JSON file into a different schema Source: https://docs.unstructured.io/api-reference/partition/transform-schemas ## Task You want to convert a JSON file that Unstructured produces into a separate JSON file that uses a different JSON schema than the one that Unstructured uses. ## Approach Use a Python package such as [json-converter](https://pypi.org/project/json-converter/) in your Python code project to transform your source JSON file into a target JSON file that conforms to your own schema. <Info>The `json-converter` package is not owned or supported by Unstructured. For questions and requests, see the [Issues](https://github.com/ebi-ait/json-converter/issues) tab of the `json-converter` repository in GitHub.</Info> ## Code <Steps> <Step title="Install dependencies"> In your local Python code project, install the [json-converter](https://pypi.org/project/json-converter/) package. ```bash pip install json-converter ``` </Step> <Step title="Identify the JSON file to transform"> 1. Find the local source JSON file that you want to transform. 2. Note the JSON field names and structures that you want to transform. For example, the JSON file might look like the following (the ellipses indicate content omitted for brevity): ```json [ { "type": "...", "element_id": "...", "text": "...", "metadata": { "filetype": "...", "languages": [ "eng" ], "page_number": 1, "filename": "..." } }, { "type": "...", "element_id": "...", "text": "...", "metadata": { "filetype": "...", "languages": [ "eng" ], "page_number": 1, "filename": "..." } }, { "...": "..." } ] ``` </Step> <Step title="Create the JSON field mappings file"> 1. Decide what you want the JSON schema in the transformed file to look like. For example, the transformed JSON file might look like the following (the ellipses indicate content omitted for brevity): ```json [ { "content_type": "...", "content_id": "...", "content": "...", "content_properties": { "page": 1 } }, { "content_type": "...", "content_id": "...", "content": "...", "content_properties": { "page": 1 } }, { "...": "..." } ] ``` 2. Create the JSON field mappings file, for example: ```json { "content_type": ["type"], "content_id": ["element_id"], "content": ["text"] "content_properties.page": ["metadata.page_number"] } ``` This file declares the following mappings: * The `type` field is renamed to `content_type`. * The `element_id` field is renamed to `content_id`. * The `text` field is renamed to `content`. * The `page_number` field nested inside `metadata` is renamed to `page` and is nested inside `content_properties`. * All of the other fields (`filetype`, `languages`, and `filename`) are dropped. For more information about the format of this JSON field mappings file, see the [Project Description](https://pypi.org/project/json-converter) in the `json-converter` page on PyPI or the [README](https://github.com/ebi-ait/json-converter) in the `json-converter` repository in GitHub. </Step> <Step title="Add and run the transform code"> 1. Set the following local environment variables: * Set `LOCAL_FILE_INPUT_PATH` to the local path to the source JSON file. * Set `LOCAL_FILE_OUTPUT_PATH` to the local path to the target JSON file. * Set `LOCAL_FIELD_MAPPINGS_PATH` to the local path to the JSON field mappings file. 2. Add the following Python code file to your project: ```python import os, json from json_converter.json_mapper import JsonMapper # Converts one JSON file with one schema to another # JSON file with a different schema. # Provide the path to the input (source) and output # (target) JSON files and the path to the JSON schema # field mappings file. def convert_json_with_schemas( input_file_path: str, output_file_path: str, field_mappings_path: str ) -> None: output_data = [] try: with open(field_mappings_path, 'r') as f: element_mappings = json.load(f) except FileNotFoundError: print(f"Error: Input JSON schema field mappings file '{input_file_path}' not found.") except IOError as e: print(f"I/O error occurred: {e}") except Exception as e: print(f"An unexpected error occurred: {e}") try: with open(input_file_path, 'r') as f: input_data = json.load(f) for item in input_data: converted_data = JsonMapper(item).map(element_mappings) output_data.append(converted_data) except FileNotFoundError: print(f"Error: Input JSON file '{input_file_path}' not found.") except IOError as e: print(f"I/O error occurred: {e}") except Exception as e: print(f"An unexpected error occurred: {e}") try: with open(output_file_path, 'w') as f: json.dump(output_data, f, indent=2) print(f"Transformation complete. Output written to '{output_file_path}'.") except IOError as e: print(f"I/O error occurred: {e}") except Exception as e: print(f"An unexpected error occurred: {e}") if name == "main": convert_json_with_schemas( input_file_path=os.getenv("LOCAL_FILE_INPUT_PATH"), output_file_path=os.getenv("LOCAL_FIELD_MAPPINGS_PATH"), field_mappings_path=os.getenv("LOCAL_FILE_OUTPUT_PATH") ) ``` 3. Run the Python code file. 4. Check the path specified by `LOCAL_FILE_OUTPUT_PATH` for the transformed JSON file. </Step> </Steps> ## Troubleshooting ### Error when trying to import Mapping from collections Issue: When you run your Python code file, the following error message appears: "ImportError: cannot import name 'Mapping' from 'collections'". Cause: When you use the `json-converter` package with newer versions of Python such as 3.11 and later, Python tries to use an outdated import in this `json-converter` package. Solution: Update the `json-converter` package's source code to use a different import, as follows: 1. In your Python project, find the `json-converter` package's source location, by running the `pip show` command: ```bash pip show json-converter ``` Note the path in the Location field. 2. Use your code editor to the open the path to the `json-converter` package's source code. 3. In the source code, open the file named `json_mapper.py`. 4. Change the following line of code... ```python from collections import Mapping ``` ...to the following line of code, by adding `.abc`: ```python from collections.abc import Mapping ``` 5. Save this source code file. 6. Run your Python code file again. # Create source connection check Source: https://docs.unstructured.io/api-reference/sources/create-source-connection-check https://platform.unstructuredapp.io/openapi.json post /api/v1/sources/{source_id}/connection-check Initiates a connection check for the specified source connector. # Create source connector Source: https://docs.unstructured.io/api-reference/sources/create-source-connector https://platform.unstructuredapp.io/openapi.json post /api/v1/sources/ Create a new source connector using the provided configuration and name. # Delete source connector Source: https://docs.unstructured.io/api-reference/sources/delete-source-connector https://platform.unstructuredapp.io/openapi.json delete /api/v1/sources/{source_id} Delete a specific source connector identified by its ID. # Get source connector Source: https://docs.unstructured.io/api-reference/sources/get-source-connector https://platform.unstructuredapp.io/openapi.json get /api/v1/sources/{source_id} Retrieve detailed information for a specific source connector by its ID. # Get the latest source connector connection check Source: https://docs.unstructured.io/api-reference/sources/get-the-latest-source-connector-connection-check https://platform.unstructuredapp.io/openapi.json get /api/v1/sources/{source_id}/connection-check Retrieves the most recent connection check for the specified source connector. # List available source connectors Source: https://docs.unstructured.io/api-reference/sources/list-available-source-connectors https://platform.unstructuredapp.io/openapi.json get /api/v1/sources/ Retrieve a list of available source connectors. # Update source connector Source: https://docs.unstructured.io/api-reference/sources/update-source-connector https://platform.unstructuredapp.io/openapi.json put /api/v1/sources/{source_id} Update the configuration of an existing source connector. # Supported file types Source: https://docs.unstructured.io/api-reference/supported-file-types The Unstructured API supports processing of the following file types: By file extension: | File extension | | -------------- | | `.abw` | | `.bmp` | | `.csv` | | `.cwk` | | `.dbf` | | `.dif`\* | | `.doc` | | `.docm` | | `.docx` | | `.dot` | | `.dotm` | | `.eml` | | `.epub` | | `.et` | | `.eth` | | `.fods` | | `.heic` | | `.htm` | | `.html` | | `.hwp` | | `.jpeg` | | `.jpg` | | `.md` | | `.mcw` | | `.msg` | | `.mw` | | `.odt` | | `.org` | | `.p7s` | | `.pbd` | | `.pdf` | | `.png` | | `.pot` | | `.ppt` | | `.pptm` | | `.pptx` | | `.prn` | | `.rst` | | `.rtf` | | `.sdp` | | `.sxg` | | `.tiff` | | `.txt` | | `.tsv` | | `.xls` | | `.xlsx` | | `.xml` | | `.zabw` | By file type: | Category | File types | | ----------------- | ------------------------------------------------------------------ | | Apple | `.cwk`, `.mcw` | | CSV | `.csv` | | Data Interchange | `.dif`\* | | dBase | `.dbf` | | E-mail | `.eml`, `.msg`, `.p7s` | | EPUB | `.epub` | | HTML | `.htm`, `.html` | | Image | `.bmp`, `.heic`, `.jpeg`, `.jpg`, `.png`, `.prn`, `.tiff` | | Markdown | `.md` | | OpenOffice | `.odt` | | Org Mode | `.org` | | Other | `.eth`, `.pbd`, `.sdp` | | PDF | `.pdf` | | Plain text | `.txt` | | PowerPoint | `.pot`, `.ppt`, `.pptm`, `.pptx` | | reStructured Text | `.rst` | | Rich Text | `.rtf` | | Spreadsheet | `.et`, `.fods`, `.mw`, `.xls`, `.xlsx` | | StarOffice | `.sxg` | | TSV | `.tsv` | | Word processing | `.abw`, `.doc`, `.docm`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` | | XML | `.xml` | `` For `.dif`, `\n` characters in `.dif` files are supported, but `\r\n` characters will raise the error `UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type`. # Troubleshooting Unstructured API keys and URLs Source: https://docs.unstructured.io/api-reference/troubleshooting/api-key-url ## Issue When you run script or code to call an Unstructured API, you might get one of the following warnings or errors: ``` UserWarning: If intending to use the paid API, please define `server_url` in your request. ``` ``` ERROR: Server responded with 404 - {"detail":"Not Found"} ``` You might also get the following related warnings or errors in your logs: ``` SDKError: API error occurred: Status 401 {"detail":"API key is malformed, please type the API key correctly in the header."} ``` ``` API error occurred: Status 404 {"detail":"Not Found"} ``` ## Possible causes 1. The Unstructured API key, API URL, or both are missing or malformed in your script or code. 2. The API key, API URL, or both are not present in your current session. 3. The API key is no longer valid, or the API key and API URL combination is not valid. ## Suggested solutions For the API URL, this URL was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). <Note> The default URL for the Unstructured Worfklow Endpoint is `https://platform.unstructuredapp.io/api/v1`. The default URL for the Unstructured Partition Endpoint is `https://api.unstructuredapp.io/general/v0/general`. However, you should always use the URLs that were provided to you when your Unstructured account was created. </Note> For the API key, the same API key works for both the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) and the [Unstructured Partition Endpoint](/api-reference/partition/overview). To access your key: 1. Sign in to your Unstructured account: If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click the Copy icon next to your target key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> 3. If you want or need to generate a new Unstructured API key at this point: a. Click Generate API Key.<br /> b. Follow the on-screen instructions to finish generating the key.<br /> c. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> <Note> If you signed up through the [For Enterprise](https://unstructured.io/enterprise) page, of if you are using a [self-hosted](/self-hosted/overview) deployment of Unstructured, the process for getting your API key might be different. For guidance, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> For the API URL, note the value of the Unstructured API URL for the Endpoint that you want to call. To access these API URLs: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. In the Unstructured UI, click API Keys on the sidebar. 3. Note the API URL for the Unstructured Workflow Endpoint or the Unstructured Partition Endpoint, depending on whether you want to call the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) or the [Unstructured Partition Endpoint](/api-reference/partition/overview), respectively. <Note> If you signed up through the [For Enterprise](https://unstructured.io/enterprise) page, of if you are using a [self-hosted](/self-hosted/overview) deployment of Unstructured, your API URL might be different. For guidance, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> If you still believe you have the correct API URL and API key, try the following steps: 1. Check your script or code to see where the API URL and API key are specified, as follows: For a POST request * Check the target URL. Make sure it is set to the right API URL. * Check the value of the `unstructured-api-key` header. Make sure it is set to the right API key, or an environment variable representing the right API key. For the Unstructured Python SDK * Check the value of the `server_url` parameter. Make sure it is set to the right API URL, or an environment variable representing the right API URL. * Check the value of the `api_key_auth` parameter. Make sure it is set to the right API key, or an environment variable representing the right API key. <Note> If you are using the Unstructured Python SDK to call the Workflow Endpoint or the Partition Endpoint, you should not need to set the value of the `server_url` parameter if you are using these Endpoints' default API URL. However, you might want to try setting this value anyway to see if it otherwise resolves your issue. </Note> For the Unstructured JavaScript/TypeScript SDK * Check the value of the `UnstructuredClient` object's `serverURL` parameter. Make sure it is set to the right API URL, or an environment variable representing the right API URL. * Check the value of the `UnstructuredClient` object's `security: { apiKeyAuth }` parameter. Make sure it is set to the right API key, or an environment variable representing the right API key. 2. If you follow the recommended approach to use environment variables to represent the API key and API URL instead of hard-coding them into your script or code, check your session to see if these environment variables have been loaded, for example: * Run the command for your operating system to list all environment variables. Make sure the API key and API URL are listed and are set to the right ones. Make sure the names and cases of these environment variables match the ones in your script or code exactly. * If you use an integrated development environment (IDE) to run your script or code, check the IDE's documentation for how to specify and load environment variables at run time or debug time. Make sure the names and cases of these environment variables match the ones in your script or code exactly. 3. If you are still getting this issue, [contact us directly](https://unstructured.io/contact). # Retrieve PEM Key Source: https://docs.unstructured.io/api-reference/users/retrieve-pem-key https://platform.unstructuredapp.io/openapi.json post /api/v1/users/retrieve Given a UNSTRUCTURED_API_KEY in the post-payload, retrieve the associated PEM key # Store an encrypted secret Source: https://docs.unstructured.io/api-reference/users/store-an-encrypted-secret https://platform.unstructuredapp.io/openapi.json post /api/v1/users/secrets After encrypting a secret locally, store it and get back a reference id. # Astra DB Source: https://docs.unstructured.io/api-reference/workflow/destinations/astradb <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Astra DB. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/PMs1iwL52aM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/). * A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html). * An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html). * A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace). * A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection). An existing collection is not required. At runtime, the collection behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings, the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Pinecone account. The new collection's name will be `unstructuredautocreated`. To create an Astra DB destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, AstraDBConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.ASTRADB, config=AstraDBConnectorConfigInput( token="<token>", api_endpoint="<api-endpoint>", collection_name="<collection-name>", keyspace="<keyspace>", batch_size=<batch-size>, flatten_metadata=<True|False> ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "astradb", "config": { "token": "<token>", "api_endpoint": "<api-endpoint>", "collection_name": "<collection-name>", "keyspace": "<keyspace>", "batch_size": <batch-size>, "flatten_metadata": "<true|false>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<token>` (required) - The application token for the database. * `<api-endpoint>` (required) - The database’s associated API endpoint. * `<collection-name>` - The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time. * `<keyspace>` - The name of the keyspace in the collection. The default is `default_keyspace` if not otherwise specified. * `<batch-size>` - The maximum number of records to send per batch. The default is `20` if not otherwise specified. * `flatten_metadata` - Set to `true` to flatten the metadata into each record. Specifically, when flattened, the metadata key values are brought to the top level of the element, and the `metadata` key itself is removed. By default, the metadata is not flattened (`false`). # Azure AI Search Source: https://docs.unstructured.io/api-reference/workflow/destinations/azure-ai-search <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Azure AI Search. The requirements are as follows. The following video shows how to fulfill the minimum set of Azure AI Search requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/6ZjU5OupWE8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Here are some more details about these requirements: * The endpoint and API key for Azure AI Search. [Create an endpoint and API key](https://learn.microsoft.com/azure/search/search-create-service-portal). * The name of the index in Azure AI Search. [Create an index](https://learn.microsoft.com/rest/api/searchservice/create-index). <iframe width="560" height="315" src="https://www.youtube.com/embed/WY8h8Gtyo7o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The Azure AI Search index that you use must have an index schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "@odata.context": "https://ingest-test-azure-ai-search.search.windows.net/$metadata#indexes/$entity", "@odata.etag": "\"0x8DCED5D96393CA9\"", "name": "<my-index-name>", "defaultScoringProfile": null, "fields": [ { "name": "id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "record_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "element_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "text", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "embeddings", "type": "Collection(Edm.Single)", "searchable": true, "filterable": false, "retrievable": true, "stored": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": 3072, "vectorSearchProfile": "embeddings-config-profile", "vectorEncoding": null, "synonymMaps": [] }, { "name": "type", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "metadata", "type": "Edm.ComplexType", "fields": [ { "name": "category_depth", "type": "Edm.Int32", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "parent_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "attached_to_filename", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "filetype", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "last_modified", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "is_continuation", "type": "Edm.Boolean", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "file_directory", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "filename", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "data_source", "type": "Edm.ComplexType", "fields": [ { "name": "url", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "version", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_created", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_modified", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_processed", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "permissions_data", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "record_locator", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] }, { "name": "coordinates", "type": "Edm.ComplexType", "fields": [ { "name": "system", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "layout_width", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "layout_height", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "points", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] }, { "name": "languages", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "page_number", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "orig_elements", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "links", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "page_name", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "url", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "link_urls", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "link_texts", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "sent_from", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "sent_to", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "subject", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "section", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "header_footer_type", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "emphasized_text_contents", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "emphasized_text_tags", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "text_as_html", "type": "Edm.String", "searchable": true, "filterable": false, "retrievable": true, "stored": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "regex_metadata", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "detection_class_prob", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "partitioner_type", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] } ], "scoringProfiles": [], "corsOptions": null, "suggesters": [], "analyzers": [], "normalizers": [], "tokenizers": [], "tokenFilters": [], "charFilters": [], "encryptionKey": null, "similarity": { "@odata.type": "#Microsoft.Azure.Search.BM25Similarity", "k1": null, "b": null }, "semantic": null, "vectorSearch": { "algorithms": [ { "name": "embeddings-config", "kind": "hnsw", "hnswParameters": { "metric": "cosine", "m": 4, "efConstruction": 400, "efSearch": 500 }, "exhaustiveKnnParameters": null } ], "profiles": [ { "name": "embeddings-config-profile", "algorithm": "embeddings-config", "vectorizer": null, "compression": null } ], "vectorizers": [], "compressions": [] } } ``` See also: * [Search indexes in Azure AI Search](https://learn.microsoft.com/azure/search/search-what-is-an-index) * [Schema of a search index](https://learn.microsoft.com/azure/search/search-what-is-an-index#schema-of-a-search-index) * [Example index schema](https://learn.microsoft.com/rest/api/searchservice/create-index#examples) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) To create an Azure AI Search destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, AzureAISearchConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.AZURE_AI_SEARCH, config=AzureAISearchConnectorConfigInput( endpoint="<endpoint>", index="<index>", key="<azure-ai-search-key>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "azure_ai_search", "config": { "endpoint": "<endpoint>", "index": "<index>", "azure_ai_search_key": "<azure-ai-search-key>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<endpoint>` (required) - The endpoint URL for Azure AI Search. * `<index>` (required) - The name of the index for Azure AI Search. * `<azure-ai-search-key>` (required) - The API key for Azure AI Search. # Couchbase Source: https://docs.unstructured.io/api-reference/workflow/destinations/couchbase <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Couchbase. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Couchbase Capella clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Couchbase Capella clusters and local Couchbase server deployments are supported. <iframe width="560" height="315" src="https://www.youtube.com/embed/9-RIBmIdi70" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For Couchbase Capella, you will need: * A [Couchbase Capella account](https://docs.couchbase.com/cloud/get-started/create-account.html#sign-up-free-tier). * A [Couchbase Capella cluster](https://docs.couchbase.com/cloud/get-started/create-account.html#getting-started). * A [bucket](https://docs.couchbase.com/cloud/clusters/data-service/manage-buckets.html#add-bucket), [scope](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-scope), and [collection](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-collection) on the cluster. * The cluster's [public connection string](https://docs.couchbase.com/cloud/get-started/connect.html#connect-from-sdk-cbsh-cli-or-ide). * The [cluster access name (username) and secret (password)](https://docs.couchbase.com/cloud/clusters/manage-database-users.html#create-database-credentials). * [Incoming IP address allowance](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) for the cluster. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> For a local Couchbase server, you will need: * [Installation of a local Couchbase server](https://docs.couchbase.com/server/current/getting-started/start-here.html). * [Connection details](https://docs.couchbase.com/server/current/guides/connect.html) to the local Couchbase server. To learn more about how to set up a Couchbase cluster and play with data, refer to this [tutorial](https://developer.couchbase.com/tutorial-quickstart-flask-python). To create a Couchbase destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, CouchbaseDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=estinationConnectorType.COUCHBASE, config=CouchbaseDestinationConnectorConfigInput( username="<username>", bucket="<bucket>", connection_string="<connection-string>", scope="<scope>", collection="<collection>", password="<password>", batch_size=<batch-size> ) ) ) ) print(response.destination_connector_information) # ... ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "couchbase", "config": { "username": "<username>", "bucket": "<bucket>", "connection_string": "<connection-string>", "scope": "<scope>", "collection": "<collection>", "password": "<password>", "batch_size": <batch-size> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<username>` (required) - The username for the Couchbase server. * `<bucket>` (required) - The name of the bucket in the Couchbase server. * `<connection-string>` (required) - The connection string for the Couchbase server. * `<scope>` - The name of the scope in the bucket. The default is `_default` if not otherwise specified. * `<collection>` - The name of the collection in the scope. The default is `_default` if not otherwise specified. * `<password>` (required) - The password for the Couchbase server. * `<batch-size>` - The maximum number of records to transmit per batch. The default is `50` if not otherwise specified. * `<collection-id>` (source connector only) - The name of the collection field that contains the document ID. The default is `id` if not otherwise specified. ## Learn more * <Icon icon="blog" /> [Couchbase Integration in Unstructured Platform](https://unstructured.io/blog/couchbase-integration-in-unstructured-platform) # Delta Tables in Databricks Source: https://docs.unstructured.io/api-reference/workflow/destinations/databricks-delta-table <Tip> This article covers connecting Unstructured to Delta Tables in Databricks. For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see [Delta Tables in Amazon S3](/api-reference/workflow/destinations/delta-table). For information about connecting Unstructured to Databricks Volumes instead, see [Databricks Volumes](/api-reference/workflow/destinations/databricks-volumes). </Tip> <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to a Delta Table in Databricks. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Datbricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * One of the following compute resources within the workspace: * A SQL warehouse for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html). The following video shows how to create a SQL warehouse if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the warehouse's owner to use it: <iframe width="560" height="315" src="https://www.youtube.com/embed/N-Aw9-U3_fE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An all-purpose cluster for [AWS](https://docs.databricks.com/compute/use-compute.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/use-compute), or [GCP](https://docs.gcp.databricks.com/compute/use-compute.html). The following video shows how to create an all-purpose cluster if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the cluster's owner to use it: <iframe width="560" height="315" src="https://www.youtube.com/embed/apgibaelVY0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The SQL warehouse's or cluster's Server Hostname and HTTP Path values for [AWS](https://docs.databricks.com/integrations/compute-details.html), [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html). * Unity Catalog enabled in the workspace for [AWS](https://docs.databricks.com/data-governance/unity-catalog/get-started.html), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/get-started), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/get-started.html). * Within Unity Catalog: * A catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html). * A schema (formerly known as a database) for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) within that catalog, * A table for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html) within that schema (formerly known as a database). You can have the connector attempt to create a table for you automatically at run time. To do this, in the connector settings as described later in this article, do one of the following: * Specify the name of the table that you want the connector to attempt to create within the specified catalog and schema (formerly known as a database). * Leave the table name blank. The connector will attempt to create a table within the specified catalog and schema (formerly known as a database). For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview), the table is named `u<short-workflow-id>`. For the [Unstructured Ingest CLI and Ingest Python library](/open-source/ingestion/overview), the table is named `unstructuredautocreated`. The connector will attempt to create the table on behalf of the related Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, as described later in these requirements. If successful, the table's owner is set as the related Databricks workspace user or Databricks managed service principal. The owner will have all Unity Catalog privileges on the table by default. No other Databricks workspace users or Databricks managed service principals will have any privileges on the table by default. <Warning> If the table's parent schema (formerly known as a database) is not owned by the same Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, then you should grant the new table's owner the `CREATE TABLE` privilege on that parent schema (formerly known as a database) before the connector attempts to create the table. Otherwise, table creation could fail. </Warning> <Note> Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), and tables might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, and tables. </Note> The following video shows how to create a catalog, schema (formerly known as a database), and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them: <iframe width="560" height="315" src="https://www.youtube.com/embed/ffNnq-6bpd4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> If you want to use an existing table or create one yourself beforehand, this table must contain at minimum the following column names and their data types: ```text CREATE TABLE IF NOT EXISTS <catalog_name>.<schema_name>.<table_name> ( id STRING NOT NULL PRIMARY KEY, record_id STRING NOT NULL, element_id STRING NOT NULL, text STRING, embeddings ARRAY<FLOAT>, type STRING, metadata VARIANT ); ``` <Info> In Databricks, a table's schema is different than a schema (formerly known as a database) in a catalog-schema object relationship in Unity Catalog. </Info> * Within Unity Catalog, a volume for [AWS](https://docs.databricks.com/volumes/utility-commands.html), [Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands), or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html). The volume can be in the same schema (formerly known as a database) as the table, or the volume and table can be in separate schemas. In either case, both of these schemas must share the same parent catalog. <Note> Using dashes (`-`) in the names of volumes might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of volumes. </Note> The following video shows how to create a catalog, schema (formerly known as a database), and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them: <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace: * A Databricks managed service principal. This service principal must have the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse. * The service principal's UUID (or Client ID or Application ID) value. * The OAuth Secret value for the service principal. To get this information, see Steps 1-3 of the instructions for [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). <Note> For Azure Databricks, this connector only supports Databricks managed service principals for authentication. Microsoft Entra ID managed service principals are not supported. </Note> The following video shows how to create a Databricks managed service principal if you do not already have one available: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Databricks personal access token authentication to the workspace, the Databricks personal access token value for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat#azure-databricks-personal-access-tokens-for-workspace-users), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users). This token must be for the workspace user who has the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse, The following video shows how to create a Databricks personal access token if you do not already have one available: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Databricks workspace user or Databricks managed service principal must have the following minimum set of permissions and privileges to write to an existing volume or table in Unity Catalog. If the owner of these is that Databricks workspace user or Databricks managed service principal, then they will have all necessary permissions and privileges by default. If the owner is someone else, then the following permissions and privileges must be explicitly granted to them before using the connector: * To use an all-purpose cluster for access, `Can Restart` permission on that cluster. Learn how to check and set cluster permissions for [AWS](https://docs.databricks.com/compute/clusters-manage.html#compute-permissions), [Azure](https://learn.microsoft.com/azure/databricks/compute/clusters-manage#cluster-level-permissions), or [GCP](https://docs.gcp.databricks.com/compute/clusters-manage.html#compute-permissions). * To use a SQL warehouse for access, `Can use` permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse), [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create#manage), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse). * To access a Unity Catalog volume, the following privileges: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. Learn how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). * To access a Unity Catalog table, the following privileges: * `USE CATALOG` on the table's parent catalog in Unity Catalog. * `USE SCHEMA` on the tables's parent schema (formerly known as a database) in Unity Catalog. * To create a new table, `CREATE TABLE` on the table's parent schema (formerly known as a database) in Unity Catalog. * If the table already exists, `MODIFY` and `SELECT` on the table. Learn how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). To create a Delta Tables in Databricks destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, DatabricksVDTDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.DATABRICKS_VOLUME_DELTA_TABLES, config=DatabricksVDTDestinationConnectorConfigInput( server_hostname="<server-hostname>", http_path="<http-path>", token="<token>", client_id="<client-id>", client_secret="<client-secret>", volume="<volume>", catalog="<catalog>", volume_path="<volume_path>", schema="<schema>", database="<database>", table_name="<table_name>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "databricks_volume_delta_tables", "config": { "server_hostname": "<server-hostname>", "http_path": "<http-path>", "token": "<token>", "client_id": "<client-id>", "client_secret": "<client-secret>", "volume": "<volume>", "catalog": "<catalog>", "volume_path": "<volume_path>", "schema": "<schema>", "database": "<database>", "table_name": "<table_name>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<server-hostname>` (required): The target Databricks cluster's or SQL warehouse's Server Hostname value. * `<http-path>` (required): The cluster's or SQL warehouse's HTTP Path value. * `<token>` (required for PAT authentication): For Databricks personal access token (PAT) authentication, the target Databricks user's PAT value. * `<client-id>` and `<client-secret>` (required for OAuth authentication): For Databricks OAuth machine-to-machine (M2M) authentication, the Databricks managed service principal's UUID (or Client ID or Application ID) and OAuth Secret (client secret) values. * `<catalog>` (required): The name of the catalog in Unity Catalog for the target volume and table in the Databricks workspace. * `<database>`: The name of the schema (formerly known as a database) in Unity Catalog for the target table. The default is `default` if not otherwise specified. If the target table and volume are in the same schema (formerly known as a database), then `<database>` and `<schema>` will have the same values. * `<table_name>`: The name of the target table in Unity Catalog. * If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema. * If no table name is specified, the connector attempts to create a table named `u<short-workflow-id>` within the specified schema (formerly known as a database). See the beginning of this article for additional technical requirements before having the connector attempt to create a table. * `<schema>`: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default is `default` if not otherwise specified. If the target volume and table are in the same schema (formerly known as a database), then `<schema>` and `<database>` will have the same values. * `<volume>` (required): The name of the target volume in Unity Catalog. * `<volume_path>`: Any target folder path inside of the volume to use instead of the volume's root. If not otherwise specified, processing occurs at the volume's root. <Note> Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, tables, and volumes. </Note> ## Learn more * <Icon icon="blog" /> [Integration Highlight: Databricks Delta Tables](https://unstructured.io/blog/integration-highlight-databricks-delta-tables) # Databricks Volumes Source: https://docs.unstructured.io/api-reference/workflow/destinations/databricks-volumes <Tip> This article covers connecting Unstructured to Databricks Volumes. For information about connecting Unstructured to Delta Tables in Databricks instead, see [Delta Tables in Databricks](/api-reference/workflow/destinations/databricks-delta-table). </Tip> <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Databricks Volumes. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Databricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * The workspace's URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids). Examples: * AWS: `https://<workspace-id>.cloud.databricks.com` * Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net` * GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com` <Note> Do not add a trailing slash (`/`) to the workspace URL. </Note> * The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), the following Databricks authentication types are supported: * Databricks OAuth machine-to-machine (M2M) authentication for\ [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). You will need the the Client ID (or UUID or Application ID) and OAuth Secret (client secret) values for the corresponding service principal. Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. The following video shows how to create a Databricks managed service principal: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html). You will need the personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For [Unstructured Ingest](/open-source/ingestion/overview), the following Databricks authentication types are supported: * For Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html): The personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password (basic) authentication ([AWS](https://docs.databricks.com/archive/dev-tools/basic.html) only): The user's name and password values. * For OAuth machine-to-machine (M2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html)): The client ID and OAuth secret values for the corresponding service principal. * For OAuth user-to-machine (U2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-u2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-u2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-u2m.html)): No additional values. * For Azure managed identities (formerly Managed Service Identities (MSI) authentication) ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-mi) only): The client ID value for the corresponding managed identity. * For Microsoft Entra ID service principal authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-sp) only): The tenant ID, client ID, and client secret values for the corresponding service principal. * For Azure CLI authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-cli) only): No additional values. * For Microsoft Entra ID user authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/user-aad-token) only): The Entra ID token for the corresponding Entra ID user. * For Google Cloud Platform credentials authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-creds.html) only): The local path to the corresponding Google Cloud service account's credentials file. * For Google Cloud Platform ID authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-id.html) only): The Google Cloud service account's email address. * The name of the parent catalog in Unity Catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html) for the volume. * The name of the parent schema (formerly known as a database) in Unity Catalog for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) for the volume. * The name of the volume in Unity Catalog for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html), and optionally any path in that volume that you want to access directly, beginning with the volume's root. * The Databricks workspace user or service principal must have the following minimum set of privileges to read from or write to the existing volume in Unity Catalog: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. The following videos shows how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog. <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Learn more about how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). To create a Databricks Volumes destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, DatabricksVolumesConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.DATABRICKS_VOLUMES, config=DatabricksVolumesConnectorConfigInput( host="<host>", catalog="<catalog>", schema="<schema>", volume="<volume>", volume_path="<volume_path>", # For Databricks OAuth machine-to-machine (M2M) authentication: client_secret="<client-secret>", client_id="<client-id>" # For Databricks personal access token authentication: token="<token>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "databricks_volumes", "config": { "host": "<host>", "catalog": "<catalog>", "schema": "<schema>", "volume": "<volume>", "volume_path": "<volume_path>", # For Databricks OAuth machine-to-machine (M2M) authentication: "client_secret": "<client-secret>", "client_id": "<client-id>" # For Databricks personal access token authentication: "token": "<token>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<host>` (required) - The Databricks workspace host URL. <Note> Do not add a trailing slash (`/`) to the workspace host URL. </Note> * `<client-id>` (required) - For Databricks OAuth machine-to-machine (M2M) authentication, the Client ID (or UUID or Application ID) value for the Databricks managed service principal that has the appropriate privileges to the volume. * `<client-secret>` (required) - For Databricks OAuth M2M authentication, the associated OAuth Secret value for the Databricks managed service principal that has the appropriate privileges to the volume. * `<token>` (required) - For Databricks personal access token authentication, the personal access token's value. * `<catalog>` (required) - The name of the catalog to use. * `<schema>` - The name of the associated schema. If not specified, `default` is used. * `<volume>` (required) - The name of the associated volume. * `<volume_path>` - Any optional path to access within the volume. # Delta Tables in Amazon S3 Source: https://docs.unstructured.io/api-reference/workflow/destinations/delta-table <Tip> This article covers connecting Unstructured to Delta Tables in Amazon S3. For information about connecting Unstructured to Delta Tables in Databricks instead, see [Delta Tables in Databricks](/api-reference/workflow/destinations/databricks-delta-table). </Tip> <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to a Delta Table, stored in Amazon S3. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements to store Delta Tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/_W4565dcUGI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). ## Add an access policy to an existing bucket To use the Amazon S3 console to add an access policy that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to an existing S3 bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 2. Open the [Amazon S3 Console](https://console.aws.amazon.com/s3/home). 3. Browse to the existing bucket and open it. 4. Click the Permissions tab. 5. In the Bucket policy area, click Edit. 6. In the Policy text area, copy the following JSON-formatted policy. To change the following policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this policy, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-bucket-name>` in two places with the name of your bucket. ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<my-account-id>:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<my-bucket-name>", "arn:aws:s3:::<my-bucket-name>/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } ``` 7. Click Save changes. ## Create a bucket with AWS CloudFormation To use the AWS CloudFormation console to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Save the following YAML to a file on your local machine, for example `create-s3-bucket.yaml`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: 'CloudFormation template to create an S3 bucket with specific permissions for account users.' Parameters: BucketName: Type: String Description: 'Name of the S3 bucket to create' Resources: MyS3Bucket: Type: 'AWS::S3::Bucket' Properties: BucketName: !Ref BucketName PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: false IgnorePublicAcls: true RestrictPublicBuckets: true BucketPolicy: Type: 'AWS::S3::BucketPolicy' Properties: Bucket: !Ref MyS3Bucket PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowAllAuthenticatedUsersInAccount Effect: Allow Principal: AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root' Action: - 's3:GetObject' - 's3:PutObject' - 's3:ListBucket' - 's3:DeleteObject' Resource: - !Sub 'arn:aws:s3:::${BucketName}' - !Sub 'arn:aws:s3:::${BucketName}/' Outputs: BucketName: Description: 'Name of the created S3 bucket' Value: !Ref MyS3Bucket ``` 2. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 3. Open the [AWS CloudFormation Console](https://console.aws.amazon.com/cloudformation/home). 4. Click Create stack > With new resources (standard). 5. On the Create stack page, with Choose an existing template already selected, select Upload a template file. 6. Click Choose file, and browse to and select the YAML file from your local machine. 7. Click Next. 8. Enter a unique Stack name and BucketName. 9. Click Next two times. 10. Click Submit. 11. Wait until the Status changes to CREATE\_COMPLETE. 12. After the bucket is created, you can delete the YAML file, if you want. ## Create a bucket with the AWS CLI To use the AWS CLI to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 2. [Set up the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). 3. Copy the following script to a file on your local machine, for example a file named `create-s3-bucket.sh`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this script, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-unique-bucket-name>` with the name of your bucket. * Replace `<us-east-1>` with your AWS Region. ```bash #!/bin/bash # Set variables for the AWS account ID, Amazon S3 bucket name, and AWS Region. ACCOUNT_ID="<my-account-id>" BUCKET_NAME="<my-unique-bucket-name>" REGION="<us-east-1>" # Temporary filename for the bucket policy. # Do not change this variable. POLICY_FILE="bucket_policy.json" # Create the bucket. aws s3api create-bucket --bucket $BUCKET_NAME --region $REGION # Wait for the bucket to exist. echo "Waiting for bucket '$BUCKET_NAME' to be fully created..." aws s3api wait bucket-exists --bucket $BUCKET_NAME # Check if the wait command was successful. if [ $? -eq 0 ]; then echo "The bucket '$BUCKET_NAME' has been fully created." else echo "Error: Timed out waiting for bucket '$BUCKET_NAME' to be created." exit 1 fi # Remove the "block public policy" bucket access setting. aws s3api put-public-access-block \ --bucket $BUCKET_NAME \ --public-access-block-configuration \ '{"BlockPublicPolicy": false, "IgnorePublicAcls": false, "BlockPublicAcls": false, "RestrictPublicBuckets": false}' # Check if the operation was successful. if [ $? -eq 0 ]; then echo "The block public policy access setting was removed from '$BUCKET_NAME'." else echo "Error: Failed to remove the block public policy access setting from '$BUCKET_NAME'." exit 1 fi # Create the bucket policy. cat << EOF > $POLICY_FILE { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::$ACCOUNT_ID:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::$BUCKET_NAME", "arn:aws:s3:::$BUCKET_NAME/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } EOF # Apply the bucket policy. aws s3api put-bucket-policy --bucket $BUCKET_NAME --policy file://$POLICY_FILE # Check if the policy application was successful. if [ $? -eq 0 ]; then echo "The bucket policy was applied to '$BUCKET_NAME'." else echo "Error: Failed to apply the bucket policy to '$BUCKET_NAME'." exit 1 fi # Verify the applied policy. echo "Verifying the applied policy:" aws s3api get-bucket-policy --bucket $BUCKET_NAME --query Policy --output text # Remove the temporary bucket policy file. rm $POLICY_FILE ``` 4. Run the script, for example: ```bash sh create-s3-bucket.sh ``` 5. After the bucket is created, you can delete the script file, if you want. ## Delta table output format A Delta table consists of Parquet files that contain data and a transaction log that stores metadata about the transactions. [Learn more](https://delta-io.github.io/delta-rs/how-delta-lake-works/architecture-of-delta-table/). The Delta Tables in Amazon S3 destination connector generates the following output within the specified path to the S3 bucket (or the specified folder within the bucket): Initially, one Parquet (`.parquet`) file per file in the source location. For example, for a file in the source location named `my-file.pdf`, an associated file with the extension `.parquet` is generated. Various kinds of file transactions can result in additional Parquet files being generated. These Parquet filenames are automatically generated by the Delta Lake engine and are not meant to be manually modified. * A folder named `_delta_log` that contains metadata and change history about the `.parquet` files. As Parquet files are added to, changed, or removed from the specified bucket or folder path, the `_delta_log` folder is updated with any related metadata and change history details. Together, this set of Parquet files and their associated `_delta_log` folder (and its contents) describe a single, versioned Delta table. Because of this, Unstructured recommends the following usage best practices: * In the source location, each set of source files that is to be considered as a unit for change management purposes should be controlled by a unique, dedicated Delta Tables in S3 destination connector. This connector should reference a unique, dedicated output folder within the bucket. Having multiple workflows refer to different sets of source files, yet all share the same Delta table, could results in data loss or table corruption. * Avoid directly modifying, adding, or deleting Parquet data files or the `_delta_log` folder within a Delta table's directory. This can lead to data loss or table corruption. * If you need to copy or move a Delta table to a different location, you must move or copy its entire set of Parquet files and its associated `_delta_log` folder (and its contents) together as a unit. Note that the copied or moved Delta table will no longer be controlled by the original Delta Tables in S3 destination connector. To create a Delta Tables in Amazon S3 destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, DeltaTableConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.DELTA_TABLE, config=DeltaTableConnectorConfigInput( aws_region="<aws-region>", table_uri="<table-uri>", aws_access_key_id="<aws-access-key-id>", aws_secret_access_key="<aws-secret-access-key>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "type": "delta_table", "name": "<name>", "config": { "aws_region": "<aws-region>", "table_uri": "<table-uri>", "aws_access_key_id": "<aws-access-key-id>", "aws_secret_access_key": "<aws-secret-access-key>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<aws-region>` (required) - The AWS Region identifier (for example, `us-east-1`) for the Amazon S3 bucket you want to store the Delta Table in. * `<table-uri>` (required) - The URI of the Amazon S3 bucket you want to store the Delta Table in. This typically takes the format `s3://my-bucket/my-folder`. * `<aws-access-key-id>` (required) - The AWS access key ID for the AWS IAM principal (such as an IAM user) that has the appropriate access to the S3 bucket. * `<aws-secret-access-key>` (required) - The AWS secret access key for the corresponding AWS access key ID. # Elasticsearch Source: https://docs.unstructured.io/api-reference/workflow/destinations/elasticsearch <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Elasticsearch. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Elastic Cloud instances are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Elastic Cloud instances and self-manged Elasticsearch instances are supported. * For Elastic Cloud, you will need an [Elastic Cloud service instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#hosted-elasticsearch-service). <iframe width="560" height="315" src="https://www.youtube.com/embed/HEYIFC7uOus" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For self-managed Elasticsearch, you will need a [self-managed Elasticsearch instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#elasticsearch-deployment-options). <iframe width="560" height="315" src="https://www.youtube.com/embed/9znbAkNG1Ok" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * You will need the name of the index on the instance. See [Create index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) and [Get index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-index.html). The Elasticsearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "record_id": { "type": "text" }, "element_id": { "type": "keyword" }, "text": { "type": "text" }, "embeddings": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }, "metadata": { "type": "object", "properties": { "parent_id": { "type": "text" }, "page_number": { "type": "integer" }, "is_continuation": { "type": "boolean" }, "orig_elements": { "type": "text" }, "partitioner_type": { "type": "text" } } } } } } ``` See also: * [An Introduction to Elasticsearch Mapping](https://www.elastic.co/blog/found-elasticsearch-mapping-introduction) * [Explicit mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html) * [Dynamic field mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) <iframe width="560" height="315" src="https://www.youtube.com/embed/0f0tEcrvP8g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Elastic Cloud, you will need the Elastic Cloud service instance's API key. If you are using Unstructured Ingest, you will also need the instance's Cloud ID. To get these, see your Elasticsearch Service web console. * For self-managed Elasticsearch, you will need: * The self-managed instance's hostname and port number. See [Networking](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html). * If you're using basic authentication to the self-managed instance, the user's name and password. * If you're using token-based authentication to the self-managed instance, the bearer token or API key for the instance. See [Token-based authentication services](https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html) and [Create API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). * If you're using certificate authentication to the self-managed instance, the path to the Certificate Authority (CA) file on the instance, and the certificate fingerprint. See [SSL certificate API](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-ssl.html) and [Where can I see my Certificate Fingerprint?](https://discuss.elastic.co/t/where-can-i-see-my-certificate-fingerprint/319335/3). To create an Elasticsearch destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, ElasticsearchConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.ELASTICSEARCH, config=ElasticsearchConnectorConfigInput( hosts=["<host-url>"], es_api_key="<es-api-key>", index_name="<index-name>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "elasticsearch", "config": { "hosts": ["<host-url>"], "es_api_key": "<es-api-key>", "index_name": "<index-name>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `hosts` is a required array, with `<host-url>` specifying the URL of the Elastic Cloud instance to access, for example: `["https://<random-number>.<region>.<cloud-provider>.cloud.es.io"]`. * `<es-api-key>` (required) - The Elastic Cloud API key for the target cluster. * `<index-name>` (required) - The name of the target index in the cluster. # Google Cloud Storage Source: https://docs.unstructured.io/api-reference/workflow/destinations/google-cloud <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Google Cloud Storage. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/R57duYCxLi8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Google Cloud service account. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#console). * A service account key for the service account. See [Create a service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating) in [Create and delete service account keys](https://cloud.google.com/iam/docs/keys-create-delete). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the service account key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * The URI for a Google Cloud Storage bucket. This URI consists of the target bucket name, plus any target folder within the bucket, expressed as `gs://<bucket-name>[/folder-name]`. [Create a bucket](https://cloud.google.com/storage/docs/creating-buckets#console). This bucket must have, at minimum, one of the following roles applied to the target Google Cloud service account: * `Storage Object Viewer` for bucket read access. * `Storage Object User` for bucket write access. * The `Storage Object Admin` role provides read and write access, plus access to additional bucket operations. To apply one of these roles to a service account for a bucket, see [Add a principal to a bucket-level policy](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-add) in [Set and manage IAM policies on buckets](https://cloud.google.com/storage/docs/access-control/using-iam-permissions). To create a Google Cloud Storage destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, GCSDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.GCS, config=GCSDestinationConnectorConfigInput( remote_url="<remote-url>", service_account_key="<service-account-key>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "type": "gcs", "name": "<name>", "config": { "remote_url": "<remote-url>", "service_account_key": "<service-account-key>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<service-account-key>` (required) - The contents of a service account key file, expressed as a single string without line breaks, for a Google Cloud service account that has the required access permissions to the bucket. * `<remote-url>` (required) - The URI for the Google Cloud Storage bucket and any target folder path within the bucket. This URI takes the format `gs://<bucket-name>[/folder-name]`. * For `recursive` (source connector only), set to `true` to ingest data recursively from any subfolders, starting from the path specified by `<remote-url>`. The default is `false` if not otherwise specified. # IBM watsonx.data Source: https://docs.unstructured.io/api-reference/workflow/destinations/ibm-watsonxdata <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to IBM watsonx.data. The requirements are as follows. * An [IBM Cloud account](https://cloud.ibm.com/login). [Create an IBM Cloud account](https://cloud.ibm.com/registration) if you do not already have one. * An API key for the IBM Cloud account. If you do not have one already, create one as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/AsV8Edq_Lko" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. In the top navigation bar, click Manage and then, under Security and access, click Access (IAM). 3. On the sidebar, under Manage identities, click API keys. 4. With the View list showing My IBM Cloud API keys, click Create. 5. Enter some Name and an optional Description for the API key. 6. Leave Leaked action set to Disable the leaked key and Session creation set to No. 7. Click Create. 8. Click Copy or Download to copy or save the API key to a secure location. You won't be able to access this API key from this screen again. * An IBM Cloud Object Storage (COS) instance in the account, and a bucket within that instance. If you do not have them already, create them as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/C_7q1EM8w20" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. Click Create resource. 4. With IBM Cloud catalog selected, search for and select Object Storage. 5. Complete the on-screen instructions to finish creating the COS instance. 6. With the COS instance's settings page shown, on the Buckets tab, click Create bucket. 7. Complete the on-screen instructions to finish creating the bucket. * The name, region, and public endpoint for the target bucket within the target Cloud Object Storage (COS) instance. To get these: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Storage, and then click the target COS instance. 4. On the Buckets tab, click the target bucket. 5. On the Configuration tab, note the following: * Under Bucket details, note the Bucket name. This is the bucket's name. * Under Bucket details section, note the value inside of the parentheses inside Location, for example `us-east`. This is the bucket's region. * Under Endpoints, note the value of Public, for example `s3.us-east.cloud-object-storage.appdomain.cloud`. (Ignore the values of Private and Direct). This is the bucket's public endpoint. * An HMAC access key ID and secret access key for the target Cloud Object Storage (COS) instance. If you do not have them already, get or create them as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Storage, and then click the target COS instance. 4. On the Service credentials tab, if there is a credential that you want to use in the list, expand the credential, and copy the following values to a secure location: * `access_key_id` under `cos_hmac_keys`, which represents the HMAC access key ID. * `secret_access_key` under `cos_hmac_keys`, which represents the HMAC secret access key. After you have copied the preceding values, you have completed this procedure. 5. If there is not a credential that you want to use, or there are no credentials at all, click New Credential. 6. Enter some Name for the credential. 7. For Role, select at least Writer, leave Select Service ID set to Auto Generated, switch on Include HMAC Credential, and then click Add. 8. In the list of credentials, expand the credential, and copy the following values to a secure location: * `access_key_id` under `cos_hmac_keys`, which represents the HMAC access key ID. * `secret_access_key` under `cos_hmac_keys`, which represents the HMAC secret access key. * An IBM watsonx.data data store instance in the IBM Cloud account. If you do not have one already, create one as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/oU21hQ9TUnU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. Click Create resource. 4. With IBM Cloud catalog selected, search for and select watsonx.data. 5. Complete the on-screen instructions to finish creating the watsonx.data data store instance. * An Apache Iceberg-based catalog within the watsonx.data data store instance. If you do not have one already, create one as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Infrastructure manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. Click Add component. 8. Under Storage, click IBM Cloud Object Storage, and then click Next. 9. Complete the on-screen instructions to finish creating the Iceberg catalog. This includes providing the following settings: * Some display name for the component. * The name of the target bucket within the target Cloud Object Storage (COS) instance that you noted earlier. * The region for the target bucket, which you noted earlier. * The public endpoint for the target bucket, which you noted earlier. For this screen only, be sure to prefix the public endpoint with `https://`. * The HMAC access key ID for the target COS instance, which you noted earlier. * The HMAC secret access key for the target COS instance, which you noted earlier. 10. Next to Connection status, click Test connection to test the connection. Do not proceed until Successful is shown. If the connection is not successful, check the values you entered for the target bucket name, region, endpoint, access key, and secret access key, and try again. 11. Check the box labelled Associate Catalog. 12. Check the box labelled Activate now. 13. Under Associated catalog, for Catalog type, select Apache Iceberg. 14. Enter some Catalog name. 15. Click Associate. 16. On the sidebar, click Infrastructure manager. Make sure the catalog is associated with the appropriate engines. If it is not, rest your mouse on an unassociated target engine, click the Manage associations icon, check the box next to the target catalog's name, and then click Save and restart engine. To create an engine if one is not already shown, click Add component, and follow the on-screen to add an appropriate engine from the list of available Engines (for example, an IBM Presto engine). * The catalog name and metastore REST endpoint for the target Iceberg catalog. To get this: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Infrastructure manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. In the Catalogs section, click the target Iceberg catalog. 8. On the Details tab, note the value of Name representing the catalog name, and Metastore REST endpoint representing the metastore REST endpoint. (Ignore the Metastore Thrift endpoint value.) * A namespace (also known as a schema) and a table in the target catalog. If you do not have these already, create them as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Data manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. On the Browse data tab, under Catalogs associated, click the target catalog. 8. Click the ellipses, and then click Create schema. 9. Enter some Name for the schema, and then click Create. 10. On the sidebar, click Query workspace. 11. In the SQL editor, enter and run a table creation statement such as the following one that uses [Presto SQL](https://prestodb.io/docs/current/connector/iceberg.html) syntax, replacing `<catalog-name>` with the name of the target catalog and `<schema-name>` with the name of the target schema: ```sql CREATE TABLE <catalog-name>.<schema-name>.elements ( "type" varchar, "element_id" varchar, "text" varchar, "file_directory" varchar, "filename" varchar, "languages" array(varchar), "last_modified" double, "page_number" varchar, "filetype" varchar, "url" varchar, "version" varchar, "record_locator" varchar, "date_created" double, "date_modified" double, "date_processed" double, "filesize_bytes" bigint, "points" varchar, "system" varchar, "layout_width" double, "layout_height" double, "id" varchar, "record_id" varchar, "parent_id" varchar ) WITH ( delete_mode = 'copy-on-write', format = 'PARQUET', format_version = '2' ) ``` Incoming elements that do not have matching column names will be dropped upon record insertion. For example, if the incoming data has an element named `sent_from` and there is no column named `sent_from` in the table, the `sent_from` element will be dropped upon record insertion. You should modify the preceding sample table creation statement to add columns for any additional elements that you want to be included upon record insertion. To increase query performance, Iceberg uses [hidden partitioning](https://iceberg.apache.org/docs/latest/partitioning/) to group similar rows together when writing. You can also [explicitly define partitions](https://prestodb.io/docs/current/connector/iceberg.html#create-table) as part of the preceding `CREATE TABLE` statement. * The name of the target namespace (also known as a schema) within the target catalog, and name of the target table within that schema. To get these: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Data manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. On the Browse data tab, expand the name of the target catalog, and note the names of the target schema and target table. * The name of the column in the target table that uniquely identifies each of the records in the table. * To improve performance, the target table should be set to regularly remove old metadata files. To do this, run the following Python script. (You cannot use the preceding `CREATE TABLE` statement, or other SQL statements such as `ALTER TABLE`, to set this behavior.) To get the values for the specified environment variables, see the preceding instructions. ```python # Improves performance by setting the target table to regularly remove # old metadata files. # # First, install the following dependencies into your Python virtual # environment: # # pip install requests pyiceberg pyarrow # # Then, set the following environment variables: # # IBM_IAM_API_KEY - An API key value for the target IBM Cloud account. # IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT - The metastore REST endpoint # value for the target Apache Iceberg catalog in the target IBM watsonx.data # data store instance. # IBM_COS_BUCKET_PUBLIC_ENDPOINT - The target IBM Cloud Object Storage (COS) # instance’s endpoint value. # IBM_COS_ACCESS_KEY - An HMAC access key ID for the target COS instance. # IBM_COS_SECRET_ACCESS_KEY - The associated HMAC secret access key ID for the # target HMAC access key. # IBM_COS_BUCKET_REGION - The target COS instance’s region short ID. # IBM_ICEBERG_CATALOG - The name of the target Iceberg catalog. # IBM_ICEBERG_SCHEMA - The name of the target namespace (also known as a schema) # in the target catalog. # IBM_ICEBERG_TABLE - The name of the target table in the target schema. # # To get these values, see the Unstructured documentation for the # IBM watsonx.data connector. import os import requests from pyiceberg.catalog import load_catalog def main(): # Get a bearer token for the target IBM Cloud account. bearer_token = requests.post( url="https://iam.cloud.ibm.com/identity/token", headers={ "Content-Type": "application/x-www-form-urlencoded", "Accept": "application/json" }, data={ "grant_type": "urn:ibm:params:oauth:grant-type:apikey", "apikey": os.getenv("IBM_IAM_API_KEY") } ).json().get("access_token") # Connect to the target Iceberg catalog. catalog = load_catalog( os.getenv("IBM_ICEBERG_CATALOG"), **{ "type": "rest", "uri": f"https://{os.getenv("IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT")}/mds/iceberg", "token": bearer_token, "warehouse": os.getenv("IBM_ICEBERG_CATALOG"), "s3.endpoint": os.getenv("IBM_COS_BUCKET_PUBLIC_ENDPOINT"), "s3.access-key-id": os.getenv("IBM_COS_ACCESS_KEY"), "s3.secret-access-key": os.getenv("IBM_COS_SECRET_ACCESS_KEY"), "s3.region": os.getenv("IBM_COS_BUCKET_REGION") }, ) # Load the target table. table = catalog.load_table(f"{os.getenv("IBM_ICEBERG_SCHEMA")}.{os.getenv("IBM_ICEBERG_TABLE")}") # Set the target table's properties to remove old metadata files. with table.transaction() as transaction: transaction.set_properties( { "commit.manifest.min-count-to-merge": 10, "commit.manifest-merge.enabled": True, "write.metadata.previous-versions-max": 10, "write.metadata.delete-after-commit.enabled": True, } ) # Confirm that the target table's properties were set as expected. print(table.metadata.properties) if name == "main": main() ``` To create an IBM watsonx.data destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, IBMWatsonxS3DestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.IBM_WATSONX_S3, config=IBMWatsonxS3DestinationConnectorConfigInput( iceberg_endpoint="<iceberg-endpoint>", object_storage_endpoint="<object-storage-endpoint>", object_storage_region="<object-storage-region>", iam_api_key="<iam-api-key>", access_key_id="<access-key-id>", secret_access_key="<secret-access-key>", catalog="<catalog>", namespace="<namespace>", table="<table>", max_retries=<max-retries>, max_retries_connection=<max-retries-connection>, record_id_key="<record-id-key>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "ibm_watsonx_s3", "config": { "iceberg_endpoint": "<iceberg-endpoint>", "object_storage_endpoint": "<object-storage-endpoint>", "object_storage_region": "<object-storage-region>", "iam_api_key": "<iam-api-key>", "access_key_id": "<access-key-id>", "secret_access_key": "<secret-access-key>", "catalog": "<catalog>", "namespace": "<namespace>", "table": "<table>", "max_retries": <max-retries>, "max_retries_connection": <max-retries-connection>, "record_id_key": "<record-id-key>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<iceberg-endpoint>` (required): The metastore REST endpoint for the target Apache Iceberg-based catalog within the IBM watsonx.data data store instance. Do not include `https://` in this value. * `<object-storage-endpoint>` (required): The public endpoint for the target bucket within the IBM Cloud Object Storage (COS) instance that is associated with the catalog. Do not include `https://` in this value. * `<object-storage-region>` (required): The region short ID (such as us-east) for the bucket. * `<iam-api-key>` (required): A valid API key value for the IBM Cloud account. * `<access-key-id>` (required): A valid hash-based message authentication code (HMAC) access key ID for the COS instance. * `<secret-access-key>` (required): The HMAC secret access key for the access key ID. * `<catalog>` (required): The name of the target Apache Iceberg-based catalog within the IBM watsonx.data data store instance. * `<namespace>` (required): The name of the target namespace (also known as a schema) within the catalog. * `<table>` (required): The name of the target table within the namespace (schema). * `<max-retries>`: The maximum number of retries for the upload process. Typically, an optimal setting is `150`. The default is `50`. If specified, it must be a number between `2` and `500`, inclusive. * `<max-retries-connection>`: The maximum number of retries when connecting to the catalog. Typically, an optimal setting is `15`. The default is `10`. If specified, it must be a number between `2` and `100`, inclusive. * `<record-id-key>`: The name of the column that uniquely identifies each record in the target table. The default is `record_id`. # Kafka Source: https://docs.unstructured.io/api-reference/workflow/destinations/kafka <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Kafka. The requirements are as follows. * A Kafka cluster in [Confluent Cloud](https://www.confluent.io/confluent-cloud). ([Create a cluster](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#create-ak-clusters).) The following video shows how to set up a Kafka cluster in Confluent Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/zcKJ96J4Xvk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The [hostname and port number](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#view-a-ak-cluster) of the bootstrap Kafka cluster to connect to.. * The name of the topic to read messages from or write messages to on the cluster. [Create a topic](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). [Access available topics](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). * For authentication, an [API key and secret](https://docs.confluent.io/cloud/current/security/authenticate/workload-identities/service-accounts/api-keys/manage-api-keys.html#add-an-api-key). To create a Kafka destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, KafkaCloudDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.KAFKA_CLOUD, config=KafkaCloudDestinationConnectorConfigInput( bootstrap_servers="<bootstrap-server>", port=<port>, group_id="<group-id>", kafka_api_key="<kafka-api-key>", secret="<secret>", topic="<topic>", batch_size=<batch-size> ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "kafka-cloud", "config": { "bootstrap_server": "<bootstrap-server>", "port": <port>, "group_id": "<group-id>", "kafka_api_key": "<kafka-api-key>", "secret": "<secret>", "topic": "<topic>", "num_messages_to_consume": <num-messages-to-consume> }, }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<bootstrap-server>` - The hostname of the bootstrap Kafka cluster to connect to. * `<port>` - The port number of the bootstrap Kafka cluster to connect to. The default is `9092` if not otherwise specified. * `<group-id>` - The ID of the consumer group. A consumer group is a way to allow a pool of consumers to divide the consumption of data over topics and partitions. The default is `default_group_id` if not otherwise specified. * `<kafka-api-key>` - For authentication, the API key for access to the cluster. * `<secret>` - For authentication, the secret for access to the cluster. * `<topic>` - The name of the topic to read messages from or write messages to on the cluster. * `<batch-size>` (destination connector only) - The maximum number of messages to send in a single batch. The default is `100` if not otherwise specified. * `<num-messages-to-consume>` (source connector only) - The maximum number of messages that the consumer will try to consume. The default is `100` if not otherwise specified. ## Learn more * <Icon icon="blog" /> [Unstructured Platform Now Integrates with Apache Kafka in Confluent Cloud](https://unstructured.io/blog/unstructured-platform-now-integrates-with-apache-kafka-in-confluent-cloud) # Local Source: https://docs.unstructured.io/api-reference/workflow/destinations/local Send processed data from Unstructured to local storage. This local destination connector is supported only for REST API clients such as `curl` and Postman. It is currently not supported by the Unstructured user interface (UI) or the Unstructured Python SDK. The local destination connector works only with the [local source connector](/api-reference/workflow/sources/local). You cannot use the local destination connector with any other source connector. For details, see the following: 1. [Create a workflow](/api-reference/workflow/workflows#create-a-workflow) that uses a local destination and local source. 2. [Run a workflow as a job](/api-reference/workflow/overview#run-a-workflow) that uses a local destination and local source. 3. [Download a processed local file from the workflow's completed job run](/api-reference/workflow/overview#download-a-processed-local-file-from-a-job). # Milvus Source: https://docs.unstructured.io/api-reference/workflow/destinations/milvus <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Milvus. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Milvus local and cloud-based instances are supported. The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data: * For Zilliz Cloud, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/ASEmYryJpkU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Zilliz Cloud account](https://cloud.zilliz.com/signup). * A [Zilliz Cloud cluster](https://docs.zilliz.com/docs/create-cluster). * The URI of the cluster, also known as the cluster's public endpoint, which takes a format such as `https://<cluster-id>.<cluster-type>.<cloud-provider>-<region>.cloud.zilliz.com`. [Get the cluster's public endpoint](https://docs.zilliz.com/docs/manage-cluster#connect-to-cluster). * The token to access the cluster. [Get the cluster's token](https://docs.zilliz.com/docs/manage-cluster#connect-to-cluster). * The name of the [database](https://docs.zilliz.com/docs/database#create-database) in the instance. * The name of the [collection](https://docs.zilliz.com/docs/manage-collections-console#create-collection) in the database. The collection must have a a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows: | Field Name | Field Type | Max Length | Dimension | Index | Metric Type | | -------------------------------- | ----------------- | ---------- | --------- | ------------- | ----------- | | `element_id` (primary key field) | VARCHAR | `200` | -- | -- | -- | | `embeddings` (vector field) | FLOAT\_VECTOR | -- | `3072` | Yes (Checked) | Cosine | | `record_id` | VARCHAR | `200` | -- | -- | -- | * For Milvus on IBM watsonx.data, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/hLCwoe2fCnc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An [IBM Cloud account](https://cloud.ibm.com/registration). * The [IBM watsonx.data subscription plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started). * A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service). * The URI of the instance, which takes the format of `https://`, followed by instance's GRPC host, followed by a colon and the GRPC port. This takes the format of `https://<host>:<port>`. [Get the instance's GRPC host and GRPC port](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-conn-to-milvus). * The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance. * The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section. * The uername and password to access the instance. The username for Milvus on IBM watsonx.data is always `ibmlhapikey`. The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. [Get the user API key](https://cloud.ibm.com/docs/account?topic=account-userapikey\&interface=ui). * For Milvus local, you will need: * A [Milvus instance](https://milvus.io/docs/install-overview.md). * The [URI](https://milvus.io/api-reference/pymilvus/v2.4.x/MilvusClient/Client/MilvusClient.md) of the instance. * The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance. * The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section. * The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance. All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows. This example code demonstrates the use of the [Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this minimum viable schema, targeting Milvus on IBM watsonx.data. For the `connections.connect` arguments to connect to other types of Milvus deployments, see your Milvus provider's documentation: ```python Python import os from pymilvus import ( connections, FieldSchema, DataType, CollectionSchema, Collection, ) connections.connect( alias="default", host=os.getenv("MILVUS_GRPC_HOST"), port=os.getenv("MILVUS_GRPC_PORT"), user=os.getenv("MILVUS_USER"), password=os.getenv("MILVUS_PASSWORD"), secure=True ) primary_key = FieldSchema( name="element_id", dtype=DataType.VARCHAR, is_primary=True, max_length=200 ) vector = FieldSchema( name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=3072 ) record_id = FieldSchema( name="record_id", dtype=DataType.VARCHAR, max_length=200 ) schema = CollectionSchema( fields=[primary_key, vector, record_id], enable_dynamic_field=True ) collection = Collection( name="my_collection", schema=schema, using="default" ) index_params = { "metric_type": "L2", "index_type": "IVF_FLAT", "params": {"nlist": 1024} } collection.create_index( field_name="embeddings", index_params=index_params ) ``` Other approaches, such as [creating collections instantly](https://milvus.io/docs/create-collection-instantly.md) or [setting nullable and default fields](https://milvus.io/docs/nullable-and-default.md), have not been fully evaluated by Unstructured and might produce unexpected results. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. To create a Milvus destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, MilvusDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.MILVUS, config=MilvusDestinationConnectorConfigInput( user="<user>", uri="<uri>", db_name="<db-name>", password="<password>", collection_name="<collection-name>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "milvus", "config": { "user": "<user>", "uri": "<uri>", "db_name": "<db-name>", "password": "<password>", "collection_name": "<collection-name>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<user>` (required) - The username to access the Milvus instance. * `<uri>` (required) - The URI of the instance, for example: `https://12345.serverless.gcp-us-west1.cloud.zilliz.com.` * `<db-name>` (required) - The name of the database in the instance. * `<password>` (required) - The password corresponding to the username to access the instance. * `<collection-name>` (required) - The name of the collection in the database. # MongoDB Source: https://docs.unstructured.io/api-reference/workflow/destinations/mongodb <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to MongoDB. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/8YBVHt5spIQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The MongoDB requirements for a MongoDB Atlas deployment include: <Warning> For MongoDB Atlas, SCRAM-SHA-1 is not supported for authentication. This means that cluster types that only include SCRAM-SHA-1, such as Free, M0, Flex, and Serverless, are not supported. Unstructured only supports SCRAM-SHA-256 for MongoDB Atlas, which is cryptographically stronger than SCRAM-SHA-1. If you try to test or use a connector that refers to a cluster type that only includes SCRAM-SHA-1, the operation will fail, and you will get an error message similar to the following: `[digital envelope routines] unsupported`. </Warning> * A MongoDB Atlas account. [Create an account](https://www.mongodb.com/cloud/atlas/register). * A MongoDB Atlas cluster. [Create a cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/). Be sure to not select a cluster type that only includes SCRAM-SHA-1, such as Free, M0, Flex, or Serverless. * The cluster must be reachable from your application environment, for example by adding IP addresses to your IP access list. [Learn more](https://www.mongodb.com/docs/atlas/setup-cluster-security/#network-and-firewall-requirements). * The cluster must be configured to allow IP address. [Learn how](https://www.mongodb.com/docs/atlas/security/ip-access-list/#add-ip-access-list-entries). To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> * The cluster must have at least one database. [Create a database](https://www.mongodb.com/docs/compass/current/databases/#create-a-database). * The database must have at least one user, and that user must have sufficient access to the database. [Create a database user](https://www.mongodb.com/docs/atlas/security-add-mongodb-users/#add-database-users). [Give the user database access](https://www.mongodb.com/docs/manual/core/authorization/). * The database must have at least one collection. [Create a collection](https://www.mongodb.com/docs/compass/current/collections/#create-a-collection). <Note> For the destination connector, Unstructured recommends that all documents in the target collection have a field named `record_id` with a `String` data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the collection or, in some cases, the operation could fail altogether. </Note> * The connection string for the cluster. For MongoDB Atlas, this connection string must include the protocol, username, password, host, and cluster name. For example: ```text mongodb+srv://<db_user>:<db_password>@<host>/?retryWrites=true&w=majority&appName=<cluster> ``` To get the connection string in MongoDB Atlas, do the following: 1. Log in to your MongoDB Atlas console. 2. In the sidebar, under Databases, click Clusters. 3. Click on the cluster you want to connect to. 4. Click Connect. 5. Click Drivers. 6. Under Add your connection string into your application code, copy the connection string. You can then close the Connect dialog in MongoDB Atlas. Before you use this connection string, be sure to fill in any placeholders in the string, such as your MongoDB Atlas database user's password value. [Learn more](https://www.mongodb.com/resources/products/fundamentals/mongodb-connection-string). To create a MongoDB destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, MongoDBConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.MONGODB, config=MongoDBConnectorConfigInput( database="<database>", collection="<collection>", uri="<uri>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "mongodb", "config": { "database": "<database>", "collection": "<collection>", "uri": "<uri>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<database>` (required) - The name of the database on the target MongoDB instance. * `<collection>` (required) - The name of the collection within the database. * `<uri>` (required) - The instance connection string. # MotherDuck Source: https://docs.unstructured.io/api-reference/workflow/destinations/motherduck <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to MotherDuck. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/tj_0qmvPpJQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [MotherDuck account](https://app.motherduck.com). * A [MotherDuck access token](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#creating-an-access-token) for the account. * A database in the account. * [Create a database](https://motherduck.com/docs/sql-reference/motherduck-sql-reference/create-database/). * [List available databases](https://motherduck.com/docs/key-tasks/database-operations/basics-operations/#listing-databases). You can run commands to manage MotherDuck databases, schemas, tables, and more in the [MotherDuck UI](https://motherduck.com/docs/getting-started/motherduck-quick-tour/) or for example by connecting to MotherDuck with the [DuckDB CLI](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/). * A schema in the target database. * [Create a schema](https://duckdb.org/docs/sql/statements/create_schema.html). * You can list available schemas and their parent catalogs by running the following command in the MotherDuck UI or the DuckDB CLI: ```sql SELECT * FROM information_schema.schemata; ``` The MotherDuck connector uses the default schema name of `main` if not otherwise specified. * A table in the target schema. * [Create a table](https://duckdb.org/docs/sql/statements/create_table). * You can list available tables in a schema by running the following commands in the MotherDuck UI or the DuckDB CLI, replacing the target catalog and schema names: ```sql USE <catalog_name>.<schema_name>; SHOW TABLES; ``` The MotherDuck connector uses the default table name of `elements` if not otherwise specified. For maximum compatibility, Unstructured recommends the following table schema: ```sql CREATE TABLE elements ( id VARCHAR, element_id VARCHAR, text TEXT, embeddings FLOAT[], type VARCHAR, system VARCHAR, layout_width DECIMAL, layout_height DECIMAL, points TEXT, url TEXT, version VARCHAR, date_created INTEGER, date_modified INTEGER, date_processed DOUBLE, permissions_data TEXT, record_locator TEXT, category_depth INTEGER, parent_id VARCHAR, attached_filename VARCHAR, filetype VARCHAR, last_modified TIMESTAMP, file_directory VARCHAR, filename VARCHAR, languages VARCHAR[], page_number VARCHAR, links TEXT, page_name VARCHAR, link_urls VARCHAR[], link_texts VARCHAR[], sent_from VARCHAR[], sent_to VARCHAR[], subject VARCHAR, section VARCHAR, header_footer_type VARCHAR, emphasized_text_contents VARCHAR[], emphasized_text_tags VARCHAR[], text_as_html TEXT, regex_metadata TEXT, detection_class_prob DECIMAL, partitioner_type VARCHAR ); ``` You can list the schema of a table by running the following commands in the MotherDuck UI or the DuckDB CLI, replacing the target catalog, schema, and table names: ```sql USE <catalog_name>.<schema_name>; DESCRIBE TABLE <table_name>; ``` To create a MotherDuck destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, MotherDuckDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.MOTHERDUCK, config=MotherDuckDestinationConnectorConfigInput( database="<database>", db_schema="<db-schema>", table="<table>", md_token="<md-token>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "motherduck", "config": { "database": "<database>", "db_schema": "<db-schema>", "table": "<table>", "md_token": "<md-token>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required): A unique name for this connector. * `<database>` (required): The name of the target MotherDuck database. * `<db-schema>` (required): The name of the target schema within the database. * `<table>` The name of the target table within the schema. By default, this table is named `elements` if not otherwise specified. * `<md-token>` (required): The access token value within the MotherDuck account that has the appropriate access to the target database, schema, and table. ## Learn more * <Icon icon="blog" /> [Unstructured's New MotherDuck Integration](https://unstructured.io/blog/unstructured-s-new-motherduck-integration) # Neo4j Source: https://docs.unstructured.io/api-reference/workflow/destinations/neo4j <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Neo4j. The requirements are as follows. * A [Neo4j deployment](https://neo4j.com/deployment-center/). * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local Neo4j deployments are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local Neo4j deployments are supported. The following video shows how to set up a Neo4j Aura deployment: <iframe width="560" height="315" src="https://www.youtube.com/embed/fo8uDIm1zCE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The username and password for the user who has access to the Neo4j deployment. The default user is typically `neo4j`. * For a Neo4j Aura instance, the defaut user's is typically set when the instance is created. * For an AWS Marketplace, Microsoft Azure Marketplace, or Google Cloud Marketplace deployment of Neo4j, the default user is typically set during the deployment process. * For a local Neo4j deployment, you can [set the default user's initial password](https://neo4j.com/docs/operations-manual/current/configuration/set-initial-password/) or [recover an admin user and its password](https://neo4j.com/docs/operations-manual/current/authentication-authorization/password-and-user-recovery/). * The connection URI for the Neo4j deployment, which starts with `neo4j://`, `neo4j+s://`, `bolt://`, or `bolt+s://`; followed by `localhost` or the host name; and sometimes ending with a colon and the port number (such as `:7687`). For example: * For a Neo4j Aura deployment, browse to the target Neo4j instance in the Neo4j Aura account and click Connect > Drivers to get the connection URI, which follows the format `neo4j+s://<host-name>`. A port number is not used or needed. * For an AWS Marketplace, Microsoft Azure Marketplace, or Google Cloud Marketplace deployment of Neo4j, see [Neo4j on AWS](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-aws/), [Neo4j on Azure](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-azure/), or [Neo4j on GCP](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-gcp/) for details about how to get the connection URI. * For a local Neo4j deployment, the URI is typically `bolt://localhost:7687` * For other Neo4j deployment types, see the deployment provider's documentation. [Learn more](https://neo4j.com/docs/browser-manual/current/operations/dbms-connection). * The name of the target database in the Neo4j deployment. A default Neo4j deployment typically contains two standard databases: one named `neo4j` for user data and another named `system` for system data and metadata. Some Neo4j deployment types support more than these two databases per deployment; Neo4j Aura instances do not. * [Create additional databases](https://neo4j.com/docs/operations-manual/current/database-administration/standard-databases/create-databases/) for a local Neo4j deployment that uses Enterprise Edition; or for Neo4j on AWS, Neo4j on Azure, or Neo4j on GCP deployments. * [Get a list of additional available databases](https://neo4j.com/docs/operations-manual/current/database-administration/standard-databases/listing-databases/) for a local Neo4j deployment that uses Enterprise Edition; or for Neo4j on AWS, Neo4j on Azure, or Neo4j on GCP deployments. ## Graph Output The graph ouput of the Neo4j destination connector is represented in the following diagram: ```mermaid graph BT subgraph dn [Document Node] D[Document] end style dn stroke-dasharray: 5 subgraph en [Element Nodes] UE1[UnstructuredElement] UE2[UnstructuredElement] UE3[UnstructuredElement] UE4[UnstructuredElement] UE5[UnstructuredElement] UE6[UnstructuredElement] end style en stroke-dasharray: 5 UE1 -->|PART_OF_DOCUMENT| D UE2 -->|PART_OF_DOCUMENT| D UE3 -->|PART_OF_DOCUMENT| D UE4 -->|PART_OF_DOCUMENT| D UE5 -->|PART_OF_DOCUMENT| D UE6 -->|PART_OF_DOCUMENT| D subgraph cn [Chunk Nodes] C1[Chunk] C2[Chunk] C3[Chunk] C4[Chunk] end style cn stroke-dasharray: 5 C1 -->|NEXT_CHUNK| C2 C2 -->|NEXT_CHUNK| C3 C3 -->|NEXT_CHUNK| C4 C1 -->|PART_OF_DOCUMENT| D C2 -->|PART_OF_DOCUMENT| D C3 -->|PART_OF_DOCUMENT| D C4 -->|PART_OF_DOCUMENT| D UE1 -.->|PART_OF_CHUNK| C1 UE2 -.->|PART_OF_CHUNK| C1 UE3 -.->|PART_OF_CHUNK| C2 UE4 -.->|PART_OF_CHUNK| C3 UE5 -.->|PART_OF_CHUNK| C4 UE6 -.->|PART_OF_CHUNK| C4 ``` [View the preceding diagram in full-screen mode](https://mermaid.live/view#pako:eNqFlN9vgjAQx_-Vps-6REEfeFiyFZYli7hskCyTxXS0ihFaU9oHo_7vq_IjgIzyxN330157d70TjDmh0IFbgQ8JeA4iBvSXq9_CQRhYuTxWGWUS-Br9KQC39pYOyki5VB5Tel2XS8H3dExwnmAh8NEBs4LohKA6hJfSOkJe7hh6k1XI9C4qlkpQUjK1Oh1UrUHVHlRng-p8QO1kgRqzoC8JxuPH8_vTR7BevqzdJQoXnh-cgVvf0wRYJsA2ATMTMP8f6FQz1tVEiWL7Vi3RpHBW5rRtWm3TbpmdnMbGnKIipb73FazRa-i_nXXAKvC9ZFWHuJfs6nrIUCVkKBIy1AjZpgTfGuWhwVRnnDT6ZFC3-vVpo0v6dKvRJH263eiRXh2OYEZFhndEj5nTlY6gTPSriaCjfwndYJXKCEbsolGsJP88shg6-onRERRcbRPobHCaa0sdCJbU3WHdbFmFHDD75jyrIUp2kotFMddu4-3yB3k-fcg). In the preceding diagram: * The `Document` node represents the source file. * The `UnstructuredElement` nodes represent the source file's Unstructured `Element` objects, before chunking. * The `Chunk` nodes represent the source file's Unstructured `Element` objects, after chunking. * Each `UnstructuredElement` node has a `PART_OF_DOCUMENT` relationship with the `Document` node. * Each `Chunk` node also has a `PART_OF_DOCUMENT` relationship with the `Document` node. * Each `UnstructuredElement` node has a `PART_OF_CHUNK` relationship with a `Chunk` element. * Each `Chunk` node, except for the "last" `Chunk` node, has a `NEXT_CHUNK` relationship with its "next" `Chunk` node. Learn more about [document elements](/ui/document-elements) and [chunking](/ui/chunking). Some related example Neo4j graph queries include the following. Query for all available nodes and relationships: ```text MATCH path=(source)-[relationship]->(target) RETURN path ``` Query for `Chunk` to `Document` relationships: ```text MATCH (chunk:Chunk)-[relationship:PART_OF_DOCUMENT]->(doc:Document) RETURN chunk, relationship, doc ``` Query for `UnstructuredElement` to `Document` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_DOCUMENT]->(doc:Document) RETURN element, relationship, doc ``` Query for `UnstructuredElement` to `Chunk` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) RETURN element, relationship, chunk ``` Query for `Chunk` to `Chunk` relationships: ```text MATCH (this:Chunk)-[relationship:NEXT_CHUNK]->(previous:Chunk) RETURN this, relationship, previous ``` Query for `UnstructuredElement` to `Chunk` to `Document` relationships: ```text MATCH (element:UnstructuredElement)-[ecrelationship:PART_OF_CHUNK]-(chunk:Chunk)-[cdrelationship:PART_OF_DOCUMENT]->(doc:Document) RETURN element, ecrelationship, chunk, cdrelationship, doc ``` Query for `UnstructuredElements` containing the text `jury`, and show their `Chunk` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) WHERE element.text =~ '(?i).jury.' RETURN element, relationship, chunk ``` Query for the `Chunk` with the specified `id`, and show its `UnstructuredElement` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) WHERE chunk.id = '731508bf53637ce4431fe93f6028ebdf' RETURN element, relationship, chunk ``` Additionally, for the [Unstructured UI](/ui/overview) and [Unstructured Workflow Endpoint](/api-reference/workflow/), when a [Named entity recognition (NER)](/ui/enriching/ner) DAG node is added to a custom workflow, any recognized entities are output as `Entity` nodes in the graph. This additional graph ouput of the Neo4j destination connector is represented in the following diagram: ```mermaid graph TD Chunk -->|HAS_ENTITY| Entity Entity -->|ENTITY_TYPE| Entity ``` In the preceding diagram: * The `Chunk` node represents one of the source file's Unstructured `Element` objects, after chunking. * The `Entity` node represents a recognized entity. * A `Chunk` node can have `HAS_ENTITY` relationships with `Entity` nodes. * An `Entity` node can have `ENTITY_TYPE` relationships with other `Entity` nodes. Some related example Neo4j graph queries include the following. Query for all available nodes and relationships: ```text MATCH path=(source)-[relationship]->(target) RETURN path ``` Query for `Entity` to `Entity` relationships: ```text MATCH (child:Entity)-[relationship:ENTITY_TYPE]->(parent:Entity) RETURN child, relationship, parent ``` Query for `Entity` nodes containing the text `PERSON`, and show their `Entity` relationships: ```text MATCH (child:Entity)-[relationship:ENTITY_TYPE]->(parent:Entity) WHERE parent.id = 'PERSON' RETURN child, relationship, parent ``` Query for `Entity` nodes containing the text `amendment`, and show their `Chunk` relationships: ```text MATCH (element:Chunk)-[relationship:HAS_ENTITY]->(entity:Entity) WHERE entity.id =~ '(?i).amendment.' RETURN element, relationship, entity ``` QUERY FOR `Entity` nodes containing the text `PERSON`, and show their `Entity` to `Entity` to `Chunk` relationships: ```text MATCH (chunk:Chunk)-[ccrelationship:HAS_ENTITY]-(child:Entity)-[cprelationship:ENTITY_TYPE]->(parent:Entity) WHERE parent.id =~ 'PERSON' RETURN chunk, ccrelationship, child, cprelationship, parent ``` To create a Neo4j destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, Neo4jDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.NEO4J, config=Neo4jDestinationConnectorConfigInput( uri="<uri>", database="<database>", username="<username>", password="<password>", batch_size=<batch-size> ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "neo4j", "config": { "uri": "<uri>", "database": "<database>", "username": "<username>", "password": "<password>", "batch_size": <batch-size> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<uri>` (required) - The connection URI for the Neo4j deployment, which typically starts with `neo4j://`, `neo4j+s://`, `bolt://`, or `bolt+s://`; is followed by the host name; and ends with a colon and the port number (such as `:7473`, `:7474`, or `:7687`). * `<database>` (required) - The name of the target database in the Neo4j deployment. A default Neo4j deployment typically contains a standard database named neo4j for user data. * `<username>` (required) - The name of the user who has access to the Neo4j deployment. A default Neo4j deployment typically contains a default user named `neo4j`. * `<password>` (required) - The password for the user. * `<batch-size>` - The maximum number of nodes or relationships to be transmitted per batch. The default is `100` if not otherwise specified. # OneDrive Source: https://docs.unstructured.io/api-reference/workflow/destinations/onedrive <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to OneDrive. The requirements are as follows. * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. (Even if you only plan to use OneDrive, you still need a plan that includes SharePoint Online, because OneDrive is built on SharePoint technology.) [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * The OneDrive and SharePoint Online plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The path to the target OneDrive folder, starting from the OneDrive account's root folder, for example `my-folder/my-subfolder`. The following video shows how to get a path: <iframe width="560" height="315" src="https://www.youtube.com/embed/SfUA1IwtI5U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration that has access to the target OneDrive account, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Files.ReadWrite.All` (if both reading and writing are needed) * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` * `Directory.Read.All` 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/jBJsrSkpClo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Both authentication types also require the User Principal Name (UPN) for the OneDrive account. This is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password authentication, you will also need the password for the target UPN. To create a OneDrive destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, OneDriveDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.ONEDRIVE, config=OneDriveDestinationConnectorConfigInput( client_id="<client-id>", user_pname="<user-pname>", password="<password>", # For username and password authentication. tenant="<tenant>", authority_url="<authority-url>", client_cred="<client-cred>", remote_url="<remote-url>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "onedrive", "config": { "client_id": "<client-id>", "user_pname": "<user-pname>", "password": "<password>", # For username and password authentication. "tenant": "<tenant>", "authority_url": "<authority-url>", "client_cred": "<client-cred>", "remote_url": "<remote-url>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<client-id>` (required) - The application (client) ID of the Microsoft Entra ID app registration that has access to the OneDrive account. * `<user-pname>` (required) - The User Principal Name (UPN) for the OneDrive user account in Entra ID. This is typically the user's email address. * `<password>` (required for username and password authentication): The password for the target UPN. * `<tenant>` (required) - The directory (tenant) ID of the Entra ID app registration. * `<authority-url>` (required) - The authentication token provider URL for the Entra ID app registration. The default is [https://login.microsoftonline.com](https://login.microsoftonline.com). * `<client-cred>` (required) - The client secret for the Entra ID app registration. * `<path>` (source connector only) - The path to the target folder in the OneDrive account, starting with the account's root folder, for example `my-folder/my-subfolder`. * For `recursive` (source connector only), set to `true` to recursively access files from subfolders within the specified OneDrive `<path>`. The default is `false` if not otherwise specified. * `<remote-url>` (destination connector only) - `onedrive://`, followed by the path to the target folder in the OneDrive account, starting with the account's root folder, for example `onedrive://my-folder/my-subfolder`. # Overview Source: https://docs.unstructured.io/api-reference/workflow/destinations/overview <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> To use the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) to manage destination connectors, do the following: * To get a list of available destination connectors, use the `UnstructuredClient` object's `destinations.list_destinations` function (for the Python SDK) or the `GET` method to call the `/destinations` endpoint (for `curl` or Postman).. [Learn more](/api-reference/workflow/overview#list-destination-connectors). * To get information about a destination connector, use the `UnstructuredClient` object's `destinations.get_destination` function (for the Python SDK) or the `GET` method to call the `/destinations/<connector-id>` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#get-a-destination-connector). * To create a destination connector, use the `UnstructuredClient` object's `destinations.create_destination` function (for the Python SDK) or the `POST` method to call the `/destinations` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#create-a-destination-connector). * To update a destination connector, use the `UnstructuredClient` object's `destinations.update_destination` function (for the Python SDK) or the `PUT` method to call the `/destinations/<connector-id>` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#update-a-destination-connector). * To delete a destination connector, use the `UnstructuredClient` object's `destinations.delete_destination` function (for the Python SDK) or the `DELETE` method to call the `/destinations/<connector-id>` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#delete-a-destination-connector). To create or update a destination connector, you must also provide settings that are specific to that connector. For the list of specific settings, see: * [Astra DB](/api-reference/workflow/destinations/astradb) (`ASTRADB` for the Python SDK or `astradb` for `curl` or Postman) * [Azure AI Search](/api-reference/workflow/destinations/azure-ai-search) (`AZURE_AI_SEARCH` for the Python SDK or `azure_ai_search` for `curl` or Postman) * [Couchbase](/api-reference/workflow/destinations/couchbase) (`COUCHBASE` for the Python SDK or `couchbase` for `curl` or Postman) * [Databricks Volumes](/api-reference/workflow/destinations/databricks-volumes) (`DATABRICKS_VOLUMES` for the Python SDK or `databricks_volumes` for `curl` or Postman) * [Delta Tables in Amazon S3](/api-reference/workflow/destinations/delta-table) (`DELTA_TABLE` for the Python SDK or `delta_table` for `curl` or Postman) * [Delta Tables in Databricks](/api-reference/workflow/destinations/databricks-delta-table) (`DATABRICKS_VOLUME_DELTA_TABLES` for the Python SDK or `databricks_volume_delta_tables` for `curl` or Postman) * [Elasticsearch](/api-reference/workflow/destinations/elasticsearch) (`ELASTICSEARCH` for the Python SDK or `elasticsearch` for `curl` or Postman) * [Google Cloud Storage](/api-reference/workflow/destinations/google-cloud) (`GCS` for the Python SDK or `gcs` for `curl` or Postman) * [IBM watsonx.data](/api-reference/workflow/destinations/ibm-watsonxdata) (`IBM_WATSONX_S3` for the Python SDK or `ibm_watsonx_s3` for `curl` or Postman) * [Kafka](/api-reference/workflow/destinations/kafka) (`KAFKA_CLOUD` for the Python SDK or `kafka-cloud` for `curl` or Postman) * [Local](/api-reference/workflow/destinations/local) (Supported only for `curl` or Postman) * [Milvus](/api-reference/workflow/destinations/milvus) (`MILVUS` for the Python SDK or `milvus` for `curl` or Postman) * [MongoDB](/api-reference/workflow/destinations/mongodb) (`MONGODB` for the Python SDK or `mongodb` for `curl` or Postman) * [MotherDuck](/api-reference/workflow/destinations/motherduck) (`MOTHERDUCK` for the Python SDK or `motherduck` for `curl` or Postman) * [Neo4j](/api-reference/workflow/destinations/neo4j) (`NEO4J` for the Python SDK or `neo4j` for `curl` or Postman) * [OneDrive](/api-reference/workflow/destinations/onedrive) (`ONEDRIVE` for the Python SDK or `onedrive` for `curl` or Postman) * [Pinecone](/api-reference/workflow/destinations/pinecone) (`PINECONE` for the Python SDK or `pinecone` for `curl` or Postman) * [PostgreSQL](/api-reference/workflow/destinations/postgresql) (`POSTGRES` for the Python SDK or `postgres` for `curl` or Postman) * [Qdrant](/api-reference/workflow/destinations/qdrant) (`QDRANT_CLOUD` for the Python SDK or `qdrant-cloud` for `curl` or Postman) * [Redis](/api-reference/workflow/destinations/redis) (`REDIS` for the Python SDK or `redis` for `curl` or Postman) * [Snowflake](/api-reference/workflow/destinations/snowflake) (`SNOWFLAKE` for the Python SDK or `snowflake` for `curl` or Postman) * [S3](/api-reference/workflow/destinations/s3) (`S3` for the Python SDK or `s3` for `curl` or Postman) * [Weaviate](/api-reference/workflow/destinations/weaviate) (`WEAVIATE` for the Python SDK or `weaviate` for `curl` or Postman) # Pinecone Source: https://docs.unstructured.io/api-reference/workflow/destinations/pinecone <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Pinecone. The requirements are as follows. * A Pinecone account. [Get an account](https://app.pinecone.io/). <iframe width="560" height="315" src="https://www.youtube.com/embed/uPoa7PLcuZ4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Pinecone API key. [Get an API key](https://docs.pinecone.io/guides/get-started/authentication#find-your-pinecone-api-key). * A Pinecone serverless index. [Create a serverless index](https://docs.pinecone.io/guides/indexes/create-an-index). An existing index is not required. At runtime, the index behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing index name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing index's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing index's embedding settings to match, and try the run again. * If an index name is not specified, Unstructured creates a new index in your Pinecone account. If Unstructured generates embeddings, the new index's name will be `u<short-workflow-id>-<short-embedding-model-name>-<number-of-dimensions>`. If Unstructured does not generate embeddings, the new index's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing index name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing index's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing index's embedding settings to match, and try the run again. * If an index name is not specified, Unstructured creates a new index in your Pinecone account. The new index's name will be `unstructuredautocreated`. <Note> If you create a new index or use an existing one, Unstructured recommends that all records in the target index have a field named `record_id` with a string data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the index or, in some cases, the operation could fail altogether. </Note> * Within a Pinecone serverless index, custom [namespaces](https://docs.pinecone.io/guides/index-data/indexing-overview#namespaces) are supported but are not required. To create a Pinecone destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, PineconeDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.PINECONE, config=PineconeDestinationConnectorConfigInput( index_name="<index-name>", namespace="<namespace-name>", api_key="<api-key>", batch_size=<batch-size> ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "pinecone", "config": { "index_name" "<index-name>", "namespace": "<namespace-name>", "api_key": "<api-key>", "batch_size" <batch-size> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<index-name>` - The name of the index in the Pinecone database. If no value is provided, see the beginning of this article for the behavior at run time. * `<namespace-name>` - The name of any custom namespace in the Pinecone index to upsert data into. If no value is provided, the default namespace of `default` will be used. * `<api-key>` (required) - The Pinecone API key. * `<batch-size>` - The maximum number of records to transmit in a single batch. The default is `50` unless otherwise specified. # PostgreSQL Source: https://docs.unstructured.io/api-reference/workflow/destinations/postgresql <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to PostgreSQL. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local PostgreSQL installations are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local PostgreSQL installations are supported. The following video shows how to set up [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/): <iframe width="560" height="315" src="https://www.youtube.com/embed/QuIlEimejDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to set up [Azure Database for PostgreSQL](https://azure.microsoft.com/products/postgresql): <iframe width="560" height="315" src="https://www.youtube.com/embed/6lvtBUFI7eQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A PostgreSQL instance. * [Create an Amazon RDS for PostgreSQL instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html). * [Create an Azure Database for PostgreSQL server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-deploy-on-azure-free-account). * [Install PostgreSQL locally](https://www.postgresql.org/docs/current/tutorial-install.html). * The host name and port number for the instance. * For Amazon RDS for PostgreSQL, learn how to [get the host name and port number](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html#postgresql-endpoint). * For Azure Database for PostgreSQL, learn how to [get the host](https://learn.microsoft.com/azure/postgresql/flexible-server/quickstart-create-server#get-the-connection-information). The port number is `5432`. * For local PostgreSQL installations, these values are in the `postgresql.conf` file's `listen_addresses` and `port` settings. This file should be on the same machine as the instance. These values might also already be set as environment variables named `PGHOST` and `PGPORT` on the same machine as the instance. * For other installation types, see your PostgreSQL provider's documentation. * Depending on your network security requirements, you might need to allow access to your instance only from specific IP addresses. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> To learn how to allow these IP address ranges, see your PostgreSQL provider's documentation, for example with [Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) or [Azure Database for PostgreSQL](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-manage-firewall-portal#create-a-firewall-rule-after-server-is-created). <Note> For Amazon RDS for PostgreSQL, Amazon recommends that you set the instance's Public access setting to No by default, as this approach is more secure. This means that no resources can connect to the instance outside of the instance's associated Virtual Private Cloud (VPC) without extra configuration. [Learn more](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.WorkingWithRDSInstanceinaVPC.html#USER_VPC.Hiding). [Access an Amazon RDS instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html). If you must enable public access, set the instance's Public access setting to Yes, and then adjust the instance's related security group to allow this access. [Learn how](https://repost.aws/en/questions/QUxemKa9u5TV6CmLiO-r5prg/lost-public-access-to-aws-rds-postgresql-instance). [Troubleshoot issues with connecting to Amazon RDS instances](https://repost.aws/knowledge-center/rds-connectivity-instance-subnet-vpc). </Note> * A database in the instance. * For Amazon RDS for PostgreSQL and Azure Database for PostgreSQL, the default database name is `postgres` unless a custom database name was specified during the instance creation process. * For local PostgreSQL installations, learn how to [create a database](https://www.postgresql.org/docs/current/tutorial-createdb.html). * For other installation types, see your PostgreSQL provider's documentation. * A table in the database. Learn how to [create a table](https://www.postgresql.org/docs/current/tutorial-table.html). The table's schema must match the schema of the documents that Unstructured produces. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: <CodeGroup> ```sql PostgreSQL CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings DECIMAL [], parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` ```sql PostgreSQL with pgvector CREATE EXTENSION vector; CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings vector(3072), parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` </CodeGroup> See also: * [CREATE TABLE](https://www.postgresql.org/docs/current/sql-createtable.html) for PostgreSQL * [CREATE TABLE](https://github.com/pgvector/pgvector) for PostrgreSQL with pgvector * [Unstructured document elements and metadata](/api-reference/partition/document-elements) The following video shows how to use the `psql` utility to connect to PostgreSQL, list databases, and list and create tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/IKo-4QHdNF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A user in the database, and a password for the user. * For Amazon RDS for PostgreSQL, learn how to [create a user](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Roles.html). * For Azure Database for PostgreSQL, learn how to [create a user](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-create-users). * For local PostgreSQL installations, learn how to [create a user](https://www.postgresql.org/docs/current/sql-createuser.html). * For other installation types, see your PostgreSQL provider's documentation. * Database access for the user. * For Amazon RDS for PostgreSQL, learn how to [control user access](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Access.html). * For Azure Database for PostgreSQL, learn how to [control user access](https://www.postgresql.org/docs/current/sql-createuser.html). * For local PostgreSQL installations, learn how to [give database access to a user](https://www.postgresql.org/docs/current/sql-grant.html). * For other installation types, see your PostgreSQL provider's documentation. To create a PostgreSQL destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, PostgresDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.POSTGRES, config=PostgresDestinationConnectorConfigInput( host="<host>", database="<database>", port="<port>", username="<username>", password="<password>", table_name="<table_name>", batch_size=<batch-size> ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "postgres", "config": { "host": "<host>", "database": "<database>", "port": "<port>", "username": "<username>", "password": "<password>", "table_name": "<table_name>", "batch_size": <batch-size> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<host>` (required) - The host name. * `<database>` (required) - The name of the database. * `<port>` (required) - The port number. * `<username>` (required) - The username. * `<password>` (required) - The user's password. * `<table_name>` (required) - The name of the table in the database. * `<batch-size>` - The maximum number of rows to transmit at a time. The default is `100` if not otherwise specified. * `<id-column>` (required, source connector only) - The name of the ID column in the table. * For `fields` (source connector only), set one or more `<field>` values, with each value representing the name of a column to process (including the specified `<id-column>` column). The default is all columns if not otherwise specified. ## Learn more * <Icon icon="blog" /> [PostgreSQL Integration in the Unstructured Platform](https://unstructured.io/blog/postgresql-integration-in-the-unstructured-platform) # Qdrant Source: https://docs.unstructured.io/api-reference/workflow/destinations/qdrant <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Qdrant. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only [Qdrant Cloud](https://qdrant.tech/documentation/cloud-intro/) is supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Qdrant Cloud, [Qdrant local](https://github.com/qdrant/qdrant), and [Qdrant client-server](https://qdrant.tech/documentation/quickstart/) are supported. The following video shows how to set up Qdrant Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/730jcEAJUG8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Qdrant local, the path to the local Qdrant installation, for example: `/qdrant/local` * For Qdrant client-server, the Qdrant server URL, for example: `http://localhost:6333` * For Qdrant Cloud: * A [Qdrant account](https://cloud.qdrant.io/login). * A [Qdrant cluster](https://qdrant.tech/documentation/cloud/create-cluster/). * The cluster's URL. To get this URL, do the following: 1. Sign in to your Qdrant Cloud account. 2. On the sidebar, under Dashboard, click Clusters. 3. Click the cluster's name. 4. Note the value of the Endpoint field, for example: `https://<random-guid>.<region-id>.<cloud-provider>.cloud.qdrant.io`. * A [Qdrant API key](https://qdrant.tech/documentation/cloud/authentication/#create-api-keys). * The name of the target [collection](https://qdrant.tech/documentation/concepts/collections) on the Qdrant local installation, Qdrant server, or Qdrant Cloud cluster. Qdrant requires the target collection to exist before Unstructured can write to the collection. The following example code demonstrates the use of the [Python Qdrant Client](https://pypi.org/project/qdrant-client/) to create a collection on a Qdrant Cloud cluster, configuring the collection for vectors with 3072 dimensions: ```python Python from qdrant_client import QdrantClient, models import os client = QdrantClient( url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY") ) client.create_collection( collection_name=os.getenv("QDRANT_COLLECTION"), vectors_config=models.VectorParams( size=3072, distance=models.Distance.COSINE ) ) collection = client.get_collection( collection_name=os.getenv("QDRANT_COLLECTION") ) print(f"The collection named '{os.getenv("QDRANT_COLLECTION")}' exists and " + f"has a status of '{collection.status}'.") ``` To create a Qdrant destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, QdrantCloudDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.QDRANT_CLOUD, config=QdrantCloudDestinationConnectorConfigInput( url="<url>", collection_name="<collection-name>", batch_size=<batch-size>, api_key="<api-key>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "qdrant-cloud", "config": { "url": "<url>", "collection_name": "<collection-name>", "batch_size": "<batch-size>", "api_key": "<api-key>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<url>` (required) - The Qdrant cluster's URL. * `<collection-name>` (required) - The name of the target collection on the Qdrant cluster. * `<batch-size>` - The maximum number of records to transmit at a time. The default is `50` if not otherwise specified. * `<api-key>` (required) - The Qdrant API key. ## Learn more * <Icon icon="blog" /> [Qdrant Integration in the Unstructured Platform](https://unstructured.io/blog/streamlining-rag-pipeline-development-qdrant-integration-in-the-unstructured-platform) # Redis Source: https://docs.unstructured.io/api-reference/workflow/destinations/redis <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Redis. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/_zpzzr2VtoM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Redis](https://redis.io) database, for example in [Redis Cloud](https://redis.io/cloud/). * The target database's hostname and port number. [Create a database in Redis Cloud](https://redis.io/docs/latest/operate/rc/rc-quickstart/#create-an-account). * The username and password for the target database. [Get the username and password in Redis Cloud](https://redis.io/docs/latest/operate/rc/rc-quickstart/#connect-to-a-database). * The database number for the target database. Redis databases are typically numbered from 0 to 15, with the default database number typically being 0. To create a Redis destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, RedisDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.REDIS, config=RedisDestinationConnectorConfigInput( database="<database>", ssl=<True|False>, batch_size=<batch-size>, key_prefix="<key-prefix>", # For URI authentication: uri="<uri>" # For password authentication: host="<host>", port=<port>, username="<username>", password="<password>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "redis", "config": { "database": <database>, "ssl": <true|false>, "batch_size": <batch-size>, "key_prefix": "<key-prefix>", # For URI authentication: "uri": "<uri>" # For password authentication: "host": "<host>", "port": <port>, "username": "<username>", "password": "<password>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<uri>` For URI authentication, the connection URI for the target database user and password, hostname, and port number. Use the format of `redis://<username>:<password>@<hostname>:<port-number>`. If SSL encryption is enabled for the database, use `rediss://` instead of `redis://`. * `<host>` (required for password authentication): The target database's hostname. * `<port>`: For password authentication, the database's port number. The default is `6379` unless otherwise specified. * `<username>` (required for password authentication): The name of the database user (not the logged in user). * `<password>` (required for password authentication): The database user's password. * `<database>`: The number (index) for the target database. The default is `0` unless otherwise specified. * For `ssl`, `true` to enable SSL encryption for the connection. Set to `true` only if SSL is already set up and enabled for the target database. The default is `true` unless otherwise specified. * `<key-prefix>`: A string to prepend to each element's `element_id`. This is useful for distinguishing between different data sets in the same Redis instance. The default is an empty string. * `<batch-size>`: The maximum number of records to upload in a single batch. The default is `100` unless otherwise specified. ## Learn more * <Icon icon="blog" /> [Redis Support in Unstructured Platform: Supercharging Your RAG Pipeline](https://unstructured.io/blog/redis-support-in-unstructured-platform-supercharging-your-rag-pipeline) # S3 Source: https://docs.unstructured.io/api-reference/workflow/destinations/s3 <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Amazon S3. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/hyDHfhVVAhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account; enable anonymous access to the bucket (which is supported but not recommended); or generate an AWS STS session token for temporary access, if required by your organization's security requirements. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Anonymous (supported but not recommended) or authenticated access to the bucket. * [Enable anonymous bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-anonymous-user). ![Enable anonymous bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-anon-access.gif) * [Enable authenticated bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/walkthrough1.html). ![Enable authenticated bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-auth-access.gif) * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated access in untrusted environments or enhanced security scenarios, an AWS STS session token for temporary access, in addition to an AWS access key and secret access key. [Create a session token](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#api_getsessiontoken). <iframe width="560" height="315" src="https://www.youtube.com/embed/4_csSXc_GNU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Warning> AWS STS credentials can be valid for as little as 15 minutes or as long as 36 hours, depending on how the credentials were initially generated. After the expiry time, the credentials are no longer valid will no longer work with the corresponding S3 connector. You must get a new set of credentials to replace the expired ones by calling [GetSessionToken](https://docs.aws.amazon.com/STS/latest/APIReference/API_GetSessionToken.html) in the AWS STS API. To overwrite the expired credentials with the new set: * For the Unstructured user interface (UI), manually update the AWS Key, AWS Secret Key, and STS Token fields in the Unstructured UI for the corresponding S3 [source](/ui/sources/s3) or [destination](/ui/destinations/s3) connector. * For the Unstructured API, use the Unstructured Workflow Endpoint to call the [update source](/api-reference/workflow/overview#update-a-source-connector) or [update destination](/api-reference/workflow/overview#update-a-destination-connector) connector operation for the corresponding S3 [source](/api-reference/workflow/sources/s3) or [destination](/api-reference/workflow/destinations/s3) connector. * For Unstructured Ingest, change the values of `--key`, `--secret`, and `--token` (CLI) or `key`, `secret`, and `token` (Python) in your command or code for the corresponding S3 [source](/open-source/ingestion/source-connectors/s3) or [destination](/open-source/ingestion/destination-connectors/s3) connector. </Warning> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, and authenticated bucket access is enabled, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). To create an S3 destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, S3DestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.S3, config=S3DestinationConnectorConfigInput( # For AWS access key ID with AWS secret access key authentication: key="<key>", secret="<secret>", # For AWS STS token authentication: key="<key>", secret="<secret>", token="<token>", remote_url="<remote_url>", endpoint_url="<endpoint-url>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "s3", "config": { # For AWS access key ID with AWS secret access key authentication: "key": "<key>", "secret": "<secret>", # For AWS STS token authentication: "token": "<token>", "key": "<key>", "secret": "<secret>", "remote_url": "<remote_url>", "endpoint_url": "<endpoint-url>", } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * For AWS access key ID with AWS secret access key authentication, or for AWS STS token authentication: * `<key>` - The AWS access key ID for the authenticated AWS IAM user (required). * `<secret>` - The AWS secret access key corresponding to the preceding AWS access key ID (required). * For AWS STS token authentication: * `<token>` - The AWS STS session token for temporary access (required). * `<endpoint-url>` - A custom URL, if connecting to a non-AWS S3 bucket. * `<remote-url>` (required) - The S3 URI to the bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket’s root) or `s3://my-bucket/my-folder/`. * For `recursive` (source connector only), set to `true` to access subfolders within the bucket. The default is `false` if not otherwise specified. # Snowflake Source: https://docs.unstructured.io/api-reference/workflow/destinations/snowflake <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Snowflake. The requirements are as follows. * A Snowflake [account](https://signup.snowflake.com/) and its account identifier. <iframe width="560" height="315" src="https://www.youtube.com/embed/LgCjLyClg3o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To get the identifier for the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click your username, and then click Account > View account details. 3. On the Account tab, note the value of the Account Identifier field. Alternatively, the following Snowflake query returns the current account's identifier: ```text SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS "Account Identifier" ``` * The Snowflake [user's login name (not its username) and its password](https://docs.snowflake.com/user-guide/admin-user-management#creating-users) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/sFLPGVe4VBM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view the login name for a user: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. On the Users tab, in the list of available users, click the name of the target user. 4. In the About tile, note the Login Name for the user. Alternatively, the following Snowflake query returns information about the user with the username of `<my-user>`, including their `login_name` value representing their login name: ```text SHOW USERS LIKE '<my-user>'; ``` * The name of the Snowflake [role](https://docs.snowflake.com/sql-reference/sql/create-role) that the user belongs to and that also has sufficient access to the Snowflake database, schema, table, and host. <iframe width="560" height="315" src="https://www.youtube.com/embed/cxe2Ed_-qkM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * To create a database in Snowflake, the role needs to be granted `CREATE DATABASE` privilege at the current account level; and `USAGE` privilege on the warehouse that is used to create the database. * To create a schema in a database in Snowflake, the role needs to be granted `USAGE` privilege on the database and the warehouse that is used to create the schema; and `CREATE SCHEMA` on the database. * To create a table in a schema in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to create the table; and `CREATE TABLE` on the schema. * To write to a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `INSERT` on the table. * To read from a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `SELECT` on the table. To view a list of available roles in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. Click the Roles tab. Alternatively, the following Snowflake query returns a list of available roles in the current account: ```text SHOW ROLES; ``` [Grant privileges to a role](https://docs.snowflake.com/sql-reference/sql/grant-privilege). [Learn more](https://docs.snowflake.com/user-guide/security-access-control-privileges). * The Snowflake warehouse's [hostname and its port number](https://docs.snowflake.com/sql-reference/functions/system_allowlist) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/ItLW0N1uEOI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available warehouses in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Warehouses. This view does not provide access to the warehouses' hostnames or port numbers. To get this information, you must run a Snowflake query. The following Snowflake query returns a list of available warehouse types, hostnames, and port numbers in the current account. Look for the row with a `type` of `SNOWFLAKE_DEPLOYMENT`: ```text SELECT t.VALUE:type::VARCHAR as type, t.VALUE:host::VARCHAR as host, t.VALUE:port as port FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t; ``` * The name of the Snowflake [database](https://docs.snowflake.com/sql-reference/sql/create-database) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/f49tin_Pehs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available databases in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. Alternatively, the following Snowflake query returns a list of available databases in the current account: ```text SHOW DATABASES; ``` * The name of the [schema](https://docs.snowflake.com/sql-reference/sql/create-schema) in the database. To view a list of available schemas for a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the target database. Alternatively, the following Snowflake query returns a list of available schemas in the current account: ```text SHOW SCHEMAS; ``` The following Snowflake query returns a list of available schemas for the database named `<database_name>` in the current account: ```text SHOW SCHEMAS IN DATABASE <database_name>; ``` * The name of the [table](https://docs.snowflake.com/sql-reference/sql/create-table) in the schema. To view a list of available tables for a schema in a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the database that contains the target schema. 4. Expand the name of the target schema. 5. Expand Tables. Alternatively, the following Snowflake query returns a list of available tables for the schema named `<schema_name>` in the datbase named `<database_name>` in the current account: ```text SHOW TABLES IN SCHEMA <database_name>.<schema_name>; ``` Snowflake requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured is as follows. In the following `CREATE TABLE` statement, replace the following placeholders with the appropriate values: * `<database_name>`: The name of the target database in the Snowflake account. * `<schema_name>`: The name of the target schema in the database. * `<number-of-dimensions>`: The number of dimensions for any embeddings that you plan to use. This value must match the number of dimensions for any embeddings that are\ specified in your related Unstructured workflows or pipelines. If you plan to use Snowflake vector embedding generation or Snowflake vector search, this value must match the number of dimensions that you plan to have Snowflake generate or search against. ```sql SQL CREATE TABLE <database_name>.<schema_name>.ELEMENTS ( ID VARCHAR(36) PRIMARY KEY NOT NULL DEFAULT UUID_STRING(), RECORD_ID VARCHAR, ELEMENT_ID VARCHAR, TEXT VARCHAR, EMBEDDINGS VECTOR(FLOAT, <number-of-dimensions>), TYPE VARCHAR, SYSTEM VARCHAR, LAYOUT_WIDTH DECIMAL, LAYOUT_HEIGHT DECIMAL, POINTS VARCHAR, URL VARCHAR, VERSION VARCHAR, DATE_CREATED TIMESTAMP_TZ, DATE_MODIFIED TIMESTAMP_TZ, DATE_PROCESSED TIMESTAMP_TZ, PERMISSIONS_DATA VARCHAR, RECORD_LOCATOR VARCHAR, CATEGORY_DEPTH INTEGER, PARENT_ID VARCHAR, ATTACHED_FILENAME VARCHAR, FILETYPE VARCHAR, LAST_MODIFIED TIMESTAMP_TZ, FILE_DIRECTORY VARCHAR, FILENAME VARCHAR, LANGUAGES ARRAY, PAGE_NUMBER VARCHAR, LINKS VARCHAR, PAGE_NAME VARCHAR, LINK_URLS ARRAY, LINK_TEXTS ARRAY, SENT_FROM ARRAY, SENT_TO ARRAY, SUBJECT VARCHAR, SECTION VARCHAR, HEADER_FOOTER_TYPE VARCHAR, EMPHASIZED_TEXT_CONTENTS ARRAY, EMPHASIZED_TEXT_TAGS ARRAY, TEXT_AS_HTML VARCHAR, REGEX_METADATA VARCHAR, DETECTION_CLASS_PROB DECIMAL, IMAGE_BASE64 VARCHAR, IMAGE_MIME_TYPE VARCHAR, ORIG_ELEMENTS VARCHAR, IS_CONTINUATION BOOLEAN ); ``` * The name of the column in the table that uniquely identifies each record (for example, `RECORD_ID`). To create a Snowflake destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, SnowflakeDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.SNOWFLAKE, config=SnowflakeDestinationConnectorConfigInput( account="<account>", user="<user>", host="<host>", port=<port>, database="<database>", schema="<schema>", role="<role>", password="<password>", record_id_key="<record-id-key>", table_name="<table_name>", batch_size=<batch-size> ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "snowflake", "config": { "account": "<account>", "user": "<user>", "host": "<host>", "port": <port>, "database": "<database>", "schema": "<schema>", "role": "<role>", "password": "<password>", "record_id_key": "<record-id-key>", "table_name": "<table_name>", "batch_size": <batch-size> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<account>` (required): The target Snowflake account's identifier. * `<role>` (required): The name of the Snowflake role that the user belongs to. This role must have the appropriate access to the target Snowflake warehouse, database, schema, and table. * `<user>` (required): The target Snowflake user's login name (not their username). * `<password>` (required): The user's password. * `<host>` (required): The hostname of the target Snowflake warehouse. * `<port>` (required): The warehouse's port number. The default is `443` if not otherwise specified. * `<database>` (required): The name of the target Snowflake database. * `<schema>` (required): The name of the target Snowflake schema within the database. * `<table_name>`: The name of the target Snowflake table within the database's schema. For the destination connector, the default is `elements` if not otherwise specified. * `<columns>` (source connector only): A comma-separated list of columns to fetch from the table. By default, all columns are fetched unless otherwise specified. * `<id-column>` (required, source connector only): The name of the column that uniquely identifies each record in the table. * `<record-id-key>` (destination connector only): The name of the column that uniquely identifies each record in the table. The default is `record_id` if not otherwise specified. * `<batch-size>` (required): The maximum number of rows to fetch for each batch. The default is `50` if not otherwise specified. ## Learn more * <Icon icon="blog" /> [Powering Enterprise RAG: Unstructured's New Snowflake Integration](https://unstructured.io/blog/powering-enterprise-rag-unstructured-s-new-snowflake-integration) # Weaviate Source: https://docs.unstructured.io/api-reference/workflow/destinations/weaviate <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created.\ If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the destination connector, add it along with a [source connector](/api-reference/workflow/sources/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create destination connectors with the Unstructured user interface (UI). [Learn how](/ui/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Weaviate. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview): only [Weaviate Cloud](https://weaviate.io/developers/wcs) clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview): Weaviate Cloud clusters, [Weaviate installed locally](https://weaviate.io/developers/weaviate/quickstart/local), and [Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded) are supported. * For Weaviate installed locally, you will need the name of the target collection on the local instance. * For Embedded Weaviate, you will need the instance's connection URL and the name of the target collection on the instance. * For Weaviate Cloud, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/FvhJDUZM078" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Weaviate database instance. The following information assumes that you have a Weaviate Cloud (WCD) account with a Weaviate database cluster in that account. [Create a WCD account](https://weaviate.io/developers/wcs/quickstart#create-a-wcd-account). [Create a database cluster](https://weaviate.io/developers/wcs/quickstart#create-a-weaviate-cluster). For other database options, [learn more](https://weaviate.io/developers/weaviate/installation). * The URL and API key for the database cluster. [Get the URL and API key](https://weaviate.io/developers/wcs/quickstart#explore-the-details-panel). * The name of the target collection in the database. [Create a collection](https://weaviate.io/developers/wcs/tools/collections-tool). An existing collection is not required. At runtime, the collection behavior is as follows: For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. If Unstructured generates embeddings, the new collection's name will be `U<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. If Unstructured does not generate embeddings, the new collection's name will be `U<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. The new collection's name will be `Unstructuredautocreated`. If Unstructured creates a new collection and generates embeddings, you will not see an embeddings property in tools such as the Weaviate Cloud Collections user interface. To view the generated embeddings, you can run a Weaviate GraphQL query such as the following. In this query, replace `<collection-name>` with the name of the new collection, and replace `<property-name>` with the name of each additional available property that you want to return results for, such as `text`, `type`, `element_id`, `record_id`, and so on. The embeddings will be returned in the `vector` property. ```text { Get { <collection-name> { _additional { vector } <property-name> <property-name> } } } ``` Weaviate requires an existing collection to have a data schema before you add data. At minimum, this schema must contain the `record_id` property, as follows: ```json { "class": "Elements", "properties": [ { "name": "record_id", "dataType": ["text"] } ] } ``` Weaviate generates any additional properties based on the incoming data. If you have specific schema requirements, you can define the schema manually. Unstructured cannot provide a schema that is guaranteed to work for everyone in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following collection schema example for your own specific schema requirements: ```json { "class": "Elements", "properties": [ { "name": "record_id", "dataType": ["text"] }, { "name": "element_id", "dataType": ["text"] }, { "name": "text", "dataType": ["text"] }, { "name": "embeddings", "dataType": ["number[]"] }, { "name": "metadata", "dataType": ["object"], "nestedProperties": [ { "name": "parent_id", "dataType": ["text"] }, { "name": "page_number", "dataType": ["text"] }, { "name": "is_continuation", "dataType": ["boolean"] }, { "name": "orig_elements", "dataType": ["text"] }, { "name": "partitioner_type", "dataType": ["text"] } ] } ] } ``` See also : * [Collection schema](https://weaviate.io/developers/weaviate/config-refs/schema) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) To create a Weaviate destination connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, WeaviateDestinationConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=CreateDestinationConnector( name="<name>", type=DestinationConnectorType.WEAVIATE_CLOUD, config=WeaviateDestinationConnectorConfigInput( cluster_url="<host-url>", collection="<class-name>", api_key="<api-key>" ) ) ) ) print(response.destination_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "weaviate", "config": { "host_url": "<host-url>", "class_name": "<class-name>", "api_key": "<api-key>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<host-url>` (required) - The URL of the Weaviate database cluster. * `<class-name>` - The name of the target collection within the cluster. If no value is provided, see the beginning of this article for the behavior at run time. * `<api-key>` (required) - The API key provided by Weaviate to access the cluster. # Jobs Source: https://docs.unstructured.io/api-reference/workflow/jobs To use the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) to manage jobs, do the following: * To get a list of available jobs, use the `UnstructuredClient` object's `jobs.list_jobs` function (for the Python SDK) or the `GET` method to call the `/jobs` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#list-jobs). * To get basic information about a job, use the `UnstructuredClient` object's `jobs.get_job` function (for the Python SDK) or the `GET` method to call the `/jobs/<job-id>` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#get-a-job). * To get information about a job's current processing status, use the `UnstructuredClient` object's `jobs.get_job_details` function (for the Python SDK) or the `GET` method to call the `/jobs/<job-id>/details` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#get-processing-details-for-a-job). * To get the list of any failed files for a job and why those files failed, use the `UnstructuredClient` object's `jobs.get_failed_files` function (for the Python SDK) or the `GET` method to call the `/jobs/<job-id>/failed-files` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#get-failed-file-details-for-a-job). * A job is created automatically whenever a workflow runs on a schedule; see [Create a workflow](/api-reference/workflow/workflows#create-a-workflow). A job is also created whenever you run a workflow manually; see [Run a workflow](/api-reference/workflow/overview#run-a-workflow). * To cancel a running job, use the `UnstructuredClient` object's `jobs.cancel_job` function (for the Python SDK) or the `POST` method to call the `/jobs/<job-id>/cancel` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#cancel-a-job). # Overview Source: https://docs.unstructured.io/api-reference/workflow/overview The [Unstructured UI](/ui/overview) features a no-code user interface for transforming your unstructured data into data that is ready for Retrieval Augmented Generation (RAG). The Unstructured Workflow Endpoint, part of the [Unstructured API](/api-reference/overview), enables a full range of partitioning, chunking, embedding, and enrichment options for your files and data. It is designed to batch-process files and data in remote locations; send processed results to various storage, databases, and vector stores; and use the latest and highest-performing models on the market today. It has built-in logic to deliver the highest quality results at the lowest cost. This page provides an overview of the Unstructured Workflow Endpoint. This endpoint enables Unstructured UI automation usage scenarios as well as for documentation, reporting, and recovery needs. ## Getting started Choose one of the following options to get started with the Unstructured Workflow Endpoint: * Follow the [quickstart](#quickstart), which uses the Unstructured Python SDK from a remote hosted Google Colab notebook. * Start using the [Unstructred Python SDK](#unstructured-python-sdk). * Start using a [REST](#rest-endpoints) client, such as `curl` or Postman. ## Quickstart This quickstart uses the Unstructured Python SDK to call the Unstructured Workflow Endpoint to get your data RAG-ready. The Python code for this quickstart is in a remote hosted Google Colab notebook. Data is processed on Unstructured-hosted compute resources. The requirements are as follows: * A compatible source (input) location that contains your data for Unstructured to process. [See the list of supported source types](/ui/connectors#sources). This quickstart uses an Amazon S3 bucket as the source location. If you use a different source type, you will need to modify the quickstart notebook accordingly. * For document-based source locations, compatible files in that location. [See the list of supported file types](/ui/supported-file-types). If you do not have any files available, you can download some from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in the `Unstructured-IO/unstructured-ingest` repository in GitHub. * A compatible destination (output) location for Unstructured to put the processed data. [See the list of supported destination types](/ui/connectors#destinations). For this quickstart's destination location, a different folder in the same Amazon S3 bucket as the source location is used. If you use a different destination S3 bucket or a different destination type, you will need to modify the quickstart notebook accordingly. <Steps> <Step title="Sign up, sign in, and get your API key"> 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> </Step> <Step title="Create and set up the S3 bucket"> This quickstart uses an Amazon S3 bucket as both the source location and the destination location. (You can use other source and destination types that are supported by Unstructured. If you use a different source or destination type, or if you use a different S3 bucket for the destination location, you will need to modify the quickstart notebook accordingly.) Inside of the S3 bucket, a folder named `input` represents the source location. This is where your files to be processed will be stored. The S3 URI to the source location will be `s3://<your-bucket-name>/input`. Inside of the same S3 bucket, a folder inside named `output` represents the destination location. This is where Unstructured will put the processed data. The S3 URI to the destination location will be `s3://<your-bucket-name>/output`. Learn how to [create an S3 bucket and set it up for Unstructured](/api-reference/workflow/sources/s3). (Do not run the Python SDK code or REST commands at the end of those setup instructions.) </Step> <Step title="Run the quickstart notebook"> After your S3 bucket is created and set up, follow the instructions in this [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb). </Step> <Step title="View the processed data"> After you run the quickstart notebook, go to your destination location to view the processed data. </Step> </Steps> ## Unstructured Python SDK Watch the following 4-minute video to learn how to use the Python SDK to call the Unstructured Workflow Endpoint to create [connectors](#connectors) in the Unstructured UI. <iframe width="560" height="315" src="https://www.youtube.com/embed/bTdCSkWzecQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Watch the following 4-minute video to learn how to use the Python SDK to call the Unstructured Workflow Endpoint to create [workflows](#workflows) and [jobs](#jobs) in the Unstructured UI. <iframe width="560" height="315" src="https://www.youtube.com/embed/oeXUrfqffeM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Open a related [notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb) that covers many of the concepts that are shown in the preceding videos. The [Unstructured Python SDK](https://github.com/Unstructured-IO/unstructured-python-client), beginning with version 0.30.6, allows you to call the Unstructured Workflow Endpoint through standard Python code. To install the Unstructured Python SDK, run the following command from within your Python virtual environment: ```bash pip install "unstructured-client>=0.30.6" ``` If you already have the Unstructured Python SDK installed, upgrade to at least version 0.30.6 by running the following command instead: ```bash pip install --upgrade "unstructured-client>=0.30.6" ``` The Unstructured Python SDK code examples, shown later on this page and on related pages, use the following environment variable, which you can set as follows: ```bash export UNSTRUCTURED_API_KEY="<your-unstructured-api-key>" ``` This environment variable enables you to more easily run the following Unstructured Python SDK examples and help prevent you from storing scripts that contain sensitive API keys in public source code repositories. To get your Unstructured API key, do the following: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> Calls made by the Unstructured Python SDK's `unstructured_client` functions for creating, listing, updating, and deleting connectors, workflows, and jobs in the Unstructured UI all use the Unstructured Workflow Endpoint URL. This URL was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). <Note> The default URL for the Unstructured Worfklow Endpoint is `https://platform.unstructuredapp.io/api/v1`. However, you should always use the URL that was provided to you when your Unstructured account was created. </Note> To specify an API URL in your code, set the `server_url` parameter in the `UnstructuredClient` constructor to the target API URL. The Unstructured Workflow Endpoint enables you to work with [connectors](#connectors), [workflows](#workflows), and [jobs](#jobs) in the Unstructured UI. * A source connector ingests files or data into Unstructured from a source location. * A destination connector sends the processed data from Unstructured to a destination location. * A workflow defines how Unstructured will process the data. * A job runs a workflow at a specific point in time. For general information about these objects, see: * [Connectors](/ui/connectors) * [Workflows](/ui/workflows) * [Jobs](/ui/jobs) Skip ahead to start learning about how to use the Unstructured Python SDK to work with [connectors](#connectors), [workflows](#workflows), and [jobs](#jobs) programmatically. ## REST endpoints The Unstructured Workflow Endpoint is callable from a set of Representational State Transfer (REST) endpoints, which you can call through standard REST-enabled utilities, tools, programming languages, packages, and libraries. The examples, shown later on this page and on related pages, describe how to call the Unstructured Workflow Endpoint with `curl` and Postman. You can adapt this information as needed for your preferred programming languages and libraries, for example by using the `requests` library with Python. <Tip> You can also use the [Unstructured Workflow Endpoint - Swagger UI](https://platform.unstructuredapp.io/docs) to call the REST endpoints that are available through the default Unstructured Workflow Endpoint URL: `https://platform.unstructuredapp.io`. To use the Swagger UI, you must provide your Unstructured API key with each call. To get this API key, see the [quickstart](#quickstart), earlier on this page. Note that you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Tip> ### curl and Postman The following `curl` examples use the following environment variables, which you can set as follows: ```bash export UNSTRUCTURED_API_URL="https://platform.unstructuredapp.io/api/v1" export UNSTRUCTURED_API_KEY="<your-unstructured-api-key>" ``` For the API URL, this URL was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). <Note> The default URL for the Unstructured Worfklow Endpoint is `https://platform.unstructuredapp.io/api/v1`. However, you should always use the URL that was provided to you when your Unstructured account was created. </Note> These environment variables enable you to more easily run the following `curl` examples and help prevent you from storing scripts that contain sensitive URLs and API keys in public source code repositories. To get your Unstructured API key, do the following: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> The following Postman examples use variables, which you can set as follows: 1. In Postman, on your workspace's sidebar, click Environments. 2. Click Globals. 3. Create two global variables with the following settings: * Variable: `UNSTRUCTURED_API_URL` * Type: `default` * Initial value: The Unstructured Workflow Endpoint URL that was provided to you when your Unstructured account was created. * Current value: The Unstructured Workflow Endpoint URL that was provided to you when your Unstructured account was created. <br /> * Variable: `UNSTRUCTURED_API_KEY` * Type: `secret` * Initial value: `<your-unstructured-api-key>` * Current value: `<your-unstructured-api-key>` 4. Click Save. These variables enable you to more easily run the following examples in Postman and help prevent you from storing Postman collections that contain sensitive URLs and API keys in public source code repositories. Unstructured offers a [Postman collection](https://learning.postman.com/docs/collections/collections-overview/) that you can import into Postman to make Workflow Endpoint requests through a graphical user interface. 1. [Install Postman](https://learning.postman.com/docs/getting-started/installation/installation-and-updates/). 2. [Sign in to Postman](https://learning.postman.com/docs/getting-started/installation/postman-account/#signing-in-to-postman). 3. In your workspace, click Import. ![Import a Postman collection](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/api/post/import.png) 4. In the Paste cURL, Raw text or URL box, enter the following URL, and then press `Enter`: ``` https://raw.githubusercontent.com/Unstructured-IO/docs/main/examplecode/codesamples/api/Unstructured-REST-API-Workflow-Endpoint.postman_collection.json ``` 5. On the sidebar, click Collections. 6. Expand Unstructured REST API - Workflow Endpoint. 7. Select the request that you want to use. 8. As applicable, modify the URL as needed to specify any required resource IDs for the request. 9. On the Headers tab, next to `unstructured-api-key`, enter your Unstructured API key in the Value column. As applicable, add, remove, or modify any other required headers for the request. 10. As applicable, on the Params tab, add, remove, or modify any required parameters for the request. 11. As applicable, on the Body tab, add, remove, or modify the required request body for the request. 12. Click Send. 13. To save the response, in the response area, click the ellipses, and then click Save response to file. To get your Unstructured API key, do the following: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> The Unstructured Workflow Endpoint enables you to work with [connectors](#connectors), [workflows](#workflows), and [jobs](#jobs) in the Unstructured UI. * A source connector ingests files or data into Unstructured from a source location. * A destination connector sends the processed data from Unstructured to a destination location. * A workflow defines how Unstructured will process the data. * A job runs a workflow at a specific point in time. For general information about these objects, see: * [Connectors](/ui/connectors) * [Workflows](/ui/workflows) * [Jobs](/ui/jobs) Skip ahead to start learning about how to use the REST endpoints to work with [connectors](#connectors), [workflows](#workflows), and [jobs](#jobs) programmatically. ## Restrictions The following Unstructured SDKs, tools, and libraries do not work with the Unstructured Workflow Endpoint: * The [Unstructured JavaScript/TypeScript SDK](/api-reference/partition/sdk-jsts) * [Local single-file POST requests](/api-reference/partition/sdk-jsts) to the Unstructured Partition Endpoint * The [Unstructured open source Python library](/open-source/introduction/overview) * The [Unstructured Ingest CLI](/open-source/ingestion/ingest-cli) * The [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) The following Unstructured API URL is also not supported: `https://api.unstructuredapp.io/general/v0/general` (the default Unstructured Partition Endpoint URL). ## Connectors You can [list](#list-source-connectors), [get](#get-a-source-connector), [create](#create-a-source-connector), [update](#update-a-source-connector), [delete](#delete-a-source-connector), and [test](#test-a-source-connector) source connectors. You can also [list](#list-destination-connectors), [get](#get-a-destination-connector), [create](#create-a-destination-connector), [update](#update-a-destination-connector), [delete](#delete-a-destination-connector), and [test](#test-a-destination-connector) destination connectors. For general information, see [Connectors](/ui/connectors). ### List source connectors To list source connectors, use the `UnstructuredClient` object's `sources.list_sources` function (for the Python SDK) or the `GET` method to call the `/sources` endpoint (for `curl` or Postman). To filter the list of source connectors, use the `ListSourcesRequest` object's `source_type` parameter (for the Python SDK) or the query parameter `source_type=<type>` (for `curl` or Postman), replacing `<type>` with the source connector type's unique ID (for example, for the Amazon S3 source connector type, `S3` for the Python SDK or `s3` for `curl` or Postman). To get this ID, see [Sources](/api-reference/workflow/sources/overview). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ListSourcesRequest from unstructured_client.models.shared import SourceConnectorType client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.sources.list_sources( request=ListSourcesRequest( source_type=SourceConnectorType.<type> # Optional, list only for this source type. ) ) # Print the list in alphabetical order by connector name. sorted_sources = sorted( response.response_list_sources, key=lambda source: source.name.lower() ) for source in sorted_sources: print(f"{source.name} ({source.id})") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ListSourcesRequest from unstructured_client.models.shared import SourceConnectorType async def list_sources(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.sources.list_sources_async( request=ListSourcesRequest( source_type=SourceConnectorType.<type> # Optional, list only for this source type. ) ) # Print the list in alphabetical order by connector name. sorted_sources = sorted( response.response_list_sources, key=lambda source: source.name.lower() ) for source in sorted_sources: print(f"{source.name} ({source.id})") asyncio.run(list_sources()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` To filter the list of source connectors: ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/sources?source_type=<type>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/sources ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. To filter the list of source connectors, on the Params tab, enter the following query parameter: * Key: `source_type`, Value: `<type>` 5. Click Send. </Accordion> </AccordionGroup> ### Get a source connector To get information about a source connector, use the `UnstructuredClient` object's `sources.get_source` function (for the Python SDK) or the `GET` method to call the `/sources/<connector-id>` endpoint (for `curl` or Postman), replacing `<connector-id>` with the source connector's unique ID. To get this ID, see [List source connectors](#list-source-connectors). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetSourceRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.sources.get_source( request=GetSourceRequest( source_id="<connector-id>" ) ) info = response.source_connector_information print(f"name: {info.name}") for key, value in info.config: print(f"{key}: {value}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetSourceRequest async def get_source(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.sources.get_source_async( request=GetSourceRequest( source_id="<connector-id>" ) ) info = response.source_connector_information print(f"name: {info.name}") for key, value in info.config: print(f"{key}: {value}") asyncio.run(get_source()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/sources/<connector-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/sources/<connector-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Create a source connector To create a source connector, use the `UnstructuredClient` object's `sources.create_source` function (for the Python SDK) or the `POST` method to call the `/sources` endpoint (for `curl` or Postman). In the `CreateSourceConnector` object (for the Python SDK) or the request body (for `curl` or Postman), specify the settings for the connector. For the specific settings to include, which differ by connector, see [Sources](/api-reference/workflow/sources/overview). For the Python SDK, replace `<type>` with the source connector type's unique ID (for example, for the Amazon S3 source connector type, `S3`). To get this ID, see [Sources](/api-reference/workflow/sources/overview). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, <type>SourceConnectorConfigInput ) client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) destination_connector = CreateSourceConnector( name="<name>", type=SourceConnectorType.<type>, config=<type>SourceConnectorConfigInput( # Specify the settings for the connector here. ) ) response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=source_connector ) ) info = response.source_connector_information print(f"name: {info.name}") print(f"id: {info.id}") for key, value in info.config: print(f"{key}: {value}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, <type>SourceConnectorConfigInput ) async def create_source(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) source_connector = CreateSourceConnector( name="<name>", type=SourceConnectorType.<type>, config=<type>SourceConnectorConfigInput( # Specify the settings for the connector here. ) ) response = await client.sources.create_source_async( request=CreateSourceRequest( create_source_connector=source_connector ) ) info = response.source_connector_information print(f"name: {info.name}") print(f"id: {info.id}") for key, value in info.config: print(f"{key}: {value}") asyncio.run(create_source()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ # Specify the settings for the connector here. }' ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/sources ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` * Key: `content-type`, Value, `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the connector. 5. Click Send. </Accordion> </AccordionGroup> ### Update a source connector To update information about a source connector, use the `UnstructuredClient` object's `sources.update_source` function (for the Python SDK) or the `PUT` method to call the `/sources/<connector-id>` endpoint (for `curl` or Postman), replacing `<connector-id>` with the source connector's unique ID. To get this ID, see [List source connectors](#list-source-connectors). In the `UpdateSourceConnector` object (for the Python SDK) or the request body (for `curl` or Postman), specify the settings for the connector. For the specific settings to include, which differ by connector, see [Sources](/api-reference/workflow/sources/overview). For the Python SDK, replace `<type>` with the source connector type's unique ID (for example, for the Amazon S3 source connector type, `S3`). To get this ID, see [Sources](/api-reference/workflow/sources/overview). You must specify all of the settings for the connector, even for settings that are not changing. You can change any of the connector's settings except for its `name` and `type`. <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import UpdateSourceRequest from unstructured_client.models.shared import ( UpdateSourceConnector, <type>SourceConnectorConfigInput ) client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) source_connector = UpdateSourceConnector( config=<type>SourceConnectorConfigInput( # Specify the settings for the connector here. ) ) response = client.sources.update_source( request=UpdateSourceRequest( source_id="<connector-id>", update_source_connector=source_connector ) ) info = response.source_connector_information print(f"name: {info.name}") print(f"id: {info.id}") for key, value in info.config: print(f"{key}: {value}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import UpdateSourceRequest from unstructured_client.models.shared import ( UpdateSourceConnector, <type>SourceConnectorConfigInput ) async def update_source(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) source_connector = UpdateSourceConnector( config=<type>SourceConnectorConfigInput( # Specify the settings for the connector here. ) ) response = await client.sources.update_source_async( request=UpdateSourceRequest( source_id="<connector-id>", update_source_connector=source_connector ) ) info = response.source_connector_information print(f"name: {info.name}") print(f"id: {info.id}") for key, value in info.config: print(f"{key}: {value}") asyncio.run(update_source()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'PUT' --location \ "$UNSTRUCTURED_API_URL/sources/<connector-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ # Specify the settings for the connector here. }' ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select PUT. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/sources/<connector-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` * Key: `content-type`, Value, `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the connector. 5. Click Send. </Accordion> </AccordionGroup> ### Delete a source connector To delete a source connector, use the `UnstructuredClient` object's `sources.delete_source` function (for the Python SDK) or the `DELETE` method to call the `/sources/<connector-id>` endpoint (for `curl` or Postman), replacing `<connector-id>` with the source connector's unique ID. To get this ID, see [List source connectors](#list-source-connectors). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import DeleteSourceRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.sources.delete_source( request=DeleteSourceRequest( source_id="<connector-id>" ) ) print(response.raw_response) ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import DeleteSourceRequest async def delete_source(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.sources.delete_source_async( request=DeleteSourceRequest( source_id="<connector-id>" ) ) print(response.raw_response) asyncio.run(delete_source()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'DELETE' --location \ "$UNSTRUCTURED_API_URL/sources/<connector-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select DELETE. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/sources/<connector-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Test a source connector To test a source connector, use the `POST` method to call the `/sources/<connector-id>/connection-check` endpoint (for `curl` or Postman), replacing `<connector-id>` with the connector's unique ID. To get this ID, see [List source connectors](#list-source-connectors). The Python SDK does not support testing source connectors. <AccordionGroup> <Accordion title="curl"> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources/<connector-id>/connection-check" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/sources/<connector-id>/connection-check ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> To get information about the most recent connector check for a source connector, use the `GET` method to call the `/sources/<connector-id>/connection-check` endpoint (for `curl` or Postman), replacing `<connector-id>` with the connector's unique ID. To get this ID, see [List source connectors](#list-source-connectors). The Python SDK does not support getting information about the most recent connector check for a source connector. <AccordionGroup> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/sources/<connector-id>/connection-check" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/sources/<connector-id>/connection-check ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### List destination connectors To list destination connectors, use the `UnstructuredClient` object's `destinations.list_destinations` function (for the Python SDK) or the `GET` method to call the `/destinations` endpoint (for `curl` or Postman). To filter the list of destination connectors, use the `ListDestinationsRequest` object's `destination_type` parameter (for the Python SDK) or the query parameter `destination_type=<type>` (for `curl` or Postman), replacing `<type>` with the destination connector type's unique ID (for example, for the Amazon S3 source connector type, `S3` for the Python SDK or `s3` for `curl` or Postman). To get this ID, see [Destinations](/api-reference/workflow/destinations/overview). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ListDestinationsRequest from unstructured_client.models.shared import DestinationConnectorType client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.destinations.list_destinations( request=ListDestinationsRequest( destination_type=DestinationConnectorType.<type> # Optional, list only for this destination type. ) ) # Print the list in alphabetical order by connector name. sorted_destinations = sorted( response.response_list_destinations, key=lambda destination: destination.name.lower() ) for destination in sorted_destinations: print(f"{destination.name} ({destination.id})") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ListDestinationsRequest from unstructured_client.models.shared import DestinationConnectorType async def list_destinations(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.destinations.list_destinations_async( request=ListDestinationsRequest( destination_type=DestinationConnectorType.<type> # Optional, list only for this destination type. ) ) # Print the list in alphabetical order by connector name. sorted_destinations = sorted( response.response_list_destinations, key=lambda destination: destination.name.lower() ) for destination in sorted_destinations: print(f"{destination.name} ({destination.id})") asyncio.run(list_destinations()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` To filter the list of destination connectors: ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/destinations?destination_type=<type>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/destinations ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. To filter the list of destination connectors, on the Params tab, enter the following query parameter: * Key: `destination_type`, Value: `<type>` 5. Click Send. </Accordion> </AccordionGroup> ### Get a destination connector To get information about a destination connector, use the `UnstructuredClient` object's `destinations.get_destination` function (for the Python SDK) or the `GET` method to call the `/destinations/<connector-id>` endpoint (for `curl` or Postman), replacing `<connector-id>` with the destination connector's unique ID. To get this ID, see [List destination connectors](#list-destination-connectors). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetDestinationRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.destinations.get_destination( request=GetDestinationRequest( destination_id="<connector-id>" ) ) info = response.destination_connector_information print(f"name: {info.name}") for key, value in info.config: print(f"{key}: {value}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetDestinationRequest async def get_destination(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.destinations.get_destination_async( request=GetDestinationRequest( destination_id="<connector-id>" ) ) info = response.destination_connector_information print(f"name: {info.name}") for key, value in info.config: print(f"{key}: {value}") asyncio.run(get_destination()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/destinations/<connector-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/destinations/<connector-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Create a destination connector To create a destination connectors, use the `UnstructuredClient` object's `destinations.create_destination` function (for the Python SDK) or the `POST` method to call the `/destinations` endpoint (for `curl` or Postman). In the `CreateDestinationConnector` object (for the Python SDK) or the request body (for `curl` or Postman), specify the settings for the connector. For the specific settings to include, which differ by connector, see [Destinations](/api-reference/workflow/destinations/overview). For the Python SDK, replace `<type>` with the destination connector type's unique ID (for example, for the Amazon S3 source connector type, `S3`). To get this ID, see [Destinations](/api-reference/workflow/destinations/overview). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, <type>DestinationConnectorConfigInput ) client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) destination_connector = CreateDestinationConnector( name="<name>", type=DestinationConnectorType.<type>, config=<type>DestinationConnectorConfigInput( # Specify the settings for the connector here. ) ) response = client.destinations.create_destination( request=CreateDestinationRequest( create_destination_connector=destination_connector ) ) info = response.destination_connector_information print(f"name: {info.name}") print(f"id: {info.id}") for key, value in info.config: print(f"{key}: {value}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateDestinationRequest from unstructured_client.models.shared import ( CreateDestinationConnector, DestinationConnectorType, <type>DestinationConnectorConfigInput ) async def create_destination(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) destination_connector = CreateDestinationConnector( name="my-s3-connector", type=DestinationConnectorType.<type>, config=<type>DestinationConnectorConfigInput( # Specify the settings for the connector here. ) ) response = await client.destinations.create_destination_async( request=CreateDestinationRequest( create_destination_connector=destination_connector ) ) info = response.destination_connector_information print(f"name: {info.name}") print(f"id: {info.id}") for key, value in info.config: print(f"{key}: {value}") asyncio.run(create_destination()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ # Specify the settings for the connector here. }' ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/destinations ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` * Key: `content-type`, Value, `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the connector. 5. Click Send. </Accordion> </AccordionGroup> ### Update a destination connector To update information about a destination connector, use the `UnstructuredClient` object's `destinations.update_destination` function (for the Python SDK) or the `PUT` method to call the `/destinations/<connector-id>` endpoint (for `curl` or Postman), replacing `<connector-id>` with the destination connector's unique ID. To get this ID, see [List destination connectors](#list-destination-connectors). In the `UpdateDestinationConnector` object (for the Python SDK) or the request body (for `curl` or Postman), specify the settings for the connector. For the specific settings to include, which differ by connector, see [Destinations](/api-reference/workflow/destinations/overview). You must specify all of the settings for the connector, even for settings that are not changing. You can change any of the connector's settings except for its `name` and `type`. <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import UpdateDestinationRequest from unstructured_client.models.shared import ( UpdateDestinationConnector, <type>DestinationConnectorConfigInput ) client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) destination_connector = UpdateDestinationConnector( config=<type>DestinationConnectorConfigInput( # Specify the settings for the connector here. ) ) response = client.destinations.update_destination( request=UpdateDestinationRequest( destination_id="<connector-id>", update_destination_connector=destination_connector ) ) info = response.destination_connector_information print(f"name: {info.name}") print(f"id: {info.id}") for key, value in info.config: print(f"{key}: {value}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import UpdateDestinationRequest from unstructured_client.models.shared import ( UpdateDestinationConnector, <type>DestinationConnectorConfigInput ) async def update_destination(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) destination_connector = UpdateDestinationConnector( config=<type>DestinationConnectorConfigInput( # Specify the settings for the connector here. ) ) response = await client.destinations.update_destination_async( request=UpdateDestinationRequest( destination_id="<connector-id>", update_destination_connector=destination_connector ) ) info = response.destination_connector_information print(f"name: {info.name}") print(f"id: {info.id}") for key, value in info.config: print(f"{key}: {value}") asyncio.run(update_destination()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'PUT' --location \ "$UNSTRUCTURED_API_URL/destinations/<connector-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ # Specify the settings for the connector here. }' ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select PUT. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/destinations/<connector-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` * Key: `content-type`, Value, `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the connector. 5. Click Send. </Accordion> </AccordionGroup> ### Delete a destination connector To delete a destination connector, use the `UnstructuredClient` object's `destinations.delete_destination` function (for the Python SDK) or the `DELETE` method to call the `/destinations/<connector-id>` endpoint (for `curl` or Postman), replacing `<connector-id>` with the destination connector's unique ID. To get this ID, see [List destination connectors](#list-destination-connectors). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import DeleteDestinationRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.destinations.delete_destination( request=DeleteDestinationRequest( destination_id="<connector-id>" ) ) print(response.raw_response) ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import DeleteDestinationRequest async def delete_destination(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.destinations.delete_destination_async( request=DeleteDestinationRequest( destination_id="<connector-id>" ) ) print(response.raw_response) asyncio.run(delete_destination()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'DELETE' --location \ "$UNSTRUCTURED_API_URL/destinations/<connector-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select DELETE. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/destinations/<connector-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Test a destination connector To test a destination connector, use the `POST` method to call the `/destinations/<connector-id>/connection-check` endpoint (for `curl` or Postman), replacing `<connector-id>` with the connector's unique ID. To get this ID, see [List destination connectors](#list-destination-connectors). The Python SDK does not support testing destination connectors. <AccordionGroup> <Accordion title="curl"> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/destinations/<connector-id>/connection-check" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/destinations/<connector-id>/connection-check ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> To get information about the most recent connector check for a destination connector, use the `GET` method to call the `/destinations/<connector-id>/connection-check` endpoint (for `curl` or Postman), replacing `<connector-id>` with the connector's unique ID. To get this ID, see [List destination connectors](#list-destination-connectors). The Python SDK does not support getting information about the most recent connector check for a destination connector. <AccordionGroup> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/destinations/<connector-id>/connection-check" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/destinations/<connector-id>/connection-check ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ## Workflows You can [list](#list-workflows), [get](#get-a-workflow), [create](#create-a-workflow), [run](#run-a-workflow), [update](#update-a-workflow), and [delete](#delete-a-workflow) workflows. For general information, see [Workflows](/ui/workflows). ### List workflows To list workflows, use the `UnstructuredClient` object's `workflows.list_workflows` function (for the Python SDK) or the `GET` method to call the `/workflows` endpoint (for `curl` or Postman). To filter the list of workflows, use one or more of the following `ListWorkflowsRequest` parameters (for the Python SDK) or query parameters (for `curl` or Postman): * `source_id=<connector-id>`, replacing `<connector-id>` with the source connector's unique ID. To get this ID, see [List source connectors](#list-source-connectors). * `destination_id=<connector-id>`, replacing `<connector-id>` with the destination connector's unique ID. To get this ID, see [List destination connectors](#list-destination-connectors). * `status=WorkflowState.<status>` (for the Python SDK) or `status=<status>` (for `curl` or Postman), replacing `<status>` with one of the following workflow statuses: `ACTIVE` or `INACTIVE` (for the Python SDK) or `active` or `inactive` (for `curl` or Postman). You can specify multiple query parameters, for example `?source_id=<connector-id>&status=<status>`. <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ListWorkflowsRequest from unstructured_client.models.shared import WorkflowState client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.workflows.list_workflows( request=ListWorkflowsRequest( destination_id="<connector-id>", # Optional, list only for this destination connector ID. source_id="<connector-id>", # Optional, list only for this source connector ID. status=WorkflowState.<status> # Optional, list only for this workflow status. ) ) # Print the list in alphabetical order by workflow name. sorted_workflows = sorted( response.response_list_workflows, key=lambda workflow: workflow.name.lower() ) for workflow in sorted_workflows: print(f"{workflow.name} ({workflow.id})") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ListWorkflowsRequest from unstructured_client.models.shared import WorkflowState async def list_workflows(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.workflows.list_workflows_async( request=ListWorkflowsRequest( destination_id="<connector-id>", # Optional, list only for this destination connector ID. source_id="<connector-id>", # Optional, list only for this source connector ID. status=WorkflowState.<status> # Optional, list only for this workflow status. ) ) # Print the list in alphabetical order by workflow name. sorted_workflows = sorted( response.response_list_workflows, key=lambda workflow: workflow.name.lower() ) for workflow in sorted_workflows: print(f"{workflow.name} ({workflow.id})") asyncio.run(list_workflows()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/workflows" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` To filter the list by source connector ID: ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/workflows?source_id=<connector-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` To filter the list by destination connector ID: ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/workflows?destination_id=<connector-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` To filter the list by workflow status: ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/workflows?status=<status>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. To filter the list of workflows, on the Params tab, enter one or more of the following query parameter: * By source connector ID: Key: `source_id`, Value: `<connector-id>` * By destination connector ID: Key: `destination_id`, Value: `<connector-id>` * By workflow status: Key: `status`, Value: `<status>` 5. Click Send. </Accordion> </AccordionGroup> ### Get a workflow To get information about a workflow, use the `UnstructuredClient` object's `workflows.get_workflow` function (for the Python SDK) or the `GET` method to call the `/workflows/<workflow-id>` endpoint (for `curl` or Postman), replacing `<workflow-id>` with the workflow's unique ID. To get this ID, see [List workflows](#list-workflows). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetWorkflowRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.workflows.get_workflow( request=GetWorkflowRequest( workflow_id="<workflow-id>" ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetWorkflowRequest async def get_workflow(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.workflows.get_workflow_async( request=GetWorkflowRequest( workflow_id="<workflow-id>" ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") asyncio.run(get_workflow()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/workflows/<workflow-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows/<workflow-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Create a workflow To create a workflow, use the `UnstructuredClient` object's `workflows.create_workflow` function (for the Python SDK) or the `POST` method to call the `/workflows` endpoint (for `curl` or Postman). In the `CreateWorkflow` object (for the Python SDK) or the request body (for `curl` or Postman), specify the settings for the workflow. For the specific settings to include, see [Create a workflow](/api-reference/workflow/workflows#create-a-workflow). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateWorkflowRequest from unstructured_client.models.shared import ( WorkflowNode, CreateWorkflow, WorkflowType, Schedule ) client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) workflow = CreateWorkflow( # Specify the settings for the workflow here. ) response = client.workflows.create_workflow( request=CreateWorkflowRequest( create_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateWorkflowRequest from unstructured_client.models.shared import ( WorkflowNode, CreateWorkflow, WorkflowType, Schedule ) async def create_workflow(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) workflow = CreateWorkflow( # Specify the settings for the workflow here. ) response = await client.workflows.create_workflow_async( request=CreateWorkflowRequest( create_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") asyncio.run(create_workflow()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/workflows" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ # Specify the settings for the workflow here. }' ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` * Key: `content-type`, Value, `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the workflow. 5. Click Send. </Accordion> </AccordionGroup> ### Run a workflow To run a workflow manually, use the `UnstructuredClient` object's `workflows.run_workflow` function (for the Python SDK) or the `POST` method to call the `/workflows/<workflow-id>/run` endpoint (for `curl` or Postman), replacing `<workflow-id>` with the workflow's unique ID. To get this ID, see [List workflows](#list-workflows). <AccordionGroup> <Accordion title="Python SDK (remote source and remote destination)"> <Note> If the target workflow was originally created programmatically by the Unstructured Python SDK or with a REST API client such as `curl` or Postman, and the workflow uses a local source connector, you can run the workflow only with a REST API client such as `curl` or Postman, as described later in this section. You cannot run the workflow with the Python SDK or the Unstructured user interface (UI), even though the workflow is visible in the UI. </Note> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import RunWorkflowRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.workflows.run_workflow( request=RunWorkflowRequest( workflow_id="<workflow-id>" ) ) print(response.raw_response) ``` </Accordion> <Accordion title="Python SDK (async) (remote source and remote destination)"> <Note> If the target workflow was originally created programmatically by the Unstructured Python SDK or with a REST API client such as `curl` or Postman, and the workflow uses a local source connector, you can run the workflow only with a REST API client such as `curl` or Postman, as described later in this section. You cannot run the workflow with the Python SDK or the Unstructured user interface (UI), even though the workflow is visible in the UI. </Note> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import RunWorkflowRequest async def run_workflow(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.workflows.run_workflow_async( request=RunWorkflowRequest( workflow_id="<workflow-id>" ) ) print(response.raw_response) asyncio.run(run_workflow()) ``` </Accordion> <Accordion title="curl (remote source and remote destination)"> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/workflows/<workflow-id>/run" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="curl (local source and local or remote destination)"> In the following command, replace: * `</full/path/to/local/filename.extension>` with the full path to the local file to upload. * `<filename.extension>` with the filename of the local file to upload. * `<local-file-media-type>` with the local file's media type. For a list of available media types, such as `application/pdf`, see [Media Types](https://www.iana.org/assignments/media-types/media-types.xhtml). To upload multiple files, add additional `--form` entries, one per file. ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/workflows/<workflow-id>/run" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --form "input_files=@</full/path/to/local/filename.extension>;filename=<filename.extension>;type=<local-file-media-type>" \ --form "input_files=@</full/path/to/local/filename.extension>;filename=<filename.extension>;type=<local-file-media-type>" # For each additional file to be uploaded. ``` For a local destination, to access the processed files' data, [download a processed local file](#download-a-processed-local-file-from-a-job) from the workflow's job run. </Accordion> <Accordion title="Postman (remote source and remote destination)"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows/<workflow-id>/run ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> <Accordion title="Postman (local source and local or remote destination)"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows/<workflow-id>/run ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. On the Body tab, select form-data, and specify the settings for the workflow run: * Key: `input_files`, File, Value: Click the Value box, then click New file from local machine, and select the file to upload. To upload multiple files, add additional `input_files` entries after this one, one entry per additional file to upload. * Key: `filename`, Text, Value: Type the name of the file that you just uploaded. To upload multiple files, add additional `filename` entries after this one, one entry per additional file to upload. Make sure the order of these `filename` entries matches the order of the `input_files` entries, respectively. * Key: `type`, Text, Value: `<local-file-media-type>` To upload multiple files, add additional `type` entries after this one, one entry per additional file to upload. Make sure the order of these `type` entries matches the order of the `input_files` entries, respectively. For a list of available media types, such as `application/pdf`, see [Media Types](https://www.iana.org/assignments/media-types/media-types.xhtml). 5. Click Send. For a local destination, to access the processed files' data, [download a processed local file](#download-a-processed-local-file-from-a-job) from the workflow's job run. </Accordion> </AccordionGroup> To run a workflow on a schedule instead, specify the `schedule` setting in the request body when you create or update a workflow. See [Create a workflow](/api-reference/workflow/workflows#create-a-workflow) or [Update a workflow](/api-reference/workflow/workflows#update-a-workflow). ### Update a workflow To update information about a workflow, use the `UnstructuredClient` object's `workflows.update_workflow` function (for the Python SDK) or the `PUT` method to call the `/workflows/<workflow-id>` endpoint (for `curl` or Postman), replacing `<workflow-id>` with the workflow's unique ID. To get this ID, see [List workflows](#list-workflows). In `UpdateWorkflow` object (for the Python SDK) or the request body (for `curl` or Postman), specify the settings for the workflow. For the specific settings to include, see [Update a workflow](/api-reference/workflow/workflows#update-a-workflow). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import UpdateWorkflowRequest from unstructured_client.models.shared import ( WorkflowNode, UpdateWorkflow, WorkflowType, Schedule ) client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) workflow = UpdateWorkflow( # Specify the settings for the workflow here. ) response = client.workflows.update_workflow( request=UpdateWorkflowRequest( workflow_id="<workflow-id>", update_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import UpdateWorkflowRequest from unstructured_client.models.shared import ( WorkflowNode, UpdateWorkflow, WorkflowType, Schedule ) async def update_workflow(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) workflow = UpdateWorkflow( # Specify the settings for the workflow here. ) response = await client.workflows.update_workflow_async( request=UpdateWorkflowRequest( workflow_id="<workflow-id>", update_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") asyncio.run(update_workflow()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'PUT' --location \ "$UNSTRUCTURED_API_URL/workflows/<workflow-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ # Specify the settings for the workflow here. }' ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select PUT. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows/<workflow-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` * Key: `content-type`, Value, `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the workflow. 5. Click Send. </Accordion> </AccordionGroup> ### Delete a workflow To delete a workflow, use the `UnstructuredClient` object's `workflows.delete_workflow` function (for the Python SDK) or the `DELETE` method to call the `/workflows/<workflow-id>` endpoint (for `curl` or Postman), replacing `<workflow-id>` with the workflow's unique ID. To get this ID, see [List workflows](#list-workflows). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import DeleteWorkflowRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.workflows.delete_workflow( request=DeleteWorkflowRequest( workflow_id="<workflow-id>" ) ) print(response.raw_response) ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import DeleteWorkflowRequest async def delete_workflow(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.workflows.delete_workflow_async( request=DeleteWorkflowRequest( workflow_id="<workflow-id>" ) ) print(response.raw_response) asyncio.run(delete_workflow()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'DELETE' --location \ "$UNSTRUCTURED_API_URL/workflows/<workflow-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select DELETE. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows/<workflow-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ## Jobs You can [list](#list-jobs), [get](#get-a-job), and [cancel](#cancel-a-job) jobs. A job is created automatically whenever a workflow runs on a schedule; see [Create a workflow](#create-a-workflow). A job is also created whenever you run a workflow; see [Run a workflow](#run-a-workflow). For general information, see [Jobs](/ui/jobs). ### List jobs To list jobs, use the `UnstructuredClient` object's `jobs.list_jobs` function (for the Python SDK) or the `GET` method to call the `/jobs` endpoint (for `curl` or Postman). To filter the list of jobs, use one or both of the following `ListJobsRequest` parameters (for the Python SDK) or query parameters (for `curl` or Postman): * `workflow_id=<workflow-id>`, replacing `<workflow-id>` with the workflow's unique ID. To get this ID, see [List workflows](#list-workflows). * `status=<status>`, replacing `<status>` with one of the following job statuses: `completed`, `failed`, `im progress`, `scheduled`, and `stopped`. For `curl` or Postman, you can specify multiple query parameters as `?workflow_id=<workflow-id>&status=<status>`. <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ListJobsRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.jobs.list_jobs( request=ListJobsRequest( workflow_id="<workflow-id>", # Optional, list only for this workflow ID. status="<status>", # Optional, list only for this job status. ) ) # Print the list in alphabetical order by workflow name. sorted_jobs = sorted( response.response_list_jobs, key=lambda job: job.workflow_name.lower() ) for job in sorted_jobs: print(f"{job.id} (workflow name: {job.workflow_name}, id: {job.workflow_id})") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ListJobsRequest async def list_jobs(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.jobs.list_jobs_async( request=ListJobsRequest( workflow_id="<workflow-id>", # Optional, list only for this workflow ID. status="<status>", # Optional, list only for this job status. ) ) # Print the list in alphabetical order by workflow name. sorted_jobs = sorted( response.response_list_jobs, key=lambda job: job.workflow_name.lower() ) for job in sorted_jobs: print(f"{job.id} (workflow name: {job.workflow_name}, id: {job.workflow_id})") asyncio.run(list_jobs()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/jobs" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` To filter the list by workflow ID: ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/jobs?workflow_id=<workflow-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` To filter the list by job status: ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/job?status=<status>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/jobs ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. To filter the list of jobs, on the Params tab, enter one or more of the following query parameter: * By workflow ID: Key: `workflow_id`, Value: `<workflow-id>` * By job status: Key: `status`, Value: `<status>` 5. Click Send. </Accordion> </AccordionGroup> ### Get a job To get basic information about a job, use the `UnstructuredClient` object's `jobs.get_job` function (for the Python SDK) or the `GET` method to call the `/jobs/<job-id>` endpoint (for `curl` or Postman), replacing `<job-id>` with the job's unique ID. To get this ID, see [List jobs](#list-jobs). This function/endpoint returns basic information about the job, such as: * The job's unique ID. * The unique ID and name of the workflow that created the job. * The job's current status. * When the job was created. To get details about a job's current processing status, see [Get processing details for a job](#get-processing-details-for-a-job). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetJobRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.jobs.get_job( request=GetJobRequest( job_id="<job-id>" ) ) info = response.job_information print(f"id: {info.id}") print(f"status: {info.status}") print(f"workflow name: {info.workflow_name}") print(f"workflow id: {info.workflow_id}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetJobRequest async def get_job(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.jobs.get_job_async( request=GetJobRequest( job_id="<job-id>" ) ) info = response.job_information print(f"id: {info.id}") print(f"status: {info.status}") print(f"workflow name: {info.workflow_name}") print(f"workflow id: {info.workflow_id}") asyncio.run(get_job()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/jobs/<job-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/jobs/<job-id> ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Get processing details for a job To get current processing information about a job, use the `UnstructuredClient` object's `jobs.get_job_details` function (for the Python SDK) or the `GET` method to call the `/jobs/<job-id>/details` endpoint (for `curl` or Postman), replacing `<job-id>` with the job's unique ID. To get this ID, see [List jobs](#list-jobs). To get basic information about a job, see [Get a job](#get-a-job). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetJobDetailsRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.jobs.get_job_details( request=GetJobDetailsRequest( job_id="<job-id>" ) ) info = response.job_details print(f"job id: {info.id}") print(f"processing status: {info.processing_status}") print(f"message: {info.message}") print(f"node stats:") for node_stat in info.node_stats: print(f"---") print(f"name: {node_stat.node_name}") print(f"type: {node_stat.node_type}") print(f"subtype: {node_stat.node_subtype}") print(f"ready: {node_stat.ready}") print(f"in progress: {node_stat.in_progress}") print(f"success: {node_stat.success}") print(f"failure: {node_stat.failure}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetJobDetailsRequest async def get_job_details(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.jobs.get_job_details( request=GetJobDetailsRequest( job_id="<job-id>" ) ) info = response.job_details print(f"job id: {info.id}") print(f"processing status: {info.processing_status}") print(f"message: {info.message}") print(f"node stats:") for node_stat in info.node_stats: print(f"---") print(f"name: {node_stat.node_name}") print(f"type: {node_stat.node_type}") print(f"subtype: {node_stat.node_subtype}") print(f"ready: {node_stat.ready}") print(f"in progress: {node_stat.in_progress}") print(f"success: {node_stat.success}") print(f"failure: {node_stat.failure}") asyncio.run(get_job_details()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/jobs/<job-id>/details" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/jobs/<job-id>/details ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Get failed file details for a job To get the list of any failed files for a job and why those files failed, use the `UnstructuredClient` object's `jobs.get_job_failed_files` function (for the Python SDK) or the `GET` method to call the `/jobs/<job-id>/failed-files` endpoint (for `curl` or Postman), replacing `<job-id>` with the job's unique ID. To get this ID, see [List jobs](#list-jobs). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetJobFailedFilesRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.jobs.get_job_failed_files( request=GetJobFailedFilesRequest( job_id="<job-id>" ) ) info = response.job_failed_files if info.failed_files.len() > 0: print(f"{info.failed_files.len()} failed file(s):") for failed_file in info.failed_files: print(f"---") print(f"document: {failed_file.document}") print(f"error: {failed_file.error}") else: print(f"No failed files.") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import GetJobFailedFilesRequest async def get_job_failed_files(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.jobs.get_job_failed_files( request=GetJobFailedFilesRequest( job_id="<job-id>" ) ) info = response.job_failed_files if info.failed_files.len() > 0: print(f"{info.failed_files.len()} failed file(s):") for failed_file in info.failed_files: print(f"---") print(f"document: {failed_file.document}") print(f"error: {failed_file.error}") else: print(f"No failed files.") asyncio.run(get_job_failed_files()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/jobs/<job-id>/failed-files" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/jobs/<job-id>/failed-files ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Cancel a job To cancel a running job, use the `UnstructuredClient` object's `jobs.cancel_job` function (for the Python SDK) or the `POST` method to call the `/jobs/<job-id>/cancel` endpoint (for `curl` or Postman), replacing `<job-id>` with the job's unique ID. To get this ID, see [List jobs](#list-jobs). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CancelJobRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = client.jobs.cancel_job( request=CancelJobRequest( job_id="<job-id>" ) ) print(response.raw_response) ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CancelJobRequest async def cancel_job(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) response = await client.jobs.cancel_job_async( request=CancelJobRequest( job_id="<job-id>" ) ) print(response.raw_response) asyncio.run(cancel_job()) ``` </Accordion> <Accordion title="curl"> ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/jobs/<job-id>/cancel" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/jobs/<job-id>/cancel ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. Click Send. </Accordion> </AccordionGroup> ### Download a processed local file from a job This applies only to jobs that use a workflow with a local source and a local destination. To download a processed local file from a completed job, use `GET` to call the `/jobs/<job-id>/download` endpoint, replacing `<job-id>` with the job's unique ID. To get this ID, see [List jobs](#list-jobs). You must also provide Unstructured's IDs for the file to download and the workflow's output node. To get these IDs, see [Get a job](#get-a-job). In the response: * Unstructured's IDs for the file to download and the workflow's output node are in the `output_node_files` array. * The ID for the file to download is in the `output_node_files` array's `file_id` field. * The ID for the workflow's output node is in the `output_node_files` array's `node_id` field. Currently, you cannot use the Unstructured user interface (UI) or the Unstructured Python SDK to download a file from a job that uses a workflow with a local source and a local destination. <AccordionGroup> <Accordion title="curl"> ```bash curl --request 'GET' --location \ "$UNSTRUCTURED_API_URL/jobs/<job-id>/download?file_id=<file-id>&node_id=<node-id>" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select GET. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/jobs/<job-id>/download ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. On the Params tab, enter the following query parameters: * Key: `file_id`, Value: `<file-id>` * Key: `node_id`, Value: `<node-id>` 5. Click Send. </Accordion> </AccordionGroup> # Azure Source: https://docs.unstructured.io/api-reference/workflow/sources/azure-blob-storage <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Azure Blob Storage. The requirements are as follows. The following video shows how to fulfill the minimum set of Azure Storage account requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/Vl3KCphlh9Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Note> If you are generating an SAS token as shown in the preceding video, be sure to set the following permissions: * Read and List for reading from the container only. * Write and List for writing to the container only. * Read, Write, and List for both reading from and writing to the container. </Note> Here are some more details about these requirements: * An Azure account. To create one, [learn how](https://azure.microsoft.com/pricing/purchase-options/azure-account). <iframe width="560" height="315" src="https://www.youtube.com/embed/2bQ6WiJ1ncA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Azure Storage account, and a container within that account. [Create a storage account](https://learn.microsoft.com/azure/storage/common/storage-account-create). [Create a container](https://learn.microsoft.com/azure/storage/blobs/blob-containers-portal). <iframe width="560" height="315" src="https://www.youtube.com/embed/AhuNgBafmUo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <iframe width="560" height="315" src="https://www.youtube.com/embed/xmndjYnGvcs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Azure Storage remote URL, using the format `az://<container-name>/<path/to/file/or/folder/in/container/as/needed>` For example, if your container is named `my-container`, and there is a folder in the container named `my-folder`, the Azure Storage remote URL would be `az://my-container/my-folder/`. * An SAS token (recommended), access key, or connection string for the Azure Storage account. [Create an SAS token (recommended)](https://learn.microsoft.com/azure/ai-services/translator/document-translation/how-to-guides/create-sas-tokens). [Get an access key](https://learn.microsoft.com/azure/storage/common/storage-account-keys-manage#view-account-access-keys). [Get a connection string](https://learn.microsoft.com/azure/storage/common/storage-configure-connection-string#configure-a-connection-string-for-an-azure-storage-account). Create an SAS token (recommended): <iframe width="560" height="315" src="https://www.youtube.com/embed/X6cmJ2IbVzo?start=240&end=370" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Get an access key or connection string: <iframe width="560" height="315" src="https://www.youtube.com/embed/muMmcwVfFqs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To create an Azure Blob Storage source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, AzureSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.AZURE, config=AzureSourceConnectorConfigInput( remote_url="az://<container-name>/<path/to/file/or/folder>", recursive=<True|False>, # For anonymous authentication, do not set any of the # following fields. # For SAS token authentication: account_name="<account-name>", sas_token="<sas-token>" # For account key authentication: account_name="<account-name>", account_key="<account-key>" # For connection string authentication: connection_string="<connection-string>" ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "azure", "config": { "remote_url": "az://<container-name>/<path/to/file/or/folder>", "recursive": <true|false>, # For anonymous authentication, do not set any of the # following fields. # For SAS token authentication: "account_name": "<account-name>", "sas_token": "<sas-token>" # For account key authentication: "account_name": "<account-name>", "account_key": "<account-key>" # For connection string authentication: "connection_string": "<connection-string>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `az://<container-name>/<path/to/file/or/folder>` (required) - The Azure Storage remote URL, with the format `az://<container-name>/<path/to/file/or/folder/in/container/as/needed>` For example, if your container is named `my-container`, and there is a folder in the container named `my-folder`, the Azure Storage remote URL would be `az://my-container/my-folder/`. * `<account-name>` (required for SAS token authentication and account key authentication) - The Azure Storage account name. * `<sas-token>` - For SAS token authentication, the SAS token for the Azure Storage account (required). * `<account-key>` - For account key authentication, the key for the Azure Storage account (required). * `<connection-string>` - For connection string authentication, the connection string for the Azure Storage account (required). * For `recursive` (source connector only), set to `true` to recursively access files from subfolders within the container. The default is `false` if not otherwise specified. # Box Source: https://docs.unstructured.io/api-reference/workflow/sources/box <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Box. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/Gkniu1sfz1o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. Access to the [Developer Console](https://app.box.com/developers/console) from your [Box enterprise account](https://account.box.com/signup/enterprise-plan) or [Box developer account](https://account.box.com/signup/developer). 2. A Box Custom App in your Box account, set up to use Server Authentication (with JWT). See [Setup with JWT](https://developer.box.com/guides/authentication/jwt/jwt-setup/). 3. The appropriate application scopes and advanced features set up for the Box Custom App, as follows: * In the Box Custom App, on the Configuration tab, under Application Scopes, check the box titled Write all files and folders stored in Box. * In the Box Custom App, on the Configuration tab, under Advanced Features, check the box titled Make API calls using the as-user header. * Click Save Changes before continuing. 4. Authorization by a Box Admin in your Box account for the Box Custom App. See [App Authorization](https://developer.box.com/guides/authentication/jwt/jwt-setup/#app-authorization). 5. Access by your Box account's source or target [folder](https://app.box.com/folder/0) to your Box Custom App, as follows: * In the Box Custom App, on the General Settings tab, copy the Service Account ID (which takes the form `AutomationUser_<your-app-service-id>_<a-random-string@boxdevedition.com`). * Share your Box account's target folder with the copied service account's email address as a Co-owner or Editor. * Note the remote URL to the target folder, which takes the format `box://<path/to/folder/in/account>`. 6. The private key configuration JSON file for the Box Custom App, or a string that contains this file's contents. * To download this file, in the Box Custom App, on the Configuration tab, under Add and Manage Public Keys, click Generate a Public/Private Keypair. Store the downloaded private key configuration JSON file in a secure location. * To ensure maximum compatibility across Unstructured service offerings, you should give the private key configuration JSON file information to Unstructured as a single-line string that contains the contents of the downloaded private key configuration JSON file (and not the file itself). To print this single-line string, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the private key configuration JSON file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` To create an Box source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, BoxSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.BOX, config=BoxSourceConnectorConfigInput( remote_url="<remote-url>", recursive=<True|False>, box_app_config="<box-app-config>" ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "box", "config": { "remote_url": "<remote-url>", "recursive": <true|false>, "box_app_config": "<box-app-config>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required): A unique name for this connector. * `<remote-url>` (required): The URL to the target Box folder. This URL must take the format `box://<path/to/folder/in/account>`. * For `recursive`, set to `true` to recursively access files from subfolders within the taget Box folder. The default is `false` if not otherwise specified. * `<box-app-config>` (required): The contents of the private key configuration JSON file for the Box Custom App with access to the target Box folder. This file's contents must be provided as a single string. # Confluence Source: https://docs.unstructured.io/api-reference/workflow/sources/confluence <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Confluence. The requirements are as follows. * A [Confluence Cloud account](https://www.atlassian.com/software/confluence/pricing) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/installing-confluence-data-center-203603.html). * The site URL for your [Confluence Cloud account](https://community.atlassian.com/t5/Confluence-questions/confluence-cloud-url/qaq-p/1157148) or [Confluence Data Center installation](https://confluence.atlassian.com/confkb/how-to-find-your-site-url-to-set-up-the-confluence-data-center-and-server-mobile-app-938025792.html). * A user in your [Confluence Cloud account](https://confluence.atlassian.com/cloud/invite-edit-and-remove-users-744721624.html) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/add-and-invite-users-138313.html). * The user must have the correct permissions in your [Conflunce Cloud account](https://support.atlassian.com/confluence-cloud/docs/what-are-confluence-cloud-permissions-and-restrictions/) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/permissions-and-restrictions-139557.html) to access the target spaces and pages. * One of the following: * For Confluence Cloud or Confluence Data Center, the target user's name or email address, and password. [Change a Confluence Cloud user's password](https://support.atlassian.com/confluence-cloud/docs/change-your-confluence-password/). [Change a Confluence Data Center user's password](https://confluence.atlassian.com/doc/change-your-password-139416.html). * For Confluence Cloud only, the target user's name or email address, and API token. [Create an API token](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/). * For Confluence Data Center only, the target user's personal access token (PAT). [Create a PAT](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html). * Optionally, the names of the specific [spaces](https://support.atlassian.com/confluence-cloud/docs/navigate-spaces/) in the Confluence instance to access. The following video provides related setup information for Confluence Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/tXu6S6Pd7Dc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> ## Document permissions metadata The source connector outputs any permissions information that it can find in the source location about the processed source documents and associates that information with each corresponding element that is generated. This permissions information is output into the `permissions_data` field, which is within the `data_source` field under the element's `metadata` field. This information lists the users or groups, if any, that have permissions to read, update, or delete the element's associated source document. The following example shows what the output looks like. Ellipses indicate content that has been omitted from this example for brevity. ```json [ { "...": "...", "metadata": { "...": "...", "data_source": { "...": "...", "permissions_data": [ { "read": { "users": [ "11111:11111111-1111-1111-1111-111111111111" ], "groups": [ "22222222-2222-2222-2222-22222222", "33333333-3333-3333-3333-33333333" ] } }, { "update": { "users": [ "44444:44444444-4444-4444-4444-44444444", "55555:55555555-5555-5555-5555-55555555" ], "groups": [ "66666666-6666-6666-6666-66666666", ] } }, { "delete": { "users": [ "77777:77777777-7777-7777-7777-77777777" ], "groups": [ "88888888-8888-8888-8888-88888888" ] } } ], "...": "..." } } } ] ``` To look up information about a particular Confluence user, use the user's ID (also known as their account ID) from the preceding output to call the [GET /wiki/rest/api/user](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-users/#api-wiki-rest-api-user-get) operation in the Confluence REST API. To look up information about a particular Confluence group, use the group's ID from the preceding output to call the [GET /wiki/rest/api/group/by-id](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-group/#api-wiki-rest-api-group-by-id-get) operation in the Confluence REST API. To create a Confluence source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, ConfuenceSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.CONFLUENCE, config=ConfluenceSourceConnectorConfigInput( url="<url>", max_num_of_spaces=<max-num-of-spaces>, max_num_of_docs_from_each_space=<max-num-of-docs-from-each-space>, spaces=["<space-name>", "<space-name>"], extract_images=<True|False>, extract_files=<True|False>, # For API token authentication: username="<username>", token="<api-token>", cloud=<True|False> # For personal access token (PAT) authentication: token="<personal-access-token>", cloud=False # For password authentication: username="<username>", password="<password>", cloud=<True|False> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "confluence", "config": { "url": "<url>", "max_num_of_spaces": <max-num-of-spaces>, "max_num_of_docs_from_each_space": <max-num-of-docs-from-each-space>, "spaces": ["<space-name>", "<space-name>"], "extract_images": "<true|false>", "extract_files": "<true|false>", # For API token authentication: "username": "<username>", "token": "<api-token>", "cloud": "<true|false>" # For personal access token (PAT) authentication: "token": "<personal-access-token>", "cloud": "false" # For password authentication: "username": "<username>", "password": "<password>", "cloud": "<true|false>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<url>` (required) - The URL to the target Confluence Cloud instance. * `<max-num-of-spaces>` - The maximum number of Confluence spaces to access within the Confluence Cloud instance. The default is `500` unless otherwise specified. * `<max-num-of-docs-from-each-space>` - The maximum number of documents to access within each space. The default is `150` unless otherwise specified. * `spaces` is an array of strings, with each `<space-name>` specifying the name of a space to access, for example: `["luke","paul"]`. By default, if no space names are specified, and the `<max-num-of-spaces>` is exceeded for the instance, be aware that you might get unexpected results. * `extract_images` - Set to `true` to download images and replace the HTML content with Base64-encoded images. The default is `false` if not otherwise specified. * `extract_files` - Set to `true` to download any embedded files in pages. The default is `false` if not otherwise specified. For username and API token authentication: * `<username>` - The name or email address of the target user. * `<api-token>` - The user's API token value. * For `cloud`, `true` if you are using Confluence Cloud. The default is `false` if not otherwise specified. For personal access token (PAT) authentication: * `<personal-access-token>` - The target user's PAT value. * `cloud` should always be `false`. For username and password authentication: * `<username>` - The name or email address of the target user. * `<password>` - The user's password. * For `cloud`, `true` if you are using Confluence Cloud. The default is `false` if not otherwise specified. # Couchbase Source: https://docs.unstructured.io/api-reference/workflow/sources/couchbase <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Couchbase. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Couchbase Capella clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Couchbase Capella clusters and local Couchbase server deployments are supported. <iframe width="560" height="315" src="https://www.youtube.com/embed/9-RIBmIdi70" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For Couchbase Capella, you will need: * A [Couchbase Capella account](https://docs.couchbase.com/cloud/get-started/create-account.html#sign-up-free-tier). * A [Couchbase Capella cluster](https://docs.couchbase.com/cloud/get-started/create-account.html#getting-started). * A [bucket](https://docs.couchbase.com/cloud/clusters/data-service/manage-buckets.html#add-bucket), [scope](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-scope), and [collection](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-collection) on the cluster. * The cluster's [public connection string](https://docs.couchbase.com/cloud/get-started/connect.html#connect-from-sdk-cbsh-cli-or-ide). * The [cluster access name (username) and secret (password)](https://docs.couchbase.com/cloud/clusters/manage-database-users.html#create-database-credentials). * [Incoming IP address allowance](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) for the cluster. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> For a local Couchbase server, you will need: * [Installation of a local Couchbase server](https://docs.couchbase.com/server/current/getting-started/start-here.html). * [Connection details](https://docs.couchbase.com/server/current/guides/connect.html) to the local Couchbase server. To learn more about how to set up a Couchbase cluster and play with data, refer to this [tutorial](https://developer.couchbase.com/tutorial-quickstart-flask-python). To create a Couchbase source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, CouchbaseSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.COUCHBASE, config=CouchbaseSourceConnectorConfigInput( username="<username>", bucket="<bucket>", connection_string="<connection-string>", scope="<scope>", collection="<collection>", password="<password>", batch_size=<batch-size>, collection_id="<collection-id>" ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "couchbase", "config": { "username": "<username>", "bucket": "<bucket>", "connection_string": "<connection-string>", "scope": "<scope>", "collection": "<collection>", "password": "<password>", "batch_size": <batch-size>, "collection_id": "<collection-id>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<username>` (required) - The username for the Couchbase server. * `<bucket>` (required) - The name of the bucket in the Couchbase server. * `<connection-string>` (required) - The connection string for the Couchbase server. * `<scope>` - The name of the scope in the bucket. The default is `_default` if not otherwise specified. * `<collection>` - The name of the collection in the scope. The default is `_default` if not otherwise specified. * `<password>` (required) - The password for the Couchbase server. * `<batch-size>` - The maximum number of records to transmit per batch. The default is `50` if not otherwise specified. * `<collection-id>` (source connector only) - The name of the collection field that contains the document ID. The default is `id` if not otherwise specified. ## Learn more * <Icon icon="blog" /> [Couchbase Integration in Unstructured Platform](https://unstructured.io/blog/couchbase-integration-in-unstructured-platform) # Databricks Volumes Source: https://docs.unstructured.io/api-reference/workflow/sources/databricks-volumes <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Databricks Volumes. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Databricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * The workspace's URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids). Examples: * AWS: `https://<workspace-id>.cloud.databricks.com` * Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net` * GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com` <Note> Do not add a trailing slash (`/`) to the workspace URL. </Note> * The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), the following Databricks authentication types are supported: * Databricks OAuth machine-to-machine (M2M) authentication for\ [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). You will need the the Client ID (or UUID or Application ID) and OAuth Secret (client secret) values for the corresponding service principal. Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. The following video shows how to create a Databricks managed service principal: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html). You will need the personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For [Unstructured Ingest](/open-source/ingestion/overview), the following Databricks authentication types are supported: * For Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html): The personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password (basic) authentication ([AWS](https://docs.databricks.com/archive/dev-tools/basic.html) only): The user's name and password values. * For OAuth machine-to-machine (M2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html)): The client ID and OAuth secret values for the corresponding service principal. * For OAuth user-to-machine (U2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-u2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-u2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-u2m.html)): No additional values. * For Azure managed identities (formerly Managed Service Identities (MSI) authentication) ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-mi) only): The client ID value for the corresponding managed identity. * For Microsoft Entra ID service principal authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-sp) only): The tenant ID, client ID, and client secret values for the corresponding service principal. * For Azure CLI authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-cli) only): No additional values. * For Microsoft Entra ID user authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/user-aad-token) only): The Entra ID token for the corresponding Entra ID user. * For Google Cloud Platform credentials authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-creds.html) only): The local path to the corresponding Google Cloud service account's credentials file. * For Google Cloud Platform ID authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-id.html) only): The Google Cloud service account's email address. * The name of the parent catalog in Unity Catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html) for the volume. * The name of the parent schema (formerly known as a database) in Unity Catalog for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) for the volume. * The name of the volume in Unity Catalog for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html), and optionally any path in that volume that you want to access directly, beginning with the volume's root. * The Databricks workspace user or service principal must have the following minimum set of privileges to read from or write to the existing volume in Unity Catalog: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. The following videos shows how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog. <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Learn more about how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). To create a Databricks Volumes source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, DatabricksVolumesConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.DATABRICKS_VOLUMES, config=DatabricksVolumesConnectorConfigInput( catalog="<catalog>", host="<host>", schema_="<schema>", volume="<volume>", volume_path="<volume_path>", # For Databricks OAuth machine-to-machine (M2M) authentication: client_id="<client_id>", client_secret="<client_secret>" # For Databricks personal access token authentication: token="<token>" ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "databricks_volumes", "config": { "host": "<host>", "catalog": "<catalog>", "schema": "<schema>", "volume": "<volume>", "volume_path": "<volume_path>", # For Databricks OAuth machine-to-machine (M2M) authentication: "client_id": "<client-id>" "client_secret": "<client-secret>" # For Databricks personal access token authentication: "token": "<token>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<host>` (required) - The Databricks workspace host URL. <Note> Do not add a trailing slash (`/`) to the workspace host URL. </Note> * `<client-id>` (required) - For Databricks OAuth machine-to-machine (M2M) authentication, the Client ID (or UUID or Application ID) value for the Databricks managed service principal that has the appropriate privileges to the volume. * `<client-secret>` (required) - For Databricks OAuth M2M authentication, the associated OAuth Secret value for the Databricks managed service principal that has the appropriate privileges to the volume. * `<token>` (required) - For Databricks personal access token authentication, the personal access token's value. * `<catalog>` (required) - The name of the catalog to use. * `<schema>` - The name of the associated schema. If not specified, `default` is used. * `<volume>` (required) - The name of the associated volume. * `<volume_path>` - Any optional path to access within the volume. # Dropbox Source: https://docs.unstructured.io/api-reference/workflow/sources/dropbox <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Dropbox. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/Vku5uYa-2N4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. A [Dropbox account](https://www.dropbox.com/try/teams). 2. A Dropbox app for your Dropbox account. To create a Dropbox app, do the following: a) Sign in to the [Dropbox Developers](https://www.dropbox.com/developers) portal with the same credentials as your Dropbox account.<br /> b) Open your [App Console](https://www.dropbox.com/developers/apps).<br /> c) Click Create app.<br /> d) For Choose an API, select Scoped access.<br /> e) For Choose the type of access you need, select App folder.<br /> f) Enter a name for your app, and then click Create app.<br /> g) On the app's Permissions tab, under Files and folders, check the boxes labelled files.content.read or files.content.write or both, depending on whether you want to read files, write files, or both. Then click Submit.<br /> h) On the app's Settings tab, note the value of the App folder name field. This is the name of the app folder that Dropbox will create under the `Apps` top-level folder in your Dropbox account that the Dropbox app will use for access. If you change the value of App folder name field here, Dropbox will create an app folder with that name under the `Apps` top-level folder instead.<br /> i) Under OAuth 2, next to Generated access token, click Generate. Copy the value of this access token. You should only click Generate after you have completed all of the preceding steps first. This is because the access token is scoped to the specific app folder and settings at the time the access token is generated. If you change the app folder name or any of the permissions later, you should regenerate the access token.<br /> <Warning> Access tokens are valid for only four hours after they are created. After this four-hour period, you can no longer use the expired access token. Dropbox does not allow the creation of access tokens that are valid for more than four hours. To replace an expired access token, you must first generate a refresh token for the corresponding access token. To learn how to generate an access token and its corresponding refresh token, see [Replace an expired access token](#replace-an-expired-access-token), later in this article. If you do not already have the corresponding refresh token for an existing access token, or if you lose a refresh token after you generate it, you must generate a new access token and its corresponding refresh token. Instead of continualy replacing expired access tokens yourself, you can have Unstructured do it for you as needed; just supply Unstructured with the refresh token along with the Dropbox app's App key and App secret values. To learn how to supply these to Unstructured, look for mentions of "refresh token," "app key," and "app secret" in the connector settings later in this article. </Warning> 3. The app folder that your Dropbox app will use for access can be found in your Dropbox account under the `Apps` top-level folder. For example, if the value of the App folder name field above is `my-folder`, then the app folder that your Dropbox app will use for access can be found under `https://dropbox.com/home/Apps/my-folder` ![The my-folder app folder under the Apps top-level folder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-folder.png) <Warning> Your Dropbox app will not have access to upload or download files from the root of the app folder. Instead, you must create a subfolder inside of the app folder for your Dropbox app to upload or download files from. You will use the name of that subfolder when specifying your remote URL in the next step. For example, if your Dropbox app uses an app folder named `my-folder` for access within the `Apps` top-level folder, and you create a subfolder named `data` within the `my-folder` app folder, then the subfolder that your Dropbox app will upload and download files from can be found under `https://dropbox.com/home/Apps/my-folder/data` ![The data subfolder under the my-folder subfolder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-subfolder.png) </Warning> 4. Note the remote URL to your subfolder inside of the app folder, which takes the format `dropbox://<subfolder-name>`. For example, if your Dropbox app uses an app folder named `my-folder` for access within the `Apps` top-level folder, and you create a subfolder named `data` within the `my-folder` app folder, then the remote URL is `dropbox://data` ![The data subfolder under the my-folder subfolder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-subfolder.png) ## Replace an expired access token Dropbox app access tokens are valid for only four hours. After this time, you can no longer use the expired access token. To have Unstructured automatically replace expired access tokens on your behalf, do the following: <iframe width="560" height="315" src="https://www.youtube.com/embed/PZyRgpPNEUs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. Get the app key and app secret values for your Dropbox app. To do this: a) Sign in to the [Dropbox Developers](https://www.dropbox.com/developers) portal with the same credentials as your Dropbox account.<br /> b) Open your [App Console](https://www.dropbox.com/developers/apps).<br /> c) Click your Dropbox app's icon.<br /> d) On the Settings tab, next to App key, copy the value of the app key.<br /> e) Next to App secret, click Show, and then copy the value of the app secret. 2. Use your web browser to browse to the following URL, replacing `<app-key>` with the app key for your Dropbox app: ```text https://www.dropbox.com/oauth2/authorize?client_id=<app-key>&response_type=code&token_access_type=offline ``` 3. Click Continue. 4. Click Allow. 5. In the Access code generated tile, copy the access code that is shown. 6. Use the [curl](https://curl.se/) utility in your Terminal or Command Prompt, or use a REST API client such as [Postman](https://www.postman.com/product/api-client/), to make the following REST API call, replacing the following placeholders: * Replace `<app-key>` with the app key for your Dropbox app. * Replace `<app-secret>` with the app secret for your Dropbox app. * Replace `<access-code>` with the access code that you just copied. ```text curl --location --request POST 'https://api.dropbox.com/oauth2/token' \ --user '<app-key>:<app-secret>' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'code=<access-code>' \ --data-urlencode 'grant_type=authorization_code' \ ``` 7. In the response, copy the following two values: * The value of `access_token` (starting with the characters `sl`) is the new, valid access token. * The value of `refresh_token` is the refresh token that can be used to replace this access token much faster and easier next time. If you lose this refresh token, you must go back to Step 2. For the [Unstructured UI](/ui/overview), if you want Unstructured to use this refresh token to automatically replace the expired access token instead of replacing it yourself, then add the following values to your connector settings, and then stop here: * Add the `refresh_token` value to the connector settings Refresh token field. * Add the `<app-key>` value to the connector settings App key field. * Add the `<app-secret>` value to the connector settings App secret field. For the [Unstructured API](/api-reference/overview) and [Unstructured Ingest](/open-source/ingestion/overview), if you want Unstructured to use this refresh token to automatically replace the expired access token instead of replacing it yourself, then add the following values to your connector settings, and then stop here: * Add the `refresh_token` value to the `refresh_token` parameter. * Add the `<app-key>` value to the `app_key` parameter. * Add the `<app-secret>` value to the connector settings `app_secret` parameter. 8. If for some reason you need to manually replace the expired access token yourself instead of having Unstructured do it for you, you can use the refresh token that you just copied to get a new access token: * Replace `<refresh-token>` with the refresh token. * Replace `<app-key>` with the app key for your Dropbox app. * Replace `<app-secret>` with the app secret for your Dropbox app. ```text curl https://api.dropbox.com/oauth2/token \ --data refresh_token=<refresh-token> \ --data grant_type=refresh_token \ --data client_id=<app-key> \ --data client_secret=<app-secret> ``` 9. In the response, copy the following two values: * The value of `access_token` (starting with the characters `sl`) is the new, valid access token. In the connector, replace the old, expired access token value with this new, valid access token value. * The value of `refresh_token` is the new, valid refresh token. To replace the expired access token yourself, go back to Step 8. To create a Dropbox source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, DropboxSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.DROPBOX, config=DropboxSourceConnectorConfigInput( remote_url="<remote-url>", recursive=<True|False>, refresh_token="<refresh-token>", app_key="<app-key>", app_secret="<app-secret>" ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "dropbox", "config": { "remote_url": "<remote-url>", "recursive": <true|false>, "refresh_token": "<refresh-token>", "app_key": "<app-key>", "app_secret": "<app-secret>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<remote-url>` (required) - The remote URL to the target subfolder inside of the app folder for the Dropbox app. * Set `recursive` to `true` to recursively process data from subfolders within the target subfolder. The default is `false` if not otherwise specified. * `<app-key>` (required) - The app key for your Dropbox app. This allows Unstructured to automatically replace expired access tokens. * `<app-secret>` (required) - The app secret for your Dropbox app. This allows Unstructured automatically to replace expired access tokens. * `<refresh-token>` (required) - The refresh token for the Dropbox app. This allows Unstructured to automatically replace expired access tokens. # Elasticsearch Source: https://docs.unstructured.io/api-reference/workflow/sources/elasticsearch <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Elasticsearch. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Elastic Cloud instances are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Elastic Cloud instances and self-manged Elasticsearch instances are supported. * For Elastic Cloud, you will need an [Elastic Cloud service instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#hosted-elasticsearch-service). <iframe width="560" height="315" src="https://www.youtube.com/embed/HEYIFC7uOus" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For self-managed Elasticsearch, you will need a [self-managed Elasticsearch instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#elasticsearch-deployment-options). <iframe width="560" height="315" src="https://www.youtube.com/embed/9znbAkNG1Ok" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * You will need the name of the index on the instance. See [Create index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) and [Get index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-index.html). The Elasticsearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "record_id": { "type": "text" }, "element_id": { "type": "keyword" }, "text": { "type": "text" }, "embeddings": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }, "metadata": { "type": "object", "properties": { "parent_id": { "type": "text" }, "page_number": { "type": "integer" }, "is_continuation": { "type": "boolean" }, "orig_elements": { "type": "text" }, "partitioner_type": { "type": "text" } } } } } } ``` See also: * [An Introduction to Elasticsearch Mapping](https://www.elastic.co/blog/found-elasticsearch-mapping-introduction) * [Explicit mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html) * [Dynamic field mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) <iframe width="560" height="315" src="https://www.youtube.com/embed/0f0tEcrvP8g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Elastic Cloud, you will need the Elastic Cloud service instance's API key. If you are using Unstructured Ingest, you will also need the instance's Cloud ID. To get these, see your Elasticsearch Service web console. * For self-managed Elasticsearch, you will need: * The self-managed instance's hostname and port number. See [Networking](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html). * If you're using basic authentication to the self-managed instance, the user's name and password. * If you're using token-based authentication to the self-managed instance, the bearer token or API key for the instance. See [Token-based authentication services](https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html) and [Create API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). * If you're using certificate authentication to the self-managed instance, the path to the Certificate Authority (CA) file on the instance, and the certificate fingerprint. See [SSL certificate API](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-ssl.html) and [Where can I see my Certificate Fingerprint?](https://discuss.elastic.co/t/where-can-i-see-my-certificate-fingerprint/319335/3). To create a Elasticsearch source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, ElasticsearchConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.ELASTICSEARCH, config=ElasticsearchConnectorConfigInput( hosts=["<host-url>"], es_api_key="<es-api-key>", index_name="<index-name>" ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "elasticsearch", "config": { "hosts": ["<host-url>"], "es_api_key": "<es-api-key>", "index_name": "<index-name>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `hosts` is a required array, with `<host-url>` specifying the URL of the Elastic Cloud instance to access, for example: `["https://<random-number>.<region>.<cloud-provider>.cloud.es.io"]`. * `<es-api-key>` (required) - The Elastic Cloud API key for the target cluster. * `<index-name>` (required) - The name of the target index in the cluster. # Google Cloud Storage Source: https://docs.unstructured.io/api-reference/workflow/sources/google-cloud <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Google Cloud Storage. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/R57duYCxLi8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Google Cloud service account. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#console). * A service account key for the service account. See [Create a service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating) in [Create and delete service account keys](https://cloud.google.com/iam/docs/keys-create-delete). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the service account key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * The URI for a Google Cloud Storage bucket. This URI consists of the target bucket name, plus any target folder within the bucket, expressed as `gs://<bucket-name>[/folder-name]`. [Create a bucket](https://cloud.google.com/storage/docs/creating-buckets#console). This bucket must have, at minimum, one of the following roles applied to the target Google Cloud service account: * `Storage Object Viewer` for bucket read access. * `Storage Object User` for bucket write access. * The `Storage Object Admin` role provides read and write access, plus access to additional bucket operations. To apply one of these roles to a service account for a bucket, see [Add a principal to a bucket-level policy](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-add) in [Set and manage IAM policies on buckets](https://cloud.google.com/storage/docs/access-control/using-iam-permissions). To create a Google Cloud Storage source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, GCSSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.GCS, config=GCSSourceConnectorConfigInput( service_account_key="<service-account-key>", remote_url="<remote-url>", recursive=<True|False> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "gcs", "config": { "service_account_key": "<service-account-key>", "remote_url": "<remote-url>", "recursive": <true|false> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<service-account-key>` (required) - The contents of a service account key file, expressed as a single string without line breaks, for a Google Cloud service account that has the required access permissions to the bucket. * `<remote-url>` (required) - The URI for the Google Cloud Storage bucket and any target folder path within the bucket. This URI takes the format `gs://<bucket-name>[/folder-name]`. * For `recursive` (source connector only), set to `true` to ingest data recursively from any subfolders, starting from the path specified by `<remote-url>`. The default is `false` if not otherwise specified. # Google Drive Source: https://docs.unstructured.io/api-reference/workflow/sources/google-drive <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Google Drive. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/ZiRD7hC-wh8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Google Cloud account](https://console.cloud.google.com). * The Google Drive API enabled in the account. [Learn how](https://cloud.google.com/apis/docs/getting-started#enabling_apis). * Within the account, a Google Cloud service account and its related `credentials.json` key file or its contents in JSON format. [Create a service account](https://developers.google.com/workspace/guides/create-credentials#create_a_service_account). [Create credentials for a service account](https://developers.google.com/workspace/guides/create-credentials#create_credentials_for_a_service_account). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the `credentials.json` key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * A Google Drive [shared folder](https://support.google.com/drive/answer/2375091) or [shared drive](https://support.google.com/a/users/answer/7212025). * Give the service account access to the shared folder or shared drive. To do this, share the folder or drive with the service account's email address. [Learn how](https://support.google.com/drive/answer/7166529). [Learn more](https://www.googlecloudcommunity.com/gc/Workspace-Q-A/Can-i-give-access-to-document-of-google-drive-to-service-account/m-p/530106). * Get the shared folder's ID or shared drive's ID. This is a part of the URL for your Google Drive shared folder or shared drive, represented in the following URL as `{folder_id}`: `https://drive.google.com/drive/folders/{folder-id}`. ## Document permissions metadata The source connector outputs any permissions information that it can find in the source location about the processed source documents and associates that information with each corresponding element that is generated. This permissions information is output into the `permissions_data` field, which is within the `data_source` field under the element's `metadata` field. This information lists the users or groups, if any, that have permissions to read, update, or delete the element's associated source document. The following example shows what the output looks like. Ellipses indicate content that has been omitted from this example for brevity. ```json [ { "...": "...", "metadata": { "...": "...", "data_source": { "...": "...", "permissions_data": [ { "read": { "users": [ "11111111111111111111" ], "groups": [ "22222222222222222222", "33333333333333333333" ] } }, { "update": { "users": [ "44444444444444444444", "55555555555555555555" ], "groups": [ "66666666666666666666", ] } }, { "delete": { "users": [ "77777777777777777777" ], "groups": [ "88888888888888888888" ] } } ], "...": "..." } } } ] ``` To look up information about a particular Google Cloud user, use the user's ID along with the [Admin SDK API](https://developers.google.com/workspace/admin/directory/reference/rest/v1/users/get) or the [People API](https://developers.google.com/people/api/rest/v1/people/get) for Google Cloud. To look up information about a particular Google Cloud group, use the group's ID along with the [Admin SDK API](https://developers.google.com/workspace/admin/directory/reference/rest/v1/groups/get) or the [Cloud Identity API](https://cloud.google.com/identity/docs/reference/rest/v1/groups/get) for Google Cloud. To create a Google Drive source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, GoogleDriveSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.GOOGLE_DRIVE, config=GoogleDriveSourceConnectorConfigInput( drive_id="<drive-id>", service_account_key="<service-account-key>", extensions=[ "<extension>", "<extension>" ], recursive=<True|False> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "google_drive", "config": { "drive_id": "<drive-id>", "service_account_key": "<service-account-key>", "extensions": [ "<extension>", "<extension>" ], "recursive": <true|false> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<drive-id>` - The ID for the target Google Drive folder or drive. * `<service-account-key>` - The contents of the `credentials.json` key file as a single-line string. * For `extensions`, set one or more `<extension>` values (such as `pdf` or `docx`) to process files with only those extensions. The default is to include all extensions. <Note> Do not include the leading dot in the file extensions. For example, use `pdf` or `docx` instead of `.pdf` or `.docx`. </Note> * Set `recursive` to `true` to recursively process data from subfolders within the target folder or drive. The default is `false` if not otherwise specified. # Jira Source: https://docs.unstructured.io/api-reference/workflow/sources/jira <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Jira. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/JR5mcukOSUQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Jira Cloud account](https://www.atlassian.com/try/cloud/signup?bundle=jira-software\&edition=free) or [Jira Data Center installation](https://confluence.atlassian.com/adminjiraserver/installing-jira-data-center-938846870.html). * The site URL for your [Jira Data Center installation](https://confluence.atlassian.com/jirakb/find-your-site-url-to-set-up-the-jira-data-center-and-server-mobile-app-954244798.html) or Jira Cloud account. For Jira Cloud, open Jira in your web browser and copy the address from the browser's address bar. If you're unsure, check the dashboard URL, or if viewing an issue, project or board, the site URL is typically everything that comes before but not including `/jira`, such as `https://<organization>.atlassian.net`. * To process Jira projects, provide the IDs for the target projects. To get a project's ID, sign in to your Jira Cloud account or Jira Data Center installation, and then go to the following URL: `https://<organization>.atlassian.net/rest/api/latest/project/<project-key>`, replacing `<organization>` with yours, and replacing `<project-key>` with the target project's key. In the response, look for the URL `https://<organization>.atlassian.net/rest/api/3/project/<project-id>`, where `<project-id>` is the target project's ID. * To process Jira boards, the IDs for the target boards. To get a board's ID, sign in to your Jira Cloud account or Jira Data Center installation, and then go to the following URL: `https://<organization>.atlassian.net/rest/agile/1.0/board?projectKeyOrId=<project-key-or-id>`, replacing `<organization>` with yours, and `<project-key-or-id>` with the associated project's key or ID. In the response, look for the URL `https://<organization>.atlassian.net/rest/agile/1.0/board/<board-id>`, where `<board-id>` is the board's ID. * To process Jira issues, the IDs for the target issues. To get an issue's ID, sign in to your Jia Cloud account or Jira Data Center installation, open the issue, and then look at the URL in your browser's address bar. The issue ID is the string of characters after the final slash in the URL. * A user in your [Jira Cloud account](https://support.atlassian.com/jira-cloud-administration/docs/manage-users-groups-permissions-and-roles-in-jira-cloud/) or [Jira Data Center installation](https://confluence.atlassian.com/adminjiraserver/create-edit-or-remove-a-user-938847025.html). * The user must have the correct permissions in your [Jira Cloud account](https://support.atlassian.com/jira-cloud-administration/docs/manage-users-groups-permissions-and-roles-in-jira-cloud/) or [Jira Data Center installation](https://confluence.atlassian.com/jirakb/permissions-made-simple-for-jira-server-717062767.html) to access the target projects, boards, and issues. * One of the following: * For Jira Cloud or Jira Data Center, the target user's name or email address, and password. [Change a Jira Cloud user's password](https://support.atlassian.com/user-management/docs/change-password-for-portal-only-customers/). [Change a Jira Data Center user's password](https://confluence.atlassian.com/adminjiraserver/create-edit-or-remove-a-user-938847025.html). * For Jira Cloud only, the target user's name or email address, and API token. [Create an API token](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/). * For Jira Data Center only, the target user's personal access token (PAT). [Create a PAT](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html). To create a Jira source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, JiraSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.JIRA, config=JiraSourceConnectorConfigInput( url="<url>", # For password or API token authentication: username="<username>", password="<password">, # For personal access token authentication: token="<token>", projects=[ "<project-id>", "<project-id>" ], boards=[ "<board-id>", "<board-id>" ], issues=[ "<issue-id>", "<issue-id>" ], status_filters=[ "<status>", "<status>" ], download_attachments=<True|False>, cloud=<True|False> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "jira", "config": { "url": "<url>", # For password or API token authentication: "username": "<username>", "password": "<password">, # For personal access token authentication: "token": "<token>", "projects": [ "<project-id>", "<project-id>" ], "boards": [ "<board-id>", "<board-id>" ], "issues": [ "<issue-id>", "<issue-id>" ], "status_filters": [ "<status>", "<status>" ], "download_attachments": <true|false>, "cloud": <true|false> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required): A unique name for this connector. * `<url>` (required): The URL of the Jira instance. * `<username>` (required for password or API token authentication): The username of the Jira user. * `<password>` (required for password or API token authentication): For password authentication, the password of the Jira user. For API token authentication, the API token of the Jira user. * `<token>` (required for personal access token authentication): The personal access token of the Jira user. * `<project-id>`: The ID of a target project in Jira to access. * `<board-id>`: The ID of a target board in Jira to access. * `<issue-id>`: The ID of a target issue in Jira to access. * `<status>`: The name of a status to filter Jira issues by. * Set `download_attachments` to `true` to download attachments from Jira issues. The default is `false`. * Set `cloud` to `true` to specify using Jira Cloud or `false` to specify using Jira Data Center. The default is to use Jira Data Center. ## Learn more * <Icon icon="blog" /> [Jira Integration in Unstructured: Build RAG Systems with Project Management Data](https://unstructured.io/blog/jira-integration-in-unstructured-build-rag-systems-with-project-management-data) # Kafka Source: https://docs.unstructured.io/api-reference/workflow/sources/kafka <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Kafka. The requirements are as follows. * A Kafka cluster in [Confluent Cloud](https://www.confluent.io/confluent-cloud). ([Create a cluster](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#create-ak-clusters).) The following video shows how to set up a Kafka cluster in Confluent Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/zcKJ96J4Xvk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The [hostname and port number](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#view-a-ak-cluster) of the bootstrap Kafka cluster to connect to.. * The name of the topic to read messages from or write messages to on the cluster. [Create a topic](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). [Access available topics](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). * For authentication, an [API key and secret](https://docs.confluent.io/cloud/current/security/authenticate/workload-identities/service-accounts/api-keys/manage-api-keys.html#add-an-api-key). To create a Kafka source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, KafkaCloudSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.KAFKA_CLOUD, config=KafkaCloudSourceConnectorConfigInput( bootstrap_servers="<bootstrap-server>", port=<port>, group_id="<group-id>", kafka_api_key="<kafka-api-key>", secret="<secret>", topic="<topic>", num_message_to_consume=<num-message-to-consume>, ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "kafka-cloud", "config": { "bootstrap_server": "<bootstrap-server>", "port": <port>, "group_id": "<group-id>", "kafka_api_key": "<kafka-api-key>", "secret": "<secret>", "topic": "<topic>", "num_messages_to_consume": <num-messages-to-consume> }, }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<bootstrap-server>` - The hostname of the bootstrap Kafka cluster to connect to. * `<port>` - The port number of the bootstrap Kafka cluster to connect to. The default is `9092` if not otherwise specified. * `<group-id>` - The ID of the consumer group. A consumer group is a way to allow a pool of consumers to divide the consumption of data over topics and partitions. The default is `default_group_id` if not otherwise specified. * `<kafka-api-key>` - For authentication, the API key for access to the cluster. * `<secret>` - For authentication, the secret for access to the cluster. * `<topic>` - The name of the topic to read messages from or write messages to on the cluster. * `<batch-size>` (destination connector only) - The maximum number of messages to send in a single batch. The default is `100` if not otherwise specified. * `<num-messages-to-consume>` (source connector only) - The maximum number of messages that the consumer will try to consume. The default is `100` if not otherwise specified. ## Learn more * <Icon icon="blog" /> [Unstructured Platform Now Integrates with Apache Kafka in Confluent Cloud](https://unstructured.io/blog/unstructured-platform-now-integrates-with-apache-kafka-in-confluent-cloud) # Local Source: https://docs.unstructured.io/api-reference/workflow/sources/local Ingest your files into Unstructured from local storage. This local source connector is supported only for REST API clients such as `curl` and Postman. It is currently not supported by the Unstructured user interface (UI) or the Unstructured Python SDK. The local source connector works only with the [local destination connector](/api-reference/workflow/destinations/local). You cannot use the local source connector with any other destination connector. For details, see the following: 1. [Create a workflow](/api-reference/workflow/workflows#create-a-workflow) that uses a local source and local destination. 2. [Run a workflow as a job](/api-reference/workflow/overview#run-a-workflow) that uses a local source and local destination. 3. [Download a processed local file from the workflow's completed job run](/api-reference/workflow/overview#download-a-processed-local-file-from-a-job). # MongoDB Source: https://docs.unstructured.io/api-reference/workflow/sources/mongodb <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from MongoDB. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/8YBVHt5spIQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The MongoDB requirements for a MongoDB Atlas deployment include: <Warning> For MongoDB Atlas, SCRAM-SHA-1 is not supported for authentication. This means that cluster types that only include SCRAM-SHA-1, such as Free, M0, Flex, and Serverless, are not supported. Unstructured only supports SCRAM-SHA-256 for MongoDB Atlas, which is cryptographically stronger than SCRAM-SHA-1. If you try to test or use a connector that refers to a cluster type that only includes SCRAM-SHA-1, the operation will fail, and you will get an error message similar to the following: `[digital envelope routines] unsupported`. </Warning> * A MongoDB Atlas account. [Create an account](https://www.mongodb.com/cloud/atlas/register). * A MongoDB Atlas cluster. [Create a cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/). Be sure to not select a cluster type that only includes SCRAM-SHA-1, such as Free, M0, Flex, or Serverless. * The cluster must be reachable from your application environment, for example by adding IP addresses to your IP access list. [Learn more](https://www.mongodb.com/docs/atlas/setup-cluster-security/#network-and-firewall-requirements). * The cluster must be configured to allow IP address. [Learn how](https://www.mongodb.com/docs/atlas/security/ip-access-list/#add-ip-access-list-entries). To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> * The cluster must have at least one database. [Create a database](https://www.mongodb.com/docs/compass/current/databases/#create-a-database). * The database must have at least one user, and that user must have sufficient access to the database. [Create a database user](https://www.mongodb.com/docs/atlas/security-add-mongodb-users/#add-database-users). [Give the user database access](https://www.mongodb.com/docs/manual/core/authorization/). * The database must have at least one collection. [Create a collection](https://www.mongodb.com/docs/compass/current/collections/#create-a-collection). <Note> For the destination connector, Unstructured recommends that all documents in the target collection have a field named `record_id` with a `String` data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the collection or, in some cases, the operation could fail altogether. </Note> * The connection string for the cluster. For MongoDB Atlas, this connection string must include the protocol, username, password, host, and cluster name. For example: ```text mongodb+srv://<db_user>:<db_password>@<host>/?retryWrites=true&w=majority&appName=<cluster> ``` To get the connection string in MongoDB Atlas, do the following: 1. Log in to your MongoDB Atlas console. 2. In the sidebar, under Databases, click Clusters. 3. Click on the cluster you want to connect to. 4. Click Connect. 5. Click Drivers. 6. Under Add your connection string into your application code, copy the connection string. You can then close the Connect dialog in MongoDB Atlas. Before you use this connection string, be sure to fill in any placeholders in the string, such as your MongoDB Atlas database user's password value. [Learn more](https://www.mongodb.com/resources/products/fundamentals/mongodb-connection-string). To create a MongoDB source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, MongoDBConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.MONGODB, config=MongoDBConnectorConfigInput( uri="<uri>", database="<database>", collection="<collection>" ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "mongodb", "config": { "uri": "<uri>", "database": "<database>", "collection": "<collection>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<database>` (required) - The name of the database on the target MongoDB instance. * `<collection>` (required) - The name of the collection within the database. * `<uri>` (required) - The instance connection string. # OneDrive Source: https://docs.unstructured.io/api-reference/workflow/sources/onedrive <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from OneDrive. The requirements are as follows. * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. (Even if you only plan to use OneDrive, you still need a plan that includes SharePoint Online, because OneDrive is built on SharePoint technology.) [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * The OneDrive and SharePoint Online plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The path to the target OneDrive folder, starting from the OneDrive account's root folder, for example `my-folder/my-subfolder`. The following video shows how to get a path: <iframe width="560" height="315" src="https://www.youtube.com/embed/SfUA1IwtI5U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration that has access to the target OneDrive account, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Files.ReadWrite.All` (if both reading and writing are needed) * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` * `Directory.Read.All` 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/jBJsrSkpClo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Both authentication types also require the User Principal Name (UPN) for the OneDrive account. This is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password authentication, you will also need the password for the target UPN. To create a OneDrive source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, OneDriveSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.ONEDRIVE, config=OneDriveSourceConnectorConfigInput( client_id="<client-id>", user_pname="<user-pname>", password="<password>", # For username and password authentication. tenant="<tenant>", authority_url="<authority-url>", client_cred="<client-cred>", path="<path>", recursive=<True|False> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "onedrive", "config": { "client_id": "<client-id>", "user_pname": "<user-pname>", "password": "<password>", # For username and password authentication. "tenant": "<tenant>", "authority_url": "<authority-url>", "client_cred": "<client-cred>", "path": "<path>", "recursive": <true|false> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<client-id>` (required) - The application (client) ID of the Microsoft Entra ID app registration that has access to the OneDrive account. * `<user-pname>` (required) - The User Principal Name (UPN) for the OneDrive user account in Entra ID. This is typically the user's email address. * `<password>` (required for username and password authentication): The password for the target UPN. * `<tenant>` (required) - The directory (tenant) ID of the Entra ID app registration. * `<authority-url>` (required) - The authentication token provider URL for the Entra ID app registration. The default is [https://login.microsoftonline.com](https://login.microsoftonline.com). * `<client-cred>` (required) - The client secret for the Entra ID app registration. * `<path>` (source connector only) - The path to the target folder in the OneDrive account, starting with the account's root folder, for example `my-folder/my-subfolder`. * For `recursive` (source connector only), set to `true` to recursively access files from subfolders within the specified OneDrive `<path>`. The default is `false` if not otherwise specified. * `<remote-url>` (destination connector only) - `onedrive://`, followed by the path to the target folder in the OneDrive account, starting with the account's root folder, for example `onedrive://my-folder/my-subfolder`. # Outlook Source: https://docs.unstructured.io/api-reference/workflow/sources/outlook <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Outlook. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/9yESRp9pzv0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Outlook user's email address. * A Microsoft Entra ID app registration in the same Azure account as the Outlook account. You will need this app registration's application (client) ID, client secret, and directory (tenant) ID. [Learn how](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app). * The Entra ID app registration must have the following Graph API permission levels of the application (not delegated) type: * `Mail.Read` * `Mail.ReadBasic` * `User.Read.All` [Learn how](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions). To create an Outlook source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, OutlookSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.OUTLOOK, config=OutlookSourceConnectorConfigInput( client_id="<client-id>", authority_url="<authority-url>", tenant="<tenant>", client_cred="<client-cred>", user_email="<user-email>", outlook_folders=["<folder-name>","<folder-name>"], recursive=<True|False> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "type": "outlook", "name": "<name>", "config": { "client_id": "<client-id>", "authority_url": "<authority-url>", "tenant": "<tenant>", "client_cred": "<client-cred>", "user_email": "<user-email>", "outlook_folders": ["<folder-name>","<folder-name>"], "recursive": <true|false> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<client-id>` (required) - The application (client) ID of the Microsoft Entra ID app registration that has access to the Outlook account. * `<authority-url>` - The authentication token provider URL for the Entra ID app registration. The default is `https://login.microsoftonline.com` if not otherwise specified. * `<tenant>` (required) - The directory (tenant) ID of the Entra ID app registration. * `<client-cred>` (required) - The client secret for the Entra ID app registration. * `<user-email>` (required) - The user's email address for the target Outlook account. * `outlook_folders` is a required array of strings, with each `<folder-name>` specifying the name of a folder to access, for example: `["Inbox","Sent"]`. Do not specify any subfolder names. * For `recursive` (source connector only), set to `true` to access any and all subfolders within each of the specified top-level folders. The default is `false` if not otherwise specified. # Overview Source: https://docs.unstructured.io/api-reference/workflow/sources/overview <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> To use the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) to manage source connectors, do the following: * To get a list of available source connectors, use the `UnstructuredClient` object's `sources.list_sources` function (for the Python SDK) or the `GET` method to call the `/sources` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#list-source-connectors). * To get information about a source connector, use the `UnstructuredClient` object's `sources.get_source` function (for the Python SDK) or the `GET` method to call the `/sources/<connector-id>` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#get-a-source-connector). * To create a source connector, use the `UnstructuredClient` object's `sources.create_source` function (for the Python SDK) or the `POST` method to call the `/sources` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#create-a-source-connector). * To update a source connector, use the `UnstructuredClient` object's `sources.update_source` function (for the Python SDK) or the `PUT` method to call the `/sources/<connector-id>` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#update-a-source-connector). * To delete a source connector, use the `UnstructuredClient` object's `sources.delete_source` function (for the Python SDK) or the `DELETE` method to call the `/sources/<connector-id>` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#delete-a-source-connector). To create or update a source connector, you must also provide settings that are specific to that connector. For the list of specific settings, see: * [Azure](/api-reference/workflow/sources/azure-blob-storage) (`AZURE` for the Python SDK or `azure` for `curl` and Postman) * [Box](/api-reference/workflow/sources/box) (`BOX` for the Python SDK or `box` for `curl` and Postman) * [Confluence](/api-reference/workflow/sources/confluence) (`CONFLUENCE` for the Python SDK or `confluence` for `curl` and Postman) * [Couchbase](/api-reference/workflow/sources/couchbase) (`COUCHBASE` for the Python SDK or `couchbase` for `curl` and Postman) * [Databricks Volumes](/api-reference/workflow/sources/databricks-volumes) (`DATABRICKS_VOLUMES` for the Python SDK or `databricks_volumes` for `curl` and Postman) * [Dropbox](/api-reference/workflow/sources/dropbox) (`DROPBOX` for the Python SDK or `dropbox` for `curl` and Postman) * [Elasticsearch](/api-reference/workflow/sources/elasticsearch) (`ELASTICSEARCH` for the Python SDK or `elasticsearch` for `curl` and Postman) * [Google Cloud Storage](/api-reference/workflow/sources/google-cloud) (`GCS` for the Python SDK or `gcs` for `curl` and Postman) * [Google Drive](/api-reference/workflow/sources/google-drive) (`GOOGLE_DRIVE` for the Python SDK or `google_drive` for `curl` and Postman) * [Jira](/api-reference/workflow/sources/jira) (`JIRA` for the Python SDK or `jira` for `curl` and Postman) * [Kafka](/api-reference/workflow/sources/kafka) (`KAFKA_CLOUD` for the Python SDK or `kafka-cloud` for `curl` and Postman) * [Local](/api-reference/workflow/sources/local) (Supported only for `curl` and Postman) * [MongoDB](/api-reference/workflow/sources/mongodb) (`MONGODB` for the Python SDK or `mongodb` for `curl` and Postman) * [OneDrive](/api-reference/workflow/sources/onedrive) (`ONEDRIVE` for the Python SDK or `onedrive` for `curl` and Postman) * [Outlook](/api-reference/workflow/sources/outlook) (`OUTLOOK` for the Python SDK or `outlook` for `curl` and Postman) * [PostgreSQL](/api-reference/workflow/sources/postgresql) (`POSTGRES` for the Python SDK or `postgres` for `curl` and Postman) * [S3](/api-reference/workflow/sources/s3) (`S3` for the Python SDK or `s3` for `curl` and Postman) * [Salesforce](/api-reference/workflow/sources/salesforce) (`SALESFORCE` for the Python SDK or `salesforce` for `curl` and Postman) * [SharePoint](/api-reference/workflow/sources/sharepoint) (`SHAREPOINT` for the Python SDK or `sharepoint` for `curl` and Postman) * [Slack](/api-reference/workflow/sources/slack) (`SLACK` for the Python SDK or `slack` for `curl` and Postman) * [Snowflake](/api-reference/workflow/sources/snowflake) (`SNOWFLAKE` for the Python SDK or `snowflake` for `curl` and Postman) * [Zendesk](/api-reference/workflow/sources/zendesk) (`ZENDESK` for the Python SDK or `zendesk` for `curl` and Postman) # PostgreSQL Source: https://docs.unstructured.io/api-reference/workflow/sources/postgresql <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from PostgreSQL. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local PostgreSQL installations are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local PostgreSQL installations are supported. The following video shows how to set up [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/): <iframe width="560" height="315" src="https://www.youtube.com/embed/QuIlEimejDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to set up [Azure Database for PostgreSQL](https://azure.microsoft.com/products/postgresql): <iframe width="560" height="315" src="https://www.youtube.com/embed/6lvtBUFI7eQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A PostgreSQL instance. * [Create an Amazon RDS for PostgreSQL instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html). * [Create an Azure Database for PostgreSQL server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-deploy-on-azure-free-account). * [Install PostgreSQL locally](https://www.postgresql.org/docs/current/tutorial-install.html). * The host name and port number for the instance. * For Amazon RDS for PostgreSQL, learn how to [get the host name and port number](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html#postgresql-endpoint). * For Azure Database for PostgreSQL, learn how to [get the host](https://learn.microsoft.com/azure/postgresql/flexible-server/quickstart-create-server#get-the-connection-information). The port number is `5432`. * For local PostgreSQL installations, these values are in the `postgresql.conf` file's `listen_addresses` and `port` settings. This file should be on the same machine as the instance. These values might also already be set as environment variables named `PGHOST` and `PGPORT` on the same machine as the instance. * For other installation types, see your PostgreSQL provider's documentation. * Depending on your network security requirements, you might need to allow access to your instance only from specific IP addresses. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> To learn how to allow these IP address ranges, see your PostgreSQL provider's documentation, for example with [Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) or [Azure Database for PostgreSQL](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-manage-firewall-portal#create-a-firewall-rule-after-server-is-created). <Note> For Amazon RDS for PostgreSQL, Amazon recommends that you set the instance's Public access setting to No by default, as this approach is more secure. This means that no resources can connect to the instance outside of the instance's associated Virtual Private Cloud (VPC) without extra configuration. [Learn more](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.WorkingWithRDSInstanceinaVPC.html#USER_VPC.Hiding). [Access an Amazon RDS instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html). If you must enable public access, set the instance's Public access setting to Yes, and then adjust the instance's related security group to allow this access. [Learn how](https://repost.aws/en/questions/QUxemKa9u5TV6CmLiO-r5prg/lost-public-access-to-aws-rds-postgresql-instance). [Troubleshoot issues with connecting to Amazon RDS instances](https://repost.aws/knowledge-center/rds-connectivity-instance-subnet-vpc). </Note> * A database in the instance. * For Amazon RDS for PostgreSQL and Azure Database for PostgreSQL, the default database name is `postgres` unless a custom database name was specified during the instance creation process. * For local PostgreSQL installations, learn how to [create a database](https://www.postgresql.org/docs/current/tutorial-createdb.html). * For other installation types, see your PostgreSQL provider's documentation. * A table in the database. Learn how to [create a table](https://www.postgresql.org/docs/current/tutorial-table.html). The table's schema must match the schema of the documents that Unstructured produces. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: <CodeGroup> ```sql PostgreSQL CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings DECIMAL [], parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` ```sql PostgreSQL with pgvector CREATE EXTENSION vector; CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings vector(3072), parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` </CodeGroup> See also: * [CREATE TABLE](https://www.postgresql.org/docs/current/sql-createtable.html) for PostgreSQL * [CREATE TABLE](https://github.com/pgvector/pgvector) for PostrgreSQL with pgvector * [Unstructured document elements and metadata](/api-reference/partition/document-elements) The following video shows how to use the `psql` utility to connect to PostgreSQL, list databases, and list and create tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/IKo-4QHdNF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A user in the database, and a password for the user. * For Amazon RDS for PostgreSQL, learn how to [create a user](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Roles.html). * For Azure Database for PostgreSQL, learn how to [create a user](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-create-users). * For local PostgreSQL installations, learn how to [create a user](https://www.postgresql.org/docs/current/sql-createuser.html). * For other installation types, see your PostgreSQL provider's documentation. * Database access for the user. * For Amazon RDS for PostgreSQL, learn how to [control user access](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Access.html). * For Azure Database for PostgreSQL, learn how to [control user access](https://www.postgresql.org/docs/current/sql-createuser.html). * For local PostgreSQL installations, learn how to [give database access to a user](https://www.postgresql.org/docs/current/sql-grant.html). * For other installation types, see your PostgreSQL provider's documentation. To create an PostgreSQL source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, PostgresSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.POSTGRES, config=PostgresSourceConnectorConfigInput( host="<host>", database="<database>", port="<port>", username="<username>", password="<password>", table_name="<table_name>", batch_size=<batch-size>, id_column="<id-column>", fields=[ "<field>", "<field>" ] ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "postgres", "config": { "host": "<host>", "database": "<database>", "port": "<port>", "username": "<username>", "password": "<password>", "table_name": "<table_name>", "batch_size": <batch-size>, "id_column": "<id-column>", "fields": [ "<field>", "<field>" ] } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<host>` (required) - The host name. * `<database>` (required) - The name of the database. * `<port>` (required) - The port number. * `<username>` (required) - The username. * `<password>` (required) - The user's password. * `<table_name>` (required) - The name of the table in the database. * `<batch-size>` - The maximum number of rows to transmit at a time. The default is `100` if not otherwise specified. * `<id-column>` (required, source connector only) - The name of the ID column in the table. * For `fields` (source connector only), set one or more `<field>` values, with each value representing the name of a column to process (including the specified `<id-column>` column). The default is all columns if not otherwise specified. ## Learn more * <Icon icon="blog" /> [PostgreSQL Integration in the Unstructured Platform](https://unstructured.io/blog/postgresql-integration-in-the-unstructured-platform) # S3 Source: https://docs.unstructured.io/api-reference/workflow/sources/s3 <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Amazon S3. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/hyDHfhVVAhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account; enable anonymous access to the bucket (which is supported but not recommended); or generate an AWS STS session token for temporary access, if required by your organization's security requirements. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Anonymous (supported but not recommended) or authenticated access to the bucket. * [Enable anonymous bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-anonymous-user). ![Enable anonymous bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-anon-access.gif) * [Enable authenticated bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/walkthrough1.html). ![Enable authenticated bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-auth-access.gif) * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated access in untrusted environments or enhanced security scenarios, an AWS STS session token for temporary access, in addition to an AWS access key and secret access key. [Create a session token](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#api_getsessiontoken). <iframe width="560" height="315" src="https://www.youtube.com/embed/4_csSXc_GNU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Warning> AWS STS credentials can be valid for as little as 15 minutes or as long as 36 hours, depending on how the credentials were initially generated. After the expiry time, the credentials are no longer valid will no longer work with the corresponding S3 connector. You must get a new set of credentials to replace the expired ones by calling [GetSessionToken](https://docs.aws.amazon.com/STS/latest/APIReference/API_GetSessionToken.html) in the AWS STS API. To overwrite the expired credentials with the new set: * For the Unstructured user interface (UI), manually update the AWS Key, AWS Secret Key, and STS Token fields in the Unstructured UI for the corresponding S3 [source](/ui/sources/s3) or [destination](/ui/destinations/s3) connector. * For the Unstructured API, use the Unstructured Workflow Endpoint to call the [update source](/api-reference/workflow/overview#update-a-source-connector) or [update destination](/api-reference/workflow/overview#update-a-destination-connector) connector operation for the corresponding S3 [source](/api-reference/workflow/sources/s3) or [destination](/api-reference/workflow/destinations/s3) connector. * For Unstructured Ingest, change the values of `--key`, `--secret`, and `--token` (CLI) or `key`, `secret`, and `token` (Python) in your command or code for the corresponding S3 [source](/open-source/ingestion/source-connectors/s3) or [destination](/open-source/ingestion/destination-connectors/s3) connector. </Warning> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, and authenticated bucket access is enabled, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). To create an S3 source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, S3SourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.S3, config=S3SourceConnectorConfigInput( # For anonymous authentication: anonymous=True, # For AWS access key ID with AWS secret access key authentication: key="<key>", secret="<secret>", # For AWS STS token authentication: token="<token>", key="<key>", secret="<secret>", remote_url="<remote_url>", endpoint_url="<endpoint-url>", recursive=<True|False> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "s3", "config": { # For anonymous authentication: "anonymous": true, # For AWS access key ID with AWS secret access key authentication: "key": "<key>", "secret": "<secret>", # For AWS STS token authentication: "token": "<token>", "key": "<key>", "secret": "<secret>", "remote_url": "<remote_url>", "endpoint_url": "<endpoint-url>", "recursive": <true|false> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * For AWS access key ID with AWS secret access key authentication, or for AWS STS token authentication: * `<key>` - The AWS access key ID for the authenticated AWS IAM user (required). * `<secret>` - The AWS secret access key corresponding to the preceding AWS access key ID (required). * For AWS STS token authentication: * `<token>` - The AWS STS session token for temporary access (required). * `<endpoint-url>` - A custom URL, if connecting to a non-AWS S3 bucket. * `<remote-url>` (required) - The S3 URI to the bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket’s root) or `s3://my-bucket/my-folder/`. * For `recursive` (source connector only), set to `true` to access subfolders within the bucket. The default is `false` if not otherwise specified. # Salesforce Source: https://docs.unstructured.io/api-reference/workflow/sources/salesforce <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Salesforce. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/5a-nh4t78V8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Salesforce account](https://developer.salesforce.com/signup). * Your Salesforce username. To get this username, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Administration, expand Users, and then click Users. 4. Note the User Name value (not the Name value) for your user. * The names of the Salesforce categories (objects) that you want to access, specified as a comma-separated list. Available categories include `Account`, `Campaign`, `Case`, `EmailMessage`, and `Lead`. * A Salesforce connected app in your Salesforce account. If you do not already have a Salesforce connected app, to create one, start by creating or getting the certificate (`.crt`) and private key (`.pem`) files that you will associate with the connected app. If you do not have the certificate and private key files, you can use the `openssl` utility on your local machine to create your own private key and self-signed certificate, if your organization allows, by running the following commands, one command at a time: ```bash openssl genrsa -out MyPrivateKey.pem -traditional openssl req -new -key MyPrivateKey.pem -out MyCertificateSigningRequest.csr openssl x509 -req -in MyCertificateSigningRequest.csr -signkey MyPrivateKey.pem -out MyCertificate.crt -days 365 ``` Of course, you can change these preceding example filenames as needed. Be sure to store these generated files in a secure location. To create a Salesforce connected app, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Platform Tools, expand Apps, and then click App Manager. 4. Click New Connected App. 5. With Create a Connected App selected, click Continue. 6. At a minimum, fill in the following, and then click Save: * Connected App Name * API Name (can be the same as Connected App Name, but do not use spaces or punctuation) * Contact Email * Under API (Enable OAuth Settings), check Enable OAuth Settings. * For Callback URL, entering `https://localhost` is okay if you won't be using this connected app for other special authentication scenarios. * Check Use digital signatures, click Choose File, and browse to and select your certificate (`.crt`) file. * For Selected OAuth Scopes, move the following entries from the Available OAuth Scopes list to the Selected OAuth Scopes list: * Manage user data via APIs (api) * Perform requests on your behalf at any time (refresh\_token, offline\_access) * Uncheck Require Proof Key for Code Exchange (PKCE) Extension for Supported Authorization Flows. * Leave Require Secret for Web Server Flow checked. * Leave Require Secret for Refresh Token Flow checked. * Check Enable Authorization Code and Credentials Flow. 7. On the connected app's details page, click Manage, click Edit Policies, set the following under OAuth Policies, and then click Save: * Set Permitted Users to All users may self-authorize. * Set IP Relaxation to Relax IP restrictions. * Set Refresh Token Policy to Refresh token is valid until revoked. * The OAuth consumer key (client ID) for the Salesforce connected app. To get the Salesforce connected app's consumer key, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Platform Tools, expand Apps, and then click App Manager. 4. In the list of apps, click the arrow next to the target connected app, and click View. 5. Click Manage Consumer Details. 6. Complete the on-screen security verification. 7. Note the Consumer Key value. * You must use your Salesforce account to do a one-time approval of the Salesforce connected app by using its consumer key and callback URL. To do this, while you are logged in to your Salesforce account, browse to the following URL, replacing `<client-id>` with the consumer key value. This URL assumes that the callback URL is `https://localhost`: ``` https://login.salesforce.com/services/oauth2/authorize?response_type=code&client_id=<client-id>&redirect_uri=https%3A%2F%2Flocalhost ``` * To ensure maximum compatibility across Unstructured service offerings, you should give the contents of the private key (`.pem`) file to Unstructured as a string that contains the contents of the file (not the private key file itself).\ To print this string suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-private-key-file>` with the path to the private key file. * For macOS or Linux: ```bash cat <path-to-private-key-file> ``` * For Windows: ```text Get-Content <path-to-private-key-file> ``` To create a Salesforce source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, SalesforceSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.SALESFORCE, config=SalesforceSourceConnectorConfigInput( username="<user-name>", consumer_key="<consumer-key>", private_key="<private-key>", categories=[ "<category>", "<category>" ] ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "salesforce", "config": { "username": "<user-name>", "consumer_key": "<consumer-key>", "private_key": "<private-key>", "categories": [ "<category>", "<category>" ] } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<user-name>` - The Salesforce username that has access to the required Salesforce categories. * `<consumer-key>` - The consumer key (client ID) for the Salesforce connected app. * `<private-key>` - The contents of the private key (PEM) associated with the consumer key for the Salesforce connected app, expressed as a single-line string. * For `categories`, set one or more `<category>` values (such as `Account`, `Campaign`, `Case`, `EmailMessage`, and `Lead`) to process only those categories. The default is to include these catagories if not otherwise specified: `Account`, `Campaign`, `Case`, `EmailMessage`, and `Lead`. # SharePoint Source: https://docs.unstructured.io/api-reference/workflow/sources/sharepoint <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from SharePoint. The requirements are as follows. <Note> If you are setting up the SharePoint connector for the first time, you can skip past this note. Previous versions of the SharePoint connector relied on SharePoint app principals for authentication. Current versions of the SharePoint connector no longer support these SharePoint app principals. Microsoft deprecated support for Share Point app principals on November 27, 2023. SharePoint app principals will no longer work for SharePoint tenants that were created on or after November 1, 2024, and they will stop working for all SharePoint tenants as of April 2, 2026. [Learn more](https://learn.microsoft.com/sharepoint/dev/sp-add-ins/retirement-announcement-for-azure-acs). Current versions of the SharePoint connector now rely on Microsoft Entra ID app registrations for authentication. To migrate from SharePoint app princpals to Entra ID app regisrations, replace the following settings in your existing SharePoint connector, as listed in the requirements following this note: * Replace the deprecated SharePoint app principal's application client ID value with your replacement Entra ID app registration's Application (client) ID value. * Replace the deprecated SharePoint app principal's client secret value with your replacement Entra ID app registration's Client secret value. * Add your replacement Entra ID app registration's Directory (tenant) ID value, token authority URL value, and the correct set of Microsoft Graph access permissions for SharePoint Online. If you need migration help, get assistance from our [Slack community](https://short.unstructured.io/pzw05l7) or [contact us](https://unstructured.io/contact) directly. </Note> * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. (Even if you only plan to use SharePoint Online, you still need a plan that includes OneDrive, because the SharePoint connector is built on OneDrive technology.) [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * The SharePoint Online and OneDrive plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The SharePoint Online site URL. * Site collection-level URLs typically have the format `https://<tenant>.sharepoint.com/sites/<site-collection-name>`. * Root site collection-level URLs typically have the format `https://<tenant>.sharepoint.com`. * To process all sites within a SharePoint tenant, use a site URL of `https://<tenant>-admin.sharepoint.com`. [Learn more](https://learn.microsoft.com/microsoft-365/community/query-string-url-tricks-sharepoint-m365). * The display name of the SharePoint Online library to use. The default is `Documents`. * The path to the SharePoint Online library to use. By default, the root of the target library is used. To start from a path other than the root, enter the path that you want to use, beginning from the root. For example, to use the my-folder > my-subfolder path in the target library, you would specify `my-folder/my-subfolder`. The following video shows how to get the site URL and a path within the site: <iframe width="560" height="315" src="https://www.youtube.com/embed/E3fRwJU-KTc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` [Learn more](https://learn.microsoft.com/answers/questions/2116616/service-principal-access-to-sharepoint-online). 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/X7fnRYyxy0Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The token authority URL for your Microsoft Entra ID app registration. This is typically `https://login.microsoftonline.com` * For username and password authentication, you must also provide the User Principal Name (UPN) and its password for the OneDrive account in the Microsoft Entra ID tenant. This UPN is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To create a SharePoint source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, SharePointSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.SHAREPOINT, config=SharePointSourceConnectorConfigInput( site="<site>", library="<library>", path="<path>", recursive=<True|False>, client_id="<client_id>", tenant="<tenant>", authority_url="<authority_url>", client_cred="<client_cred>", user_pname="<user_pname>", # For username and password authentication. password="<password>" # For username and password authentication. ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "sharepoint", "config": { "site": "<site>", "library": "<library>", "path": "<path>", "recursive": <true|false>, "client_id": "<client-id>", "tenant": "<tenant>", "authority_url": "<authority-url>", "client_cred": "<client-cred>", "user_pname": "<user-pname>", # For username and password authentication. "password": "<password>" # For username and password authentication. } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<site>` (required) - The base URL of the SharePoint site to connect to. * `<library>` - The display name of the SharePoint library to use. The default is `Documents`. * `<path>` - The path to use within the library. The default is the root of the target library. To use a different path, specify the correct path format as described previously in this article. * For `recursive`, set to `true` to recursively process data from subfolders within the target path. The default is `false` if not otherwise specified. * `<client-id>` (required) - The client ID provided by SharePoint for the app registration. * `<tenant>` (required) - The Directory (tenant) ID for the Microsoft Entra ID app registration with the correct set of Microsoft Graph access permissions. * `<authority-url>` - The authentication token provider URL for the Entra ID app registration. The default is [https://login.microsoftonline.com](https://login.microsoftonline.com). * `<client-cred>` (required) - The Client secret for the Entra ID app registration. * `<user-pname>` (required for username and password authentication) - For username and password authentication, the UPN for the OneDrive account in the Entra ID tenant. * `<password>` (required for username and password authentication) - For username and password authentication, the password for the target UPN. # Slack Source: https://docs.unstructured.io/api-reference/workflow/sources/slack <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Slack. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/0F1MXMgk3kY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Slack app. Create a Slack app by following [Step 1: Creating an app](https://api.slack.com/quickstart#creating). * The app must have the the following OAuth scopes: * To read messages from public channels, add the scopes `channels:join` and `channels:history`. * To read messages from private channels, add the scope `groups:history`. Add these scopes to the app by following [Step 2: Requesting scopes](https://api.slack.com/quickstart#scopes). * The app must be installed and authorized for the target Slack workspace. Install and authorize the app by following [Step 3: Installing and authorizing the app](https://api.slack.com/quickstart#installing). * The app's access token. Get this token by following [Step 3: Installing and authorizing the app](https://api.slack.com/quickstart#installing). * Add the app to the target channels in the Slack workspace. To do this from the channel, open the channel's details page, click the Integrations tab, click Add apps, and follow the on-screen directions to install the app. * The channel ID for each target channel. To get this ID, open the channel's details page, and look for the Channel ID field on the About tab. * The starting and ending date and time range for the channels to be processed. Supported formats include: * `YYYY-MM-DD` * `YYYY-MM-DDTHH:MM:SS` * `YYYY-MM-DDTHH:MM:SSZ` * `YYYY-MM-DD+HH:MM:SS` * `YYYY-MM-DD-HH:MM:SS` <Info> Slack has established API rate limits that could restrict the number and frequency of messages that can be read. To explore possible workarounds for these cases, see your Slack representative. </Info> To create a Slack source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, SlackSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.SLACK, config=SlackSourceConnectorConfigInput( channels=[ "<channel>", "<channel>" ], start_date="<start-date>", end_date="<end-date>", token="<token>" ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "slack", "config": { "channels": [ "<channel>", "<channel>" ], "start-date": "<start-date>", "end-date": "<end-date>", "token": "<token>" } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<channel>` (required): The ID of the target Slack channel from which to read messages. * `<start-date>`: The date and time, in ISO 8601 format (such as `YYYY-MM-DDTHH:MM:SSZ`), from which to begin reading messages. Messages created before this timestamp are ignored. By default, all messages are read unless an end date is specified. * `<end-date>`: The date and time, in ISO 8601 format (such as `YYYY-MM-DDTHH:MM:SSZ`), at which to stop reading messages. Messages created after this timestamp are ignored. By default, all messages are read unless a start date is specified. * `<token>` (required): The access token for the Slack app that has permission to read messages from the target Slack channels. # Snowflake Source: https://docs.unstructured.io/api-reference/workflow/sources/snowflake <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Snowflake. The requirements are as follows. * A Snowflake [account](https://signup.snowflake.com/) and its account identifier. <iframe width="560" height="315" src="https://www.youtube.com/embed/LgCjLyClg3o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To get the identifier for the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click your username, and then click Account > View account details. 3. On the Account tab, note the value of the Account Identifier field. Alternatively, the following Snowflake query returns the current account's identifier: ```text SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS "Account Identifier" ``` * The Snowflake [user's login name (not its username) and its password](https://docs.snowflake.com/user-guide/admin-user-management#creating-users) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/sFLPGVe4VBM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view the login name for a user: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. On the Users tab, in the list of available users, click the name of the target user. 4. In the About tile, note the Login Name for the user. Alternatively, the following Snowflake query returns information about the user with the username of `<my-user>`, including their `login_name` value representing their login name: ```text SHOW USERS LIKE '<my-user>'; ``` * The name of the Snowflake [role](https://docs.snowflake.com/sql-reference/sql/create-role) that the user belongs to and that also has sufficient access to the Snowflake database, schema, table, and host. <iframe width="560" height="315" src="https://www.youtube.com/embed/cxe2Ed_-qkM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * To create a database in Snowflake, the role needs to be granted `CREATE DATABASE` privilege at the current account level; and `USAGE` privilege on the warehouse that is used to create the database. * To create a schema in a database in Snowflake, the role needs to be granted `USAGE` privilege on the database and the warehouse that is used to create the schema; and `CREATE SCHEMA` on the database. * To create a table in a schema in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to create the table; and `CREATE TABLE` on the schema. * To write to a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `INSERT` on the table. * To read from a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `SELECT` on the table. To view a list of available roles in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. Click the Roles tab. Alternatively, the following Snowflake query returns a list of available roles in the current account: ```text SHOW ROLES; ``` [Grant privileges to a role](https://docs.snowflake.com/sql-reference/sql/grant-privilege). [Learn more](https://docs.snowflake.com/user-guide/security-access-control-privileges). * The Snowflake warehouse's [hostname and its port number](https://docs.snowflake.com/sql-reference/functions/system_allowlist) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/ItLW0N1uEOI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available warehouses in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Warehouses. This view does not provide access to the warehouses' hostnames or port numbers. To get this information, you must run a Snowflake query. The following Snowflake query returns a list of available warehouse types, hostnames, and port numbers in the current account. Look for the row with a `type` of `SNOWFLAKE_DEPLOYMENT`: ```text SELECT t.VALUE:type::VARCHAR as type, t.VALUE:host::VARCHAR as host, t.VALUE:port as port FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t; ``` * The name of the Snowflake [database](https://docs.snowflake.com/sql-reference/sql/create-database) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/f49tin_Pehs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available databases in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. Alternatively, the following Snowflake query returns a list of available databases in the current account: ```text SHOW DATABASES; ``` * The name of the [schema](https://docs.snowflake.com/sql-reference/sql/create-schema) in the database. To view a list of available schemas for a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the target database. Alternatively, the following Snowflake query returns a list of available schemas in the current account: ```text SHOW SCHEMAS; ``` The following Snowflake query returns a list of available schemas for the database named `<database_name>` in the current account: ```text SHOW SCHEMAS IN DATABASE <database_name>; ``` * The name of the [table](https://docs.snowflake.com/sql-reference/sql/create-table) in the schema. To view a list of available tables for a schema in a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the database that contains the target schema. 4. Expand the name of the target schema. 5. Expand Tables. Alternatively, the following Snowflake query returns a list of available tables for the schema named `<schema_name>` in the datbase named `<database_name>` in the current account: ```text SHOW TABLES IN SCHEMA <database_name>.<schema_name>; ``` Snowflake requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured is as follows. In the following `CREATE TABLE` statement, replace the following placeholders with the appropriate values: * `<database_name>`: The name of the target database in the Snowflake account. * `<schema_name>`: The name of the target schema in the database. * `<number-of-dimensions>`: The number of dimensions for any embeddings that you plan to use. This value must match the number of dimensions for any embeddings that are\ specified in your related Unstructured workflows or pipelines. If you plan to use Snowflake vector embedding generation or Snowflake vector search, this value must match the number of dimensions that you plan to have Snowflake generate or search against. ```sql SQL CREATE TABLE <database_name>.<schema_name>.ELEMENTS ( ID VARCHAR(36) PRIMARY KEY NOT NULL DEFAULT UUID_STRING(), RECORD_ID VARCHAR, ELEMENT_ID VARCHAR, TEXT VARCHAR, EMBEDDINGS VECTOR(FLOAT, <number-of-dimensions>), TYPE VARCHAR, SYSTEM VARCHAR, LAYOUT_WIDTH DECIMAL, LAYOUT_HEIGHT DECIMAL, POINTS VARCHAR, URL VARCHAR, VERSION VARCHAR, DATE_CREATED TIMESTAMP_TZ, DATE_MODIFIED TIMESTAMP_TZ, DATE_PROCESSED TIMESTAMP_TZ, PERMISSIONS_DATA VARCHAR, RECORD_LOCATOR VARCHAR, CATEGORY_DEPTH INTEGER, PARENT_ID VARCHAR, ATTACHED_FILENAME VARCHAR, FILETYPE VARCHAR, LAST_MODIFIED TIMESTAMP_TZ, FILE_DIRECTORY VARCHAR, FILENAME VARCHAR, LANGUAGES ARRAY, PAGE_NUMBER VARCHAR, LINKS VARCHAR, PAGE_NAME VARCHAR, LINK_URLS ARRAY, LINK_TEXTS ARRAY, SENT_FROM ARRAY, SENT_TO ARRAY, SUBJECT VARCHAR, SECTION VARCHAR, HEADER_FOOTER_TYPE VARCHAR, EMPHASIZED_TEXT_CONTENTS ARRAY, EMPHASIZED_TEXT_TAGS ARRAY, TEXT_AS_HTML VARCHAR, REGEX_METADATA VARCHAR, DETECTION_CLASS_PROB DECIMAL, IMAGE_BASE64 VARCHAR, IMAGE_MIME_TYPE VARCHAR, ORIG_ELEMENTS VARCHAR, IS_CONTINUATION BOOLEAN ); ``` * The name of the column in the table that uniquely identifies each record (for example, `RECORD_ID`). To create a Snowflake source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, SnowflakeSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.SNOWFLAKE, config=SnowflakeSourceConnectorConfigInput( account="<account>", role="<role>", user="<user>", password="<password>", host="<host", port=<port> database="<database>", schema_="<schema>", table_name="<table_name>", id_column="<id-column>", fields=[ "<field>", "<field>" ], batch_size=<batch-size> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "snowflake", "config": { "account": "<account>", "user": "<user>", "host": "<host>", "port": <port>, "database": "<database>", "schema": "<schema>", "role": "<role>", "password": "<password>", "id_column": "<id-column>", "table_name": "<table_name>", "batch_size": <batch-size>, "fields": [ "<field>", "<field>" ] } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<account>` (required): The target Snowflake account's identifier. * `<role>` (required): The name of the Snowflake role that the user belongs to. This role must have the appropriate access to the target Snowflake warehouse, database, schema, and table. * `<user>` (required): The target Snowflake user's login name (not their username). * `<password>` (required): The user's password. * `<host>` (required): The hostname of the target Snowflake warehouse. * `<port>` (required): The warehouse's port number. The default is `443` if not otherwise specified. * `<database>` (required): The name of the target Snowflake database. * `<schema>` (required): The name of the target Snowflake schema within the database. * `<table_name>`: The name of the target Snowflake table within the database's schema. For the destination connector, the default is `elements` if not otherwise specified. * `<columns>` (source connector only): A comma-separated list of columns to fetch from the table. By default, all columns are fetched unless otherwise specified. * `<id-column>` (required, source connector only): The name of the column that uniquely identifies each record in the table. * `<record-id-key>` (destination connector only): The name of the column that uniquely identifies each record in the table. The default is `record_id` if not otherwise specified. * `<batch-size>` (required): The maximum number of rows to fetch for each batch. The default is `50` if not otherwise specified. # Zendesk Source: https://docs.unstructured.io/api-reference/workflow/sources/zendesk <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to get your Unstructured API key. To learn how, watch this 40-second [how-to video](https://www.youtube.com/watch?v=FucugLkYB6M). After you create the source connector, add it along with a [destination connector](/api-reference/workflow/destinations/overview) to a [workflow](/api-reference/workflow/overview#workflows). Then run the worklow as a [job](/api-reference/workflow/overview#jobs). To learn how, try out the [hands-on Workflow Endpoint quickstart](/api-reference/workflow/overview#quickstart), go directly to the [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb), or watch the two 4-minute video tutorials for the [Unstructured Python SDK](/api-reference/workflow/overview#unstructured-python-sdk). You can also create source connectors with the Unstructured user interface (UI). [Learn how](/ui/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Zendesk. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/f2yeXwSvZtE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Zendesk account](https://www.zendesk.com/register/). * Your [Zendesk subdomain](https://support.zendesk.com/hc/en-us/articles/4409381383578-Where-can-I-find-my-Zendesk-subdomain), for example, the `<organization>` part of `<organization>.zendesk.com`. * The login email address for your Zendesk account. * An [API token](https://support.zendesk.com/hc/en-us/articles/4408889192858-Managing-access-to-the-Zendesk-API#topic_tcb_fk1_2yb) (not an OAuth token) for your login email address. To create a Zendesk source connector, see the following examples. <CodeGroup> ```python Python SDK import os from unstructured_client import UnstructuredClient from unstructured_client.models.operations import CreateSourceRequest from unstructured_client.models.shared import ( CreateSourceConnector, SourceConnectorType, ZendeskSourceConnectorConfigInput ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.sources.create_source( request=CreateSourceRequest( create_source_connector=CreateSourceConnector( name="<name>", type=SourceConnectorType.ZENDESK, config=ZendeskSourceConnectorConfigInput( subdomain="<subdomain>", email="<email>", api_token="<api-token>", item_type="<item-type>", batch_size=<batch-size> ) ) ) ) print(response.source_connector_information) ``` ```bash curl curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/sources" \ --header 'accept: application/json' \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'content-type: application/json' \ --data \ '{ "name": "<name>", "type": "zendesk", "config": { "subdomain": "<subdomain>", "email": "<email>", "api-token": "<api-token>", "item-type": "<item-type>", "batch-size": <batch-size> } }' ``` </CodeGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this connector. * `<subdomain>` (required): The subdomain of your Zendesk site, for example the `<organization>` part of `<organization>.zendesk.com`. * `<email>` (required): The email address of the user account that has access to the Zendesk subdomain. * `<api-token>` (required): The API token (not an OAuth token) for the user account. For more information, see [Managing access to the Zendesk API](https://support.zendesk.com/hc/en-us/articles/226022787-Generating-a-new-API-token). * `<item-type>`: The type of item to parse. Allowed values include `tickets` and `articles`. If no value is provided, the default is `tickets`. * `<batch-size>`: The maximum number of items to structure per batch. The default is `2`. ## Learn more * <Icon icon="blog" /> [Unlock Customer Support Knowledge: Unstructured Integrates with Zendesk](https://unstructured.io/blog/unlock-customer-support-knowledge-unstructured-integrates-with-zendesk) # Workflows Source: https://docs.unstructured.io/api-reference/workflow/workflows To use the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) to manage workflows, do the following: * To get a list of available workflows, use the `UnstructuredClient` object's `workflows.list_workflows` function (for the Python SDK) or the `GET` method to call the `/workflows` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#list-workflows). * To get information about a workflow, use the `UnstructuredClient` object's `workflows.get_workflow` function (for the Python SDK) or the `GET` method to call the `/workflows/<workflow-id>` endpoint (for `curl` or Postman)use the `GET` method to call the `/workflows/<workflow-id>` endpoint. [Learn more](/api-reference/workflow/overview#get-a-workflow). * To create a workflow, use the `UnstructuredClient` object's `workflows.create_workflow` function (for the Python SDK) or the `POST` method to call the `/workflows` endpoint (for `curl` or Postman). [Learn more](#create-a-workflow). * To run a workflow manually, use the `UnstructuredClient` object's `workflows.run_workflow` function (for the Python SDK) or the `POST` method to call the `/workflows/<workflow-id>/run` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#run-a-workflow). * To update a workflow, use the `UnstructuredClient` object's `workflows.update_workflow` function (for the Python SDK) or the `PUT` method to call the `/workflows/<workflow-id>` endpoint (for `curl` or Postman). [Learn more](#update-a-workflow). * To delete a workflow, use the `UnstructuredClient` object's `workflows.delete_workflow` function (for the Python SDK) or the `DELETE` method to call the `/workflows/<workflow-id>` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#delete-a-workflow). The following examples assume that you have already met the [requirements](/api-reference/workflow/overview#requirements) and understand the [basics](/api-reference/workflow/overview#basics) of working with the Unstructured Workflow Endpoint. ## Create a workflow To create a workflow, use the `UnstructuredClient` object's `workflows.create_workflow` function (for the Python SDK) or the `POST` method to call the `/workflows` endpoint (for `curl` or Postman). In the `CreateWorkflow` object (for the Python SDK) or the request body (for `curl` or Postman), specify the settings for the workflow, as follows: <AccordionGroup> <Accordion title="Python SDK (remote source and remote destination)"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.shared import ( WorkflowNode, CreateWorkflow, WorkflowType, Schedule ) from unstructured_client.models.operations import CreateWorkflowRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) another_workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) # And so on for any additional nodes. workflow = CreateWorkflow( name="<name>", source_id="<source-connector-id>", destination_id="<destination-connector-id>", workflow_type=WorkflowType.<TYPE>, workflow_nodes=[ workflow_node, another_workflow_node # And so on for any additional nodes. ], schedule=Schedule("<schedule-timeframe>") ) response = client.workflows.create_workflow( request=CreateWorkflowRequest( create_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") ``` </Accordion> <Accordion title="Python SDK (local source and local destination)"> To use a local source and a local destination, do not specify a `source_id` or `destination_id` value. Also, the `workflow_type` must be set to `CUSTOM`. <Note> A workflow with a local source has the following limitations: * The workflow cannot be set to run on a repeating schedule. * The workflow cannot be run with the Unstructured Python SDK or from the Unstructured user interface (UI), even though the workflows is visible in the UI. However, you can run the workflow with REST API clients such as `curl` or Postman. [Learn how](/api-reference/workflow/overview#run-a-workflow). </Note> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.shared import ( WorkflowNode, CreateWorkflow, WorkflowType ) from unstructured_client.models.operations import CreateWorkflowRequest workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) another_workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) # And so on for any additional nodes. workflow=CreateWorkflow( name="<name>", workflow_type=WorkflowType.CUSTOM, workflow_nodes=[ workflow_node, another_workflow_node # And so on for any additional nodes. ] ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.workflows.create_workflow( request=CreateWorkflowRequest( create_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") ``` </Accordion> <Accordion title="Python SDK (local source and remote destination)"> To use a local source and a remote destination, specify a `destination_id` value, and do not specify a `source_id` value. Also, the `workflow_type` must be set to `CUSTOM`. <Note> A workflow with a local source has the following limitations: * The workflow cannot be set to run on a repeating schedule. * The workflow cannot be run with the Unstructured Python SDK or from the Unstructured user interface (UI), even though the workflows is visible in the UI. However, you can run the workflow with REST API clients such as `curl` or Postman. [Learn how](/api-reference/workflow/overview#run-a-workflow). </Note> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.shared import ( WorkflowNode, CreateWorkflow, WorkflowType ) from unstructured_client.models.operations import CreateWorkflowRequest workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) another_workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) # And so on for any additional nodes. workflow=CreateWorkflow( name="<name>", destination_id="<destination-connector-id>", workflow_type=WorkflowType.CUSTOM, workflow_nodes=[ workflow_node, another_workflow_node # And so on for any additional nodes. ] ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = client.workflows.create_workflow( request=CreateWorkflowRequest( create_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") ``` </Accordion> <Accordion title="Python SDK (async) (remote source and remote destination)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.shared import ( WorkflowNode, CreateWorkflow, WorkflowType, Schedule ) from unstructured_client.models.operations import CreateWorkflowRequest async def create_workflow(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) another_workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) # And so on for any additional nodes. workflow = CreateWorkflow( name="<name>", source_id="<source-connector-id>", destination_id="<destination-connector-id>", workflow_type=WorkflowType.<TYPE>, workflow_nodes=[ workflow_node, another_workflow_node # And so on for any additional nodes. ], schedule=Schedule("<schedule-timeframe>") ) response = await client.workflows.create_workflow_async( request=CreateWorkflowRequest( create_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") asyncio.run(create_workflow()) ``` </Accordion> <Accordion title="Python SDK (async) (local source and local destination)"> To use a local source and a local destination, do not specify a `source_id` or `destination_id` value. Also, the `workflow_type` must be set to `CUSTOM`. <Note> A workflow with a local source has the following limitations: * The workflow cannot be set to run on a repeating schedule. * The workflow cannot be run with the Unstructured Python SDK or from the Unstructured user interface (UI), even though the workflows is visible in the UI. However, you can run the workflow with REST API clients such as `curl` or Postman. [Learn how](/api-reference/workflow/overview#run-a-workflow). </Note> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.shared import ( WorkflowNode, CreateWorkflow, WorkflowType ) from unstructured_client.models.operations import CreateWorkflowRequest async def create_workflow(): workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) another_workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) # And so on for any additional nodes. workflow = CreateWorkflow( name="<name>", workflow_type=WorkflowType.CUSTOM, workflow_nodes=[ workflow_node, another_workflow_node # And so on for any additional nodes. ] ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = await client.workflows.create_workflow_async( request=CreateWorkflowRequest( create_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") asyncio.run(create_workflow()) ``` </Accordion> <Accordion title="Python SDK (async) (local source and remote destination)"> To use a local source and a remote destination, specify a `destination_id` value, and do not specify a `source_id` value. Also, the `workflow_type` must be set to `CUSTOM`. <Note> A workflow with a local source has the following limitations: * The workflow cannot be set to run on a repeating schedule. * The workflow cannot be run with the Unstructured Python SDK or from the Unstructured user interface (UI), even though the workflows is visible in the UI. However, you can run the workflow with REST API clients such as `curl` or Postman. [Learn how](/api-reference/workflow/overview#run-a-workflow). </Note> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.shared import ( WorkflowNode, CreateWorkflow, WorkflowType ) from unstructured_client.models.operations import CreateWorkflowRequest async def create_workflow(): workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) another_workflow_node = WorkflowNode( name="<node-name>", type="<node-type>", subtype="<node-subtype>", settings={ "...": "..." } ) # And so on for any additional nodes. workflow = CreateWorkflow( name="<name>", destination_id="<destination-connector-id>", workflow_type=WorkflowType.CUSTOM, workflow_nodes=[ workflow_node, another_workflow_node # And so on for any additional nodes. ] ) with UnstructuredClient(api_key_auth=os.getenv("UNSTRUCTURED_API_KEY")) as client: response = await client.workflows.create_workflow_async( request=CreateWorkflowRequest( create_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") asyncio.run(create_workflow()) ``` </Accordion> <Accordion title="curl (remote source and remote destination)"> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/workflows" \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'accept: application/json' \ --data \ '{ "name": "<name>", "source_id": "<source-connector-id>", "destination_id": "<destination-connector-id>", "workflow_type": "<type>", "workflow_nodes": [ { "name": "<node-name>", "type": "<node-type>", "subtype": "<node-subtype>", "settings": { "...": "..." } }, { "...": "..." } ], "schedule": "<schedule-timeframe>" }' ``` </Accordion> <Accordion title="curl (local source and local destination)"> To use a local source and a local destination, do not specify a `source_id` or `destination_id` value. Also, the `workflow_type` must be set to `custom`. <Note> A workflow with a local source has the following limitations: * The workflow cannot be set to run on a repeating schedule. * The workflow cannot be run with the Unstructured Python SDK or from the Unstructured user interface (UI), even though the workflows is visible in the UI. However, you can run the workflow with REST API clients such as `curl` (or Postman). [Learn how](/api-reference/workflow/overview#run-a-workflow). </Note> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/workflows" \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'accept: application/json' \ --data \ '{ "name": "<name>", "workflow_type": "custom", "workflow_nodes": [ { "name": "<node-name>", "type": "<node-type>", "subtype": "<node-subtype>", "settings": { "...": "..." } }, { "...": "..." } ] }' ``` </Accordion> <Accordion title="curl (local source and remote destination)"> To use a local source and a remote destination, specify a `destination_id` value, and do not specify a `source_id` value. Also, the `workflow_type` must be set to `custom`. <Note> A workflow with a local source has the following limitations: * The workflow cannot be set to run on a repeating schedule. * The workflow cannot be run with the Unstructured Python SDK or from the Unstructured user interface (UI), even though the workflows is visible in the UI. However, you can run the workflow with REST API clients such as `curl` (or Postman). [Learn how](/api-reference/workflow/overview#run-a-workflow). </Note> ```bash curl --request 'POST' --location \ "$UNSTRUCTURED_API_URL/workflows" \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'accept: application/json' \ --data \ '{ "name": "<name>", "destination_id": "<destination-connector-id>", "workflow_type": "custom", "workflow_nodes": [ { "name": "<node-name>", "type": "<node-type>", "subtype": "<node-subtype>", "settings": { "...": "..." } }, { "...": "..." } ] }' ``` </Accordion> <Accordion title="Postman (remote source and remote destination)"> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the workflow: ```json { "name": "<name>", "source_id": "<source-connector-id>", "destination_id": "<destination-connector-id>", "workflow_type": "<type>", "workflow_nodes": [ { "name": "<node-name>", "type": "<node-type>", "subtype": "<node-subtype>", "settings": { "...": "..." } }, { "...": "..." } ], "schedule": "<schedule-timeframe>" } ``` 5. Click Send. </Accordion> <Accordion title="Postman (local source and local destination)"> To use a local source and a local destination do not specify a `source_id` or `destination_id` value. Also, the `workflow_type` must be set to `custom`. <Note> A workflow with a local source has the following limitations: * The workflow cannot be set to run on a repeating schedule. * The workflow cannot be run with the Unstructured Python SDK or from the Unstructured user interface (UI), even though the workflows is visible in the UI. However, you can run the workflow with REST API clients such as Postman (or `curl`). [Learn how](/api-reference/workflow/overview#run-a-workflow). </Note> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the workflow: ```json { "name": "<name>", "workflow_type": "custom", "workflow_nodes": [ { "name": "<node-name>", "type": "<node-type>", "subtype": "<node-subtype>", "settings": { "...": "..." } }, { "...": "..." } ] } ``` 5. Click Send. </Accordion> <Accordion title="Postman (local source and remote destination)"> To use a local source and a remote destination, specify a `destination_id` value, and do not specify a `source_id` value. Also, the `workflow_type` must be set to `custom`. <Note> A workflow with a local source has the following limitations: * The workflow cannot be set to run on a repeating schedule. * The workflow cannot be run with the Unstructured Python SDK or from the Unstructured user interface (UI), even though the workflows is visible in the UI. However, you can run the workflow with REST API clients such as Postman (or `curl`). [Learn how](/api-reference/workflow/overview#run-a-workflow). </Note> 1. In the method drop-down list, select POST. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows ``` 3. On the Headers tab, enter the following headers: * Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the workflow: ```json { "name": "<name>", "workflow_type": "custom", "workflow_nodes": [ { "name": "<node-name>", "destination_id": "<destination-connector-id>", "type": "<node-type>", "subtype": "<node-subtype>", "settings": { "...": "..." } }, { "...": "..." } ] } ``` 5. Click Send. </Accordion> </AccordionGroup> Replace the preceding placeholders as follows: * `<name>` (required) - A unique name for this workflow. * `<source-connector-id>` (required) - The ID of the target source connector. To get the ID, use the `UnstructuredClient` object's `sources.list_sources` function (for the Python SDK) or the `GET` method to call the `/sources` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#list-source-connectors). * `<destination-connector-id>` (required) - The ID of the target destination connector. To get the ID, use the `UnstructuredClient` object's `destinations.list_destinations` function (for the Python SDK) or the `GET` method to call the `/destinations` endpoint (for `curl` or Postman). [Learn more](/api-reference/workflow/overview#list-destination-connectors). * `<TYPE>` (for the Python SDK) or `<type>` (for `curl` or Postman) (required) - The workflow type. Available values include `CUSTOM` (for the Python SDK) and `custom` (for `curl` or Postman). If `<TYPE>` is set to `CUSTOM` (for the Python SDK), or if `<type>` is set to `custom` (for `curl` or Postman), you must add a `workflow_nodes` array. For instructions, see [Custom workflow DAG nodes](#custom-workflow-dag-nodes). <Note> The previously-available workflow optimization types `ADVANCED`, `BASIC`, and `PLATINUM` (for the Python SDK) and `advanced`, `basic`, and `platinum` (for `curl` or Postman) are non-operational and planned to be fully removed in a future release. The ability to create an [automatic workflow](/ui/workflows#create-an-automatic-workflow) type is currently not available but is planned to be added in a future release. </Note> * `<schedule-timeframe>` - The repeating automatic run schedule, specified as a predefined phrase. The available predefined phrases are: * `every 15 minutes` (for `curl` or Postman): Every 15 minutes (cron expression: `/15 * * `). `every hour`: At the first minute of every hour (cron expression: `0 * * * `). `every 2 hours`: At the first minute of every second hour (cron expression: `0 /2 * `). `every 4 hours`: At the first minute of every fourth hour (cron expression: `0 /4 * `). `every 6 hours`: At the first minute of every sixth hour (cron expression: `0 /6 * `). `every 8 hours`: At the first minute of every eighth hour (cron expression: `0 /8 * `). `every 10 hours`: At the first minute of every tenth hour (cron expression: `0 /10 * `). `every 12 hours`: At the first minute of every twelfth hour (cron expression: `0 /12 * `). `daily`: At the first minute of every day (cron expression: `0 0 * * `). `weekly`: At the first minute of every Sunday (cron expression: `0 0 * * 0`). * `monthly`: At the first minute of the first day of every month (cron expression: `0 0 1 * `). If `schedule` is not specified, the workflow does not automatically run on a repeating schedule. Workflows with a local source cannot be set to run on a repeating schedule. ## Update a workflow To update information about a workflow, use the `UnstructuredClient` object's `workflows.update_workflow` function (for the Python SDK) or the `PUT` method to call the `/workflows/<workflow-id>` endpoint (for `curl` or Postman), replacing `<workflow-id>` with the workflow's unique ID. To get this ID, see [List workflows](#list-workflows). In the request body, specify the settings for the workflow. For the specific settings to include, see [Create a workflow](/api-reference/workflow/workflows#create-a-workflow). <AccordionGroup> <Accordion title="Python SDK"> ```python import os from unstructured_client import UnstructuredClient from unstructured_client.models.shared import ( WorkflowNode, UpdateWorkflow, WorkflowType, Schedule, ), from unstructured_client.models.operations import UpdateWorkflowRequest client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) workflow_node = WorkflowNode( # Specify the settings for a workflow node here. ) another_workflow_node = WorkflowNode( # Specify the settings for another workflow node here. ) # And so on for any additional nodes. workflow = UpdateWorkflow( # Specify the settings for the workflow here. ) response = client.workflows.update_workflow( request=UpdateWorkflowRequest( workflow_id="<workflow-id>", update_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") ``` </Accordion> <Accordion title="Python SDK (async)"> ```python import os import asyncio from unstructured_client import UnstructuredClient from unstructured_client.models.shared import ( WorkflowNode, UpdateWorkflow, WorkflowType, Schedule, ), from unstructured_client.models.operations import UpdateWorkflowRequest async def update_workflow(): client = UnstructuredClient( api_key_auth=os.getenv("UNSTRUCTURED_API_KEY") ) workflow_node = WorkflowNode( # Specify the settings for a workflow node here. ) another_workflow_node = WorkflowNode( # Specify the settings for another workflow node here. ) # And so on for any additional nodes. workflow = UpdateWorkflow( # Specify the settings for the workflow here. ) response = await client.workflows.update_workflow_async( request=UpdateWorkflowRequest( workflow_id="<workflow-id>", update_workflow=workflow ) ) info = response.workflow_information print(f"name: {info.name}") print(f"id: {info.id}") print(f"status: {info.status}") print(f"type: {info.workflow_type}") print("source(s):") for source in info.sources: print(f" {source}") print("destination(s):") for destination in info.destinations: print(f" {destination}") print("schedule(s):") for crontab_entry in info.schedule.crontab_entries: print(f" {crontab_entry.cron_expression}") asyncio.run(update_workflow()) ``` </Accordion> <Accordion title="curl"> ```bash curl --request 'PUT' --location \ "$UNSTRUCTURED_API_URL/workflows/<workflow-id>" \ --header "unstructured-api-key: $UNSTRUCTURED_API_KEY" \ --header 'accept: application/json' \ --data \ '{ # Specify the settings for the workflow here. }' ``` </Accordion> <Accordion title="Postman"> 1. In the method drop-down list, select PUT. 2. In the address box, enter the following URL: ```text {{UNSTRUCTURED_API_URL}}/workflows/<workflow-id> ``` 3. On the Headers tab, enter the following headers: Key: `unstructured-api-key`, Value: `{{UNSTRUCTURED_API_KEY}}` * Key: `accept`, Value: `application/json` 4. On the Body tab, select raw and JSON, and specify the settings for the workflow. 5. Click Send. </Accordion> </AccordionGroup> ## Custom workflow DAG nodes If `WorkflowType` is set to `CUSTOM` (for the Python SDK), or if `workflow_type` is set to `custom` (for `curl` or Postman), you must also specify the settings for the workflow's directed acyclic graph (DAG) nodes. These nodes' settings are specified in the `workflow_nodes` array. * A Source node is automatically created when you specify the `source_id` value outside of the\ `workflow_nodes` array. * A Destination node is automatically created when you specify the `destination_id` value outside of the `workflow_nodes` array. * You can specify [Partitioner](#partitioner-node), [Chunker](#chunker-node), [Enrichment](#enrichment-node), and [Embedder](#embedder-node) nodes. <Warning> Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description, table description, or table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> * The order of the nodes in the `workflow_nodes` array will be the same order that these nodes appear in the DAG, with the first node in the array added directly after the Source node. The Destination node follows the last node in the array. * Be sure to specify nodes in the allowed order. The following DAG placements are all allowed: ```mermaid flowchart LR Partitioner ``` ```mermaid flowchart LR Partitioner-->Chunker ``` ```mermaid flowchart LR Partitioner-->Chunker-->Embedder ``` ```mermaid flowchart LR Partitioner-->Enrichment-->Chunker ``` ```mermaid flowchart LR Partitioner-->Enrichment-->Chunker-->Embedder ``` ### Partitioner node A Partitioner node has a `type` of `partition`. [Learn about the available partitioning strategies](/ui/partitioning). #### Auto strategy <AccordionGroup> <Accordion title="Python SDK"> ```python auto_partitioner_workflow_node = WorkflowNode( name="Partitioner", subtype="unstructured_api", type="partition", settings={ "strategy": "auto", "provider": "<provider>", "provider_api_key": None, "model": "<model>", "output_format": "<output-format>", "prompt": { "text": "<alternative-prompt>" }, "format_html": <True|False>, "unique_element_ids": <True|False>, "is_dynamic": <True|False>, "allow_fast": <True|False> } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "auto", "provider": "<provider>", "provider_api_key": null, "model": "<model>", "output_format": "<output-format>", "prompt": { "text": "<alternative-prompt>" }, "format_html": <true|false>, "unique_element_ids": <true|false>, "is_dynamic": <true|false>, "allow_fast": <true|false> } } ``` </Accordion> </AccordionGroup> Fields for `settings` include: * `strategy`: Required. The partitioning strategy to use. This field must be set to `auto`. * `provider`: Optional. If the Auto partitioning strategy needs to use the VLM partitioning strategy, then use the specified VLM provider. Allowed values include `auto`, `openai`, `anthropic`, and `bedrock`. The default value is `anthropic`. * `provider_api_key`: Optional. If specified, use a non-default API key for calls to the specified VLM provider as needed. The default is none, which means to rely on using Unstructured's internal default API key for the VLM provider. * `model`: Optional. If the Auto partitioning strategy needs to use the VLM partitioning strategy, then use the specified VLM. The default value is `claude-3-5-sonnet-20241022`. * For `openai`, available values for `model` are `gpt-4o` and `gpt-4o-mini`. * For `anthropic`, available values for `model` are `claude-3-5-sonnet-20241022` and `claude-3-7-sonnet-20250219`. * For `bedrock`, available values for `model` are: * `us.amazon.nova-lite-v1:0` * `us.amazon.nova-pro-v1:0` * `us.anthropic.claude-3-opus-20240229-v1:0` * `us.anthropic.claude-3-haiku-20240307-v1:0` * `us.anthropic.claude-3-sonnet-20240229-v1:0` * `us.anthropic.claude-3-5-sonnet-20241022-v2:0` * `us.meta.llama3-2-11b-instruct-v1:0` * `us.meta.llama3-2-90b-instruct-v1:0` * `output_format`: Output. The format of the response. Allowed values include `text/html` and `application/json`. The default is `text/html`. * `prompt.text`: Optional. If the Auto partitioning strategy needs to use the VLM partitioning strategy, then use the specified prompt when calling the specified VLM. The default value is none, which means to rely on using Unstructured's internal default prompt when calling the VLM. * `format_html`: Optional. If the Auto partitioning strategy needs to use the VLM partitioning strategy, true (the default) to apply Beautiful Soup's `prettify` method to the HTML that is generated by the VLM partitioner, which for example adds indentation for better readability. * `unique_element_ids`: Optional. True (the default) to assign UUIDs to element IDs, which guarantees their uniqueness. This is useful for example when using them as primary keys in a database. False to assign a SHA-256 of the element's text as its element ID. * `is_dynamic`: Optional. True (the default) to enable dynamic routing of pages to Fast, High Res, or VLM as needed for better overall performance and cost savings. * `allow_fast`: Optional. True (the default) to allow routing of pages to Fast as needed for better overall performance and cost savings. #### VLM strategy <AccordionGroup> <Accordion title="Python SDK"> ```python vlm_partitioner_workflow_node = WorkflowNode( name="Partitioner", subtype="vlm", type="partition", settings={ "provider": "<provider>", "provider_api_key": None, "model": "<model>", "output_format": "<output-format>", "prompt": { "text": "<alternative-prompt>" }, "format_html": <True|False>, "unique_element_ids": <True|False>, "is_dynamic": <True|False>, "allow_fast": <True|False> } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Partitioner", "type": "partition", "subtype": "vlm", "settings": { "provider": "<provider>", "provider_api_key": null, "model": "<model>", "output_format": "<output-format>", "prompt": { "text": "<alternative-prompt>" }, "format_html": <true|false>, "unique_element_ids": <true|false>, "is_dynamic": <true|false>, "allow_fast": <true|false> } } ``` </Accordion> </AccordionGroup> Fields for `settings` include: * `provider`: Optional. Use the specified VLM provider. Allowed values include `auto`, `openai`, `anthropic`, and `bedrock`. The default value is `anthropic`. * `provider_api_key`: Optional. If specified, use a non-default API key for calls to the specified VLM provider as needed. The default is none, which means to rely on using Unstructured's internal default API key for the VLM provider. * `model`: Optional. If the Auto partitioning strategy needs to use the VLM partitioning strategy, then use the specified VLM. The default value is `claude-3-5-sonnet-20241022`. * For `openai`, available values for `model` are `gpt-4o` and `gpt-4o-mini`. * For `anthropic`, available values for `model` are `claude-3-5-sonnet-20241022` and `claude-3-7-sonnet-20250219`. * For `bedrock`, available values for `model` are: * `us.amazon.nova-lite-v1:0` * `us.amazon.nova-pro-v1:0` * `us.anthropic.claude-3-opus-20240229-v1:0` * `us.anthropic.claude-3-haiku-20240307-v1:0` * `us.anthropic.claude-3-sonnet-20240229-v1:0` * `us.anthropic.claude-3-5-sonnet-20241022-v2:0` * `us.meta.llama3-2-11b-instruct-v1:0` * `us.meta.llama3-2-90b-instruct-v1:0` * `output_format`: Output. The format of the response. Allowed values include `text/html` and `application/json`. The default is `text/html`. * `prompt.text`: Optional. Use the specified prompt when calling the specified VLM. The default value is none, which means to rely on using Unstructured's internal default prompt when calling the VLM. * `format_html`: Optional. True (the default) to apply Beautiful Soup's `prettify` method to the HTML that is generated by the VLM partitioner, which for example adds indentation for better readability. * `unique_element_ids`: Optional. True (the default) to assign UUIDs to element IDs, which guarantees their uniqueness. This is useful for example when using them as primary keys in a database. False to assign a SHA-256 of the element's text as its element ID. * `is_dynamic`: Optional. This setting has no effect for the VLM strategy. The default is false. * `allow_fast`: Optional. This setting has no effect for the VLM strategy. The default is true. #### High Res strategy <AccordionGroup> <Accordion title="Python SDK"> ```python high_res_paritioner_workflow_node = WorkflowNode( name="Partitioner", subtype="unstructured_api", type="partition", settings={ "strategy": "hi_res", "include_page_breaks": <True|False>, "pdf_infer_table_structure": <True|False>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <True|False>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <True|False>, } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "hi_res", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false>, } } ``` </Accordion> </AccordionGroup> * `strategy`: Required. The partitioning strategy to use. This field must be set to `hi_res`. * `include_page_breaks`: Optional. True to include page breaks in the output if supported by the file type. The default is false. * `pdf_infer_table_structure`: Optional. True for any `Table` elements extracted from a PDF to include an additional metadata field, `text_as_html`, where the value (string) is a just a transformation of the data into an HTML table. The default is false. * `exclude_elements`: Optional. A list of any Unstructured element types to exclude from the output. The default is none. Available values include: * `FigureCaption` * `NarrativeText` * `ListItem` * `Title` * `Address` * `Table` * `PageBreak` * `Header` * `Footer` * `UncategorizedText` * `Image` * `Formula` * `EmailAddress` * `xml_keep_tags`: Optional. True to retain any XML tags in the output. False (the default) to just extract the text from any XML tags instead. * `encoding`: Optional. The encoding method used to decode the text input. The default is `utf-8`. * `ocr_languages`: Optional. A list of languages present in the input, for use in partitioning, OCR, or both. Multiple languages indicate that the text could be in any of the specified languages. The default is `[ 'eng' ]`. [See the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/lang.py). * `extract_image_block_types`: Optional. A list of the Unstructured element types for use in extracting image blocks as Base64 encoded data stored in `metadata` fields. Available values include `Image` and `Table`. The default is `[ 'Image', 'Table' ]`. * `infer_table_structure`: Optional. True to have any table elements extracted from a PDF to include an additional `metadata` field named `text_as_html`, containing an HTML `<table>` transformation. The default is false. #### Fast strategy <AccordionGroup> <Accordion title="Python SDK"> ```python fast_partitioner_workflow_node = WorkflowNode( name="Partitioner", subtype="unstructured_api", type="partition", settings={ "strategy": "fast", "include_page_breaks": <True|False>, "pdf_infer_table_structure": <True|False>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <True|False>, "encoding": "<encoding>", "ocr_languages": [ "<language>", "<language>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <True|False> } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Partitioner", "type": "partition", "subtype": "unstructured_api", "settings": { "strategy": "fast", "include_page_breaks": <true|false>, "pdf_infer_table_structure": <true|false>, "exclude_elements": [ "<element-name>", "<element-name>" ], "xml_keep_tags": <true|false>, "encoding": "<encoding>", "ocr_languages": [ "<language-code>", "<language-code>" ], "extract_image_block_types": [ "image", "table" ], "infer_table_structure": <true|false> } } ``` </Accordion> </AccordionGroup> Fields for `settings` include: * `strategy`: Required. The partitioning strategy to use. This field must be set to `fast`. * \`include\_page\_breaks: Optional. True to include page breaks in the output if supported by the file type. The default is false. * `pdf_infer_table_structure`: Optional. Although this field is listed, it applies only to the `hi_res` strategy and will not work if set to true. The default is false. * `exclude_elements`: Optional. A list of any Unstructured element types to exclude from the output. The default is none. Available values include: * `FigureCaption` * `NarrativeText` * `ListItem` * `Title` * `Address` * `Table` * `PageBreak` * `Header` * `Footer` * `UncategorizedText` * `Image` * `Formula` * `EmailAddress` * `xml_keep_tags`: Optional. True to retain any XML tags in the output. False (the default) to just extract the text from any XML tags instead. * `encoding`: Optional. The encoding method used to decode the text input. The default is `utf-8`. * `ocr_languages`: Optional. A list of languages present in the input, for use in partitioning, OCR, or both. Multiple languages indicate that the text could be in any of the specified languages. The default is `[ 'eng' ]`. [See the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/lang.py). * `extract_image_block_types`: Optional. A list of the Unstructured element types for use in extracting image blocks as Base64 encoded data stored in `metadata` fields. Available values include `Image` and `Table`. The default is `[ 'Image', 'Table' ]`. * `infer_table_structure`: Optional. True to have any table elements extracted from a PDF to include an additional `metadata` field named `text_as_html`, containing an HTML `<table>` transformation. The default is false. ### Chunker node A Chunker node has a `type` of `chunk`. [Learn about the available chunking strategies](/ui/chunking). #### Chunk by Character strategy <AccordionGroup> <Accordion title="Python SDK"> ```python chunk_by_character_chunker_workflow_node = WorkflowNode( name="Chunker", subtype="chunk_by_character", type="chunk", settings={ "unstructured_api_url": None, "unstructured_api_key": None, "include_orig_elements": <True|False>, "new_after_n_chars": <new-after-n-chars>, "max_characters": <max-characters>, "overlap": <overlap>, "overlap_all": <True|False>, "contextual_chunking_strategy": "<contextual-chunking-strategy>" } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_character", "settings": { "unstructured_api_url": null, "unstructured_api_key": null, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, "max_characters": <max-characters>, "overlap": <overlap>, "overlap_all": <true|false>, "contextual_chunking_strategy": "<contextual-chunking-strategy>" } } ``` </Accordion> </AccordionGroup> Fields for `settings` include: * `unstructured_api_url`: Optional. If specified, use a non-default API URL for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API URL for the chunker. * `unstructured_api_key`: Optional. If specified, use a non-default API key for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API key for the chunker. * `include_orig_elements`: Optional. True to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. The default is false. * `new_after_n_chars`: Optional. Closes new sections after reaching a length of this many characters. This is an approximate limit. The default is none. * `max_characters`: Optional. The absolute maximum number of characters in a chunk. The default is none. * `overlap`: Optional. Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is none. * `overlap_all`: Optional. True to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. The default is false. * `contextual_chunking_strategy`: Optional. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none. #### Chunk by Title strategy <AccordionGroup> <Accordion title="Python SDK"> ```python chunk_by_title_chunker_workflow_node = WorkflowNode( name="Chunker", subtype="chunk_by_title", type="chunk", settings={ "unstructured_api_url": None, "unstructured_api_key": None, "multipage_sections": <True|False>, "combine_text_under_n_chars": <combine-text-under-n-chars>, "include_orig_elements": <True|False>, "new_after_n_chars": <new-after-n-chars>, "max_characters": <max-characters>, "overlap": <overlap>, "overlap_all": <True|False>, "contextual_chunking_strategy": "<contextual-chunking-strategy>" } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_title", "settings": { "unstructured_api_url": null, "unstructured_api_key": null, "multipage_sections": <true|false>, "combine_text_under_n_chars": <combine-text-under-n-chars>, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, "max_characters": <max-characters>, "overlap": <overlap>, "overlap_all": <true|false>, "contextual_chunking_strategy": "<contextual-chunking-strategy>" } } ``` </Accordion> </AccordionGroup> Fields for `settings` include: * `unstructured_api_url`: Optional. If specified, use a non-default API URL for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API URL for the chunker. * `unstructured_api_key`: Optional. If specified, use a non-default API key for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API key for the chunker.- `multipage_sections`: Optional. ... The default is false. * `combine_text_under_n_chars`: Optional. Combines elements from a section into a chunk until a section reaches a length of this many characters. The default is none. * `include_orig_elements`: Optional. True to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. The default is false. * `new_after_n_chars`: Optional. Closes new sections after reaching a length of this many characters. This is an approximate limit. The default is none. * `max_characters`: Optional. The absolute maximum number of characters in a chunk. The default is none. * `overlap`: Optional. Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is none. * `overlap_all`: Optional. True to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. The default is false. * `contextual_chunking_strategy`: Optional. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none. #### Chunk by Page strategy <AccordionGroup> <Accordion title="Python SDK"> ```python chunk_by_page_chunker_workflow_node = WorkflowNode( name="Chunker", subtype="chunk_by_page", type="chunk", settings={ "unstructured_api_url": None, "unstructured_api_key": None, "include_orig_elements": <True|False>, "new_after_n_chars": <new-after-n-chars>, "max_characters": <max-characters>, "overlap": <overlap>, "overlap_all": <True|False>, "contextual_chunking_strategy": "<contextual-chunking-strategy>" } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_page", "settings": { "unstructured_api_url": null, "unstructured_api_key": null, "include_orig_elements": <true|false>, "new_after_n_chars": <new-after-n-chars>, "max_characters": <max-characters>, "overlap": <overlap>, "overlap_all": <true|false>, "contextual_chunking_strategy": "<contextual-chunking-strategy>" } } ``` </Accordion> </AccordionGroup> Fields for `settings` include: * `unstructured_api_url`: Optional. If specified, use a non-default API URL for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API URL for the chunker. * `unstructured_api_key`: Optional. If specified, use a non-default API key for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API key for the chunker.- `include_orig_elements`: Optional. ... The default is false. * `include_orig_elements`: Optional. True to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. The default is false. * `new_after_n_chars`: Optional. Closes new sections after reaching a length of this many characters. This is an approximate limit. The default is none. * `max_characters`: Optional. The absolute maximum number of characters in a chunk. The default is none. * `overlap`: Optional. Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is none. * `overlap_all`: Optional. True to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. The default is false. * `contextual_chunking_strategy`: Optional. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none. #### Chunk by Similarity strategy <AccordionGroup> <Accordion title="Python SDK"> ```python chunk_by_similarity_chunker_workflow_node = WorkflowNode( name="Chunker", subtype="chunk_by_similarity", type="chunk", settings={ "unstructured_api_url": None, "unstructured_api_key": None, "include_orig_elements": <True|False>, "new_after_n_chars": <True|False>, "max_characters": <max-characters>, "overlap": <overlap>, "overlap_all": <True|False>, "contextual_chunking_strategy": "<contextual-chunking-strategy>", "similarity_threshold": <similarity-threshold> } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Chunker", "type": "chunk", "subtype": "chunk_by_similarity", "settings": { "unstructured_api_url": null, "unstructured_api_key": null, "include_orig_elements": <true|false>, "new_after_n_chars": <true|false>, "max_characters": <max-characters>, "overlap": <overlap>, "overlap_all": <true|false>, "contextual_chunking_strategy": "<contextual-chunking-strategy>", "similarity_threshold": <similarity-threshold> } } ``` </Accordion> </AccordionGroup> Fields for `settings` include: * `unstructured_api_url`: Optional. If specified, use a non-default API URL for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API URL for the chunker. * `unstructured_api_key`: Optional. If specified, use a non-default API key for calls to the specified chunker as needed. The default is none, which means to rely on using Unstructured's internal default API key for the chunker. * `include_orig_elements`: Optional. True to have the elements that are used to form a chunk appear in `.metadata.orig_elements` for that chunk. The default is false. * `new_after_n_chars`: Optional. Closes new sections after reaching a length of this many characters. This is an approximate limit. The default is none. * `max_characters`: Optional. The absolute maximum number of characters in a chunk. The default is none. * `overlap`: Optional. Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is none. * `overlap_all`: Optional. True to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. The default is false. * `contextual_chunking_strategy`: Optional. If specified, prepends chunk-specific explanatory context to each chunk. Allowed values include `v1`. The default is none. * `similarity_threshold`: Optional. The minimum similarity that text in consecutive elements must have to be included in the same chunk. This must be a value between `0.0` and `1.0`, exclusive (`0.01` to `0.99`). The default is none. ### Enrichment node An Enrichment node has a `type` of `prompter`. [Learn about the available enrichments](/ui/enriching/overview). <Warning> Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description, table description, or table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> #### Image Description task <Warning> Image summary descriptions are generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description enrichment node will not produce any image summary descriptions, and it could also cause the workflow to stop running or produce unexpected results. </Warning> <AccordionGroup> <Accordion title="Python SDK"> ```python image_description_enrichment_workflow_node = WorkflowNode( name="Enrichment", subtype="<subtype>", type="prompter", settings={} ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Enrichment", "type": "prompter", "subtype": "<subtype>", "settings": {} } ``` </Accordion> </AccordionGroup> Allowed values for `<subtype>` include: * `openai_image_description` * `anthropic_image_description` * `bedrock_image_description` #### Table Description task <Warning> Table summary descriptions are generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains a table description enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains a table description enrichment node will not produce any table summary descriptions, and it could also cause the workflow to stop running or produce unexpected results. </Warning> <AccordionGroup> <Accordion title="Python SDK"> ```python table_description_enrichment_workflow_node = WorkflowNode( name="Enrichment", subtype="<subtype>", type="prompter", settings={} ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Enrichment", "type": "prompter", "subtype": "<subtype>", "settings": {} } ``` </Accordion> </AccordionGroup> Allowed values for `<subtype>` include: * `openai_table_description` * `anthropic_table_description` * `bedrock_table_description` #### Table to HTML task <Warning> Table-to-HTML generation happens only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains a table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains a table-to-HTML enrichment node will not generate any table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> <AccordionGroup> <Accordion title="Python SDK"> ```python table_to_html_enrichment_workflow_node = WorkflowNode( name="Enrichment", subtype="openai_table2html", type="prompter", settings={} ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Enrichment", "type": "prompter", "subtype": "openai_table2html", "settings": {} } ``` </Accordion> </AccordionGroup> #### Named Entity Recognition (NER) task <AccordionGroup> <Accordion title="Python SDK"> ```python ner_enrichment_workflow_node = WorkflowNode( name="Enrichment", subtype="<subtype>", type="prompter", settings={ "prompt_interface_overrides": { "prompt": { "user": "<user-prompt-override>" } } } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Enrichment", "type": "prompter", "subtype": "<subtype>", "settings": { "prompt_interface_overrides": { "prompt": { "user": "<user-prompt-override>" } } } } ``` </Accordion> </AccordionGroup> Fields for settings include: * `prompt_interface_overrides.prompt.user`: Optional. Any alternative prompt to use with the underlying NER model. The default is none, which means to rely on using Unstructured's internal default prompt when calling the NER model. Allowed values for `<subtype>` include: * `openai_ner` * `anthropic_ner` ### Embedder node An Embedder node has a `type` of `embed`. [Learn about the available embedding providers and models](/ui/embedding). <AccordionGroup> <Accordion title="Python SDK"> ```python embedder_workflow_node = WorkflowNode( name="Embedder", subtype="<subtype>", type="embed", settings={ "model_name": "<model-name>" } ) ``` </Accordion> <Accordion title="curl, Postman"> ```json { "name": "Embedder", "type": "embed", "subtype": "<subtype>", "settings": { "model_name": "<model-name>" } } ``` </Accordion> </AccordionGroup> Allowed values for `subtype` and `model_name` include: * `"subtype": "azure_openai"` * `"model_name": "text-embedding-3-small"` * `"model_name": "text-embedding-3-large"` * `"model_name": "text-embedding-ada-002"` * `"subtype": "bedrock"` * `"model_name": "amazon.titan-embed-text-v2:0"` * `"model_name": "amazon.titan-embed-text-v1"` * `"model_name": "amazon.titan-embed-image-v1"` * `"model_name": "cohere.embed-english-v3"` * `"model_name": "cohere.embed-multilingual-v3"` * `"subtype": "togetherai"` * `"model_name": "togethercomputer/m2-bert-80M-32k-retrieval"` * `"subtype": "voyageai"` * `"model_name": "voyage-3"` * `"model_name": "voyage-3-large"` * `"model_name": "voyage-3-lite"` * `"model_name": "voyage-code-3"` * `"model_name": "voyage-finance-2"` * `"model_name": "voyage-law-2"` * `"model_name": "voyage-code-2"` * `"model_name": "voyage-multimodal-3"` # Create Workflow Source: https://docs.unstructured.io/api-reference/workflows/create-workflow https://platform.unstructuredapp.io/openapi.json post /api/v1/workflows/ Create a new workflow, either custom or auto, and configure its settings. # Delete Workflow Source: https://docs.unstructured.io/api-reference/workflows/delete-workflow https://platform.unstructuredapp.io/openapi.json delete /api/v1/workflows/{workflow_id} Delete a workflow by its ID. # Get Workflow Source: https://docs.unstructured.io/api-reference/workflows/get-workflow https://platform.unstructuredapp.io/openapi.json get /api/v1/workflows/{workflow_id} Retrieve detailed information for a specific workflow by its ID. # List Workflows Source: https://docs.unstructured.io/api-reference/workflows/list-workflows https://platform.unstructuredapp.io/openapi.json get /api/v1/workflows/ Retrieve a list of workflows, optionally filtered by source, destination, state, name, date range, and supports pagination and sorting. # Run Workflow Source: https://docs.unstructured.io/api-reference/workflows/run-workflow https://platform.unstructuredapp.io/openapi.json post /api/v1/workflows/{workflow_id}/run Run a workflow by triggering a new job if none is currently active. # Update Workflow Source: https://docs.unstructured.io/api-reference/workflows/update-workflow https://platform.unstructuredapp.io/openapi.json put /api/v1/workflows/{workflow_id} Update an existing workflow's name, connectors, schedule, or workflow type. # Query processed PDF with HuggingChat Source: https://docs.unstructured.io/examplecode/codesamples/api/huggingchat This example uses the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) or the [Unstructured JavaScript/TypeScript SDK](/api-reference/partition/sdk-jsts) to send a PDF file to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing. Unstructured processes the PDF and extracts the PDF's content. This example then sends some of the content to [HuggingChat](https://huggingface.co/chat/), Hugging Face's open-source AI chatbot, along with some queries about this content. To run this example, you'll need: * The [hugchat](https://pypi.org/project/hugchat/) package for Python, or the [huggingface-chat](https://www.npmjs.com/package/huggingface-chat) package for JavaScript/TypeScript. * Your Unstructured API key and API URL. [Get an API key and API URL](/api-reference/partition/overview). * Your Hugging Face account's email address and account password. [Get an account](https://huggingface.co/join). * A PDF file for Unstructured to process. This example uses a sample PDF file containing the text of the United States Constitution, available for download from [https://constitutioncenter.org/media/files/constitution.pdf](https://constitutioncenter.org/media/files/constitution.pdf). These environment variables: * `UNSTRUCTURED_API_KEY` - Your Unstructured API key value. * `UNSTRUCTURED_API_URL` - Your Unstructured API URL. These environment variables: * `HUGGING_FACE_EMAIL`, representing your Hugging Face account's email address. * `HUGGING_FACE_PASSWORD`, representing your Hugging Face account's email address. * `HUGGING_FACE_COOKIE_DIR_PATH`, representing a login cookie cache path, typically `./cookies/` or `./login_cache/`. This example's code is as follows: <AccordionGroup> <Accordion title="Ingest Python library"> ```python Python from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig import os, json from hugchat import hugchat from hugchat.login import Login def generate_json_from_local( input_path: str, output_dir: str, partition_by_api: bool = False, api_key: str = None, partition_endpoint: str = None, split_pdf_page: bool = True, split_pdf_allow_failed: bool = True, split_pdf_concurrency_level: int = 15 ): Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=input_path), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=partition_by_api, api_key=api_key, partition_endpoint=partition_endpoint, additional_partition_args={ "split_pdf_page": split_pdf_page, "split_pdf_allow_failed": split_pdf_allow_failed, "split_pdf_concurrency_level": split_pdf_concurrency_level } ), uploader_config=LocalUploaderConfig(output_dir=output_dir) ).run() def extract_matching_texts_from_local(input_json_file_path: str, text_to_match: str) -> str: voting_texts = "" with open(input_json_file_path, 'r') as file: file_elements = json.load(file) for element in file_elements: if text_to_match in element["text"]: voting_texts += " " + element["text"] return voting_texts def log_in_to_hugging_face(email: str, passwd: str, cookie_dir_path: str) -> hugchat.ChatBot: sign = Login( email=email, passwd=passwd ) cookies = sign.login(cookie_dir_path=cookie_dir_path) return hugchat.ChatBot(cookies=cookies.get_dict()) if name == "main": generate_json_from_local( input_path=os.getenv("LOCAL_FILE_INPUT_DIR"), output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"), partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") ) chatbot = log_in_to_hugging_face( email=os.getenv("HUGGING_FACE_EMAIL"), passwd=os.getenv("HUGGING_FACE_PASSWORD"), cookie_dir_path=os.getenv("HUGGING_FACE_COOKIE_DIR_PATH") ) voting_texts = extract_matching_texts_from_local( input_json_file_path=f"{os.getenv("LOCAL_FILE_OUTPUT_DIR")}/constitution.json", text_to_match="vot" ) print("\n-----\n") print("Querying HuggingChat...") print("\n-----\n") req = f"Given the following information, what is the minimum voting age in the United States? {voting_texts}" print(req) print("\n-----\n") print(chatbot.chat(text=req)) print("\n-----\n") print("Querying HuggingChat again...") print("\n-----\n") follow_up = "And when were women given the right to vote in the United States?" print(follow_up) print("\n-----\n") print(chatbot.chat(text=follow_up)) ``` </Accordion> <Accordion title="JavaScript/TypeScript SDK"> <Warning> Unstructured recommends that you use the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) instead. The Ingest Python library provides faster processing of larger individual files, and faster and easier processing of multiple files at a time in batches. </Warning> ```typescript TypeScript import { UnstructuredClient } from "unstructured-client"; import { Strategy } from "unstructured-client/sdk/models/shared/index.js"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; import * as fs from "fs"; import * as path from 'path'; import { fileURLToPath } from 'url'; import { Login, ChatBot} from "huggingface-chat"; const key = process.env.UNSTRUCTURED_API_KEY || ''; const url = process.env.UNSTRUCTURED_API_URL || ''; const username = process.env.HUGGING_FACE_EMAIL || ''; const userpassword = process.env.HUGGING_FACE_PASSWORD || ''; const cookieDirPath = process.env.HUGGING_FACE_COOKIE_DIR_PATH || ''; const inputFilepath = path.dirname(fileURLToPath(import.meta.url)) + "/local-ingest-pdf-source/constitution.pdf" async function getChatBot(email: string, password: string, cachePath: string): Promise<ChatBot> { const signin = new Login(email, password) const result = await signin.login(cachePath) const chat = new ChatBot(result) return chat } async function queryAndResponse(chatbot: ChatBot, query: string) { let data = await chatbot.chat(query) if (! data) { console.log("No data available."); return; } let reader = data.stream?.getReader(); if (! reader) { console.log("No stream available."); return; } while (true) { const { done, value } = await reader?.read(); if (done) break; process.stdout.write(value); } } const client = new UnstructuredClient({ security: { apiKeyAuth: key }, serverURL: new URL(url).origin }); const data = fs.readFileSync(inputFilepath); client.general.partition({ partitionParameters: { files: { content: data, fileName: inputFilepath }, strategy: Strategy.HiRes, splitPdfPage: true, splitPdfAllowFailed: true, splitPdfConcurrencyLevel: 15 } }).then(async (res: PartitionResponse) => { if (res.statusCode == 200) { let voting_texts = "" if (res) for (const element of res) { if (element.text.includes("vot")) { voting_texts += " " + element.text; } } let chat = await getChatBot(username, userpassword, cachePath) await chat.initialize() const firstQuery = 'Given the following information, what is the minimum voting age in the United States?' const secondQuery = 'And when were women given the right to vote in the United States?' console.log(`${firstQuery} ${voting_texts}\n\n-----\n`) await queryAndResponse(chat, firstQuery) console.log(`\n\n-----\n`) console.log(`${secondQuery}\n\n-----\n`) await queryAndResponse(chat, secondQuery) } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); ``` </Accordion> </AccordionGroup> # Table extraction from PDF Source: https://docs.unstructured.io/examplecode/codesamples/apioss/table-extraction-from-pdf This section describes two methods for extracting tables from PDF files. <Note> This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") library and also provides an alternative method the utilizing the [Unstructured Partition Endpoint](/api-reference/partition/overview). </Note> ## Method 1: Using partition\_pdf To extract the tables from PDF files using the [partition\_pdf](/open-source/core-functionality/partitioning#partition-pdf), set the `skip_infer_table_types` parameter to `False` and `strategy` parameter to `hi_res`. Usage ```python from unstructured.partition.pdf import partition_pdf fname = "example-docs/pdf/layout-parser-paper.pdf" elements = partition_pdf(filename=fname, skip_infer_table_types=False, strategy='hi_res', ) tables = [el for el in elements if el.category == "Table"] print(tables[0].text) print(tables[0].metadata.text_as_html) ``` ## Method 2: Using Auto Partition or Unstructured API By default, table extraction from all file types is enabled. To extract tables from PDFs and images using [Auto Partition](/open-source/core-functionality/partitioning#partition) or [Unstructured API parameters](/api-reference/partition/api-parameters) simply set `strategy` parameter to `hi_res`. Usage: Auto Partition ```python from unstructured.partition.auto import partition filename = "example-docs/pdf/layout-parser-paper.pdf" elements = partition(filename=filename, strategy='hi_res', ) tables = [el for el in elements if el.category == "Table"] print(tables[0].text) print(tables[0].metadata.text_as_html) ``` Usage: API Parameters ```bash curl -X 'POST' \ 'https://api.unstructured.io' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \ -F 'strategy=hi_res' \ | jq -C . | less -R ``` # Multi-file API processing Source: https://docs.unstructured.io/examplecode/codesamples/oss/multi-files-api-processing <Note> This sample code utilizes the [Unstructured Partition Endpoint](/api-reference/partition/overview). </Note> ## Introduction This guide demonstrates how to process multiple files using the Unstructured API and S3 Connector and implement context-aware chunking. The process involves installing dependencies, configuring settings, and utilizing Python scripts to manage and chunk data effectively. ## Prerequisites Ensure you have Unstructured API key and access to an S3 bucket containing the target files. ## Step-by-Step Process ### Step 1: Install Unstructured and S3 Dependency Install the unstructured package with S3 support. ```bash pip install "unstructured-ingest[s3]" ``` ### Step 2: Import Libraries Import necessary libraries from the unstructured package for chunking and S3 processing. ```python from unstructured_ingest.interfaces import ( FsspecConfig, PartitionConfig, ProcessorConfig, ReadConfig, ) from unstructured_ingest.runner import S3Runner from unstructured.chunking.title import chunk_by_title from unstructured.staging.base import dict_to_elements ``` ### Step 3: Configuration Set up the API key and S3 URL for accessing the data. ```python UNSTRUCTURED_API_KEY = os.getenv('UNSTRUCTURED_API_KEY') S3_URL = "s3://rh-financial-reports/world-development-bank-2023/" ``` ### Step 4: Python Runner Configure and run the S3Runner for processing the data. ```python runner = S3Runner( processor_config=ProcessorConfig( verbose=True, output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"), num_processes=8, ), read_config=ReadConfig(), partition_config=PartitionConfig( partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), strategy="hi_res", hi_res_model_name="layout_v1.1.0", ), fsspec_config=FsspecConfig( remote_url=S3_URL, ), ) runner.run(anonymous=True) ``` ### Step 5: Combine JSON Files from Multi-files Ingestion Combine JSON files into a single dataset for further processing. ```json combined_json_data = read_and_combine_json("Connector-Output/world-development-bank-2023") ``` ### Step 6: Convert into Unstructured Elements for Chunking Convert the combined JSON data into Unstructured Elements and apply chunking by title. ```python elements = dict_to_elements(combined_json_data) chunks = chunk_by_title(elements) ``` ## Conclusion Following these steps allows for efficient processing of multiple files using the Unstructured S3 Connector. The context-aware chunking helps in organizing and analyzing the data effectively. # Delta Table source connector Source: https://docs.unstructured.io/examplecode/codesamples/oss/table-source-connector <Note> This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") Library. </Note> ## Objectives 1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK. 2. Process and store this data in a Databricks Delta Table. 3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector. ## Prerequisites * Unstructured Python SDK * Databricks account and workspace * AWS S3 for Delta Table storage ## Processing and Storing into Databricks Delta Table 3. Initialize PySpark ```python from pyspark.sql import SparkSession spark = SparkSession.builder.appName('sparkdf').getOrCreate() ``` 4. Convert JSON output into Dataframe ```python import pyspark dataframe = spark.createDataFrame(res.elements) ``` 5. Store DataFrame as Delta Table ``` dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table") ``` ## Conclusion This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use. # Vector database ingestion Source: https://docs.unstructured.io/examplecode/codesamples/oss/vector-database <Note> This sample code utilizes the [Unstructured Open Source](/open-source/introduction/overview "Open Source") Library. </Note> In this guide, we demonstrate how to leverage Unstructured.IO, ChromaDB, and LangChain to summarize topics from the front page of CNN Lite. Utilizing the modern LLM stack, including Unstructured, Chroma, and LangChain, this workflow is streamlined to less than two dozen lines of code. ## Gather Links with Unstructured First, we gather links from the CNN Lite homepage using the partition\_html function from Unstructured. When Unstructured partitions HTML pages, links are included in the metadata for each element, making link collection straightforward. ```python from unstructured.partition.html import partition_html cnn_lite_url = "https://lite.cnn.com/" elements = partition_html(url=cnn_lite_url) links = [] for element in elements: if element.metadata.link_urls: relative_link = element.metadata.link_urls[0][1:] if relative_link.startswith("2024"): links.append(f"{cnn_lite_url}{relative_link}") ``` ## Ingest Individual Articles with UnstructuredURLLoader With the links in hand, we preprocess individual news articles using UnstructuredURLLoader. This loader fetches content from the web and then uses the unstructured partition function to extract content and metadata. Here we preprocess HTML files, but it also works with other response types like application/pdf. The result is a list of LangChain Document objects. ```python from langchain.document_loaders import UnstructuredURLLoader loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True) docs = loaders.load() ``` ## Load Documents into ChromaDB The next step is to load the preprocessed documents into ChromaDB. This process involves vectorizing the documents using OpenAI embeddings and loading them into Chroma’s vector store. Once in Chroma, similarity search can be performed to retrieve documents related to specific topics. ```python from langchain.vectorstores.chroma import Chroma from langchain.embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings() vectorstore = Chroma.from_documents(docs, embeddings) query_docs = vectorstore.similarity_search("Update on the coup in Niger.", k=1) ``` ## Summarize the Documents After retrieving relevant documents from Chroma, we summarize them using LangChain. The load\_summarization\_chain function allows for easy summarization, simply requiring the selection of an LLM and summarization chain. ```python from langchain.chat_models import ChatOpenAI from langchain.chains.summarize import load_summarize_chain llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k") chain = load_summarize_chain(llm, chain_type="stuff") chain.run(query_docs) ``` ## Jupyter Notebook To delve deeper into this example, you can access the full Jupyter Notebook here: [News of the Day Notebook](https://github.com/Unstructured-IO/unstructured/blob/main/examples/chroma-news-of-the-day/news-of-the-day.ipynb) # Notebooks Source: https://docs.unstructured.io/examplecode/notebooks Notebooks contain complete working sample code for end-to-end solutions. <CardGroup cols={2}> <Card title="Create a S3 to Qdrant Pipeline using the Unstructured API" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/S3_to_Qdrant_Workflow_using_Unstructured_API.ipynb"> <br /> This notebook walks through using the Unstructured Workflow Endpoint to set up a complete pipeline that pulls documents from S3, processes them using Unstructured, and stores the resulting embeddings in Qdrant for fast vector search and retrieval. <br /> `Unstructured API` `Workflows` `S3` `Qdrant` `VLM` `Embeddings` <br /> </Card> <Card title="Create a S3 to MongoDB Pipeline using the Unstructured API" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/S3_to_MongoDB_Workflow_using_Unstructured_API.ipynb"> <br /> Learn how to build an end-to-end document processing pipeline that processes PDFs from S3 and stores structured results in MongoDB. Features VLM-powered partitioning, semantic chunking, and vector embeddings using the Unstructured Workflows API. <br /> `Unstructured API` `Workflows` `S3` `MongoDB` `VLM` `Embeddings` <br /> </Card> <Card title="Getting Started with Unstructured API and IBM watsonx.data" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Azure_to_IBM_WatsonX.ipynb"> <br /> Learn how to create data processing workflows with Unstructured API and its Python SDK to preprocess all of your unstructured data from your Azure Blob Storage into your IBM watsonx.data instance. <br /> `Unstructured API` `Workflows` `Azure Blob Storage` `IBM watsonx.data` <br /> </Card> <Card title="Using Unstructured with Snowflake Cortex Search for RAG" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Use_Unstructured_with_Snowflake_Cortex_for_RAG_Search.ipynb"> <br /> Use Snowflake Cortex and RAG to do natural-language searches across a Snowflake table that contains data provided by Unstructured. Additional Snowflake Cortex functions are also explored. <br /> `Unstructured API` `Snowflake Cortex` `RAG Search` `Workflows` `S3` <br /> </Card> <Card title="Agentic RAG with LangGraph and Together AI" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/AgenticRAG_with_LangGraph,TogetherAI.ipynb"> <br /> Build Agentic RAG with `LangGraph` and `Together AI` and compare the results with Vanilla RAG in pure Python <br /> `Unstructured API` `Workflows` `Agents` `LangGraph` `Together AI` `Astra DB` <br /> </Card> <Card title="Getting Started with Unstructured API and Snowflake" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Getting_Started_with_Unstructured_API_and_Snowflake.ipynb"> <br /> Learn how to create data processing workflows with Unstructured API and its Python SDK to preprocess all of your unstructured data from your Azure Blob Storage into your Snowflake Table. <br /> `Unstructured API` `Workflows` `Azure Blob Storage` `Snowflake` <br /> </Card> <Card title="Building Graph-Based RAG Applications" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Building_Graph_Based_RAG_Applications_with_Unstructured_and_AstraDB.ipynb"> <br /> Learn how to use the Unstructured API to create a Graph RAG-based workflow that writes data with named entity recognition (NER) to your Astra DB. <br /> `Unstructured API` `Workflows` `Graph RAG` `NER` `Astra DB` <br /> </Card> <Card title="Getting Started with Unstructured API and Delta Tables in Databricks" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Getting_Started_with_Unstructured_API_and_Delta_Tables_in_Databricks.ipynb"> <br /> Learn how to create data processing workflows with Unstructured API and its Python SDK to preprocess all of your unstructured data into your Delta Table. <br /> `Unstructured API` `Workflows` `Databricks` `S3` <br /> </Card> <Card title="RAG for Online Documentation" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/RAG_for_documentation.ipynb"> <br /> Crawl websites with Firecrawl and build a RAG workflow powered by Unstructured and MongoDB Atlas vector search. <br /> `Unstructured API` `Workflows` `MongoDB` <br /> </Card> <Card title="Unstructured Workflow Endpoint Quickstart" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb"> <br /> Build an end-to-end workflow in Unstructured programmatically by using the Unstructured Workflow Endpoint. <br /> `Unstructured API` `Workflows` `S3` <br /> </Card> <Card title="RAG with Databricks Vector Search with Context from Multiple Sources" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Delta_Tables_Databricks_Multiple_Sources.ipynb"> <br /> Build RAG with Databricks Vector Search with context preprocessed from multiple sources by Unstructured. <br /> `Databricks` `Introductory notebook` <br /> </Card> <Card title="Agentic RAG with Hugging Face smolagents vs Vanilla RAG" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Agentic_RAG_with_HuggingFace_smolagents.ipynb"> <br /> Build Agentic RAG with `smolagents` library and compare the results with Vanilla RAG in pure Python <br /> `GPT-4o` `smolagents` `Agents` `DataStax` `S3` `Advanced notebook` <br /> </Card> <Card title="LLama3.2 RAG evaluation on unstructured text" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Llama3_2_RAG_evaluation_on_Unstructured_Text_via_VLM.ipynb"> <br /> Evaluate Llama3.2 for your RAG system with Unstructured, GPT-4o, Ragas, and LangChain <br /> `GPT-4o` `Ragas` `LangChain` `Llama3.2` `Pinecone` `S3` `Advanced notebook` <br /> </Card> <Card title="Multimodal RAG: Enhancing RAG outputs with image results" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Multimodal_RAG_with_image_results.ipynb"> <br /> Process a file in S3 with Unstructured and return images in your RAG output <br /> `S3` `FAISS` `GPT-4o-mini` `Advanced notebook` <br /> </Card> <Card title="Quantitative Reasoning with tables inside PDFs" href="https://app.hex.tech/5e6b6e24-dead-4d3b-b9da-a9a7ad587b96/hex/cb595d8a-6eac-4e19-96ed-b1540e5c031c/draft/logic"> <br /> From Pixels to Insights: Seamlessly Extracting and Visualizing Table Data with Unstructured and Hex <br /> `Unstructured API` `Hex` `Advanced notebook` <br /> </Card> <Card title="PII removal with GLiNER in unstructured data ETL" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/PII_removal_in_unstructured_data_ETL.ipynb"> <br /> Remove Personally Identifiable Information (PII) as a part of unstructured data preprocessing. <br /> `Unstructured API` `PII` `GLiNER` `Advanced notebook` <br /> </Card> <Card title="Custom metadata extraction and self-querying retrieval" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/custom_metadata_self_querying_rag_mongodb_unstructured_langgraph.ipynb"> <br /> Extract custom metadata, and enable metadata pre-filtering in your RAG. <br /> `Unstructured API` `MongoDB` `Metadata` `Advanced notebook` <br /> </Card> <Card title="Selecting an embedding model for custom data" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Selecting_an_embedding_model_for_custom_data.ipynb"> <br /> End-to-end data processing pipeline using Unstructured Serverless API. <br /> `Unstructured API` `Hugging Face` `Advanced notebook` <br /> </Card> <Card title="RAG with PDFs, LangChain and Llama 3" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/RAG_Llama3_Unstructured_LangChain.ipynb"> <br /> A RAG system with the Llama 3 model from Hugging Face. <br /> `Unstructured API` `🤗 Hugging Face` `LangChain` `Llama 3` `Introductory notebook` </Card> <Card title="Unstructured data ETL from S3 to SingleStore DB" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_data_ETL_from_S3_to_SingleStore.ipynb"> <br /> Learn to ingest, partition, chunk, embed and load data from an S3 bucket into SingleStore DB. <br /> `Unstructured API` `SingleStoreDB` `AWS S3` `Introductory notebook` </Card> <Card title="Google Drive to DataStax Astra DB" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Google_Docs_to_Astra.ipynb"> <br /> Embed your Google Drive Docs in an Astra Vector Database with Unstructured Serverless API <br /> `Unstructured API` `Google` `DataStax` `Introductory notebook` <br /> </Card> <Card title="Weaviate RAG quickstart" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Weaviate_Quickstart_OpenAI.ipynb"> <br /> Embed your local documents in an Weaviate Vector Database with Unstructured Serverless API <br /> `Unstructured API` `OpenAI` `Weaviate` `Introductory notebook` <br /> </Card> <Card title="Preprocess PDFs in AWS S3, load into Elasticsearch" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/S3_to_Elasticsearch_with_Unstructured.ipynb"> <br /> Ingest PDF documents from an S3 bucket, transform them into a normalized JSON with Unstructured Serverless API, chunk, embed and load into Elasticsearch. <br /> `Unstructured API` `AWS S3` `Elasticsearch` `Introductory notebook` <br /> </Card> <Card title="Preprocess documents in Google Drive, load into Databricks Volume" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/GoogleDrive_to_Databricks_Connector.ipynb"> <br /> Preprocess documents from a Google Drive Unstructured Serverless API and load them into Databricks Volume. <br /> `Unstructured API` `Google Drive` `Databricks` `Introductory notebook` <br /> </Card> <Card title="Source references in RAG responses" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/RAG_on_arXiv_papers_with_source_references.ipynb"> <br /> Add document source references to RAG responses based on documents metadata. <br /> `Unstructured API` `RAG` `LangChain` `Intermediate notebook` <br /> </Card> <Card title="Query processed PDF with HuggingChat" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/PDF_with_Unstructured_and_HuggingChat.ipynb"> <br /> Send a PDF to Unstructured for processing, and send a subset of the returned PDF's processed text to [HuggingChat](https://huggingface.co/chat/) for chatbot-style querying. <br /> `Unstructured API` `🤗 Hugging Face` `🤗 HuggingChat` `Introductory notebook` </Card> <Card title="Llama 3 Local RAG with emails" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Local_RAG_with_emails.ipynb"> <br /> Build a local RAG app for your emails with Unstructured, LangChain and Ollama. <br /> `Unstructured API` `LangChain` `Ollama` `Llama 3` `Introductory notebook` </Card> <Card title="Building RAG With PowerPoint presentations" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Building_RAG_with_Powerpoint_presentations.ipynb"> <br /> A RAG solution that is based on PowerPoint files. <br /> `Unstructured API` `🤗 Hugging Face` `LangChain` `Llama 3` `Introductory notebook` </Card> <Card title="Synthetic test dataset generation" href="https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/RAG_synthetic_test_data_with_Unstructured_GPT_4o_and_Ragas.ipynb"> <br /> Build a Synthetic Test Dataset for your RAG system in 5 easy steps <br /> `Unstructured API` `GPT-4o` `Ragas` `LangChain` `Advanced notebook` <br /> </Card> </CardGroup> # CrewAI Source: https://docs.unstructured.io/examplecode/tools/crewai [CrewAI](https://www.crewai.com/) is a popular framework for building AI agents and multi-agent workflows. This article provides a hands-on, step-by-step walkthrough that uses [CrewAI open source](https://www.crewai.com/open-source), along with the [Unstructured Workflow Endpoint MCP Server](/examplecode/tools/mcp) and Python, to build a multi-agent workflow. This multi-agent workflow uses the MCP server to call various functions within the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) to build Unstructured ETL+ workflows from connector creation all the way through to workflow job completion. This walkthrough uses an Amazon S3 bucket as both the workflow's source and destination. However, you can modify this multi-agent workflow later to use a different S3 bucket or even different sources and destinations, to have a collection of AI agents quickly build multiple Unstructured ETL+ workflows on your behalf with varying configurations. ## Requirements To complete this walkthrough, you must first have: * An Unstructured account and an Unstructured API key for that account. * An Anthropic account and an Anthropic API key for that account. * A properly configured Amazon S3 bucket with bucket access credentials. * A Firecrawl account and a Firecrawl API key for that account. * Python and the CrewAI open-source toolchain installed on your local development machine. The following sections describe how to get these requirements. ### Unstructured account and API key Before you begin, you must have an Unstructured account and an Unstructured API key, as follows: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> ### Anthropic account and API key This walkthrough uses a Claude Opus 3 model from Anthropic. So, before you begin, you must also have an Anthropic account and an Anthropic API key for that account. [Sign in to or create your Anthropic account](https://console.anthropic.com/login). After you sign in to your Anthropic account, [get your Anthropic API key](https://console.anthropic.com/settings/keys). ### Amazon S3 bucket and access credentials This walkthrough uses an Amazon S3 bucket as both the workflow's source and destination. So, before you begin, you must also have a properly configured Amazon S3 bucket, with bucket access credentials consisting of an access key ID and a secret access key for the AWS IAM user that has access to the bucket. Follow the S3 connector instructions to [create and configure the bucket and get the bucket access credentials](/ui/sources/s3) if you do not already have this all set up. (In these instructions, do not follow the directions to use the Unstructured UI to create the S3 source connector. CrewAI will do this for you later automatically.) This walkthrough expects two folders to exist within the bucket, as follows: * An `input` folder, which contains the files to process. This `input` folder must contain at least one file to process. If you do not have any files available to upload into the `input` folder, you can get some from the [example-docs](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs) folder in the Unstructured open source library repository on GitHub. * An `output` folder, which will contain the processed files' data from Unstructured after CrewAI runs the workflow. This `output` folder should be empty for now. ### Firecrawl account and API key This walkthrough uses Firecrawl to monitor job statuses. So, before you begin, you must also have a Firecrawl account and a Firecrawl API key for that account. [Sign in to or create your Firecrawl account](https://www.firecrawl.dev/signin/signup). After you sign in to your Firecrawl account, [get your Firecrawl API key](https://www.firecrawl.dev/app/api-keys). ### Python and CrewAI open-source toolchain and project setup Before you can start coding on your local machine, you must install Python, and you should also install a Python package and project manager to manage your project's code dependencies. This walkthrough uses the popular Python pacakge and project manager [uv](https://docs.astral.sh/uv/) (although `uv` is not required to use CrewAI or the Unstructured Workflow Endpoint MCP Server). <Steps> <Step title="Install uv"> To install `uv`, run one of the following commands, depending on your operating system: <Tabs> <Tab title="macOS, Linux"> To use `curl` with `sh`: ```bash curl -fsSL https://get.uv.dev | bash ``` To use `wget` with `sh` instead: ```bash wget -qO- https://astral.sh/uv/install.sh | sh ``` </Tab> <Tab title="Windows"> To use PowerShell with `irm` to download the script and run it with `iex`: ```powershell powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" ``` </Tab> </Tabs> If you need to install `uv` by using other approaches such as PyPI, Homebrew, or WinGet, see [Installing uv](https://docs.astral.sh/uv/getting-started/installation/). </Step> <Step title="Install CrewAI open source"> Use `uv` to install the CrewAI open-source toolchain, by running the following command: ```bash uv tool install crewai ``` </Step> <Step title="Install Python"> <Note> CrewAI open source works only with Python 3.10, 3.11, and 3.12. </Note> `uv` will detect and use Python if you already have it installed. To view a list of installed Python versions, run the following command: ```bash uv python list ``` If, however, you do not already have Python installed, you can install a version of Python for use with `uv` by running the following command. For example, this command installs Python 3.12 for use with `uv`: ```bash uv python install 3.12 ``` </Step> </Steps> ## Build and run the CrewAIproject You are now ready to start coding. <Steps> <Step title="Create the project directory"> Switch to the directory on your local development machine where you want to create the project directory for this walkthrough. This example creates a project directory named `crewai_unstructured_demo` within your current working directory and then switches to this new project directory: ```bash mkdir crewai_unstructured_demo cd crewai_unstructured_demo ``` </Step> <Step title="Intiialize the project"> From within the new project directory, use `uv` to initialize the project by running the following command: ```bash uv init ``` </Step> <Step title="Create a venv virtual environment"> To isolate and manage your project's code dependencies, you should create a virtual environment. This walkthrough uses the popular Python virtual environment manager [venv](https://docs.python.org/3/library/venv.html) (although `venv` is not required to use CrewAI or the Unstructured Workflow Endpoint MCP Server). From the root of your project directory, use `uv` to create a virtual environment with `venv` by running the following command: ```bash # Create the virtual environment by using the current Python version: uv venv # Or, if you want to use a specific Python version: uv venv --python 3.12 ``` </Step> <Step title="Activate the virtual environment"> To activate the `venv` virtual environment, run one of the following commands from the root of your project directory: <Tabs> <Tab title="macOS, Linux"> * For `bash` or `zsh`, run `source .venv/bin/activate` * For `fish`, run `source .venv/bin/activate.fish` * For `csh` or `tcsh`, run `source .venv/bin/activate.csh` * For `pwsh`, run `.venv/bin/Activate.ps1` </Tab> <Tab title="Windows"> * For `cmd.exe`, run `.venv\Scripts\activate.bat` * For `PowerShell`, run `.venv\Scripts\Activate.ps1` </Tab> </Tabs> If you need to deactivate the virtual environment at any time, run the following command: ```bash deactivate ``` </Step> <Step title="Get the Unstructured Workflow Endpoint MCP Server's source code"> The Unstructured Workflow Endpoint MCP Server is a Python package that provides an MCP server for the Unstructured Workflow Endpoint. To get the MCP server's source code, run the following command from the root of your project directory: ```bash git clone https://github.com/Unstructured-IO/UNS-MCP.git ``` </Step> <Step title="Install the Unstructured Workflow Endpoint MCP Server's code dependencies"> From the root of your project directory, switch to the cloned Unstructured Workflow Endpoint MCP Server's source directory, and then use `uv` to install the MCP server's code dependencies, by running the following commands: ```bash cd UNS-MCP uv sync ``` </Step> <Step title="Install the CrewAI project's code dependencies"> Switch back to the CrewAI project's root directory, and then use `uv` to install the CrewAI project's dependencies, by running the following commands: ```bash cd .. uv add "crewai>=0.11.0" "crewai-tools[mcp]>=0.0.5" "pydantic>=2.11.0" python-dotenv ``` </Step> <Step title="Add the CrewAI project's source code"> In the `main.py` file in the CrewAI project's root directory, replace that file's contents with the following Python code. This code defines a set of CrewAI-compatible [agents](https://docs.crewai.com/concepts/agents) and [tasks](https://docs.crewai.com/concepts/tasks) that make up a multi-agent [crew](https://docs.crewai.com/concepts/crews). This code then uses the crew's agents to run the tasks in order to build an Unstructured ETL+ workflows from connector creation all the way through to workflow job completion: ```python import os from typing import Optional from crewai import LLM, Agent, Crew, Task from crewai_tools import MCPServerAdapter from dotenv import load_dotenv from pydantic import BaseModel, Field from rich.pretty import pprint load_dotenv() class SourceConfigurationResult(BaseModel): source_id: Optional[str] = Field( default=None, description="The ID of the configured data source.", ) source_type: Optional[str] = Field( default=None, description="The type of data source configured.", ) source_name: Optional[str] = Field( default=None, description="The name of the configured data source.", ) source_config: Optional[dict] = Field( default=None, description="The configuration details of the data source.", ) class DestinationConfigurationResult(BaseModel): destination_id: Optional[str] = Field( default=None, description="The ID of the configured data destination.", ) destination_type: Optional[str] = Field( default=None, description="The type of data destination configured.", ) destination_name: Optional[str] = Field( default=None, description="The name of the configured data destination.", ) destination_config: Optional[dict] = Field( default=None, description="The configuration details of the data destination.", ) class WorkflowConfigurationResult(BaseModel): workflow_id: Optional[str] = Field( default=None, description="The ID of the created workflow.", ) workflow_name: Optional[str] = Field( default=None, description="The name of the created workflow.", ) workflow_type: Optional[str] = Field( default=None, description="The type of workflow created.", ) workflow_config: Optional[dict] = Field( default=None, description="The configuration details of the workflow.", ) class JobConfigurationResult(BaseModel): job_id: Optional[str] = Field( default=None, description="The ID of the created job.", ) class JobStatusResult(BaseModel): job_status: Optional[str] = Field( default=None, description="The status of the job.", ) def main(): with MCPServerAdapter( {"url": os.getenv("MCP_SERVER_URL", "http://127.0.0.1:8080/sse")}, ) as tools: llm = LLM(model="anthropic/claude-3-opus-20240229", temperature=0.7, max_tokens=4096) source_agent = Agent( role="Source Configuration Specialist", goal="Configure and manage data sources for the MCP server", backstory="""You are an expert in data source configuration and management. You specialize in setting up and configuring various types of data sources including AWS S3, Google Drive, and other storage systems. You ensure proper configuration and validation of data sources.""", tools=tools, llm=llm, verbose=True, ) destination_agent = Agent( role="Destination Configuration Specialist", goal="Configure and manage data destinations for the MCP server", backstory="""You are an expert in data destination configuration and management. You specialize in setting up and configuring various types of data destinations including AWS S3, Google Drive, and other storage systems. You ensure proper configuration and validation of data destinations.""", tools=tools, llm=llm, verbose=True, ) workflow_agent = Agent( role="Workflow Configuration Specialist", goal="Configure and manage workflows for the MCP server", backstory="""You are an expert in workflow configuration and management. You specialize in setting up and configuring various types of workflows including AWS S3, Google Drive, and other storage systems. You ensure proper configuration and validation of workflows.""", tools=tools, llm=llm, verbose=True, ) run_workflow_agent = Agent( role="Workflow Run Specialist", goal="Run workflows for the MCP server", backstory="""You are an expert in workflow run management. You specialize in running various types of workflows including AWS S3, Google Drive, and other storage systems. You ensure proper running of workflows.""", tools=tools, llm=llm, verbose=True, ) job_agent = Agent( role="Job Management Specialist", goal="Manage jobs for the MCP server", backstory="""You are an expert in job management. You specialize in monitoring various types of jobs proper monitoring of jobs.""", tools=tools, llm=llm, verbose=True, ) source_task = Task( description=f"""Configure an S3 source with the following specifications: - Name: {os.getenv("S3_SOURCE_CONNECTOR_NAME")} - URI: {os.getenv("S3_SOURCE_URI")} - Recursive: true Ensure the source is properly configured and return the configuration details.""", agent=source_agent, expected_output="""A result containing: - source_id: The ID of the configured source - source_config: The configuration details - source_name: The name of the configured source - source_type: The type of data source configured """, output_pydantic=SourceConfigurationResult, ) destination_task = Task( description=f"""Configure an S3 destination with the following specifications: - Name: {os.getenv("S3_DESTINATION_CONNECTOR_NAME")} - URI: {os.getenv("S3_DESTINATION_URI")} Ensure the destination is properly configured and return the configuration details.""", agent=destination_agent, expected_output="""A result containing: - destination_id: The ID of the configured destination - destination_config: The configuration details - destination_name: The name of the configured destination - destination_type: The type of data destination configured """, output_pydantic=DestinationConfigurationResult, ) workflow_task = Task( description=f"""Create a workflow that uses the High Res partitioner and the following specifications: - Name: {os.getenv("WORKFLOW_NAME")} - Type: custom Use the source and destination IDs from the previous tasks to create the workflow. Ensure the workflow is properly configured to connect these endpoints.""", agent=workflow_agent, expected_output="""A result containing: - workflow_id: The ID of the created workflow - workflow_config: The configuration details - workflow_name: The name of the created workflow - workflow_type: The type of workflow created """, output_pydantic=WorkflowConfigurationResult, context=[source_task, destination_task], ) run_workflow_task = Task( description=f"""Run a workflow. Use the workflow ID from the previous task to run the workflow.""", agent=run_workflow_agent, expected_output="""A result containing: - job_id: The ID of the running workflow """, output_pydantic=JobConfigurationResult, context=[workflow_task], ) job_task = Task( description=f"""Monitor a job. Use the job ID from the previous task. Keep monitoring the job every 15 seconds until it is completed. Each time you check, report the current job status. When the job status is completed, stop reporting and return the final job status.""", agent=job_agent, expected_output="""A result containing: - job_status: The status of the job """, output_pydantic=JobStatusResult, context=[run_workflow_task], ) crew = Crew(agents=[ source_agent, destination_agent, workflow_agent, run_workflow_agent, job_agent ], tasks=[ source_task, destination_task, workflow_task, run_workflow_task, job_task ], verbose=True ) result = crew.kickoff() pprint("Source Task Result:") pprint(result.tasks_output[0].pydantic) pprint("Destination Task Result:") pprint(result.tasks_output[1].pydantic) pprint("Workflow Task Result:") pprint(result.tasks_output[2].pydantic) pprint("Run Workflow Task Result:") pprint(result.tasks_output[3].pydantic) pprint("Job Task Result:") pprint(result.tasks_output[4].pydantic) if name == "main": main() ``` The preceding code does the following: 1. Imports the necessary library modules for the rest of the code to use. 2. Loads the environment variables that the code relies on from a `.env` file, which you will create in the next step. 3. Defines the [Pydantic](https://docs.pydantic.dev/)-formatted models for the expected output of each task. These models format the tasks' output for consistent presentation. 4. Defines the agents and tasks for the crew. These agents use their related tasks to create a source connector, a destination connector, and a workflow that uses these connectors, and then runs the newly created workflow as a job and reports the job's status. 5. After the crew is finished, the results of each task are printed. </Step> <Step title="Create the .env file"> Create an `.env` file in the root of your CrewAI project directory, and then add the following environment variables to the file: * `UNSTRUCTURED_API_KEY` - Your Unstructured API key. * `ANTHROPIC_API_KEY` - Your Anthropic API key. * `FIRECRAWL_API_KEY` - Your Firecrawl API key. * `AWS_KEY` - The AWS access key ID for the AWS IAM user that has access to the S3 bucket. * `AWS_SECRET` - The IAM user's AWS secret access key. * `S3_SOURCE_CONNECTOR_NAME` - Some display name for the S3 source connector. * `S3_SOURCE_URI` - The URI of the `input` folder in the S3 bucket. * `S3_DESTINATION_CONNECTOR_NAME` - The name of the S3 destination connector. * `S3_DESTINATION_URI` - The URI of the `output` folder in the S3 bucket. * `WORKFLOW_NAME` - Some display name for the workflow. For example, your `.env` file might look like this: ```bash - UNSTRUCTURED_API_KEY=7F1...<redacted>hZV4 - ANTHROPIC_API_KEY=sk-ant-api03-09b...<redacted>...wAA - FIRECRAWL_API_KEY=fc-af337...<redacted>...329 - AWS_KEY=AKI...<redacted>...3DX - AWS_SECRET=end...<redacted>...Xxb - S3_SOURCE_CONNECTOR_NAME=My-MCP-S3-Source - S3_SOURCE_URI=s3://my-bucket/input - S3_DESTINATION_CONNECTOR_NAME=My-MCP-S3-Destination - S3_DESTINATION_URI=s3://my-bucket/output - WORKFLOW_NAME=My-MCP-S3-To-S3-Workflow ``` </Step> <Step title="Unstructured Workflow Endpoint MCP Server"> From the root of your project directory, switch to the Unstructured Workflow Endpoint MCP Server's source directory, and then use `make` to run the MCP server locally, by running the following commands: ```bash cd UNS-MCP make sse-server ``` <Note>If you do not have `make` available, see your operating system's documentation for installation instructions.</Note> The MCP server will start running at `http://127.0.0.1:8080/sse`. You must leave the MCP server running in your terminal or command prompt window while you run the CrewAI project. </Step> <Step title="Run the CrewAI project"> 1. In a separate terminal or command prompt window, from the root of your project directory, run the CrewAI project by running the following command: ```bash uv run main.py ``` 2. The CrewAI project will run and create an Unstructured ETL+ workflow. You can see the crew's progress in the terminal or command prompt window where you ran the `uv run main.py` command. 3. The crew's agents will create a source connector, a destination connector, and a workflow that uses these connectors. 4. The crew's agents will then run the workflow as a job and report on the job's status. 5. After the job is completed, the crew will report final information about the tasks that were completed, for example: ```bash 'Source Task Result:' SourceConfigurationResult( │ source_id='c26...<redacted>...83b', │ source_type='s3', │ source_name='My-MCP-S3-Source', │ source_config={ │ │ 'anonymous': False, │ │ 'recursive': True, │ │ 'remote_url': 's3://my-bucket/input', │ │ 'endpoint_url': None, │ │ 'key': '', │ │ 'secret': '', │ │ 'token': None │ } ) 'Destination Task Result:' DestinationConfigurationResult( │ destination_id='d57...<redacted>...ea4', │ destination_type='s3', │ destination_name='My-MCP-S3-Destination', │ destination_config={ │ │ 'anonymous': False, │ │ 'remote_url': 's3://my-bucket/output', │ │ 'endpoint_url': None, │ │ 'key': '', │ │ 'secret': '', │ │ 'token': None │ } ) 'Workflow Task Result:' WorkflowConfigurationResult( │ workflow_id='ae5...<redacted>...cae', │ workflow_name='My-MCP-S3-To-S3-Workflow', │ workflow_type='custom', │ workflow_config={ │ │ 'name': 'My-MCP-S3-To-S3-Workflow', │ │ 'workflow_type': 'custom', │ │ 'source_id': 'c26...<redacted>...83b', │ │ 'destination_id': 'd57...<redacted>...ea4', │ │ 'workflow_nodes': [ │ │ │ { │ │ │ │ 'name': 'Partitioner', │ │ │ │ 'type': 'partition', │ │ │ │ 'subtype': 'unstructured_api', │ │ │ │ 'settings': { │ │ │ │ │ 'strategy': 'hi_res', │ │ │ │ │ 'include_page_breaks': False, │ │ │ │ │ 'pdf_infer_table_structure': False, │ │ │ │ │ 'exclude_elements': None, │ │ │ │ │ 'xml_keep_tags': False, │ │ │ │ │ 'encoding': 'utf-8', │ │ │ │ │ 'ocr_languages': ['eng'], │ │ │ │ │ 'extract_image_block_types': ['image', 'table'] │ │ │ │ } │ │ │ } │ │ ] │ } ) 'Run Workflow Task Result:' JobConfigurationResult(job_id='c3a5...<redacted>...bfa') 'Job Task Result:' JobStatusResult(job_status='Completed') ``` 6. You can sign in to your Unstructured account to see the results: * On the sidebar, click Connectors to see the source and destination connectors. * On the sidebar, click Workflows to see the workflow. * On the sidebar, click Jobs to see the job. 7. If you do not want to keep the MCP server running, you can stop it by pressing `Ctrl+C` in the terminal or command prompt window where you ran the `make sse-server` command. </Step> </Steps> ## Learn more * [Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured Workflow Endpoint](/examplecode/tools/mcp) * [Unstructured Workflow Endpoint Reference](/api-reference/workflow/overview) * [CrewAI Documentation](https://docs.crewai.com/) # Firecrawl Source: https://docs.unstructured.io/examplecode/tools/firecrawl [Firecrawl](https://firecrawl.dev/) is a tool that allows you to crawl websites and extract data from them. You can use Unstructured to process the data that Firecrawl extracts from these websites. Unstructured turns this extracted website data into a single set of standard [Unstructured document elements and metadata](/ui/document-elements), improving the accuracy, relevance, and performance of your RAG applications, agentic AI workflows, model fine tuning, and more. This example provides a hands-on walkthrough of how to use Firecrawl with Unstructured. In this example, you use Python code to have Firecrawl extract data from a ficticious website and then add the extracted data to an Amazon S3 bucket within your AWS account. From there, you use the Unstructured user interface (UI) to set up an ETL+ workflow that takes the extracted data from the S3 bucket, converts it into Unstructured document elements and metadata, adds vector embeddings, and then sends the results over into a serverless index within a Pinecone vector store. ## Requirements To complete this walkthrough, you must first have: * An Unstructured account. * A Firecrawl account and a Firecrawl API key for that account. * An AWS account and a properly configured Amazon S3 bucket with bucket access credentials within that account. * A Pinecone account and a Pinecone API key for that account. * Python and the necessary development toolchain installed on your local development machine. The following sections describe how to get these requirements. ### Unstructured account Before you begin, you must have an Unstructured account and an Unstructured API key, as follows: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). ### Firecrawl account and API key Before you begin, you must have a Firecrawl account and a Firecrawl API key, as follows: 1. [Sign up or sign in to your Firecrawl account](https://www.firecrawl.dev/signin/signup). 2. After you sign in, [get your Firecrawl API key](https://www.firecrawl.dev/app/api-keys). ### Amazon S3 bucket with access permissions and credentials This walkthrough uses an Amazon S3 bucket as the Unstructured ETL+ workflow's source. So, before you begin, you must have an AWS account, and within that account, a properly configured Amazon S3 bucket with bucket access credentials consisting of an AWS access key ID and secret access key for the AWS IAM user that has access to the bucket. Follow the S3 connector instructions to [create and configure the bucket and get the bucket access credentials](/ui/sources/s3) if you do not already have this all set up. (In these instructions, do not follow the directions to use the Unstructured UI to create the S3 source connector. You will do this later when you create the Unstructured ETL+ workflow.) ### Pinecone account and API key Before you begin, you must have a Pinecone account and a Pinecone API key, as follows: 1. [Sign up](https://app.pinecone.io/?sessionType=signup) or [log in](https://app.pinecone.io/?sessionType=login) to your Pinecone account. 2. After you log in, [get your Pinecone API key](https://docs.pinecone.io/guides/projects/manage-api-keys#create-an-api-key). ### Python and development toolchain installation Before you can start coding on your local machine, you must install Python, and you should also install a Python package and project manager to manage your project's Firecrawl, AWS, Pinecone, and other code dependencies. This walkthrough uses the popular Python package and project manager [uv](https://docs.astral.sh/uv/) (although `uv` is not required to use Firecrawl, AWS, or Pinecone with Python). <Steps> <Step title="Install uv"> To install `uv`, run one of the following commands, depending on your operating system: <Tabs> <Tab title="macOS, Linux"> To use `curl` with `sh`: ```bash curl -fsSL https://get.uv.dev | bash ``` To use `wget` with `sh` instead: ```bash wget -qO- https://astral.sh/uv/install.sh | sh ``` </Tab> <Tab title="Windows"> To use PowerShell with `irm` to download the script and run it with `iex`: ```powershell powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" ``` </Tab> </Tabs> If you need to install `uv` by using other approaches such as PyPI, Homebrew, or WinGet, see [Installing uv](https://docs.astral.sh/uv/getting-started/installation/). </Step> <Step title="Install Python"> <Note>This walkthrough should work with Python 3.9 or later and has been tested with Python 3.12.</Note> `uv` will detect and use Python if you already have it installed. To view a list of installed Python versions, run the following command: ```bash uv python list ``` If, however, you do not already have Python installed, you can install a version of Python for use with `uv` by running the following command. For example, this command installs Python 3.12 for use with `uv`: ```bash uv python install 3.12 ``` </Step> <Step title="Create the project directory"> Switch to the directory on your local development machine where you want to create the project directory for this walkthrough. This example creates a project directory named `firecrawl_unstructured_demo` within your current working directory and then switches to this new project directory: ```bash mkdir firecrawl_unstructured_demo cd firecrawl_unstructured_demo ``` </Step> <Step title="Intiialize the project"> From within the new project directory, use `uv` to initialize the project by running the following command: ```bash uv init ``` </Step> <Step title="Create a venv virtual environment"> To isolate and manage your project's code dependencies, you should create a virtual environment. This walkthrough uses the popular Python virtual environment manager [venv](https://docs.python.org/3/library/venv.html) (although `venv` is not required to use CrewAI or the Unstructured Workflow Endpoint MCP Server). From the root of your project directory, use `uv` to create a virtual environment with `venv` by running the following command: ```bash # Create the virtual environment by using the current Python version: uv venv # Or, if you want to use a specific Python version: uv venv --python 3.12 ``` </Step> <Step title="Activate the virtual environment"> To activate the `venv` virtual environment, run one of the following commands from the root of your project directory: <Tabs> <Tab title="macOS, Linux"> * For `bash` or `zsh`, run `source .venv/bin/activate` * For `fish`, run `source .venv/bin/activate.fish` * For `csh` or `tcsh`, run `source .venv/bin/activate.csh` * For `pwsh`, run `.venv/bin/Activate.ps1` </Tab> <Tab title="Windows"> * For `cmd.exe`, run `.venv\Scripts\activate.bat` * For `PowerShell`, run `.venv\Scripts\Activate.ps1` </Tab> </Tabs> If you need to deactivate the virtual environment at any time, run the following command: ```bash deactivate ``` </Step> <Step title="Install the Firecrawl Python SDK"> You will use the Firecrawl Python SDK later to extract data from a website. Use `uv` to install the Firecrawl Python SDK package, by running the following command: ```bash uv add firecrawl-py ``` </Step> <Step title="Install the AWS SDK for Python"> You will use the AWS SDK for Python later to have Firecrawl add the extracted data to your Amazon S3 bucket. 1. Use `uv` to install the AWS SDK for Python package, by running the following command: ```bash uv add boto3 ``` 2. To enable the AWS SDK for Python to authenticate with your AWS account, create a file named `credentials` in the `~/.aws/` directory for macOS or Linux, or the `<drive>:\Users\<username>\.aws\` directory for Windows. Then add the AWS access key ID and secret access key of the AWS IAM user that has access to your Amazon S3 bucket, and the [short code for the AWS Region](https://docs.aws.amazon.com/global-infrastructure/latest/regions/aws-regions.html) of the bucket (for example, `us-east-1`), to the `credentials` file. In the following `credentials` file example, replace the following placeholders: * Replace `<your-access-key-id>` with the AWS access key ID of the AWS IAM user that has access to the bucket. * Replace `<your-secret-access-key>` with the secret access key for the related access key ID. * Replace `<the-aws-short-region-code-for-your-bucket>` with the short code for the AWS Region of the bucket. ```text [default] aws_access_key_id = <your-access-key-id> aws_secret_access_key = <your-secret-access-key> region = <the-aws-short-region-code-for-your-bucket> ``` </Step> <Step title="Install the Pinecone Python SDK"> You will use the Pinecone Python SDK later to create the Pinecone serverless index and, in an optional step, to help enable RAG chat on the data that will be storedin the index. Use `uv` to install the Pinecone Python SDK package, along with the `grpc` extra to enable the programmatic creation of a Pinecone serverless index later, by running the following command: ```bash uv add "pinecone[grpc]" ``` </Step> </Steps> ## Step 1: Extract the website data and add the data to the bucket In this step, you use Firecrawl to extract data from a website and then add the extracted data to the Amazon S3 bucket. <Steps> <Step title="Add your Firecrawl API key to your project"> 1. Set your Firecrawl API key in your project as an environment variable for your code to use. To do this: a. In the root of your code project, create a file named `.env`.<br /> b. Add the following environment variable to this `.env` file, and replace `<your-firecrawl-api-key>` with your Firecrawl API key:<br /> ```text FIRECRAWL_API_KEY=<your-firecrawl-api-key> ``` 2. To enable your Firecrawl code to access the environment variable, use `uv` to install the `dotenv` package by running the following command: ```bash uv add dotenv ``` 3. In the `.gitignore` file in the root of your project, add the following line, to help prevent accidentally checking in your Firecrawl API key (or anything else in the `.env` file) into any shared code repositories later: ```text .env ``` </Step> <Step title="Create the Python script to extract the website data"> 1. In the `.env` file, add the following line, which defines an environment variable representing the base URL of the website to crawl. This walkthrough uses a website named [Books to Scrape](https://books.toscrape.com/) that contains ficticious data (although you can use any accessible website you want): ```text FIRECRAWL_BASE_URL=https://books.toscrape.com/ ``` 2. Also in the `.env` file, add the following line, which defines an environment variable representing the name of the target Amazon S3 bucket to have Firecrawlstore the website crawl results in. Replace `<the-name-of-your-bucket>` with the name of the bucket: ```text FIRECRAWL_S3_BUCKET_NAME=<the-name-of-your-bucket> ``` <Note>Enter only the name of the bucket. Do not prefix the bucket's name with `s3://`. Do not include any trailing slash (`/`) after the bucket name.</Note> 3. Create a file named `firecrawl_extract.py` in the root of your project directory, and add the following code to it: ```python from firecrawl import FirecrawlApp, ScrapeOptions from dotenv import load_dotenv import os, boto3, json from datetime import datetime load_dotenv() def save_to_s3(crawl_result, bucket_name, prefix): s3 = boto3.client("s3") timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") # Save the full crawl results to the bucket as a single JSON file. full_results_key = f"{prefix}/{timestamp}/full_results.json" s3.put_object( Bucket=bucket_name, Key=full_results_key, Body=json.dumps(crawl_result.model_dump(), indent=2, default=str), ) # Save individually crawled pages to the bucket as HTML files. for idx, page in enumerate(crawl_result.data): if page.html is not None: page_key = f"{prefix}/{timestamp}/pages/{idx}.html" s3.put_object(Bucket=bucket_name, Key=page_key, Body=page.html) print(f"Successfully saved {idx}.html to {bucket_name}/{page_key}") print(f"Successfully saved {len(crawl_result.data)} pages to {bucket_name}/{full_results_key}") def main(): app = FirecrawlApp(api_key=os.getenv("FIRECRAWL_API_KEY")) result = app.crawl_url( url=os.getenv("FIRECRAWL_BASE_URL"), limit=5, # Maximum number of pages to crawl. scrape_options=ScrapeOptions( formats=["html"], # Save the crawled pages as HTML files. waitFor=1000, # Wait for one second before crawling the currentpage. timeout=10000 # Move on to the next page if no response within 10 seconds. ) ) save_to_s3( crawl_result=result, bucket_name=os.getenv("FIRECRAWL_S3_BUCKET_NAME"), prefix="crawls" ) if name == "main": main() ``` The preceding code does the following: * Loads from your `.env` file your Firecrawl API key, the base URL for Firecrawl to use for website crawling, and the S3 bucket name for Firecrawl to send the website crawl results to. * Calls a function named `main` that uses Firecrawl to crawl the website. * The `main` function then calls the `save_to_s3` function, which adds the website crawl results to the S3 bucket. You can change the where the website crawl results are saved in your S3 bucket by changing the `full_results_key` and `page_key` variables. For example, you might want to save the website crawl results in a single folder and then keep overwriting those results with new results as they come in, instead of adding new results to separate subfolders. You can also change the number of crawled pages by changing the `limit` argument, outputting the results as Markdown instead of HTML, and so on. For more information, see the [Firecrawl Python SDK documentation](https://docs.firecrawl.dev/sdks/python). </Step> <Step title="Run the script to extract the data"> From the root of your project, use `uv` to run the script to extract the data from the website by running the following command: ```bash uv run firecrawl_extract.py ``` After the script runs, you should see the following output, where `<your-bucket-name>` is the name of your bucket, and `<timestamp>` is the timestamp generated by the script: ```bash Successfully saved 0.html to <your-bucket-name>/crawls/<timestamp>/pages/0.html Successfully saved 1.html to <your-bucket-name>/crawls/<timestamp>/pages/1.html Successfully saved 2.html to <your-bucket-name>/crawls/<timestamp>/pages/2.html Successfully saved 3.html to <your-bucket-name>/crawls/<timestamp>/pages/3.html Successfully saved 4.html to <your-bucket-name>/crawls/<timestamp>/pages/4.html Successfully saved 5 pages to <your-bucket-name>/crawls/<timestamp>/full_results.json ``` To verify that the data was successfully added, open your AmazonS3 console and look in the `crawls` folder within your bucket. </Step> </Steps> ## Step 2: Create a Pinecone serverless index In this step, you create a Pinecone serverless index that Unstructured will use to store its results based on the extracted website data that Firecrawl sends to your Amazon S3 bucket. <Steps> <Step title="Add your Pinecone API key to your project"> Set your Pinecone API key as an environment variable for your code to use. To do this, add the following environment variable to your `.env` file, and replace `<your-pinecone-api-key>` with your Pinecone API key: ```text PINECONE_API_KEY=<your-pinecone-api-key> ``` </Step> <Step title="Add the Pinecone index name to your project"> Add the following line to your `.env` file, and replace `<the-name-of-your-index>` with the name of the serverlessindex you want to create: ```text PINECONE_INDEX_NAME=<the-name-of-your-index> ``` </Step> <Step title="Create the Python script to create the Pinecone serverless index"> Create a file named `pinecone_create_index.py` in the root of your project directory and add the following code to it: ```python from pinecone.grpc import PineconeGRPC as Pinecone from pinecone import ServerlessSpec import os from dotenv import load_dotenv load_dotenv() pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY")) pc.create_index( name=os.getenv("PINECONE_INDEX_NAME"), vector_type="dense", dimension=3072, metric="cosine", spec=ServerlessSpec( cloud="aws", region="us-east-1" ) ) ``` The preceding code creates in your Pinecone account a serverless index with the specified name for the specified cloud provider and within the specified region for that provider. The index is configured to accept vectors with 3072 dimensions with a cosine metric and a dense vector type. </Step> <Step title="Run the script to create the index"> From the root of your project, use `uv` to run the script to create the Pinecone serverless index by running the following command: ```bash uv run pinecone_create_index.py ``` After the script runs, to verify that the index was successfully added, log in to your Pinecone account. Then open your default project, and click Indexes under Database in the sidebar to view the new index that was added. </Step> </Steps> ## Step 3: Set up and run the Unstructured ETL+ workflow In this step, you set up and run the Unstructured ETL+ workflow that will take the extracted website data from your Amazon S3 bucket, process it to turn it into Unstructured elements and metadata, add vector embeddings to the elements, and then add all of this information to your Pinecone serverless index. <Steps> <Step title="Create the source connector"> 1. Sign in to your Unstructured account. 2. On the sidebar, click Connectors. 3. With Sources selected, New. 4. Enter some name for the source connector, such as `s3-firecrawl-source`. 5. For Provider, click Amazon S3. 6. Click Continue. 7. For Bucket URI, enter `s3://<your-bucket-name>/crawls/<timestamp>/pages/`, replacing `<your-bucket-name>` with the name of your bucket, and `<timestamp>` with the timestamp generated by the script in the previous step.. 8. For Authentication Method, select Secret. 9. Enter your AWS access key ID and secret access key. 10. Click Continue. 11. Click Save and Test. </Step> <Step title="Create the destination connector"> 1. On the sidebar, click Connectors. 2. With Destinations selected, New. 3. Enter some name for the destination connector, such as `pinecone-firecrawl-destination`. 4. For Provider, click Pinecone. 5. Click Continue. 6. For Index Name, enter the name of the Pinecone index you created earlier, such as `firecrawl-dense-index`. 7. Leave Namespace blank. Unstructured will automatically use the default namespace in the index. 8. Leave Batch Size set to the default value of `50`. 9. For API Key, enter your Pinecone API key. 10. Click Save and Test. </Step> <Step title="Create the workflow"> 1. On the sidebar, click Workflows. 2. Click New Workflow. 3. With Build it Myself selected, click Continue. 4. In the workflow designer, click the Source workflow node. 5. In the node settings pane, on the Connectors tab, select the name of the source connector you created earlier, such as `s3-firecrawl-source`. 6. Click the Destination workflow node. 7. In the node settings pane, on the Connectors tab, select the name of the destination connector you created earlier, such as `pinecone-firecrawl-destination`. 8. Between the Partitioner and Destination nodes, click the + (add node) button. 9. Click Transform > Embedder. 10. Click the Embedder workflow node. 11. In the Settings pane, under Azure OpenAI, select text-embedding-3-large. 12. Click any blank space in the workflow designer. 13. In the workflow settings pane, click Details. 14. For Name, click the pencil (edit) icon and then enter some name for the workflow, such as `firecrawl-s3-to-pinecone-workflow`. Then click the checkmark (save) icon. 15. Switch Active to on. 16. Click Save. </Step> <Step title="Run the workflow as a job"> 1. On the sidebar, click Workflows. 2. Next to the name of the workflow you created earlier, click Run. </Step> <Step title="Monitor the job"> 1. On the sidebar, click Jobs. 2. Click the name of the workflow you created earlier. 3. Monitor the job's progress in the job details pane until the status changes to Finished. </Step> <Step title="View the results"> 1. One way to view the results is to create a file named `pinecone_fetch_from_index.py` in the root of your project directory and add the following code to it: ```python from pinecone import Pinecone import os from dotenv import load_dotenv load_dotenv() pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY")) index = pc.Index(os.getenv("PINECONE_INDEX_NAME")) ids = [] # The list_paginated() method returns a dict with 'vectors' key paginated_results = index.list_paginated(namespace="default") for page in paginated_results['_data_store']['vectors']: ids.append(page['id']) response = index.fetch( ids=ids, namespace="default" ) for vector_id, vector_data in response["vectors"].items(): print(f"ID: {vector_id}\nMetadata: {vector_data.get('metadata')}\n\n") ``` 2. From the root of your project, use `uv` to run the script as follows: ```bash uv run pinecone_fetch_from_index.py ``` The output will show each record's ID and metadata. For brevity and performance, the records' vector embeddings are not included in the output. </Step> </Steps> ## Additional resources * [Mastering Firecrawl's Crawl Endpoint: A Complete Web Scraping Guide](https://www.firecrawl.dev/blog/mastering-the-crawl-endpoint-in-firecrawl) # Google Drive event triggers Source: https://docs.unstructured.io/examplecode/tools/google-drive-events You can use Google Drive events, such as adding new files to—or updating existing files in—Google Drive shared folders or shared drives, to automatically run Unstructured ETL+ workflows that rely on those folders or drives as sources. This enables a no-touch approach to having Unstructured automatically process new and updated files in Google Drive as they are added or updated. This example shows how to automate this process by adding a custom [Google Apps Script](https://developers.google.com/apps-script) project in your Google account. This project runs a script on a regular time interval. This script automatically checks for new or updated files within the specified Google Drive shared folder or shared drive. If the script detects at least one new or updated file, it then calls the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) to automatically run the specified corresponding Unstructured ETL+ workflow in your Unstructured account. <Note> This example uses a custom Google Apps Script that you create and maintain. Any issues with file detection, timing, or script execution could be related to your custom script, rather than with Unstructured. If you are getting unexpected or no results, be sure to check your custom script's execution logs first for any informational and error messages. </Note> ## Requirements To use this example, you will need the following: * An Unstructured account, and an Unstructured API key for your account, as follows: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> * The Unstructured Workflow Endpoint URL for your account, as follows: 1. In the Unstructured UI, click API Keys on the sidebar.<br /> 2. Note the value of the Unstructured Workflow Endpoint field. * A Google Drive source connector in your Unstructured account. [Learn how](/ui/sources/google-drive). * Some available [destination connector](/ui/destinations/overview) in your Unstructured account. * A workflow that uses the preceding source and destination connectors. [Learn how](/ui/workflows). ## Step 1: Create the Google Apps Script project 1. Sign in to your Google account. 2. Go to [http://script.google.com/](http://script.google.com/). 3. Click + New project. 4. Click the new project's default name (such as Untitled project), and change it to something more descriptive, such as Unstructured ETL Scripts. ## Step 2: Add the script 1. With the project still open, on the sidebar, click the \< > (Editor) icon. 2. In the Files tab, click Code.gs. 3. Replace the contents of the `Code.gs` file with the following code instead: ```javascript function checkForNewOrUpdatedFiles() { const folder = DriveApp.getFolderById(FOLDER_ID); const files = folder.getFiles(); const now = new Date(); const thresholdMillis = 5 * 60 * 1000; // 5 minutes (adjust as needed). while (files.hasNext()) { const file = files.next(); const created = file.getDateCreated(); const lastUpdated = file.getLastUpdated(); const fileName = file.getName(); // Calculate time differences. const millisSinceCreated = now - created; const createdWithinThreshold = millisSinceCreated < thresholdMillis; const millisSinceUpdated = now - lastUpdated; const updatedWithinThreshold = millisSinceUpdated < thresholdMillis; // Log file details and calculations. console.log('File Name: ' + fileName); console.log('Created: ' + created); console.log('Last updated: ' + lastUpdated); console.log('Milliseconds since created: ' + millisSinceCreated); console.log('Milliseconds since last updated: ' + millisSinceUpdated); console.log('Created within threshold of ' + thresholdMillis + ' milliseconds? ' + createdWithinThreshold); console.log('Updated within threshold of ' + thresholdMillis + ' milliseconds? ' + updatedWithinThreshold); console.log('-----') // If at least one file was created or updated within the last 5 minutes... if ((createdWithinThreshold) || (updatedWithinThreshold)) { // ...then make the HTTP POST request. UrlFetchApp.fetch(UNSTRUCTURED_API_URL, { method: 'post', headers: { 'accept': 'application/json', 'unstructured-api-key': UNSTRUCTURED_API_KEY } }); // Then stop the script after the first fetch (no need to check any more files). console.log('At least one file created or updated within threshold of ' + thresholdMillis + ' milliseconds.') console.log('Unstructured workflow request sent to ' + UNSTRUCTURED_API_URL) return; } } console.log('No files created or updated within threshold of ' + thresholdMillis + ' milliseconds. No Unstructured workflow request sent.') } ``` 4. Click the Save project to Drive button. ## Step 3: Customize the script for your workflow 1. With the project still open, on the Files tab, click the Add a file button, and then click Script. 2. Name the new file `Constants`. The `.gs` extension is added automatically. 3. Replace the contents of the `Constants.gs` file with the following code instead: ```javascript const FOLDER_ID = '<folder-id>'; const UNSTRUCTURED_API_URL = '<unstructured-api-url>' + '/workflows/<workflow-id>/run'; const UNSTRUCTURED_API_KEY = '<unstructured-api-key>'; ``` Replace the following placeholders: * Replace `<folder-id>` with the ID of your Google Drive shared folder or shared drive. This is the same ID that you specified when you created your Google Drive source connector in your Unstructured account. * Replace `<unstructured-api-url>` with your Unstructured API URL value. * Replace `<workflow-id>` with the ID of your Unstructured workflow. * Replace `<unstructured-api-key>` with your Unstructured API key value. 4. Click the disk (Save project to Drive) icon. ## Step 4: Create the script trigger 1. With the project still open, on the sidebar, click the alarm clock (Triggers) icon. 2. Click the + Add Trigger button. 3. Set the following values: * For Choose which function to run, select `checkForNewOrUpdatedFiles`. * For Choose which deployment should run, select Head. * For Select event source, select Time-driven. * For Select type of time based trigger, select Minutes timer. * For Select minute interval, select Every 5 minutes. <Note> If you change Minutes timer or Every 5 minutes to a different interval, you should also go back and change the number `5` in the following line of code in the `checkForNewOrUpdatedFiles` function. Change the number `5` to the number of minutes that correspond to the alternate interval you selected: ```javascript const thresholdMillis = 5 * 60 * 1000; ``` </Note> * For Failure notification settings, select an interval such as immediately, hourly, or daily. 4. Click Save. ## Step 5: View trigger results 1. With the project still open, on the sidebar, click the three lines (Executions) icon. 2. As soon as the first script execution completes, you should see a corresponding message appear in the Executions list. If the Status column shows Completed, then keep going with this procedure. If the Status column shows Failed, expand the message to get any details about the failure. Fix the failure, and then wait for the next script execution to complete. 3. When the Status column shows Completed then, in your Unstructured account, click Jobs on the sidebar to see if a new job is running for that worklow. If no new job is running for that workflow, then add at least one new file to—or update at least one existing file in—the Google Drive shared folder or shared drive, within 5 minutes of the next script execution. After the next script execution, check the Jobs list again. ## Step 6 (Optional): Delete the trigger 1. To stop the script from automatically executing on a regular basis, with the project still open, on the sidebar, click the alarm clock (Triggers) icon. 2. Rest your mouse pointer on the trigger you created in Step 4. 3. Click the ellipsis (three dots) icon, and then click Delete trigger. # Query JSON with jq Source: https://docs.unstructured.io/examplecode/tools/jq [jq](https://jqlang.org/) is a lightweight and flexible command-line JSON processor. You can use `jq` on a local development machine to slice, filter, map, and transform the JSON data that Unstructured outputs in much the same ways that tools such as `sed`, `awk`, and `grep` let you work with text. To get `jq`, see the [Download jq](https://jqlang.org/download/) page. <Info> `jq` is not owned or supported by Unstructured. For questions about `jq`and feature requests for future versions of `jq`, see the [Issues](https://github.com/jqlang/jq/issues) tab of the `jq` repository in GitHub. </Info> The following command examples use `jq` with the [spring-weather.html.json](https://github.com/Unstructured-IO/unstructured/blob/main/example-docs/spring-weather.html.json) file in the example-docs directory within the Unstructured-IO/unstructured repository in GitHub. Find the element with a `type` of `Address`, and print the element's `text` field's value. ```bash jq '.[] | select(.type == "Address") | .text' spring-weather.html.json ``` The output is: ```bash "Silver Spring, MD 20910" ``` Find all elements with a `type` of `Title`, and print the `text` field of each found element as a string in a JSON array. ```bash jq '[ .[] | select(.type == "Title") | .text]' spring-weather.html.json ``` The output is: ```bash [ "News Around NOAA", "National Program", "Are You Weather-Ready for the Spring?", "Weather.gov >", "News Around NOAA > Are You Weather-Ready for the Spring?", "US Dept of Commerce", "National Oceanic and Atmospheric Administration", "National Weather Service", "News Around NOAA", "1325 East West Highway", "Comments? Questions? Please Contact Us.", "Disclaimer", "Information Quality", "Help", "Glossary", "Privacy Policy", "Freedom of Information Act (FOIA)", "About Us", "Career Opportunities" ] ``` Find all elements with a `type` of `Title`. Of these, find the ones that have a `text` field that contains the phrase `Contact Us`, and print the contents of each found element's `metadata.link_urls` field. ```bash jq '.[] | select(.type == "Title") | select(.text | contains("Contact Us")) | .metadata.link_urls' spring-weather.html.json ``` The output is: ```bash [ "https://www.weather.gov/news/contact" ] ``` Find all elements with a `type` of `ListItem`. Of these, find the ones that have a `text` field that contains the phrase `Weather Safety`. For each item in `metadata.link_texts`, print the item's value as the key, followed by the matching item in `metadata.link_urls` as the value. Trim any leading and trailing whitespace from all values. Wrap the output in a JSON array. ```bash jq '[ .[] | select(.type == "ListItem") | select(.text | test("Weather Safety"; "i")) | [.metadata.link_texts, .metadata.link_urls] | transpose[] | { (.[0] | gsub("^\\s+|\\s+$"; "")) : (.[1] | gsub("^\\s+|\\s+$"; "")) } ]' spring-weather.html.json ``` The output is: ```bash [ { "Weather Safety": "http://www.weather.gov/safetycampaign" }, { "Air Quality": "https://www.weather.gov/safety/airquality" }, { "Beach Hazards": "https://www.weather.gov/safety/beachhazards" }, { "Cold": "https://www.weather.gov/safety/cold" }, { "Cold Water": "https://www.weather.gov/safety/coldwater" }, { "Drought": "https://www.weather.gov/safety/drought" }, { "Floods": "https://www.weather.gov/safety/flood" }, { "Fog": "https://www.weather.gov/safety/fog" }, { "Heat": "https://www.weather.gov/safety/heat" }, { "Hurricanes": "https://www.weather.gov/safety/hurricane" }, { "Lightning Safety": "https://www.weather.gov/safety/lightning" }, { "Rip Currents": "https://www.weather.gov/safety/ripcurrent" }, { "Safe Boating": "https://www.weather.gov/safety/safeboating" }, { "Space Weather": "https://www.weather.gov/safety/space" }, { "Sun (Ultraviolet Radiation)": "https://www.weather.gov/safety/heat-uv" }, { "Thunderstorms & Tornadoes": "https://www.weather.gov/safety/thunderstorm" }, { "Tornado": "https://www.weather.gov/safety/tornado" }, { "Tsunami": "https://www.weather.gov/safety/tsunami" }, { "Wildfire": "https://www.weather.gov/safety/wildfire" }, { "Wind": "https://www.weather.gov/safety/wind" }, { "Winter": "https://www.weather.gov/safety/winter" } ] ``` ## Additional resources * [jq Tutorial](https://jqlang.org/tutorial/) * [jq Manual](https://jqlang.org/manual/) * [jq Playground](https://play.jqlang.org/) # Langflow Source: https://docs.unstructured.io/examplecode/tools/langflow [Langflow](https://www.langflow.org/) is a visual framework for building multi-agent and RAG applications. It is open-source, fully customizable, and works with most LLMs and many vector stores out of the box. ![Langflow designer](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/designer.png) This no-code, hands-on demonstration walks you through creating a Langflow project that enables you to use GPT-4o-mini to chat in real time with a PDF document that is processed by Unstructured and has its processed data stored in an [Astra DB](https://www.datastax.com/products/datastax-astra) vector database. ## Prerequisites <iframe width="560" height="315" src="https://www.youtube.com/embed/PMs1iwL52aM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/). * A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html). * An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html). * A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace). * A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection). An existing collection is not required. At runtime, the collection behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings, the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Pinecone account. The new collection's name will be `unstructuredautocreated`. Also: * [Sign up for an OpenAI account](https://platform.openai.com/signup), and [get your OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key). * [Sign up for a free Langflow account](https://astra.datastax.com/signup?type=langflow). * Get your Unstructured account and Unstructured API key: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> ## Create and run the demonstration project <Steps> <Step title="Create the Langflow project"> 1. Sign in to your Langflow dashboard. 2. From your dashboard, click New Project. 3. Click Blank Flow. </Step> <Step title="Add the Unstructured component"> In this step, you add a component that instructs the Unstructured Partition Endpoint to process a local file that you specify. 1. On the sidebar, expand Experimental (Beta), and then expand Loaders. 2. Drag the Unstructured component onto the designer area. 3. In the Unstructured component, click the box or icon next to File, and then select a local file for Unstructured to process. This component works only with the file extensions `.pdf`, `.docx`, and `.txt`. Although you can use any local file with one of these extensions, this demonstration uses [the text of the United States Constitution in PDF format](https://constitutioncenter.org/media/files/constitution.pdf), saved to your local development machine. <Tip> To work with multiple local files, or to work with remote files, see the suggested options in [Next steps](#next-steps). </Tip> 4. For Unstructured.io Serverless API Key, enter your Unstructured API key value. ![Unstructured component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/unstructured-component.png) 5. Wait until Saved appears in the top navigation bar. ![Saved message](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/saved.png) </Step> <Step title="Add the OpenAI Embeddings component"> In this step, you add a component that generates vector embeddings for the processed data that Unstructured outputs. 1. On the sidebar, expand Embeddings, and then drag the OpenAI Embeddings component onto the designer area. 2. In the OpenAI Embeddings component, for Model, select `text-embedding-3-large`. 3. For OpenAI API Key, enter your OpenAI API key's value. ![OpenAI Embeddings component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/openai-embeddings-component.png) 4. Wait until Saved appears in the top navigation bar. </Step> <Step title="Add the Astra DB components"> In this step, you add two components. The first component instructs Astra DB to ingest into the specified Astra DB collection the processed data that Unstructured outputs along with the associated generated vector embeddings. The second component instructs Astra DB to take user-supplied chat messages and perform contextual searches over the ingested data in the specified Astra DB collection, outputting its search results. 1. On the sidebar, expand Vector Stores, and then drag the Astra DB component onto the designer area. 2. Double-click the Astra DB component's title bar, and rename the component to `Astra DB Ingest`. 3. Repeat these previous two actions to add a second Astra DB component, renaming it to `Astra DB RAG`. 4. In both of these Astra DB components, in the Database list, select the name of your Astra DB database. Make sure this is the same database name in both components. 5. In the Collection list in both components, select the name of the collection in the database. Make sure this is the same collection name in both components. 6. In the Astra DB Application Token box in both components, enter your Astra DB application token's value. Make sure this is the same application token value in both components. 7. Connect the Data output from the Unstructured component to the Ingest Data input in the Astra DB Ingest component. To make the connection, click and hold your mouse pointer inside of the circle next to Data in the Unstructured component. While holding your mouse pointer, drag it over into the circle next to Ingest Data in the Astra DB Ingest component. Then release your mouse pointer. A line appears between these two circles. 8. Connect the Embeddings output from the OpenAI Embeddings component to the Embedding or Astra Vectorize input in the Astra DB Ingest component. ![Astra DB component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/astra-db-component.png) 9. Wait until Saved appears in the top navigation bar. 10. In the title bar of the Astra DB Ingest component, click the play icon. This ingests the processed data from Unstructured and the associated generated vector embeddings into the specified Astra DB collection. ![Play icon](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/build.png) 11. Wait until Building disppears from the top navigation bar, and a green check mark appears next to this play icon. This could take several minutes. <Note> Each time you click the play icon in the Astra DB Ingest component, Unstructured reprocesses the specified local file. If this file does not change, this could result in multiple duplicate records being inserted into the specified Astra DB collection. You should only click the play icon in the Astra DB Ingest component when you want to insert new processed data into the specified Astra DB collection. </Note> </Step> <Step title="Add the Chat Input component"> In this step, you add a component that takes user-supplied chat messages and sends them as input to Astra DB for contextual searching. 1. On the sidebar, expand Inputs, and then drag the Chat Input component onto the designer area. 2. Connect the Message output from the Chat Input component to the Search Input input in the Astra DB RAG component. ![Chat Input component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/chat-input-component.png) 3. Wait until Saved appears in the top navigation bar. </Step> <Step title="Add the Parse Data component"> In this step, you add a component that takes the Astra DB search results and converts them into plain text, suitable for inclusion in a prompt to a text-based LLM. 1. On the sidebar, expand Helpers, and then drag the Parse Data component onto the designer area. 2. Connect the Search Results output from the Astra DB RAG component to the Data input in the Parse Data component. ![Parse Data component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/parse-data-component.png) 3. Wait until Saved appears in the top navigation bar. </Step> <Step title="Add the Prompt component"> In this step, you add a component that builds a prompt and then sends it to a text-based LLM. 1. On the sidebar, expand Prompts, and then drag the Prompt component onto the designer area. 2. In the Prompt component, next to Template, click the box or arrow icon. 3. In the Edit Prompt window, enter the following prompt: ```text {context} --- Given the context above, answer the question as best as possible. Question: {question} Answer: ``` <Tip> To answer the question, the preceding prompt uses the context along with general information that the text-based LLM is trained on. To use only the context to answer the question, you can change the prompt, for example to something like this: ```text {context} --- Given the context above, answer the question as best as possible. Use only the context to answer the question. Do not use any other sources of information. If the context does not provide enough information to answer the question, reply with 'I do not have enough context to answer the question.' Question: {question} Answer: ``` </Tip> 4. Click Check & Save. ![Prompt component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/edit-prompt.png) 5. Connect the Text output from the Parse Data component to the context input in the Prompt component. ![Connect Prompt component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/connect-prompt-component.png) 6. Connect the Message output from the Chat Input component to the question input in the Prompt component. <Note> You will now have two connections from the Message output in the Chat Input component: * One connection was already made to the Search Input input in the Astra DB RAG component. * Another connection has just now been made to the question input in the Prompt component. </Note> 7. Wait until Saved appears in the top navigation bar. </Step> <Step title="Add the OpenAI component"> In this step, you create a component that sends a prompt to a text-based LLM and outputs the LLM's response. 1. On the sidebar, expand Models, and then drag the OpenAI component onto the designer area. 2. In the Model Name list, select gpt-4o-mini. 3. For OpenAI API Key, enter your OpenAI API key's value. 4. For Temperature, enter `0.1`. 5. Connect the Prompt Message output from the Prompt component to the Input input in the OpenAI component. ![OpenAI component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/openai-component.png) 6. Wait until Saved appears in the top navigation bar. </Step> <Step title="Add the Chat Output component"> In this step, you create a component that returns the answer to the user's original chat message. 1. On the sidebar, expand Outputs, and then drag the Chat Output component onto the designer area. 2. Connect the Text output from the OpenAI component to the Text input in the Chat Output component. ![Chat Output component](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/chat-output-component.png) 3. Wait until Saved appears in the top navigation bar. The final project should look like this: ![Final project results](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/final-project.png) </Step> <Step title="Run the project"> 1. In the designer area, click Playground. ![Open Playground button](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/open-playground.png) 2. Enter a question into the chat box, for example, `What rights does the fifth amendment guarantee?` Then press the send button. ![Playground window](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/langflow/playground.png) 3. Wait until the answer appears. 4. Ask as many additional questions as you want to. </Step> </Steps> ## Next steps Now that you have your pipeline set up, here are just a few ways that you could modify it to support different requirements, such as processing multiple files or using a different vector store. ### Process multiple files In this demonstration, you pass to Unstructured a single local file. To pass multiple local or non-local files to Unstructured instead, you can use the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview) or [Unstructured Ingest](/open-source/ingestion/overview) outside of Langflow. To do this, you can: * [Use the Unstructured UI to create a workflow](/ui/quickstart) that relies on any available [source connector](/ui/sources/overview) to connect to [Astra DB](/ui/destinations/astradb). Run this workflow outside of Langflow anytime you have new documents in that source location that you want Unstructured to process and then insert the new processed data into Astra DB. Then, back in the Langflow project, use the Playground to ask additional questions, which will now include the new data when generating answers. * [Use Unstructured Ingest to create a pipeline](/open-source/ingestion/overview) that relies on any available [source connector](/open-source/ingestion/source-connectors/overview) to connect to [Astra DB](/open-source/ingestion/destination-connectors/astradb). Run this pipeline outside of Langflow anytime you have new documents in that non-local source location that you want Unstructured to process and then insert the new processed data into Astra DB. Then, back in the Langflow project, use the Playground to ask additonal questions, which will now include the new data when generating answers. ### Use a different vector store In this demonstration, you use Astra DB as the vector store. Langflow and Unstructured support several vector stores in addition to Astra DB. To do this, you can: [Use the Unstructured UI to create a workflow](/ui/quickstart) that relies on any available [source connector](/ui/sources/overview) to connect to one of the following available vector stores that Langflow also supports: * [Milvus](/ui/destinations/milvus) * [MongoDB](/ui/destinations/mongodb) * [Pinecone](/ui/destinations/pinecone) Run this workflow outside of Langflow anytime you have new documents in the source location that you want Unstructured to process and then insert the new processed data into the vector store. Then, back in the Langflow project, swap out the Astra DB RAG component for the corresponding Vector Stores component that matches the new vector store's name. Configure the new component, and then use the Playground to ask additional questions, which will now use the new vector store when generating answers. Or, [use Unstructured Ingest to create a pipeline](/open-source/ingestion/overview) that relies on any available [source connector](/open-source/ingestion/source-connectors/overview) to connect to one of the following available vector stores that Langflow also supports: * [Chroma DB](/open-source/ingestion/destination-connectors/chroma) * [Couchbase](/open-source/ingestion/destination-connectors/couchbase) * [Elasticsearch](/open-source/ingestion/destination-connectors/elasticsearch) * [Milvus](/open-source/ingestion/destination-connectors/milvus) * [MongoDB](/open-source/ingestion/destination-connectors/mongodb) * [OpenSearch](/open-source/ingestion/destination-connectors/opensearch) * [Pinecone](/open-source/ingestion/destination-connectors/pinecone) * [Qdrant](/open-source/ingestion/destination-connectors/qdrant) * [Vectara](/open-source/ingestion/destination-connectors/vectara) * [Weaviate](/open-source/ingestion/destination-connectors/weaviate) Run this pipeline outside of Langflow anytime you have new documents in the source location that you want Unstructured to process and then insert the new processed data into the vector store. Then, back in the Langflow project, swap out the Astra DB RAG component for the corresponding Vector Stores component that matches the new vector store's name. Configure the new component, and then use the Playground to ask additional questions, which will now use the new vector store when generating answers. ## Learn more * See the [Langflow documentation](https://docs.langflow.org/). * <Icon icon="blog" /> [No-Code AI Assistant in No Time with Unstructured Platform, AstraDB, and Langflow](https://unstructured.io/blog/no-code-ai-assistant-in-no-time-with-unstructured-platform-astradb-and-langflow) # Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured Workflow Endpoint Source: https://docs.unstructured.io/examplecode/tools/mcp [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) is an open protocol that standardizes how applications provide context to LLMs. From the MCP documentation: > Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools. This article provides a hands-on, step-by-step walkthrough to build a simple example on your local development machine that integrates the [Unstructured API](/api-reference/overview) with the [Python SDK for Model Context Protocol](https://github.com/modelcontextprotocol/python-sdk). Specifically, this walkthrough demonstrates how to write and run Python code to create an MCP server (a lightweight program that exposes specific capabilities through MCP); this MCP server uses the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) to list all of your available [source connectors](/api-reference/workflow/sources/overview) and to get information about a specific source connector. The Workflow Endpoint offers a full range of partitioning, chunking, embedding, and enrichment options for your files and data. <Note> For a hands-on walkthrough of the [Unstructured Partition Endpoint](/api-reference/partition/overview) instead, see the [MCP Hands-On Walkthrough for the Unstructured Partition Endpoint](/examplecode/tools/mcp-partition), which is intended for rapid prototyping of Unstructured's various partitioning strategies, with limited support for chunking, processing local files one file at a time. </Note> You call this MCP server from an MCP client (an application that maintains a one-to-one connection with an MCP server), which in this walkthrough is [Claude for Desktop](https://claude.ai/download), a chat-based interface, by asking it plain language questions. Other MCP clients are also available. [Learn more](https://modelcontextprotocol.io/clients). At the end of this walkthrough, you can extend this example by using the Workflow Endpoint to perform other tasks, such as creating, listing, changing, and deleting various connectors, workflows, and jobs in your Unstructured account. ## Requirements To complete this walkthrough, you must first have an Unstructured account, an Unstructured API key for that account, the necessary Python toolchain and project setups, and Claude for Desktop, as follows. ### Unstructured account and API key Before you begin, you must have an Unstructured account and an Unstructured API key, as follows: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> ### Python toolchain and Claude for Desktop setup Before you can start coding on your local machine, you must install the Python package and project manager [uv](https://docs.astral.sh/uv/), and install Claude for Desktop, as follows. <Steps> <Step> Install `uv`, as follows. By using `pip`: ```bash pip install uv ``` Or, by using the following script (be sure to restart your terminal or Command Prompt after running the script): ```bash # For macOS and Linux: curl -LsSf https://astral.sh/uv/install.sh | sh # For Windows: powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" ``` (To learn more about `uv`, see the [uv documentation](https://docs.astral.sh/uv/).) </Step> <Step> [Install Claude for Desktop](https://claude.ai/download). </Step> </Steps> You are now ready to start coding. ## Step 1: Create the MCP server <Tip> If you are not interested in building an MCP server from the beginning and want to use a pre-built MCP server instead, skip ahead to [Optional Step 4: Extend the MCP server](#optional-step-4%3A-extend-the-mcp-server). </Tip> <Steps> <Step> From your current working directory, create a project directory and then initialize it. This example creates a project directory named `mcp-unstructured-demo`. Then go to the project directory: ```bash uv init mcp-unstructured-demo cd mcp-unstructured-demo ``` </Step> <Step> Create a virtual environment in the root of the project directory. Then activate the virtual environment: ```bash # For macOS and Linux: uv venv source .venv/bin/activate # For Windows: uv venv .venv\Scripts\activate ``` To deactivate and exit the virtual environment at any time, simply run the `deactivate` command: ```bash deactivate ``` </Step> <Step> With the virtual environment activated, install the Python SDK for Model Context Protocol and additional code dependencies: ```bash uv add "mcp[cli]" "unstructured-client>=0.30.6" dotenv ``` </Step> <Step> In the root of the project directory, create a new Python file. This example names the file `workflow.py`. Then add the following code to this new `workflow.py` file: ```python from typing import Any, AsyncIterator, Optional from mcp.server.fastmcp import FastMCP, Context from dotenv import load_dotenv import os from dataclasses import dataclass from contextlib import asynccontextmanager from unstructured_client import UnstructuredClient from unstructured_client.models.operations import ( ListSourcesRequest, GetSourceRequest ) from unstructured_client.models.shared import SourceConnectorType def load_environment_variables() -> None: """ Loads environment variables from a .env file and checks for required variables. Raises an error if any required variable is missing. Args: None Returns: None """ load_dotenv() required_vars = [ "UNSTRUCTURED_API_KEY" ] for var in required_vars: if not os.getenv(var): raise ValueError(f"Missing required environment variable: {var}") @dataclass class AppContext: client: UnstructuredClient @asynccontextmanager async def app_lifespan(server: FastMCP) -> AsyncIterator[AppContext]: """ Manages application lifecycle with type-safe context for the FastMCP server. Initializes the UnstructuredClient. Args: server (FastMCP): The FastMCP server instance. Returns: AsyncIterator[AppContext]: An asynchronous context manager providing the application context. """ api_key = os.getenv("UNSTRUCTURED_API_KEY") if not api_key: raise ValueError("UNSTRUCTURED_API_KEY environment variable is required") client = UnstructuredClient(api_key_auth=api_key) try: yield AppContext(client=client) finally: pass # Initialize the FastMCP server. mcp = FastMCP("workflow", lifespan=app_lifespan) @mcp.tool() async def list_sources(ctx: Context, source_type: Optional[str] = None) -> str: """ Lists available source connectors. Optionally filters by source type. Args: ctx (Context): The context for the request. source_type (Optional[str]): The type of source connector to filter by. If None, all sources are listed. Returns: str: A formatted string listing available source connector names and IDs. """ client = ctx.request_context.lifespan_context.client request = ListSourcesRequest() if source_type: try: request.source_type = SourceConnectorType[source_type] except KeyError: return f"Invalid source type: {source_type}" response = await client.sources.list_sources_async( request=request ) sorted_sources = sorted( response.response_list_sources, key=lambda source: source.name.lower() ) result = ["Available source connectors:"] for source in sorted_sources: result.append(f"- {source.name} (ID: {source.id})") return "\n".join(result) @mcp.tool() async def get_source(ctx: Context, source_id: str) -> str: """ Gets detailed information about a specific source connector by its ID. Args: ctx (Context): The context for the request. source_id (str): The ID of the source connector to retrieve. Returns: str: A formatted string containing information about the source connector. """ client = ctx.request_context.lifespan_context.client response = await client.sources.get_source_async( request=GetSourceRequest(source_id=source_id) ) info = response.source_connector_information result = ["Source connector information:"] result.append(f"- name: {info.name}") for key, value in info.config: result.append(f"{key}: {value}") return "\n".join(result) if name == "main": load_environment_variables() # Initialize and run the MCP server. mcp.run(transport='stdio') ``` </Step> <Step> In the root of the project directory, create a new file named `.env`. Then add the following code to this new `.env` file, replacing `<your-unstructured-api-key>` with your Unstructured API key: ```text UNSTRUCTURED_API_KEY="<your-unstructured-api-key>" ``` </Step> </Steps> ## Step 2: Configure Claude for Desktop to use the MCP server <Steps> <Step> Open the Claude for Desktop App configuration file from the following location, for example by using Visual Studio Code: ```text # For macOS or Linux: code ~/Library/Application\ Support/Claude/claude_desktop_config.json # For Windows: code $env:AppData\Claude\claude_desktop_config.json ``` </Step> <Step> Add the following code to the `claude_desktop_config.json` file, and then save the file. In this file: * `"workflow"` must match the value that you set in `mcp = FastMCP("workflow" ...` in the `workflow.py` file. * "Absolute path to" `uv` must be the absolute path to the `uv` executable on your local machine. * "Absolute path to project directory" must be the absolute path to the project directory. * `mcp-unstructured-demo` must match the name of the project directory that you created. * `"workflow.py"` must match the name of the Python file that you created for the MCP server. For macOS or Linux: ```json { "mcpServers": { "workflow": { "command": "/ABSOLUTE/PATH/TO/uv", "args": [ "--directory", "/ABSOLUTE/PATH/TO/PROJECT/DIRECTORY/mcp-unstructured-demo/", "run", "workflow.py" ] } } } ``` For Windows: ```json { "mcpServers": { "workflow": { "command": "C:\\ABSOLUTE\\PATH\\TO\\uv", "args": [ "--directory", "C:\\ABSOLUTE\\PATH\\TO\\PROJECT\\DIRECTORY\\mcp-unstructured-demo\\", "run", "workflow.py" ] } } } ``` </Step> </Steps> ## Step 3: Run the MCP server and call it from Claude for Desktop <Steps> <Step> Start Claude for Desktop. If it is already running, restart it. </Step> <Step> Ask the following question in the chat window, replacing `S3` with the type of source connector that you want to list. For a list of available source connector types, see the [Workflow Endpoint Sources overview](/api-reference/workflow/sources/overview). ```text List all available source connectors of type S3. ``` Claude for Desktop responds with a list of all available source connectors for that source connector type. </Step> <Step> Ask another question in the chat window, replacing `<source-connector-id>` with one of the source connector IDs that Claude for Desktop returned in the previous step. For example: ```text Provide information about the source connector with ID <source-connector-id>. ``` Claude for Desktop responds with information about that source connector. </Step> </Steps> ## Optional Step 4: Extend the MCP server Unstructured offers an initial experimental MCP server implementation, with much more functionality then the MCP server that you just created previously in this walkthrough. This initial MCP server is intended only for experimentation purposes and is not recommended for production usage. A more complete, supported, production-level implementation is coming soon. To use Unstructured's initial experimental MCP server implementation, do the following: <Steps> <Step> From a different working directory than the one you used in previously in this walkthrough, use Git to clone the [Unstructured API MCP Server](https://github.com/Unstructured-IO/UNS-MCP) repository in GitHub, and then switch to the cloned repository's root directory: ```bash git clone https://github.com/Unstructured-IO/UNS-MCP.git cd UNS-MCP ``` </Step> <Step> In the cloned repository's root directory, create a new file named `.env`. Then copy the contents of the repository's `.env.template` file into the `.env` file that you just created. As an MCP security best practice, sensitive information such as API keys should be stored in environment variables or a secrets manager, not hard-coded in MCP server code or retrieved from the user by an MCP client. The `.env` file is a common way to store environment variables in Python projects outside of MCP server code and not visible to MCP clients. Having your own `.env` file helps ensure that you do not accidentally commit any sensitive information from this cloned repository into other repositories (as `.env` is an entry in this cloned repository's `.gitignore` file, but `.env.template` is not). In your `.env` file, replace `<key-here>` in the following line with your Unstructured API key: ```text UNSTRUCTURED_API_KEY="<key-here>" ``` As needed, replace any of the other placeholders in the `.env` file with your own values. For example, if you want to create an Amazon S3 source or destination connector, you should provide your AWS credentials in the `.env` file by specifying values for the `AWS_KEY` and `AWS_SECRET` variables. For more information, see the usage comments in the original `.env.template` file or in the content that you copied into your `.env` file. The original `.env.template` file will be updated as more features are added to the repository. You should periodically pull down the source repository's contents into your cloned repository to get any new features, as well as the latest version of the `.env.template` file that you can copy any newly available required environment variable names over into your existing `.env` file. </Step> <Step> In the cloned repository's root directory, create a virtual environment. Then activate the virtual environment: ```bash # For macOS and Linux: uv venv source .venv/bin/activate # For Windows: uv venv .venv\Scripts\activate ``` To deactivate and exit the virtual environment at any time, simply run the `deactivate` command: ```bash deactivate ``` </Step> <Step> With the virtual environment activated, install the cloned repository's code dependencies: ```bash uv sync ``` </Step> <Step> Open the Claude for Desktop App configuration file from the following location, for example by using Visual Studio Code: ```text # For macOS or Linux: code ~/Library/Application\ Support/Claude/claude_desktop_config.json # For Windows: code $env:AppData\Claude\claude_desktop_config.json ``` </Step> <Step> Overwrite the contents of the `claude_desktop_config.json` file with the following content, and then save the file. For macOS or Linux: ```json { "mcpServers": { "UNS-MCP": { "command": "/ABSOLUTE/PATH/TO/uv", "args": [ "--directory", "/ABSOLUTE/PATH/TO/PARENT/FOLDER/UNS-MCP", "run", "server.py" ] } } } ``` For Windows: ```json { "mcpServers": { "workflow": { "command": "C:\\ABSOLUTE\\PATH\\TO\\uv", "args": [ "--directory", "C:\\ABSOLUTE\\PATH\\TO\\PARENT\\FOLDER\\UNS-MCP", "run", "server.py" ] } } } ``` </Step> <Step> Start Claude for Desktop. If it is already running, restart it. </Step> <Step> Here are some questions that you can ask in the chat window to test the MCP server. Replace the placeholders with the appropriate values. For a list of available source connector types, see the [Workflow Endpoint Sources overview](/api-reference/workflow/sources/overview). For a list of available destination connector types, see the [Workflow Endpoint Destinations overview](/api-reference/workflow/destinations/overview). * `List all available source connectors of type <source-connector-type>.` * `Provide information about the source connector with ID <source-connector-id>.` * `List all available source connectors of type <destination-connector-type>.` * `Provide information about the source connector with ID <destination-connector-id>.` * `List all available destination connectors of type <destination-connector-type>.` * `Provide information about the destination connector with ID <destination-connector-id>.` * `Create a new workflow.` * `Provide information about the workflow with ID <workflow-id>.` * `Run the workflow with ID <workflow-id>.` * `Update the workflow with ID <workflow-id>.` * `List all available workflows.` * `Delete the worklow with ID <workflow-id>.` </Step> <Step> New features are being added to the source repository over time, which will increase the types of questions that you can ask the MCP server. You should periodically pull down the source repository's contents into your cloned repository to get any new features as they are added. </Step> </Steps> ## Learn more * <Icon icon="blog" /> [Building an MCP Server with Unstructured API](https://unstructured.io/blog/building-an-mcp-server-with-unstructured-api) * <Icon icon="blog" /> [Building an End-to-End Data Pipeline with Custom NER on Unstructured using MCP](https://unstructured.io/blog/building-an-end-to-end-data-pipeline-with-custom-ner-on-unstructured-using-mcp) * <Icon icon="blog" /> Uses this MCP server: [All Your Unstructured Data in a Databricks Delta Table. Just Say the Word.](https://unstructured.io/blog/all-your-unstructured-data-in-a-databricks-delta-table-just-say-the-word) # Model Context Protocol (MCP) Hands-On Walkthrough for the Unstructured Partition Endpoint Source: https://docs.unstructured.io/examplecode/tools/mcp-partition [Model Context Protocol](https://modelcontextprotocol.io/introduction) (MCP) is an open protocol that standardizes how applications provide context to LLMs. From the MCP documentation: > Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools. This article provides a hands-on, step-by-step walkthrough to build a simple example on your local development machine that integrates the [Unstructured API](/api-reference/overview) with MCP. Specifically, this walkthrough demonstrates how to run an MCP server (a lightweight program that exposes specific capabilities through MCP); this MCP server uses the [Unstructured Partition Endpoint](/api-reference/workflow/overview) to partition a single file, with limited support for chunking as well, processing local files one file at a time. <Note> For a hands-on walkthrough of the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) instead, see the [MCP Hands-On Walkthrough for the Unstructured Workflow Endpoint](/examplecode/tools/mcp), which offers a full range of partitioning, chunking, embedding, and enrichment options for your files and data. </Note> You call this MCP server from an MCP client (an application that maintains a one-to-one connection with an MCP server), which in this walkthrough is [Claude for Desktop](https://claude.ai/download), a chat-based interface, by asking it plain language questions. Other MCP clients are also available. [Learn more](https://modelcontextprotocol.io/clients). ## Requirements To complete this walkthrough, you must first have an Unstructured account, an Unstructured API key for that account, the necessary Python toolchain and project setups, Claude for Desktop, and an input file for Unstructured to process, as follows. ### Unstructured account and API key Before you begin, you must have an Unstructured account and an Unstructured API key, as follows: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> ### Python toolchain and Claude for Desktop setup Before you can start coding on your local machine, you must install the Python package and project manager [uv](https://docs.astral.sh/uv/), and install Claude for Desktop, as follows. <Steps> <Step> Install `uv`, as follows. By using `pip`: ```bash pip install uv ``` Or, by using the following script (be sure to restart your terminal or Command Prompt after running the script): ```bash # For macOS and Linux: curl -LsSf https://astral.sh/uv/install.sh | sh # For Windows: powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" ``` (To learn more about `uv`, see the [uv documentation](https://docs.astral.sh/uv/).) </Step> <Step> [Install Claude for Desktop](https://claude.ai/download). </Step> </Steps> ### Input file You must have an input file for Unstructured to process. This file can be any file type that Unstructured supports, such as a `.pdf` file. [See the list of supported file types](/ui/supported-file-types). If you do not have any files available, you can download some from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in the Unstructured repo on GitHub. To minimize the risk of Claude for Desktop's hard timeout limit of 60 seconds for MCP server responses, we recommend that you use small files (less than 400 KB) for this walkthrough. You are now ready to start coding. ## Step 1: Create the MCP server <Steps> <Step> From your current working directory, create a project directory and then initialize it. This example creates a project directory named `mcp-unstructured-parition-demo`. Then go to the project directory: ```bash uv init mcp-unstructured-partition-demo cd mcp-unstructured-partition-demo ``` </Step> <Step> Create a virtual environment in the root of the project directory. Then activate the virtual environment: ```bash # For macOS and Linux: uv venv source .venv/bin/activate # For Windows: uv venv .venv\Scripts\activate ``` To deactivate and exit the virtual environment at any time, simply run the `deactivate` command: ```bash deactivate ``` </Step> <Step> With the virtual environment activated, install the Python SDK for Model Context Protocol and additional code dependencies: ```bash uv add "mcp[cli]" "unstructured-client>=0.30.6" dotenv ``` </Step> <Step> In the root of the project directory, create a new Python file. This example names the file `doc_processor.py`. Then add the following code to this new `doc_processor.py` file. In the following code, replace `<absolute-path-to-local-directory>` with the absolute path to the local directory where you want Unstructured to store the processed files' content. Be sure that this local directory exists before you run the code. ```python import os from dotenv import load_dotenv import json from unstructured_client import UnstructuredClient from typing import AsyncIterator from dataclasses import dataclass from contextlib import asynccontextmanager from mcp.server.fastmcp import FastMCP, Context from unstructured_client.models import operations, shared @dataclass class AppContext: client: UnstructuredClient @asynccontextmanager async def app_lifespan(server: FastMCP) -> AsyncIterator[AppContext]: """Manages the Unstructured API client lifecycle.""" api_key = os.getenv("UNSTRUCTURED_API_KEY") if not api_key: raise ValueError("UNSTRUCTURED_API_KEY environment variable is required") client = UnstructuredClient(api_key_auth=api_key) try: yield AppContext(client=client) finally: # No cleanup needed for the API client. pass # Create the MCP server instance. mcp = FastMCP("unstructured-partition-mcp", lifespan=app_lifespan, dependencies=["unstructured-client", "python-dotenv"]) # Specify the absolute path to the local directory to store processed files. PROCESSED_FILES_FOLDER = "<absolute-path-to-local-directory>" def load_environment_variables() -> None: """ Load environment variables from .env file. Raises an error if critical environment variables are missing. """ load_dotenv() required_vars = [ "UNSTRUCTURED_API_KEY" ] for var in required_vars: if not os.getenv(var): raise ValueError(f"Missing required environment variable: {var}") def json_to_text(file_path) -> str: with open(file_path, 'r') as file: elements = json.load(file) doc_texts = [] for element in elements: text = element.get("text", "").strip() element_type = element.get("type", "") metadata = element.get("metadata", {}) if element_type == "Title": doc_texts.append(f"<h1> {text}</h1><br>") elif element_type == "Header": doc_texts.append(f"<h2> {text}</h2><br/>") elif element_type == "NarrativeText" or element_type == "UncategorizedText": doc_texts.append(f"<p>{text}</p>") elif element_type == "ListItem": doc_texts.append(f"<li>{text}</li>") elif element_type == "PageNumber": doc_texts.append(f"Page number: {text}") elif element_type == "Table": table_html = metadata.get("text_as_html", "") doc_texts.append(table_html) # Keep the table as HTML. else: doc_texts.append(text) return " ".join(doc_texts) @mcp.tool() async def process_document(ctx: Context, filepath: str) -> str: """ Sends the document to Unstructured for processing. Returns the processed contents of the document Args: filepath: The local path to the document. """ filepath = os.path.abspath(filepath) if not os.path.isfile(filepath): return "File does not exist" # Check whether Unstructured supports the file's extension. _, ext = os.path.splitext(filepath) supported_extensions = {".abw", ".bmp", ".csv", ".cwk", ".dbf", ".dif", ".doc", ".docm", ".docx", ".dot", ".dotm", ".eml", ".epub", ".et", ".eth", ".fods", ".gif", ".heic", ".htm", ".html", ".hwp", ".jpeg", ".jpg", ".md", ".mcw", ".mw", ".odt", ".org", ".p7s", ".pages", ".pbd", ".pdf", ".png", ".pot", ".potm", ".ppt", ".pptm", ".pptx", ".prn", ".rst", ".rtf", ".sdp", ".sgl", ".svg", ".sxg", ".tiff", ".txt", ".tsv", ".uof", ".uos1", ".uos2", ".web", ".webp", ".wk2", ".xls", ".xlsb", ".xlsm", ".xlsx", ".xlw", ".xml", ".zabw"} if ext.lower() not in supported_extensions: return "File extension not supported by Unstructured" client = ctx.request_context.lifespan_context.client file_basename = os.path.basename(filepath) req = operations.PartitionRequest( partition_parameters=shared.PartitionParameters( files=shared.Files( content=open(filepath, "rb"), file_name=filepath, ), strategy=shared.Strategy.AUTO, ), ) try: res = client.general.partition(request=req) element_dicts = [element for element in res.elements] json_elements = json.dumps(element_dicts, indent=2) output_json_file_path = os.path.join(PROCESSED_FILES_FOLDER, f"{file_basename}.json") with open(output_json_file_path, "w") as file: file.write(json_elements) return json_to_text(output_json_file_path) except Exception as e: return f"The following exception happened during file processing: {e}" if name == "main": load_environment_variables() # Initialize and run the server. mcp.run(transport='stdio') ``` </Step> <Step> In the root of the project directory, create a new file named `.env`. Then add the following code to this new `.env` file, replacing `<your-unstructured-api-key>` with your Unstructured API key: ```text UNSTRUCTURED_API_KEY="<your-unstructured-api-key>" ``` </Step> </Steps> ## Step 2: Configure Claude for Desktop to use the MCP server <Steps> <Step> Open the Claude for Desktop App configuration file from the following location, for example by using Visual Studio Code: ```text # For macOS or Linux: code ~/Library/Application\ Support/Claude/claude_desktop_config.json # For Windows: code $env:AppData\Claude\claude_desktop_config.json ``` </Step> <Step> Add the following code to the `claude_desktop_config.json` file, and then save the file. In this file: * `"unstructured-partition-mcp"` must match the value that you set in `mcp = FastMCP("unstructured-partition-mcp" ...` in the `doc_processor.py` file. * "Absolute path to" `uv` must be the absolute path to the `uv` executable on your local machine. * "Absolute path to project directory" must be the absolute path to the project directory. * `mcp-unstructured-partition-demo` must match the name of the project directory that you created. * `"doc_processor.py"` must match the name of the Python file that you created for the MCP server. For macOS or Linux: ```json { "mcpServers": { "unstructured-partition-mcp": { "command": "/ABSOLUTE/PATH/TO/uv", "args": [ "--directory", "/ABSOLUTE/PATH/TO/PROJECT/DIRECTORY/mcp-unstructured-partition-demo/", "run", "doc_processor.py" ] } } } ``` For Windows: ```json { "mcpServers": { "unstructured-partition-mcp": { "command": "C:\\ABSOLUTE\\PATH\\TO\\uv", "args": [ "--directory", "C:\\ABSOLUTE\\PATH\\TO\\PROJECT\\DIRECTORY\\mcp-unstructured-partition-demo\\", "run", "doc_processor.py" ] } } } ``` </Step> </Steps> ## Step 3: Run the MCP server and call it from Claude for Desktop <Steps> <Step> Start Claude for Desktop. If it is already running, restart it. </Step> <Step> Ask the following question in the chat window, replacing `<absolute-path-to-local-file>` with the absolute path to the local file that you want Unstructured to process. <Note> This walkthrough has been tested with small files (less than 400 KB), due to Claude for Desktop's hard timeout limit of 60 seconds for MCP server responses. Larger files might take longer for Unstructured to process, and Claude for Desktop might time out before such files are fully processed. </Note> ```text Process the document at <absolute-path-to-local-file> ``` Claude for Desktop responds with information about the processed document. You can find the processed file's content in the path that you specified for `PROCESSED_FILES_FOLDER` in the MCP server's `doc_processor.py` file. </Step> </Steps> # Streamlit in Snowflake Source: https://docs.unstructured.io/examplecode/tools/snowflake-streamlit [Streamlit](https://streamlit.io/) is an open-source Python framework for data scientists and AI/ML engineers to deliver dynamic data apps with only a few lines of code. [Streamlit in Snowflake](https://www.snowflake.com/en/product/features/streamlit-in-snowflake/) enables data scientists and Python developers to combine Streamlit's component-rich, open-source Python library with the scale, performance and security of the Snowflake platform. Streamlit Python scripts can define user interface (UI) components such as filters, graphs, sliders, and more to interact with your data. In this example, you use Snowsight in your Snowflake account to create a simple Streamlit app that uses [Snowflake Cortex Search for RAG](https://docs.snowflake.com/user-guide/snowflake-cortex/cortex-search/cortex-search-overview) to ask natural-language questions about an existing table in your Snowflake account. This table contains data that was generated by Unstructured. Answers are returned in natural-language, chatbot-style format. ## Prerequisites * A table in Snowflake that contains data that was generated by Unstructured. The target Snowflake table must have a column named `EMBEDDINGS` that will contains vector embeddings for the text in the table's `TEXT` column. The following Streamlit example app assumes that the `EMBEDDINGS` column contains 1,024 vector embeddings and has a data type of `VECTOR(FLOAT, 1024)`. To create this table, you can [create a custom Unstructured workflow](/ui/workflows#create-a-custom-workflow) that uses any supported [source connector](/ui/sources/overview) along with the [Snowflake destination connector](/ui/destinations/snowflake). Then [run](/ui/workflows#edit%2C-delete%2C-or-run-a-workflow) the workflow to generate the data and then insert that generated data into the target Snowflake table. After the data is inserted into the target Snowflake table, you can run the following Snowflake SQL statement to generate the 1,024 vector embeddings for the text in the table's `TEXT` column and then insert those generated vector embeddings into the table's `EMBEDDINGS` column. The model specified here for generating the vector embeddings is the same one that is used by the Streamlit example app: ```sql UPDATE ELEMENTS SET EMBEDDINGS = SNOWFLAKE.CORTEX.EMBED_TEXT_1024( 'snowflake-arctic-embed-l-v2.0', TEXT ); ``` To learn how to run Snowflake SQL statements, see for example [Querying data using worksheets](https://docs.snowflake.com/user-guide/ui-snowsight-query). * You must have the appropriate privileges to create and use a Streamlit app in your Snowflake account. These privileges include ones for the target table's parent database and schema as well as the Snowflake warehouse that runs the Streamlit app. For details, see [Getting started with Streamlit in Snowflake](https://docs.snowflake.com/developer-guide/streamlit/getting-started). ## Create and run the example app <Steps> <Step title="Create the Streamlit app"> 1. In Snowsight for your Snowflake account, on the sidebar, click Projects > Streamlit. 2. Click + Streamlit App. 3. For App title, enter a name for your app, such as `Unstructured Demo Streamlit App`. 4. For App location, chose the target database and schema to store the app in. 5. For App warehouse, choose the warehouse that you want to use to run your app and execute its queries. 6. Click Create. </Step> <Step title="Add code to the Streamlit app"> In this step, you add Python code to the Streamlit app that you created in the previous step. This step explains each part of the code as you add it. If you want to skip past these explanations, add the code in the [complete code example](#complete-code-example) all at once, and then skip ahead to the next step, "Run the Streamlit app." 1. Import Python dependencies that get the current connection to the Snowflake database and schema and get Streamlit functions and features. ```python from snowflake.snowpark.context import get_active_session import streamlit as st ``` 2. Get the current connection to the Snowflake database and schema. ```python session = get_active_session() ``` 3. Display the title of the app in the Streamlit UI, and get the user's search query from the Streamlit UI. ```python st.title("Snowflake Cortex Search for RAG with Data from Unstructured") query = st.text_input("Enter your search query:") ``` 4. Get the user's search query and display a progress indicator in the UI. ```python if query: with st.spinner("Embedding and retrieving..."): ``` 5. Use the user's search query to get the top result from the `ELEMENTS` table. The `ELEMENTS` table contains the data that was generated by Unstructured. The code uses the `SNOWFLAKE.CORTEX.EMBED_TEXT_1024` function to generate vector embeddings for the user's search query and the `VECTOR_COSINE_SIMILARITY` function to get the similarity between the vector embeddings for the user's search query and the vector embeddings for the `TEXT` column for each rown in the `ELEMENTS` table. The code then orders the results by similarity and limits the results to the row with the greatest similarity between the search query and the target text. ```python top_result_df = session.sql(f""" WITH query_embedding AS ( SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_1024( 'snowflake-arctic-embed-l-v2.0', '{query}' ) AS EMBED ) SELECT e.TEXT, VECTOR_COSINE_SIMILARITY(e.EMBEDDINGS, q.EMBED) AS similarity FROM ELEMENTS e, query_embedding q ORDER BY similarity DESC LIMIT 1 """).to_pandas() ``` 6. Get the `TEXT` column from the top result and use it as context for the user's search query. ```python context = top_result_df["TEXT"][0] ``` 7. Use the user's search query and the context from the top result to get a response from Snowflake Cortex Search for RAG. The code uses the `SNOWFLAKE.CORTEX.COMPLETE` function to generate a response to the user's search query based on the context from the top result. ```python completion_df = session.sql(f""" SELECT SNOWFLAKE.CORTEX.COMPLETE( 'snowflake-arctic', CONCAT('Context: ', $$ {context} $$, ' \\n\\nQuestion: {query}\\nAnswer:') ) AS RESPONSE """).to_pandas() ``` 8. Display the generated response in the Streamlit UI. ```python st.write("Answer:") st.write(completion_df["RESPONSE"][0]) ``` </Step> <Step title="Run the Streamlit app"> 1. In the upper right corner, click Run. 2. For Enter your search query, enter some natural-language question about the `TEXT` column in the table. 3. Press Enter. Snowflake Cortex Search for RAG returns its answer to your question in natural-language, chatbot-style format. </Step> </Steps> ## Complete code example The full code example for the Streamlit app is as follows: ```python from snowflake.snowpark.context import get_active_session import streamlit as st session = get_active_session() st.title("Snowflake Cortex Search for RAG with Data from Unstructured") query = st.text_input("Enter your search query:") if query: with st.spinner("Embedding and retrieving..."): top_result_df = session.sql(f""" WITH query_embedding AS ( SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_1024( 'snowflake-arctic-embed-l-v2.0', '{query}' ) AS EMBED ) SELECT e.TEXT, VECTOR_COSINE_SIMILARITY(e.EMBEDDINGS, q.EMBED) AS similarity FROM ELEMENTS e, query_embedding q ORDER BY similarity DESC LIMIT 1 """).to_pandas() context = top_result_df["TEXT"][0] completion_df = session.sql(f""" SELECT SNOWFLAKE.CORTEX.COMPLETE( 'snowflake-arctic', CONCAT('Context: ', $$ {context} $$, ' \\n\\nQuestion: {query}\\nAnswer:') ) AS RESPONSE """).to_pandas() st.write("Answer:") st.write(completion_df["RESPONSE"][0]) ``` ## Additional resources * [Streamlit in Snowflake documentation](https://docs.snowflake.com/developer-guide/streamlit/about-streamlit) * [Create and deploy Streamlit apps using Snowsight](https://docs.snowflake.com/developer-guide/streamlit/create-streamlit-ui) * [Snowflake Solutions Developer Center for Streamlit](https://www.snowflake.com/en/developers/solutions-center/?tags=technology%2Fstreamlit) * [Streamlit documentation](https://docs.streamlit.io/) # VectorShift Source: https://docs.unstructured.io/examplecode/tools/vectorshift [VectorShift](https://vectorshift.ai/) is an integrated framework of no-code, low-code, and out of the box generative AI solutions to build AI search engines, assistants, chatbots, and automations. VectorShift's platform allows you to design, prototype, build, deploy, and manage generative AI workflows and automation across two interfaces: no-code and code SDK. This hands-on demonstration uses the no-code interface to walk you through creating a VectorShift pipeline project. This project enables you to use GPT-4o-mini to chat in real time with a PDF document that is processed by Unstructured and has its processed data stored in a [Pinecone](https://www.pinecone.io/) vector database. ## Prerequisites <iframe width="560" height="315" src="https://www.youtube.com/embed/Li0yhaeguYQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Pinecone account. [Get an account](https://app.pinecone.io/). <iframe width="560" height="315" src="https://www.youtube.com/embed/uPoa7PLcuZ4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Pinecone API key. [Get an API key](https://docs.pinecone.io/guides/get-started/authentication#find-your-pinecone-api-key). * A Pinecone serverless index. [Create a serverless index](https://docs.pinecone.io/guides/indexes/create-an-index). An existing index is not required. At runtime, the index behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing index name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing index's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing index's embedding settings to match, and try the run again. * If an index name is not specified, Unstructured creates a new index in your Pinecone account. If Unstructured generates embeddings, the new index's name will be `u<short-workflow-id>-<short-embedding-model-name>-<number-of-dimensions>`. If Unstructured does not generate embeddings, the new index's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing index name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing index's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing index's embedding settings to match, and try the run again. * If an index name is not specified, Unstructured creates a new index in your Pinecone account. The new index's name will be `unstructuredautocreated`. <Note> If you create a new index or use an existing one, Unstructured recommends that all records in the target index have a field named `record_id` with a string data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the index or, in some cases, the operation could fail altogether. </Note> * Within a Pinecone serverless index, custom [namespaces](https://docs.pinecone.io/guides/index-data/indexing-overview#namespaces) are supported but are not required. Also: * [Sign up for an OpenAI account](https://platform.openai.com/signup), and [get your OpenAI API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key). * [Sign up for a VectorShift Starter account](https://app.vectorshift.ai/api/signup). * Sign up for an Unstructured account: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> ## Create and run the demonstration project <Steps> <Step title="Get source data into Pinecone"> Although you can use any [supported file type](/ui/supported-file-types) or data in any [supported source type](/ui/sources/overview) for the input into Pinecone, this demonstration uses [the text of the United States Constitution in PDF format](https://constitutioncenter.org/media/files/constitution.pdf). 1. Sign in to your Unstructured account. 2. [Create a source connector](/ui/sources/overview), if you do not already have one, to connect Unstructured to the source location where the PDF file is stored. 3. [Create a Pinecone destination connector](/ui/destinations/pinecone), if you do not already have one, to connect Unstructured to your Pinecone serverless index. 4. [Create a workflow](/ui/workflows#create-a-workflow) that references this source connector and destination connector. 5. [Run the workflow](/ui/workflows#edit-delete-or-run-a-workflow). </Step> <Step title="Create the VectorShift project"> 1. Sign in to your VectorShift account dashboard. 2. On the sidebar, click Pipelines. 3. Click New. 4. Click Create Pipeline from Scratch. ![Create the VectorShift project](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/vectorshift/CreateProject.png) </Step> <Step title="Add the Input node"> In this step, you add a node to the pipeline. This node takes user-supplied chat messages and sends them as input to Pinecone, and as input to a text-based LLM, for contextual searching. In the top pipeline node chooser bar, on the General tab, click Input. ![Adding the Input node](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/vectorshift/InputComponent.png) </Step> <Step title="Add the Pinecone node"> In this step, you add a node that connects to the Pinecone serverless index. 1. In the top pipeline node chooser bar, on the Integrations tab, click Pinecone. 2. In the Pinecone node, for Embedding Model, select openai/text-embedding-3-large. 3. Click Connected Account. 4. In the Select Pinecone Account dialog, click Connect New. 5. Enter the API Key and Region for your Pinecone serverless index, and then click Save. 6. For Index, selet the name of your Pinecone serverless index. 7. Connect the input\_1 output from the Input node to the query input in the Pinecone node. To make the connection, click and hold your mouse pointer inside of the circle next to input\_1 in the Input node. While holding your mouse pointer, drag it over into the circle next to query in the Pinecone node. Then release your mouse pointer. A line appears between these two circles. ![Adding the Pinecone node](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/vectorshift/PineconeComponent.png) </Step> <Step title="Add the OpenAI LLM node"> In this step, you add a node that builds a prompt and then sends it to a text-based LLM. 1. In the top pipeline node chooser bar, on the LLMs tab, click OpenAI. 2. In the OpenAI LLM node, for System, enter the following text: ``` Answer the Question based on Context. Use Memory when relevant. ``` <Tip> To answer the question, the preceding prompt uses the context along with general information that the text-based LLM is trained on. To use only the context to answer the question, you can change the prompt, for example to something like this: ```text Answer the Question based only on the Context. Do not use any other sources of information. If the context does not provide enough information to answer the question, reply with 'I do not have enough context to answer the question.' Use Memory when relevant. ``` </Tip> 3. For Prompt, enter the following text: ``` Question: {{Question}} Context: {{Context}} Memory: {{Memory}} ``` 4. For Model, select gpt-4o-mini. 5. Check the box titled Use Personal API Key. 6. For API Key, enter your OpenAI API key. 7. Connect the input\_1 output from the Input node to the Question input in the OpenAI LLM node. 8. Connect the output output from the Pinecone node to the Context input in the OpenAI LLM node. ![Adding the OpenAI LLM node](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/vectorshift/OpenAILLMComponent.png) </Step> <Step title="Add the Chat Memory node"> In this step, you add a node that adds chat memory to the session. 1. In the top pipeline node chooser bar, on the Chat tab, click Chat Memory. 2. Connect the output from the Chat Memory node to the Memory input in the OpenAI LLM node. ![Adding the Chat Memory node](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/vectorshift/ChatMemoryComponent.png) </Step> <Step title="Add the Output node"> In this step, you add a node that displays the chat output. 1. In the top pipeline node chooser bar, on the General tab, click Output. 2. Connect the response output from the OpenAI LLM node to the input in the Output node. ![Adding the Output node](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/vectorshift/OutputComponent.png) </Step> <Step title="Run the project"> 1. In the upper corner of the pipeline designer, click the play (Run Pipeline) button. ![Running the pipeline](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/vectorshift/RunPipeline.png) 2. In the chat pane, on the Chatbot tab, enter a question into the Message Assistant box, for example, `What rights does the fifth amendment guarantee?` Then press the send button. ![Chatbot results](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/vectorshift/ChatbotResults.png) 3. Wait until the answer appears. 4. Ask as many additional questions as you want to. </Step> </Steps> ## Learn more See the [VectorShift documentation](https://docs.vectorshift.ai/). # Frequently asked questions Source: https://docs.unstructured.io/faq/faq View our [frequently asked questions](https://unstructured.io/developers#faq). # Chunking Source: https://docs.unstructured.io/open-source/best-practices/chunking # Embedding Source: https://docs.unstructured.io/open-source/best-practices/embedding # Document elements and metadata Source: https://docs.unstructured.io/open-source/concepts/document-elements `unstructured` simplifies and streamline the preprocessing of structured and unstructured documents for downstream tasks. What that means is no matter where your data is and no matter what format that data is in, Unstructured's toolkit will transform and preprocess that data into an easily digestible and usable format that is uniform across data formats. When you partition a document with Unstructured, the result is a list of document `Element` objects. These element objects represent different components of the source document. ## Element example Here's an example of what an element might look like: ``` {'type': 'NarrativeText', 'element_id': '5ef1d1117721f0472c1ad825991d7d37', 'text': 'The Unstructured API documentation covers the following API services:', 'metadata': {'last_modified': '2024-05-01T14:15:22', 'page_number': 1, 'languages': ['eng'], 'parent_id': '56f24319ae258b735cac3ec2a271b1d9', 'file_directory': '/content', 'filename': 'Unstructured - Unstructured.html', 'filetype': 'text/html'}} ``` As you can see, every element will have a `type`, an `element_id`, the extracted `text`, and some `metadata` which may vary depending on the element type, document structure, and some additional parameters used during partitioning and/or chunking. Let's explore some of these document element components in more detail. ## Element type Instead of treating all documents like a wall of plain text, Unstructured preserves the semantic structure of the documents. This gives you more control and flexibility over how you further use the processed documents and allows you to take their structure into consideration. At the same time, normalizing data from various file formats to the Unstructured element type scheme lets you treat all documents the same in your downstream processing, regardless of source format. For example, if you plan to summarize a document, you may only be interested in the narrative of the document, and not care about footers and headers. You can easily filter out the elements you don't need using their type. Here are some examples of the element types your document may contain: | Element type | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | | `Formula` | An element containing formulas in a document. | | `FigureCaption` | An element for capturing text associated with figure captions. | | `NarrativeText` | NarrativeText is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. | | `ListItem` | ListItem is a NarrativeText element that is part of a list. | | `Title` | A text element for capturing titles. | | `Address` | A text element for capturing physical addresses. | | `EmailAddress` | A text element for capturing email addresses. | | `Image` | A text element for capturing image metadata. | | `PageBreak` | An element for capturing page breaks. | | `Table` | An element for capturing tables. | | `Header` | An element for capturing document headers. | | `Footer` | An element for capturing document footers. | | `CodeSnippet` | An element for capturing code snippets. | | `PageNumber` | An element for capturing page numbers. | | `UncategorizedText` | Base element for capturing free text from within document. | If you apply chunking during partitioning of a document or later, you will also see the `CompositeElement` type. `CompositeElement` is a chunk formed from text (non-Table) elements. It is only produced by chunking. A composite element may be formed by combining one or more sequential elements produced by partitioning. For example, several individual list items may be combined into a single chunk. ## Element ID By default, the element ID is a SHA-256 hash of the element's text, its position on the page, page number it's on, and the name of the document file - this is to ensure that the ID is deterministic and unique at the document level. To obtain globally unique IDs in the output (UUIDs), you can pass `unique_element_ids=True` into any of the partition functions. This can be helpful if you'd like to use the IDs as a primary key in a database, for example. ## Metadata Unstructured tracks a variety of metadata about the elements extracted from documents. Here are a couple of examples of what element metadata enables you to do: * filter document elements based on an element metadata value. For instance, you may want to limit your scope to elements from a certain page, or you may want to use only elements that have an email matching a regular expression in their metadata. * map an element to the document page where it occurred so that original page can be retrieved when that element matches search criteria. Metadata is tracked at the element level. You can access the metadata for a given document element with `element.metadata`. For a dictionary representation, use `element.metadata.to_dict()`. ### Common metadata fields All document types return the following metadata fields when the information is available from the source file: | Metadata field name | Description | | -------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `filename` | Filename | | `file_directory` | File directory | | `last_modified` | Last modified Date | | `filetype` | File type | | `coordinates` | XY Bounding Box Coordinates. See notes below for further details about the bounding box. | | `parent_id` | Element Hierarchy. `parent_id` may be used to infer where an element resides within the overall hierarchy of a document. For instance, a NarrativeText element may have a Title element as a parent (a “sub-title”), which in turn may have another Title element as its parent (a "title"). | | `category_depth` | Element depth relative to other elements of the same category. Category depth is the depth of an element relative to other elements of the same category. It’s set by a document partitioner and enables the hierarchy post-processor to compute more accurate hierarchies. Category depth may be set using native document hierarchies, e.g. reflecting \<H1>, \<H2>, or \<H3> tags within an HTML document or the indentation level of a bulleted list item in a Word document. | | `text_as_html` | HTML representation of extracted tables. Only applicable to table elements. | | `languages` | Document Languages. At document level or element level. List is ordered by probability of being the primary language of the text. | | `emphasized_text_contents` | Emphasized text (bold or italic) in the original document. | | `emphasized_text_tags` | Tags on text that is emphasized in the original document. | | `is_continuation` | True if element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to max\_characters. | | `detection_class_prob` | Detection model class probabilities. From unstructured-inference, hi-res strategy. | Notes on common metadata fields: #### Metadata for document hierarchy `parent_id` and `category_depth` enhance hierarchy detection to identify the document structure in various file formats by measuring relative depth of an element within its category. This is especially useful in documents with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure. #### Element's coordinates Some document types support location data for the elements, usually in the form of bounding boxes. If it exists, an element's location data is available with `element.metadata.coordinates`. The `coordinates` property of an `ElementMetadata` stores: * `points` : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the `y` coordinate increases in the downward direction. * `system`: The points have an associated coordinate system. A typical example of a coordinate system is `PixelSpace`, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height. The Unstructured Open Source library offers a way to change the coordinates of an element to a new coordinate system by using the `Element.convert_coordinates_to_new_system` method. If the `in_place` flag is `True`, the coordinate system and points of the element are updated in place and the new coordinates are returned. If the `in_place` flag is `False`, only the altered coordinates are returned. ```python from unstructured.documents.elements import Element from unstructured.documents.coordinates import PixelSpace, RelativeCoordinateSystem coordinates = ((10, 10), (10, 100), (200, 100), (200, 10)) coordinate_system = PixelSpace(width=850, height=1100) element = Element(coordinates=coordinates, coordinate_system=coordinate_system) print(element.metadata.coordinates.to_dict()) print(element.metadata.coordinates.system.orientation) print(element.metadata.coordinates.system.width) print(element.metadata.coordinates.system.height) element.convert_coordinates_to_new_system(RelativeCoordinateSystem(), in_place=True) # Should now be in terms of new coordinate system print(element.metadata.coordinates.to_dict()) print(element.metadata.coordinates.system.orientation) print(element.metadata.coordinates.system.width) print(element.metadata.coordinates.system.height) ``` ### Additional metadata fields by document type | Field name | Applicable file types | Description | | ---------------------- | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `attached_to_filename` | MSG | The name of the file that the attached file is attached to. | | `bcc_recipient` | EML | The related [email](#email) BCC recipient. | | `cc_recipient` | EML | The related [email](#email) CC recipient. | | `email_message_id` | EML | The related [email](#email) message ID. | | `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. | | `image_path` | PDF | The path to the image. This is useful when you want to extract the image and save it in a specified path instead of serializing the image within the processed data. | | `image_mime_type` | PDF | The MIME type of the image. | | `image_url` | HTML | The URL to the image. | | `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. | | `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. | | `link_urls` | HTML | A list of URLs within the extracted content. | | `links` | PDF | A list of links within the extracted content. | | `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). | | `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. | | `section` | EPUB | The book section title corresponding to a table of contents. | | `sent_from` | EML | The related [email](#email) sender. | | `sent_to` | EML | The related [email](#email) recipient. | | `signature` | EML | The related [email](#email) signature. | | `subject` | EML | The related [email](#email) subject. | Notes on additional metadata by document type: #### Email For emails, metadata will contain the following fields, where available: * `bcc_recipient` * `cc_recipient` * `email_message_id` * `sent_from` * `sent_to` * `signature` * `subject` `sent_from` is a list of strings because the [RFC 822](https://www.rfc-editor.org/rfc/rfc822) spec for emails allows for multiple sent from email addresses. #### Microsoft Excel documents For Excel documents, `ElementMetadata` will contain a `page_name` element, which corresponds to the sheet name in the Excel document. #### Microsoft Word documents Headers and footers in Word documents include a `header_footer_type` indicating which page a header or footer applies to. Valid values are `"primary"`, `"even_only"`, and `"first_page"`. ### Table-specific metadata For `Table` elements, the raw text of the table will be stored in the `text` attribute for the Element, and HTML representation of the table will be available in the element metadata under `element.metadata.text_as_html`. By default, Unstructured will automatically extract all tables for all doc types unless you set `skip_infer_table_types` parameter. Here's an example of a table element. The `text` of the element will look like this: ``` Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents ``` And the `text_as_html` metadata for the same element will look like this: ```py <table><thead><th>Dataset</th><th>| Base Model’</th><th>| Notes</th></thead><tr><td>PubLayNet</td><td>[38] F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset [31]</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></table> ``` ### Data connector metadata fields Documents processed through source connectors include additional document metadata. These additional fields only ever appear if the source document was processed by a connector. #### Common data connector metadata fields * Data Source metadata (on json output): * url * version * date created * date modified * date processed * record locator * Record locator is specific to each connector #### Additional metadata fields by connector type (via record locator) | Source connector | Additional metadata | | --------------------- | -------------------------------- | | airtable | base id, table id, view id | | azure (from fsspec) | protocol, remote file path | | box (from fsspec) | protocol, remote file path | | confluence | url, page id | | discord | channel | | dropbox (from fsspec) | protocol, remote file path | | elasticsearch | url, index name, document id | | fsspec | protocol, remote file path | | google drive | drive id, file id | | gcs (from fsspec) | protocol, remote file path | | jira | base url, issue key | | onedrive | user pname, server relative path | | outlook | message id, user email | | s3 (from fsspec) | protocol, remote file path | | sharepoint | server path, site url | | wikipedia | page title, age url | # Models Source: https://docs.unstructured.io/open-source/concepts/models Depending on your need, `Unstructured` provides OCR-based and Transformer-based models to detect elements in the documents. The models are useful to detect the complex layout in the documents and predict the element types. Basic usage: ```python elements = partition(filename=filename, strategy="hi_res", hi_res_model_name="yolox") ``` <Note> * To use any model with the partition, set the `strategy` to `hi_res` as shown above. * To maintain the consistency between the `unstructured` and `unstructured-api` libraries, we are deprecating the `model_name` parameter. Please use `hi_res_model_name` parameter when specifying a model. </Note> The `hi_res_model_name` parameter supports the `yolox` and `detectron2_onnx` arguments. ## Using a Non-Default Model `Unstructured` will download the model specified in `UNSTRUCTURED_HI_RES_MODEL_NAME` environment variable. If not defined, it will download the default model. There are three ways you can use the non-default model as follows: 1. Store the model name in the environment variable ```python import os from unstructured.partition.pdf import partition_pdf os.environ["UNSTRUCTURED_HI_RES_MODEL_NAME"] = "yolox" out_yolox = partition_pdf("example-docs/pdf/layout-parser-paper-fast.pdf", strategy="hi_res") ``` 2. Pass the model name in the `partition` function. ```python filename = "example-docs/pdf/layout-parser-paper-fast.pdf" elements = partition(filename=filename, strategy="hi_res", hi_res_model_name="yolox") ``` 3. Use [unstructured-inference](https://github.com/Unstructured-IO/unstructured-inference) library. ```python from unstructured_inference.models.base import get_model from unstructured_inference.inference.layout import DocumentLayout model = get_model("yolox") layout = DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf", detection_model=model) ``` # Partitioning strategies Source: https://docs.unstructured.io/open-source/concepts/partitioning-strategies For certain document types, such as images and PDFs, for example, Unstructured products offer a variety of different ways to preprocess them, controlled by the `strategy` parameter. PDF documents, for example, vary in quality and complexity. In simple cases, traditional NLP extraction techniques may be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required to process a PDF. You can think of the strategies as being "rule-based" workflows (thus they are "fast"), or "model-based" workflows (slower workflow because it requires model inference, but you get "higher resolution", thus "hi\_res"). When choosing a partitioning strategy for your files, you have to be mindful of the quality/speed trade-off. To give you an example, the `fast` strategy is roughly 100x faster than leading image-to-text models. Available options: * `auto` (default strategy): The "auto" strategy will choose the partitioning strategy based on document characteristics and the function kwargs. * `fast`: The "rule-based" strategy leverages traditional NLP extraction techniques to quickly pull all the text elements. "Fast" strategy is not recommended for image-based file types. * `hi_res`: The "model-based" strategy identifies the layout of the document. The advantage of "hi\_res" is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements. * `ocr_only`: Another "model-based" strategy that leverages Optical Character Recognition to extract text from the image-based files. These strategies are available on the following partition functions: | Document Type | Partition Function | Strategies | Table Support | Options | | ------------------------ | ------------------ | -------------------------------------- | ------------- | -------------------------------------------------------------------------------------------- | | Images (.png/.jpg/.heic) | partition\_image | “auto”, “hi\_res”, “ocr\_only” | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy | | PDFs (.pdf) | partition\_pdf | “auto”, “fast”, “hi\_res”, “ocr\_only” | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy | # Chunking Source: https://docs.unstructured.io/open-source/core-functionality/chunking Chunking functions in `unstructured` use metadata and document elements detected with `partition` functions to post-process elements into more useful "chunks" for uses cases such as Retrieval Augmented Generation (RAG). ## Chunking Basics Chunking in `unstructured` differs from other chunking mechanisms you may be familiar with. Typical approaches start with the text extracted from the document and form chunks based on plain-text features, character sequences like `"\n\n"` or `"\n"` that might indicate a paragraph boundary or list-item boundary. Because `unstructured` uses specific knowledge about each document format to partition the document into semantic units (document elements), we only need to resort to text-splitting when a single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one or more whole elements, preserving the coherence of semantic units established during partitioning. A few concepts about chunking are worth introducing before discussing the details. * Chunking is performed on document elements. It is a separate step performed after partitioning, on the elements produced by partitioning. (Although it can be combined with partitioning in a single step.) * In general, chunking combines consecutive elements to form chunks as large as possible without exceeding the maximum chunk size. * A single element that by itself exceeds the maximum chunk size is divided into two or more chunks using text-splitting. * Chunking produces a sequence of `CompositeElement`, `Table`, or `TableChunk` elements. Each “chunk” is an instance of one of these three types. ## Chunking Options The following options are available to tune chunking behaviors. These are keyword arguments that can be used in a partitioning or chunking function call. All these options have defaults and need only be specified when a non-default setting is required. Specific chunking strategies (such as `by_title`) may have additional options. * `max_characters: int (default=500)` - the hard maximum size for a chunk. No chunk will exceed this number of characters. A single element that by itself exceeds this size will be divided into two or more chunks using text-splitting. * `new_after_n_chars: int (default=max_characters)` - the “soft” maximum size for a chunk. A chunk that already exceeds this number of characters will not be extended, even if the next element would fit without exceeding the specified hard maximum. This can be used in conjunction with `max_characters` to set a “preferred” size, like “I prefer chunks of around 1000 characters, but I’d rather have a chunk of 1500 (max\_characters) than resort to text-splitting”. This would be specified with `(..., max_characters=1500, new_after_n_chars=1000)`. * `overlap: int (default=0)` - only when using text-splitting to break up an oversized chunk, include this number of characters from the end of the prior chunk as a prefix on the next. This can mitigate the effect of splitting the semantic unit represented by the oversized element at an arbitrary position based on text length. * `overlap_all: bool (default=False)` - also apply overlap between “normal” chunks, not just when text-splitting to break up an oversized element. Because normal chunks are formed from whole elements that each have a clean semantic boundary, this option may “pollute” normal chunks. You’ll need to decide based on your use-case whether this option is right for you. ## Chunking Chunking can be performed as part of partitioning or as a separate step after partitioning: ### Specifying a chunking strategy while partitioning Chunking can be performed as part of partitioning by specifying a value for the `chunking_strategy` argument. The current options are `basic` and `by_title` (described below). ```python from unstructured.partition.html import partition_html chunks = partition_html(url=url, chunking_strategy="basic") ``` ### Calling a chunking function Chunking can also be performed separately from partitioning by calling a chunking function directly. This may be convenient, for example, when tuning chunking parameters. Chunking is typically faster than partitioning, especially when OCR or inference is used, so a faster feedback loop is possible by doing these separately: ```python from unstructured.chunking.basic import chunk_elements from unstructured.partition.html import partition_html url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0" elements = partition_html(url=url) chunks = chunk_elements(elements) # -- OR -- from unstructured.chunking.title import chunk_by_title chunks = chunk_by_title(elements) for chunk in chunks: print(chunk) print("\n\n" + "-"80) input() ``` ## Chunking Strategies There are currently two chunking strategies, basic* and by\_title. The `by_title` strategy shares most behaviors with the basic strategy so we'll describe the baseline strategy first: ### "basic" chunking strategy * The basic strategy combines sequential elements to maximally fill each chunk while respecting both the specified `max_characters` (hard-max) and `new_after_n_chars` (soft-max) option values. * A single element that by itself exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text-splitting. * A `Table` element is always isolated and never combined with another element. A `Table` can be oversized, like any other text element, and in that case is divided into two or more `TableChunk` elements using text-splitting. * If specified, `overlap` is applied between chunks formed by splitting oversized elements and is also applied between other chunks when `overlap_all` is `True`. ### "by\_title" chunking strategy The `by_title` chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk. In addition to the behaviors of the `basic` strategy above, the `by_title` strategy has the following behaviors: * Detect section headings. A `Title` element is considered to start a new section. When a `Title` element is encountered, the prior chunk is closed and a new chunk started, even if the `Title` element would fit in the prior chunk. * Respect page boundaries. Page boundaries can optionally also be respected using the `multipage_sections` argument. This defaults to `True` meaning that a page break does not start a new chunk. Setting this to `False` will separate elements that occur on different pages into distinct chunks. * Combine small sections. In certain documents, partitioning may identify a list-item or other short paragraph as a `Title` element even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using the `combine_text_under_n_chars` argument. This defaults to the same value as `max_characters` such that sequential small sections are combined to maximally fill the chunking window. Setting this to `0` will disable section combining. ## Recovering Chunk Elements In general, a chunk consolidates multiple document elements to maximally fill a chunk of the desired size. Information is naturally lost in this consolidation, for example which element a portion of the text came from and certain metadata like page-number and coordinates which cannot always be resolved to a single value. The original elements combined to make a chunk can be accessed using the .metadata.orig\_elements field on the chunk: ```python >>> elements = [ ... Title("Lorem Ipsum"), ... NarrativeText("Lorem ipsum dolor sit."), ... ] >>> chunk = chunk_elements(elements)[0] >>> print(chunk.text) 'Lorem Ipsum\n\nLorem ipsum dolor sit.' >>> print(chunk.metadata.orig_elements) [Title("Lorem Ipsum"), NarrativeText("Lorem ipsum dolor sit.")] ``` These elements will contain all their original metadata so can be used to access metadata that cannot reliably be consolidated, for example: ```python python >>> {e.metadata.page_number for e in chunk.metadata.orig_elements} {2, 3} ``` ``` >>> [e.metadata.coordinates for e in chunk.metadata.orig_elements] [<CoordinatesMetadata ...>, <CoordinatesMetadata ...>, ...] ``` ``` >>> [ e.metadata.image_path for e in chunk.metadata.orig_elements if e.metadata.image_path is not None ] ['/tmp/lorem.jpg', '/tmp/ipsum.png'] ``` During serialization, `.metadata.orig_elements` is compressed into Base64 gzipped format. To deserialize `.metadata.orig_elements`, you can use the `elements_from_base64_gzipped_json`. For example: ```python from unstructured.partition.auto import partition from unstructured.staging.base import elements_from_base64_gzipped_json elements = partition('local-ingest-source/fake-email.eml', chunking_strategy='basic', include_orig_elements=True) print("Before:\n") for element in elements: metadata = element.metadata.to_dict() print(f"Element ID: {element.id}") print(f" Compressed orig_elements: {metadata["orig_elements"]}") print("\n") # Output: # ------- # Before: # # Element ID: 083776ca703b1925e5fef69bb2635f1f # Compressed orig_elements: eJztlM1uGyEUhV8Fzbq2GGbwDFZUZdNFpDaLyOomiix+Lg7tABbcaRtFfffCJFXSyiuvLbFA3HsOHD7B/XMDE3gIuHem2ZJGmLaDnmnBGTfajF0/aqm6zSi4gnEwzQfSeEBpJMrS/9xYN8HeuAQaY3qqFlPUclq5cICMqxznpKGqamOQHmqLld9hBV66aQ1+qtVJhsMsi6SU7xsIh+ZhWc2499E462A5HqOMrdoyNrt22NJ+225WlG8prR65xrAp+sXji0R8hJ/kLioXcgzkyqfX6fUcMqZZ45zArF38uGy2yDEu4tuIf/VXb/PrEPEfqY7+VTurb+UG6hF3JTb5VLM1v0sF4dfL8qPLpAxJsDYs4QlGMmcgNiYyB4dLKa9rFnw6Ljd1K1OS6H7ArvoUw/+BjZveMt5zxQSVgo2S8kGr1ggtKLOivwA7E9iNP8aEMiA5Rhcwb99j2Tmc4BQOBtyOYuwko5J3RnLRtszowYpOWj5ccJyL4y5mKK8nASnC9yg+u4w3CP4UDWWMVEpT3fZUmb63RvORgR3o2Ovy011onEnjq4sT4AsPNc1wGsjDH93mEjs= print ("After:\n") for element in elements: metadata = element.metadata.to_dict() print(f"Element ID: {element.id}") orig_elements = elements_from_base64_gzipped_json(metadata["orig_elements"]) print(f" Uncompressed orig_elements:") for orig_element in orig_elements: print(f" {orig_element.category}: {orig_element.text}") print("\n") # Output: # ------- # After: # # Element ID: 083776ca703b1925e5fef69bb2635f1f # Uncompressed orig_elements: # NarrativeText: This is a test email to use for unit tests. # Title: Important points: # ListItem: Roses are red # ListItem: Violets are blue ``` ## Learn more <Icon icon="blog" /> [Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices) # Cleaning Source: https://docs.unstructured.io/open-source/core-functionality/cleaning As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the `unstructured` library includes cleaning functions to help users sanitize output before sending it to downstream applications. Some cleaning functions apply automatically. In the example in the Partition section, the output `Philadelphia Eaglesâ\x80\x99 victory` automatically gets converted to `Philadelphia Eagles' victory` in `partition_html` using the `replace_unicode_quotes` cleaning function. You can see how that works in the code snippet below: ```python from unstructured.cleaners.core import replace_unicode_quotes replace_unicode_quotes("Philadelphia Eaglesâ\x80\x99 victory") ``` Document elements in `unstructured` include an `apply` method that allow you to apply the text cleaning to the document element without instantiating a new element. The `apply` method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the `replace_unicode_quotes` cleaning function using the `apply` method. ```python from unstructured.documents.elements import Text element = Text("Philadelphia Eaglesâ\x80\x99 victory") element.apply(replace_unicode_quotes) print(element) ``` Since a cleaning function is just a `str -> str` function, users can also easily include their own cleaning functions for custom data preparation tasks. In the example below, we remove citations from a section of text. ```python import re remove_citations = lambda text: re.sub("\[\d{1,3}\]", "", text) element = Text("[1] Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.") element.apply(remove_citations) print(element) ``` See below for a full list of cleaning functions in the `unstructured` library. ## `bytes_string_to_string` Converts an output string that looks like a byte string to a string using the specified encoding. This happens sometimes in `partition_html` when there is a character like an emoji that isn’t expected by the HTML parser. In that case, the encoded bytes get processed. Examples: ```python from unstructured.cleaners.core import bytes_string_to_string text = "Hello ð\x9f\x98\x80" # The output should be "Hello 😀" bytes_string_to_string(text, encoding="utf-8") ``` ```python from unstructured.cleaners.core import bytes_string_to_string from unstructured.partition.html import partition_html text = """\n<html charset="utf-8"><p>Hello 😀</p></html>""" elements = partition_html(text=text) elements[0].apply(bytes_string_to_string) # The output should be "Hello 😀" elements[0].text ``` For more information about the `bytes_string_to_string` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `clean` Cleans a section of text with options including removing bullets, extra whitespace, dashes and trailing punctuation. Optionally, you can choose to lowercase the output. Options: * Applies `clean_bullets` if `bullets=True`. * Applies `clean_extra_whitespace` if `extra_whitespace=True`. * Applies `clean_dashes` if `dashes=True`. * Applies `clean_trailing_punctuation` if `trailing_punctuation=True`. * Lowercases the output if `lowercase=True`. Examples: ```python from unstructured.cleaners.core import clean # Returns "an excellent point!" clean("● An excellent point!", bullets=True, lowercase=True) # Returns "ITEM 1A: RISK FACTORS" clean("ITEM 1A: RISK-FACTORS", extra_whitespace=True, dashes=True) ``` For more information about the `clean` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `clean_bullets` Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the text are not removed. Examples: ```python from unstructured.cleaners.core import clean_bullets # Returns "An excellent point!" clean_bullets("● An excellent point!") # Returns "I love Morse Code! ●●●" clean_bullets("I love Morse Code! ●●●") ``` For more information about the `clean_bullets` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `clean_dashes` Removes dashes from a section of text. Also handles special characters such as `\u2013`. Examples: ```python from unstructured.cleaners.core import clean_dashes # Returns "ITEM 1A: RISK FACTORS" clean_dashes("ITEM 1A: RISK-FACTORS\u2013") ``` For more information about the `clean_dashes` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `clean_non_ascii_chars` Removes non-ascii characters from a string. Examples: ```python from unstructured.cleaners.core import clean_non_ascii_chars text = "\x88This text contains ®non-ascii characters!●" # Returns "This text contains non-ascii characters!" clean_non_ascii_chars(text) ``` For more information about the `clean_non_ascii_chars` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `clean_ordered_bullets` Remove alphanumeric bullets from the beginning of text up to three “sub-section” levels. Examples: ```python from unstructured.cleaners.core import clean_ordered_bullets # Returns "This is a very important point" clean_ordered_bullets("1.1 This is a very important point") # Returns "This is a very important point ●" clean_ordered_bullets("a.b This is a very important point ●") ``` For more information about the `clean_ordered_bullets` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `clean_postfix` Removes the postfix from a string if they match a specified pattern. Options: * Ignores case if `ignore_case` is set to `True`. The default is `False`. * Strips trailing whitespace is `strip` is set to `True`. The default is `True`. Examples: ```python from unstructured.cleaners.core import clean_postfix text = "The end! END" # Returns "The end!" clean_postfix(text, r"(END|STOP)", ignore_case=True) ``` For more information about the `clean_postfix` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `clean_prefix` Removes the prefix from a string if they match a specified pattern. Options: * Ignores case if `ignore_case` is set to `True`. The default is `False`. * Strips leading whitespace is `strip` is set to `True`. The default is `True`. Examples: ```python from unstructured.cleaners.core import clean_prefix text = "SUMMARY: This is the best summary of all time!" # Returns "This is the best summary of all time!" clean_prefix(text, r"(SUMMARY|DESCRIPTION):", ignore_case=True) ``` For more information about the `clean_prefix` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `clean_trailing_punctuation` Removes trailing punctuation from a section of text. Examples: ```python from unstructured.cleaners.core import clean_trailing_punctuation # Returns "ITEM 1A: RISK FACTORS" clean_trailing_punctuation("ITEM 1A: RISK FACTORS.") ``` For more information about the `clean_trailing_punctuation` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `group_broken_paragraphs` Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in `.txt` files. By default, `group_broken_paragraphs` groups together lines split by `\n`. You can change that behavior with the `line_split` kwarg. The function considers `\n\n` to be a paragraph break by default. You can change that behavior with the `paragraph_split` kwarg. Examples: ```python from unstructured.cleaners.core import group_broken_paragraphs text = """The big brown fox was walking down the lane. At the end of the lane, the fox met a bear.""" group_broken_paragraphs(text) ``` ```python import re from unstructured.cleaners.core import group_broken_paragraphs para_split_re = re.compile(r"(\s\n\s){3}") text = """The big brown fox was walking down the lane. At the end of the lane, the fox met a bear.""" group_broken_paragraphs(text, paragraph_split=para_split_re) ``` For more information about the `group_broken_paragraphs` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `remove_punctuation` Removes ASCII and unicode punctuation from a string. Examples: ```python from unstructured.cleaners.core import remove_punctuation # Returns "A lovely quote" remove_punctuation("“A lovely quote!”") ``` For more information about the `remove_punctuation` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `replace_unicode_quotes` Replaces unicode quote characters such as `\x91` in strings. Examples: ``` from unstructured.cleaners.core import replace_unicode_quotes # Returns "“A lovely quote!”" replace_unicode_characters("\x93A lovely quote!\x94") # Returns ""‘A lovely quote!’" replace_unicode_characters("\x91A lovely quote!\x92") ``` For more information about the `replace_unicode_quotes` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `translate_text` The `translate_text` cleaning functions translates text between languages. `translate_text` uses the [Helsinki NLP MT models](https://huggingface.co/Helsinki-NLP) from `transformers` for machine translation. Works for Russian, Chinese, Arabic, and many other languages. Parameters: * `text`: the input string to translate. * `source_lang`: the two letter language code for the source language of the text. If `source_lang` is not specified, the language will be detected using `langdetect`. * `target_lang`: the two letter language code for the target language for translation. Defaults to `"en"`. Examples: ```python from unstructured.cleaners.translate import translate_text # Output is "I'm a Berliner!" translate_text("Ich bin ein Berliner!") # Output is "I can also translate Russian!" translate_text("Я тоже можно переводать русский язык!", "ru", "en") ``` For more information about the `translate_text` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/translate.py). # Embedding Source: https://docs.unstructured.io/open-source/core-functionality/embedding The Unstructured open-source library does not offer built-in support for calling embedding providers to obtain embeddings for pieces of text. Alternatively, the [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli) and the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) offer built-in support for calling embedding providers as part of an ingest pipeline. [Learn how](/open-source/how-to/embedding). Also, you can use common third-party tools and libraries to get embeddings for document elements' text within JSON files that are produced by calling the Unstructured open-source library. For example, the following sample Python script: 1. Takes an Unstructured open-source library-generated JSON file as input. 2. Reads in the JSON file's contents as a JSON object. 3. Uses the [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model on Hugging Face to generate embeddings for each `text` field of each document element in the JSON file. 4. Adds the generated embeddings next to each corresponding `text` field in the original JSON. 5. Saves the results back to the original JSON file. ```python Python # Filename: embeddings.py # pip install langchain sentence_transformers import sys import json from langchain_huggingface.embeddings import HuggingFaceEmbeddings if name == "main": embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2" ) # Get the JSON file's path. if len(sys.argv) < 2: print("Error: Specify the path to the input JSON file.") print("For example, 'python embeddings.py myfile.json'") sys.exit(1) file_path = sys.argv[1] try: # Get the JSON file's contents. with open(file_path, 'r') as file: file_elements = json.load(file) # Process each element in the JSON file. for element in file_elements: # Get the element's "text" field. text = element["text"] # Generate the embeddings for that "text" field. query_result = embeddings.embed_query(text) # Add the embeddings to that element as an "embeddings" field. element["embeddings"] = query_result # Save the updated JSON back into the original file. with open(file_path, 'w') as file: json.dump(file_elements, file, indent=2) print(f"Done! Updated JSON saved to '{file_path}'.") except FileNotFoundError: print(f"Error: File '{file_path}' not found.") except IOError: print(f"Error: Unable to access file '{file_path}'.") ``` ## Additional resources For information about how to use Python scripts to call various embedding providers, see for example: * [Amazon Bedrock](https://docs.aws.amazon.com/code-library/latest/ug/python_3_bedrock-runtime_code_examples.html#amazon_titan_text_embeddings) * [Hugging Face](https://huggingface.co/blog/getting-started-with-embeddings) * [OctoAI](https://octo.ai/blog/introducing-octoais-embedding-api-to-power-your-rag-needs/) * [OpenAI](https://platform.openai.com/docs/guides/embeddings) * [together.ai](https://docs.together.ai/docs/embeddings-overview) * [Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) * [Voyage AI](https://docs.voyageai.com/docs/embeddings) # Extracting Source: https://docs.unstructured.io/open-source/core-functionality/extracting ## `extract_datetimetz` Extracts the date, time, and timezone in the `Received` field(s) from an `.eml` file. `extract_datetimetz` takes in a string and returns a datetime.datetime object from the input string. ```python from unstructured.cleaners.extract import extract_datetimetz text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by \n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\ n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200""" # Returns datetime.datetime(2021, 3, 26, 11, 4, 9, tzinfo=datetime.timezone(datetime.timedelta(seconds=43200))) extract_datetimetz(text) ``` For more information about the `extract_datetimetz` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `extract_email_address` Extracts email addresses from a string input and returns a list of all the email addresses in the input string. ```python from unstructured.cleaners.extract import extract_email_address text = """Me me@email.com and You <You@email.com> ([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)""" # Returns "['me@email.com', 'you@email.com']" extract_email_address(text) ``` For more information about the `extract_email_address` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `extract_ip_address` Extracts IPv4 and IPv6 IP addresses in the input string and returns a list of all IP address in input string. ```python from unstructured.cleaners.extract import extract_ip_address text = """Me me@email.com and You <You@email.com> ([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)""" # Returns "['ba23::58b5:2236:45g2:88h2', '10.0.2.01']" extract_ip_address(text) ``` For more information about the `extract_ip_address` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `extract_ip_address_name` Extracts the names of each IP address in the `Received` field(s) from an `.eml` file. `extract_ip_address_name` takes in a string and returns a list of all IP addresses in the input string. ```python from unstructured.cleaners.extract import extract_ip_address_name text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by \n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\ n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200""" # Returns "['ABC.DEF.local', 'ABC.DEF.local2']" extract_ip_address_name(text) ``` For more information about the `extract_ip_address_name` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `extract_mapi_id` Extracts the `mapi id` in the `Received` field(s) from an `.eml` file. `extract_mapi_id` takes in a string and returns a list of a string containing the `mapi id` in the input string. ```python from unstructured.cleaners.extract import extract_mapi_id text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by \n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\ n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200""" # Returns "['32.88.5467.123']" extract_mapi_id(text) ``` For more information about the `extract_mapi_id` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `extract_ordered_bullets` Extracts alphanumeric bullets from the beginning of text up to three “sub-section” levels. Examples: ```python from unstructured.cleaners.extract import extract_ordered_bullets # Returns ("1", "1", None) extract_ordered_bullets("1.1 This is a very important point") # Returns ("a", "1", None) extract_ordered_bullets("a.1 This is a very important point") ``` For more information about the `extract_ordered_bullets` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `extract_text_after` Extracts text that occurs after the specified pattern. Options: * If `index` is set, extract after the `(index + 1)`th occurrence of the pattern. The default is `0`. * Strips trailing whitespace if `strip` is set to `True`. The default is `True`. Examples: ```python from unstructured.cleaners.extract import extract_text_after text = "SPEAKER 1: Look at me, I'm flying!" # Returns "Look at me, I'm flying!" extract_text_after(text, r"SPEAKER \d{1}:") ``` For more information about the `extract_text_after` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `extract_text_before` Extracts text that occurs before the specified pattern. Options: * If `index` is set, extract before the `(index + 1)`th occurrence of the pattern. The default is `0`. * Strips leading whitespace if `strip` is set to `True`. The default is `True`. Examples: ```python from unstructured.cleaners.extract import extract_text_before text = "Here I am! STOP Look at me! STOP I'm flying! STOP" # Returns "Here I am!" extract_text_before(text, r"STOP") ``` For more information about the `extract_text_before` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `extract_us_phone_number` Extracts a phone number from a section of text. Examples: ```python from unstructured.cleaners.extract import extract_us_phone_number # Returns "215-867-5309" extract_us_phone_number("Phone number: 215-867-5309") ``` For more information about the `extract_us_phone_number` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/extract.py). ## `group_broken_paragraphs` Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in `.txt` files. By default, `group_broken_paragraphs` groups together lines split by `\n`. You can change that behavior with the `line_split` kwarg. The function considers `\n\n` to be a paragraph break by default. You can change that behavior with the `paragraph_split` kwarg. Examples: ```python from unstructured.cleaners.core import group_broken_paragraphs text = """The big brown fox was walking down the lane. At the end of the lane, the fox met a bear.""" group_broken_paragraphs(text) ``` ```python import re from unstructured.cleaners.core import group_broken_paragraphs para_split_re = re.compile(r"(\s\n\s){3}") text = """The big brown fox was walking down the lane. At the end of the lane, the fox met a bear.""" group_broken_paragraphs(text, paragraph_split=para_split_re) ``` For more information about the `group_broken_paragraphs` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `remove_punctuation` Removes ASCII and unicode punctuation from a string. Examples: ```python from unstructured.cleaners.core import remove_punctuation # Returns "A lovely quote" remove_punctuation("“A lovely quote!”") ``` For more information about the `remove_punctuation` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `replace_unicode_quotes` Replaces unicode quote characters such as `\x91` in strings. Examples: ```python from unstructured.cleaners.core import replace_unicode_quotes # Returns "“A lovely quote!”" replace_unicode_characters("\x93A lovely quote!\x94") # Returns ""‘A lovely quote!’" replace_unicode_characters("\x91A lovely quote!\x92") ``` For more information about the `replace_unicode_quotes` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/core.py). ## `translate_text` The `translate_text` cleaning function translates text between languages. `translate_text` uses the [Helsinki NLP MT models](https://huggingface.co/Helsinki-NLP) from `transformers` for machine translation. Works for Russian, Chinese, Arabic, and many other languages. Parameters: * `text`: the input string to translate. * `source_lang`: the two letter language code for the source language of the text. If `source_lang` is not specified, the language will be detected using `langdetect`. * `target_lang`: the two letter language code for the target language for translation. Defaults to `"en"`. Examples: ```python from unstructured.cleaners.translate import translate_text # Output is "I'm a Berliner!" translate_text("Ich bin ein Berliner!") # Output is "I can also translate Russian!" translate_text("Я тоже можно переводать русский язык!", "ru", "en") ``` For more information about the `translate_text` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/cleaners/translate.py). # Core Functionality Source: https://docs.unstructured.io/open-source/core-functionality/overview The `unstructured` library includes functions to partition, chunk, clean, and stage raw source documents. These functions serve as the primary public interfaces within the library. After reading this section, you should understand the following: * How to partition a document into json or csv. * How to remove unwanted content from document elements using cleaning functions. * How to extract content from a document using the extraction functions. * How to prepare data for downstream use cases using staging functions * How to chunk partitioned documents for use cases such as Retrieval Augmented Generation (RAG). # Partitioning Source: https://docs.unstructured.io/open-source/core-functionality/partitioning Partitioning functions in `unstructured` allow users to extract structured content from a raw unstructured document. These functions break a document down into elements such as `Title`, `NarrativeText`, and `ListItem`, enabling users to decide what content they’d like to keep for their particular application. If you’re training a summarization model, for example, you may only be interested in `NarrativeText`. The easiest way to partition documents in unstructured is to use the `partition` function. If you call the `partition` function, `unstructured` will use `libmagic` to automatically determine the file type and invoke the appropriate partition function. In cases where `libmagic` is not available, filetype detection will fall back to using the file extension. The following table shows the document types the `unstructured` library currently supports. `partition` will recognize each of these document types and route the document to the appropriate partitioning function. If you already know your document type, you can use the partitioning function listed in the table directly. | Document Type | Partition Function | Strategies | Table Support | Options | | ------------------------------------------------------------------------- | ------------------ | -------------------------------------- | ------------- | -------------------------------------------------------------------------------------------- | | CSV Files (.csv) | partition\_csv | N/A | Yes | None | | E-mails (.eml) | partition\_email | N/A | No | Encoding; Include Headers; Max Partition; Process Attachments | | E-mails (.msg) | partition\_msg | N/A | No | Encoding; Max Partition; Process Attachments | | EPubs (.epub) | partition\_epub | N/A | Yes | Include Page Breaks | | Excel Documents (.xlsx/.xls) | partition\_xlsx | N/A | Yes | None | | HTML Pages (.html/.htm) | partition\_html | N/A | No | Encoding; Include Page Breaks | | Images (.png/.jpg/.jpeg/.tiff/.bmp/.heic) | partition\_image | “auto”, “hi\_res”, “ocr\_only” | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages, Strategy | | Markdown (.md) | partition\_md | N/A | Yes | Include Page Breaks | | Org Mode (.org) | partition\_org | N/A | Yes | Include Page Breaks | | Open Office Documents (.odt) | partition\_odt | N/A | Yes | None | | PDFs (.pdf) | partition\_pdf | “auto”, “fast”, “hi\_res”, “ocr\_only” | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages, Strategy | | Plain Text (.txt/.text/.log) | partition\_text | N/A | No | Encoding; Max Partition; Paragraph Grouper | | PowerPoints (.ppt) | partition\_ppt | N/A | Yes | Include Page Breaks | | PowerPoints (.pptx) | partition\_pptx | N/A | Yes | Include Page Breaks | | ReStructured Text (.rst) | partition\_rst | N/A | Yes | Include Page Breaks | | Rich Text Files (.rtf) | partition\_rtf | N/A | Yes | Include Page Breaks | | TSV Files (.tsv) | partition\_tsv | N/A | Yes | None | | Word Documents (.doc) | partition\_doc | N/A | Yes | Include Page Breaks | | Word Documents (.docx) | partition\_docx | N/A | Yes | Include Page Breaks | | XML Documents (.xml) | partition\_xml | N/A | No | Encoding; Max Partition; XML Keep Tags | | Code Files (.js/.py/.java/ .cpp/.cc/.cxx/.c/.cs/ .php/.rb/.swift/.ts/.go) | partition\_text | N/A | No | Encoding; Max Partition; Paragraph Grouper | As shown in the examples below, the `partition` function accepts both filenames and file-like objects as input. `partition` also has some optional kwargs. For example, if you set `include_page_breaks=True`, the output will include `PageBreak` elements if the filetype supports it. Additionally you can bypass the filetype detection logic with the optional `content_type` argument which may be specified with either the `filename` or file-like object, `file`. You can find a full listing of optional kwargs in the documentation below. ```python from unstructured.partition.auto import partition filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf") elements = partition(filename=filename, content_type="application/pdf") print("\n\n".join([str(el) for el in elements][:10])) ``` ```python from unstructured.partition.auto import partition filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf") with open(filename, "rb") as f: elements = partition(file=f, include_page_breaks=True) print("\n\n".join([str(el) for el in elements][5:15])) ``` The `unstructured` library also includes partitioning functions targeted at specific document types. The `partition` function uses these document-specific partitioning functions under the hood. There are a few reasons you may want to use a document-specific partitioning function instead of `partition`: * If you already know the document type, filetype detection is unnecessary. Using the document-specific function directly, or passing in the `content_type` will make your program run faster. * Fewer dependencies. You don’t need to install `libmagic` for filetype detection if you’re only using document-specific functions. * Additional features. The API for partition is the least common denominator for all document types. Certain document-specific function include extra features that you may want to take advantage of. For example, `partition_html` allows you to pass in a URL so you don’t have to store the `.html` file locally. See the documentation below learn about the options available in each partitioning function. Below we see an example of how to partition a document directly with the URL using the partition\_html function. ```python from unstructured.partition.html import partition_html url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html" elements = partition_html(url=url) print("\n\n".join([str(el) for el in elements])) ``` ## `partition` The `partition` function is the simplest way to partition a document in `unstructured`. If you call the `partition` function, `unstructured` will attempt to detect the file type and route it to the appropriate partitioning function. All partitioning functions called within `partition` are called using the default kwargs. Use the document-type specific functions if you need to apply non-default settings. `partition` currently supports `.docx`, `.doc`, `.odt`, `.pptx`, `.ppt`, `.xlsx`, `.csv`, `.tsv`, `.eml`, `.msg`, `.rtf`, `.epub`, `.html`, `.xml`, `.pdf`, `.png`, `.jpg`, `.heic`, and `.txt` files. If you set the `include_page_breaks` kwarg to `True`, the output will include page breaks. This is only supported for `.pptx`, `.html`, `.pdf`, `.png`, `.heic`, and `.jpg`. The `strategy` kwarg controls the strategy for partitioning documents. Generally available strategies are “fast” for faster processing and “hi\_res” for more accurate processing. ```python import docx from unstructured.partition.auto import partition document = docx.Document() document.add_paragraph("Important Analysis", style="Heading 1") document.add_paragraph("Here is my first thought.", style="Body Text") document.add_paragraph("Here is my second thought.", style="Normal") document.save("mydoc.docx") elements = partition(filename="mydoc.docx") with open("mydoc.docx", "rb") as f: elements = partition(file=f) ``` ```python from unstructured.partition.auto import partition elements = partition(filename="example-docs/pdf/layout-parser-paper-fast.pdf") ``` The `partition` function also accepts a `url` kwarg for remotely hosted documents. If you want to force `partition` to treat the document as a particular MIME type, use the `content_type` kwarg in conjunction with `url`. Otherwise, `partition` will use the information from the `Content-Type` header in the HTTP response. The `ssl_verify` kwarg controls whether or not SSL verification is enabled for the HTTP request. By default it is on. Use `ssl_verify=False` to disable SSL verification in the request. ```python from unstructured.partition.auto import partition url = "https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/LICENSE.md" elements = partition(url=url) elements = partition(url=url, content_type="text/markdown") ``` For more information about the `partition` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/auto.py). ## `partition_csv` The `partition_csv` function pre-processes CSV files. The output is a single `Table` element. The `text_as_html` attribute in the element metadata will contain an HTML representation of the table. Examples: ```python from unstructured.partition.csv import partition_csv elements = partition_csv(filename="example-docs/stanley-cups.csv") print(elements[0].metadata.text_as_html) ``` For more information about the `partition_csv` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/csv.py). ## `partition_doc` The `partition_doc` partitioning function pre-processes Microsoft Word documents saved in the `.doc` format. This partition function uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The `partition_doc` can take a filename or file-like object as input. `partition_doc` uses `libreoffice` to convert the file to `.docx` and then calls `partition_docx`. Ensure you have `libreoffice` installed before using `partition_doc`. Examples: ```python from unstructured.partition.doc import partition_doc elements = partition_doc(filename="example-docs/fake.doc") ``` For more information about the `partition_doc` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/doc.py). ## `partition_docx` The `partition_docx` partitioning function pre-processes Microsoft Word documents saved in the `.docx` format. This partition function uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The `partition_docx` can take a filename or file-like object as input, as shown in the two examples below. Examples: ```python import docx from unstructured.partition.docx import partition_docx document = docx.Document() document.add_paragraph("Important Analysis", style="Heading 1") document.add_paragraph("Here is my first thought.", style="Body Text") document.add_paragraph("Here is my second thought.", style="Normal") document.save("mydoc.docx") elements = partition_docx(filename="mydoc.docx") with open("mydoc.docx", "rb") as f: elements = partition_docx(file=f) ``` In Word documents, headers and footers are specified per section. In the output, the `Header` elements will appear at the beginning of a section and `Footer` elements will appear at the end. MSFT Word headers and footers have a `header_footer_type` metadata field indicating where the header or footer applies. Valid values are `"primary"`, `"first_page"` and `"even_page"`. `partition_docx` will include page numbers in the document metadata when page breaks are present in the document. The function will detect user inserted page breaks and page breaks inserted by the Word document renderer. Some (but not all) Word document renderers insert page breaks when you save the document. If your Word document renderer does not do that, you may not see page numbers in the output even if you see them visually when you open the document. If that is the case, you can try saving the document with a different renderer. For more information about the `partition_docx` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py). ## `partition_email` The `partition_email` function partitions `.eml` documents and works with exports from email clients such as Microsoft Outlook and Gmail. The `partition_email` takes a filename, file-like object, or raw text as input and produces a list of document `Element` objects as output. Also `content_source` can be set to `text/html` (default) or `text/plain` to process the html or plain text version of the email, respectively. In order for `partition_email` to return the header information as elements, `include_headers` must be set to `True`. Header information is captured in element metadata regardless of the `include_headers` setting. | Header | Element type | Metadata name | | ---------- | ------------ | ------------------ | | Subject | Subject | subject | | From | Sender | sent\_from | | To | Recipient | sent\_to | | Cc | Recipient | cc\_recipient | | Bcc | Recipient | bcc\_recipient | | Received | ReceivedInfo | NA | | Message-ID | MetaData | email\_message\_id | | (other) | MetaData | NA | Examples: ```python from unstructured.partition.email import partition_email elements = partition_email(filename="example-docs/eml/fake-email.eml") with open("example-docs/eml/fake-email.eml", "r") as f: elements = partition_email(file=f) with open("example-docs/eml/fake-email.eml", "r") as f: text = f.read() elements = partition_email(text=text) with open("example-docs/eml/fake-email.eml", "r") as f: text = f.read() elements = partition_email(text=text, content_source="text/plain") with open("example-docs/eml/fake-email.eml", "r") as f: text = f.read() elements = partition_email(text=text, include_headers=True) ``` `partition_email` includes a `max_partition` parameter that indicates the maximum character length for a document element. This parameter only applies if `"text/plain"` is selected as the `content_source`. The default value is `1500`, which roughly corresponds to the average character length for a paragraph. You can disable `max_partition` by setting it to `None`. You can optionally partition e-mail attachments by setting `process_attachments=True`. The following is an example of what the workflow looks like: ```python from unstructured.partition.email import partition_email filename = "example-docs/eml/fake-email-attachment.eml" elements = partition_email(filename=filename, process_attachments=True) ``` If the content of an email is PGP encrypted, `partition_email` will return an empty list of elements and emit a warning indicated the email is encrypted. For more information about the `partition_email` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/email.py). ## `partition_epub` The `partition_epub` function processes e-books in EPUB3 format. The function first converts the document to HTML using `pandocs` and then calls `partition_html`. You’ll need [pandocs](https://pandoc.org/installing.html) installed on your system to use `partition_epub`. Examples: ```python from unstructured.partition.epub import partition_epub elements = partition_epub(filename="example-docs/winter-sports.epub") ``` For more information about the `partition_epub` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/epub.py). ## `partition_html` The `partition_html` function partitions an HTML document and returns a list of document `Element` objects. `partition_html` can take a filename, file-like object, string, or url as input. The following three invocations of partition\_html() are essentially equivalent: ```python from unstructured.partition.html import partition_html elements = partition_html(filename="example-docs/example-10k.html") with open("example-docs/example-10k.html", "r") as f: elements = partition_html(file=f) with open("example-docs/example-10k.html", "r") as f: text = f.read() elements = partition_html(text=text) ``` The following illustrates fetching a url and partitioning the response content. The `ssl_verify` kwarg controls whether or not SSL verification is enabled for the HTTP request. By default it is on. Use `ssl_verify=False` to disable SSL verification in the request. ```python from unstructured.partition.html import partition_html elements = partition_html(url="https://python.org/") # you can also provide custom headers: elements = partition_html(url="https://python.org/", headers={"User-Agent": "YourScriptName/1.0 ..."}) # and turn off SSL verification elements = partition_html(url="https://python.org/", ssl_verify=False) ``` For more information about the `partition_html` function, you can check the [source code](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/html/partition.py). ## `partition_image` The `partition_image` function has the same API as `partition_pdf`. The only difference is that `partition_image` does not need to convert a PDF to an image prior to processing. The `partition_image` function supports `.png`, `.heic`, and `.jpg` files. You can also specify what languages to use for OCR with the `languages` kwarg. For example, use `languages=["eng", "deu"]` to use the English and German language packs. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. Examples: ```python from unstructured.partition.image import partition_image # Returns a List[Element] present in the pages of the parsed image document elements = partition_image("example-docs/img/layout-parser-paper-fast.jpg") # Applies the English and Swedish language pack for ocr elements = partition_image("example-docs/img/layout-parser-paper-fast.jpg", languages=["eng", "swe"]) ``` The `strategy` kwarg controls the method that will be used to process the PDF. The available strategies for images are `"auto"`, `"hi_res"` and `"ocr_only"`. The `"auto"` strategy will choose the partitioning strategy based on document characteristics and the function kwargs. If `skip_infer_table_types` is set to an empty list, the strategy will be `"hi_res"` because that is the only strategy that currently extracts tables for PDFs. Otherwise, `"auto"` will choose `ocr_only`. `"auto"` is the default strategy. The `"hi_res"` strategy will identify the layout of the document using `detectron2_onnx`. The advantage of “hi\_res” is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements. If `detectron2_onnx` is not available, the `"hi_res"` strategy will fall back to the `"ocr_only"` strategy. The `"ocr_only"` strategy runs the document through Tesseract for OCR and then runs the raw text through `partition_text`. Currently, `"hi_res"` has difficulty ordering elements for documents with multiple columns. If you have a document with multiple columns that does not have extractable text, we recommend using the `"ocr_only"` strategy. It is helpful to use `"ocr_only"` instead of `"hi_res"` if `detectron2_onnx` does not detect a text element in the image. To run example below, ensure you have the Korean language pack for Tesseract installed on your system. ```python from unstructured.partition.image import partition_image filename = "example-docs/img/english-and-korean.png" elements = partition_image(filename=filename, languages=["eng", "kor"], strategy="ocr_only") ``` For more information about the `partition_image` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/image.py). ## `partition_md` The `partition_md` function provides the ability to parse markdown files. The following workflow shows how to use `partition_md`. Examples: ```python from unstructured.partition.md import partition_md elements = partition_md(filename="README.md") ``` For more information about the `partition_md` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/md.py). ## `partition_msg` The `partition_msg` functions processes `.msg` files, which is a filetype specific to email exports from Microsoft Outlook. Email header information (`bcc_recipient`, `cc_recipient`, `email_message_id`, `sent_from`, `sent_to`, `subject`, etc.) is captured in element metadata. Examples: ```python from unstructured.partition.msg import partition_msg elements = partition_msg(filename="example-docs/fake-email.msg") ``` `partition_msg` includes a `max_partition` parameter that indicates the maximum character length for a document element. This parameter only applies if `"text/plain"` is selected as the `content_source`. The default value is `1500`, which roughly corresponds to the average character length for a paragraph. You can disable `max_partition` by setting it to `None`. You can optionally partition e-mail attachments by setting `process_attachments=True`. The following is an example of what the workflow looks like: ```python from unstructured.partition.msg import partition_msg filename = "example-docs/fake-email-attachment.msg" elements = partition_msg(filename=filename, process_attachments=True) ``` If the content of an email is PGP encrypted, `partition_msg` will return an empty list of elements and emit a warning indicated the email is encrypted. For more information about the `partition_msg` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/msg.py). ## `partition_multiple_via_api` `partition_multiple_via_api` is similar to `partition_via_api`, but allows you to partition multiple documents in a single REST API call. The result has the type `List[List[Element]]`, for example: ```python [ [NarrativeText("Narrative!"), Title("Title!")], [NarrativeText("Narrative!"), Title("Title!")] ] ``` Examples: ```python from unstructured.partition.api import partition_multiple_via_api filenames = ["example-docs/eml/fake-email.eml", "example-docs/fake.docx"] documents = partition_multiple_via_api(filenames=filenames) ``` ```python from contextlib import ExitStack from unstructured.partition.api import partition_multiple_via_api filenames = ["example-docs/eml/fake-email.eml", "example-docs/fake.docx"] files = [open(filename, "rb") for filename in filenames] with ExitStack() as stack: files = [stack.enter_context(open(filename, "rb")) for filename in filenames] documents = partition_multiple_via_api(files=files, metadata_filenames=filenames) ``` For more information about the `partition_multiple_via_api` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/api.py). ## `partition_odt` The `partition_odt` partitioning function pre-processes Open Office documents saved in the `.odt` format. The function first converts the document to `.docx` using `pandoc` and then processes it using `partition_docx`. Examples: ```python from unstructured.partition.odt import partition_odt elements = partition_odt(filename="example-docs/fake.odt") ``` For more information about the `partition_odt` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/odt.py). ## `partition_org` The `partition_org` function processes Org Mode (`.org`) documents. The function first converts the document to HTML using `pandoc` and then calls `partition_html`. You’ll need [pandoc](https://pandoc.org/installing.html) installed on your system to use `partition_org`. Examples: ```python from unstructured.partition.org import partition_org elements = partition_org(filename="example-docs/README.org") ``` For more information about the `partition_org` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/org.py). ## `partition_pdf` The `partition_pdf` function segments a PDF document by using a document image analysis model. If you set `url=None`, the document image analysis model will execute locally. You need to install `unstructured[local-inference]` if you’d like to run inference locally. If you set the URL, `partition_pdf` will make a call to a remote inference server. `partition_pdf` also includes a `token` function that allows you to pass in an authentication token for a remote API call. You can also specify what languages to use for OCR with the `languages` kwarg. For example, use `languages=["eng", "deu"]` to use the English and German language packs. See the [Tesseract documentation](https://github.com/tesseract-ocr/tessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document. Examples: ```python from unstructured.partition.pdf import partition_pdf # Returns a List[Element] present in the pages of the parsed pdf document elements = partition_pdf("example-docs/pdf/layout-parser-paper-fast.pdf") # Applies the English and Swedish language pack for ocr. OCR is only applied # if the text is not available in the PDF. elements = partition_pdf("example-docs/pdf/layout-parser-paper-fast.pdf", languages=["eng", "swe"]) ``` The `strategy` kwarg controls the method that will be used to process the PDF. The available strategies for PDFs are `"auto"`, `"hi_res"`, `"ocr_only"`, and `"fast"`. * The `"auto"` strategy will choose the partitioning strategy based on document characteristics and the function kwargs. If `skip_infer_table_types` is set to an empty list, the strategy will be `"hi_res"` because that is the only strategy that currently extracts tables for PDFs. Otherwise, `"auto"` will choose `"fast"` if the PDF text is extractable and `"ocr_only"` otherwise. `"auto"` is the default strategy. * The `"hi_res"` strategy will identify the layout of the document using `detectron2_onnx`. The advantage of “hi\_res” is that it uses the document layout to gain additional information about document elements. We recommend using this strategy if your use case is highly sensitive to correct classifications for document elements. If `detectron2_onnx` is not available, the `"hi_res"` strategy will fall back to the `"ocr_only"` strategy. * The `"ocr_only"` strategy runs the document through Tesseract for OCR and then runs the raw text through `partition_text`. Currently, `"hi_res"` has difficulty ordering elements for documents with multiple columns. If you have a document with multiple columns that does not have extractable text, we recommend using the `"ocr_only"` strategy. `"ocr_only"` falls back to `"fast"` if Tesseract is not available and the document has extractable text. * The `"fast"` strategy will extract the text using `pdfminer` and process the raw text with `partition_text`. If the PDF text is not extractable, `partition_pdf` will fall back to `"ocr_only"`. We recommend using the `"fast"` strategy in most cases where the PDF has extractable text. To extract images and elements as image blocks from a PDF, it is mandatory to set `strategy="hi_res"` when setting `extract_images_in_pdf=True`. With this configuration, detected images are saved in a specified directory or encoded within the file. However, keep in mind that `extract_images_in_pdf` is being phased out in favor of `extract_image_block_types`. This option allows you to specify types of images or elements, like “Image” or “Table”. If some extracted images have content clipped, you can adjust the padding by specifying two environment variables “EXTRACT\_IMAGE\_BLOCK\_CROP\_HORIZONTAL\_PAD” and “EXTRACT\_IMAGE\_BLOCK\_CROP\_VERTICAL\_PAD” (for example, EXTRACT\_IMAGE\_BLOCK\_CROP\_HORIZONTAL\_PAD = 20, EXTRACT\_IMAGE\_BLOCK\_CROP\_VERTICAL\_PAD = 10). For integrating these images directly into web applications or APIs, `extract_image_block_to_payload` can be used to convert them into `base64` format, including details about the image type, currently it’s always `image/jpeg`. Lastly, the `extract_image_block_output_dir` can be used to specify the filesystem path for saving the extracted images when not embedding them in payloads. Examples: ```python from unstructured.partition.pdf import partition_pdf partition_pdf( filename="path/to/your/pdf_file.pdf", # mandatory strategy="hi_res", # mandatory to use ``hi_res`` strategy extract_images_in_pdf=True, # mandatory to set as ``True`` extract_image_block_types=["Image", "Table"], # optional extract_image_block_to_payload=False, # optional extract_image_block_output_dir="path/to/save/images", # optional - only works when ``extract_image_block_to_payload=False`` ) ``` If a PDF is copy protected, `partition_pdf` can process the document with the `"hi_res"` strategy (which will treat it like an image), but cannot process the document with the `"fast"` strategy. If the user chooses `"fast"` on a copy protected PDF, `partition_pdf` will fall back to the `"hi_res"` strategy. If `detectron2_onnx` is not installed, `partition_pdf` will fail for copy protected PDFs because the document will not be processable by any of the available methods. Examples: ```python from unstructured.partition.pdf import partition_pdf # This will process without issue elements = partition_pdf("example-docs/pdf/copy-protected.pdf", strategy="hi_res") # This will output a warning and fall back to hi_res elements = partition_pdf("example-docs/pdf/copy-protected.pdf", strategy="fast") ``` `partition_pdf` includes a `max_partition` parameter that indicates the maximum character length for a document element. This parameter only applies if the `"ocr_only"` strategy is used for partitioning. The default value is `1500`, which roughly corresponds to the average character length for a paragraph. You can disable `max_partition` by setting it to `None`. For more information about the `partition_pdf` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/pdf.py). ## `partition_ppt` The `partition_ppt` partitioning function pre-processes Microsoft PowerPoint documents saved in the `.ppt` format. This partition function uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The `partition_ppt` can take a filename or file-like object. `partition_ppt` uses `libreoffice` to convert the file to `.pptx` and then calls `partition_pptx`. Ensure you have `libreoffice` installed before using `partition_ppt`. Examples: ```python from unstructured.partition.ppt import partition_ppt elements = partition_ppt(filename="example-docs/fake-power-point.ppt") ``` For more information about the `partition_ppt` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/ppt.py). ## `partition_pptx` The `partition_pptx` partitioning function pre-processes Microsoft PowerPoint documents saved in the `.pptx` format. This partition function uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The `partition_pptx` can take a filename or file-like object as input, as shown in the two examples below. Examples: ```python from unstructured.partition.pptx import partition_pptx elements = partition_pptx(filename="example-docs/fake-power-point.pptx") with open("example-docs/fake-power-point.pptx", "rb") as f: elements = partition_pptx(file=f) ``` For more information about the `partition_pptx` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/pptx.py). ## `partition_rst` The `partition_rst` function processes ReStructured Text (`.rst`) documents. The function first converts the document to HTML using `pandoc` and then calls `partition_html`. You’ll need [pandoc](https://pandoc.org/installing.html) installed on your system to use `partition_rst`. Examples: ``` from unstructured.partition.rst import partition_rst elements = partition_rst(filename="example-docs/README.rst") ``` For more information about the `partition_rst` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/rst.py). ## `partition_rtf` The `partition_rtf` function processes rich text files. The function first converts the document to HTML using `pandocs` and then calls `partition_html`. You’ll need [pandocs](https://pandoc.org/installing.html) installed on your system to use `partition_rtf`. Examples: ```python from unstructured.partition.rtf import partition_rtf elements = partition_rtf(filename="example-docs/fake-doc.rtf") ``` For more information about the `partition_rtf` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/rtf.py). ## `partition_text` The `partition_text` function partitions text files. The `partition_text` takes a filename, file-like object, and raw text as input and produces `Element` objects as output. Examples: ```python from unstructured.partition.text import partition_text elements = partition_text(filename="example-docs/fake-text.txt") with open("example-docs/fake-text.txt", "r") as f: elements = partition_text(file=f) with open("example-docs/fake-text.txt", "r") as f: text = f.read() elements = partition_text(text=text) ``` If the text has extra line breaks for formatting purposes, you can group together the broken text using the `paragraph_grouper` kwarg. The `paragraph_grouper` kwarg is a function that accepts a string and returns another string. Examples: ```python from unstructured.partition.text import partition_text from unstructured.cleaners.core import group_broken_paragraphs text = """The big brown fox was walking down the lane. At the end of the lane, the fox met a bear.""" partition_text(text=text, paragraph_grouper=group_broken_paragraphs) ``` `partition_text` includes a `max_partition` parameter that indicates the maximum character length for a document element. The default value is `1500`, which roughly corresponds to the average character length for a paragraph. You can disable `max_partition` by setting it to `None`. For more information about the `partition_text` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/text.py). ## `partition_tsv` The `partition_tsv` function pre-processes TSV files. The output is a single `Table` element. The `text_as_html` attribute in the element metadata will contain an HTML representation of the table. Examples: ```python from unstructured.partition.tsv import partition_tsv elements = partition_tsv(filename="example-docs/stanley-cups.tsv") print(elements[0].metadata.text_as_html) ``` For more information about the `partition_tsv` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/tsv.py). ## `partition_via_api` `partition_via_api` allows users to partition documents using the hosted Unstructured API. The API partitions documents using the automatic `partition` function through the Unstructured SDK Client. This is helpful if you’re hosting the API yourself or running it locally through a container. You can pass in your API key using the `api_key` kwarg. You can use the `content_type` kwarg to pass in the MIME type for the file. If you do not explicitly pass it, the MIME type will be inferred. ```python from unstructured.partition.api import partition_via_api filename = "example-docs/eml/fake-email.eml" elements = partition_via_api(filename=filename, api_key="MY_API_KEY", content_type="message/rfc822") with open(filename, "rb") as f: elements = partition_via_api(file=f, metadata_filename=filename, api_key="MY_API_KEY") ``` You can pass additional settings such as `strategy`, `languages` and `encoding` to the API through optional kwargs. These options get added to the request body when the API is called. See [the API documentation](https://api.unstructured.io/general/docs) for a full list of settings supported by the API. ```python from unstructured.partition.api import partition_via_api filename = "example-docs/pdf/DA-1p.pdf" elements = partition_via_api( filename=filename, api_key=api_key, strategy="auto" ) ``` If you are using the [Unstructured Partition Endpoint](/api-reference/partition/overview), you can use the `api_url` kwarg to point the `partition_via_api` function at your Unstructured Partition URL. ```python import os from unstructured.partition.api import partition_via_api filename = "example-docs/eml/fake-email.eml" elements = partition_via_api( filename=filename, api_key=os.getenv("UNSTRUCTURED_API_KEY"), api_url=os.getenv("UNSTRUCTURED_API_URL") ) ``` If you are self-hosting or running the API locally, you can use the `api_url` kwarg to point the `partition_via_api` function at your self-hosted or local API. See [here](https://github.com/Unstructured-IO/unstructured-api#dizzy-instructions-for-using-the-docker-image) for documentation on how to run the API as a container locally. ```python import os from unstructured.partition.api import partition_via_api filename = "example-docs/eml/fake-email.eml" elements = partition_via_api( filename=filename, api_key=os.getenv("UNSTRUCTURED_API_KEY"), api_url=os.getenv("UNSTRUCTURED_API_URL") ) ``` For more information about the `partition_via_api` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/api.py). ## `partition_xlsx` The `partition_xlsx` function pre-processes Microsoft Excel documents. Each sheet in the Excel file will be stored as a `Table` object. The plain text of the sheet will be the `text` attribute of the `Table`. The `text_as_html` attribute in the element metadata will contain an HTML representation of the table. Examples: ```python from unstructured.partition.xlsx import partition_xlsx elements = partition_xlsx(filename="example-docs/stanley-cups.xlsx") print(elements[0].metadata.text_as_html) ``` For more information about the `partition_xlsx` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/xlsx.py). ## `partition_xml` The `partition_xml` function processes XML documents. If `xml_keep_tags=False`, the function only returns the text attributes from the tags. You can use `xml_path` in conjunction with `xml_keep_tags=False` to restrict the text extraction to specific tags. If `xml_keep_tags=True`, the function returns tag information in addition to tag text. `xml_keep_tags` is `False` be default. ```python from unstructured.partition.xml import partition_xml elements = partition_xml(filename="example-docs/factbook.xml", xml_keep_tags=True) elements = partition_xml(filename="example-docs/factbook.xml", xml_keep_tags=False) ``` For more information about the `partition_xml` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/xml.py). # Staging Source: https://docs.unstructured.io/open-source/core-functionality/staging <Warning> Staging functions in the Unstructured open source library are being deprecated in favor of [destination connectors](/open-source/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/open-source/ingestion/overview). </Warning> Staging functions in the `unstructured` package help prepare your data for ingestion into downstream systems. A staging function accepts a list of document elements as input and return an appropriately formatted dictionary as output. In the example below, we get our narrative text samples prepared for ingestion into LabelStudio using `the stage_for_label_studio` function. We can take this data and directly upload it into LabelStudio to quickly get started with an NLP labeling task. ```python import json from unstructured.staging.label_studio import stage_for_label_studio output = stage_for_label_studio(narrative_text) print(json.dumps(output[:2], indent=4)) ``` ## `convert_to_csv` Converts outputs to the initial structured data (ISD) format as a CSV string. Examples: ```python from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.base import convert_to_csv elements = [Title(text="Title"), NarrativeText(text="Narrative")] isd_csv = convert_to_csv(elements) ``` For more information about the `convert_to_csv` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/base.py). ## `convert_to_dataframe` Converts a list of document `Element` objects to a `pandas` dataframe. The dataframe will have a `text` column with the text from the element and a `type` column indicating the element type, such as `NarrativeText` or `Title`. Examples: ```python from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.base import convert_to_dataframe elements = [Title(text="Title"), NarrativeText(text="Narrative")] df = convert_to_dataframe(elements) ``` For more information about the `convert_to_dataframe` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/base.py). ## `convert_to_dict` Converts a list of `Element` objects to a dictionary. This is the default format for representing documents in `unstructured`. Examples: ```python from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.base import convert_to_dict elements = [Title(text="Title"), NarrativeText(text="Narrative")] isd = convert_to_dict(elements) ``` For more information about the `convert_to_dict` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/base.py). ## `dict_to_elements` Converts a dictionary of the format produced by `convert_to_dict` back to a list of `Element` objects. Examples: ```python from unstructured.staging.base import dict_to_elements isd = [ {"text": "My Title", "type": "Title"}, {"text": "My Narrative", "type": "NarrativeText"} ] # elements will look like: # [ Title(text="My Title"), NarrativeText(text="My Narrative")] elements = dict_to_elements(isd) ``` For more information about the `dict_to_elements` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/base.py). ## `stage_csv_for_prodigy` Formats outputs in CSV format for use with [Prodigy](https://prodi.gy/docs/api-loaders). After running `stage_csv_for_prodigy`, you can write the results to a CSV file that is ready to be used with Prodigy. Examples: ```python from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.prodigy import stage_csv_for_prodigy elements = [Title(text="Title"), NarrativeText(text="Narrative")] metadata = [{"type": "title"}, {"source": "news"}] prodigy_csv_data = stage_csv_for_prodigy(elements, metadata) # The resulting CSV file is ready to be used with Prodigy with open("prodigy.csv", "w") as csv_file: csv_file.write(prodigy_csv_data) ``` For more information about the `stage_csv_for_prodigy` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/prodigy.py). ## `stage_for_argilla` Convert a list of `Text` elements to an [Argilla Dataset](https://docs.argilla.io/en/latest/reference/python/python_client.html#python-ref-datasets). The type of Argilla dataset to be generated can be specified with `argilla_task` parameter. Valid values for `argilla_task` are `"text_classification"`, `"token_classification"`, and `"text2text"`. If `"token_classification"` is selected and `tokens` is not included in the optional kwargs, the `nltk` word tokenizer is used by default. Examples: ```python import json from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.argilla import stage_for_argilla elements = [Title(text="Title"), NarrativeText(text="Narrative")] metadata = [{"type": "title"}, {"type": "text"}] argilla_dataset = stage_for_argilla(elements, "text_classification", metadata=metadata) ``` For more information about the `stage_for_argilla` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/argilla.py). ## `stage_for_baseplate`- The `stage_for_baseplate` staging function prepares a list of `Element` objects for ingestion into [Baseplate](https://docs.baseplate.ai/introduction), an LLM backend with a spreadsheet interface. After running the `stage_for_baseplate` function, you can use the [Baseplate API](https://docs.baseplate.ai/api-reference/documents/upsert-data-rows) to upload the documents to Baseplate. The following example code shows how to use the `stage_for_baseplate` function. ```python from unstructured.documents.elements import ElementMetadata, NarrativeText, Title from unstructured.staging.baseplate import stage_for_baseplate metadata = ElementMetadata(filename="fox.epub") elements = [ Title("A Wonderful Story About A Fox", metadata=metadata), NarrativeText( "A fox ran into the chicken coop and the chickens flew off!", metadata=metadata, ), ] rows = stage_for_baseplate(elements) ``` The output will look like: ```json { "rows": [ { "data": { "element_id": "ad270eefd1cc68d15f4d3e51666d4dc8", "text": "A Wonderful Story About A Fox", "type": "Title", }, "metadata": {"filename": "fox.epub"}, }, { "data": { "element_id": "8275769fdd1804f9f2b55ad3c9b0ef1b", "text": "A fox ran into the chicken coop and the chickens flew off!", "type": "NarrativeText", }, "metadata": {"filename": "fox.epub"}, }, ], } ``` For more information about the `stage_for_baseplate` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/baseplate.py). ## `stage_for_datasaur` Formats a list of `Text` elements as input to token based tasks in Datasaur. Example: ```python from unstructured.documents.elements import Text from unstructured.staging.datasaur import stage_for_datasaur elements = [Text("Text1"),Text("Text2")] datasaur_data = stage_for_datasaur(elements) ``` The output is a list of dictionaries, each one with two keys: “text” with the content of the element and “entities” with an empty list. You can also specify entities in the `stage_for_datasaur` function. Entities you specify in the input will be included in the entities key in the output. The list of entities is a list of dictionaries and must have all of the keys in the example below. The list of entities must be the same length as the list of elements. Use an empty list for any elements that do not have any entities. Example: ```python from unstructured.documents.elements import Text from unstructured.staging.datasaur import stage_for_datasaur elements = [Text("Hi my name is Matt.")] entities = [[{"text": "Matt", "type": "PER", "start_idx": 11, "end_idx": 15}]] datasaur_data = stage_for_datasaur(elements, entities) ``` For more information about the `stage_for_datasaur` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/datasaur.py). ## `stage_for_label_box` Formats outputs for use with [LabelBox](https://docs.labelbox.com/docs/overview). LabelBox accepts cloud-hosted data and does not support importing text directly. The `stage_for_label_box` does the following: * Stages the data files in the `output_directory` specified in function arguments to be uploaded to a cloud storage service. * Returns a config of type `List[Dict[str, Any]]` that can be written to a `json` file and imported into LabelBox. Note: `stage_for_label_box` does not upload the data to remote storage such as S3. Users can upload the data to S3 using `aws s3 sync ${output_directory} ${url_prefix}` after running the `stage_for_label_box` staging function. Examples: The following example demonstrates generating a `config.json` file that can be used with LabelBox and uploading the staged data files to an S3 bucket. ```python import os import json from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.label_box import stage_for_label_box # The S3 Bucket name where data files should be uploaded. S3_BUCKET_NAME = "labelbox-staging-bucket" # The S3 key prefix (I.e. directory) where data files should be stored. S3_BUCKET_KEY_PREFIX = "data/" # The URL prefix where the data files will be accessed. S3_URL_PREFIX = f"https://{S3_BUCKET_NAME}.s3.amazonaws.com/{S3_BUCKET_KEY_PREFIX}" # The local output directory where the data files will be staged for uploading to a Cloud Storage service. LOCAL_OUTPUT_DIRECTORY = "/tmp/labelbox-staging" elements = [Title(text="Title"), NarrativeText(text="Narrative")] labelbox_config = stage_for_label_box( elements, output_directory=LOCAL_OUTPUT_DIRECTORY, url_prefix=S3_URL_PREFIX, external_ids=["id1", "id2"], attachments=[[{"type": "RAW_TEXT", "value": "Title description"}], [{"type": "RAW_TEXT", "value": "Narrative Description"}]], create_directory=True, ) # The resulting JSON config file is ready to be used with LabelBox. with open("config.json", "w+") as labelbox_config_file: json.dump(labelbox_config, labelbox_config_file, indent=4) # Upload staged data files to S3 from local output directory. def upload_staged_files(): from s3fs import S3FileSystem fs = S3FileSystem() for filename in os.listdir(LOCAL_OUTPUT_DIRECTORY): filepath = os.path.join(LOCAL_OUTPUT_DIRECTORY, filename) upload_key = os.path.join(S3_BUCKET_KEY_PREFIX, filename) fs.put_file(lpath=filepath, rpath=os.path.join(S3_BUCKET_NAME, upload_key)) upload_staged_files() ``` For more information about the `stage_for_label_box` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/label_box.py). ## `stage_for_label_studio` Formats outputs for upload to LabelStudio. After running `stage_for_label_studio`, you can write the results to a JSON folder that is ready to be included in a new LabelStudio project. Examples: ```python import json from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.label_studio import stage_for_label_studio elements = [Title(text="Title"), NarrativeText(text="Narrative")] label_studio_data = stage_for_label_studio(elements, text_field="my_text", id_field="my_id") # The resulting JSON file is ready to be uploaded to LabelStudio with open("label_studio.json", "w") as f: json.dump(label_studio_data, f, indent=4) ``` You can also include pre-annotations and predictions as part of your LabelStudio upload. The `annotations` kwarg is a list of lists. If `annotations` is specified, there must be a list of annotations for each element in the `elements` list. If an element does not have any annotations, use an empty list. The following shows an example of how to upload annotations for the “Text Classification” task in LabelStudio: ```python import json from unstructured.documents.elements import NarrativeText from unstructured.staging.label_studio import ( stage_for_label_studio, LabelStudioAnnotation, LabelStudioResult, ) elements = [NarrativeText(text="Narrative")] annotations = [[ LabelStudioAnnotation( result=[ LabelStudioResult( type="choices", value={"choices": ["Positive"]}, from_name="sentiment", to_name="text", ) ] ) ]] label_studio_data = stage_for_label_studio( elements, annotations=annotations, text_field="my_text", id_field="my_id" ) # The resulting JSON file is ready to be uploaded to LabelStudio # with annotations included with open("label_studio.json", "w") as f: json.dump(label_studio_data, f, indent=4) ``` Similar to annotations, the `predictions` kwarg is also a list of lists. A `prediction` is an annotation with the addition of a `score` value. If `predictions` is specified, there must be a list of predictions for each element in the `elements` list. If an element does not have any predictions, use an empty list. The following shows an example of how to upload predictions for the “Text Classification” task in LabelStudio: ```python import json from unstructured.documents.elements import NarrativeText from unstructured.staging.label_studio import ( stage_for_label_studio, LabelStudioPrediction, LabelStudioResult, ) elements = [NarrativeText(text="Narrative")] predictions = [[ LabelStudioPrediction( result=[ LabelStudioResult( type="choices", value={"choices": ["Positive"]}, from_name="sentiment", to_name="text", ) ], score=0.68 ) ]] label_studio_data = stage_for_label_studio( elements, predictions=predictions, text_field="my_text", id_field="my_id" ) # The resulting JSON file is ready to be uploaded to LabelStudio # with annotations included with open("label_studio.json", "w") as f: json.dump(label_studio_data, f, indent=4) ``` The following shows an example of how to upload annotations for the “Named Entity Recognition” task in LabelStudio: ```python import json from unstructured.documents.elements import NarrativeText from unstructured.staging.label_studio import ( stage_for_label_studio, LabelStudioAnnotation, LabelStudioResult, ) elements = [NarrativeText(text="Narrative")] annotations = [[ LabelStudioAnnotation( result=[ LabelStudioResult( type="labels", value={"start": 0, "end": 9, "text": "Narrative", "labels": ["MISC"]}, from_name="label", to_name="text", ) ] ) ]] label_studio_data = stage_for_label_studio( elements, annotations=annotations, text_field="my_text", id_field="my_id" ) # The resulting JSON file is ready to be uploaded to LabelStudio # with annotations included with open("label_studio.json", "w") as f: json.dump(label_studio_data, f, indent=4) ``` See the [LabelStudio docs](https://labelstud.io/tags/labels.html) for a full list of options for labels and annotations. For more information about the `stage_for_label_studio` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/label_studio.py). ## `stage_for_prodigy` Formats outputs in JSON format for use with [Prodigy](https://prodi.gy/docs/api-loaders). After running `stage_for_prodigy`, you can write the results to a JSON file that is ready to be used with Prodigy. Examples: ```python import json from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.prodigy import stage_for_prodigy elements = [Title(text="Title"), NarrativeText(text="Narrative")] metadata = [{"type": "title"}, {"type": "text"}] prodigy_data = stage_for_prodigy(elements, metadata) # The resulting JSON file is ready to be used with Prodigy with open("prodigy.json", "w") as f: json.dump(prodigy_data, f, indent=4) ``` Note: Prodigy recommends `.jsonl` format for feeding data to API loaders. After running `stage_for_prodigy`, you can use the `save_as_jsonl` utility function to save the formatted data to a `.jsonl` file that is ready to be used with Prodigy. ```python from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.prodigy import stage_for_prodigy from unstructured.utils import save_as_jsonl elements = [Title(text="Title"), NarrativeText(text="Narrative")] metadata = [{"type": "title"}, {"type": "text"}] prodigy_data = stage_for_prodigy(elements, metadata) # The resulting jsonl file is ready to be used with Prodigy. save_as_jsonl(prodigy_data, "prodigy.jsonl") ``` For more information about the `stage_for_prodigy` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/prodigy.py). ## `stage_for_transformers` Prepares `Text` elements for processing in `transformers` pipelines by splitting the elements into chunks that fit into the model’s attention window. Examples: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline from unstructured.documents.elements import NarrativeText from unstructured.staging.huggingface import stage_for_transformers model_name = "hf-internal-testing/tiny-bert-for-token-classification" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) nlp = pipeline("ner", model=model, tokenizer=tokenizer) text = """From frost advisories this morning to a strong cold front expected later this week, the chance of fall showing up is real. There's a refreshing crispness to the air, and it looks to get only more pronounced as the week goes on. Frost advisories were in place this morning across portions of the Appalachians and coastal Maine as temperatures dropped into the 30s. Temperatures this morning were in the 40s as far south as the Florida Panhandle. And Maine even had a few reports of their first snow of the season Sunday. More cities could see their first snow later this week. Yes, hello fall! As temperatures moderate during the next few days, much of the east will stay right around seasonal norms, but the next blast of cold air will be strong and come with the potential for hazardous conditions. "A more active fall weather pattern is expected to evolve by the end of this week and continuing into the weekend as a couple of cold fronts move across the central and eastern states," the Weather Prediction Center said. The potent cold front will come in from Canada with a punch of chilly air, heavy rain and strong wind. The Weather Prediction Center has a slight risk of excessive rainfall for much of the Northeast and New England on Thursday, including places like New York City, Buffalo and Burlington, so we will have to look out for flash flooding in these areas. "More impactful weather continues to look likely with confidence growing that our region will experience the first real fall-like system with gusty to strong winds and a period of moderate to heavy rain along and ahead of a cold front passage," the National Weather Service office in Burlington wrote. The potential for very heavy rain could accompany the front, bringing up to two inches of rain for much of the area, and isolated locations could see even more. "Ensembles [forecast models] show median rainfall totals by Wednesday night around a half inch, with a potential for some spots to see around one inch, our first substantial rainfall in at least a couple of weeks," the weather service office in Grand Rapids noted, adding, "It may also get cold enough for some snow to mix in Thursday night to Friday morning, especially in the higher terrain north of Grand Rapids toward Cadillac." There is also a chance for very strong winds to accompany the system. The weather service is forecasting winds of 30-40 mph ahead of the cold front, which could cause some tree limbs to fall and sporadic power outages. Behind the front, temperatures will fall. "East Coast, with highs about 5-15 degrees below average to close out the workweek and going into next weekend, with highs only in the 40s and 50s from the Great Lakes to the Northeast on most days," the Weather Prediction Center explained. By the weekend, a second cold front will drop down from Canada and bring a reinforcing shot of chilly air across the eastern half of the country.""" elements = stage_for_transformers([NarrativeText(text=text)], tokenizer) ``` The following optional keyword arguments can be specified in `stage_for_transformers`: <Note> * `buffer`: Indicates the number of tokens to leave as a buffer for the attention window. This is to account for special tokens like `[CLS]` that can appear at the beginning or end of an input sequence. * `max_input_size`: The size of the attention window for the model. If not specified, the default is the `model_max_length` attribute on the tokenizer object. * `split_function`: The function used to split the text into chunks to consider for adding to the attention window. Splits on spaces be default. * `chunk_separator`: The string used to concat adjacent chunks when reconstructing the text. Uses spaces by default. If you need to operate on text directly instead of `unstructured` `Text` objects, use the `chunk_by_attention_window` helper function. Simply modify the example above to include the following: ```python from unstructured.staging.huggingface import chunk_by_attention_window chunks = chunk_by_attention_window(text, tokenizer) results = [nlp(chunk) for chunk in chunks] ``` </Note> For more information about the `stage_for_transformers` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/huggingface.py). ## `stage_for_weaviate` The `stage_for_weaviate` staging function prepares a list of `Element` objects for ingestion into the [Weaviate](https://weaviate.io/) vector database. You can create a schema in Weaviate for the unstructured outputs using the following workflow: ```python from unstructured.staging.weaviate import create_unstructured_weaviate_class import weaviate # Change `class_name` if you want the class for unstructured documents in Weaviate # to have a different name unstructured_class = create_unstructured_weaviate_class(class_name="UnstructuredDocument") schema = {"classes": [unstructured_class]} client = weaviate.Client("http://localhost:8080") client.schema.create(schema) ``` Once the schema is created, you can batch upload documents to Weaviate using the following workflow. See the [Weaviate documentation](https://weaviate.io/developers/weaviate) for more details on options for uploading data and querying data once it has been uploaded. ```python from unstructured.partition.pdf import partition_pdf from unstructured.staging.weaviate import stage_for_weaviate import weaviate from weaviate.util import generate_uuid5 filename = "example-docs/pdf/layout-parser-paper-fast.pdf" elements = partition_pdf(filename=filename, strategy="fast") data_objects = stage_for_weaviate(elements) client = weaviate.Client("http://localhost:8080") with client.batch(batch_size=10) as batch: for data_object in tqdm.tqdm(data_objects): batch.add_data_object( data_object, unstructured_class_name, uuid=generate_uuid5(data_object), ) ``` For more information about the `stage_for_weaviate` function, you can check the [source code here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/staging/weaviate.py). # Set embedding behavior Source: https://docs.unstructured.io/open-source/how-to/embedding <Note> The following information applies only to the [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli) and the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest). The Unstructured SDKs for Python and JavaScript/TypeScript, and the Unstructured open-source library, do not support this functionality. </Note> ## Concepts You can use the [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli) or the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) to generate embeddings after the partitioning and chunking steps in an ingest pipeline. The chunking step is particularly important to ensure that the text pieces (also known as the documents or elements) can fit the input limits of an embedding model. You generate embeddings by specifying an embedding model that is provided or used by an embedding provider. An [embedding model](https://python.langchain.com/v0.2/docs/concepts/#embedding-models) creates arrays of numbers known as vectors, representing the text that is extracted by Unstructured. These vectors are stored or embedded next to the data itself. These vector embeddings allow vector databases to more quickly and efficiently analyze and process these inherent properties and relationships between data. For example, you can save the extracted text along with its embeddings in a vector store. When a user queries a retrieval augmented generation (RAG) application, the application can use a vector database to perform a similarity search in that vector store and then return the documents whose embeddings are the closest to that user's query. Learn more about [chunking](https://unstructured.io/blog/chunking-for-rag-best-practices) and [embedding](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag). ## Generate embeddings <iframe width="560" height="315" src="https://www.youtube.com/embed/ulMrzxNGc7A" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To use the Ingest CLI or Ingest Python library to generate embeddings, do the following: 1. Choose an embedding provider that you want to use from among the following allowed providers, and note the provider's ID: <Note> The following list assumes that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS) for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), you are limited to the provider and model names that are supported by the Unstructured API. [See the list of supported provider names](/api-reference/workflow/workflows#embedder-node). </Note> * The provider ID `bedrock` for [Amazon Bedrock](https://aws.amazon.com/bedrock/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/bedrock/). * `huggingface` for [Hugging Face](https://huggingface.co/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/huggingfacehub/). * `mixedbread-ai` for [Mixedbread](https://www.mixedbread.ai/). [Learn more](https://www.mixedbread.ai/docs/embeddings/overview). * `octoai` for [Octo AI](https://octo.ai/). [Learn more](https://octo.ai/docs/text-gen-solution/using-unstructured-io-for-embedding-documents). * `openai` for [OpenAI](https://openai.com/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/openai/). * `togetherai` for [Together.ai](https://www.together.ai/). [Learn more](https://docs.together.ai/docs/embedding-models). * `vertexai` for [Google Vertex AI PaLM](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/google_vertex_ai_palm/). * `voyageai` for [Voyage AI](https://www.voyageai.com/). [Learn more](https://python.langchain.com/v0.2/docs/integrations/text_embedding/voyageai/). 2. Run the following command to install the required Python package for the embedding provider: * For `bedrock`, run `pip install "unstructured-ingest[bedrock]"`. * For `huggingface`, run `pip install "unstructured-ingest[embed-huggingface]"`. * For `mixedbread-ai`, run `pip install "unstructured-ingest[embed-mixedbreadai]"`. * For `octoai`, run `pip install "unstructured-ingest[embed-octoai]"`. * For `openai`, run `pip install "unstructured-ingest[openai]"`. * For `togetherai`, run `pip install "unstructured-ingest[togetherai]"`. * For `vertexai`, run `pip install "unstructured-ingest[embed-vertexai]"`. * For `voyageai`, run `pip install "unstructured-ingest[embed-voyageai]"`. 3. For the following embedding providers, you can choose the model that you want to use. If you do choose a model, note the model's name: <Note> The following list assumes that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS) for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), you are limited to the model names that are supported by the Unstructured API. [See the list of supported model names](/api-reference/workflow/workflows#embedder-node). </Note> * `bedrock`. [Choose a model](https://docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html). No default model is provided. [Learn more about the supported models](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html). * `huggingface`. [Choose a model](https://huggingface.co/models?other=embeddings), or use the default model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). * `mixedbread-ai`. [Choose a model](https://www.mixedbread.ai/docs/embeddings/models), or use the default model [mixedbread-ai/mxbai-embed-large-v1](https://www.mixedbread.ai/docs/embeddings/mxbai-embed-large-v1). * `octoai`. [Choose a model](https://octo.ai/blog/supercharge-rag-performance-using-octoai-and-unstructured-embeddings/), or use the default model `thenlper/gte-large`. * `openai`. [Choose a model](https://platform.openai.com/docs/guides/embeddings/embedding-models), or use the default model `text-embedding-ada-002`. * `togetherai`. [Choose a model](https://docs.together.ai/docs/embedding-models), or use the default model `togethercomputer/m2-bert-80M-32k-retrieval`. * `vertexai`. [Choose a model](https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings-api), or use the default model `text-embedding-05`. * `voyageai`. [Choose a model](https://docs.voyageai.com/docs/embeddings). No default model is provided. 4. Note the special settings to connect to the provider: <Note> The following special settings assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS) for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of these special settings. Unstructured uses its own internal special settings when using the specified provider to generate the embeddings. </Note> * For `bedrock`, you'll need an AWS access key value, the corresponding AWS secret access key value, and the corresponding AWS Region identifier. [Get an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html). * For `huggingface`, if you use a gated model (a model with special conditions that you must accept before you can use it, or a privately published model), you'll need an HF inference API key value, beginning with `hf_`. [Get an HF inference API key](https://huggingface.co/docs/api-inference/en/quicktour#get-your-api-token). To learn whether your model requires an HF inference API key, see your model provider's documentation. * For `mixedbread-ai`, you'll need a Mixedbread API key value. [Get a Mixedbread API key](https://www.mixedbread.ai/dashboard?next=api-keys). * For `octoai`, you'll need an Octo AI API token value. [Get an Octo AI API token](https://octo.ai/docs/getting-started/how-to-create-octoai-access-token). * For `openai`, you'll need an OpenAI API key value. [Get an OpenAI API key](https://platform.openai.com/docs/quickstart/create-and-export-an-api-key). * For `togetherai`, you'll need a together.ai API key value. [Get a together.ai API key](https://docs.together.ai/reference/authentication-1). * For `vertexai`, you'll need the path to a Google Cloud credentials JSON file. Learn more [here](https://cloud.google.com/docs/authentication/application-default-credentials#GAC) and [here](https://googleapis.dev/python/google-auth/latest/reference/google.auth.html#module-google.auth). * For `voyageai`, you'll need a Voyage AI API key value. [Get a Voyage AI API key](https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys). 5. Now, apply all of this information as follows, and then run your command or code: <AccordionGroup> <Accordion title="Ingest CLI"> <Note> The following options assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS) for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of the following options: * `--embedding-api-key` * `--embedding-aws-access-key-id` * `--embedding-aws-secret-access-key` * `--embedding-aws-region` Unstructured uses its own internal settings for these options when using the specified provider to generate the embeddings. </Note> For the [source connector](/open-source/ingestion/source-connectors/overview) command: * Set the command's `--embedding-provider` to the provider's ID, for example `huggingface`. * Set `--embedding-model-name` to the model name, as applicable, for example `sentence-transformers/sentence-t5-xl`. Or omit this to use the default model, as applicable. * Set `--embedding-api-key` to the provider's required API key value or credentials JSON file path, as appropriate. * For `bedrock`: * Set `--embedding-aws-access-key-id` to the AWS access key value. * Set `--embedding-aws-secret-access-key` to the corresponding AWS secret access key value. * Set `--embedding-aws-region` to the corresponding AWS Region identifier. </Accordion> <Accordion title="Ingest Python library"> <Note> The following parameters assume that you are calling the embedding provider directly. If you are calling Unstructured's software-as-a-service (SaaS) for processing instead (for example, by specifying an Unstructured API key and an Unstructured SaaS URL), do not include any of the following parameters: * `embedding_api_key` * `embedding_aws_access_key_id` * `embedding_aws_secret_access_key` * `embedding_aws_region` Unstructured uses its own internal settings for these parameters when using the specified provider to generate the embeddings. </Note> For the [source connector's](/open-source/ingestion/source-connectors/overview) `EmbedderConfig` object: * Set the `embedding_provider` parameter to the provider's ID, for example `huggingface`. * Set `embedding_model_name` to the model name, as applicable, for example `sentence-transformers/sentence-t5-xl`. Or omit this to use the default model, as applicable. * Set `embedding_api_key` to the provider's required API key value or credentials JSON file path, as appropriate. * For `bedrock`: * Set `embedding_aws_access_key_id` to the AWS access key value. * Set `embedding_aws_secret_access_key` to the corresponding AWS secret access key value. * Set `embedding_aws_region` to the corresponding AWS Region identifier. </Accordion> </AccordionGroup> # Examples Source: https://docs.unstructured.io/open-source/how-to/examples This page provides some examples of accessing Unstructured by using the Unstructured Ingest CLI and the Unstructured Ingest Python library. These examples assume that you have already followed the instructured to set up the [Unstructured Ingest CLI](/open-source/ingestion/ingest-cli) and the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest). ### Changing partition strategy for a PDF Here's how you can modify partition strategy for a PDF file, and select an alternative model to use with Unstructured API. The `hi_res` strategy supports different models, and the default is `layout_v1.1.0`. <iframe width="560" height="315" src="https://www.youtube.com/embed/SwJVB_kPqTc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <AccordionGroup> <Accordion title="Ingest CLI"> ```bash CLI unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy hi_res \ --hi-res-model-name layout_v1.1.0 \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` </Accordion> <Accordion title="Ingest Python"> ```python Python import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( strategy="hi_res", hi_res_model_name="layout_v1.0.0", partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </Accordion> </AccordionGroup> If you have a local deployment of the Unstructured API, you can use other supported models, such as `yolox`. ### Specifying the language of a document for better OCR results For better OCR results, you can specify what languages your document is in using the `languages` parameter. [View the list of available languages](https://github.com/tesseract-ocr/tessdata). <AccordionGroup> <Accordion title="Ingest CLI"> ```bash CLI unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy ocr_only \ --ocr-languages kor \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` </Accordion> <Accordion title="Ingest Python"> ```python Python import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( strategy="ocr_only", ocr_languages=["kor"], partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </Accordion> </AccordionGroup> ### Saving bounding box coordinates When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the `coordinates` parameter to `true` to add this field to the elements in the response. <AccordionGroup> <Accordion title="Ingest CLI"> ```bash CLI unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"coordinates\":\"true\", \"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` </Accordion> <Accordion title="Ingest Python"> ```python Python import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "coordinates": True, "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </Accordion> </AccordionGroup> ### Returning unique element IDs By default, the element ID is a SHA-256 hash of the element text. This is to ensure that the ID is deterministic. One downside is that the ID is not guaranteed to be unique. Different elements with the same text will have the same ID, and there could also be hash collisions. To use UUIDs in the output instead, set `unique_element_ids=true`. Note: this means that the element IDs will be random, so with every partition of the same file, you will get different IDs. This can be helpful if you'd like to use the IDs as a primary key in a database, for example. <AccordionGroup> <Accordion title="Ingest CLI"> ```bash CLI unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"unique_element_ids\":\"true\", \"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` </Accordion> <Accordion title="Ingest Python"> ```python Python import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "unique_element_ids": True, "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </Accordion> </AccordionGroup> ### Adding the chunking step after partitioning You can combine partitioning and subsequent chunking in a single request by setting the `chunking_strategy` parameter. By default, the `chunking_strategy` is set to `None`, and no chunking is performed. [//]: # "TODO: add a link to the concepts section about chunking strategies. Need to create the shared Concepts section first" <AccordionGroup> <Accordion title="Ingest CLI"> ```bash CLI unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --chunking-strategy by_title \ --chunk-max-characters 1024 \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` </Accordion> <Accordion title="Ingest Python"> ```python Python import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig( chunking_strategy="by_title", chunk_max_characters=1024 ), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </Accordion> </AccordionGroup> # Extract images and tables from documents Source: https://docs.unstructured.io/open-source/how-to/extract-image-block-types ## Task You want to get, decode, and show elements, such as images and tables, that are embedded in a PDF document. ## Approach Extract the Base64-encoded representation of specific elements, such as images and tables, in the document. For each of these extracted elements, decode the Base64-encoded representation of the element into its original visual representation and then show it. ## To run this example You will need a document that is one of the document types supported by the `extract_image_block_types` argument. See the `extract_image_block_types` entry in [API Parameters](/api-reference/partition/api-parameters). This example uses a PDF file with embedded images and tables. ## Code For the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest), you can use the standard Python [json.load](https://docs.python.org/3/library/json.html#json.load) function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete. ```python Python import json, base64, io from PIL import Image def get_image_block_types(input_json_file_path: str): with open(input_json_file_path, 'r') as file: file_elements = json.load(file) for element in file_elements: if "image_base64" in element["metadata"]: # Decode the Base64-encoded representation of the # processed "Image" or "Table" element into its original # visual representation, and then show it. image_data = base64.b64decode(element["metadata"]["image_base64"]) image = Image.open(io.BytesIO(image_data)) image.show() if name == "main": # Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/pdf/embedded-images-tables.pdf # Specify where to get the local file, relative to this .py file. get_image_block_types( input_json_file_path="local-ingest-output/embedded-images-tables.json" ) ``` # Process a subset of files Source: https://docs.unstructured.io/open-source/how-to/filter-files <Note> The following information applies only to the [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli) and the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest). The Unstructured SDKs for Python and JavaScript/TypeScript and the Unstructured open-source library do not support this functionality. </Note> ## Task You want to process only files with specified extensions, only files at or below a specified size, or both. ## Approach For the Ingest CLI, use the following command options. For the Ingest Python library, use the following parameters for the `FiltererConfig` object. * Use `--file-glob` (CLI) or `file_glob` (Python) to specify the list of file extensions to process. * Use `--max-file-size` (CLI) or `max_file_size` (Python) to specify the maximum size of files to process, in bytes. ## To run this example The following example processes only `.pdf` and `.eml` files that have a file size of 100 KB or less. To run this example, you should have a directory with a mixture of files, including at least one `.pdf` file and one `.eml` file, and with at least one of these files having a file size of 100 KB or less. ## Code <CodeGroup> ```bash CLI Ingest v2 unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --file-glob ".pdf,.eml" \ --max-file-size 100000 \ --partition-by-api \ --partition-endpoint $UNSTRUCTURED_API_URL \ --api-key $UNSTRUCTURED_API_KEY \ --strategy hi_res ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.filter import FiltererConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), filterer_config=FiltererConfig( file_glob=[".pdf",".eml"], max_file_size=100000 ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "unique_element_ids": True, "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> # Get chunked elements Source: https://docs.unstructured.io/open-source/how-to/get-chunked-elements ## Task You want to get, and print or save, the contents of elements that have been chunked. ## Approach Chunked elements are typically represented in this format: ```json { "type": "CompositeElement", "element_id": "3800f7ff047e97cbf0a5f7df7ad52c80", "text": "4,5 Exception noted in \u201cSpecial Considerations for Promotional Labeling and Advertising Material\u201d\n\n3", "metadata": { "filetype": "application/pdf", "languages": ["eng"], "page_number": 6, "orig_elements": "eJy ... Full content omitted for brevity ... x8=", "filename": "Portable-Document-Format-Specifications.pdf" } } ``` ```json { "type": "Table", "element_id": "758127b42c51b93b59abf08640d1ccab", "text": "Gels and karyotypes High pressure liquid chromatography 600 dpi (8 bit grayscale depth) 300 dpi", "metadata": { "text_as_html": "<table><tbody><tr><td>Gels and karyotypes</td><td>600 dpi (8 bit grayscale depth)</td></tr><tr><td>High pressure liquid chromatography</td><td>300</td></tr></tbody></table>", "filetype": "application/pdf", "languages": ["eng"], "page_number": 8, "orig_elements": "eJy ... Full content omitted for brevity ... MnD", "filename": "Portable-Document-Format-Specifications.pdf" } ``` To get the element's chunked content, extract the contents of the element's `text` field. The chunked content might not contain all of its associated content. To get all of the elements that were used to derive this chunked content, extract the contents of the element's `orig_elements` field, which is nested inside of its parent `metadata` field. The contents of the `orig_elements` field is in compressed Base64 gzipped format. To get the content, Base64-decode the bytes, decompress them, and then decode them using UTF-8. ## To run this example You will need to chunk a document during processing. This example uses a PDF file chunked into 200- to 300-character elements. ## Code For the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest), you can use the standard Python [json.load](https://docs.python.org/3/library/json.html#json.load) function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete. ```python Python import json, base64, zlib from typing import List, Dict, Any # Extract the contents of an orig_elements field. def extract_orig_elements(orig_elements): decoded_orig_elements = base64.b64decode(orig_elements) decompressed_orig_elements = zlib.decompress(decoded_orig_elements) return decompressed_orig_elements.decode('utf-8') def get_chunked_elements(input_json_file_path: str) -> List[Dict[str, Any]]: # Create a dictionary that will hold only # a transposed version of the returned elements. # For instance, we just want to capture each element's ID, # the chunk's text, and the chunk's associated elements in context. orig_elements_dict: List[Dict[str, Any]] = [] with open(input_json_file_path, 'r') as file: file_elements = json.load(file) for element in file_elements: # For each chunk that has an "orig_elements" field... if "orig_elements" in element["metadata"]: # ...get the chunk's associated elements in context... orig_elements = extract_orig_elements(element["metadata"]["orig_elements"]) # ...and then transpose it and other associated fields into a separate dictionary. orig_elements_dict.append({ "element_id": element["element_id"], "text": element["text"], "orig_elements": json.loads(orig_elements) }) return orig_elements_dict if name == "main": # Source file: https://www.fda.gov/files/drugs/published/Portable-Document-Format-Specifications.pdf input_filepath = "local-ingest-output-json/Portable-Document-Format-Specifications.json" output_filepath = "local-ingest-output-orig/Portable-Document-Format-Specifications-Orig-Elements-Only.json" orig_elements_dict = get_chunked_elements(input_json_file_path = input_filepath) # Convert the elements into a JSON object. orig_elements_json = json.dumps(orig_elements_dict, indent=2) # Write the JSON to a file. with open(output_filepath, "w") as file: file.write(orig_elements_json) ``` ## See also * [Recovering chunk elements](/open-source/core-functionality/chunking#recovering-chunk-elements) # Get element contents Source: https://docs.unstructured.io/open-source/how-to/get-elements ## Task You want to get, manipulate, and print or save, the contents of the document elements and metadata from the processed data that Unstructured returns. ## Approach Each element in the document elements contains fields for that element's type, its ID, the extracted text, and associated metadata. The programmatic approach you take to get these document elements will depend on which tool, SDK, or library you use: <AccordionGroup> <Accordion title="Ingest CLI"> For the [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli), you can use a tool such as [jq](https://jqlang.github.io/jq/) to work with a JSON file that the CLI outputs after the processing is complete. For example, the following script uses `jq` to access and print each element's ID, text, and originating file name: ```bash Shell #!/usr/bin/env bash JSON_FILE="local-ingest-output/my-file.json" jq -r '.[] | "ID: \(.element_id)\nText: \(.text)\nFilename: \(.metadata.filename)\n"' \ "$JSON_FILE" ``` </Accordion> <Accordion title="Ingest Python library"> For the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest), you can use the standard Python [json.load](https://docs.python.org/3/library/json.html#json.load) function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete. For example, the following code example uses standard Python to access and print each element's ID, text, and originating file name: ```python Python import json def parse_json_file(input_file_path: str): with open(input_file_path, 'r') as file: file_elements = json.load(file) for element in file_elements: print(f"ID: {element["element_id"]}") print(f"Text: {element["text"]}") print(f"Filename: {element["metadata"]["filename"]}\n") if name == "main": parse_json_file( input_file_path="local-ingest-output/my-file.json" ) ``` </Accordion> <Accordion title="Open-source library"> For the [Unstructured open-source library](/open-source/introduction/overview), calling the `partition_via_api` function returns a list of elements (`list[Element]`). For example: ```python Python # ... elements = partition_via_api( filename=input_filepath, api_key=os.getenv("UNSTRUCTURED_API_KEY"), api_url=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res" ) # ... ``` You can use standard Python [list operations](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists) on this list. You can also use standard Python [looping techniques](https://docs.python.org/3/tutorial/datastructures.html#looping-techniques) on this list to access each element in this list. Each individual element has the following attributes: * `.text` provides the element's `text` field value as a `str`. See [Element example](/open-source/concepts/document-elements#element-example). * `.metadata` provides the element's `metadata` field as an `ElementMetadata` object. See [Metadata](/open-source/concepts/document-elements#metadata). * `.category` provides the element's `type` field value as a `str`. See [Element type](/open-source/concepts/document-elements#element-type). * `.id` provides the element's `element_id` value as a `str`. See [Element ID](/open-source/concepts/document-elements#element-id). In addition, the following methods are available: * `.convert_coordinates_to_new_system()` converts the element's location coordinates, if any, to a new coordinate system. See [Element's coordinates](/open-source/concepts/document-elements#elements-coordinates). * `.to_dict()` gets the element's content as a standard Python key-value dictionary (`dict[str, Any]`). For example: ```python Python # ... elements = partition_via_api( filename=input_filepath, api_key=os.getenv("UNSTRUCTURED_API_KEY"), api_url=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res" ) for element in elements: # Do something with each element, for example: save_element_to_database(f"{element.id}") save_element_to_database(f"{element.text}") save_element_to_database(f"{element.metadata.filename}") ``` To serialize this list as a Python dictionary, you can use the `elements_to_dicts` method, for example: ```python Python from unstructured.staging.base import elements_to_dicts # ... elements = partition_via_api( filename=input_filepath, api_key=os.getenv("UNSTRUCTURED_API_KEY"), api_url=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res" ) elements_dicts = elements_to_dicts(elements) ``` To serialize this list as JSON, you can use the `elements_to_json` function to convert the list of elements (`Iterable[Element]`) into a JSON-formatted string and then print or save that string. For example: ```python Python from unstructured.staging.base import elements_to_json # ... elements = partition_via_api( filename=input_filepath, api_key=os.getenv("UNSTRUCTURED_API_KEY"), api_url=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res" ) json_elements = elements_to_json( elements=elements, indent=2 ) elements_to_json( elements=elements, indent=2, filename=output_filepath ) ``` </Accordion> </AccordionGroup> # Set the OCR agent Source: https://docs.unstructured.io/open-source/how-to/set-ocr-agent ## Task You want to specify the type of agent that you use when performing optical character recognition (OCR) on files, such as images and PDFs. <Note>This task works only with the Unstructured open source library.</Note> ## Approach Set the environment variable named `OCR_AGENT` to one of the following supported values: * `unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract` to use Tesseract OCR. This is the default if not otherwise specified. * `unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle` to use Paddle OCR. * `unstructured.partition.utils.ocr_models.google_vision_ocr.OCRAgentGoogleVision` to use Google Cloud Vision OCR. Also, be sure to install the corresponding OCR agent and its dependencies, if you have not already done so: * For Tesseract OCR, [see the dependency list](https://github.com/Unstructured-IO/unstructured/blob/main/requirements/extra-pdf-image.in). * For Paddle OCR, [see the dependency list](https://github.com/Unstructured-IO/unstructured/blob/main/requirements/extra-paddleocr.in). * For Google Cloud Vision OCR, [see the dependency list](https://github.com/Unstructured-IO/unstructured/blob/main/requirements/extra-pdf-image.in). ## Example code This example uses a PNG file with an embedded combination of English and Korean text. This example uses Tesseract OCR. Language codes will differ depending on the OCR agent you use: * For Tesseract OCR, [see the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/lang.py). * For Paddle OCR, [see the language codes list](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/lang.py) and [language names list](https://github.com/PaddlePaddle/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#language_abbreviations). * For Google Cloud Vision OCR, [see the language codes list](https://cloud.google.com/vision/docs/languages). ```python Python import json from unstructured.partition.image import partition_image # Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/img/english-and-korean.png # Path to the local file to process, relative to this .py file. filename = "local-ingest-png/english-and-korean.png" elements = partition_image( filename=filename, strategy="ocr_only", languages=["eng", "kor"] # Language codes differ by the OCR agent used. ) # Convert the list of returned elements into a list of dictionaries for printing or saving. element_dicts = [element.to_dict() for element in elements] # Print the list. print(json.dumps(element_dicts, indent=2)) # Or, save the list locally: # # file = "local-ingest-output/english-and-korean.json" # # with open(file, "w") as file: # json.dump(element_dicts, file, indent=2) ``` # Speed up processing of large files and batches Source: https://docs.unstructured.io/open-source/how-to/speed-up-large-files-batches When you use Unstructured, here are some techniques that you can try to help speed up the processing of large files and large batches of files. * Choose your partitioning strategy wisely. For example, if you have simple PDFs that don't have images and tables, you might be able to use the `fast` strategy. Try the `fast` strategy on a few of your documents before you try using the `hi_res` strategy. [Learn more](/open-source/ingestion/ingest-configuration/partition-configuration). * For processing large numbers of documents, use [ingestion](/open-source/ingestion/overview) and [add CPUs](#adding-cpus). ## Adding CPUs For speeding up file processing during [ingestion](/open-source/ingestion/overview), the Unstructured CLI and Unstructured Python Ingest enable you to instruct Unstructured to use additional local CPUs where applicable. Using additional local CPUs applies only to pipeline steps that Unstructured logs as being processed across CPUs. It does not apply to pipeline steps that are logged as being processed asynchronously. To get a list of which operations are processed where, look for the following log messages when you run an ingest pipeline: * Steps that are processed across CPUs correspond to log messages that read: `processing content across processes`. These steps might benefit by setting a higher number of local CPUs to be used. * Steps that are processed asynchronously correspond to log messages that read: `processing content across processes`. Any settings to use a higher number of local CPUs are ignored for these steps. For the Unstructured CLI, you can set `--num-processes` to the maximum number of available local CPUs that you want to use where applicable, for example: ```bash unstructured-ingest \ local \ --num-processes <number> # ... ``` To get the maximum number of available local logical CPUs that can be used where applicable, see your operating system's documentation. For Unstructured Python Ingest, you can set the `ProcessorConfig` object's `num_processes` parameter to the maximum number of available local CPUs that you want to use where applicable, for example: <CodeGroup> ```python Python Ingest from unstructured_ingest.interfaces import ProcessorConfig # ... if name == "main": Pipeline.from_configs( context=ProcessorConfig( num_processes=<number>, # ... ), # ... ).run() ``` </CodeGroup> In Python, to specify the maximum number of available local logical CPUs that can be used where applicable, you can call functions such as [os.cpu\_count](https://docs.python.org/3/library/os.html#os.cpu_count) and [multiprocessing.cpu\_count](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.cpu_count). ## PDF files To speed up PDF file processing, the [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli) and the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) provide the following parameters to help speed up processing a large PDF file: * `split_pdf_page`, when set to true, splits the PDF file on the client side before sending it as batches to Unstructured for processing. The number of pages in each batch is determined internally. Batches can contain between 2 and 20 pages. * `split_pdf_concurrency_level` is an integer that specifies the number of parallel requests. The default is 5. The maximum is 15. This behavior is ignored unless `split_pdf_page` is also set to true. * `split_pdf_allow_failed`, when set to true, allows partitioning to continue even if some pages fail. * `split_pdf_page_range` is a list of two integers that specify the beginning and ending page numbers of the PDF file to be sent. A `ValueError` is raised if the specified range is not valid. This behavior is ignored unless `split_pdf_page` is also set to true. # Extract tables as HTML Source: https://docs.unstructured.io/open-source/how-to/text-as-html ## Task You want to get, save, or show the contents of elements that are represented as HTML, such as tables that are embedded in a PDF document. ## Approach Extract the contents of an element's `text_as_html` JSON object, which is nested inside of its parent `metadata` object. ## To run this example You will need a document that is one of the document types that can output the `text_as_html` JSON object. For the list of applicable document types, see the entries in the table at the beginning of [Partitioning](/open-source/core-functionality/partitioning) where "Table Support" is "Yes." This example uses a PDF file with an embedded table. ## Code For the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest), you can use the standard Python [json.load](https://docs.python.org/3/library/json.html#json.load) function to load into a Python dictionary the contents of a JSON file that the Ingest Python library outputs after the processing is complete. ```python Python import json, os, webbrowser def get_tables_as_html( input_json_file_path: str, output_html_dir_path: str ): with open(input_json_file_path, 'r') as file: file_elements = json.load(file) # Provide some minimal CSS for better table readability. table_css = "<head><style>table, th, td { border: 1px solid; }</style></head>" for element in file_elements: if "text_as_html" in element["metadata"]: # Surround the element's HTML with basic <html> and <body> tags, and add the minimal CSS. html_string = f"<!DOCTYPE html><html>{table_css}<body>{element["metadata"]["text_as_html"]}</body></html>" # Save the element's HTML to a local file. save_path = f"{output_html_dir_path}/{element["element_id"]}.html" file = open(save_path, 'w') file.write(html_string) file.close() # View the locally saved file in the local default web browser. webbrowser.open_new(f"file:///{os.getcwd()}/{save_path}") if name == "main": # Source: https://github.com/Unstructured-IO/unstructured-ingest/blob/main/example-docs/pdf/embedded-images-tables.pdf # Specify where to get the local file, relative to this .py file, and # where to store the retrieved HTML, relative to this .py file. get_tables_as_html( input_json_file_path="local-ingest-output/embedded-images-tables.json", output_html_dir_path="local-ingest-output/html/" ) ``` # Astra DB Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/astradb <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in an Astra DB account. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/PMs1iwL52aM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/). * A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html). * An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html). * A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace). * A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection). An existing collection is not required. At runtime, the collection behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings, the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Pinecone account. The new collection's name will be `unstructuredautocreated`. The Astra DB connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[astradb]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `ASTRA_DB_API_ENDPOINT` - The API endpoint for the Astra DB database, represented by `--api-endpoint` (CLI) or `api_endpoint` (Python). To get the endpoint, see the Database Details > API Endpoint value on your database's Overview tab. * `ASTRA_DB_APPLICATION_TOKEN` - The database application token value for the database, represented by `--token` (CLI) or `token` (Python). To get the token, see the Database Details > Application Tokens box on your database's Overview tab. * `ASTRA_DB_KEYSPACE` - The name of the keyspace for the database, represented by `--keyspace` (CLI) or `keyspace` (Python). * `ASTRA_DB_COLLECTION` - The name of the collection for the keyspace, represented by `--collection-name` (CLI) or `collection_name` (Python). If no value is provided, see the beginning of this article for the behavior at run time. Additional settings include: * For the source connector only, `--fields` (CLI) or `fields` (Python): Optionally, a comma-separated list (CLI) or an array of strings (Python) of fields to include in the output. The default is ti include all fields, if not otherwise specified. * For the destination connector only, `--flatten-metadata` (CLI) or `flatten_metadata=True` (Python): Optionally, whether to "flatten" the metadata. Specifically, the metadata key values are brought to the top level of the element, and the `metadata` key itself is removed. To not flatten the metadata (the default), specify `--no-flatten_metadata` (CLI) or `flatten_metadata=False` (Python). The default is is to not flatten the metadata if not otherwise specified. Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ astradb \ --api-endpoint $ASTRA_DB_API_ENDPOINT \ --token $ASTRA_DB_APPLICATION_TOKEN \ --collection-name $ASTRA_DB_COLLECTION \ --keyspace $ASTRA_DB_KEYSPACE \ --flatten-metadata ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.astradb import ( AstraDBConnectionConfig, AstraDBAccessConfig, AstraDBUploadStagerConfig, AstraDBUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig( ), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=AstraDBConnectionConfig( access_config=AstraDBAccessConfig( api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"), token=os.getenv("ASTRA_DB_APPLICATION_TOKEN") ) ), stager_config=AstraDBUploadStagerConfig( flatten_metadata=True ), uploader_config=AstraDBUploaderConfig( collection_name=os.getenv("ASTRA_DB_COLLECTION"), keyspace=os.getenv("ASTRA_DB_KEYSPACE"), batch_size=20, record_id_key="record_id" ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Azure Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/azure <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in an Azure Storage account. The requirements are as follows. The following video shows how to fulfill the minimum set of Azure Storage account requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/Vl3KCphlh9Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Note> If you are generating an SAS token as shown in the preceding video, be sure to set the following permissions: * Read and List for reading from the container only. * Write and List for writing to the container only. * Read, Write, and List for both reading from and writing to the container. </Note> Here are some more details about these requirements: * An Azure account. To create one, [learn how](https://azure.microsoft.com/pricing/purchase-options/azure-account). <iframe width="560" height="315" src="https://www.youtube.com/embed/2bQ6WiJ1ncA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Azure Storage account, and a container within that account. [Create a storage account](https://learn.microsoft.com/azure/storage/common/storage-account-create). [Create a container](https://learn.microsoft.com/azure/storage/blobs/blob-containers-portal). <iframe width="560" height="315" src="https://www.youtube.com/embed/AhuNgBafmUo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <iframe width="560" height="315" src="https://www.youtube.com/embed/xmndjYnGvcs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Azure Storage remote URL, using the format `az://<container-name>/<path/to/file/or/folder/in/container/as/needed>` For example, if your container is named `my-container`, and there is a folder in the container named `my-folder`, the Azure Storage remote URL would be `az://my-container/my-folder/`. * An SAS token (recommended), access key, or connection string for the Azure Storage account. [Create an SAS token (recommended)](https://learn.microsoft.com/azure/ai-services/translator/document-translation/how-to-guides/create-sas-tokens). [Get an access key](https://learn.microsoft.com/azure/storage/common/storage-account-keys-manage#view-account-access-keys). [Get a connection string](https://learn.microsoft.com/azure/storage/common/storage-configure-connection-string#configure-a-connection-string-for-an-azure-storage-account). Create an SAS token (recommended): <iframe width="560" height="315" src="https://www.youtube.com/embed/X6cmJ2IbVzo?start=240&end=370" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Get an access key or connection string: <iframe width="560" height="315" src="https://www.youtube.com/embed/muMmcwVfFqs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The Azure Storage account connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[azure]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `AZURE_STORAGE_REMOTE_URL` - The Azure Storage remote URL, represented by `--remote-url` (CLI) or `remote_url` (Python). The remote URL takes the format `az://<container-name>/<path/to/file/or/folder/in/container/as/needed>` For example, if your container is named `my-container`, and there is a folder in the container named `my-folder`, the Azure Storage remote URL would be `az://my-container/my-folder/`. * `AZURE_STORAGE_ACCOUNT_NAME` - The name of the Azure Storage account, represented by `--account-name` (CLI) or `account_name` (Python). * `AZURE_STORAGE_ACCOUNT_KEY`, `AZURE_STORAGE_CONNECTION_STRING`, or `AZURE_STORAGE_SAS_TOKEN` - The name of the key, connection string, or SAS token for the Azure Storage account, depending on your security configuration, represented by `--account-key` (CLI) or `account_key` (Python), `--connection-string` (CLI) or `connection_string` (Python), and `--sas_token` (CLI) or `sas_token` (Python), respectively. Now call the Unstructured Ingest CLI or Unstructured Ingest Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ azure \ --remote-url $AZURE_STORAGE_REMOTE_URL \ --account-name $AZURE_STORAGE_ACCOUNT_NAME \ --account-key $AZURE_STORAGE_ACCOUNT_KEY ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.azure import ( AzureConnectionConfig, AzureAccessConfig, AzureUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=AzureConnectionConfig( access_config=AzureAccessConfig( account_name=os.getenv("AZURE_STORAGE_ACCOUNT_NAME"), account_key=os.getenv("AZURE_STORAGE_ACCOUNT_KEY") ) ), uploader_config=AzureUploaderConfig(remote_url=os.getenv("AZURE_STORAGE_REMOTE_URL")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Azure AI Search Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/azure-ai-search <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in an Azure AI Search account. The requirements are as follows. The following video shows how to fulfill the minimum set of Azure AI Search requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/6ZjU5OupWE8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Here are some more details about these requirements: * The endpoint and API key for Azure AI Search. [Create an endpoint and API key](https://learn.microsoft.com/azure/search/search-create-service-portal). * The name of the index in Azure AI Search. [Create an index](https://learn.microsoft.com/rest/api/searchservice/create-index). <iframe width="560" height="315" src="https://www.youtube.com/embed/WY8h8Gtyo7o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The Azure AI Search index that you use must have an index schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "@odata.context": "https://ingest-test-azure-ai-search.search.windows.net/$metadata#indexes/$entity", "@odata.etag": "\"0x8DCED5D96393CA9\"", "name": "<my-index-name>", "defaultScoringProfile": null, "fields": [ { "name": "id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "record_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "element_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "text", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "embeddings", "type": "Collection(Edm.Single)", "searchable": true, "filterable": false, "retrievable": true, "stored": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": 3072, "vectorSearchProfile": "embeddings-config-profile", "vectorEncoding": null, "synonymMaps": [] }, { "name": "type", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "metadata", "type": "Edm.ComplexType", "fields": [ { "name": "category_depth", "type": "Edm.Int32", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "parent_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "attached_to_filename", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "filetype", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "last_modified", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "is_continuation", "type": "Edm.Boolean", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "file_directory", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "filename", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "data_source", "type": "Edm.ComplexType", "fields": [ { "name": "url", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "version", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_created", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_modified", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_processed", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "permissions_data", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "record_locator", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] }, { "name": "coordinates", "type": "Edm.ComplexType", "fields": [ { "name": "system", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "layout_width", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "layout_height", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "points", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] }, { "name": "languages", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "page_number", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "orig_elements", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "links", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "page_name", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "url", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "link_urls", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "link_texts", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "sent_from", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "sent_to", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "subject", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "section", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "header_footer_type", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "emphasized_text_contents", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "emphasized_text_tags", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "text_as_html", "type": "Edm.String", "searchable": true, "filterable": false, "retrievable": true, "stored": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "regex_metadata", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "detection_class_prob", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "partitioner_type", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] } ], "scoringProfiles": [], "corsOptions": null, "suggesters": [], "analyzers": [], "normalizers": [], "tokenizers": [], "tokenFilters": [], "charFilters": [], "encryptionKey": null, "similarity": { "@odata.type": "#Microsoft.Azure.Search.BM25Similarity", "k1": null, "b": null }, "semantic": null, "vectorSearch": { "algorithms": [ { "name": "embeddings-config", "kind": "hnsw", "hnswParameters": { "metric": "cosine", "m": 4, "efConstruction": 400, "efSearch": 500 }, "exhaustiveKnnParameters": null } ], "profiles": [ { "name": "embeddings-config-profile", "algorithm": "embeddings-config", "vectorizer": null, "compression": null } ], "vectorizers": [], "compressions": [] } } ``` See also: * [Search indexes in Azure AI Search](https://learn.microsoft.com/azure/search/search-what-is-an-index) * [Schema of a search index](https://learn.microsoft.com/azure/search/search-what-is-an-index#schema-of-a-search-index) * [Example index schema](https://learn.microsoft.com/rest/api/searchservice/create-index#examples) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) The Azure AI Search connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[azure-ai-search]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `AZURE_SEARCH_ENDPOINT` - The endpoint URL for Azure AI Search, represented by `--endpoint` (CLI) or `endpoint` (Python). * `AZURE_SEARCH_API_KEY` - The API key for Azure AI Search, represented by `--key` (CLI) or `key` (Python). * `AZURE_SEARCH_INDEX` - The name of the index for Azure AI Search, represented by `--index` (CLI) or `index` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ azure-ai-search \ --key $AZURE_SEARCH_API_KEY \ --endpoint $AZURE_SEARCH_ENDPOINT \ --index $AZURE_SEARCH_INDEX ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.azure_ai_search import ( AzureAISearchAccessConfig, AzureAISearchConnectionConfig, AzureAISearchUploadStagerConfig, AzureAISearchUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=AzureAISearchConnectionConfig( access_config=AzureAISearchAccessConfig( key=os.getenv("AZURE_SEARCH_API_KEY") ), endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"), index=os.getenv("AZURE_SEARCH_INDEX") ), stager_config=AzureAISearchUploadStagerConfig(), uploader_config=AzureAISearchUploaderConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Box Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/box <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a Box account. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/Gkniu1sfz1o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. Access to the [Developer Console](https://app.box.com/developers/console) from your [Box enterprise account](https://account.box.com/signup/enterprise-plan) or [Box developer account](https://account.box.com/signup/developer). 2. A Box Custom App in your Box account, set up to use Server Authentication (with JWT). See [Setup with JWT](https://developer.box.com/guides/authentication/jwt/jwt-setup/). 3. The appropriate application scopes and advanced features set up for the Box Custom App, as follows: * In the Box Custom App, on the Configuration tab, under Application Scopes, check the box titled Write all files and folders stored in Box. * In the Box Custom App, on the Configuration tab, under Advanced Features, check the box titled Make API calls using the as-user header. * Click Save Changes before continuing. 4. Authorization by a Box Admin in your Box account for the Box Custom App. See [App Authorization](https://developer.box.com/guides/authentication/jwt/jwt-setup/#app-authorization). 5. Access by your Box account's source or target [folder](https://app.box.com/folder/0) to your Box Custom App, as follows: * In the Box Custom App, on the General Settings tab, copy the Service Account ID (which takes the form `AutomationUser_<your-app-service-id>_<a-random-string@boxdevedition.com`). * Share your Box account's target folder with the copied service account's email address as a Co-owner or Editor. * Note the remote URL to the target folder, which takes the format `box://<path/to/folder/in/account>`. 6. The private key configuration JSON file for the Box Custom App, or a string that contains this file's contents. * To download this file, in the Box Custom App, on the Configuration tab, under Add and Manage Public Keys, click Generate a Public/Private Keypair. Store the downloaded private key configuration JSON file in a secure location. * To ensure maximum compatibility across Unstructured service offerings, you should give the private key configuration JSON file information to Unstructured as a single-line string that contains the contents of the downloaded private key configuration JSON file (and not the file itself). To print this single-line string, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the private key configuration JSON file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` The Box connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[box]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `BOX_APP_CONFIG` - The local path to the downloaded private key configuration JSON file for the Box Custom App, or a single-line string that contains the contents of this file, represented by `--box-app-config` (CLI) or `box_app_config` (Python). * `BOX_REMOTE_URL` - The remote URL to the target folder, represented by `--remote-url` (CLI) or `remote_url` (Python). This URL must take the format `box://<path/to/folder/in/account>`. Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy hi_res \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ box \ --box-app-config $BOX_APP_CONFIG \ --remote-url $BOX_REMOTE_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.box import ( BoxAccessConfig, BoxConnectionConfig, BoxIndexerConfig, BoxUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=BoxConnectionConfig( access_config=BoxAccessConfig( box_app_config=os.getenv("BOX_APP_CONFIG") ) ), uploader_config=BoxUploaderConfig( remote_url=os.getenv("BOX_REMOTE_URL") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Chroma Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/chroma <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a Chroma account. The requirements are as follows. * A Chroma server. See [Deployment](https://docs.trychroma.com/deployment). For example, here is a video about how to deploy a Chroma server to AWS: <iframe width="560" height="315" src="https://www.youtube.com/embed/4ApW6ZwqW20" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Chroma server's hostname or IP address, and the server's port number. * If you are not connecting to the server through HTTP, the path to the server instance. * The name of the tenant that you want to access on the server. * The name of the database that you want to access in the tenant. * The name of the collection that you want to access in the database. The Chroma connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[chroma]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `CHROMA_HOST` - The , represented by `--host` (CLI) or `host` (Python). * `CHROMA_PORT` - The , represented by `--port` (CLI) or `port` (Python). * `CHROMA_TENANT` - The name of the tenant that you want to access on the Chroma server, represented by `--tenant` (CLI) or `tenant` (Python). * `CHROMA_DATABASE` - The name of the database that you want to access in the tenant, represented by `--database` (CLI) or `database` (Python). * `CHROMA_COLLECTION` - The name of the collection that you want to access in the database, represented by `--collection-name` (CLI) or `collection_name` (Python). Additional settings include: * `--path` (CLI) or `path` (Python): The location where Chroma is persisted if you are not connecting through HTTP. * `--settings` (CLI) or `settings` (Python): A dictionary of settings to communicate with the Chroma server, for example: `'{"persist_directory":"./chroma-persist"}'`. * `--headers` (CLI) or `headers` (Python): A dictionary of headers to send to the Chroma server, for example: `'{"Authorization":"Basic()"}'`. * `--ssl` (CLI) or `ssl` (Python): True to use SSL for the connection. Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --work-dir $WORK_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ chroma \ --host $CHROMA_HOST \ --port $CHROMA_PORT \ --collection-name $CHROMA_COLLECTION \ --tenant $CHROMA_TENANT \ --database $CHROMA_DATABASE \ --settings "{\"persist_directory\":\"./chroma-persist\"}" \ --headers "{\"Authorization\":\"Basic()\"}" \ --batch-size 80 \ --ssl ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.chroma import ( ChromaAccessConfig, ChromaConnectionConfig, ChromaUploadStagerConfig, ChromaUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=ChromaConnectionConfig( access_config=ChromaAccessConfig( settings={"persist_directory":"./chroma-persist"}, headers={"Authorization":"Basic()"} ), host=os.getenv("CHROMA_HOST"), port=os.getenv("CHROMA_PORT"), tenant=os.getenv("CHROMA_TENANT"), database=os.getenv("CHROMA_DATABASE"), collection_name=os.getenv("CHROMA_COLLECTION") ), stager_config=ChromaUploadStagerConfig(), uploader_config=ChromaUploaderConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Couchbase Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/couchbase <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> [Couchbase](https://couchbase.com) is a Distributed NoSQL Cloud Database. Couchbase embraces AI with coding assistance for developers, and vector search for their applications. Batch process all your records to store structured outputs in a Couchbase database. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Couchbase Capella clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Couchbase Capella clusters and local Couchbase server deployments are supported. <iframe width="560" height="315" src="https://www.youtube.com/embed/9-RIBmIdi70" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For Couchbase Capella, you will need: * A [Couchbase Capella account](https://docs.couchbase.com/cloud/get-started/create-account.html#sign-up-free-tier). * A [Couchbase Capella cluster](https://docs.couchbase.com/cloud/get-started/create-account.html#getting-started). * A [bucket](https://docs.couchbase.com/cloud/clusters/data-service/manage-buckets.html#add-bucket), [scope](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-scope), and [collection](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-collection) on the cluster. * The cluster's [public connection string](https://docs.couchbase.com/cloud/get-started/connect.html#connect-from-sdk-cbsh-cli-or-ide). * The [cluster access name (username) and secret (password)](https://docs.couchbase.com/cloud/clusters/manage-database-users.html#create-database-credentials). * [Incoming IP address allowance](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) for the cluster. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> For a local Couchbase server, you will need: * [Installation of a local Couchbase server](https://docs.couchbase.com/server/current/getting-started/start-here.html). * [Connection details](https://docs.couchbase.com/server/current/guides/connect.html) to the local Couchbase server. To learn more about how to set up a Couchbase cluster and play with data, refer to this [tutorial](https://developer.couchbase.com/tutorial-quickstart-flask-python). The Couchbase DB connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[couchbase]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables are required for the Couchbase Connector: * `CB_CONN_STR` - The Connection String for the Couchbase server, represented by `--connection-string` (CLI) or `connection_string` (Python). * `CB_USERNAME` - The username for the Couchbase server, represented by `--username` (CLI) or `username` (Python). * `CB_PASSWORD` - The password for the Couchbase server, represented by `--password` (CLI) or `password` (Python). * `CB_BUCKET` - The name of the bucket in the Couchbase server, represented by `--bucket` (CLI) or `bucket` (Python). * `CB_SCOPE` - The name of the scope in the bucket, represented by `--scope` (CLI) or `scope` (Python). * `CB_COLLECTION` - The name of the collection in the scope, represented by `--collection` (CLI) or `collection` (Python). Additional available settings include: * `--collection-id` (CLI) or `collection_id` in `CouchbaseDownloaderConfig` (Python) - Optional for the source connector. The\ unique key of the ID field in the collection. The default is `id` if not otherwise specified. [Learn more](https://docs.couchbase.com/server/current/learn/services-and-indexes/indexes/indexing-and-query-perf.html#introduction-document-keys). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy hi_res \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ couchbase \ --connection-string $CB_CONN_STR \ --username $CB_USERNAME \ --password $CB_PASSWORD \ --bucket $CB_BUCKET \ --scope $CB_SCOPE \ --collection $CB_COLLECTION \ --num-processes 2 \ --batch-size 80 ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.couchbase import ( CouchbaseAccessConfig, CouchbaseConnectionConfig, CouchbaseUploadStagerConfig, CouchbaseUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=CouchbaseConnectionConfig( access_config=CouchbaseAccessConfig( password=os.getenv("CB_PASSWORD"), ), connection_string=os.getenv("CB_CONN_STR"), username=os.getenv("CB_USERNAME"), bucket=os.getenv("CB_BUCKET"), scope=os.getenv("CB_SCOPE"), collection=os.getenv("CB_COLLECTION") ), stager_config=CouchbaseUploadStagerConfig(), uploader_config=CouchbaseUploaderConfig(batch_size=100) ).run() ``` </CodeGroup> To understand how [vector search](https://www.couchbase.com/products/vector-search/) works in Couchbase, refer to this [tutorial](https://developer.couchbase.com/tutorial-python-langchain-pdf-chat) and the [Couchbase docs](https://docs.couchbase.com/cloud/vector-search/vector-search.html) For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Delta Tables in Databricks Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/databricks-delta-table <Note> This article covers connecting Unstructured to Delta Tables in Databricks. For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see [Delta Tables in Amazon S3](/open-source/ingestion/destination-connectors/delta-table). For information about connecting Unstructured to Databricks Volumes instead, see [Databricks Volumes](/open-source/ingestion/destination-connectors/databricks-volumes). </Note> <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a Delta Table in Databricks. The Delta Tables in Databricks connector provides two implementations: * The SQL-based implementation enables you to have Unstructured write its processed data directly to a table in Unity Catalog. If you use this implementation, you do not need to provide a volume as described in the following requirements. * The volume-based implementation enables you to have Unstructured write its processed data as a file to a volume in Unity Catalog. Unstructured then uses that file in the volume to write the file's data to a table in Unity Catalog. The file remains in the volume. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Datbricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * One of the following compute resources within the workspace: * A SQL warehouse for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html). The following video shows how to create a SQL warehouse if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the warehouse's owner to use it: <iframe width="560" height="315" src="https://www.youtube.com/embed/N-Aw9-U3_fE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An all-purpose cluster for [AWS](https://docs.databricks.com/compute/use-compute.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/use-compute), or [GCP](https://docs.gcp.databricks.com/compute/use-compute.html). The following video shows how to create an all-purpose cluster if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the cluster's owner to use it: <iframe width="560" height="315" src="https://www.youtube.com/embed/apgibaelVY0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The SQL warehouse's or cluster's Server Hostname and HTTP Path values for [AWS](https://docs.databricks.com/integrations/compute-details.html), [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html). * Unity Catalog enabled in the workspace for [AWS](https://docs.databricks.com/data-governance/unity-catalog/get-started.html), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/get-started), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/get-started.html). * Within Unity Catalog: * A catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html). * A schema (formerly known as a database) for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) within that catalog, * A table for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html) within that schema (formerly known as a database). You can have the connector attempt to create a table for you automatically at run time. To do this, in the connector settings as described later in this article, do one of the following: * Specify the name of the table that you want the connector to attempt to create within the specified catalog and schema (formerly known as a database). * Leave the table name blank. The connector will attempt to create a table within the specified catalog and schema (formerly known as a database). For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview), the table is named `u<short-workflow-id>`. For the [Unstructured Ingest CLI and Ingest Python library](/open-source/ingestion/overview), the table is named `unstructuredautocreated`. The connector will attempt to create the table on behalf of the related Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, as described later in these requirements. If successful, the table's owner is set as the related Databricks workspace user or Databricks managed service principal. The owner will have all Unity Catalog privileges on the table by default. No other Databricks workspace users or Databricks managed service principals will have any privileges on the table by default. <Warning> If the table's parent schema (formerly known as a database) is not owned by the same Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, then you should grant the new table's owner the `CREATE TABLE` privilege on that parent schema (formerly known as a database) before the connector attempts to create the table. Otherwise, table creation could fail. </Warning> <Note> Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), and tables might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, and tables. </Note> The following video shows how to create a catalog, schema (formerly known as a database), and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them: <iframe width="560" height="315" src="https://www.youtube.com/embed/ffNnq-6bpd4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> If you want to use an existing table or create one yourself beforehand, this table must contain at minimum the following column names and their data types: ```text CREATE TABLE IF NOT EXISTS <catalog_name>.<schema_name>.<table_name> ( id STRING NOT NULL PRIMARY KEY, record_id STRING NOT NULL, element_id STRING NOT NULL, text STRING, embeddings ARRAY<FLOAT>, type STRING, metadata VARIANT ); ``` <Info> In Databricks, a table's schema is different than a schema (formerly known as a database) in a catalog-schema object relationship in Unity Catalog. </Info> * Within Unity Catalog, a volume for [AWS](https://docs.databricks.com/volumes/utility-commands.html), [Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands), or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html). The volume can be in the same schema (formerly known as a database) as the table, or the volume and table can be in separate schemas. In either case, both of these schemas must share the same parent catalog. <Note> Using dashes (`-`) in the names of volumes might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of volumes. </Note> The following video shows how to create a catalog, schema (formerly known as a database), and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them: <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace: * A Databricks managed service principal. This service principal must have the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse. * The service principal's UUID (or Client ID or Application ID) value. * The OAuth Secret value for the service principal. To get this information, see Steps 1-3 of the instructions for [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). <Note> For Azure Databricks, this connector only supports Databricks managed service principals for authentication. Microsoft Entra ID managed service principals are not supported. </Note> The following video shows how to create a Databricks managed service principal if you do not already have one available: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Databricks personal access token authentication to the workspace, the Databricks personal access token value for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat#azure-databricks-personal-access-tokens-for-workspace-users), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users). This token must be for the workspace user who has the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse, The following video shows how to create a Databricks personal access token if you do not already have one available: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Databricks workspace user or Databricks managed service principal must have the following minimum set of permissions and privileges to write to an existing volume or table in Unity Catalog. If the owner of these is that Databricks workspace user or Databricks managed service principal, then they will have all necessary permissions and privileges by default. If the owner is someone else, then the following permissions and privileges must be explicitly granted to them before using the connector: * To use an all-purpose cluster for access, `Can Restart` permission on that cluster. Learn how to check and set cluster permissions for [AWS](https://docs.databricks.com/compute/clusters-manage.html#compute-permissions), [Azure](https://learn.microsoft.com/azure/databricks/compute/clusters-manage#cluster-level-permissions), or [GCP](https://docs.gcp.databricks.com/compute/clusters-manage.html#compute-permissions). * To use a SQL warehouse for access, `Can use` permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse), [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create#manage), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse). * To access a Unity Catalog volume, the following privileges: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. Learn how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). * To access a Unity Catalog table, the following privileges: * `USE CATALOG` on the table's parent catalog in Unity Catalog. * `USE SCHEMA` on the tables's parent schema (formerly known as a database) in Unity Catalog. * To create a new table, `CREATE TABLE` on the table's parent schema (formerly known as a database) in Unity Catalog. * If the table already exists, `MODIFY` and `SELECT` on the table. Learn how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). The Delta Tables in Databricks connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[databricks-delta-tables]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `DATABRICKS_HOST` - The Databricks cluster's or SQL warehouse's Server Hostname value, represented by `--server-hostname` (CLI) or `server_hostname` (Python). * `DATABRICKS_HTTP_PATH` - The cluster's or SQL warehouse's HTTP Path value, represented by `--http-path` (CLI) or `http_path` (Python). * `DATABRICKS_TOKEN` - For Databricks personal access token authentication, the token's value, represented by `--token` (CLI) or `token` (Python). * `DATABRICKS_CLIENT_ID` - For Databricks managed service principal authenticaton, the service principal's UUID (or Client ID or Application ID) value, represented by `--client-id` (CLI) or `client_id` (Python). * `DATABRICKS_CLIENT_SECRET` - For Databricks managed service principal authenticaton, the service principal's OAuth Secret value, represented by `--client-secret` (CLI) or `client_secret` (Python). * `DATABRICKS_CATALOG` - The name of the catalog in Unity Catalog, represented by `--catalog` (CLI) or `catalog` (Python). * `DATABRICKS_DATABASE` - The name of the schema (formerly known as a database) inside of the catalog for the target table, represented by `--database` (CLI) or `database` (Python). The default is `default` if not otherwise specified. If you are also using a volume, and the target table and volume are in the same schema (formerly known as a database), then `DATABRICKS_DATABASE` and `DATABRICKS_SCHEMA` will have the same values. * `DATABRICKS_TABLE` - The name of the table inside of the schema (formerly known as a database), represented by `--table-name` (CLI) or `table_name` (Python). * If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema. * If no table name is specified, the connector attempts to create a table named `u<short-workflow-id>` within the specified schema (formerly known as a database). See the beginning of this article for additional technical requirements before having the connector attempt to create a table. <Note> Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, tables, and volumes. </Note> For the SQL-based implementation, add these environment variables: * `DATABRICKS_RECORD_ID_KEY` - The name of the column that uniquely identifies each record in the table, represented by `--record-id-key` (CLI) or `record_id_key` (Python). * The maximum number of records to be transmitted per batch, represented by `--batch-size` (CLI) or `batch_size` (Python). The default is `50` if not otherwise specified. For the volume-based implementation, add these environment variables: * `DATABRICKS_SCHEMA` - The name of the schema (formerly known as a database) inside of the catalog for the target volume, represented by `--schema` (CLI) or `schema` (Python). The default is `default` if not otherwise specified. If the target volume and table are in the same schema (formerly known as a database), then `DATABRICKS_SCHEMA` and `DATABRICKS_SCHEMA` will have the same values. * `DATABRICKS_VOLUME` - The name of the volume inside of the schema (formerly known as a database), represented by `--volume` (CLI) or `volume` (Python). * `DATABRICKS_VOLUME_PATH` - Optionally, a specific path inside of the volume that you want to start accessing from, starting from the volume's root, represented by `--volume-path` (CLI) or `volume_path` (Python). The default is to start accessing from the volume's root if not otherwise specified. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI (SQL-Based) #!/usr/bin/env bash # Chunking and embedding are optional. # For authenticating with Databricks personal access tokens: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ databricks-delta-tables \ --server-hostname $DATABRICKS_HOST \ --http-path $DATABRICKS_HTTP_PATH \ --token $DATABRICKS_TOKEN \ --catalog $DATABRICKS_CATALOG \ --database $DATABRICKS_DATABASE \ --table-name $DATABRICKS_TABLE --record-id-key $DATABRICKS_RECORD_ID_KEY \ --batch-size 50 # For authenticating with Databricks managed service principals: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ databricks-delta-tables \ --server-hostname $DATABRICKS_HOST \ --http-path $DATABRICKS_HTTP_PATH \ --client-id $DATABRICKS_CLIENT_ID \ --client-secret $DATABRICKS_CLIENT_SECRET \ --catalog $DATABRICKS_CATALOG \ --database $DATABRICKS_DATABASE \ --table-name $DATABRICKS_TABLE --record-id-key $DATABRICKS_RECORD_ID_KEY \ --batch-size 50 ``` ```bash CLI (Volume-Based) #!/usr/bin/env bash # Chunking and embedding are optional. # For authenticating with Databricks personal access tokens: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ databricks-volume-delta-tables \ --server-hostname $DATABRICKS_HOST \ --http-path $DATABRICKS_HTTP_PATH \ --token $DATABRICKS_TOKEN \ --catalog $DATABRICKS_CATALOG \ --schema $DATABRICKS_SCHEMA \ --database $DATABRICKS_DATABASE \ --volume $DATABRICKS_VOLUME \ --volume-path $DATABRICKS_VOLUME_PATH \ --table-name $DATABRICKS_TABLE # For authenticating with Databricks managed service principals: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ databricks-volume-delta-tables \ --server-hostname $DATABRICKS_HOST \ --http-path $DATABRICKS_HTTP_PATH \ --client-id $DATABRICKS_CLIENT_ID \ --client-secret $DATABRICKS_CLIENT_SECRET \ --catalog $DATABRICKS_CATALOG \ --schema $DATABRICKS_SCHEMA \ --database $DATABRICKS_DATABASE \ --volume $DATABRICKS_VOLUME \ --volume-path $DATABRICKS_VOLUME_PATH \ --table-name $DATABRICKS_TABLE ``` ```python Python Ingest (SQL-Based) import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.databricks_delta_tables import ( DatabricksDeltaTablesConnectionConfig, DatabricksDeltaTablesAccessConfig, DatabricksDeltaTablesUploadStagerConfig, DatabricksDeltaTablesUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), # For authenticating with Databricks personal access tokens. destination_connection_config=DatabricksDeltaTablesConnectionConfig( access_config=DatabricksDeltaTablesAccessConfig( token=os.getenv("DATABRICKS_TOKEN"), ), server_hostname=os.getenv("DATABRICKS_HOST"), http_path=os.getenv("DATABRICKS_HTTP_PATH") ), # For authenticating with Databricks managed service principals. # destination_connection_config=DatabricksDeltaTablesConnectionConfig( # access_config=DatabricksDeltaTablesAccessConfig( # client_id=os.getenv("DATABRICKS_CLIENT_ID"), # client_secret=os.getenv("DATABRICKS_CLIENT_SECRET") # ), # server_hostname=os.getenv("DATABRICKS_HOST"), # http_path=os.getenv("DATABRICKS_HTTP_PATH") # ), stager_config=DatabricksDeltaTablesUploadStagerConfig(), uploader_config=DatabricksDeltaTablesUploaderConfig( catalog=os.getenv("DATABRICKS_CATALOG"), database=os.getenv("DATABRICKS_DATABASE"), table_name=os.getenv("DATABRICKS_TABLE"), record_id_key=os.getenv("DATABRICKS_RECORD_ID_KEY"), batch_size=50 ) ).run() ``` ```python Python Ingest (Volume-Based) import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.databricks_delta_tables import ( DatabricksDeltaTablesConnectionConfig, DatabricksDeltaTablesAccessConfig, DatabricksDeltaTablesUploadStagerConfig ) from unstructured_ingest.processes.connectors.databricks.volumes_table import ( DatabricksVolumeDeltaTableUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), # For authenticating with Databricks personal access tokens. destination_connection_config=DatabricksDeltaTablesConnectionConfig( access_config=DatabricksDeltaTablesAccessConfig( token=os.getenv("DATABRICKS_TOKEN"), ), server_hostname=os.getenv("DATABRICKS_HOST"), http_path=os.getenv("DATABRICKS_HTTP_PATH") ), # For authenticating with Databricks managed service principals. # destination_connection_config=DatabricksDeltaTablesConnectionConfig( # access_config=DatabricksDeltaTablesAccessConfig( # client_id=os.getenv("DATABRICKS_CLIENT_ID"), # client_secret=os.getenv("DATABRICKS_CLIENT_SECRET") # ), # server_hostname=os.getenv("DATABRICKS_HOST"), # http_path=os.getenv("DATABRICKS_HTTP_PATH") # ), stager_config=DatabricksDeltaTablesUploadStagerConfig(), uploader_config=DatabricksVolumeDeltaTableUploaderConfig( catalog=os.getenv("DATABRICKS_CATALOG"), schema=os.getenv("DATABRICKS_SCHEMA"), database=os.getenv("DATABRICKS_DATABASE"), volume=os.getenv("DATABRICKS_VOLUME"), volume_path=os.getenv("DATABRICKS_VOLUME_PATH"), table_name=os.getenv("DATABRICKS_TABLE") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Databricks Volumes Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/databricks-volumes <Note> This article covers connecting Unstructured to Databricks Volumes. For information about connecting Unstructured to Delta Tables in Databricks instead, see [Delta Tables in Databricks](/open-source/ingestion/destination-connectors/databricks-delta-table). </Note> Batch process all your records to store structured outputs in Databricks Volumes. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Databricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * The workspace's URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids). Examples: * AWS: `https://<workspace-id>.cloud.databricks.com` * Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net` * GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com` <Note> Do not add a trailing slash (`/`) to the workspace URL. </Note> * The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), the following Databricks authentication types are supported: * Databricks OAuth machine-to-machine (M2M) authentication for\ [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). You will need the the Client ID (or UUID or Application ID) and OAuth Secret (client secret) values for the corresponding service principal. Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. The following video shows how to create a Databricks managed service principal: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html). You will need the personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For [Unstructured Ingest](/open-source/ingestion/overview), the following Databricks authentication types are supported: * For Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html): The personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password (basic) authentication ([AWS](https://docs.databricks.com/archive/dev-tools/basic.html) only): The user's name and password values. * For OAuth machine-to-machine (M2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html)): The client ID and OAuth secret values for the corresponding service principal. * For OAuth user-to-machine (U2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-u2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-u2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-u2m.html)): No additional values. * For Azure managed identities (formerly Managed Service Identities (MSI) authentication) ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-mi) only): The client ID value for the corresponding managed identity. * For Microsoft Entra ID service principal authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-sp) only): The tenant ID, client ID, and client secret values for the corresponding service principal. * For Azure CLI authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-cli) only): No additional values. * For Microsoft Entra ID user authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/user-aad-token) only): The Entra ID token for the corresponding Entra ID user. * For Google Cloud Platform credentials authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-creds.html) only): The local path to the corresponding Google Cloud service account's credentials file. * For Google Cloud Platform ID authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-id.html) only): The Google Cloud service account's email address. * The name of the parent catalog in Unity Catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html) for the volume. * The name of the parent schema (formerly known as a database) in Unity Catalog for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) for the volume. * The name of the volume in Unity Catalog for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html), and optionally any path in that volume that you want to access directly, beginning with the volume's root. * The Databricks workspace user or service principal must have the following minimum set of privileges to read from or write to the existing volume in Unity Catalog: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. The following videos shows how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog. <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Learn more about how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). The Databricks Volumes connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[databricks-volumes]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `DATABRICKS_HOST` - The Databricks host URL, represented by `--host` (CLI) or `host` (Python). <Note> Do not add a trailing slash (`/`) to the host URL. </Note> * `DATABRICKS_CATALOG` - The Databricks catalog name for the Volume, represented by `--catalog` (CLI) or `catalog` (Python). * `DATABRICKS_SCHEMA` - The Databricks schema name for the Volume, represented by `--schema` (CLI) or `schema` (Python). If not specified, `default` is used. * `DATABRICKS_VOLUME` - The Databricks Volume name, represented by `--volume` (CLI) or `volume` (Python). * `DATABRICKS_VOLUME_PATH` - Any optional path to access within the volume, specified by `--volume-path` (CLI) or `volume_path` (Python). Environment variables based on your authentication type, depending on your cloud provider: * For Databricks personal access token authentication (AWS, Azure, and GCP): * `DATABRICKS_TOKEN` - The personal access token, represented by `--token` (CLI) or `token` (Python). * For username and password (basic) authentication (AWS only): The user's name and password values. * `DATABRICKS_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python). * `DATABRICKS_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). * For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal. * `DATABRICKS_CLIENT_ID` - The client ID value for the corresponding service principal, represented by `--client-id` (CLI) or `client_id` (Python). * `DATABRICKS_CLIENT_SECRET` - The client ID and OAuth secret values for the corresponding service principal, represented by `--client-secret` (CLI) or `client_secret` (Python). * For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional environment variables. * For Azure managed identities (MSI) authentication (Azure only): * `ARM_CLIENT_ID` - The client ID value for the corresponding managed identity, represented by `--azure-client-id` (CLI) or `azure_client_id` (Python). * If the target identity has not already been added to the workspace, then you must also specify the `DATABRICKS_AZURE_RESOURCE_ID`, represented by `--azure-workspace-resource-id` (CLI) or `azure_workspace_resource_id` (Python). * For Microsoft Entra ID service principal authentication (Azure only): * `ARM_TENANT_ID` - The tenant ID value for the corresponding service principal, represented by `--azure-tenant-id` (CLI) or `azure_tenant_id` (Python). * `ARM_CLIENT_ID` - The client ID value for the corresponding service principal, represented by `--azure-client-id` (CLI) or `azure_client_id` (Python). * `ARM_CLIENT_SECRET` - The client secret value for the corresponding service principal, represented by `--azure-client-secret` (CLI) or `azure_client_secret` (Python). * If the service principal has not already been added to the workspace, then you must also specify the `DATABRICKS_AZURE_RESOURCE_ID`, represented by `--azure-workspace-resource-id` (CLI) or `azure_workspace_resource_id` (Python). * For Azure CLI authentication (Azure only): No additional environment variables. * For Microsoft Entra ID user authentication (Azure only): * `DATABRICKS_TOKEN` - The Entra ID token for the corresponding Entra ID user, represented by `--token` (CLI) or `token` (Python). * For Google Cloud Platform credentials authentication (GCP only): * `GOOGLE_CREDENTIALS` - The local path to the corresponding Google Cloud service account's credentials file, represented by `--google-credentials` (CLI) or `google_credentials` * For Google Cloud Platform ID authentication (GCP only): * `GOOGLE_SERVICE_ACCOUNT` - The Google Cloud service account's email address, represented by `--google-service-account` (CLI) or `google_service_account` (Python). * Alternatively, you can store the preceding settings in a local [Databricks configuration profile](https://docs.databricks.com/en/dev-tools/auth/config-profiles.html) and then just refer to the profile's name: * `DATABRICKS_PROFILE` - The name of the Databricks configuration profile, represented by `--profile` (CLI) or `profile` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ --chunking-strategy by_title \ --embedding-provider huggingface \ databricks-volumes \ --profile $DATABRICKS_PROFILE \ --host $DATABRICKS_HOST \ --catalog $DATABRICKS_CATALOG \ --schema $DATABRICKS_SCHEMA \ --volume $DATABRICKS_VOLUME \ --volume-path $DATABRICKS_VOLUME_PATH ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig # For all supported Databricks authentication types, you can import this: from unstructured_ingest.processes.connectors.databricks.volumes import ( DatabricksVolumesConnectionConfig, DatabricksVolumesAccessConfig, DatabricksVolumesUploaderConfig ) # Alternatively, for supported Databricks on AWS authentication types only, you can import this: # from unstructured_ingest.processes.connectors.databricks.volumes_aws import ( # DatabricksAWSVolumesConnectionConfig, # DatabricksAWSVolumesAccessConfig, # DatabricksAWSVolumesUploaderConfig # ) # Alternatively, for supported Azure Databricks authentication types only, you can import this: # from unstructured_ingest.processes.connectors.databricks.volumes_azure import ( # DatabricksAzureVolumesConnectionConfig, # DatabricksAzureVolumesAccessConfig, # DatabricksAzureVolumesUploaderConfig # ) # Alternatively, for supported Databricks on Google Cloud authentication types only, you can import this: # from unstructured_ingest.processes.connectors.databricks.volumes_gcp import ( # DatabricksGoogleVolumesConnectionConfig, # DatabricksGoogleVolumesAccessConfig, # DatabricksGoogleVolumesUploaderConfig # ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), # For specifying a Databricks configuration profile: destination_connection_config=DatabricksVolumesConnectionConfig( access_config=DatabricksVolumesAccessConfig(profile=os.getenv("DATABRICKS_PROFILE")), host=os.getenv("DATABRICKS_HOST"), catalog=os.getenv("DATABRICKS_CATALOG"), schema=os.getenv("DATABRICKS_SCHEMA"), volume=os.getenv("DATABRICKS_VOLUME"), volume_path=os.getenv("DATABRICKS_VOLUME_PATH") ), uploader_config=DatabricksVolumesUploaderConfig(overwrite=True) # Other examples: # # For Databricks on AWS, with Databricks personal access token authentication: # destination_connection_config=DatabricksAWSVolumesConnectionConfig( # access_config=DatabricksAWSVolumesAccessConfig(token=os.getenv("DATABRICKS_TOKEN")), # host=os.getenv("DATABRICKS_HOST") # ), # uploader_config=DatabricksAWSVolumesUploaderConfig( # catalog=os.getenv("DATABRICKS_CATALOG"), # schema=os.getenv("DATABRICKS_SCHEMA"), # volume=os.getenv("DATABRICKS_VOLUME"), # volume_path=os.getenv("DATABRICKS_VOLUME_PATH"), # overwrite=True # ) # # For Azure Databricks, with Microsoft Entra ID service principal authentication: # destination_connection_config=DatabricksAzureVolumesConnectionConfig( # access_config=DatabricksAzureVolumesAccessConfig( # azure_client_id=os.getenv("ARM_CLIENT_ID"), # azure_client_secret=os.getenv("ARM_CLIENT_SECRET"), # azure_tenant_id=os.getenv("ARM_TENANT_ID") # ), # host=os.getenv("DATABRICKS_HOST") # ), # uploader_config=DatabricksAzureVolumesUploaderConfig( # catalog=os.getenv("DATABRICKS_CATALOG"), # schema=os.getenv("DATABRICKS_SCHEMA"), # volume=os.getenv("DATABRICKS_VOLUME"), # volume_path=os.getenv("DATABRICKS_VOLUME_PATH"), # overwrite=True # ) # # For Databricks on Google Cloud, with Google Cloud Platform credentials authentication: # destination_connection_config=DatabricksGoogleVolumesConnectionConfig( # access_config=DatabricksGoogleVolumesAccessConfig( # google_service_account=os.getenv("GOOGLE_CREDENTIALS") # ), # host=os.getenv("DATABRICKS_HOST") # ), # uploader_config=DatabricksAWSVolumesUploaderConfig( # catalog=os.getenv("DATABRICKS_CATALOG"), # schema=os.getenv("DATABRICKS_SCHEMA"), # volume=os.getenv("DATABRICKS_VOLUME"), # volume_path=os.getenv("DATABRICKS_VOLUME_PATH"), # overwrite=True # ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Delta Tables in Amazon S3 Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/delta-table <Note> This article covers connecting Unstructured to Delta Tables in Amazon S3. For information about connecting Unstructured to Delta Tables in Databricks instead, see [Delta Tables in Databricks](/open-source/ingestion/destination-connectors/databricks-delta-table). </Note> Batch process all your records to store structured outputs in a Delta Table in an Amazon S3 bucket. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements to store Delta Tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/_W4565dcUGI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). ## Add an access policy to an existing bucket To use the Amazon S3 console to add an access policy that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to an existing S3 bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 2. Open the [Amazon S3 Console](https://console.aws.amazon.com/s3/home). 3. Browse to the existing bucket and open it. 4. Click the Permissions tab. 5. In the Bucket policy area, click Edit. 6. In the Policy text area, copy the following JSON-formatted policy. To change the following policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this policy, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-bucket-name>` in two places with the name of your bucket. ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<my-account-id>:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<my-bucket-name>", "arn:aws:s3:::<my-bucket-name>/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } ``` 7. Click Save changes. ## Create a bucket with AWS CloudFormation To use the AWS CloudFormation console to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Save the following YAML to a file on your local machine, for example `create-s3-bucket.yaml`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: 'CloudFormation template to create an S3 bucket with specific permissions for account users.' Parameters: BucketName: Type: String Description: 'Name of the S3 bucket to create' Resources: MyS3Bucket: Type: 'AWS::S3::Bucket' Properties: BucketName: !Ref BucketName PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: false IgnorePublicAcls: true RestrictPublicBuckets: true BucketPolicy: Type: 'AWS::S3::BucketPolicy' Properties: Bucket: !Ref MyS3Bucket PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowAllAuthenticatedUsersInAccount Effect: Allow Principal: AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root' Action: - 's3:GetObject' - 's3:PutObject' - 's3:ListBucket' - 's3:DeleteObject' Resource: - !Sub 'arn:aws:s3:::${BucketName}' - !Sub 'arn:aws:s3:::${BucketName}/' Outputs: BucketName: Description: 'Name of the created S3 bucket' Value: !Ref MyS3Bucket ``` 2. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 3. Open the [AWS CloudFormation Console](https://console.aws.amazon.com/cloudformation/home). 4. Click Create stack > With new resources (standard). 5. On the Create stack page, with Choose an existing template already selected, select Upload a template file. 6. Click Choose file, and browse to and select the YAML file from your local machine. 7. Click Next. 8. Enter a unique Stack name and BucketName. 9. Click Next two times. 10. Click Submit. 11. Wait until the Status changes to CREATE\_COMPLETE. 12. After the bucket is created, you can delete the YAML file, if you want. ## Create a bucket with the AWS CLI To use the AWS CLI to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 2. [Set up the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). 3. Copy the following script to a file on your local machine, for example a file named `create-s3-bucket.sh`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this script, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-unique-bucket-name>` with the name of your bucket. * Replace `<us-east-1>` with your AWS Region. ```bash #!/bin/bash # Set variables for the AWS account ID, Amazon S3 bucket name, and AWS Region. ACCOUNT_ID="<my-account-id>" BUCKET_NAME="<my-unique-bucket-name>" REGION="<us-east-1>" # Temporary filename for the bucket policy. # Do not change this variable. POLICY_FILE="bucket_policy.json" # Create the bucket. aws s3api create-bucket --bucket $BUCKET_NAME --region $REGION # Wait for the bucket to exist. echo "Waiting for bucket '$BUCKET_NAME' to be fully created..." aws s3api wait bucket-exists --bucket $BUCKET_NAME # Check if the wait command was successful. if [ $? -eq 0 ]; then echo "The bucket '$BUCKET_NAME' has been fully created." else echo "Error: Timed out waiting for bucket '$BUCKET_NAME' to be created." exit 1 fi # Remove the "block public policy" bucket access setting. aws s3api put-public-access-block \ --bucket $BUCKET_NAME \ --public-access-block-configuration \ '{"BlockPublicPolicy": false, "IgnorePublicAcls": false, "BlockPublicAcls": false, "RestrictPublicBuckets": false}' # Check if the operation was successful. if [ $? -eq 0 ]; then echo "The block public policy access setting was removed from '$BUCKET_NAME'." else echo "Error: Failed to remove the block public policy access setting from '$BUCKET_NAME'." exit 1 fi # Create the bucket policy. cat << EOF > $POLICY_FILE { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::$ACCOUNT_ID:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::$BUCKET_NAME", "arn:aws:s3:::$BUCKET_NAME/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } EOF # Apply the bucket policy. aws s3api put-bucket-policy --bucket $BUCKET_NAME --policy file://$POLICY_FILE # Check if the policy application was successful. if [ $? -eq 0 ]; then echo "The bucket policy was applied to '$BUCKET_NAME'." else echo "Error: Failed to apply the bucket policy to '$BUCKET_NAME'." exit 1 fi # Verify the applied policy. echo "Verifying the applied policy:" aws s3api get-bucket-policy --bucket $BUCKET_NAME --query Policy --output text # Remove the temporary bucket policy file. rm $POLICY_FILE ``` 4. Run the script, for example: ```bash sh create-s3-bucket.sh ``` 5. After the bucket is created, you can delete the script file, if you want. ## Delta table output format A Delta table consists of Parquet files that contain data and a transaction log that stores metadata about the transactions. [Learn more](https://delta-io.github.io/delta-rs/how-delta-lake-works/architecture-of-delta-table/). The Delta Tables in Amazon S3 destination connector generates the following output within the specified path to the S3 bucket (or the specified folder within the bucket): Initially, one Parquet (`.parquet`) file per file in the source location. For example, for a file in the source location named `my-file.pdf`, an associated file with the extension `.parquet` is generated. Various kinds of file transactions can result in additional Parquet files being generated. These Parquet filenames are automatically generated by the Delta Lake engine and are not meant to be manually modified. * A folder named `_delta_log` that contains metadata and change history about the `.parquet` files. As Parquet files are added to, changed, or removed from the specified bucket or folder path, the `_delta_log` folder is updated with any related metadata and change history details. Together, this set of Parquet files and their associated `_delta_log` folder (and its contents) describe a single, versioned Delta table. Because of this, Unstructured recommends the following usage best practices: * In the source location, each set of source files that is to be considered as a unit for change management purposes should be controlled by a unique, dedicated Delta Tables in S3 destination connector. This connector should reference a unique, dedicated output folder within the bucket. Having multiple workflows refer to different sets of source files, yet all share the same Delta table, could results in data loss or table corruption. * Avoid directly modifying, adding, or deleting Parquet data files or the `_delta_log` folder within a Delta table's directory. This can lead to data loss or table corruption. * If you need to copy or move a Delta table to a different location, you must move or copy its entire set of Parquet files and its associated `_delta_log` folder (and its contents) together as a unit. Note that the copied or moved Delta table will no longer be controlled by the original Delta Tables in S3 destination connector. The Delta Table connector dependencies for Amazon S3: ```bash CLI, Python pip install "unstructured-ingest[delta-table]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `AWS_S3_URL` - The path to the S3 bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/`, represented by `--table-uri` (CLI) or `table_uri` (Python). * `AWS_ACCESS_KEY_ID` - The AWS access key ID for the authenticated AWS IAM user, represented by `--aws-access-key-id` (CLI) or `aws_access_key` (Python). * `AWS_SECRET_ACCESS_KEY` - The corresponding AWS secret access key, represented by `--aws-secret-access-key` (CLI) or `aws_secret_access_key` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ --chunking-strategy by_title \ --embedding-provider huggingface \ delta-table \ --aws-access-key-id $AWS_ACCESS_KEY_ID \ --aws-secret-access-key $AWS_SECRET_ACCESS_KEY \ --table-uri $AWS_S3_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.delta_table import ( DeltaTableConnectionConfig, DeltaTableAccessConfig, DeltaTableUploadStagerConfig, DeltaTableUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=DeltaTableConnectionConfig( access_config=DeltaTableAccessConfig( aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"), aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY") ), table_uri=os.getenv("AWS_S3_URL") ), stager_config=DeltaTableUploadStagerConfig(), uploader_config=DeltaTableUploaderConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Dropbox Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/dropbox <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a Dropbox account. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/Vku5uYa-2N4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. A [Dropbox account](https://www.dropbox.com/try/teams). 2. A Dropbox app for your Dropbox account. To create a Dropbox app, do the following: a) Sign in to the [Dropbox Developers](https://www.dropbox.com/developers) portal with the same credentials as your Dropbox account.<br /> b) Open your [App Console](https://www.dropbox.com/developers/apps).<br /> c) Click Create app.<br /> d) For Choose an API, select Scoped access.<br /> e) For Choose the type of access you need, select App folder.<br /> f) Enter a name for your app, and then click Create app.<br /> g) On the app's Permissions tab, under Files and folders, check the boxes labelled files.content.read or files.content.write or both, depending on whether you want to read files, write files, or both. Then click Submit.<br /> h) On the app's Settings tab, note the value of the App folder name field. This is the name of the app folder that Dropbox will create under the `Apps` top-level folder in your Dropbox account that the Dropbox app will use for access. If you change the value of App folder name field here, Dropbox will create an app folder with that name under the `Apps` top-level folder instead.<br /> i) Under OAuth 2, next to Generated access token, click Generate. Copy the value of this access token. You should only click Generate after you have completed all of the preceding steps first. This is because the access token is scoped to the specific app folder and settings at the time the access token is generated. If you change the app folder name or any of the permissions later, you should regenerate the access token.<br /> <Warning> Access tokens are valid for only four hours after they are created. After this four-hour period, you can no longer use the expired access token. Dropbox does not allow the creation of access tokens that are valid for more than four hours. To replace an expired access token, you must first generate a refresh token for the corresponding access token. To learn how to generate an access token and its corresponding refresh token, see [Replace an expired access token](#replace-an-expired-access-token), later in this article. If you do not already have the corresponding refresh token for an existing access token, or if you lose a refresh token after you generate it, you must generate a new access token and its corresponding refresh token. Instead of continualy replacing expired access tokens yourself, you can have Unstructured do it for you as needed; just supply Unstructured with the refresh token along with the Dropbox app's App key and App secret values. To learn how to supply these to Unstructured, look for mentions of "refresh token," "app key," and "app secret" in the connector settings later in this article. </Warning> 3. The app folder that your Dropbox app will use for access can be found in your Dropbox account under the `Apps` top-level folder. For example, if the value of the App folder name field above is `my-folder`, then the app folder that your Dropbox app will use for access can be found under `https://dropbox.com/home/Apps/my-folder` ![The my-folder app folder under the Apps top-level folder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-folder.png) <Warning> Your Dropbox app will not have access to upload or download files from the root of the app folder. Instead, you must create a subfolder inside of the app folder for your Dropbox app to upload or download files from. You will use the name of that subfolder when specifying your remote URL in the next step. For example, if your Dropbox app uses an app folder named `my-folder` for access within the `Apps` top-level folder, and you create a subfolder named `data` within the `my-folder` app folder, then the subfolder that your Dropbox app will upload and download files from can be found under `https://dropbox.com/home/Apps/my-folder/data` ![The data subfolder under the my-folder subfolder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-subfolder.png) </Warning> 4. Note the remote URL to your subfolder inside of the app folder, which takes the format `dropbox://<subfolder-name>`. For example, if your Dropbox app uses an app folder named `my-folder` for access within the `Apps` top-level folder, and you create a subfolder named `data` within the `my-folder` app folder, then the remote URL is `dropbox://data` ![The data subfolder under the my-folder subfolder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-subfolder.png) ## Replace an expired access token Dropbox app access tokens are valid for only four hours. After this time, you can no longer use the expired access token. To have Unstructured automatically replace expired access tokens on your behalf, do the following: <iframe width="560" height="315" src="https://www.youtube.com/embed/PZyRgpPNEUs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. Get the app key and app secret values for your Dropbox app. To do this: a) Sign in to the [Dropbox Developers](https://www.dropbox.com/developers) portal with the same credentials as your Dropbox account.<br /> b) Open your [App Console](https://www.dropbox.com/developers/apps).<br /> c) Click your Dropbox app's icon.<br /> d) On the Settings tab, next to App key, copy the value of the app key.<br /> e) Next to App secret, click Show, and then copy the value of the app secret. 2. Use your web browser to browse to the following URL, replacing `<app-key>` with the app key for your Dropbox app: ```text https://www.dropbox.com/oauth2/authorize?client_id=<app-key>&response_type=code&token_access_type=offline ``` 3. Click Continue. 4. Click Allow. 5. In the Access code generated tile, copy the access code that is shown. 6. Use the [curl](https://curl.se/) utility in your Terminal or Command Prompt, or use a REST API client such as [Postman](https://www.postman.com/product/api-client/), to make the following REST API call, replacing the following placeholders: * Replace `<app-key>` with the app key for your Dropbox app. * Replace `<app-secret>` with the app secret for your Dropbox app. * Replace `<access-code>` with the access code that you just copied. ```text curl --location --request POST 'https://api.dropbox.com/oauth2/token' \ --user '<app-key>:<app-secret>' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'code=<access-code>' \ --data-urlencode 'grant_type=authorization_code' \ ``` 7. In the response, copy the following two values: * The value of `access_token` (starting with the characters `sl`) is the new, valid access token. * The value of `refresh_token` is the refresh token that can be used to replace this access token much faster and easier next time. If you lose this refresh token, you must go back to Step 2. For the [Unstructured UI](/ui/overview), if you want Unstructured to use this refresh token to automatically replace the expired access token instead of replacing it yourself, then add the following values to your connector settings, and then stop here: * Add the `refresh_token` value to the connector settings Refresh token field. * Add the `<app-key>` value to the connector settings App key field. * Add the `<app-secret>` value to the connector settings App secret field. For the [Unstructured API](/api-reference/overview) and [Unstructured Ingest](/open-source/ingestion/overview), if you want Unstructured to use this refresh token to automatically replace the expired access token instead of replacing it yourself, then add the following values to your connector settings, and then stop here: * Add the `refresh_token` value to the `refresh_token` parameter. * Add the `<app-key>` value to the `app_key` parameter. * Add the `<app-secret>` value to the connector settings `app_secret` parameter. 8. If for some reason you need to manually replace the expired access token yourself instead of having Unstructured do it for you, you can use the refresh token that you just copied to get a new access token: * Replace `<refresh-token>` with the refresh token. * Replace `<app-key>` with the app key for your Dropbox app. * Replace `<app-secret>` with the app secret for your Dropbox app. ```text curl https://api.dropbox.com/oauth2/token \ --data refresh_token=<refresh-token> \ --data grant_type=refresh_token \ --data client_id=<app-key> \ --data client_secret=<app-secret> ``` 9. In the response, copy the following two values: * The value of `access_token` (starting with the characters `sl`) is the new, valid access token. In the connector, replace the old, expired access token value with this new, valid access token value. * The value of `refresh_token` is the new, valid refresh token. To replace the expired access token yourself, go back to Step 8. The Dropbox connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[dropbox]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `DROPBOX_REMOTE_URL` - The remote URL to the target subfolder inside of the app folder for the Dropbox app, represented by `--remote-url` (CLI) or `remote_url` (Python). * `DROPBOX_ACCESS_TOKEN` - The value of the access token for the Dropbox app that is associated with the target app folder, represented by `--token` (CLI) or `token` (Python). Provide this only if for some reason you do not want Unstructured to automatically refresh expired access tokens. To have Unstructured automatically refresh expired Dropbox App access tokens on your behalf, do not provide an access token. Instead, provide the following environment variables: * `DROPBOX_REFRESH_TOKEN` - The value of the refresh token for the corresponding access token, represented by `--refresh-token` (CLI) or `refresh_token` (Python). * `DROPBOX_APP_KEY` - The app key for the Dropbox app, represented by `--app-key` (CLI) or `app_key` (Python). * `DROPBOX_APP_SECRET` - The app secret for the Dropbox app, represented by `--app-secret` (CLI) or `app_secret` (Python). Now call the Unstructured Ingest CLI or Unstructured Ingest Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy hi_res \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY\ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ dropbox \ --refresh-token $DROPBOX_REFRESH_TOKEN \ --app-key $DROPBOX_APP_KEY \ --app-secret $DROPBOX_APP_SECRET \ --remote-url $DROPBOX_REMOTE_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.dropbox import ( DropboxAccessConfig, DropboxConnectionConfig, DropboxUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=DropboxConnectionConfig( access_config=DropboxAccessConfig( refresh_token=os.getenv("DROPBOX_REFRESH_TOKEN"), app_key=os.getenv("DROPBOX_APP_KEY"), app_secret=os.getenv("DROPBOX_APP_SECRET") ) ), uploader_config=DropboxUploaderConfig( remote_url=os.getenv("DROPBOX_REMOTE_URL") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # DuckDB Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/duckdb <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a DuckDB installation. The requirements are as follows. * A [DuckDB installation](https://duckdb.org/docs/installation). * A [persistent database](https://duckdb.org/docs/connect/overview.html#persistent-database), for example by running the [DuckDB CLI](https://duckdb.org/docs/api/cli) command `duckdb <my-database-filename>.db` or `duckdb <my-database-filename>.duckdb`, replacing `<my-database-filename>` with the name of the target file. * The path to the target persistent database file. * A schema in the target database. * [Create a schema](https://duckdb.org/docs/sql/statements/create_schema.html). * You can list available schemas and their parent catalogs by running the following DuckDB CLI command: ```sql SELECT * FROM information_schema.schemata; ``` The DuckDB connector uses the default schema name of `main` if not otherwise specified. * A table in the target schema. * [Create a table](https://duckdb.org/docs/sql/statements/create_table). * You can list available tables in a schema by running the following DuckDB CLI commands, replacing the target catalog and schema names: ```sql USE <catalog_name>.<schema_name>; SHOW TABLES; ``` The DuckDB connector uses the default table name of `elements` if not otherwise specified. For maximum compatibility, Unstructured recommends the following table schema: ```sql CREATE TABLE elements ( id VARCHAR, element_id VARCHAR, text TEXT, embeddings FLOAT[], type VARCHAR, system VARCHAR, layout_width DECIMAL, layout_height DECIMAL, points TEXT, url TEXT, version VARCHAR, date_created INTEGER, date_modified INTEGER, date_processed DOUBLE, permissions_data TEXT, record_locator TEXT, category_depth INTEGER, parent_id VARCHAR, attached_filename VARCHAR, filetype VARCHAR, last_modified TIMESTAMP, file_directory VARCHAR, filename VARCHAR, languages VARCHAR[], page_number VARCHAR, links TEXT, page_name VARCHAR, link_urls VARCHAR[], link_texts VARCHAR[], sent_from VARCHAR[], sent_to VARCHAR[], subject VARCHAR, section VARCHAR, header_footer_type VARCHAR, emphasized_text_contents VARCHAR[], emphasized_text_tags VARCHAR[], text_as_html TEXT, regex_metadata TEXT, detection_class_prob DECIMAL, partitioner_type VARCHAR ); ``` You can list the schema of a table by running the following DuckDB CLI commands, replacing the target catalog, schema, and table names: ```sql USE <catalog_name>.<schema_name>; DESCRIBE TABLE <table_name>; ``` The DuckDB connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[duckdb]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `DUCKDB_DATABASE` - The path to the target DuckDB persistent database file with the extension `.db` or `.duckdb`, represented by `--database` (CLI) or `database` (Python). * `DUCKDB_DB_SCHEMA` - The name of the target schema in the database, represented by `--db-schema` (CLI) or `db_schema` (Python). * `DUCKDB_TABLE` - The name of the target table in the schema, represented by `--table` (CLI) or `table` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ duckdb \ --database $DUCKDB_DATABASE \ --db-schema $DUCKDB_DB_SCHEMA \ --table $DUCKDB_TABLE ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.duckdb.duckdb import ( DuckDBAccessConfig, DuckDBConnectionConfig, DuckDBUploadStagerConfig, DuckDBUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=DuckDBConnectionConfig( access_config=DuckDBAccessConfig(), database=os.getenv("DUCKDB_DATABASE"), db_schema=os.getenv("DUCKDB_DB_SCHEMA"), table=os.getenv("DUCKDB_TABLE") ), stager_config=DuckDBUploadStagerConfig(), uploader_config=DuckDBUploaderConfig(batch_size=50) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Elasticsearch Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/elasticsearch <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in Elasticsearch. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Elastic Cloud instances are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Elastic Cloud instances and self-manged Elasticsearch instances are supported. * For Elastic Cloud, you will need an [Elastic Cloud service instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#hosted-elasticsearch-service). <iframe width="560" height="315" src="https://www.youtube.com/embed/HEYIFC7uOus" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For self-managed Elasticsearch, you will need a [self-managed Elasticsearch instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#elasticsearch-deployment-options). <iframe width="560" height="315" src="https://www.youtube.com/embed/9znbAkNG1Ok" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * You will need the name of the index on the instance. See [Create index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) and [Get index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-index.html). The Elasticsearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "record_id": { "type": "text" }, "element_id": { "type": "keyword" }, "text": { "type": "text" }, "embeddings": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }, "metadata": { "type": "object", "properties": { "parent_id": { "type": "text" }, "page_number": { "type": "integer" }, "is_continuation": { "type": "boolean" }, "orig_elements": { "type": "text" }, "partitioner_type": { "type": "text" } } } } } } ``` See also: * [An Introduction to Elasticsearch Mapping](https://www.elastic.co/blog/found-elasticsearch-mapping-introduction) * [Explicit mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html) * [Dynamic field mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) <iframe width="560" height="315" src="https://www.youtube.com/embed/0f0tEcrvP8g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Elastic Cloud, you will need the Elastic Cloud service instance's API key. If you are using Unstructured Ingest, you will also need the instance's Cloud ID. To get these, see your Elasticsearch Service web console. * For self-managed Elasticsearch, you will need: * The self-managed instance's hostname and port number. See [Networking](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html). * If you're using basic authentication to the self-managed instance, the user's name and password. * If you're using token-based authentication to the self-managed instance, the bearer token or API key for the instance. See [Token-based authentication services](https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html) and [Create API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). * If you're using certificate authentication to the self-managed instance, the path to the Certificate Authority (CA) file on the instance, and the certificate fingerprint. See [SSL certificate API](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-ssl.html) and [Where can I see my Certificate Fingerprint?](https://discuss.elastic.co/t/where-can-i-see-my-certificate-fingerprint/319335/3). The Elasticsearch connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[elasticsearch]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `ELASTICSEARCH_HOST` - The hostname and port number, defined as `<hostname>:<port-number>` and represented by `--hosts` (CLI) or `hosts` (Python). * `ELASTICSEARCH_INDEX_NAME` - The name of the search index, represented by `--index-name` (CLI) or `index_name` (Python). If you're using Elastic Cloud to connect to ElasticSearch: * `ELASTIC_CLOUD_ID` - The instance's Cloud ID, represented by `--cloud-id` (CLI) or `cloud_id` (Python). * `ELASTIC_CLOUD_API_KEY` - The corresponding Cloud API key value, represented by `--api-key-id` (CLI) or `api_key_id` (Python). If you're using basic authentication to the instance: * `ELASTICSEARCH_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python). * `ELASTICSEARCH_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). If you're using token-based authentication to the instance instead: * `ELASTICSEARCH_BEARER_TOKEN` - The bearer token's value, represented by `--bearer-auth` (CLI) or `bearer_auth` (Python). * `ELASTIC_CLOUD_API_KEY_ID` - The API key's value, represented by `--api-key` (CLI) or `api_key` (Python). If you're using certificates: * `ELASTICSEARCH_CA_CERTS` - The path to the Certificate Authority (CA) file, represented by `--ca-certs` (CLI) or `ca_certs` (Python). * `ELASTICSEARCH_SSL_ASSERT_FINGERPRINT` - The certificate's fingerprint, represented by `--ssl-assert-fingerprint` or `ssl_assert_fingerprint` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash Shell #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy hi_res \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 4 \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ elasticsearch \ --hosts $ELASTICSEARCH_HOST \ --username $ELASTICSEARCH_USERNAME \ --password $ELASTICSEARCH_PASSWORD \ --index-name $ELASTICSEARCH_INDEX_NAME \ --ca-certs $ELASTICSEARCH_CA_CERTS \ --ssl-assert-fingerprint $ELASTICSEARCH_SSL_ASSERT_FINGERPRINT \ --num-processes 2 ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.elasticsearch.elasticsearch import ( ElasticsearchConnectionConfig, ElasticsearchAccessConfig, ElasticsearchUploadStagerConfig, ElasticsearchUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=ElasticsearchConnectionConfig( access_config=ElasticsearchAccessConfig( password=os.getenv("ELASTICSEARCH_PASSWORD"), ssl_assert_fingerprint=os.getenv("ELASTICSEARCH_SSL_ASSERT_FINGERPRINT") ), hosts=[os.getenv("ELASTICSEARCH_HOST")], username=os.getenv("ELASTICSEARCH_USERNAME"), ca_certs=os.getenv("ELASTICSEARCH_CA_CERTS") ), stager_config=ElasticsearchUploadStagerConfig(index_name=os.getenv("ELASTICSEARCH_INDEX_NAME")), uploader_config=ElasticsearchUploaderConfig(index_name=os.getenv("ELASTICSEARCH_INDEX_NAME")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Google Cloud Storage Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/google-cloud-service <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in Google Cloud Service. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/R57duYCxLi8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Google Cloud service account. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#console). * A service account key for the service account. See [Create a service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating) in [Create and delete service account keys](https://cloud.google.com/iam/docs/keys-create-delete). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the service account key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * The URI for a Google Cloud Storage bucket. This URI consists of the target bucket name, plus any target folder within the bucket, expressed as `gs://<bucket-name>[/folder-name]`. [Create a bucket](https://cloud.google.com/storage/docs/creating-buckets#console). This bucket must have, at minimum, one of the following roles applied to the target Google Cloud service account: * `Storage Object Viewer` for bucket read access. * `Storage Object User` for bucket write access. * The `Storage Object Admin` role provides read and write access, plus access to additional bucket operations. To apply one of these roles to a service account for a bucket, see [Add a principal to a bucket-level policy](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-add) in [Set and manage IAM policies on buckets](https://cloud.google.com/storage/docs/access-control/using-iam-permissions). The Google Cloud Storage connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[gcs]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `GCS_SERVICE_ACCOUNT_KEY` - The Google Cloud service account key for Google Cloud Storage, represented by `--service-account-key` (CLI) or `service_account_key` (Python). * `GCS_REMOTE_URL` - The Google Cloud Storage bucket URL, represented by `--remote-url` (CLI) or `remote_url` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy hi_res \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ gcs \ --service-account-key $GCS_SERVICE_ACCOUNT_KEY \ --remote-url $GCS_REMOTE_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.gcs import ( GcsConnectionConfig, GcsAccessConfig, GcsUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=GcsConnectionConfig( access_config=GcsAccessConfig( service_account_key=os.getenv("GCS_SERVICE_ACCOUNT_KEY") ) ), uploader_config=GcsUploaderConfig(remote_url=os.getenv("GCS_REMOTE_URL")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # IBM watsonx.data Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/ibm-watsonxdata <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in IBM watsonx.data. The requirements are as follows. * An [IBM Cloud account](https://cloud.ibm.com/login). [Create an IBM Cloud account](https://cloud.ibm.com/registration) if you do not already have one. * An API key for the IBM Cloud account. If you do not have one already, create one as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/AsV8Edq_Lko" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. In the top navigation bar, click Manage and then, under Security and access, click Access (IAM). 3. On the sidebar, under Manage identities, click API keys. 4. With the View list showing My IBM Cloud API keys, click Create. 5. Enter some Name and an optional Description for the API key. 6. Leave Leaked action set to Disable the leaked key and Session creation set to No. 7. Click Create. 8. Click Copy or Download to copy or save the API key to a secure location. You won't be able to access this API key from this screen again. * An IBM Cloud Object Storage (COS) instance in the account, and a bucket within that instance. If you do not have them already, create them as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/C_7q1EM8w20" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. Click Create resource. 4. With IBM Cloud catalog selected, search for and select Object Storage. 5. Complete the on-screen instructions to finish creating the COS instance. 6. With the COS instance's settings page shown, on the Buckets tab, click Create bucket. 7. Complete the on-screen instructions to finish creating the bucket. * The name, region, and public endpoint for the target bucket within the target Cloud Object Storage (COS) instance. To get these: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Storage, and then click the target COS instance. 4. On the Buckets tab, click the target bucket. 5. On the Configuration tab, note the following: * Under Bucket details, note the Bucket name. This is the bucket's name. * Under Bucket details section, note the value inside of the parentheses inside Location, for example `us-east`. This is the bucket's region. * Under Endpoints, note the value of Public, for example `s3.us-east.cloud-object-storage.appdomain.cloud`. (Ignore the values of Private and Direct). This is the bucket's public endpoint. * An HMAC access key ID and secret access key for the target Cloud Object Storage (COS) instance. If you do not have them already, get or create them as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Storage, and then click the target COS instance. 4. On the Service credentials tab, if there is a credential that you want to use in the list, expand the credential, and copy the following values to a secure location: * `access_key_id` under `cos_hmac_keys`, which represents the HMAC access key ID. * `secret_access_key` under `cos_hmac_keys`, which represents the HMAC secret access key. After you have copied the preceding values, you have completed this procedure. 5. If there is not a credential that you want to use, or there are no credentials at all, click New Credential. 6. Enter some Name for the credential. 7. For Role, select at least Writer, leave Select Service ID set to Auto Generated, switch on Include HMAC Credential, and then click Add. 8. In the list of credentials, expand the credential, and copy the following values to a secure location: * `access_key_id` under `cos_hmac_keys`, which represents the HMAC access key ID. * `secret_access_key` under `cos_hmac_keys`, which represents the HMAC secret access key. * An IBM watsonx.data data store instance in the IBM Cloud account. If you do not have one already, create one as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/oU21hQ9TUnU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. Click Create resource. 4. With IBM Cloud catalog selected, search for and select watsonx.data. 5. Complete the on-screen instructions to finish creating the watsonx.data data store instance. * An Apache Iceberg-based catalog within the watsonx.data data store instance. If you do not have one already, create one as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Infrastructure manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. Click Add component. 8. Under Storage, click IBM Cloud Object Storage, and then click Next. 9. Complete the on-screen instructions to finish creating the Iceberg catalog. This includes providing the following settings: * Some display name for the component. * The name of the target bucket within the target Cloud Object Storage (COS) instance that you noted earlier. * The region for the target bucket, which you noted earlier. * The public endpoint for the target bucket, which you noted earlier. For this screen only, be sure to prefix the public endpoint with `https://`. * The HMAC access key ID for the target COS instance, which you noted earlier. * The HMAC secret access key for the target COS instance, which you noted earlier. 10. Next to Connection status, click Test connection to test the connection. Do not proceed until Successful is shown. If the connection is not successful, check the values you entered for the target bucket name, region, endpoint, access key, and secret access key, and try again. 11. Check the box labelled Associate Catalog. 12. Check the box labelled Activate now. 13. Under Associated catalog, for Catalog type, select Apache Iceberg. 14. Enter some Catalog name. 15. Click Associate. 16. On the sidebar, click Infrastructure manager. Make sure the catalog is associated with the appropriate engines. If it is not, rest your mouse on an unassociated target engine, click the Manage associations icon, check the box next to the target catalog's name, and then click Save and restart engine. To create an engine if one is not already shown, click Add component, and follow the on-screen to add an appropriate engine from the list of available Engines (for example, an IBM Presto engine). * The catalog name and metastore REST endpoint for the target Iceberg catalog. To get this: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Infrastructure manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. In the Catalogs section, click the target Iceberg catalog. 8. On the Details tab, note the value of Name representing the catalog name, and Metastore REST endpoint representing the metastore REST endpoint. (Ignore the Metastore Thrift endpoint value.) * A namespace (also known as a schema) and a table in the target catalog. If you do not have these already, create them as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Data manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. On the Browse data tab, under Catalogs associated, click the target catalog. 8. Click the ellipses, and then click Create schema. 9. Enter some Name for the schema, and then click Create. 10. On the sidebar, click Query workspace. 11. In the SQL editor, enter and run a table creation statement such as the following one that uses [Presto SQL](https://prestodb.io/docs/current/connector/iceberg.html) syntax, replacing `<catalog-name>` with the name of the target catalog and `<schema-name>` with the name of the target schema: ```sql CREATE TABLE <catalog-name>.<schema-name>.elements ( "type" varchar, "element_id" varchar, "text" varchar, "file_directory" varchar, "filename" varchar, "languages" array(varchar), "last_modified" double, "page_number" varchar, "filetype" varchar, "url" varchar, "version" varchar, "record_locator" varchar, "date_created" double, "date_modified" double, "date_processed" double, "filesize_bytes" bigint, "points" varchar, "system" varchar, "layout_width" double, "layout_height" double, "id" varchar, "record_id" varchar, "parent_id" varchar ) WITH ( delete_mode = 'copy-on-write', format = 'PARQUET', format_version = '2' ) ``` Incoming elements that do not have matching column names will be dropped upon record insertion. For example, if the incoming data has an element named `sent_from` and there is no column named `sent_from` in the table, the `sent_from` element will be dropped upon record insertion. You should modify the preceding sample table creation statement to add columns for any additional elements that you want to be included upon record insertion. To increase query performance, Iceberg uses [hidden partitioning](https://iceberg.apache.org/docs/latest/partitioning/) to group similar rows together when writing. You can also [explicitly define partitions](https://prestodb.io/docs/current/connector/iceberg.html#create-table) as part of the preceding `CREATE TABLE` statement. * The name of the target namespace (also known as a schema) within the target catalog, and name of the target table within that schema. To get these: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Data manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. On the Browse data tab, expand the name of the target catalog, and note the names of the target schema and target table. * The name of the column in the target table that uniquely identifies each of the records in the table. * To improve performance, the target table should be set to regularly remove old metadata files. To do this, run the following Python script. (You cannot use the preceding `CREATE TABLE` statement, or other SQL statements such as `ALTER TABLE`, to set this behavior.) To get the values for the specified environment variables, see the preceding instructions. ```python # Improves performance by setting the target table to regularly remove # old metadata files. # # First, install the following dependencies into your Python virtual # environment: # # pip install requests pyiceberg pyarrow # # Then, set the following environment variables: # # IBM_IAM_API_KEY - An API key value for the target IBM Cloud account. # IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT - The metastore REST endpoint # value for the target Apache Iceberg catalog in the target IBM watsonx.data # data store instance. # IBM_COS_BUCKET_PUBLIC_ENDPOINT - The target IBM Cloud Object Storage (COS) # instance’s endpoint value. # IBM_COS_ACCESS_KEY - An HMAC access key ID for the target COS instance. # IBM_COS_SECRET_ACCESS_KEY - The associated HMAC secret access key ID for the # target HMAC access key. # IBM_COS_BUCKET_REGION - The target COS instance’s region short ID. # IBM_ICEBERG_CATALOG - The name of the target Iceberg catalog. # IBM_ICEBERG_SCHEMA - The name of the target namespace (also known as a schema) # in the target catalog. # IBM_ICEBERG_TABLE - The name of the target table in the target schema. # # To get these values, see the Unstructured documentation for the # IBM watsonx.data connector. import os import requests from pyiceberg.catalog import load_catalog def main(): # Get a bearer token for the target IBM Cloud account. bearer_token = requests.post( url="https://iam.cloud.ibm.com/identity/token", headers={ "Content-Type": "application/x-www-form-urlencoded", "Accept": "application/json" }, data={ "grant_type": "urn:ibm:params:oauth:grant-type:apikey", "apikey": os.getenv("IBM_IAM_API_KEY") } ).json().get("access_token") # Connect to the target Iceberg catalog. catalog = load_catalog( os.getenv("IBM_ICEBERG_CATALOG"), **{ "type": "rest", "uri": f"https://{os.getenv("IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT")}/mds/iceberg", "token": bearer_token, "warehouse": os.getenv("IBM_ICEBERG_CATALOG"), "s3.endpoint": os.getenv("IBM_COS_BUCKET_PUBLIC_ENDPOINT"), "s3.access-key-id": os.getenv("IBM_COS_ACCESS_KEY"), "s3.secret-access-key": os.getenv("IBM_COS_SECRET_ACCESS_KEY"), "s3.region": os.getenv("IBM_COS_BUCKET_REGION") }, ) # Load the target table. table = catalog.load_table(f"{os.getenv("IBM_ICEBERG_SCHEMA")}.{os.getenv("IBM_ICEBERG_TABLE")}") # Set the target table's properties to remove old metadata files. with table.transaction() as transaction: transaction.set_properties( { "commit.manifest.min-count-to-merge": 10, "commit.manifest-merge.enabled": True, "write.metadata.previous-versions-max": 10, "write.metadata.delete-after-commit.enabled": True, } ) # Confirm that the target table's properties were set as expected. print(table.metadata.properties) if name == "main": main() ``` The IBM watsonx.data connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[ibm-watsonx-s3]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `IBM_IAM_API_KEY` - An API key for the target IBM Cloud account, represented by `--iam-api-key` (CLI) or `iam_api_key` (Python). * `IBM_COS_ACCESS_KEY` - An HMAC access key ID for the target IBM Cloud Object Storage (COS) instance, represented by `--access-key-id` (CLI) or `access_key_id` (Python). * `IBM_COS_SECRET_ACCESS_KEY` - The associated HMAC secret access key ID for the target HMAC access key, represented by `--secret-access-key` (CLI) or `secret_access_key` (Python). * `IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT` - The metastore REST endpoint value for the target Apache Iceberg catalog in the target IBM watsonx.data data store instance, represented by `--iceberg-endpoint` (CLI) or `iceberg_endpoint` (Python). Do not include `https://` in this value. * `IBM_COS_BUCKET_PUBLIC_ENDPOINT` - The target COS instance's endpoint value, represented by `--object-storage-endpoint` (CLI) or `object_storage_endpoint` (Python). * `IBM_COS_BUCKET_REGION` - The target COS instance's region short ID, represented by `--object-storage-region` (CLI) or `object_storage_region` (Python). * `IBM_ICEBERG_CATALOG` - The name of the target Iceberg catalog, represented by `--catalog` (CLI) or `catalog` (Python). * `IBM_ICEBERG_SCHEMA` - The name of the target namespace (also known as a schema) in the target catalog, represented by `--namespace` (CLI) or `namespace` (Python). * `IBM_ICEBERG_TABLE` - The name of the target table in the target schema, represented by `--table` (CLI) or `table` (Python). * `IBM_ICEBERG_TABLE_UNIQUE_RECORD_COLUMN` - The name of the column that uniquely identifies each record in the target table, represented by `--record-id-key` (CLI) or `record_id_key` (Python). The default is `record_id`. Additionally: * `--max-retries-connection` (CLI) or `max_retries_connection` (Python) is an optional parameter that specifies the maximum number of retries when connecting to the catalog. Typically, an optimal setting is `15`. The default is `10`. If specified, it must be a number between `2` and `100`, inclusive. * `--max-retries` (CLI) or `max_retries` (Python) is an optional parameter that specifies the number of times to retry uploading data. Typically, an optimal setting is `150`. The default is `50`. If specified, it must be a number between `2` and `500`, inclusive. Now call the Unstructured Ingest CLI or Unstructured Ingest Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunk-elements \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY\ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ ibm-watsonx-s3 \ --iam-api-key $IBM_IAM_API_KEY \ --access-key-id $IBM_COS_ACCESS_KEY \ --secret-access-key $IBM_COS_SECRET_ACCESS_KEY \ --iceberg-endpoint $IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT \ --object-storage-endpoint $IBM_COS_BUCKET_PUBLIC_ENDPOINT \ --object-storage-region $IBM_COS_BUCKET_REGION \ --catalog $IBM_ICEBERG_CATALOG \ --namespace $IBM_ICEBERG_SCHEMA \ --table $IBM_ICEBERG_TABLE \ --max-retries 50 \ --max-retries-connection 10 \ --record-id-key $IBM_ICEBERG_TABLE_UNIQUE_RECORD_COLUMN ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.connectors.ibm_watsonx.ibm_watsonx_s3 import ( IbmWatsonxConnectionConfig, IbmWatsonxAccessConfig, IbmWatsonxUploadStagerConfig, IbmWatsonxUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(reprocess=True, clear_cache=True), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=IbmWatsonxConnectionConfig( access_config=IbmWatsonxAccessConfig( iam_api_key=os.getenv("IBM_IAM_API_KEY"), access_key_id=os.getenv("IBM_COS_ACCESS_KEY"), secret_access_key=os.getenv("IBM_COS_SECRET_ACCESS_KEY") ), iceberg_endpoint=os.getenv("IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT"), object_storage_endpoint=os.getenv("IBM_COS_BUCKET_PUBLIC_ENDPOINT"), object_storage_region=os.getenv("IBM_COS_BUCKET_REGION"), catalog=os.getenv("IBM_ICEBERG_CATALOG") ), stager_config=IbmWatsonxUploadStagerConfig(), uploader_config=IbmWatsonxUploaderConfig( namespace=os.getenv("IBM_ICEBERG_SCHEMA"), table=os.getenv("IBM_ICEBERG_TABLE"), max_retries=50, max_retries_connection=10, record_id_key=os.getenv("IBM_ICEBERG_TABLE_UNIQUE_RECORD_COLUMN") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Kafka Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/kafka <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in Kafka. The requirements are as follows. * A Kafka cluster in [Confluent Cloud](https://www.confluent.io/confluent-cloud). ([Create a cluster](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#create-ak-clusters).) The following video shows how to set up a Kafka cluster in Confluent Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/zcKJ96J4Xvk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The [hostname and port number](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#view-a-ak-cluster) of the bootstrap Kafka cluster to connect to.. * The name of the topic to read messages from or write messages to on the cluster. [Create a topic](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). [Access available topics](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). * For authentication, an [API key and secret](https://docs.confluent.io/cloud/current/security/authenticate/workload-identities/service-accounts/api-keys/manage-api-keys.html#add-an-api-key). The Kafka connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[kafka]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `KAFKA_BOOTSTRAP_SERVER` - The hostname of the bootstrap Kafka cluster to connect to, represented by `--bootstrap-server` (CLI) or `bootstrap_server` (Python). * `KAFKA_PORT` - The port number of the cluster, represented by `--port` (CLI) or `port` (Python). * `KAFKA_TOPIC` - The unique name of the topic to read messages from and write messages to on the cluster, represented by `--topic` (CLI) or `topic` (Python). If you use Kafka API keys and secrets for authentication: * `KAFKA_API_KEY` - The Kafka API key value, represented by `--kafka-api-key` (CLI) or `kafka_api_key` (Python). * `KAFKA_SECRET` - The secret value for the Kafka API key, represented by `--secret` (CLI) or `secret` (Python). Additional settings include: * `--confluent` (CLI) or `confluent` (Python): True to indicate that the cluster is running Confluent Kafka. * `--num-messages-to-consume` (CLI) or `num_messages_to_consume` (Python): The maximum number of messages to get from the topic. The default is `1` if not otherwise specified. * `--timeout` (CLI) or `timeout` (Python): The maximum amount of time to wait for the response of a request to the topic, expressed in seconds. The default is `1.0` if not otherwise specified. * `--group-id` (CLI) or `group_id` (Python): The ID of the consumer group, if any, that is associated with the target Kafka cluster. (A consumer group is a way to allow a pool of consumers to divide the consumption of data over topics and partitions.) The default is `default_group_id` if not otherwise specified. Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY\ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ kafka \ --bootstrap-server $KAFKA_BOOTSTRAP_SERVER \ --port $KAFKA_PORT \ --topic $KAFKA_TOPIC \ --kafka-api-key $KAFKA_API_KEY \ --secret $KAFKA_API_KEY \ --confluent true \ --num-messages-to-consume 1 \ --timeout 1.0 ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.kafka.cloud import ( CloudKafkaConnectionConfig, CloudKafkaAccessConfig, CloudKafkaUploaderConfig ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=CloudKafkaConnectionConfig( access_config=CloudKafkaAccessConfig( kafka_api_key=os.getenv("KAFKA_API_KEY"), secret=os.getenv("KAFKA_SECRET") ), bootstrap_server=os.getenv("KAFKA_BOOTSTRAP_SERVER"), port=os.getenv("KAFKA_PORT") ), uploader_config=CloudKafkaUploaderConfig( batch_size=100, topic=os.getenv("KAFKA_TOPIC"), timeout=10 ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # KDB.AI Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/kdbai <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in KDB.AI. The requirements are as follows. * A KDB.AI Cloud or server instance. [Sign Up for KDB.AI Cloud: Starter Edition](https://trykdb.kx.com/kdbai/signup/). [Set up KDB.AI Server](https://code.kx.com/kdbai/latest/gettingStarted/kdb-ai-server-setup.html). * The instance's endpoint URL. [Get the KDB.AI Cloud endpoint URL](https://code.kx.com/kdbai/latest/gettingStarted/kdb-ai-cloud-setup.html#connect-to-your-database). [Get the KDB.AI Server endpoint URL](https://code.kx.com/kdbai/latest/gettingStarted/kdb-ai-server-setup.html). * An API key. [Create the API key](https://code.kx.com/kdbai/latest/gettingStarted/kdb-ai-cloud-setup.html#create-an-api-key). * The name of the target table to access. [Create the table](https://code.kx.com/kdbai/latest/gettingStarted/quickstart.html#create-a-new-table). KDB.AI requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured contains the fields `id`, `element_id`, `document`, `metadata`, and `embeddings`, as follows. This example code demonstrates the use of the [KDB.AI Client for Python](https://pypi.org/project/kdbai-client/) to create a table with this recommended schema, along with creating a vector index that contains 3072 dimensions: ```python Python import kdbai_client as kdbai import os session = kdbai.Session( endpoint=os.getenv("KDBAI_ENDPOINT"), api_key=os.getenv("KDBAI_API_KEY") ) db = session.database("default") schema = [ { "name": "id", "type": "str" }, { "name": "element_id", "type": "str" }, { "name": "document", "type": "str" }, { "name": "metadata", "type": "general" }, { "name": "embeddings", "type": "float32s" } ] indexes = [ { "name": "vectorIndex", "type": "flat", "params": { "dims": 3072, "metric": "L2" }, "column": "embeddings" } ] table = db.create_table( table=os.getenv("KDBAI_TABLE"), schema=schema, indexes=indexes ) print(f"The table named '{table.name}' now exists.") ``` The KDB.AI connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[kdbai]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `KDBAI_ENDPOINT` - The KDB.AI instance's endpoint URL, represented by `--endpoint` (CLI) or `endpoint` (Python). * `KDBAI_API_KEY` - The KDB.AI API key, represented by `--api-key` (CLI) or `api_key` (Python). * `KDBAI_TABLE` - The name of the target table, represented by `--table-name` (CLI) or `table_name` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ kdbai \ --endpoint $KDBAI_API_KEY \ --api-key $KDBAI_API_KEY \ --table-name $KDBAI_TABLE ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.kdbai import ( KdbaiConnectionConfig, KdbaiAccessConfig, KdbaiUploadStagerConfig, KdbaiUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=KdbaiConnectionConfig( access_config=KdbaiAccessConfig( api_key=os.getenv("KDBAI_API_KEY") ), endpoint=os.getenv("KDBAI_ENDPOINT") ), stager_config=KdbaiUploadStagerConfig(), uploader_config=KdbaiUploaderConfig(table_name=os.getenv("KDBAI_TABLE")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # LanceDB Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/lancedb <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in LanceDB. The requirements are as follows. * A [LanceDB open source software (OSS) installation](https://lancedb.github.io/lancedb/basic/#installation) on a local machine, a server, or a virtual machine. (LanceDB Cloud is not supported.) * For LanceDB OSS with local data storage: * The local path to the folder where the LanceDB data is (or will be) stored. See [Connect to a database](https://lancedb.github.io/lancedb/basic/#connect-to-a-database) in the LanceDB documentation. * The name of the target [LanceDB table](https://lancedb.github.io/lancedb/basic/#create-an-empty-table) within the local data folder. <Note> Unstructured recommends that the target table have a field named `record_id` with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether. </Note> * For LanceDB OSS with data storage in an Amazon S3 bucket: * The URI for the target Amazon S3 bucket and any target folder path within that bucket. Use the format `s3://<bucket-name>[/<folder-name>]`. * The name of the target [LanceDB table](https://lancedb.github.io/lancedb/guides/storage/#object-stores) within the Amazon S3 bucket. <Note> Unstructured recommends that the target table have a field named `record_id` with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether. </Note> * The AWS access key ID and AWS secret access key for the AWS IAM entity that has access to the Amazon S3 bucket. For more information, see [AWS S3](https://lancedb.github.io/lancedb/guides/storage/#aws-s3) in the LanceDB documentation, along with the following video: <iframe width="560" height="315" src="https://www.youtube.com/embed/hyDHfhVVAhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For LanceDB OSS with data storage in an Azure Blob Storage account: * The name of the target Azure Blob Storage account. \= The URI for the target container within that Azure Blob Storage account and any target folder path within that container. Use the format `az://<container-name>[/<folder-name>]`. * The name of the target [LanceDB table](https://lancedb.github.io/lancedb/guides/storage/#object-stores) within the Azure Blob Storage account. <Note> Unstructured recommends that the target table have a field named `record_id` with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether. </Note> * The access key for the Azure Blob Storage account. For more information, see [Azure Blob Storage](https://lancedb.github.io/lancedb/guides/storage/#azure-blob-storage) in the LanceDB documentation, along with the following video: <iframe width="560" height="315" src="https://www.youtube.com/embed/Vl3KCphlh9Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For LanceDB OSS with data storage in a Google Cloud Storage bucket: * The URI for the target Google Cloud Storage bucket and any target folder path within that bucket. Use the format `gs://<bucket-name>[/<folder-name>]`. * The name of the target [LanceDB table](https://lancedb.github.io/lancedb/guides/storage/#object-stores) within the Google Cloud Storage bucket. <Note> Unstructured recommends that the target table have a field named `record_id` with a text string data type. Unstructured can use this field to do intelligent record overwrites. Without this field, duplicate records might be written to the table or, in some cases, the operation could fail altogether. </Note> * A single-line string that contains the contents of the downloaded service account key file for the Google Cloud service account that has access to the Google Cloud Storage bucket. For more information, see [Google Cloud Storage](https://lancedb.github.io/lancedb/guides/storage/#google-cloud-storage) in the LanceDB documentation, along with the following video: <iframe width="560" height="315" src="https://www.youtube.com/embed/HYaALQ0F-L4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The LanceDB connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[lancedb]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * For LanceDB OSS with local data storage: * `LANCEDB_URI` - The local path to the folder where the LanceDB data is stored, represented by `--uri` (CLI) or `uri` (Python). * `LANCEDB_TABLE` - The name of the target LanceDB table within the local data folder, represented by `--table-name` (CLI) or `table_name` (Python). * For LanceDB OSS with data storage in an Amazon S3 bucket: * `LANCEDB_URI` - The URI for the target Amazon S3 bucket and any target folder path within that bucket. Use the format `s3://<bucket-name>[/<folder-name>]`. This is represented by `--uri` (CLI) or `uri` (Python). * `LANCEDB_TABLE` - The name of the target LanceDB table within the Amazon S3 bucket, rrepresented by `--table-name` (CLI) or `table_name` (Python). * `AWS_ACCESS_KEY_ID` - The AWS access key ID for the AWS IAM entity that has access to the Amazon S3 bucket, represented by `--aws-access-key-id` (CLI) or `aws_access_key_id` (Python). * `AWS_SECRET_ACCESS_KEY` - The AWS secret access key for the AWS IAM entity that has access to the Amazon S3 bucket, represented by `--aws-secret-access-key` (CLI) or `aws_secret_access_key` (Python). * For LanceDB OSS with data storage in an Azure Blob Storage account: * `LANCEDB_URI` - The URI for the target container within that Azure Blob Storage account and any target folder path within that container. Use the format `az://<container-name>[/<folder-name>]`. This is represented by `--uri` (CLI) or `uri` (Python). * `LANCEDB_TABLE` - The name of the target LanceDB table within the Azure Blob Storage account, represented by `--table-name` (CLI) or `table_name` (Python). * `AZURE_STORAGE_ACCOUNT_NAME` - The name of the target Azure Blob Storage account, represented by `--azure-storage-account-name` (CLI) or `azure_storage_account_name` (Python). * `AZURE_STORAGE_ACCOUNT_KEY` - The access key for the Azure Blob Storage account, represented by `--azure-storage-account-key` (CLI) or `azure_storage_account_key` (Python). * For LanceDB OSS with data storage in a Google Cloud Storage bucket: * `LANCEDB_URI` - The URI for the target Google Cloud Storage bucket and any target folder path within that bucket. Use the format `gs://<bucket-name>[/<folder-name>]`. This is represented by `--uri` (CLI) or `uri` (Python). * `LANCEDB_TABLE` - The name of the target LanceDB table within the Google Cloud Storage bucket, represented by `--table-name` (CLI) or `table_name` (Python). * `GCS_SERVICE_ACCOUNT_KEY` - A single-line string that contains the contents of the downloaded service account key file for the Google Cloud service account that has access to the Google Cloud Storage bucket, represented by `--google-service-account-key` (CLI) or `google_service_account_key` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. # For LanceDB OSS with local data storage: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ lancedb-local \ --uri $LANCEDB_URI \ --table-name $LANCEDB_TABLE # For LanceDB OSS with data storage in an Amazon S3 bucket: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ lancedb-aws \ --aws-access-key-id $AWS_ACCESS_KEY_ID \ --aws-secret-access-key $AWS_SECRET_ACCESS_KEY \ --uri $LANCEDB_URI \ --table-name $LANCEDB_TABLE \ --timeout 30s # For LanceDB OSS with data storage in an Azure Blob Storage account: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ lancedb-azure \ --azure-storage-account-name $AZURE_STORAGE_ACCOUNT_NAME \ --azure-storage-account-key $AZURE_STORAGE_ACCOUNT_KEY \ --uri $LANCEDB_URI \ --table-name $LANCEDB_TABLE \ --timeout 30s # For LanceDB OSS with data storage in a Google Cloud Storage bucket: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ lancedb-gcs \ --google-service-account-key $GCS_SERVICE_ACCOUNT_KEY \ --uri $LANCEDB_URI \ --table-name $LANCEDB_TABLE \ --timeout 30s ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # For LanceDB OSS with local data storage: # from unstructured_ingest.processes.connectors.lancedb.local import ( # LanceDBLocalConnectionConfig, # LanceDBLocalAccessConfig, # LanceDBUploadStagerConfig, # LanceDBUploaderConfig # ) # For LanceDB OSS with data storage in an Amazon S3 bucket: from unstructured_ingest.processes.connectors.lancedb.aws import ( LanceDBS3ConnectionConfig, LanceDBS3AccessConfig, LanceDBUploadStagerConfig, LanceDBUploaderConfig ) # For LanceDB OSS with data storage in an Azure Blob Storage account: # from unstructured_ingest.processes.connectors.lancedb.azure import ( # LanceDBAzureConnectionConfig, # LanceDBAzureAccessConfig, # LanceDBUploadStagerConfig, # LanceDBUploaderConfig # ) # For LanceDB OSS with data storage in a Google Cloud Storage bucket: # from unstructured_ingest.processes.connectors.lancedb.gcp import ( # LanceDBGCSConnectionConfig, # LanceDBGCSAccessConfig, # LanceDBUploadStagerConfig, # LanceDBUploaderConfig # ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), # For LanceDB OSS with local data storage: # destination_connection_config=LanceDBLocalConnectionConfig( # access_config=LanceDBLocalAccessConfig(), # uri=os.getenv("LANCEDB_URI") # ), # For LanceDB OSS with data storage in an Amazon S3 bucket: destination_connection_config=LanceDBS3ConnectionConfig( access_config=LanceDBS3AccessConfig( aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"), aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY") ), uri=os.getenv("LANCEDB_URI"), timeout="30s" ), # For LanceDB OSS with data storage in an Azure Blob Storage account: # destination_connection_config=LanceDBAzureConnectionConfig( # access_config=LanceDBAzureAccessConfig( # azure_storage_account_name=os.getenv("AZURE_STORAGE_ACCOUNT_NAME"), # azure_storage_account_key=os.getenv("AZURE_STORAGE_ACCOUNT_KEY") # ), # uri=os.getenv("LANCEDB_URI"), # timeout="30s" # ), # For LanceDB OSS with data storage in a Google Cloud Storage bucket: # destination_connection_config=LanceDBGCSConnectionConfig( # access_config=LanceDBGCSAccessConfig( # google_service_account_key=os.getenv("GCS_SERVICE_ACCOUNT_KEY") # ), # uri=os.getenv("LANCEDB_URI"), # timeout="30s" # ), stager_config=LanceDBUploadStagerConfig(), uploader_config=LanceDBUploaderConfig(table_name=os.getenv("LANCEDB_TABLE")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Local Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/local <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs locally. You will need the local destination connector dependencies: ```bash CLI, Python pip install unstructured-ingest ``` To use the local destination connector, you must set `--output-dir` (CLI) or `output_dir` (Python) to the path in the local filesystem which will contain the structured output. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured for processing: <iframe width="560" height="315" src="https://www.youtube.com/embed/tSKHFXsBQ-c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ local \ --output-dir $LOCAL_FILE_OUTPUT_DIR ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Milvus Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/milvus <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in Milvus. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Milvus local and cloud-based instances are supported. The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data: * For Zilliz Cloud, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/ASEmYryJpkU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Zilliz Cloud account](https://cloud.zilliz.com/signup). * A [Zilliz Cloud cluster](https://docs.zilliz.com/docs/create-cluster). * The URI of the cluster, also known as the cluster's public endpoint, which takes a format such as `https://<cluster-id>.<cluster-type>.<cloud-provider>-<region>.cloud.zilliz.com`. [Get the cluster's public endpoint](https://docs.zilliz.com/docs/manage-cluster#connect-to-cluster). * The token to access the cluster. [Get the cluster's token](https://docs.zilliz.com/docs/manage-cluster#connect-to-cluster). * The name of the [database](https://docs.zilliz.com/docs/database#create-database) in the instance. * The name of the [collection](https://docs.zilliz.com/docs/manage-collections-console#create-collection) in the database. The collection must have a a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows: | Field Name | Field Type | Max Length | Dimension | Index | Metric Type | | -------------------------------- | ----------------- | ---------- | --------- | ------------- | ----------- | | `element_id` (primary key field) | VARCHAR | `200` | -- | -- | -- | | `embeddings` (vector field) | FLOAT\_VECTOR | -- | `3072` | Yes (Checked) | Cosine | | `record_id` | VARCHAR | `200` | -- | -- | -- | * For Milvus on IBM watsonx.data, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/hLCwoe2fCnc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An [IBM Cloud account](https://cloud.ibm.com/registration). * The [IBM watsonx.data subscription plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started). * A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service). * The URI of the instance, which takes the format of `https://`, followed by instance's GRPC host, followed by a colon and the GRPC port. This takes the format of `https://<host>:<port>`. [Get the instance's GRPC host and GRPC port](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-conn-to-milvus). * The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance. * The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section. * The uername and password to access the instance. The username for Milvus on IBM watsonx.data is always `ibmlhapikey`. The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. [Get the user API key](https://cloud.ibm.com/docs/account?topic=account-userapikey\&interface=ui). * For Milvus local, you will need: * A [Milvus instance](https://milvus.io/docs/install-overview.md). * The [URI](https://milvus.io/api-reference/pymilvus/v2.4.x/MilvusClient/Client/MilvusClient.md) of the instance. * The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance. * The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section. * The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance. All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows. This example code demonstrates the use of the [Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this minimum viable schema, targeting Milvus on IBM watsonx.data. For the `connections.connect` arguments to connect to other types of Milvus deployments, see your Milvus provider's documentation: ```python Python import os from pymilvus import ( connections, FieldSchema, DataType, CollectionSchema, Collection, ) connections.connect( alias="default", host=os.getenv("MILVUS_GRPC_HOST"), port=os.getenv("MILVUS_GRPC_PORT"), user=os.getenv("MILVUS_USER"), password=os.getenv("MILVUS_PASSWORD"), secure=True ) primary_key = FieldSchema( name="element_id", dtype=DataType.VARCHAR, is_primary=True, max_length=200 ) vector = FieldSchema( name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=3072 ) record_id = FieldSchema( name="record_id", dtype=DataType.VARCHAR, max_length=200 ) schema = CollectionSchema( fields=[primary_key, vector, record_id], enable_dynamic_field=True ) collection = Collection( name="my_collection", schema=schema, using="default" ) index_params = { "metric_type": "L2", "index_type": "IVF_FLAT", "params": {"nlist": 1024} } collection.create_index( field_name="embeddings", index_params=index_params ) ``` Other approaches, such as [creating collections instantly](https://milvus.io/docs/create-collection-instantly.md) or [setting nullable and default fields](https://milvus.io/docs/nullable-and-default.md), have not been fully evaluated by Unstructured and might produce unexpected results. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. The Milvus connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[milvus]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `MILVUS_URI` - The Milvus instance's URI, represented by `--uri` (CLI) or `uri` (Python). * `MILVUS_USER` and `MILVUS_PASSWORD`, or `MILVUS_TOKEN` - The username and password, or token, to access the instance. This is represented by `--user` and `--password`, or `--token` (CLI); or `user` and `password`, or `token` (Python). * `MILVUS_DB` - The database's name, represented by `--db-name` (CLI) or `db_name` (Python). * `MILVUS_COLLECTION` - The collection's name, represented by `--collection-name` (CLI) or `collection_name` (Python). * `MILVUS_FIELDS_TO_INCLUDE` - A list of fields to include a comma-separated list (CLI) or an array of strings (Python), represented by `--field-to-include` (CLI) or `fields_to_include` (Python). Additional settings include: * To emit the `metadata` field's child fields directly into the output, include `--flatten-metadata` (CLI) or `flatten_metadata=True` (Python). This is the default if not specified. * To keep the `metadata` field with its child fields intact in the output, include `--no-flatten-metadata` (CLI) or `flatten_metadata=False` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ milvus \ --uri $MILVUS_URI \ --user $MILVUS_USER \ --password $MILVUS_PASSWORD \ --db-name $MILVUS_DB \ --collection-name $MILVUS_COLLECTION \ --fields-to-include type,element_id,text,embeddings ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.milvus import ( MilvusConnectionConfig, MilvusAccessConfig, MilvusUploadStagerConfig, MilvusUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=MilvusConnectionConfig( access_config=MilvusAccessConfig( password=os.getenv("MILVUS_PASSWORD") ), uri=os.getenv("MILVUS_URI"), user=os.getenv("MILVUS_USER"), db_name=os.getenv("MILVUS_DB") ), stager_config=MilvusUploadStagerConfig(fields_to_include=["type", "element_id", "text", "embeddings"]), uploader_config=MilvusUploaderConfig(collection_name=os.getenv("MILVUS_COLLECTION")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # MongoDB Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/mongodb <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in MongoDB. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/8YBVHt5spIQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The MongoDB requirements for a MongoDB Atlas deployment include: <Warning> For MongoDB Atlas, SCRAM-SHA-1 is not supported for authentication. This means that cluster types that only include SCRAM-SHA-1, such as Free, M0, Flex, and Serverless, are not supported. Unstructured only supports SCRAM-SHA-256 for MongoDB Atlas, which is cryptographically stronger than SCRAM-SHA-1. If you try to test or use a connector that refers to a cluster type that only includes SCRAM-SHA-1, the operation will fail, and you will get an error message similar to the following: `[digital envelope routines] unsupported`. </Warning> * A MongoDB Atlas account. [Create an account](https://www.mongodb.com/cloud/atlas/register). * A MongoDB Atlas cluster. [Create a cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/). Be sure to not select a cluster type that only includes SCRAM-SHA-1, such as Free, M0, Flex, or Serverless. * The cluster must be reachable from your application environment, for example by adding IP addresses to your IP access list. [Learn more](https://www.mongodb.com/docs/atlas/setup-cluster-security/#network-and-firewall-requirements). * The cluster must be configured to allow IP address. [Learn how](https://www.mongodb.com/docs/atlas/security/ip-access-list/#add-ip-access-list-entries). To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> * The cluster must have at least one database. [Create a database](https://www.mongodb.com/docs/compass/current/databases/#create-a-database). * The database must have at least one user, and that user must have sufficient access to the database. [Create a database user](https://www.mongodb.com/docs/atlas/security-add-mongodb-users/#add-database-users). [Give the user database access](https://www.mongodb.com/docs/manual/core/authorization/). * The database must have at least one collection. [Create a collection](https://www.mongodb.com/docs/compass/current/collections/#create-a-collection). <Note> For the destination connector, Unstructured recommends that all documents in the target collection have a field named `record_id` with a `String` data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the collection or, in some cases, the operation could fail altogether. </Note> * The connection string for the cluster. For MongoDB Atlas, this connection string must include the protocol, username, password, host, and cluster name. For example: ```text mongodb+srv://<db_user>:<db_password>@<host>/?retryWrites=true&w=majority&appName=<cluster> ``` To get the connection string in MongoDB Atlas, do the following: 1. Log in to your MongoDB Atlas console. 2. In the sidebar, under Databases, click Clusters. 3. Click on the cluster you want to connect to. 4. Click Connect. 5. Click Drivers. 6. Under Add your connection string into your application code, copy the connection string. You can then close the Connect dialog in MongoDB Atlas. Before you use this connection string, be sure to fill in any placeholders in the string, such as your MongoDB Atlas database user's password value. [Learn more](https://www.mongodb.com/resources/products/fundamentals/mongodb-connection-string). The MongoDB connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[mongodb]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). For a MongoDB Atlas deployment, the following environment variables: * `MONGODB_DATABASE` - The name of the database, represented by `--database` (CLI) or `database` (Python). * `MONGODB_COLLECTION` - The name of the collection in the database, represented by `--collection` (CLI) or `collection` (Python). * `MONGODB_URI` - The connection string for the cluster, represented by `--uri` (CLI) or `uri` (Python). For a local MongoDB server, the following environment variables: * `MONGODB_HOST` - The host for the local MongoDB server, represented by `--host` (CLI) or `host` (Python). * `MONGODB_PORT` - The port for the local MongoDB server, represented by `--port` (CLI) or `port` (Python). Now call the Unstructured Ingest CLI or Unstructured Ingest Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy hi_res \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ mongodb \ --uri $MONGODB_URI \ --database $MONGODB_DATABASE \ --collection $MONGODB_COLLECTION ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.mongodb import ( MongoDBAccessConfig, MongoDBConnectionConfig, MongoDBUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=MongoDBConnectionConfig( access_config=MongoDBAccessConfig( uri=os.getenv("MONGODB_URI") ), database=os.getenv("MONGODB_DATABASE"), collection=os.getenv("MONGODB_COLLECTION") ), uploader_config=MongoDBUploaderConfig(batch_size=100) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # MotherDuck Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/motherduck <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a MotherDuck account. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/tj_0qmvPpJQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [MotherDuck account](https://app.motherduck.com). * A [MotherDuck access token](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#creating-an-access-token) for the account. * A database in the account. * [Create a database](https://motherduck.com/docs/sql-reference/motherduck-sql-reference/create-database/). * [List available databases](https://motherduck.com/docs/key-tasks/database-operations/basics-operations/#listing-databases). You can run commands to manage MotherDuck databases, schemas, tables, and more in the [MotherDuck UI](https://motherduck.com/docs/getting-started/motherduck-quick-tour/) or for example by connecting to MotherDuck with the [DuckDB CLI](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/). * A schema in the target database. * [Create a schema](https://duckdb.org/docs/sql/statements/create_schema.html). * You can list available schemas and their parent catalogs by running the following command in the MotherDuck UI or the DuckDB CLI: ```sql SELECT * FROM information_schema.schemata; ``` The MotherDuck connector uses the default schema name of `main` if not otherwise specified. * A table in the target schema. * [Create a table](https://duckdb.org/docs/sql/statements/create_table). * You can list available tables in a schema by running the following commands in the MotherDuck UI or the DuckDB CLI, replacing the target catalog and schema names: ```sql USE <catalog_name>.<schema_name>; SHOW TABLES; ``` The MotherDuck connector uses the default table name of `elements` if not otherwise specified. For maximum compatibility, Unstructured recommends the following table schema: ```sql CREATE TABLE elements ( id VARCHAR, element_id VARCHAR, text TEXT, embeddings FLOAT[], type VARCHAR, system VARCHAR, layout_width DECIMAL, layout_height DECIMAL, points TEXT, url TEXT, version VARCHAR, date_created INTEGER, date_modified INTEGER, date_processed DOUBLE, permissions_data TEXT, record_locator TEXT, category_depth INTEGER, parent_id VARCHAR, attached_filename VARCHAR, filetype VARCHAR, last_modified TIMESTAMP, file_directory VARCHAR, filename VARCHAR, languages VARCHAR[], page_number VARCHAR, links TEXT, page_name VARCHAR, link_urls VARCHAR[], link_texts VARCHAR[], sent_from VARCHAR[], sent_to VARCHAR[], subject VARCHAR, section VARCHAR, header_footer_type VARCHAR, emphasized_text_contents VARCHAR[], emphasized_text_tags VARCHAR[], text_as_html TEXT, regex_metadata TEXT, detection_class_prob DECIMAL, partitioner_type VARCHAR ); ``` You can list the schema of a table by running the following commands in the MotherDuck UI or the DuckDB CLI, replacing the target catalog, schema, and table names: ```sql USE <catalog_name>.<schema_name>; DESCRIBE TABLE <table_name>; ``` The MotherDuck connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[duckdb]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `MOTHERDUCK_MD_TOKEN` - The access token for the target MotherDuck account, represented by `--md-token` (CLI) or `md_token` (Python). * `MOTHERDUCK_DATABASE` - The name of the target database in the account, represented by `--database` (CLI) or `database` (Python). * `MOTHERDUCK_DB_SCHEMA` - The name of the target schema in the database, represented by `--db-schema` (CLI) or `db_schema` (Python). * `MOTHERDUCK_TABLE` - The name of the target table in the schema, represented by `--table` (CLI) or `table` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ motherduck \ --md-token $MOTHERDUCK_MD_TOKEN \ --database $MOTHERDUCK_DATABASE \ --db-schema $MOTHERDUCK_DB_SCHEMA \ --table $MOTHERDUCK_TABLE ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.duckdb.motherduck import ( MotherDuckAccessConfig, MotherDuckConnectionConfig, MotherDuckUploadStagerConfig, MotherDuckUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=MotherDuckConnectionConfig( access_config=MotherDuckAccessConfig(md_token=os.getenv("MOTHERDUCK_MD_TOKEN")), database=os.getenv("MOTHERDUCK_DATABASE"), db_schema=os.getenv("MOTHERDUCK_DB_SCHEMA"), table=os.getenv("MOTHERDUCK_TABLE") ), stager_config=MotherDuckUploadStagerConfig(), uploader_config=MotherDuckUploaderConfig(batch_size=50) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Neo4j Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/neo4j <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a Neo4j account. The requirements are as follows. * A [Neo4j deployment](https://neo4j.com/deployment-center/). * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local Neo4j deployments are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local Neo4j deployments are supported. The following video shows how to set up a Neo4j Aura deployment: <iframe width="560" height="315" src="https://www.youtube.com/embed/fo8uDIm1zCE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The username and password for the user who has access to the Neo4j deployment. The default user is typically `neo4j`. * For a Neo4j Aura instance, the defaut user's is typically set when the instance is created. * For an AWS Marketplace, Microsoft Azure Marketplace, or Google Cloud Marketplace deployment of Neo4j, the default user is typically set during the deployment process. * For a local Neo4j deployment, you can [set the default user's initial password](https://neo4j.com/docs/operations-manual/current/configuration/set-initial-password/) or [recover an admin user and its password](https://neo4j.com/docs/operations-manual/current/authentication-authorization/password-and-user-recovery/). * The connection URI for the Neo4j deployment, which starts with `neo4j://`, `neo4j+s://`, `bolt://`, or `bolt+s://`; followed by `localhost` or the host name; and sometimes ending with a colon and the port number (such as `:7687`). For example: * For a Neo4j Aura deployment, browse to the target Neo4j instance in the Neo4j Aura account and click Connect > Drivers to get the connection URI, which follows the format `neo4j+s://<host-name>`. A port number is not used or needed. * For an AWS Marketplace, Microsoft Azure Marketplace, or Google Cloud Marketplace deployment of Neo4j, see [Neo4j on AWS](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-aws/), [Neo4j on Azure](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-azure/), or [Neo4j on GCP](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-gcp/) for details about how to get the connection URI. * For a local Neo4j deployment, the URI is typically `bolt://localhost:7687` * For other Neo4j deployment types, see the deployment provider's documentation. [Learn more](https://neo4j.com/docs/browser-manual/current/operations/dbms-connection). * The name of the target database in the Neo4j deployment. A default Neo4j deployment typically contains two standard databases: one named `neo4j` for user data and another named `system` for system data and metadata. Some Neo4j deployment types support more than these two databases per deployment; Neo4j Aura instances do not. * [Create additional databases](https://neo4j.com/docs/operations-manual/current/database-administration/standard-databases/create-databases/) for a local Neo4j deployment that uses Enterprise Edition; or for Neo4j on AWS, Neo4j on Azure, or Neo4j on GCP deployments. * [Get a list of additional available databases](https://neo4j.com/docs/operations-manual/current/database-administration/standard-databases/listing-databases/) for a local Neo4j deployment that uses Enterprise Edition; or for Neo4j on AWS, Neo4j on Azure, or Neo4j on GCP deployments. The Neo4j connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[neo4j]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `NEO4J_USERNAME` - The name of the target user with access to the target Neo4j deployment, represented by `--username` (CLI) or `username` (Python). * `NEO4J_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). * `NEO4J_URI` - The connection URI for the deployment, represented by `--uri` (CLI) or `uri` (Python). * `NEO4J_DATABASE` - The name of the database in the deployment, represented by `--database` (CLI) or `database` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ neo4j \ --username $NEO4J_USERNAME \ --password $NEO4J_PASSWORD \ --uri $NEO4J_URI \ # <scheme>://<host>:<port> --database $NEO4J_DATABASE \ --batch-size 100 ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.neo4j import ( Neo4jAccessConfig, Neo4jConnectionConfig, Neo4jUploadStagerConfig, Neo4jUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=Neo4jConnectionConfig( access_config=Neo4jAccessConfig(password=os.getenv("NEO4J_PASSWORD")), username=os.getenv("NEO4J_USERNAME"), uri=os.getenv("NEO4J_URI"), database=os.getenv("NEO4J_DATABASE"), ), stager_config=Neo4jUploadStagerConfig(), uploader_config=Neo4jUploaderConfig(batch_size=100) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> ## Graph Output The graph ouput of the Neo4j destination connector is represented in the following diagram: ```mermaid graph BT subgraph dn [Document Node] D[Document] end style dn stroke-dasharray: 5 subgraph en [Element Nodes] UE1[UnstructuredElement] UE2[UnstructuredElement] UE3[UnstructuredElement] UE4[UnstructuredElement] UE5[UnstructuredElement] UE6[UnstructuredElement] end style en stroke-dasharray: 5 UE1 -->|PART_OF_DOCUMENT| D UE2 -->|PART_OF_DOCUMENT| D UE3 -->|PART_OF_DOCUMENT| D UE4 -->|PART_OF_DOCUMENT| D UE5 -->|PART_OF_DOCUMENT| D UE6 -->|PART_OF_DOCUMENT| D subgraph cn [Chunk Nodes] C1[Chunk] C2[Chunk] C3[Chunk] C4[Chunk] end style cn stroke-dasharray: 5 C1 -->|NEXT_CHUNK| C2 C2 -->|NEXT_CHUNK| C3 C3 -->|NEXT_CHUNK| C4 C1 -->|PART_OF_DOCUMENT| D C2 -->|PART_OF_DOCUMENT| D C3 -->|PART_OF_DOCUMENT| D C4 -->|PART_OF_DOCUMENT| D UE1 -.->|PART_OF_CHUNK| C1 UE2 -.->|PART_OF_CHUNK| C1 UE3 -.->|PART_OF_CHUNK| C2 UE4 -.->|PART_OF_CHUNK| C3 UE5 -.->|PART_OF_CHUNK| C4 UE6 -.->|PART_OF_CHUNK| C4 ``` [View the preceding diagram in full-screen mode](https://mermaid.live/view#pako:eNqFlN9vgjAQx_-Vps-6REEfeFiyFZYli7hskCyTxXS0ihFaU9oHo_7vq_IjgIzyxN330157d70TjDmh0IFbgQ8JeA4iBvSXq9_CQRhYuTxWGWUS-Br9KQC39pYOyki5VB5Tel2XS8H3dExwnmAh8NEBs4LohKA6hJfSOkJe7hh6k1XI9C4qlkpQUjK1Oh1UrUHVHlRng-p8QO1kgRqzoC8JxuPH8_vTR7BevqzdJQoXnh-cgVvf0wRYJsA2ATMTMP8f6FQz1tVEiWL7Vi3RpHBW5rRtWm3TbpmdnMbGnKIipb73FazRa-i_nXXAKvC9ZFWHuJfs6nrIUCVkKBIy1AjZpgTfGuWhwVRnnDT6ZFC3-vVpo0v6dKvRJH263eiRXh2OYEZFhndEj5nTlY6gTPSriaCjfwndYJXKCEbsolGsJP88shg6-onRERRcbRPobHCaa0sdCJbU3WHdbFmFHDD75jyrIUp2kotFMddu4-3yB3k-fcg). In the preceding diagram: * The `Document` node represents the source file. * The `UnstructuredElement` nodes represent the source file's Unstructured `Element` objects, before chunking. * The `Chunk` nodes represent the source file's Unstructured `Element` objects, after chunking. * Each `UnstructuredElement` node has a `PART_OF_DOCUMENT` relationship with the `Document` node. * Each `Chunk` node also has a `PART_OF_DOCUMENT` relationship with the `Document` node. * Each `UnstructuredElement` node has a `PART_OF_CHUNK` relationship with a `Chunk` element. * Each `Chunk` node, except for the "last" `Chunk` node, has a `NEXT_CHUNK` relationship with its "next" `Chunk` node. Learn more about [document elements](/ui/document-elements) and [chunking](/ui/chunking). Some related example Neo4j graph queries include the following. Query for all available nodes and relationships: ```text MATCH path=(source)-[relationship]->(target) RETURN path ``` Query for `Chunk` to `Document` relationships: ```text MATCH (chunk:Chunk)-[relationship:PART_OF_DOCUMENT]->(doc:Document) RETURN chunk, relationship, doc ``` Query for `UnstructuredElement` to `Document` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_DOCUMENT]->(doc:Document) RETURN element, relationship, doc ``` Query for `UnstructuredElement` to `Chunk` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) RETURN element, relationship, chunk ``` Query for `Chunk` to `Chunk` relationships: ```text MATCH (this:Chunk)-[relationship:NEXT_CHUNK]->(previous:Chunk) RETURN this, relationship, previous ``` Query for `UnstructuredElement` to `Chunk` to `Document` relationships: ```text MATCH (element:UnstructuredElement)-[ecrelationship:PART_OF_CHUNK]-(chunk:Chunk)-[cdrelationship:PART_OF_DOCUMENT]->(doc:Document) RETURN element, ecrelationship, chunk, cdrelationship, doc ``` Query for `UnstructuredElements` containing the text `jury`, and show their `Chunk` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) WHERE element.text =~ '(?i).jury.' RETURN element, relationship, chunk ``` Query for the `Chunk` with the specified `id`, and show its `UnstructuredElement` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) WHERE chunk.id = '731508bf53637ce4431fe93f6028ebdf' RETURN element, relationship, chunk ``` Additionally, for the [Unstructured UI](/ui/overview) and [Unstructured Workflow Endpoint](/api-reference/workflow/), when a [Named entity recognition (NER)](/ui/enriching/ner) DAG node is added to a custom workflow, any recognized entities are output as `Entity` nodes in the graph. This additional graph ouput of the Neo4j destination connector is represented in the following diagram: ```mermaid graph TD Chunk -->|HAS_ENTITY| Entity Entity -->|ENTITY_TYPE| Entity ``` In the preceding diagram: * The `Chunk` node represents one of the source file's Unstructured `Element` objects, after chunking. * The `Entity` node represents a recognized entity. * A `Chunk` node can have `HAS_ENTITY` relationships with `Entity` nodes. * An `Entity` node can have `ENTITY_TYPE` relationships with other `Entity` nodes. Some related example Neo4j graph queries include the following. Query for all available nodes and relationships: ```text MATCH path=(source)-[relationship]->(target) RETURN path ``` Query for `Entity` to `Entity` relationships: ```text MATCH (child:Entity)-[relationship:ENTITY_TYPE]->(parent:Entity) RETURN child, relationship, parent ``` Query for `Entity` nodes containing the text `PERSON`, and show their `Entity` relationships: ```text MATCH (child:Entity)-[relationship:ENTITY_TYPE]->(parent:Entity) WHERE parent.id = 'PERSON' RETURN child, relationship, parent ``` Query for `Entity` nodes containing the text `amendment`, and show their `Chunk` relationships: ```text MATCH (element:Chunk)-[relationship:HAS_ENTITY]->(entity:Entity) WHERE entity.id =~ '(?i).amendment.' RETURN element, relationship, entity ``` QUERY FOR `Entity` nodes containing the text `PERSON`, and show their `Entity` to `Entity` to `Chunk` relationships: ```text MATCH (chunk:Chunk)-[ccrelationship:HAS_ENTITY]-(child:Entity)-[cprelationship:ENTITY_TYPE]->(parent:Entity) WHERE parent.id =~ 'PERSON' RETURN chunk, ccrelationship, child, cprelationship, parent ``` # OneDrive Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/onedrive <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a OneDrive account. The requirements are as follows. * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. (Even if you only plan to use OneDrive, you still need a plan that includes SharePoint Online, because OneDrive is built on SharePoint technology.) [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * The OneDrive and SharePoint Online plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The path to the target OneDrive folder, starting from the OneDrive account's root folder, for example `my-folder/my-subfolder`. The following video shows how to get a path: <iframe width="560" height="315" src="https://www.youtube.com/embed/SfUA1IwtI5U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration that has access to the target OneDrive account, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Files.ReadWrite.All` (if both reading and writing are needed) * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` * `Directory.Read.All` 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/jBJsrSkpClo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Both authentication types also require the User Principal Name (UPN) for the OneDrive account. This is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password authentication, you will also need the password for the target UPN. The OneDrive connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[onedrive]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `ONEDRIVE_PATH` - The path to the target folder in the OneDrive account, starting with the account's root folder (for example `my-folder/my-subfolder`), represented by `--path` (CLI) or `path` (Python). * `ONEDRIVE_CLIENT_ID` - The application (client) ID of the Microsoft Entra ID app registration that has access to the OneDrive account, represented by `--client-id` (CLI) or `client_id` (Python). * `ONEDRIVE_CLIENT_CRED` - The client secret for the Entra ID app registration, represented by `--client-cred` (CLI) or `client_cred` (Python). * `ONEDRIVE_TENANT` - The directory (tenant) ID of the Entra ID app registration, represented by `--tenant` (CLI) or `tenant` (Python). * `ONEDRIVE_USER_PNAME` - The User Principal Name (UPN) for the OneDrive user account in Entra ID (typically the user's email address), represented by `--user-pname` (CLI) or `user_pname` (Python). * `ONEDRIVE_USER_PASSWORD` - The password for the target UPN, represented by `--password` (CLI) or `password` (Python). * `ONEDRIVE_AUTHORITY_URL` - The authentication token provider URL for the Entra ID app registration (the default is `https://login.microsoftonline.com` if not otherwise specified), represented by `--authority-url` (CLI) or `authority_url` (Python). Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ onedrive \ --client-cred $ONEDRIVE_CLIENT_CRED \ --client-id $ONEDRIVE_CLIENT_ID \ --user-pname $ONEDRIVE_USER_PNAME \ --password $ONEDRIVE_USER_PASSWORD \ # For username and password authentication. --tenant $ONEDRIVE_TENANT \ --authority-url $ONEDRIVE_AUTHORITY_URL \ --remote-url $ONEDRIVE_PATH \ --prefix "onedrive://" ``` ```bash Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.onedrive import ( OnedriveConnectionConfig, OnedriveAccessConfig, OnedriveUploaderConfig ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=OnedriveConnectionConfig( access_config=OnedriveAccessConfig( client_cred=os.getenv("ONEDRIVE_CLIENT_CRED"), password=os.getenv("ONEDRIVE_USER_PASSWORD") # For username and password authentication. ), client_id=os.getenv("ONEDRIVE_CLIENT_ID"), user_pname=os.getenv("ONEDRIVE_USER_PNAME"), tenant=os.getenv("ONEDRIVE_TENANT"), authority_url=os.getenv("ONEDRIVE_AUTHORITY_URL") ), uploader_config=OnedriveUploaderConfig( remote_url=os.getenv("ONEDRIVE_PATH"), prefix="onedrive://" ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # OpenSearch Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/opensearch <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in OpenSearch. The requirements are as follows. * An OpenSearch instance, such as an [AWS OpenSearch](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/setting-up.html) instance... <iframe width="560" height="315" src="https://www.youtube.com/embed/BNOYTbRbaFQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> ...or a [local](https://opensearch.org/downloads.html) instance. <iframe width="560" height="315" src="https://www.youtube.com/embed/Rew3_pNnYIs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> In all cases, the [OpenSearch version](https://github.com/opensearch-project/opensearch-py/blob/main/COMPATIBILITY.md) must be compatible with the client version of the [opensearch-py](https://github.com/opensearch-project/opensearch-py/blob/main/COMPATIBILITY.md) package on PyPI. * The instance's hostname and port number. To learn how, see: * [Creating and managing Amazon OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html) for AWS OpenSearch instances. * [Communicate with OpenSearch](https://opensearch.org/docs/latest/getting-started/communicate/) for local OpenSearch instances. * The name of the search index on the instance. [Create an index](https://opensearch.org/docs/latest/api-reference/index-apis/create-index/). The OpenSearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "record_id": { "type": "text" }, "element_id": { "type": "keyword" }, "text": { "type": "text" }, "embeddings": { "type": "knn_vector", "dimension": 384 }, "metadata": { "type": "object", "properties": { "parent_id": { "type": "text" }, "page_number": { "type": "integer" }, "is_continuation": { "type": "boolean" }, "orig_elements": { "type": "text" }, "partitioner_type": { "type": "text" } } } } } } ``` See also: * [Mappings and field types](https://opensearch.org/docs/latest/field-types/) * [Explicit mapping](https://opensearch.org/docs/latest/field-types/#explicit-mapping) * [Dynamic mapping](https://opensearch.org/docs/latest/field-types/#dynamic-mapping) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) * If you're using basic authentication to the instance, the user's name and password. * If you're using certificates for authentication instead: * The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA. * The path to the combined private key and certificate file, or * The paths to the separate private key and certificate file. To learn more, see: * For AWS OpenSearch instances, see [Tutorial: Configure a domain with the internal user database and HTTP basic authentication](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac-http-auth.html) and [Creating a custom endpoint for Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/customendpoint.html). * For local OpenSearch instances, see [Authentication backends](https://opensearch.org/docs/latest/security/authentication-backends/authc-index/), [HTTP basic authentication](https://opensearch.org/docs/latest/security/authentication-backends/basic-authc/), and [Client certificate authentication](https://opensearch.org/docs/latest/security/authentication-backends/client-auth/). The OpenSearch connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[opensearch]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `OPENSEARCH_HOST` - The hostname and port number, defined as `<hostname>:<port-number>` and represented by `--hosts` (CLI) or `hosts` (Python). * `OPENSEARCH_INDEX_NAME` - The name of the search index, represented by `--index-name` (CLI) or `index_name` (Python). If you're using basic authentication to the instance: * `OPENSEARCH_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python). * `OPENSEARCH_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). If you're using certificates for authentication instead: * `OPENSEARCH_CA_CERTS` - The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA. This is represented by `--ca-certs` (CLI) or `ca_certs` (Python). * `OPENSEARCH_CLIENT_CERT` - The path to the combined private key and certificate file, or the path to just the certificate file. This is represented by `--client-cert` (CLI) or `client_cert` (Python). * `OPENSEARCH_CLIENT_KEY` - The path to the private key file, if `OPENSEARCH_CLIENT_CERT` refers to just the certificate file. This is represented by `--client-key` (CLI) or `client_key` (Python). Additional related settings include: * `--use-ssl` (CLI) or `use_ssl=True` (Python) to use SSL for the connection. * `--verify-certs` (CLI) or `verify_certs=True` (Python) to verify SSL certificates. * `--ssl-show-warn` (CLI) or `ssl_show_warn=True` (Python) to show a warning when verifying SSL certificates is disabled. Now call the Unstructured CLI or Python. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ opensearch \ --hosts $OPENSEARCH_HOST \ --username $OPENSEARCH_USERNAME \ --password $OPENSEARCH_PASSWORD \ --index-name $OPENSEARCH_INDEX_NAME ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.elasticsearch.opensearch import ( OpenSearchConnectionConfig, OpenSearchAccessConfig, OpensearchUploadStagerConfig, OpensearchUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=OpenSearchConnectionConfig( access_config=OpenSearchAccessConfig( password=os.getenv("OPENSEARCH_PASSWORD"), use_ssl=True ), hosts=[os.getenv("OPENSEARCH_HOST")], username=os.getenv("OPENSEARCH_USERNAME") ), stager_config=OpensearchUploadStagerConfig( index_name=os.getenv("OPENSEARCH_INDEX_NAME") ), uploader_config=OpensearchUploaderConfig( index_name=os.getenv("OPENSEARCH_INDEX_NAME") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Overview Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/overview Ingest and process data from a storage with a source connector, and move the results to the data storage platform of your choice using a destination connector. All the examples for destination connector use local source connector. You can replace it with any other source connector. We are constantly adding new destination connectors and if you don't see your favorite platform let us know in our community Slack. # Pinecone Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/pinecone <NewDocument /> Batch process all your records to store structured outputs in Pinecone. The requirements are as follows. * A Pinecone account. [Get an account](https://app.pinecone.io/). <iframe width="560" height="315" src="https://www.youtube.com/embed/uPoa7PLcuZ4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Pinecone API key. [Get an API key](https://docs.pinecone.io/guides/get-started/authentication#find-your-pinecone-api-key). * A Pinecone serverless index. [Create a serverless index](https://docs.pinecone.io/guides/indexes/create-an-index). An existing index is not required. At runtime, the index behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing index name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing index's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing index's embedding settings to match, and try the run again. * If an index name is not specified, Unstructured creates a new index in your Pinecone account. If Unstructured generates embeddings, the new index's name will be `u<short-workflow-id>-<short-embedding-model-name>-<number-of-dimensions>`. If Unstructured does not generate embeddings, the new index's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing index name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing index's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing index's embedding settings to match, and try the run again. * If an index name is not specified, Unstructured creates a new index in your Pinecone account. The new index's name will be `unstructuredautocreated`. <Note> If you create a new index or use an existing one, Unstructured recommends that all records in the target index have a field named `record_id` with a string data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the index or, in some cases, the operation could fail altogether. </Note> * Within a Pinecone serverless index, custom [namespaces](https://docs.pinecone.io/guides/index-data/indexing-overview#namespaces) are supported but are not required. The Pinecone connector dependencies. ```bash pip install "unstructured-ingest[pinecone]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `PINECONE_API_KEY` - The Pinecone API, represented by `--api-key` (CLI) or `api_key` (Python, in the `PineconeAccessConfig` object). * `PINECONE_INDEX_NAME` - The Pinecone serverless index name, represented by `--index-name` (CLI) or `index_name` (Python). If no value is provided, see the beginning of this article for the behavior at run time. * `PINECONE_NAMESPACE_NAME` - The name of any custom namespace in the Pinecone index to upsert data into, represented by `--namespace` (CLI) or `namespace` (Python). If no value is provided, the default namespace of `default` will be used. Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. <iframe width="560" height="315" src="https://www.youtube.com/embed/FZNgYi-fxdg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --strategy hi_res \ --chunk-elements \ --embedding-provider huggingface \ --num-processes 2 \ --verbose \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ pinecone \ --api-key "$PINECONE_API_KEY" \ --index-name "$PINECONE_INDEX_NAME" \ --namespace "$PINECONE_NAMESPACE_NAME" \ --batch-size 80 ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.pinecone import ( PineconeConnectionConfig, PineconeAccessConfig, PineconeUploaderConfig, PineconeUploadStagerConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=PineconeConnectionConfig( access_config=PineconeAccessConfig( api_key=os.getenv("PINECONE_API_KEY") ), index_name=os.getenv("PINECONE_INDEX_NAME") ), stager_config=PineconeUploadStagerConfig(), uploader_config=PineconeUploaderConfig( batch_size=100, namespace=os.getenv("PINECONE_NAMESPACE_NAME") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # PostgreSQL Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/postgresql <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a PostgreSQL schema. Insert query is currently limited to append. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local PostgreSQL installations are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local PostgreSQL installations are supported. The following video shows how to set up [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/): <iframe width="560" height="315" src="https://www.youtube.com/embed/QuIlEimejDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to set up [Azure Database for PostgreSQL](https://azure.microsoft.com/products/postgresql): <iframe width="560" height="315" src="https://www.youtube.com/embed/6lvtBUFI7eQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A PostgreSQL instance. * [Create an Amazon RDS for PostgreSQL instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html). * [Create an Azure Database for PostgreSQL server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-deploy-on-azure-free-account). * [Install PostgreSQL locally](https://www.postgresql.org/docs/current/tutorial-install.html). * The host name and port number for the instance. * For Amazon RDS for PostgreSQL, learn how to [get the host name and port number](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html#postgresql-endpoint). * For Azure Database for PostgreSQL, learn how to [get the host](https://learn.microsoft.com/azure/postgresql/flexible-server/quickstart-create-server#get-the-connection-information). The port number is `5432`. * For local PostgreSQL installations, these values are in the `postgresql.conf` file's `listen_addresses` and `port` settings. This file should be on the same machine as the instance. These values might also already be set as environment variables named `PGHOST` and `PGPORT` on the same machine as the instance. * For other installation types, see your PostgreSQL provider's documentation. * Depending on your network security requirements, you might need to allow access to your instance only from specific IP addresses. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> To learn how to allow these IP address ranges, see your PostgreSQL provider's documentation, for example with [Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) or [Azure Database for PostgreSQL](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-manage-firewall-portal#create-a-firewall-rule-after-server-is-created). <Note> For Amazon RDS for PostgreSQL, Amazon recommends that you set the instance's Public access setting to No by default, as this approach is more secure. This means that no resources can connect to the instance outside of the instance's associated Virtual Private Cloud (VPC) without extra configuration. [Learn more](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.WorkingWithRDSInstanceinaVPC.html#USER_VPC.Hiding). [Access an Amazon RDS instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html). If you must enable public access, set the instance's Public access setting to Yes, and then adjust the instance's related security group to allow this access. [Learn how](https://repost.aws/en/questions/QUxemKa9u5TV6CmLiO-r5prg/lost-public-access-to-aws-rds-postgresql-instance). [Troubleshoot issues with connecting to Amazon RDS instances](https://repost.aws/knowledge-center/rds-connectivity-instance-subnet-vpc). </Note> * A database in the instance. * For Amazon RDS for PostgreSQL and Azure Database for PostgreSQL, the default database name is `postgres` unless a custom database name was specified during the instance creation process. * For local PostgreSQL installations, learn how to [create a database](https://www.postgresql.org/docs/current/tutorial-createdb.html). * For other installation types, see your PostgreSQL provider's documentation. * A table in the database. Learn how to [create a table](https://www.postgresql.org/docs/current/tutorial-table.html). The table's schema must match the schema of the documents that Unstructured produces. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: <CodeGroup> ```sql PostgreSQL CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings DECIMAL [], parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` ```sql PostgreSQL with pgvector CREATE EXTENSION vector; CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings vector(3072), parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` </CodeGroup> See also: * [CREATE TABLE](https://www.postgresql.org/docs/current/sql-createtable.html) for PostgreSQL * [CREATE TABLE](https://github.com/pgvector/pgvector) for PostrgreSQL with pgvector * [Unstructured document elements and metadata](/api-reference/partition/document-elements) The following video shows how to use the `psql` utility to connect to PostgreSQL, list databases, and list and create tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/IKo-4QHdNF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A user in the database, and a password for the user. * For Amazon RDS for PostgreSQL, learn how to [create a user](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Roles.html). * For Azure Database for PostgreSQL, learn how to [create a user](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-create-users). * For local PostgreSQL installations, learn how to [create a user](https://www.postgresql.org/docs/current/sql-createuser.html). * For other installation types, see your PostgreSQL provider's documentation. * Database access for the user. * For Amazon RDS for PostgreSQL, learn how to [control user access](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Access.html). * For Azure Database for PostgreSQL, learn how to [control user access](https://www.postgresql.org/docs/current/sql-createuser.html). * For local PostgreSQL installations, learn how to [give database access to a user](https://www.postgresql.org/docs/current/sql-grant.html). * For other installation types, see your PostgreSQL provider's documentation. The PostgreSQL connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[postgres]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `PGHOST` - The host name, represented by `--host` (CLI) or `host` (Python). * `PGPORT` - The port number, represented by `--port` (CLI) or `port` (Python). * `PGUSER` - The username, represented by `--username` (CLI) or `username` (Python). * `PGPASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). * `PGDATABASE` - The name of the database, represented by `--database` (CLI) or `database` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Specify which fields to output in the processed data. This can help prevent # database record insert issues, where a particular field in the processed data # does not match a column in the database table on insert. metadata_includes="id,element_id,text,embeddings,type,system,layout_width,\ layout_height,points,url,version,date_created,date_modified,date_processed,\ permissions_data,record_locator,category_depth,parent_id,attached_filename,\ filetype,last_modified,file_directory,filename,languages,page_number,links,\ page_name,link_urls,link_texts,sent_from,sent_to,subject,section,\ header_footer_type,emphasized_text_contents,emphasized_text_tags,\ text_as_html,regex_metadata,detection_class_prob" unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --verbose \ --strategy hi_res \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --metadata-include "$metadata_includes" \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ postgres \ --username $PGUSER \ --password $PGPASSWORD \ --host $PGHOST \ --port $PGPORT \ --database $PGDATABASE ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.postgres import ( PostgresConnectionConfig, PostgresAccessConfig, PostgresUploaderConfig, PostgresUploadStagerConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": # Specify which fields to output in the processed data. This can help prevent # database record insert issues, where a particular field in the processed data # does not match a column in the database table on insert. metadata_includes = [ "id", "element_id", "text", "embeddings", "type", "system", "layout_width", "layout_height", "points", "url", "version", "date_created", "date_modified", "date_processed", "permissions_data", "record_locator", "category_depth", "parent_id", "attached_filename", "filetype", "last_modified", "file_directory", "filename", "languages", "page_number", "links", "page_name", "link_urls", "link_texts", "sent_from", "sent_to", "subject", "section", "header_footer_type", "emphasized_text_contents", "emphasized_text_tags", "text_as_html", "regex_metadata", "detection_class_prob" ] Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), metadata_include=metadata_includes, additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=PostgresConnectionConfig( access_config=PostgresAccessConfig(password=os.getenv("PGPASSWORD")), host=os.getenv("PGHOST"), port=os.getenv("PGPORT"), username=os.getenv("PGUSER"), database=os.getenv("PGDATABASE") ), stager_config=PostgresUploadStagerConfig(), uploader_config=PostgresUploaderConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Qdrant Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/qdrant <NewDocument /> Batch process all your records to store structured outputs in Qdrant. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only [Qdrant Cloud](https://qdrant.tech/documentation/cloud-intro/) is supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Qdrant Cloud, [Qdrant local](https://github.com/qdrant/qdrant), and [Qdrant client-server](https://qdrant.tech/documentation/quickstart/) are supported. The following video shows how to set up Qdrant Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/730jcEAJUG8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Qdrant local, the path to the local Qdrant installation, for example: `/qdrant/local` * For Qdrant client-server, the Qdrant server URL, for example: `http://localhost:6333` * For Qdrant Cloud: * A [Qdrant account](https://cloud.qdrant.io/login). * A [Qdrant cluster](https://qdrant.tech/documentation/cloud/create-cluster/). * The cluster's URL. To get this URL, do the following: 1. Sign in to your Qdrant Cloud account. 2. On the sidebar, under Dashboard, click Clusters. 3. Click the cluster's name. 4. Note the value of the Endpoint field, for example: `https://<random-guid>.<region-id>.<cloud-provider>.cloud.qdrant.io`. * A [Qdrant API key](https://qdrant.tech/documentation/cloud/authentication/#create-api-keys). * The name of the target [collection](https://qdrant.tech/documentation/concepts/collections) on the Qdrant local installation, Qdrant server, or Qdrant Cloud cluster. Qdrant requires the target collection to exist before Unstructured can write to the collection. The following example code demonstrates the use of the [Python Qdrant Client](https://pypi.org/project/qdrant-client/) to create a collection on a Qdrant Cloud cluster, configuring the collection for vectors with 3072 dimensions: ```python Python from qdrant_client import QdrantClient, models import os client = QdrantClient( url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY") ) client.create_collection( collection_name=os.getenv("QDRANT_COLLECTION"), vectors_config=models.VectorParams( size=3072, distance=models.Distance.COSINE ) ) collection = client.get_collection( collection_name=os.getenv("QDRANT_COLLECTION") ) print(f"The collection named '{os.getenv("QDRANT_COLLECTION")}' exists and " + f"has a status of '{collection.status}'.") ``` The Qdrant connector dependencies. ```bash pip install "unstructured-ingest[qdrant]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `QDRANT_COLLECTION` - The name of the target collection on the Qdrant local installation, Qdrant server, or Qdrant Cloud cluster, represented by `--collection-name` (CLI) or `collection_name` (Python). * For Qdrant local, `QDRANT_PATH` - The path to the local Qdrant installation, represented by `--path` (CLI) or `path` (Python). * For Qdrant client-server, `QDRANT_URL` - The Qdrant server's URL, represented by `--url` (CLI) or `url` (Python). * For Qdrant Cloud: * `QDRANT_URL` - The Qdrant cluster's URL, represented by `--url` (CLI) or `url` (Python). * `QDRANT_API_KEY` - The Qdrant API key, represented by `--api-key` (CLI) or `api_key` (Python). Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. # For Qdrant local: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ qdrant-local \ --path $QDRANT_PATH \ --collection-name $QDRANT_COLLECTION \ --batch-size 50 \ --num-processes 1 # For Qdrant client-server: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ qdrant-server \ --url $QDRANT_URL \ --collection-name $QDRANT_COLLECTION \ --batch-size 50 \ --num-processes 1 # For Qdrant cloud: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ qdrant-cloud \ --url $QDRANT_URL \ --api-key $QDRANT_API_KEY \ --collection-name $QDRANT_COLLECTION \ --batch-size 50 \ --num-processes 1 ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # For Qdrant local: # from unstructured_ingest.processes.connectors.qdrant.local import ( # LocalQdrantConnectionConfig, # LocalQdrantAccessConfig, # LocalQdrantUploadStagerConfig, # LocalQdrantUploaderConfig # ) # For Qdrant client-server: # from unstructured_ingest.processes.connectors.qdrant.server import ( # ServerQdrantConnectionConfig, # ServerQdrantAccessConfig, # ServerQdrantUploadStagerConfig, # ServerQdrantUploaderConfig # ) # For Qdrant Cloud: from unstructured_ingest.processes.connectors.qdrant.cloud import ( CloudQdrantConnectionConfig, CloudQdrantAccessConfig, CloudQdrantUploadStagerConfig, CloudQdrantUploaderConfig ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), # For Qdrant local: # destination_connection_config=LocalQdrantConnectionConfig( # access_config=LocalQdrantAccessConfig(), # path=os.getenv("QDRANT_PATH") # ), # stager_config=LocalQdrantUploadStagerConfig(), # uploader_config=LocalQdrantUploaderConfig( # collection_name=os.getenv("QDRANT_COLLECTION"), # batch_size=50, # num_processes=1 # ) # For Qdrant client-server: # destination_connection_config=ServerQdrantConnectionConfig( # access_config=ServerQdrantAccessConfig(), # url=os.getenv("QDRANT_URL") # ), # stager_config=ServerQdrantUploadStagerConfig(), # uploader_config=ServerQdrantUploaderConfig( # collection_name=os.getenv("QDRANT_COLLECTION"), # batch_size=50, # num_processes=1 # ) # For Qdrant cloud: destination_connection_config=CloudQdrantConnectionConfig( access_config=CloudQdrantAccessConfig( api_key=os.getenv("QDRANT_API_KEY") ), url=os.getenv("QDRANT_URL") ), stager_config=CloudQdrantUploadStagerConfig(), uploader_config=CloudQdrantUploaderConfig( collection_name=os.getenv("QDRANT_COLLECTION"), batch_size=50, num_processes=1 ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Redis Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/redis <NewDocument /> Batch process all your records to store structured outputs in Redis. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/_zpzzr2VtoM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Redis](https://redis.io) database, for example in [Redis Cloud](https://redis.io/cloud/). * The target database's hostname and port number. [Create a database in Redis Cloud](https://redis.io/docs/latest/operate/rc/rc-quickstart/#create-an-account). * The username and password for the target database. [Get the username and password in Redis Cloud](https://redis.io/docs/latest/operate/rc/rc-quickstart/#connect-to-a-database). * The database number for the target database. Redis databases are typically numbered from 0 to 15, with the default database number typically being 0. The Redis connector dependencies. ```bash pip install "unstructured-ingest[redis]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * For connecting with a Redis connection string, `REDIS_URI`, represented by `--uri` (CLI) or `uri` (Python). Redis connection strings use the following format: ``` <protocol>://<username>:<password>@<hostname>:<port>?ssl=<true|false>&db=<db_number> ``` * For connecting with Redis connection properties: * `REDIS_HOST` - The hostname of the target Redis database, represented by `--host` (CLI) or `host` (Python). * `REDIS_PORT` - The database's port number, represented by `--port` (CLI) or `port` (Python). * `REDIS_DATABASE` - The database number of the target database, represented by `--database` (CLI) or `database` (Python). * `REDIS_USERNAME` - The username for the database, represented by `--username` (CLI) or `username` (Python). * `REDIS_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). * `REDIS_KEY_PREFIX` - A string to prepend to each element's `element_id`, represented by `--key-prefix` (CLI) or `key_prefix` (Python). Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI # Chunking and embedding are optional. # Use a Redis connection string: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ redis \ --uri $REDIS_URI, --no-ssl \ --key-prefix $REDIS_KEY_PREFIX \ --batch-size 100 # Use Redis connection properties: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ redis \ --host $REDIS_HOST \ --port 14453 --database 0 \ --username $REDIS_USERNAME \ --password $REDIS_PASSWORD \ --no-ssl \ --key-prefix $REDIS_KEY_PREFIX \ --batch-size 100 ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.redisdb import ( RedisAccessConfig, RedisConnectionConfig, RedisUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), # Use a Redis connection string. # destination_connection_config=RedisConnectionConfig( # access_config=RedisAccessConfig( # uri = os.getenv("REDIS_URI") # ) # ), # Use Redis connection properties. destination_connection_config=RedisConnectionConfig( access_config=RedisAccessConfig( password=os.getenv("REDIS_PASSWORD") ), host=os.getenv("REDIS_HOST"), database=int(os.getenv("REDIS_DATABASE")), port=int(os.getenv("REDIS_PORT")), username=os.getenv("REDIS_USERNAME"), ssl=False ), uploader_config=RedisUploaderConfig( batch_size=100 key_prefix=os.getenv("REDIS_KEY_PREFIX") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # S3 Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/s3 <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in an S3 bucket. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/hyDHfhVVAhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account; enable anonymous access to the bucket (which is supported but not recommended); or generate an AWS STS session token for temporary access, if required by your organization's security requirements. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Anonymous (supported but not recommended) or authenticated access to the bucket. * [Enable anonymous bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-anonymous-user). ![Enable anonymous bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-anon-access.gif) * [Enable authenticated bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/walkthrough1.html). ![Enable authenticated bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-auth-access.gif) * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated access in untrusted environments or enhanced security scenarios, an AWS STS session token for temporary access, in addition to an AWS access key and secret access key. [Create a session token](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#api_getsessiontoken). <iframe width="560" height="315" src="https://www.youtube.com/embed/4_csSXc_GNU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Warning> AWS STS credentials can be valid for as little as 15 minutes or as long as 36 hours, depending on how the credentials were initially generated. After the expiry time, the credentials are no longer valid will no longer work with the corresponding S3 connector. You must get a new set of credentials to replace the expired ones by calling [GetSessionToken](https://docs.aws.amazon.com/STS/latest/APIReference/API_GetSessionToken.html) in the AWS STS API. To overwrite the expired credentials with the new set: * For the Unstructured user interface (UI), manually update the AWS Key, AWS Secret Key, and STS Token fields in the Unstructured UI for the corresponding S3 [source](/ui/sources/s3) or [destination](/ui/destinations/s3) connector. * For the Unstructured API, use the Unstructured Workflow Endpoint to call the [update source](/api-reference/workflow/overview#update-a-source-connector) or [update destination](/api-reference/workflow/overview#update-a-destination-connector) connector operation for the corresponding S3 [source](/api-reference/workflow/sources/s3) or [destination](/api-reference/workflow/destinations/s3) connector. * For Unstructured Ingest, change the values of `--key`, `--secret`, and `--token` (CLI) or `key`, `secret`, and `token` (Python) in your command or code for the corresponding S3 [source](/open-source/ingestion/source-connectors/s3) or [destination](/open-source/ingestion/destination-connectors/s3) connector. </Warning> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, and authenticated bucket access is enabled, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). The S3 connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[s3]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `AWS_S3_URL` - The path to the S3 bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/`. * If the bucket does not have anonymous access enabled, provide the AWS credentials: * `AWS_ACCESS_KEY_ID` - The AWS access key ID for the authenticated AWS IAM user, represented by `--key` (CLI) or `key` (Python). * `AWS_SECRET_ACCESS_KEY` - The corresponding AWS secret access key, represented by `--secret` (CLI) or `secret` (Python). * `AWS_STS_TOKEN` - If required, the AWS STS session token for temporary access, represented by `--token` (CLI) or `token` (Python). * If the bucket has anonymous access enabled for reading from the bucket, set `--anonymous` (CLI) or `anonymous=True` (Python) instead. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ s3 \ --remote-url $AWS_S3_URL \ --key $AWS_ACCESS_KEY_ID \ --secret $AWS_SECRET_ACCESS_KEY \ --token $AWS_STS_TOKEN # If using AWS STS token. ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.fsspec.s3 import ( S3ConnectionConfig, S3AccessConfig, S3UploaderConfig ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=S3ConnectionConfig( access_config=S3AccessConfig( key=os.getenv("AWS_ACCESS_KEY_ID"), secret=os.getenv("AWS_SECRET_ACCESS_KEY"), token=os.getenv("AWS_STS_TOKEN", None), # If using AWS STS token. ) ), uploader_config=S3UploaderConfig(remote_url=os.getenv("AWS_S3_URL")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> ## Add an access policy to an existing bucket To use the Amazon S3 console to add an access policy that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to an existing S3 bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 2. Open the [Amazon S3 Console](https://console.aws.amazon.com/s3/home). 3. Browse to the existing bucket and open it. 4. Click the Permissions tab. 5. In the Bucket policy area, click Edit. 6. In the Policy text area, copy the following JSON-formatted policy. To change the following policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this policy, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-bucket-name>` in two places with the name of your bucket. ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<my-account-id>:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<my-bucket-name>", "arn:aws:s3:::<my-bucket-name>/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } ``` 7. Click Save changes. ## Create a bucket with the AWS CLI To use the AWS CLI to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 2. [Set up the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). 3. Copy the following script to a file on your local machine, for example a file named `create-s3-bucket.sh`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this script, replace the following: Replace `<my-account-id>` with your AWS account ID. * Replace `<my-unique-bucket-name>` with the name of your bucket. * Replace `<us-east-1>` with your AWS Region. ```bash #!/bin/bash # Set variables for the AWS account ID, Amazon S3 bucket name, and AWS Region. ACCOUNT_ID="<my-account-id>" BUCKET_NAME="<my-unique-bucket-name>" REGION="<us-east-1>" # Temporary filename for the bucket policy. # Do not change this variable. POLICY_FILE="bucket_policy.json" # Create the bucket. aws s3api create-bucket --bucket $BUCKET_NAME --region $REGION # Wait for the bucket to exist. echo "Waiting for bucket '$BUCKET_NAME' to be fully created..." aws s3api wait bucket-exists --bucket $BUCKET_NAME # Check if the wait command was successful. if [ $? -eq 0 ]; then echo "The bucket '$BUCKET_NAME' has been fully created." else echo "Error: Timed out waiting for bucket '$BUCKET_NAME' to be created." exit 1 fi # Remove the "block public policy" bucket access setting. aws s3api put-public-access-block \ --bucket $BUCKET_NAME \ --public-access-block-configuration \ '{"BlockPublicPolicy": false, "IgnorePublicAcls": false, "BlockPublicAcls": false, "RestrictPublicBuckets": false}' # Check if the operation was successful. if [ $? -eq 0 ]; then echo "The block public policy access setting was removed from '$BUCKET_NAME'." else echo "Error: Failed to remove the block public policy access setting from '$BUCKET_NAME'." exit 1 fi # Create the bucket policy. cat << EOF > $POLICY_FILE { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::$ACCOUNT_ID:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::$BUCKET_NAME", "arn:aws:s3:::$BUCKET_NAME/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } EOF # Apply the bucket policy. aws s3api put-bucket-policy --bucket $BUCKET_NAME --policy file://$POLICY_FILE # Check if the policy application was successful. if [ $? -eq 0 ]; then echo "The bucket policy was applied to '$BUCKET_NAME'." else echo "Error: Failed to apply the bucket policy to '$BUCKET_NAME'." exit 1 fi # Verify the applied policy. echo "Verifying the applied policy:" aws s3api get-bucket-policy --bucket $BUCKET_NAME --query Policy --output text # Remove the temporary bucket policy file. rm $POLICY_FILE ``` 4. Run the script, for example: ```bash sh create-s3-bucket.sh ``` 5. After the bucket is created, you can delete the script file, if you want. ## Create a bucket with AWS CloudFormation To use the AWS CloudFormation console to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Save the following YAML to a file on your local machine, for example `create-s3-bucket.yaml`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: 'CloudFormation template to create an S3 bucket with specific permissions for account users.' Parameters: BucketName: Type: String Description: 'Name of the S3 bucket to create' Resources: MyS3Bucket: Type: 'AWS::S3::Bucket' Properties: BucketName: !Ref BucketName PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: false IgnorePublicAcls: true RestrictPublicBuckets: true BucketPolicy: Type: 'AWS::S3::BucketPolicy' Properties: Bucket: !Ref MyS3Bucket PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowAllAuthenticatedUsersInAccount Effect: Allow Principal: AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root' Action: - 's3:GetObject' - 's3:PutObject' - 's3:ListBucket' - 's3:DeleteObject' Resource: - !Sub 'arn:aws:s3:::${BucketName}' - !Sub 'arn:aws:s3:::${BucketName}/' Outputs: BucketName: Description: 'Name of the created S3 bucket' Value: !Ref MyS3Bucket ``` 2. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 3. Open the [AWS CloudFormation Console](https://console.aws.amazon.com/cloudformation/home). 4. Click Create stack > With new resources (standard). 5. On the Create stack page, with Choose an existing template already selected, select Upload a template file. 6. Click Choose file, and browse to and select the YAML file from your local machine. 7. Click Next. 8. Enter a unique Stack name and BucketName. 9. Click Next two times. 10. Click Submit. 11. Wait until the Status changes to CREATE\_COMPLETE. 12. After the bucket is created, you can delete the YAML file, if you want. # SFTP Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/sftp <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in SFTP. The requirements are as follows. * The SFTP server hostname, port, username, and password. SFTP servers are offered by several vendors. For example, the following video shows how to create and set up an SFTP server by using AWS Transfer Family: <iframe width="560" height="315" src="https://www.youtube.com/embed/3_HHSnoFsoM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The directory path to start accessing data from, specified as `sftp://<path>/<to>/<directory>`. See the [SFTP documentation](https://en.wikipedia.org/wiki/SSH_File_Transfer_Protocol). The SFTP connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[sftp]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `SFTP_HOST` - The SFTP hostname, represented by `--host` (CLI) or `host` (Python). * `SFTP_PORT` - The SFTP port number, represented by `--port` (CLI) or `port` (Python). * `SFTP_REMOTE_URL` - The directory path to start accessing data from, represented by `--remote-url` (CLI) or `remote_url` (Python). * `SFTP_USERNAME` - The SFTP username, represented by `--username` (CLI) or `username` (Python). * `SFTP_PASSWORD` - The SFTP password, represented by `--password` (CLI) or `password` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --strategy hi_res \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ sftp \ --host $SFTP_HOST \ --port $SFTP_PORT \ --username $SFTP_USERNAME \ --password $SFTP_PASSWORD \ --remote-url $SFTP_REMOTE_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.sftp import ( SftpConnectionConfig, SftpAccessConfig, SftpUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=SftpConnectionConfig( access_config=SftpAccessConfig(password=os.getenv("SFTP_PASSWORD")), host=os.getenv("SFTP_HOST"), port=os.getenv("SFTP_PORT"), username=os.getenv("SFTP_USERNAME") ), uploader_config=SftpUploaderConfig(remote_url=os.getenv("SFTP_REMOTE_URL")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # SingleStore Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/singlestore <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a SingleStore account. The requirements are as follows. * A SingleStore deployment, database, and table. [Learn how](https://www.singlestore.com/blog/how-to-get-started-with-singlestore/). * The hostname for the SingleStore deployment. * The port for the host. * The username for the deployment. * The password for the user. * The name of the database in the deployment. * The name of the table in the database. To get the values for the hostname, port, username, and password: 1. In your SingleStore account's dashboard sidebar, click Deployments. 2. From the drop-down list at the top of the Deployments page, select your deployment. 3. On the Overview tab, in the Compute area, in the Connect drop-down list for your deployment, select Your App. 4. If a Create User dialog box appears, note the User name and Password values. 5. In the Connect to Workspace pane's Your App tab, note the string in the following format: ``` <user-name>:<password>@<host>:<port> ``` * `<user-name>` is the username. * `<password>` is the user's password. * `<host>` is the workspace's hostname. * `<post>` is the host's port. To get the values for the database and table names: 1. In your SingleStore dashboard's sidebar, click Deployments. 2. From the drop-down list at the top of the Deployments page, select your deployment. 3. On the Databases tab, note Name of your database. 4. Click the database and, on the Tables tab, note the name of your table. The SingleStore connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[singlestore]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `SINGLESTORE_HOST` - The hostname for the SingleStore deployment, represented by `--host` (CLI) or `host` (Python). * `SINGLESTORE_PORT` - The port for the host, represented by `--port` (CLI) or `port` (Python). * `SINGLESTORE_USER` - The username for the deployment, represented by `--user` (CLI) or `user` (Python). * `SINGLESTORE_PASSWORD` - The password for the user, represented by `--password` (CLI) or `password` (Python). * `SINGLESTORE_DB` - The name of the database in the deployment, represented by `--database` (CLI) or `database` (Python). * `SINGLESTORE_TABLE` - The name of the table in the database, represented by `--table-name` (CLI) or `table_name` (Python). The table's schema must match the schema of the documents that Unstructured produces. During insertion, JSON objects are flattened, and underscores are placed between nested object names. For example, the following JSON object matches a column in the table named `data_source_date_created`: ```json { "...": "...", "data_source": { "date_created": "1719963233.949" }, "...": "..." } ``` The `metadata` object itself is also flattened in a similar way. However, for nested objects in `metadata`, the column in the table does not start with `metadata_`. For example, the following JSON object matches a column in the table named `last_modified`: ```json { "...": "...", "metadata": { "...": "...", "last_modified": "2022-12-16T17:04:16-05:00", "...": "..." }, "...": "..." } ``` Unstructured cannot provide a table schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: ```sql CREATE TABLE elements ( id INT PRIMARY KEY NOT NULL AUTO_INCREMENT, record_id TEXT, element_id TEXT, text TEXT, embeddings Vector(384), parent_id TEXT, page_number TEXT, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type TEXT ); ``` See also: * [CREATE TABLE](https://docs.singlestore.com/cloud/reference/sql-reference/data-definition-language-ddl/create-table/) in the SingleStore documentation * [Unstructured document elements and metadata](/api-reference/partition/document-elements) Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ singlestore \ --host $SINGLESTORE_HOST \ --port $SINGLESTORE_PORT \ --user $SINGLESTORE_USER \ --password $SINGLESTORE_PASSWORD \ --database $SINGLESTORE_DB \ --table-name $SINGLESTORE_TABLE ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.singlestore import ( SingleStoreConnectionConfig, SingleStoreAccessConfig, SingleStoreUploadStagerConfig, SingleStoreUploaderConfig ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=SingleStoreConnectionConfig( host=os.getenv("SINGLESTORE_HOST"), port=os.getenv("SINGLESTORE_PORT"), user=os.getenv("SINGLESTORE_USER"), database=os.getenv("SINGLESTORE_DB"), access_config=SingleStoreAccessConfig( password=os.getenv("SINGLESTORE_PASSWORD") ) ), stager_config=SingleStoreUploadStagerConfig(drop_empty_cols=True), uploader_config=SingleStoreUploaderConfig(table_name=os.getenv("SINGLESTORE_TABLE")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Snowflake Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/snowflake <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in Snowflake. The requirements are as follows. * A Snowflake [account](https://signup.snowflake.com/) and its account identifier. <iframe width="560" height="315" src="https://www.youtube.com/embed/LgCjLyClg3o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To get the identifier for the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click your username, and then click Account > View account details. 3. On the Account tab, note the value of the Account Identifier field. Alternatively, the following Snowflake query returns the current account's identifier: ```text SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS "Account Identifier" ``` * The Snowflake [user's login name (not its username) and its password](https://docs.snowflake.com/user-guide/admin-user-management#creating-users) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/sFLPGVe4VBM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view the login name for a user: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. On the Users tab, in the list of available users, click the name of the target user. 4. In the About tile, note the Login Name for the user. Alternatively, the following Snowflake query returns information about the user with the username of `<my-user>`, including their `login_name` value representing their login name: ```text SHOW USERS LIKE '<my-user>'; ``` * The name of the Snowflake [role](https://docs.snowflake.com/sql-reference/sql/create-role) that the user belongs to and that also has sufficient access to the Snowflake database, schema, table, and host. <iframe width="560" height="315" src="https://www.youtube.com/embed/cxe2Ed_-qkM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * To create a database in Snowflake, the role needs to be granted `CREATE DATABASE` privilege at the current account level; and `USAGE` privilege on the warehouse that is used to create the database. * To create a schema in a database in Snowflake, the role needs to be granted `USAGE` privilege on the database and the warehouse that is used to create the schema; and `CREATE SCHEMA` on the database. * To create a table in a schema in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to create the table; and `CREATE TABLE` on the schema. * To write to a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `INSERT` on the table. * To read from a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `SELECT` on the table. To view a list of available roles in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. Click the Roles tab. Alternatively, the following Snowflake query returns a list of available roles in the current account: ```text SHOW ROLES; ``` [Grant privileges to a role](https://docs.snowflake.com/sql-reference/sql/grant-privilege). [Learn more](https://docs.snowflake.com/user-guide/security-access-control-privileges). * The Snowflake warehouse's [hostname and its port number](https://docs.snowflake.com/sql-reference/functions/system_allowlist) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/ItLW0N1uEOI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available warehouses in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Warehouses. This view does not provide access to the warehouses' hostnames or port numbers. To get this information, you must run a Snowflake query. The following Snowflake query returns a list of available warehouse types, hostnames, and port numbers in the current account. Look for the row with a `type` of `SNOWFLAKE_DEPLOYMENT`: ```text SELECT t.VALUE:type::VARCHAR as type, t.VALUE:host::VARCHAR as host, t.VALUE:port as port FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t; ``` * The name of the Snowflake [database](https://docs.snowflake.com/sql-reference/sql/create-database) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/f49tin_Pehs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available databases in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. Alternatively, the following Snowflake query returns a list of available databases in the current account: ```text SHOW DATABASES; ``` * The name of the [schema](https://docs.snowflake.com/sql-reference/sql/create-schema) in the database. To view a list of available schemas for a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the target database. Alternatively, the following Snowflake query returns a list of available schemas in the current account: ```text SHOW SCHEMAS; ``` The following Snowflake query returns a list of available schemas for the database named `<database_name>` in the current account: ```text SHOW SCHEMAS IN DATABASE <database_name>; ``` * The name of the [table](https://docs.snowflake.com/sql-reference/sql/create-table) in the schema. To view a list of available tables for a schema in a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the database that contains the target schema. 4. Expand the name of the target schema. 5. Expand Tables. Alternatively, the following Snowflake query returns a list of available tables for the schema named `<schema_name>` in the datbase named `<database_name>` in the current account: ```text SHOW TABLES IN SCHEMA <database_name>.<schema_name>; ``` Snowflake requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured is as follows. In the following `CREATE TABLE` statement, replace the following placeholders with the appropriate values: * `<database_name>`: The name of the target database in the Snowflake account. * `<schema_name>`: The name of the target schema in the database. * `<number-of-dimensions>`: The number of dimensions for any embeddings that you plan to use. This value must match the number of dimensions for any embeddings that are\ specified in your related Unstructured workflows or pipelines. If you plan to use Snowflake vector embedding generation or Snowflake vector search, this value must match the number of dimensions that you plan to have Snowflake generate or search against. ```sql SQL CREATE TABLE <database_name>.<schema_name>.ELEMENTS ( ID VARCHAR(36) PRIMARY KEY NOT NULL DEFAULT UUID_STRING(), RECORD_ID VARCHAR, ELEMENT_ID VARCHAR, TEXT VARCHAR, EMBEDDINGS VECTOR(FLOAT, <number-of-dimensions>), TYPE VARCHAR, SYSTEM VARCHAR, LAYOUT_WIDTH DECIMAL, LAYOUT_HEIGHT DECIMAL, POINTS VARCHAR, URL VARCHAR, VERSION VARCHAR, DATE_CREATED TIMESTAMP_TZ, DATE_MODIFIED TIMESTAMP_TZ, DATE_PROCESSED TIMESTAMP_TZ, PERMISSIONS_DATA VARCHAR, RECORD_LOCATOR VARCHAR, CATEGORY_DEPTH INTEGER, PARENT_ID VARCHAR, ATTACHED_FILENAME VARCHAR, FILETYPE VARCHAR, LAST_MODIFIED TIMESTAMP_TZ, FILE_DIRECTORY VARCHAR, FILENAME VARCHAR, LANGUAGES ARRAY, PAGE_NUMBER VARCHAR, LINKS VARCHAR, PAGE_NAME VARCHAR, LINK_URLS ARRAY, LINK_TEXTS ARRAY, SENT_FROM ARRAY, SENT_TO ARRAY, SUBJECT VARCHAR, SECTION VARCHAR, HEADER_FOOTER_TYPE VARCHAR, EMPHASIZED_TEXT_CONTENTS ARRAY, EMPHASIZED_TEXT_TAGS ARRAY, TEXT_AS_HTML VARCHAR, REGEX_METADATA VARCHAR, DETECTION_CLASS_PROB DECIMAL, IMAGE_BASE64 VARCHAR, IMAGE_MIME_TYPE VARCHAR, ORIG_ELEMENTS VARCHAR, IS_CONTINUATION BOOLEAN ); ``` * The name of the column in the table that uniquely identifies each record (for example, `RECORD_ID`). The Snowflake connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[snowflake]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `SNOWFLAKE_ACCOUNT` - The ID of the target Snowflake account, represented by `--account` (CLI) or `account` (Python). * `SNOWFLAKE_USER` - The name of the target Snowflake user, represented by `--user` (CLI) or `user` (Python). * `SNOWFLAKE_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). * `SNOWFLAKE_ROLE` - The target role for the user, represented by `--role` (CLI) or `role` (Python). * `SNOWFLAKE_HOST` - The hostname for the target Snowflake warehouse, represented by `--host` (CLI) or `host` (Python). * `SNOWFLAKE_PORT` - The warehouse's port number, represented by `--port` (CLI) or `port` (Python). The default is `443` if not otherwise specified. * `SNOWFLAKE_DATABASE` - The name of the target Snowflake database, represented by `--database` (CLI) or `database` (Python). * `SNOWFLAKE_SCHEMA` - The name of the target schema in the database, represented by `--schema` (CLI) or `schema` (Python). * `SNOWFLAKE_TABLE` - The name of the target table in the schema, represented by `--table-name` (CLI) or `table_name` (Python). For the destination connector, the default is `elements` if not otherwise specified. * `SNOWFLAKE_RECORD_ID_KEY` - The name of the column in the table that uniquely identifies each record, represented by: * For the source connector, `--id-column` (CLI) or `id_column` (Python). * For the destination connector, `--record-id-key` (CLI) or `record_id_key` (Python). For the destination connector, the default is `record_id` if not otherwise specified. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ snowflake \ --account $SNOWFLAKE_ACCOUNT \ --user $SNOWFLAKE_USER \ --password $SNOWFLAKE_PASSWORD \ --role $SNOWFLAKE_ROLE \ --host $SNOWFLAKE_HOST \ --port $SNOWFLAKE_PORT \ --database $SNOWFLAKE_DATABASE \ --schema $SNOWFLAKE_SCHEMA \ --batch-size 50 \ --table-name $SNOWFLAKE_TABLE \ --record-id-key $SNOWFLAKE_RECORD_ID_KEY \ ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.snowflake import ( SnowflakeConnectionConfig, SnowflakeAccessConfig, SnowflakeUploadStagerConfig, SnowflakeUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=SnowflakeConnectionConfig( access_config=SnowflakeAccessConfig( password=os.getenv("SNOWFLAKE_PASSWORD") ), account=os.getenv("SNOWFLAKE_ACCOUNT"), user=os.getenv("SNOWFLAKE_USER"), host=os.getenv("SNOWFLAKE_HOST"), port=os.getenv("SNOWFLAKE_PORT"), database=os.getenv("SNOWFLAKE_DATABASE"), schema=os.getenv("SNOWFLAKE_SCHEMA"), role=os.getenv("SNOWFLAKE_ROLE") ), stager_config=SnowflakeUploadStagerConfig(), uploader_config=SnowflakeUploaderConfig( batch_size=50, table_name=os.getenv("SNOWFLAKE_TABLE"), record_id_key=os.getenv("SNOWFLAKE_RECORD_ID_KEY") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # SQLite Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/sqlite <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a SQLite schema. Insert query is currently limited to append. The requirements are as follows. * A SQLite instance. [Download and install SQLite](https://www.sqlitetutorial.net/download-install-sqlite/). * A SQLite database. [Create a database](https://www.sqlite.org/quickstart.html). * The path to the database's `.db` file. * A table in the database. [Create a table](https://www.sqlite.org/cli.html). The table's schema must match the schema of the documents that Unstructured produces. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: ```sql SQLite CREATE TABLE elements ( id TEXT PRIMARY KEY, record_id TEXT, element_id TEXT, text TEXT, embeddings TEXT, parent_id TEXT, page_number INTEGER, is_continuation INTEGER, orig_elements TEXT, partitioner_type TEXT ); ``` See also: * [CREATE TABLE](https://www.sqlite.org/lang_createtable.html) for SQLite * [Unstructured document elements and metadata](/api-reference/partition/document-elements) You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `SQLITE_DB_PATH` - The path to the database's `.db` file, represented by `--database` (CLI) or `database` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Specify which fields to output in the processed data. This can help prevent # database record insert issues, where a particular field in the processed data # does not match a column in the database table on insert. metadata_includes="id,element_id,text,embeddings,type,system,layout_width,\ layout_height,points,url,version,date_created,date_modified,date_processed,\ permissions_data,record_locator,category_depth,parent_id,attached_filename,\ filetype,last_modified,file_directory,filename,languages,page_number,links,\ page_name,link_urls,link_texts,sent_from,sent_to,subject,section,\ header_footer_type,emphasized_text_contents,emphasized_text_tags,\ text_as_html,regex_metadata,detection_class_prob" unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --verbose \ --strategy hi_res \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --metadata-include "$metadata_includes" \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ sqlite \ --database-path $SQLITE_DB_PATH ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.sqlite import ( SQLiteConnectionConfig, SQLiteAccessConfig, SQLiteUploaderConfig, SQLiteUploadStagerConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": # Specify which fields to output in the processed data. This can help prevent # database record insert issues, where a particular field in the processed data # does not match a column in the database table on insert. metadata_includes = [ "id", "element_id", "text", "embeddings", "type", "system", "layout_width", "layout_height", "points", "url", "version", "date_created", "date_modified", "date_processed", "permissions_data", "record_locator", "category_depth", "parent_id", "attached_filename", "filetype", "last_modified", "file_directory", "filename", "languages", "page_number", "links", "page_name", "link_urls", "link_texts", "sent_from", "sent_to", "subject", "section", "header_footer_type", "emphasized_text_contents", "emphasized_text_tags", "text_as_html", "regex_metadata", "detection_class_prob" ] Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), metadata_include=metadata_includes, additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=SQLiteConnectionConfig( access_config=SQLiteAccessConfig(), database_path=os.getenv("SQLITE_DB_PATH") ), stager_config=SQLiteUploadStagerConfig(), uploader_config=SQLiteUploaderConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Vectara Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/vectara <NewDocument /> Batch process all your records to store structured outputs in Vectara. The requirements are as follows. * A [Vectara account](https://console.vectara.com/signup). * The [customer ID](https://docs.vectara.com/docs/console-ui/vectara-console-overview#view-the-customer-id) for the account. * The name and key for the target [corpus](https://docs.vectara.com/docs/console-ui/creating-a-corpus) in the account. * The [OAuth authentication URL, client ID, and client secret](https://docs.vectara.com/docs/console-ui/app-clients) for accessing the target corpus. The Vectara connector dependencies. ```bash pip install "unstructured-ingest[vectara]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `VECTARA_CUSTOMER_ID` - The customer ID for the target Vectara account, represented by `--customer-id` (CLI) or `customer_id` (Python). * `VECTARA_CORPUS_NAME` - The name of the target corpus in the account, represented by `--corpus-name` (CLI) or `corpus_name` (Python). * `VECTARA_CORPUS_KEY` - The name of the corpus's key, represented by `--corpus-key` (CLI) or `corpus_key` (Python). * `VECTARA_OAUTH_TOKEN_URL` - The OAuth token URL for getting and refreshing OAuth access tokens in the account, represented by `--token-url` (CLI) or `token_url` (Python). * `VECTARA_OAUTH_CLIENT_ID` - A valid OAuth client ID in the account, represented by `--oauth-client-id` (CLI) or `oauth_client_id` (Python). * `VECTARA_OAUTH_CLIENT_SECRET` - The OAuth client secret for the client ID, represented by `--oauth-client-secret` (CLI) or `oauth_client_secret` (Python). Now call the Unstructured CLI or Python SDK. The source connector can be any of the ones supported. This example uses the local source connector. This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding is optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ vectara \ --customer-id $VECTARA_CUSTOMER_ID \ --corpus-name $VECTARA_CORPUS_NAME \ --corpus-key $VECTARA_CORPUS_KEY \ --oauth-client-id $VECTARA_OAUTH_CLIENT_ID \ --oauth-secret $VECTARA_OAUTH_CLIENT_SECRET \ --token-url $VECTARA_OAUTH_TOKEN_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.vectara import ( VectaraAccessConfig, VectaraConnectionConfig, VectaraUploadStagerConfig, VectaraUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalConnectionConfig, LocalDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding is optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=VectaraConnectionConfig( access_config=VectaraAccessConfig( oauth_client_id=os.getenv("VECTARA_OAUTH_CLIENT_ID"), oauth_secret=os.getenv("VECTARA_OAUTH_CLIENT_SECRET") ), customer_id=os.getenv("VECTARA_CUSTOMER_ID"), corpus_name=os.getenv("VECTARA_CORPUS_NAME"), corpus_key=os.getenv("VECTARA_CORPUS_KEY"), token_url=os.getenv("VECTARA_OAUTH_TOKEN_URL") ), stager_config=VectaraUploadStagerConfig(), uploader_config=VectaraUploaderConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Weaviate Source: https://docs.unstructured.io/open-source/ingestion/destination-connectors/weaviate <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Batch process all your records to store structured outputs in a Weaviate database. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview): only [Weaviate Cloud](https://weaviate.io/developers/wcs) clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview): Weaviate Cloud clusters, [Weaviate installed locally](https://weaviate.io/developers/weaviate/quickstart/local), and [Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded) are supported. * For Weaviate installed locally, you will need the name of the target collection on the local instance. * For Embedded Weaviate, you will need the instance's connection URL and the name of the target collection on the instance. * For Weaviate Cloud, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/FvhJDUZM078" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Weaviate database instance. The following information assumes that you have a Weaviate Cloud (WCD) account with a Weaviate database cluster in that account. [Create a WCD account](https://weaviate.io/developers/wcs/quickstart#create-a-wcd-account). [Create a database cluster](https://weaviate.io/developers/wcs/quickstart#create-a-weaviate-cluster). For other database options, [learn more](https://weaviate.io/developers/weaviate/installation). * The URL and API key for the database cluster. [Get the URL and API key](https://weaviate.io/developers/wcs/quickstart#explore-the-details-panel). * The name of the target collection in the database. [Create a collection](https://weaviate.io/developers/wcs/tools/collections-tool). An existing collection is not required. At runtime, the collection behavior is as follows: For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. If Unstructured generates embeddings, the new collection's name will be `U<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. If Unstructured does not generate embeddings, the new collection's name will be `U<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. The new collection's name will be `Unstructuredautocreated`. If Unstructured creates a new collection and generates embeddings, you will not see an embeddings property in tools such as the Weaviate Cloud Collections user interface. To view the generated embeddings, you can run a Weaviate GraphQL query such as the following. In this query, replace `<collection-name>` with the name of the new collection, and replace `<property-name>` with the name of each additional available property that you want to return results for, such as `text`, `type`, `element_id`, `record_id`, and so on. The embeddings will be returned in the `vector` property. ```text { Get { <collection-name> { _additional { vector } <property-name> <property-name> } } } ``` Weaviate requires an existing collection to have a data schema before you add data. At minimum, this schema must contain the `record_id` property, as follows: ```json { "class": "Elements", "properties": [ { "name": "record_id", "dataType": ["text"] } ] } ``` Weaviate generates any additional properties based on the incoming data. If you have specific schema requirements, you can define the schema manually. Unstructured cannot provide a schema that is guaranteed to work for everyone in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following collection schema example for your own specific schema requirements: ```json { "class": "Elements", "properties": [ { "name": "record_id", "dataType": ["text"] }, { "name": "element_id", "dataType": ["text"] }, { "name": "text", "dataType": ["text"] }, { "name": "embeddings", "dataType": ["number[]"] }, { "name": "metadata", "dataType": ["object"], "nestedProperties": [ { "name": "parent_id", "dataType": ["text"] }, { "name": "page_number", "dataType": ["text"] }, { "name": "is_continuation", "dataType": ["boolean"] }, { "name": "orig_elements", "dataType": ["text"] }, { "name": "partitioner_type", "dataType": ["text"] } ] } ] } ``` See also : * [Collection schema](https://weaviate.io/developers/weaviate/config-refs/schema) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) The Weaviate connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[weaviate]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * For Weaviate installed locally, `WEAVIATE_COLLECTION` - The name of the target collection in the instance, represented by `--collection` (CLI) or `collection` (Python). * For Embedded Weaviate: * `WEAVIATE_HOST` - The connection URL to the instance, represented by `--hostname` (CLI) or `hostname` (Python). * `WEAVIATE_COLLECTION` - The name of the target collection in the instance, represented by `--collection` (CLI) or `collection` (Python). If no value is provided, see the beginning of this article for the behavior at run time. * For Weaviate Cloud: * `WEAVIATE_CLUSTER_URL` - THE REST endpoint for the Weaviate database cluster, represented by `--cluster-url` (CLI) or `cluster_url` (Python). * `WEAVIATE_API_KEY` - The API key for the database cluster, represented by `--api-key` (CLI) or `api_key` (Python). <Note>For the CLI, the `--api-key` option here is part of the `weaviate-cloud` command. For Python, the `api_key` parameter here is part of the `CloudWeaviateAccessConfig` object.</Note> * `WEAVIATE_COLLECTION` - The name of the target collection in the database, represented by `--collection` (CLI) or `collection` (Python). If no value is provided, see the beginning of this article for the behavior at run time. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The source connector can be any of the ones supported. This example uses the local source connector: This example sends files to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding is optional. # For Weaviate installed locally: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --chunking-strategy by_title \ --embedding-provider huggingface \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ weaviate-local \ --collection $WEAVIATE_COLLECTION # For Embedded Weaviate: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --chunking-strategy by_title \ --embedding-provider huggingface \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ weaviate-embedded \ --hostname $WEAVIATE_HOST \ --collection $WEAVIATE_COLLECTION # For Weaviate Cloud: unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --chunking-strategy by_title \ --embedding-provider huggingface \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ weaviate-cloud \ --cluster-url $WEAVIATE_URL \ --api-key $WEAVIATE_API_KEY \ --collection $WEAVIATE_COLLECTION ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # For Weaviate installed locally: # from unstructured_ingest.processes.connectors.weaviate.local import ( # LocalWeaviateConnectionConfig, # LocalWeaviateAccessConfig, # LocalWeaviateUploadStagerConfig, # LocalWeaviateUploaderConfig # ) # For Embedded Weaviate: # from unstructured_ingest.processes.connectors.weaviate.embedded import ( # EmbeddedWeaviateConnectionConfig, # EmbeddedWeaviateAccessConfig, # EmbeddedWeaviateUploadStagerConfig, # EmbeddedWeaviateUploaderConfig # ) # For Weaviate Cloud: from unstructured_ingest.processes.connectors.weaviate.cloud import ( CloudWeaviateConnectionConfig, CloudWeaviateAccessConfig, CloudWeaviateUploaderConfig, CloudWeaviateUploadStagerConfig ) if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig( embedding_provider="openai", embedding_model_name=os.getenv("EMBEDDING_MODEL_NAME"), embedding_api_key=os.getenv("OPENAI_APIKEY") ), # For Weaviate installed locally: # destination_connection_config=LocalWeaviateConnectionConfig( # access_config=LocalWeaviateAccessConfig() # ), # stager_config=LocalWeaviateUploadStagerConfig(), # uploader_config=LocalWeaviateUploaderConfig( # collection=os.getenv("WEAVIATE_COLLECTION") # ) # For Embedded Weaviate: # destination_connection_config=EmbeddedWeaviateConnectionConfig( # access_config=EmbeddedWeaviateAccessConfig(), # hostname=os.getenv("WEAVIATE_HOST") # ), # stager_config=EmbeddedWeaviateUploadStagerConfig(), # uploader_config=EmbeddedWeaviateUploaderConfig( # collection=os.getenv("WEAVIATE_COLLECTION") # ) # For Weaviate Cloud: destination_connection_config=CloudWeaviateConnectionConfig( access_config=CloudWeaviateAccessConfig( api_key=os.getenv("WEAVIATE_API_KEY") ), cluster_url=os.getenv("WEAVIATE_CLUSTER_URL") ), stager_config=CloudWeaviateUploadStagerConfig(), uploader_config=CloudWeaviateUploaderConfig( collection=os.getenv("WEAVIATE_COLLECTION") ) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Process files in batches by using the Unstructured Ingest CLI Source: https://docs.unstructured.io/open-source/ingestion/ingest-cli The Unstructured Ingest CLI enables you to use command-line scripts to send files in batches to Unstructured for processing, and to tell Unstructured where to deliver the processed data. [Learn more](/open-source/ingestion/overview#unstructured-ingest-cli). <Note> The Unstructured Ingest CLI does not work with the Unstructured API. For information about the Unstructured API, see the [Unstructured API Overview](/api-reference/workflow/overview). </Note> ## Getting started You can use the Unstructured Ingest CLI to process files locally, or you can use the Ingest CLI to send files in batches to Unstructured for processing. Local processing does not use an Unstructured API key or API URL. Using the Ingest CLI to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> 3) The Unstructured API URL for Unstructured Ingest was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). <Note> The default URL for Unstructured Ingest is the same as the default URL for the Unstructured Partition Endpoint: `https://api.unstructuredapp.io/general/v0/general`. However, you should always use the URL that was provided to you when your Unstructured account was created. </Note> ## Installation One approach to get started quickly with the Unstructured Ingest CLI is to install Python and then run the following command: ```bash pip install unstructured-ingest ``` This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations. You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). For additional installation options, see [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli) in the [Ingest](/open-source/ingestion/overview) section. <Info>To migrate from older, deprecated versions of the Ingest CLI that used `pip install unstructured`, see the [migration guide](/open-source/ingestion/overview#migration-guide).</Info> ## Usage To call the Unstructured Ingest CLI, follow this calling pattern, where: * `<source>` is the command name for one of the available [source](/open-source/ingestion/source-connectors/overview) (input) connectors, such as `local` for a local source location, `azure` for an Azure Storage account source, `s3` for an Amazon S3 bucket source, and so on. * `<destination>` is the command name for one of the available [destination](/open-source/ingestion/destination-connectors/overview) (output) connectors, such as `local` for a local destination, `azure` for an Azure Storage account destination, `s3` for an Amazon S3 bucket destination, and so on. * `<setting>` is one or more command-line options for specifying how and where Unstructured will ingest the files from the `<source>`, or how and where to deliver the processed data to the `<destination>`. ```bash CLI #!/usr/bin/env bash unstructured-ingest \ <source> \ --<setting1> <value1> \ --<setting2> <value2> \ --<settingN> <valueN> \ <destination> \ --<setting1> <value1> \ --<setting2> <value2> \ --<settingN> <valueN> ``` To learn how to use the Unstructured Ingest CLI to work with a specific source (input) and destination (output) location, see the CLI script examples for the [source](/open-source/ingestion/source-connectors/overview) and [destination](/open-source/ingestion/destination-connectors/overview) connectors that are available for you to choose from. See also the [ingest configuration](/open-source/ingestion/ingest-configuration/overview) settings for command-line options that enable you to further control how batches are sent and processed. # Chunking configuration Source: https://docs.unstructured.io/open-source/ingestion/ingest-configuration/chunking-configuration A common chunking configuration is a critical element in the data processing pipeline, particularly when creating embeddings and populating vector databases with the results. This configuration defines the parameters governing the segmentation of text into meaningful chunks, whether at the document, paragraph, or sentence level. It plays a pivotal role in determining the size and structure of these chunks, ensuring that they align with the specific requirements of downstream tasks, such as embedding generation and vector database population. By carefully configuring chunking parameters, users can optimize the granularity of data segments, ultimately contributing to more cohesive and contextually rich results. This is crucial for tasks like natural language processing and text analysis, as well as for the efficient storage and retrieval of embeddings in vector databases, enhancing the quality and relevance of the results. ## Configs * `chunk_api_key`: If `chunk_by_api` is set to `True`, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls. * `chunk_by_api`: Default: `False`. If set to `True`, uses Unstructured to run chunking. If set to `False`, runs chunking locally. * `chunk_combine_text_under_n_chars`: Combine consecutive chunks when the first does not exceed length `n` and the second will fit without exceeding the hard-maximum length. Only operative for the `by_title` chunking strategy. * `chunk_include_orig_elements`: `True` when chunking to add the original elements consolidated to form each chunk to `.metadata.orig_elements` on that chunk. * `chunk_max_characters`: Default: `500`. The hard-maximum chunk length. No chunk will exceed this length. An oversized element will be divided by text-splitting to fit this window. * `chunk_multipage_selections`: `True` to ignore page boundaries when chunking such that elements from two different pages can appear in the same chunk. Only operative for the `by_title` chunking strategy. * `chunk_new_after_n_chars`: The soft-maximum chunk length. Another element will not be added to a chunk of `n` length even when it would fit without exceeding the hard-maximum length. * `chunk_overlap`: Default: `0`. Prefix each chunk's text with the last overlap of `n` characters from the prior chunk. Only applies to oversized chunks divided by text-splitting. To apply overlap to non-oversized chunks, use `chunk_overlap_all`. * `chunk_overlap_all`: Applies overlap to chunks formed from whole elements as well as those formed by text-splitting oversized elements. The overlap length is taken from the `chunk_overlap` value. * `chunking_endpoint`: If `chunk_by_api` is set to `True`, chunking requests are sent to this Unstructured API URL. By default, this URL is the Unstructured Partition Endpoint URL: `https://api.unstructuredapp.io/general/v0/general`. However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). * `chunking_strategy`: One of `basic` or `by_title`. When omitted, no chunking is performed. The `basic` strategy maximally fills each chunk with whole elements, up the specified size limits as specified by `max_characters` and `new_after_n_chars`. A single element that exceeds this length is divided into two or more chunks using text-splitting. A `Table` element is never combined with any other element and appears as a chunk of its own or as a sequence of `TableChunk` elements splitting is required. The `by_title` behaviors are the same except that section and optionally page boundaries are respected such that two consecutive elements from different sections appear in separate chunks. <Note>Chunking will fail if you set both `partition_by_api` to `False` and `chunking_strategy` to `by_page` or `by_similarity`. However, the rest of your data processing pipeline should be unaffected by this setting.</Note> # Embedding configuration Source: https://docs.unstructured.io/open-source/ingestion/ingest-configuration/embedding-configuration A common embedding configuration is a critical component that allows for dynamic selection of embedders and their associated parameters to create vectors from data. This configuration provides the flexibility to choose from various embedding models and fine-tune parameters to optimize the quality and characteristics of the resulting vectors. It enables users to tailor the embedding process to the specific needs of their data and downstream applications, ensuring that the generated vectors effectively capture semantic relationships and contextual information within the dataset. The core [Unstructured](https://github.com/Unstructured-IO/unstructured) open source library does not enable the generation of embeddings by default. (However you can generate embeddings as a separate step manually. [Learn how](/open-source/core-functionality/embedding).) You can configure the Unstructured CLI and Unstructured Ingest Python library to generate embeddings by specifying the `embedding_provider`, `embedding_api_key`, and `embedding_model_name` options (and, for Amazon Bedrock, additional options), as follows. You must provide your own API key for the specified embedding provider. To get this API key, you must first create an account with that provider and set up billing directly with them. You are responsible for all costs associated with using that provider. Calls to the Unstructured CLI or Unstructured Ingest Python library that are routed to Unstructured's software-as-a-service (SaaS) for processing (for example, by specifying an Unstructured API key and an Unstructured SaaS URL) require an Unstructured account for billing purposes. Unstructured's costs for generating embeddings are already included in its account pricing plans. To generate embeddings, you must specify the `embedding_provider` and `embedding_model_name` configuration options as follows. Unstructured uses its own internal API key when using the specified provider to generate the embeddings. These `embedding_provider` and `embedding_model_name` options are limited only to the provider and model names that are supported by the Unstructured API. [See the list of supported provider and model names](/api-reference/workflow/workflows#embedder-node). [Learn how to specify these options](/open-source/how-to/embedding). ## Configs * `embedding_provider`: The embedding provider to use while doing embedding. Available values include `bedrock`, `azure-openai`, `huggingface`, `mixedbread-ai`, `octoai`, `openai`, `togetherai`, `vertexai`, and `voyageai`. * `embedding_api_key`: The API key to use, if one is required to generate the embeddings through an API service, such as OpenAI. * `embedding_aws_access_key_id`: The AWS access key ID to be used for AWS-based embedders, such as Amazon Bedrock. * `embedding_aws_region`: The AWS Region ID to be used for AWS-based embedders, such as Amazon Bedrock. * `embedding_aws_secret_access_key`: The AWS secret access key to be used for AWS-based embedders, such as Amazon Bedrock. * `embedding_model_name`: The specific model to use for the embedding provider, if necessary. The default `embedding_model_name` values unless otherwise specified are: * `bedrock`: None * `azure-openai`: `text-embedding-ada-002`, with 1536 dimensions * `huggingface`: `sentence-transformers/all-MiniLM-L6-v2`, with 384 dimensions * `mixedbread-ai`: `mixedbread-ai/mxbai-embed-large-v1`, with 1024 dimensions * `octoai`: `thenlper/gte-large`, with 1024 dimensions * `openai`: `text-embedding-ada-002`, with 1536 dimensions * `togetherai`: `togethercomputer/m2-bert-80M-32k-retrieval`, with 768 dimensions * `vertexai`: `text-embedding-05`, with 768 dimensions * `voyageai`: None # Overview Source: https://docs.unstructured.io/open-source/ingestion/ingest-configuration/overview The configurations listed in this section are common to all connectors. These configurations act as instructions for how to collect, process, and store data. <iframe width="560" height="315" src="https://www.youtube.com/embed/A_kqK2KHTdg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> If you're looking for parameters specific to a connector (such as authentication options) see the [source code](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/unstructured_ingest/processes/connectors). # Partition configuration Source: https://docs.unstructured.io/open-source/ingestion/ingest-configuration/partition-configuration A standard partition configuration is a collection of parameters designed to oversee document partitioning, whether executed through API integration or by the unstructured library on a local system. These parameters serve a dual role, encompassing those passed to the partition method for the initial segmentation of documents and those responsible for coordinating data after processing, including the dynamic metadata associated with each element. ## Configs for Partitioning * `additional_partition_args`: A JSON string representation of any values to pass through to the `partition` function. * `encoding`: The encoding method used to decode the text input. By default, UTF-8 will be used. * `ocr_languages`: The languages present in the document, for use in partitioning, OCR, or both. Multiple languages indicate that the text could be in any of the specified languages. * `skip_infer_table_types`: List of document types that you want to skip table extraction with. * `strategy`: Default: `auto`. The strategy to use for partitioning PDF and image files. Uses a layout detection model if set to `hi_res`. Otherwise, partitioning simply extracts the text from the document and processes it. ## Configs for the Process * `api_key`: If `partition_by_api` is set to `True`, requests that are sent to the Unstructured API will use this Unstructured API key to make authenticated calls. * `fields_include`: Fields to include in the output JSON. By default, the following fields are included: `element_id`, `text`, `type`, `metadata`, and `embeddings`. * `flatten_metadata`: Default: `False`. If set to `True`, the hierarchical metadata structure is flattened to have all values exist at the top level. * `hi_res_model_name`: The model to use when `strategy` is set to `hi_res`. Available values are `layout_v1.0.0` (the default) and `yolox`. * `metadata_exclude`: Values from the `metadata` field to exclude from the output. * `metadata_include`: If provided, only the specified fields are preserved in the `metadata` output. * `partition_by_api`: Default: `False`. If set to `True`, uses Unstructured to run partitioning. If set to `False`, runs partitioning locally. * `partition_endpoint`: If `partition_by_api` is set to `True`, partitioning requests are sent to this Unstructured API URL. # Processor configuration Source: https://docs.unstructured.io/open-source/ingestion/ingest-configuration/processor-configuration A common process configuration plays a pivotal role in overseeing the entire ingest process, encompassing various aspects to ensure efficiency and reliability. This configuration includes parameters for managing a pool of workers, which allows for parallelization of the ingest process to maximize throughput. Additionally, it addresses the storage and retrieval of intermediate results, supporting caching strategies that enable replayability in case of errors or interruptions. By configuring worker pools, the process can efficiently handle multiple tasks concurrently, enhancing performance. Furthermore, saving intermediate results allows for the resumption of the process from a known state, reducing data loss and ensuring robustness in the face of unexpected errors or system failures. This comprehensive configuration promotes an organized and resilient approach to data ingestion, improving overall system reliability and efficiency. ## Configs * `disable_parallelism`: `True` if the `INGEST_DISABLE_PARALLELISM` environment variable is set to `True` (case-insensitive), otherwise `False` (the default). * `download_only`: Default: `False`. If set to `True`, downloads any files that are not already present in the connector's specified download directory (`download_dir`), or `work_dir` if `download_dir` is not specified, or the default file path for `work_dir` if `work_dir` is not specified. * `max_connections`: The maximum of connections allowed when running an asynchronous step in the ingest pipeline. * `max_docs`: The maximum number of documents to be pulled in for the entire ingest process. * `num_processes`: Default: `2`. For every step that can use a pool of workers to increase throughput, how many workers to configure in the pool. * `output_dir`: Where the final results will be located when the process is finished. This is regardless of whether a destination is configured. If a directory is not specified, by default a folder named `structured-output`, relative to the current working directory, is used. * `preserve_downloads`: When set to `False` (the default), will remove downloaded files after they are successfully processed. * `raise_on_error`: `False `(default) for any single document that fails in the process, causes the error to be logged but allows for all other documents to proceed in the process. If `True`, causes the entire process to fail and raise the error if any one document fails. * `re_download`: When set to `False` (the default), will not re-download files if they are already present in the download directory. Setting this to `True` downloads the files even if they exist in the download directory. `True` is useful in cases where for some reason you suspect that the files in the remote location might have changed, and you want to be sure that the file copies in the download directory are the most current ones. * `reprocess`: Default: `False`. If set to `True`, will ignore all content that may have been cached and rerun each step. * `tqdm`: When set to `False` (the default), will not show a progress bar. * `uncompress`: When set to `False` (the default), will not uncompress any archived ZIP or TAR files. If set to `True`, and the specified partitioner provides support for compressed ZIP and TAR files, these files will be uncompressed. * `verbose`: Default: `False`. If set to `True`, debug logging should be included in the output. * `work_dir`: The file path for where intermediate results should be saved. If a file path is not set, a default file path will be used relative to the user's home directory (`/home/username/.cache/unstructured/ingest/pipeline` or `C:\Users\username\.cache\unstructured\ingest\pipeline`). # Ingest dependencies Source: https://docs.unstructured.io/open-source/ingestion/ingest-dependencies When you install the [Unstructured Ingest CLI](/open-source/ingestion/ingest-cli) and the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) by running the command `pip install unstructured-ingest` by itself, you get the following by default: * The [local source connector](/open-source/ingestion/source-connectors/local) and the [local destination connector](/open-source/ingestion/destination-connectors/local). * Support for the following file types: | File type | | --------- | | `.bmp` | | `.eml` | | `.heic` | | `.html` | | `.jpg` | | `.jpeg` | | `.tiff` | | `.png` | | `.txt` | | `.xml` | To add support for additional file types, run the following: | Command | File type | | ----------------------------------------- | --------- | | `pip install "unstructured-ingest[csv]"` | `.csv` | | `pip install "unstructured-ingest[doc]"` | `.doc` | | `pip install "unstructured-ingest[docx]"` | `.docx` | | `pip install "unstructured-ingest[epub]"` | `.epub` | | `pip install "unstructured-ingest[md]"` | `.md` | | `pip install "unstructured-ingest[msg]"` | `.msg` | | `pip install "unstructured-ingest[odt]"` | `.odt` | | `pip install "unstructured-ingest[org]"` | `.org` | | `pip install "unstructured-ingest[pdf]"` | `.pdf` | | `pip install "unstructured-ingest[ppt]"` | `.ppt` | | `pip install "unstructured-ingest[pptx]"` | `.pptx` | | `pip install "unstructured-ingest[rtf]"` | `.rtf` | | `pip install "unstructured-ingest[rst]"` | `.rst` | | `pip install "unstructured-ingest[tsv]"` | `.tsv` | | `pip install "unstructured-ingest[xlsx]"` | `.xlsx` | To add support for additional connectors, run the following: | Command | Connector type | | ------------------------------------------------------- | -------------------- | | `pip install "unstructured-ingest[airtable]"` | Airtable | | `pip install "unstructured-ingest[astra]"` | Astra DB | | `pip install "unstructured-ingest[azure]"` | Azure Blob Storage | | `pip install "unstructured-ingest[azure-ai-search]"` | Azure AI Search | | `pip install "unstructured-ingest[box]"` | Box | | `pip install "unstructured-ingest[chroma]"` | Chroma | | `pip install "unstructured-ingest[confluence]"` | Confluence | | `pip install "unstructured-ingest[couchbase]"` | Couchbase | | `pip install "unstructured-ingest[databricks-volumes]"` | Databricks Volumes | | `pip install "unstructured-ingest[delta-table]"` | Delta Tables | | `pip install "unstructured-ingest[dropbox]"` | Dropbox | | `pip install "unstructured-ingest[dropbox]"` | DuckDB, MotherDuck | | `pip install "unstructured-ingest[elasticsearch]"` | Elasticsearch | | `pip install "unstructured-ingest[gcs]"` | Google Cloud Storage | | `pip install "unstructured-ingest[github]"` | GitHub | | `pip install "unstructured-ingest[gitlab]"` | GitLab | | `pip install "unstructured-ingest[google-drive]"` | Google Drive | | `pip install "unstructured-ingest[jira]"` | JIRA | | `pip install "unstructured-ingest[kafka]"` | Apache Kafka | | `pip install "unstructured-ingest[milvus]"` | Milvus | | `pip install "unstructured-ingest[mongodb]"` | MongoDB | | `pip install "unstructured-ingest[neo4j]"` | Neo4j | | `pip install "unstructured-ingest[notion]"` | Notion | | `pip install "unstructured-ingest[onedrive]"` | OneDrive | | `pip install "unstructured-ingest[opensearch]"` | OpenSearch | | `pip install "unstructured-ingest[outlook]"` | Outlook | | `pip install "unstructured-ingest[pinecone]"` | Pinecone | | `pip install "unstructured-ingest[postgres]"` | PostgreSQL, SQLite | | `pip install "unstructured-ingest[qdrant]"` | Qdrant | | `pip install "unstructured-ingest[redis]"` | Redis | | `pip install "unstructured-ingest[s3]"` | Amazon S3 | | `pip install "unstructured-ingest[sharepoint]"` | SharePoint | | `pip install "unstructured-ingest[salesforce]"` | Salesforce | | `pip install "unstructured-ingest[singlestore]"` | SingleStore | | `pip install "unstructured-ingest[snowflake]"` | Snowflake | | `pip install "unstructured-ingest[sftp]"` | SFTP | | `pip install "unstructured-ingest[slack]"` | Slack | | `pip install "unstructured-ingest[vectara]"` | Vectara | | `pip install "unstructured-ingest[weaviate]"` | Weaviate | To add support for available embedding libraries, run the following: | Command | Embedding library type | | ------------------------------------------------------- | ---------------------- | | `pip install "unstructured-ingest[bedrock]"` | Amazon Bedrock | | `pip install "unstructured-ingest[embed-huggingface]"` | Hugging Face | | `pip install "unstructured-ingest[embed-octoai]"` | OctoAI | | `pip install "unstructured-ingest[embed-vertexai]"` | Google Vertex AI | | `pip install "unstructured-ingest[embed-voyageai]"` | Voyage AI | | `pip install "unstructured-ingest[embed-mixedbreadai]"` | Mixedbread | | `pip install "unstructured-ingest[openai]"` | OpenAI, Azure OpenAI | | `pip install "unstructured-ingest[togetherai]"` | together.ai | For details about the specific dependencies that are installed, see: * [Common dependencies](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/requirements/common) * [File type dependencies](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/requirements/local_partition) * [Connector dependencies](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/requirements/connectors) * [Embedding library dependencies](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/requirements/embed) See also [setup.py](https://github.com/Unstructured-IO/unstructured-ingest/blob/main/setup.py). # Overview Source: https://docs.unstructured.io/open-source/ingestion/overview <Note> Unstructured recommends that you use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead of the Unstructured Ingest CLI or the Unstructured Ingest Python library. The Unstructured UI and API provide a full range of partitioning, chunking, embedding, and enrichment options for your files and data. It also uses the latest and highest-performing models on the market today, and it has built-in logic to deliver the highest quality results at the lowest cost. The Unstructured Ingest CLI and the Unstructured Ingest Python library are not being actively updated to include these and other Unstructured API features. </Note> You can send multiple files in batches to be ingested by Unstructured for processing. Ingestion is the term that Unstructured uses to refer to the set of activities that happens when files are input for processing. [Learn more](/open-source/ingestion/overview). You can send batches to Unstructured by using the following tools: * The [Unstructured Ingest CLI](/open-source/ingestion/ingest-cli), which is part of and builds upon the [Unstructured open source library](/open-source/introduction/overview). * The [Unstructured Ingest Python](/open-source/ingestion/python-ingest) library, which is also part of and builds upon the [Unstructured open source library](/open-source/introduction/overview). The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured for processing: <iframe width="560" height="315" src="https://www.youtube.com/embed/tSKHFXsBQ-c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following 5-minute video goes into more detail about the various components of the Unstructured Ingest Python library: <iframe width="560" height="315" src="https://www.youtube.com/embed/A_kqK2KHTdg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> ## Ingest flow The Unstructured ingest flow is similar to an extract, transform and load (ETL) data pipeline. Because of this, a customer-defined implementation of the Unstructured ingest flow is sometimes referred to as an ingest pipeline or simply a pipeline. An Unstructured ingest pipeline contains the following logical steps: <Steps> <Step title="Index"> Reaches out to the source location and pulls in metadata for each document. For example, this could include information such as the path to the files to be analyzed. * For the Unstructured CLI, you can control this behavior, where available for a connector, through its `--input-path` command option. * For the Unstructured Ingest Python library, you can control this behavior, where available for a connector, through its `<Prefix>IndexerConfig` class (where `<Prefix>` represents the connector provider's name, such as `Azure` for Azure.) </Step> <Step title="Post-Index Filter"> After indexing, you might not want to download everything that was indexed. For example, you might want to download only files that match specific types, file names, paths, or sizes. For the Unstructured Ingest Python library, you can control this behavior through the `FiltererConfig` class. </Step> <Step title="Download"> Using the information generated from the indexer and the filter, downloads the content as files on the local file system for processing. This may require manipulation of the data to prepare it for partitioning. For example, this could include information such as the path to a local directory to download files to. * For the Unstructured CLI, you can control this behavior through a connector's `--download-dir` command option. * For the Unstructured Ingest Python library, you can control this behavior through a connector's `<Prefix>DownloaderConfig` class. </Step> <Step title="Post-Download Filter"> After downloading, if uncompression is enabled, you might not want to uncompress everything that was downloaded. The filter that was defined at the beginning is repeated here. </Step> <Step title="Uncompress"> If enabled, searches for any compressed files (Unstructured supports TAR and ZIP) and uncompresses them. * For the Unstructured CLI, you can control this behavior through the `--uncompress` command option. * For the Unstructured Ingest Python library, you can control this behavior through the `UncompressConfig` class. </Step> <Step title="Post-Uncompress Filter"> After downloading, and uncompressing if enabled, you might not want to partition everything that was downloaded, and uncompressed if enabled. The filter that was defined at the beginning is repeated here again. </Step> <Step title="Partition"> Generates the structured enriched content from the local files that have been downloaded, uncompressed if enabled, and filtered. Both local-based partitioning and Unstructured-based partitioning is supported, with API services-based partitioning set to run asynchronously and local-based partitioning set to run through multiprocessing. For the Unstructured Ingest Python library, you can control this behavior through the `PartitionerConfig` class. </Step> <Step title="Chunk"> Optionally, chunks the partitioned content. Chunking can also be run locally or through Unstructured, with asynchronous or multiprocessing behaviors set in the same approach as the partitioner. For the Unstructured Ingest Python library, you can control this behavior through the `ChunkerConfig` class. </Step> <Step title="Embed"> Optionally, generates vector embeddings for each element in the structured output. Most of the time, this is done through API calls to a third-party embedding vendor and therefore runs asynchronously. But it can also use a locally available Hugging Face model, which will run through multiprocessing. For the Unstructured Ingest Python library, you can control this behavior through the `EmbedderConfig` class. </Step> <Step title="Stage"> This is an optional step that does not apply most of the time. However, sometimes the data needs to be modified from the existing structure to better support the upload step, such as converting it to a CSV file for tabular-based destinations. For the Unstructured Ingest Python library, you can control this behavior, where available for a connector, through its `<Prefix>UploadStagerConfig` class. </Step> <Step title="Upload"> Writes the local content to the destination. If no destination is provided, the local one will be used, which writes the final result to a location on the local filesystem. If batch uploads are needed, this will run in a single process with access to all documents. If batch uploads are not supported, all documents can be uploaded concurrently by using the asynchronous approach. For the Unstructured Ingest Python library, you can control this behavior through a connector's `<Prefix>UploaderConfig` class. </Step> </Steps> ## Generate Python code examples You can connect any available source connector to any available destination connector. However, the source connector code examples in the documentation show connecting only to the local destination connector. Similarly, the destination connector code examples in the documentation show connecting only to the local source connector. To quickly generate an Unstructured Ingest Python library code example that connects any available source connector to any available destination connector, do the following: 1. Open the [Unstructured Ingest Code Generator](https://huggingface.co/spaces/MariaK/unstructured-pipeline-builder) webpage. 2. Select your input (source) location type from the Get unstructured documents from drop-down list. 3. Select your output (destination) location type from the Upload RAG-ready documents to drop-down list. 4. Select your chunking strategy from the Chunking strategy drop-down list: * None - Do not chunk the data elements' content. * basic - Combine sequential data elements to maximally fill each chunk. However, do not mix `Table` and non-`Table` elements in the same chunk. * by\_title - Use the `basic` strategy and also preserve section boundaries. Optionally preserve page boundaries as well. * by\_page - Use the `basic` strategy and also preserve page boundaries. * by\_similarity - Use the `sentence-transformers/multi-qa-mpnet-base-dot-v1` embedding model to identify topically similar sequential elements and combine them into chunks. This strategy is availably only when calling Unstructured. To learn more, see [Chunking strategies](/api-reference/partition/chunking) and [Chunking configuration](/open-source/ingestion/ingest-configuration/chunking-configuration). 5. For any chunking strategy other than None: * Enter your chunk size in the Chunk size (characters) box, or leave the default of 1000 characters. * If you need to apply overlapping to the chunks, enter the chunk overlap size in the Chunk overlap (characters) box, or leave default of 20 characters. To learn more, see [Chunking configuration](/open-source/ingestion/ingest-configuration/chunking-configuration). 6. To generate vector embeddings, select the provider in the Embedding provider drop-down list. To learn more, see [Embedding configuraton](/open-source/ingestion/ingest-configuration/embedding-configuration). 7. Click Generate code. 8. Copy the example code from the Generated Code pane into your code project. 9. The code example will contain one or more environment variables that you must set for the code to run correctly. To learn what to set these variables to, click the documentation links that are below the Generated Code pane. ## Pricing Calls to the Unstructured CLI or Unstructured Ingest Python library that are routed to Unstructured's software-as-a-service (SaaS) for processing (for example, by specifying an Unstructured API key and an Unstructured SaaS URL) require an Unstructured account for billing purposes. Unstructured offers three account pricing plans: * SaaS Cloud-hosted - Processing happens on Unstructured's software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment. * Private SaaS - Processing also happens on Unstructured's SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy. * VPC - Sometimes referred to as [self-hosted](/self-hosted/overview), an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support. For more details, see the [Unstructured Pricing](https://unstructured.io/pricing) page. Some of these plans are billed on a per-page basis. Unstructured calculates a page as follows: * For these file types, a page is a page, slide, or image: `.pdf`, `.pptx`, and `.tiff`. * For `.docx` files that have page metadata, Unstructured calculates the number of pages based on that metadata. * For all other file types, Unstructured calculates the number of pages as the file's size divided by 100 KB. * For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed. ## Learn more * [Ingest configuration](/open-source/ingestion/ingest-configuration/overview) settings enable you to control how batches are sent and processed. * [Source connectors](/open-source/ingestion/source-connectors/overview) enable you to send batches from local or remote locations to be ingested by Unstructured for processing. * [Destination connectors](/open-source/ingestion/destination-connectors/overview) enable Unstructured to send the processed data to local or remote locations. ## See also * The [Unstructured UI](/ui/overview) enables you to send batches to Unstructured from remote locations, and to have Unstructured send the processed data to remote locations, all without using code or a CLI. # Process files in batches by using the Unstructured Ingest Python library Source: https://docs.unstructured.io/open-source/ingestion/python-ingest The Unstructured Ingest Python library enables you to use Python code to send files in batches to Unstructured for processing, and to tell Unstructured where to deliver the processed data. <Note> The Unstructured Ingest Python library does not work with the Unstructured API. For information about the Unstructured API, see the [Unstructured API Overview](/api-reference/workflow/overview). </Note> The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured for processing: <iframe width="560" height="315" src="https://www.youtube.com/embed/tSKHFXsBQ-c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> ## Getting started You can use the Unstructured Ingest Python library to process files locally, or you can use the Ingest Python library to send files in batches to Unstructured for processing. Local processing does not use an Unstructured API key or API URL. Using the Ingest Python library to send files in batches to Unstructured for processing is more robust, and usage is billed to you on a pay-as-you-go basis. Usage requires an Unstructured API key and API URL, as follows: 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> 3) The Unstructured API URL for Unstructured Ingest was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). <Note> The default URL for Unstructured Ingest is the same as the default URL for the Unstructured Partition Endpoint: `https://api.unstructuredapp.io/general/v0/general`. However, you should always use the URL that was provided to you when your Unstructured account was created. </Note> ## Installation One approach to get started quickly with the Unstructured Ingest Python library is to install Python and then run the following command: ```bash pip install unstructured-ingest ``` This default installation option enables the ingestion of plain text files, HTML, XML, JSON and emails that do not require any extra dependencies. This default option also enables you to specify local source and destination locations. You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). For additional installation options, see the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) in the [Ingest](/open-source/ingestion/overview) section. <Info>To migrate from older, deprecated versions of the Ingest Python library that used `pip install unstructured`, see the [migration guide](/open-source/ingestion/overview#migration-guide).</Info> ## Usage For example, to use the Unstructured Ingest Python library to ingest files from a local source (input) location and to deliver the processed data to an Azure Storage account destination (output) location: <CodeGroup> ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.azure import ( AzureConnectionConfig, AzureAccessConfig, AzureUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=AzureConnectionConfig( access_config=AzureAccessConfig( account_name=os.getenv("AZURE_STORAGE_ACCOUNT_NAME"), account_key=os.getenv("AZURE_STORAGE_ACCOUNT_KEY") ) ), uploader_config=AzureUploaderConfig(remote_url=os.getenv("AZURE_STORAGE_REMOTE_URL")) ).run() ``` </CodeGroup> To learn how to use the Unstructured Ingest Python library to work with a specific source (input) and destination (output) location, see the Python code examples for the [source](/open-source/ingestion/source-connectors/overview) and [destination](/open-source/ingestion/destination-connectors/overview) connectors that are available for you to choose from. See also the [ingest configuration](/open-source/ingestion/ingest-configuration/overview) settings that enable you to further control how batches are sent and processed. ## Migration guide The Unstructured Ingest Python library introduced breaking changes in version 0.7.0 as follows: * The library has moved the calling pattern (formerly commonly referred to as "v2") up one level in the package. If you are using this calling pattern in your existing code, beginning with version 0.7.0 you must remove `.v2` from your related `from...import` statements. For example, if you previously had statments such as these: ```python from unstructured_ingest.v2.pipeline.pipeline import Pipeline from unstructured_ingest.v2.interfaces import ProcessorConfig ``` Beginning with version 0.7.0, you must change to these by removing `.v2`: ```python from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig ``` If you do not make these changes beginning with version 0.7.0, your existing code will no longer work. To run your existing code without making these changes, you must pin your installed library version to 0.6.4 or earlier. * The earlier calling pattern (commonly referred to as "v1") has been deprecated and removed from the library and will no longer work beginning with version 0.7.0. If you are using this earlier "v1" calling pattern in your code, you must either manually migrate your code to the new calling pattern (formerly commonly referred to as "v2"), or pin your installed library version to 0.6.4 or earlier to keep using this earlier "v1" calling pattern. To learn how to use the new calling pattern (formerly commonly referred to as "v2"), see the code examples in the [source connectors](/open-source/ingestion/source-connectors/overview) and [destination connectors](/open-source/ingestion/destination-connectors/overview) documentation. # Airtable Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/airtable <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Airtable to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your data and store structured outputs locally on your filesystem. The requirements are as follows. * An [Airtable](https://www.airtable.com/) account. [Create a free Airtable account](https://airtable.com/signup). * An Airtable personal access token. [Create a personal access token](https://support.airtable.com/docs/creating-and-using-api-keys-and-access-tokens). <iframe width="560" height="315" src="https://www.youtube.com/embed/c3gvIWAAcTY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The ID of the Airtable base to access. [Create a base](https://www.airtable.com/guides/build/create-a-base). [Get a base's ID](https://support.airtable.com/docs/finding-airtable-ids#finding-base-url-ids). <iframe width="560" height="315" src="https://www.youtube.com/embed/rU347YVMm2M" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The ID of the table to access in the base. [Create a table](https://www.airtable.com/guides/build/create-a-table). [Get a table's ID](https://support.airtable.com/docs/finding-airtable-ids#finding-base-url-ids). * The ID of the view to access in the table. [Create a view](https://www.airtable.com/guides/build/create-custom-views-of-data). [Get a view's ID](https://support.airtable.com/docs/finding-airtable-ids#finding-base-url-ids). By default, Unstructured processes all tables from all bases within an Airtable organization. You can limit the tables that Unstructured ingests data from within Airtable by specifying a list of Airtable paths. An Airtable path uses the following structure: `base_id/table_id(optional)/view_id(optional)` For example, given the following example Airtable URL: ```text https://airtable.com/appr9nKeXLAtg6bgn/tblZ8uT1GY7NLbWit/viwDcpzf9dP0Gqz5J ``` * The base's ID is `appr9nKeXLAtg6bgn`. The base's path is `appr9nKeXLAtg6bgn`. * The table's ID is `tblZ8uT1GY7NLbWit`. The table's path is `appr9nKeXLAtg6bgn/tblZ8uT1GY7NLbWit`. * The view's ID is `viwDcpzf9dP0Gqz5J`. The view's path is `appr9nKeXLAtg6bgn/tblZ8uT1GY7NLbWit/viwDcpzf9dP0Gqz5J`. You can call the Airtable API to get lists of available IDs for Airtable bases, tables, and views in bulk, as follows: * [Base IDs](https://airtable.com/developers/web/api/list-bases) * [Table and view IDs](https://airtable.com/developers/web/api/get-base-schema) * [Base, table, and view IDs](https://pyairtable.readthedocs.io/en/latest/metadata.html) The Airtable connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[airtable]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `AIRTABLE_TOKEN` - The Airtable personal access token, represented by `--personal-access-token` (CLI) or `personal_access_token` (Python). * `AIRTABLE_PATHS` - The list of Airtable paths to process, represented by `--list-of-paths` (CLI) or `list_of_paths` (Python). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ airtable \ --personal-access-token $AIRTABLE_PERSONAL_ACCESS_TOKEN \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --reprocess \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.airtable import ( AirtableIndexerConfig, AirtableDownloaderConfig, AirtableConnectionConfig, AirtableAccessConfig ) from unstructured_ingest.processes.connectors.local import LocalUploaderConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig # Chunking is optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=AirtableIndexerConfig(list_of_paths=[os.getenv("AIRTABLE_PATHS")]), downloader_config=AirtableDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=AirtableConnectionConfig( access_config=AirtableAccessConfig( personal_access_token=os.getenv("AIRTABLE_TOKEN") ) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Astra DB Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/astradb <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Astra DB to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/PMs1iwL52aM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/). * A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html). * An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html). * A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace). * A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection). An existing collection is not required. At runtime, the collection behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings, the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Pinecone account. The new collection's name will be `unstructuredautocreated`. The Astra DB connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[astradb]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `ASTRA_DB_API_ENDPOINT` - The API endpoint for the Astra DB database, represented by `--api-endpoint` (CLI) or `api_endpoint` (Python). To get the endpoint, see the Database Details > API Endpoint value on your database's Overview tab. * `ASTRA_DB_APPLICATION_TOKEN` - The database application token value for the database, represented by `--token` (CLI) or `token` (Python). To get the token, see the Database Details > Application Tokens box on your database's Overview tab. * `ASTRA_DB_KEYSPACE` - The name of the keyspace for the database, represented by `--keyspace` (CLI) or `keyspace` (Python). * `ASTRA_DB_COLLECTION` - The name of the collection for the keyspace, represented by `--collection-name` (CLI) or `collection_name` (Python). If no value is provided, see the beginning of this article for the behavior at run time. Additional settings include: * For the source connector only, `--fields` (CLI) or `fields` (Python): Optionally, a comma-separated list (CLI) or an array of strings (Python) of fields to include in the output. The default is ti include all fields, if not otherwise specified. * For the destination connector only, `--flatten-metadata` (CLI) or `flatten_metadata=True` (Python): Optionally, whether to "flatten" the metadata. Specifically, the metadata key values are brought to the top level of the element, and the `metadata` key itself is removed. To not flatten the metadata (the default), specify `--no-flatten_metadata` (CLI) or `flatten_metadata=False` (Python). The default is is to not flatten the metadata if not otherwise specified. Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ astradb \ --api-endpoint $ASTRA_DB_API_ENDPOINT \ --token $ASTRA_DB_APPLICATION_TOKEN \ --collection-name $ASTRA_DB_COLLECTION \ --keyspace $ASTRA_DB_KEYSPACE \ --fields record_id,content \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.astradb import ( AstraDBIndexerConfig, AstraDBDownloaderConfig, AstraDBConnectionConfig, AstraDBAccessConfig ) from unstructured_ingest.processes.connectors.local import LocalConnectionConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=AstraDBIndexerConfig( collection_name=os.getenv("ASTRA_DB_COLLECTION"), keyspace=os.getenv("ASTRA_DB_KEYSPACE"), batch_size=20 ), downloader_config=AstraDBDownloaderConfig( download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR"), fields=["record_id", "content"] ), source_connection_config=AstraDBConnectionConfig( access_config=AstraDBAccessConfig( api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"), token=os.getenv("ASTRA_DB_APPLICATION_TOKEN") ) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Azure Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/azure <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Azure Storage to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. The following video shows how to fulfill the minimum set of Azure Storage account requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/Vl3KCphlh9Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Note> If you are generating an SAS token as shown in the preceding video, be sure to set the following permissions: * Read and List for reading from the container only. * Write and List for writing to the container only. * Read, Write, and List for both reading from and writing to the container. </Note> Here are some more details about these requirements: * An Azure account. To create one, [learn how](https://azure.microsoft.com/pricing/purchase-options/azure-account). <iframe width="560" height="315" src="https://www.youtube.com/embed/2bQ6WiJ1ncA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Azure Storage account, and a container within that account. [Create a storage account](https://learn.microsoft.com/azure/storage/common/storage-account-create). [Create a container](https://learn.microsoft.com/azure/storage/blobs/blob-containers-portal). <iframe width="560" height="315" src="https://www.youtube.com/embed/AhuNgBafmUo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <iframe width="560" height="315" src="https://www.youtube.com/embed/xmndjYnGvcs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Azure Storage remote URL, using the format `az://<container-name>/<path/to/file/or/folder/in/container/as/needed>` For example, if your container is named `my-container`, and there is a folder in the container named `my-folder`, the Azure Storage remote URL would be `az://my-container/my-folder/`. * An SAS token (recommended), access key, or connection string for the Azure Storage account. [Create an SAS token (recommended)](https://learn.microsoft.com/azure/ai-services/translator/document-translation/how-to-guides/create-sas-tokens). [Get an access key](https://learn.microsoft.com/azure/storage/common/storage-account-keys-manage#view-account-access-keys). [Get a connection string](https://learn.microsoft.com/azure/storage/common/storage-configure-connection-string#configure-a-connection-string-for-an-azure-storage-account). Create an SAS token (recommended): <iframe width="560" height="315" src="https://www.youtube.com/embed/X6cmJ2IbVzo?start=240&end=370" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Get an access key or connection string: <iframe width="560" height="315" src="https://www.youtube.com/embed/muMmcwVfFqs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The Azure Storage account connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[azure]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `AZURE_STORAGE_REMOTE_URL` - The Azure Storage remote URL, represented by `--remote-url` (CLI) or `remote_url` (Python). The remote URL takes the format `az://<container-name>/<path/to/file/or/folder/in/container/as/needed>` For example, if your container is named `my-container`, and there is a folder in the container named `my-folder`, the Azure Storage remote URL would be `az://my-container/my-folder/`. * `AZURE_STORAGE_ACCOUNT_NAME` - The name of the Azure Storage account, represented by `--account-name` (CLI) or `account_name` (Python). * `AZURE_STORAGE_ACCOUNT_KEY`, `AZURE_STORAGE_CONNECTION_STRING`, or `AZURE_STORAGE_SAS_TOKEN` - The name of the key, connection string, or SAS token for the Azure Storage account, depending on your security configuration, represented by `--account-key` (CLI) or `account_key` (Python), `--connection-string` (CLI) or `connection_string` (Python), and `--sas_token` (CLI) or `sas_token` (Python), respectively. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ azure \ --remote-url $AZURE_STORAGE_REMOTE_URL \ --account-name $AZURE_STORAGE_ACCOUNT_NAME \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ local \ --output-dir $LOCAL_FILE_OUTPUT_DIR ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.azure import ( AzureIndexerConfig, AzureDownloaderConfig, AzureConnectionConfig, AzureAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=AzureIndexerConfig(remote_url=os.getenv("AZURE_STORAGE_REMOTE_URL")), downloader_config=AzureDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=AzureConnectionConfig( access_config=AzureAccessConfig( account_name=os.getenv("AZURE_STORAGE_ACCOUNT_NAME"), account_key=os.getenv("AZURE_STORAGE_ACCOUNT_KEY") ) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Box Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/box <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Box to your preprocessing pipeline, and use the Unstructured CLI or Python to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/Gkniu1sfz1o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. Access to the [Developer Console](https://app.box.com/developers/console) from your [Box enterprise account](https://account.box.com/signup/enterprise-plan) or [Box developer account](https://account.box.com/signup/developer). 2. A Box Custom App in your Box account, set up to use Server Authentication (with JWT). See [Setup with JWT](https://developer.box.com/guides/authentication/jwt/jwt-setup/). 3. The appropriate application scopes and advanced features set up for the Box Custom App, as follows: * In the Box Custom App, on the Configuration tab, under Application Scopes, check the box titled Write all files and folders stored in Box. * In the Box Custom App, on the Configuration tab, under Advanced Features, check the box titled Make API calls using the as-user header. * Click Save Changes before continuing. 4. Authorization by a Box Admin in your Box account for the Box Custom App. See [App Authorization](https://developer.box.com/guides/authentication/jwt/jwt-setup/#app-authorization). 5. Access by your Box account's source or target [folder](https://app.box.com/folder/0) to your Box Custom App, as follows: * In the Box Custom App, on the General Settings tab, copy the Service Account ID (which takes the form `AutomationUser_<your-app-service-id>_<a-random-string@boxdevedition.com`). * Share your Box account's target folder with the copied service account's email address as a Co-owner or Editor. * Note the remote URL to the target folder, which takes the format `box://<path/to/folder/in/account>`. 6. The private key configuration JSON file for the Box Custom App, or a string that contains this file's contents. * To download this file, in the Box Custom App, on the Configuration tab, under Add and Manage Public Keys, click Generate a Public/Private Keypair. Store the downloaded private key configuration JSON file in a secure location. * To ensure maximum compatibility across Unstructured service offerings, you should give the private key configuration JSON file information to Unstructured as a single-line string that contains the contents of the downloaded private key configuration JSON file (and not the file itself). To print this single-line string, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the private key configuration JSON file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` The Box connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[box]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `BOX_APP_CONFIG` - The local path to the downloaded private key configuration JSON file for the Box Custom App, or a single-line string that contains the contents of this file, represented by `--box-app-config` (CLI) or `box_app_config` (Python). * `BOX_REMOTE_URL` - The remote URL to the target folder, represented by `--remote-url` (CLI) or `remote_url` (Python). This URL must take the format `box://<path/to/folder/in/account>`. Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ box \ --box-app-config $BOX_APP_CONFIG \ --remote-url $BOX_REMOTE_URL \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --recursive \ --verbose \ --partition-by-api \ --partition-endpoint $UNSTRUCTURED_API_URL \ --api-key $UNSTRUCTURED_API_KEY \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.box import ( BoxAccessConfig, BoxConnectionConfig, BoxIndexerConfig, BoxDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=BoxIndexerConfig(remote_url=os.getenv("BOX_REMOTE_URL")), downloader_config=BoxDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=BoxConnectionConfig( access_config=BoxAccessConfig( box_app_config=os.getenv("BOX_APP_CONFIG") ) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Confluence Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/confluence <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Confluence to your preprocessing pipeline, and use the Unstructured CLI or Python to batch process all your spaces and pages and store structured outputs locally on your filesystem. The requirements are as follows. * A [Confluence Cloud account](https://www.atlassian.com/software/confluence/pricing) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/installing-confluence-data-center-203603.html). * The site URL for your [Confluence Cloud account](https://community.atlassian.com/t5/Confluence-questions/confluence-cloud-url/qaq-p/1157148) or [Confluence Data Center installation](https://confluence.atlassian.com/confkb/how-to-find-your-site-url-to-set-up-the-confluence-data-center-and-server-mobile-app-938025792.html). * A user in your [Confluence Cloud account](https://confluence.atlassian.com/cloud/invite-edit-and-remove-users-744721624.html) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/add-and-invite-users-138313.html). * The user must have the correct permissions in your [Conflunce Cloud account](https://support.atlassian.com/confluence-cloud/docs/what-are-confluence-cloud-permissions-and-restrictions/) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/permissions-and-restrictions-139557.html) to access the target spaces and pages. * One of the following: * For Confluence Cloud or Confluence Data Center, the target user's name or email address, and password. [Change a Confluence Cloud user's password](https://support.atlassian.com/confluence-cloud/docs/change-your-confluence-password/). [Change a Confluence Data Center user's password](https://confluence.atlassian.com/doc/change-your-password-139416.html). * For Confluence Cloud only, the target user's name or email address, and API token. [Create an API token](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/). * For Confluence Data Center only, the target user's personal access token (PAT). [Create a PAT](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html). * Optionally, the names of the specific [spaces](https://support.atlassian.com/confluence-cloud/docs/navigate-spaces/) in the Confluence instance to access. The following video provides related setup information for Confluence Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/tXu6S6Pd7Dc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> ## Document permissions metadata The source connector outputs any permissions information that it can find in the source location about the processed source documents and associates that information with each corresponding element that is generated. This permissions information is output into the `permissions_data` field, which is within the `data_source` field under the element's `metadata` field. This information lists the users or groups, if any, that have permissions to read, update, or delete the element's associated source document. The following example shows what the output looks like. Ellipses indicate content that has been omitted from this example for brevity. ```json [ { "...": "...", "metadata": { "...": "...", "data_source": { "...": "...", "permissions_data": [ { "read": { "users": [ "11111:11111111-1111-1111-1111-111111111111" ], "groups": [ "22222222-2222-2222-2222-22222222", "33333333-3333-3333-3333-33333333" ] } }, { "update": { "users": [ "44444:44444444-4444-4444-4444-44444444", "55555:55555555-5555-5555-5555-55555555" ], "groups": [ "66666666-6666-6666-6666-66666666", ] } }, { "delete": { "users": [ "77777:77777777-7777-7777-7777-77777777" ], "groups": [ "88888888-8888-8888-8888-88888888" ] } } ], "...": "..." } } } ] ``` To look up information about a particular Confluence user, use the user's ID (also known as their account ID) from the preceding output to call the [GET /wiki/rest/api/user](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-users/#api-wiki-rest-api-user-get) operation in the Confluence REST API. To look up information about a particular Confluence group, use the group's ID from the preceding output to call the [GET /wiki/rest/api/group/by-id](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-group/#api-wiki-rest-api-group-by-id-get) operation in the Confluence REST API. The Confluence connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[confluence]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `CONFLUENCE_URL` - The target Confluence site's URL, represented by `--url` (CLI) or `url` (Python). * One of the following: * For API token authentication: `CONFLUENCE_USERNAME` and `CONFLUENCE_API_TOKEN` - The name or email address, and API token of the target Confluence user, represented by `--username` (CLI) or `username` (Python) and `--api-token` (CLI) or `api-token` (Python), respectively. * For personal access token (PAT) authentication: `CONFLUENCE_PERSONAL_ACCESS_TOKEN` - The PAT for the target Confluence user, represented by `--token` (CLI) or `token` (Python). * For password authentication: `CONFLUENCE_USERNAME` and `CONFLUENCE_PASSWORD` - The name or email address, and password of the target Confluence user, represented by `--username` (CLI) or `username` (Python) and `--password` (CLI) or `password` (Python), respectively. Additional settings include: * `--spaces` (CLI) or `spaces` (Python): Optionally, the list of the names of the specific spaces to access, expressed as a comma-separated list of strings (CLI) or an array of strings (Python), with each string representing a space's name. The default is no specific spaces, if not otherwise specified. * `--max-num-of-spaces` (CLI) or `max_num_of_spaces` (Python): Optionally, the maximum number of spaces to access, expressed as an integer. The default value is `500` if not otherwise specified. * `--max-num-of-docs-from-each-space` (CLI) or `max_num_of_docs_from_each_space` (Python): Optionally, the maximum number of documents to access from each space, expressed as an integer. The default value is `100` if not otherwise specified. * `--cloud` or `--no-cloud` (CLI) or `cloud` (Python): Optionally, whether to use Confluence Cloud (`--cloud` for CLI or `cloud=True` for Python). The default is `--no-cloud` (CLI) or `cloud=False` (Python) if not otherwise specified. * `--extract-images` (CLI) or `extract_images` (Python): Optionally, download images and replace the HTML content with Base64-encoded images. The default is `--no-extract-images` (CLI) or `extract_images=False` (Python) if not otherwise specified. * `--extract-files` (CLI) or `extract_files` (Python): Optionally, download any embedded files. The default is `--no-extract-files` (CLI) or `extract_files=False` (Python) if not otherwise specified. * `--force-download` (CLI) or `force_download` (Python): Optionally, re-download extracted files even if they already exist locally. The default is `--no-force-download` (CLI) or `force_download=False` (Python) if not otherwise specified. * `--allow-list` (CLI) or `allow_list`: Optionally, a command-separated list (CLI) or a an array of strings (Python) of allowed URLs to download. By default, the base URL that the original HTML came from is used, if not otherwise specified. Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # For API token authentication: unstructured-ingest \ confluence \ --api-token $CONFLUENCE_API_TOKEN \ --url $CONFLUENCE_URL \ --username $CONFLUENCE_USERNAME \ --cloud \ --spaces luke,paul \ --max-num-of-spaces 500 \ --max-num-of-docs-from-each-space 150 \ --extract-images \ --extract-files \ --force-download \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --chunking-strategy by_title \ --embedding-provider huggingface # For personal access token (PAT) authentication: unstructured-ingest \ confluence \ --token $CONFLUENCE_PERSONAL_ACCESS_TOKEN \ --url $CONFLUENCE_URL \ --spaces luke,paul \ --max-num-of-spaces 500 \ --max-num-of-docs-from-each-space 150 \ --extract-images \ --extract-files \ --force-download \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --chunking-strategy by_title \ --embedding-provider huggingface # For password authentication: unstructured-ingest \ confluence \ --password $CONFLUENCE_PASSWORD \ --url $CONFLUENCE_URL \ --username $CONFLUENCE_USERNAME \ --no-cloud \ --spaces luke,paul \ --max-num-of-spaces 500 \ --max-num-of-docs-from-each-space 150 \ --extract-images \ --extract-files \ --force-download \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --chunking-strategy by_title \ --embedding-provider huggingface ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.confluence import ( ConfluenceIndexerConfig, ConfluenceDownloaderConfig, ConfluenceConnectionConfig, ConfluenceAccessConfig ) # Chunking and embedding are optional. from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=ConfluenceIndexerConfig( spaces=["luke", "paul"], max_num_of_spaces=500, max_num_of_docs_from_each_space=150 ), downloader_config=ConfluenceDownloaderConfig( download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR") extract_images=True, extract_files=True, force_download=True, allow_list=[] ), # For API token authentication: source_connection_config=ConfluenceConnectionConfig( access_config=ConfluenceAccessConfig( api_token=os.getenv("CONFLUENCE_API_TOKEN") ), url=os.getenv("CONFLUENCE_URL"), username=os.getenv("CONFLUENCE_USERNAME"), cloud=True ), # For personal access token (PAT) authentication: # source_connection_config=ConfluenceConnectionConfig( # access_config=ConfluenceAccessConfig( # token=os.getenv("CONFLUENCE_PERSONAL_ACCESS_TOKEN") # ), # url=os.getenv("CONFLUENCE_URL") # ), # For password authentication: # source_connection_config=ConfluenceConnectionConfig( # access_config=ConfluenceAccessConfig( # password=os.getenv("CONFLUENCE_PASSWORD") # ), # url=os.getenv("CONFLUENCE_URL"), # username=os.getenv("CONFLUENCE_USERNAME"), # cloud=False # ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Couchbase Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/couchbase <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Couchbase Database to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Couchbase Capella clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Couchbase Capella clusters and local Couchbase server deployments are supported. <iframe width="560" height="315" src="https://www.youtube.com/embed/9-RIBmIdi70" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For Couchbase Capella, you will need: * A [Couchbase Capella account](https://docs.couchbase.com/cloud/get-started/create-account.html#sign-up-free-tier). * A [Couchbase Capella cluster](https://docs.couchbase.com/cloud/get-started/create-account.html#getting-started). * A [bucket](https://docs.couchbase.com/cloud/clusters/data-service/manage-buckets.html#add-bucket), [scope](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-scope), and [collection](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-collection) on the cluster. * The cluster's [public connection string](https://docs.couchbase.com/cloud/get-started/connect.html#connect-from-sdk-cbsh-cli-or-ide). * The [cluster access name (username) and secret (password)](https://docs.couchbase.com/cloud/clusters/manage-database-users.html#create-database-credentials). * [Incoming IP address allowance](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) for the cluster. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> For a local Couchbase server, you will need: * [Installation of a local Couchbase server](https://docs.couchbase.com/server/current/getting-started/start-here.html). * [Connection details](https://docs.couchbase.com/server/current/guides/connect.html) to the local Couchbase server. To learn more about how to set up a Couchbase cluster and play with data, refer to this [tutorial](https://developer.couchbase.com/tutorial-quickstart-flask-python). The Couchbase DB connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[couchbase]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables are required for the Couchbase Connector: * `CB_CONN_STR` - The Connection String for the Couchbase server, represented by `--connection-string` (CLI) or `connection_string` (Python). * `CB_USERNAME` - The username for the Couchbase server, represented by `--username` (CLI) or `username` (Python). * `CB_PASSWORD` - The password for the Couchbase server, represented by `--password` (CLI) or `password` (Python). * `CB_BUCKET` - The name of the bucket in the Couchbase server, represented by `--bucket` (CLI) or `bucket` (Python). * `CB_SCOPE` - The name of the scope in the bucket, represented by `--scope` (CLI) or `scope` (Python). * `CB_COLLECTION` - The name of the collection in the scope, represented by `--collection` (CLI) or `collection` (Python). Additional available settings include: * `--collection-id` (CLI) or `collection_id` in `CouchbaseDownloaderConfig` (Python) - Optional for the source connector. The\ unique key of the ID field in the collection. The default is `id` if not otherwise specified. [Learn more](https://docs.couchbase.com/server/current/learn/services-and-indexes/indexes/indexing-and-query-perf.html#introduction-document-keys). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ couchbase \ --metadata-exclude filename,file_directory,metadata.data_source.date_processed \ --connection-string "$COUCHBASE_CONNECTION_STRING" \ --username $COUCHBASE_USERNAM \ --password $COUCHBASE_PASSWORD \ --bucket $COUCHBASE_BUCKET \ --scope $COUCHBASE_SCOPE \ --collection $COUCHBASE_COLLECTION \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.couchbase import ( CouchbaseIndexerConfig, CouchbaseDownloaderConfig, CouchbaseConnectionConfig, CouchbaseAccessConfig, ) from unstructured_ingest.processes.connectors.local import ( LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=CouchbaseIndexerConfig(batch_size=100), downloader_config=CouchbaseDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=CouchbaseConnectionConfig( access_config=CouchbaseAccessConfig( password=os.getenv("CB_PASSWORD") ), username=os.getenv("CB_USERNAME"), connection_string=os.getenv("CB_CONN_STR"), bucket=os.getenv("CB_BUCKET"), scope=os.getenv("CB_SCOPE"), collection=os.getenv("CB_COLLECTION") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Databricks Volumes Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/databricks-volumes <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Databricks Volumes to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Databricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * The workspace's URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids). Examples: * AWS: `https://<workspace-id>.cloud.databricks.com` * Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net` * GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com` <Note> Do not add a trailing slash (`/`) to the workspace URL. </Note> * The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), the following Databricks authentication types are supported: * Databricks OAuth machine-to-machine (M2M) authentication for\ [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). You will need the the Client ID (or UUID or Application ID) and OAuth Secret (client secret) values for the corresponding service principal. Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. The following video shows how to create a Databricks managed service principal: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html). You will need the personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For [Unstructured Ingest](/open-source/ingestion/overview), the following Databricks authentication types are supported: * For Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html): The personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password (basic) authentication ([AWS](https://docs.databricks.com/archive/dev-tools/basic.html) only): The user's name and password values. * For OAuth machine-to-machine (M2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html)): The client ID and OAuth secret values for the corresponding service principal. * For OAuth user-to-machine (U2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-u2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-u2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-u2m.html)): No additional values. * For Azure managed identities (formerly Managed Service Identities (MSI) authentication) ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-mi) only): The client ID value for the corresponding managed identity. * For Microsoft Entra ID service principal authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-sp) only): The tenant ID, client ID, and client secret values for the corresponding service principal. * For Azure CLI authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-cli) only): No additional values. * For Microsoft Entra ID user authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/user-aad-token) only): The Entra ID token for the corresponding Entra ID user. * For Google Cloud Platform credentials authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-creds.html) only): The local path to the corresponding Google Cloud service account's credentials file. * For Google Cloud Platform ID authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-id.html) only): The Google Cloud service account's email address. * The name of the parent catalog in Unity Catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html) for the volume. * The name of the parent schema (formerly known as a database) in Unity Catalog for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) for the volume. * The name of the volume in Unity Catalog for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html), and optionally any path in that volume that you want to access directly, beginning with the volume's root. * The Databricks workspace user or service principal must have the following minimum set of privileges to read from or write to the existing volume in Unity Catalog: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. The following videos shows how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog. <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Learn more about how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). The Databricks Volumes connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[databricks-volumes]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `DATABRICKS_HOST` - The Databricks host URL, represented by `--host` (CLI) or `host` (Python). <Note> Do not add a trailing slash (`/`) to the host URL. </Note> * `DATABRICKS_CATALOG` - The Databricks catalog name for the Volume, represented by `--catalog` (CLI) or `catalog` (Python). * `DATABRICKS_SCHEMA` - The Databricks schema name for the Volume, represented by `--schema` (CLI) or `schema` (Python). If not specified, `default` is used. * `DATABRICKS_VOLUME` - The Databricks Volume name, represented by `--volume` (CLI) or `volume` (Python). * `DATABRICKS_VOLUME_PATH` - Any optional path to access within the volume, specified by `--volume-path` (CLI) or `volume_path` (Python). Environment variables based on your authentication type, depending on your cloud provider: * For Databricks personal access token authentication (AWS, Azure, and GCP): * `DATABRICKS_TOKEN` - The personal access token, represented by `--token` (CLI) or `token` (Python). * For username and password (basic) authentication (AWS only): The user's name and password values. * `DATABRICKS_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python). * `DATABRICKS_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). * For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal. * `DATABRICKS_CLIENT_ID` - The client ID value for the corresponding service principal, represented by `--client-id` (CLI) or `client_id` (Python). * `DATABRICKS_CLIENT_SECRET` - The client ID and OAuth secret values for the corresponding service principal, represented by `--client-secret` (CLI) or `client_secret` (Python). * For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional environment variables. * For Azure managed identities (MSI) authentication (Azure only): * `ARM_CLIENT_ID` - The client ID value for the corresponding managed identity, represented by `--azure-client-id` (CLI) or `azure_client_id` (Python). * If the target identity has not already been added to the workspace, then you must also specify the `DATABRICKS_AZURE_RESOURCE_ID`, represented by `--azure-workspace-resource-id` (CLI) or `azure_workspace_resource_id` (Python). * For Microsoft Entra ID service principal authentication (Azure only): * `ARM_TENANT_ID` - The tenant ID value for the corresponding service principal, represented by `--azure-tenant-id` (CLI) or `azure_tenant_id` (Python). * `ARM_CLIENT_ID` - The client ID value for the corresponding service principal, represented by `--azure-client-id` (CLI) or `azure_client_id` (Python). * `ARM_CLIENT_SECRET` - The client secret value for the corresponding service principal, represented by `--azure-client-secret` (CLI) or `azure_client_secret` (Python). * If the service principal has not already been added to the workspace, then you must also specify the `DATABRICKS_AZURE_RESOURCE_ID`, represented by `--azure-workspace-resource-id` (CLI) or `azure_workspace_resource_id` (Python). * For Azure CLI authentication (Azure only): No additional environment variables. * For Microsoft Entra ID user authentication (Azure only): * `DATABRICKS_TOKEN` - The Entra ID token for the corresponding Entra ID user, represented by `--token` (CLI) or `token` (Python). * For Google Cloud Platform credentials authentication (GCP only): * `GOOGLE_CREDENTIALS` - The local path to the corresponding Google Cloud service account's credentials file, represented by `--google-credentials` (CLI) or `google_credentials` * For Google Cloud Platform ID authentication (GCP only): * `GOOGLE_SERVICE_ACCOUNT` - The Google Cloud service account's email address, represented by `--google-service-account` (CLI) or `google_service_account` (Python). * Alternatively, you can store the preceding settings in a local [Databricks configuration profile](https://docs.databricks.com/en/dev-tools/auth/config-profiles.html) and then just refer to the profile's name: * `DATABRICKS_PROFILE` - The name of the Databricks configuration profile, represented by `--profile` (CLI) or `profile` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ databricks-volumes \ --profile $DATABRICKS_PROFILE \ --host $DATABRICKS_HOST \ --catalog $DATABRICKS_CATALOG \ --schema $DATABRICKS_SCHEMA \ --volume $DATABRICKS_VOLUME \ --volume-path $DATABRICKS_VOLUME_PATH \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ --chunking-strategy by_title \ --embedding-provider huggingface \ local \ --output-dir $LOCAL_FILE_OUTPUT_DIR ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig # For all supported Databricks authentication types, you can import this: from unstructured_ingest.processes.connectors.databricks.volumes import ( DatabricksVolumesIndexerConfig, DatabricksVolumesConnectionConfig, DatabricksVolumesAccessConfig, DatabricksVolumesDownloaderConfig ) # Alternatively, for supported Databricks on AWS authentication types only, you can import this: # from unstructured_ingest.processes.connectors.databricks.volumes_aws import ( # DatabricksAWSVolumesIndexerConfig, # DatabricksAWSVolumesConnectionConfig, # DatabricksAWSVolumesAccessConfig, # DatabricksAWSVolumesDownloaderConfig # ) # Alternatively, for supported Azure Databricks authentication types only, you can import this: # from unstructured_ingest.processes.connectors.databricks.volumes_azure import ( # DatabricksAzureVolumesIndexerConfig, # DatabricksAzureVolumesConnectionConfig, # DatabricksAzureVolumesAccessConfig, # DatabricksAzureVolumesDownloaderConfig # ) # Alternatively, for supported Databricks on Google Cloud authentication types only, you can import this: # from unstructured_ingest.processes.connectors.databricks.volumes_gcp import ( # DatabricksGoogleVolumesIndexerConfig, # DatabricksGoogleVolumesConnectionConfig, # DatabricksGoogleVolumesAccessConfig, # DatabricksGoogleVolumesDownloaderConfig # ) from unstructured_ingest.processes.connectors.local import ( LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(reprocess=True), indexer_config=DatabricksVolumesIndexerConfig(recursive=True), # For specifying a Databricks configuration profile: downloader_config=DatabricksVolumesDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=DatabricksVolumesConnectionConfig( access_config=DatabricksVolumesAccessConfig(profile=os.getenv("DATABRICKS_PROFILE")), host=os.getenv("DATABRICKS_HOST"), catalog=os.getenv("DATABRICKS_CATALOG"), schema=os.getenv("DATABRICKS_SCHEMA"), volume=os.getenv("DATABRICKS_VOLUME"), volume_path=os.getenv("DATABRICKS_VOLUME_PATH") ), # Other examples: # # For Databricks on AWS, with Databricks personal access token authentication: # downloader_config=DatabricksAWSVolumesDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), # source_connection_config=DatabricksAWSVolumesConnectionConfig( # access_config=DatabricksAWSVolumesAccessConfig(token=os.getenv("DATABRICKS_TOKEN")), # host=os.getenv("DATABRICKS_HOST") # ), # # For Azure Databricks, with Microsoft Entra ID service principal authentication: # downloader_config=DatabricksAzureVolumesDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), # source_connection_config=DatabricksAzureVolumesConnectionConfig( # access_config=DatabricksAzureVolumesAccessConfig( # azure_client_id=os.getenv("ARM_CLIENT_ID"), # azure_client_secret=os.getenv("ARM_CLIENT_SECRET"), # azure_tenant_id=os.getenv("ARM_TENANT_ID") # ), # host=os.getenv("DATABRICKS_HOST") # ), # # For Databricks on Google Cloud, with Google Cloud Platform credentials authentication: # downloader_config=DatabricksGoogleVolumesDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), # source_connection_config=DatabricksGoogleVolumesConnectionConfig( # access_config=DatabricksGoogleVolumesAccessConfig( # google_service_account=os.getenv("GOOGLE_CREDENTIALS") # ), # host=os.getenv("DATABRICKS_HOST") # ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Delta Table Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/delta-table Connect Delta Table to your preprocessing pipeline, and batch process all your documents using `unstructured-ingest` to store structured outputs locally on your filesystem. Make sure to have the Delta Table dependencies installed: ```bash Shell pip install "unstructured-ingest[delta-table]" ``` AWS credentials need to be available for use with the storage options. Specify the to the DeltaTable using the `table-uri` argument, and pass a dictionary of the options to use for the storage backend via `storage_options`. <CodeGroup> ```bash Shell #!/usr/bin/env bash unstructured-ingest \ delta-table \ --table-uri s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/ \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --storage_options "AWS_REGION=us-east-2,AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \ --verbose \ --strategy hi_res ``` ```python Python import os from unstructured_ingest.connector.delta_table import SimpleDeltaTableConfig from unstructured_ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig from unstructured_ingest.runner import DeltaTableRunner if name == "main": runner = DeltaTableRunner( processor_config=ProcessorConfig( verbose=True, output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"), num_processes=2, ), read_config=ReadConfig(), partition_config=PartitionConfig( strategy="hi_res", ), connector_config=SimpleDeltaTableConfig( table_uri="s3://utic-dev-tech-fixtures/sample-delta-lake-data/deltatable/", storage_options={ "AWS_REGION": "us-east-2", "AWS_ACCESS_KEY_ID": os.getenv("AWS_ACCESS_KEY_ID"), "AWS_SECRET_ACCESS_KEY": os.getenv("AWS_SECRET_ACCESS_KEY"), }, ), ) runner.run() ``` </CodeGroup> For a full list of the options the Unstructured Ingest CLI accepts check `unstructured-ingest delta-table --help`. # Discord Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/discord <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Discord to your preprocessing pipeline, and use the Unstructured CLI or Python to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A Discord [access token](https://discord.com/developers/docs/topics/oauth2). * The target Discord [channel IDs](https://support.discord.com/hc/articles/206346498-Where-can-I-find-my-User-Server-Message-ID#h_01HRSTXPS5FMK2A5SMVSX4JW4E). The Discord connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[discord]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `DISCORD_ACCESS_TOKEN` - The Discord access token, represented by `--token` (CLI) or `token` (Python). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ discord \ --token $DISCORD_ACCESS_TOKEN \ --channels 475182341782896651,847950191108554762 \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.discord import ( DiscordIndexerConfig, DiscordDownloaderConfig, DiscordConnectionConfig, DiscordAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=DiscordIndexerConfig( channels=[ "475182341782896651", "847950191108554762" ] ), downloader_config=DiscordDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=DiscordConnectionConfig( access_config=DiscordAccessConfig(token=os.getenv("DISCORD_ACCESS_TOKEN")) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Dropbox Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/dropbox <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Dropbox to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/Vku5uYa-2N4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. A [Dropbox account](https://www.dropbox.com/try/teams). 2. A Dropbox app for your Dropbox account. To create a Dropbox app, do the following: a) Sign in to the [Dropbox Developers](https://www.dropbox.com/developers) portal with the same credentials as your Dropbox account.<br /> b) Open your [App Console](https://www.dropbox.com/developers/apps).<br /> c) Click Create app.<br /> d) For Choose an API, select Scoped access.<br /> e) For Choose the type of access you need, select App folder.<br /> f) Enter a name for your app, and then click Create app.<br /> g) On the app's Permissions tab, under Files and folders, check the boxes labelled files.content.read or files.content.write or both, depending on whether you want to read files, write files, or both. Then click Submit.<br /> h) On the app's Settings tab, note the value of the App folder name field. This is the name of the app folder that Dropbox will create under the `Apps` top-level folder in your Dropbox account that the Dropbox app will use for access. If you change the value of App folder name field here, Dropbox will create an app folder with that name under the `Apps` top-level folder instead.<br /> i) Under OAuth 2, next to Generated access token, click Generate. Copy the value of this access token. You should only click Generate after you have completed all of the preceding steps first. This is because the access token is scoped to the specific app folder and settings at the time the access token is generated. If you change the app folder name or any of the permissions later, you should regenerate the access token.<br /> <Warning> Access tokens are valid for only four hours after they are created. After this four-hour period, you can no longer use the expired access token. Dropbox does not allow the creation of access tokens that are valid for more than four hours. To replace an expired access token, you must first generate a refresh token for the corresponding access token. To learn how to generate an access token and its corresponding refresh token, see [Replace an expired access token](#replace-an-expired-access-token), later in this article. If you do not already have the corresponding refresh token for an existing access token, or if you lose a refresh token after you generate it, you must generate a new access token and its corresponding refresh token. Instead of continualy replacing expired access tokens yourself, you can have Unstructured do it for you as needed; just supply Unstructured with the refresh token along with the Dropbox app's App key and App secret values. To learn how to supply these to Unstructured, look for mentions of "refresh token," "app key," and "app secret" in the connector settings later in this article. </Warning> 3. The app folder that your Dropbox app will use for access can be found in your Dropbox account under the `Apps` top-level folder. For example, if the value of the App folder name field above is `my-folder`, then the app folder that your Dropbox app will use for access can be found under `https://dropbox.com/home/Apps/my-folder` ![The my-folder app folder under the Apps top-level folder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-folder.png) <Warning> Your Dropbox app will not have access to upload or download files from the root of the app folder. Instead, you must create a subfolder inside of the app folder for your Dropbox app to upload or download files from. You will use the name of that subfolder when specifying your remote URL in the next step. For example, if your Dropbox app uses an app folder named `my-folder` for access within the `Apps` top-level folder, and you create a subfolder named `data` within the `my-folder` app folder, then the subfolder that your Dropbox app will upload and download files from can be found under `https://dropbox.com/home/Apps/my-folder/data` ![The data subfolder under the my-folder subfolder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-subfolder.png) </Warning> 4. Note the remote URL to your subfolder inside of the app folder, which takes the format `dropbox://<subfolder-name>`. For example, if your Dropbox app uses an app folder named `my-folder` for access within the `Apps` top-level folder, and you create a subfolder named `data` within the `my-folder` app folder, then the remote URL is `dropbox://data` ![The data subfolder under the my-folder subfolder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-subfolder.png) ## Replace an expired access token Dropbox app access tokens are valid for only four hours. After this time, you can no longer use the expired access token. To have Unstructured automatically replace expired access tokens on your behalf, do the following: <iframe width="560" height="315" src="https://www.youtube.com/embed/PZyRgpPNEUs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. Get the app key and app secret values for your Dropbox app. To do this: a) Sign in to the [Dropbox Developers](https://www.dropbox.com/developers) portal with the same credentials as your Dropbox account.<br /> b) Open your [App Console](https://www.dropbox.com/developers/apps).<br /> c) Click your Dropbox app's icon.<br /> d) On the Settings tab, next to App key, copy the value of the app key.<br /> e) Next to App secret, click Show, and then copy the value of the app secret. 2. Use your web browser to browse to the following URL, replacing `<app-key>` with the app key for your Dropbox app: ```text https://www.dropbox.com/oauth2/authorize?client_id=<app-key>&response_type=code&token_access_type=offline ``` 3. Click Continue. 4. Click Allow. 5. In the Access code generated tile, copy the access code that is shown. 6. Use the [curl](https://curl.se/) utility in your Terminal or Command Prompt, or use a REST API client such as [Postman](https://www.postman.com/product/api-client/), to make the following REST API call, replacing the following placeholders: * Replace `<app-key>` with the app key for your Dropbox app. * Replace `<app-secret>` with the app secret for your Dropbox app. * Replace `<access-code>` with the access code that you just copied. ```text curl --location --request POST 'https://api.dropbox.com/oauth2/token' \ --user '<app-key>:<app-secret>' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'code=<access-code>' \ --data-urlencode 'grant_type=authorization_code' \ ``` 7. In the response, copy the following two values: * The value of `access_token` (starting with the characters `sl`) is the new, valid access token. * The value of `refresh_token` is the refresh token that can be used to replace this access token much faster and easier next time. If you lose this refresh token, you must go back to Step 2. For the [Unstructured UI](/ui/overview), if you want Unstructured to use this refresh token to automatically replace the expired access token instead of replacing it yourself, then add the following values to your connector settings, and then stop here: * Add the `refresh_token` value to the connector settings Refresh token field. * Add the `<app-key>` value to the connector settings App key field. * Add the `<app-secret>` value to the connector settings App secret field. For the [Unstructured API](/api-reference/overview) and [Unstructured Ingest](/open-source/ingestion/overview), if you want Unstructured to use this refresh token to automatically replace the expired access token instead of replacing it yourself, then add the following values to your connector settings, and then stop here: * Add the `refresh_token` value to the `refresh_token` parameter. * Add the `<app-key>` value to the `app_key` parameter. * Add the `<app-secret>` value to the connector settings `app_secret` parameter. 8. If for some reason you need to manually replace the expired access token yourself instead of having Unstructured do it for you, you can use the refresh token that you just copied to get a new access token: * Replace `<refresh-token>` with the refresh token. * Replace `<app-key>` with the app key for your Dropbox app. * Replace `<app-secret>` with the app secret for your Dropbox app. ```text curl https://api.dropbox.com/oauth2/token \ --data refresh_token=<refresh-token> \ --data grant_type=refresh_token \ --data client_id=<app-key> \ --data client_secret=<app-secret> ``` 9. In the response, copy the following two values: * The value of `access_token` (starting with the characters `sl`) is the new, valid access token. In the connector, replace the old, expired access token value with this new, valid access token value. * The value of `refresh_token` is the new, valid refresh token. To replace the expired access token yourself, go back to Step 8. The Dropbox connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[dropbox]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `DROPBOX_REMOTE_URL` - The remote URL to the target subfolder inside of the app folder for the Dropbox app, represented by `--remote-url` (CLI) or `remote_url` (Python). * `DROPBOX_ACCESS_TOKEN` - The value of the access token for the Dropbox app that is associated with the target app folder, represented by `--token` (CLI) or `token` (Python). Provide this only if for some reason you do not want Unstructured to automatically refresh expired access tokens. To have Unstructured automatically refresh expired Dropbox App access tokens on your behalf, do not provide an access token. Instead, provide the following environment variables: * `DROPBOX_REFRESH_TOKEN` - The value of the refresh token for the corresponding access token, represented by `--refresh-token` (CLI) or `refresh_token` (Python). * `DROPBOX_APP_KEY` - The app key for the Dropbox app, represented by `--app-key` (CLI) or `app_key` (Python). * `DROPBOX_APP_SECRET` - The app secret for the Dropbox app, represented by `--app-secret` (CLI) or `app_secret` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ dropbox \ --remote-url $DROPBOX_REMOTE_URL \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --refresh-token $DROPBOX_REFRESH_TOKEN \ --app-key $DROPBOX_APP_KEY \ --app-secret $DROPBOX_APP_SECRET \ --num-processes 2 \ --recursive \ --verbose \ --partition-by-api \ --partition-endpoint $UNSTRUCTURED_API_URL \ --api-key $UNSTRUCTURED_API_KEY \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.dropbox import ( DropboxIndexerConfig, DropboxDownloaderConfig, DropboxAccessConfig, DropboxConnectionConfig ) from unstructured_ingest.processes.connectors.local import ( LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=DropboxIndexerConfig(remote_url=os.getenv("DROPBOX_REMOTE_URL")), downloader_config=DropboxDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=DropboxConnectionConfig( access_config=DropboxAccessConfig( refresh_token=os.getenv("DROPBOX_REFRESH_TOKEN"), app_key=os.getenv("DROPBOX_APP_KEY"), app_secret=os.getenv("DROPBOX_APP_SECRET") ) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Elasticsearch Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/elastic-search <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Elasticsearch to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Elastic Cloud instances are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Elastic Cloud instances and self-manged Elasticsearch instances are supported. * For Elastic Cloud, you will need an [Elastic Cloud service instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#hosted-elasticsearch-service). <iframe width="560" height="315" src="https://www.youtube.com/embed/HEYIFC7uOus" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For self-managed Elasticsearch, you will need a [self-managed Elasticsearch instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#elasticsearch-deployment-options). <iframe width="560" height="315" src="https://www.youtube.com/embed/9znbAkNG1Ok" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * You will need the name of the index on the instance. See [Create index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) and [Get index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-index.html). The Elasticsearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "record_id": { "type": "text" }, "element_id": { "type": "keyword" }, "text": { "type": "text" }, "embeddings": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }, "metadata": { "type": "object", "properties": { "parent_id": { "type": "text" }, "page_number": { "type": "integer" }, "is_continuation": { "type": "boolean" }, "orig_elements": { "type": "text" }, "partitioner_type": { "type": "text" } } } } } } ``` See also: * [An Introduction to Elasticsearch Mapping](https://www.elastic.co/blog/found-elasticsearch-mapping-introduction) * [Explicit mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html) * [Dynamic field mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) <iframe width="560" height="315" src="https://www.youtube.com/embed/0f0tEcrvP8g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Elastic Cloud, you will need the Elastic Cloud service instance's API key. If you are using Unstructured Ingest, you will also need the instance's Cloud ID. To get these, see your Elasticsearch Service web console. * For self-managed Elasticsearch, you will need: * The self-managed instance's hostname and port number. See [Networking](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html). * If you're using basic authentication to the self-managed instance, the user's name and password. * If you're using token-based authentication to the self-managed instance, the bearer token or API key for the instance. See [Token-based authentication services](https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html) and [Create API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). * If you're using certificate authentication to the self-managed instance, the path to the Certificate Authority (CA) file on the instance, and the certificate fingerprint. See [SSL certificate API](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-ssl.html) and [Where can I see my Certificate Fingerprint?](https://discuss.elastic.co/t/where-can-i-see-my-certificate-fingerprint/319335/3). The Elasticsearch connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[elasticsearch]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `ELASTICSEARCH_HOST` - The hostname and port number, defined as `<hostname>:<port-number>` and represented by `--hosts` (CLI) or `hosts` (Python). * `ELASTICSEARCH_INDEX_NAME` - The name of the search index, represented by `--index-name` (CLI) or `index_name` (Python). If you're using Elastic Cloud to connect to ElasticSearch: * `ELASTIC_CLOUD_ID` - The instance's Cloud ID, represented by `--cloud-id` (CLI) or `cloud_id` (Python). * `ELASTIC_CLOUD_API_KEY` - The corresponding Cloud API key value, represented by `--api-key-id` (CLI) or `api_key_id` (Python). If you're using basic authentication to the instance: * `ELASTICSEARCH_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python). * `ELASTICSEARCH_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). If you're using token-based authentication to the instance instead: * `ELASTICSEARCH_BEARER_TOKEN` - The bearer token's value, represented by `--bearer-auth` (CLI) or `bearer_auth` (Python). * `ELASTIC_CLOUD_API_KEY_ID` - The API key's value, represented by `--api-key` (CLI) or `api_key` (Python). If you're using certificates: * `ELASTICSEARCH_CA_CERTS` - The path to the Certificate Authority (CA) file, represented by `--ca-certs` (CLI) or `ca_certs` (Python). * `ELASTICSEARCH_SSL_ASSERT_FINGERPRINT` - The certificate's fingerprint, represented by `--ssl-assert-fingerprint` or `ssl_assert_fingerprint` (Python). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ elasticsearch \ --metadata-exclude filename,file_directory,metadata.data_source.date_processed \ --hosts $ELASTICSEARCH_HOST \ --username $ELASTICSEARCH_USERNAME \ --password $ELASTICSEARCH_PASSWORD \ --index-name $ELASTICSEARCH_INDEX_NAME \ --ca-certs $ELASTICSEARCH_CA_CERTS \ --ssl-assert-fingerprint $ELASTICSEARCH_SSL_ASSERT_FINGERPRINT \ --fields 'ethnicity, director, plot' \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.elasticsearch.elasticsearch import ( ElasticsearchIndexerConfig, ElasticsearchDownloaderConfig, ElasticsearchConnectionConfig, ElasticsearchAccessConfig, ElasticsearchUploadStagerConfig, ElasticsearchUploaderConfig ) from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=ElasticsearchIndexerConfig(index_name=os.getenv("ELASTICSEARCH_INDEX_NAME")), downloader_config=ElasticsearchDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=ElasticsearchConnectionConfig( access_config=ElasticsearchAccessConfig( password=os.getenv("ELASTICSEARCH_PASSWORD"), ssl_assert_fingerprint=os.getenv("ELASTICSEARCH_SSL_ASSERT_FINGERPRINT") ), hosts=[os.getenv("ELASTICSEARCH_HOST")], username=os.getenv("ELASTICSEARCH_USERNAME"), ca_certs=os.getenv("ELASTICSEARCH_CA_CERTS") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # GitHub Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/github <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect GitHub to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * The user or organization name, and the repository name, for the target GitHub repository, for example `Unstructured-IO/docs`. * The name of a branch to access in the repository. By default, the repository's default branch is used. * A GitHub [account](https://docs.github.com/get-started/start-your-journey/creating-an-account-on-github). * A GitHub [personal access token](https://docs.github.com/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) for the account. * If you are accessing a private GitHub repository, the repository must grant the appropriate access [permissions](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/managing-repository-settings/managing-teams-and-people-with-access-to-your-repository) to the GitHub account. The GitHub connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[github]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `GITHUB_URL` - The user or organization name, and the repository name, for the target GitHub repository; for example `Unstructured-IO/docs`, represented by `--url` (CLI) or `url` (Python). * `GITHUB_BRANCH` - The name of the target branch in the repository, for example `main`, represented by `--branch` (CLI) or `branch` (Python). By default, the repository's default branch is used. * `GITHUB_ACCESS_TOKEN` - The value of your GitHub access token, represented by `--access-token` (CLI) or `access_token` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ github \ --access-token $GITHUB_ACCESS_TOKEN \ --url $GITHUB_URL \ --branch $GITHUB_BRANCH \ --recursive \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.github import ( GithubIndexerConfig, GithubDownloaderConfig, GithubConnectionConfig, GithubAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=GithubIndexerConfig( branch=os.getenv("GITHUB_BRANCH"), recursive=True ), downloader_config=GithubDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=GithubConnectionConfig( access_config=GithubAccessConfig(access_token=os.getenv("GITHUB_ACCESS_TOKEN")), url=os.getenv("GITHUB_URL") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # GitLab Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/gitlab <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect GitLab to your preprocessing pipeline, and use the Unstructured CLI or Python to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A [GitLab account](https://gitlab.com/users/sign_up). * The URL for the target [GitLab repository](https://docs.gitlab.com/ee/user/project/repository/). * A [GitLab access token](https://docs.gitlab.com/ee/security/tokens/) that allows access to the repository. * Optionally, the name of a specific branch to access in the repository. The repository's default repository branch is used if not otherwise specified. * Optionally, a list of file globs to limit which types of files are accepted, for example `.html` or `.txt`. The GitLab connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[gitlab]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `GITLAB_REPO_URL` - The URL for the target GitLab repository, represented by `--url` (CLI) or `url` (Python). * `GITLAB_TOKEN` - The GitLab access token that allows access to the repository, represented by `--git-access-token` (CLI) or `access_token` (Python). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ gitlab \ --url $GITLAB_URL \ --git-access-token $GITLAB_TOKEN \ --git-branch v0.0.7 \ --git-file-glob ".html,.txt" \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest # pip install "unstructured-ingest[gitlab]" import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.gitlab import ( GitLabIndexerConfig, GitLabDownloaderConfig, GitLabConnectionConfig, GitLabAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=GitLabIndexerConfig( path="/", recursive=True, git_branch="v0.0.7" ), downloader_config=GitLabDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=GitLabConnectionConfig( access_config=GitLabAccessConfig(access_token=os.getenv("GITLAB_TOKEN")), url=os.getenv("GITLAB_URL") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Google Cloud Storage Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/google-cloud-storage <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Google Cloud Storage to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/R57duYCxLi8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Google Cloud service account. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#console). * A service account key for the service account. See [Create a service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating) in [Create and delete service account keys](https://cloud.google.com/iam/docs/keys-create-delete). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the service account key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * The URI for a Google Cloud Storage bucket. This URI consists of the target bucket name, plus any target folder within the bucket, expressed as `gs://<bucket-name>[/folder-name]`. [Create a bucket](https://cloud.google.com/storage/docs/creating-buckets#console). This bucket must have, at minimum, one of the following roles applied to the target Google Cloud service account: * `Storage Object Viewer` for bucket read access. * `Storage Object User` for bucket write access. * The `Storage Object Admin` role provides read and write access, plus access to additional bucket operations. To apply one of these roles to a service account for a bucket, see [Add a principal to a bucket-level policy](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-add) in [Set and manage IAM policies on buckets](https://cloud.google.com/storage/docs/access-control/using-iam-permissions). The Google Cloud Storage connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[gcs]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `GCS_SERVICE_ACCOUNT_KEY` - The Google Cloud service account key for Google Cloud Storage, represented by `--service-account-key` (CLI) or `service_account_key` (Python). * `GCS_REMOTE_URL` - The Google Cloud Storage bucket URL, represented by `--remote-url` (CLI) or `remote_url` (Python). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ gcs \ --service-account-key $GCS_SERVICE_ACCOUNT_KEY \ --remote-url $GCS_REMOTE_URL \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --recursive \ --verbose \ --strategy hi_res \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.gcs import ( GcsIndexerConfig, GcsDownloaderConfig, GcsConnectionConfig, GcsAccessConfig ) from unstructured_ingest.processes.connectors.local import LocalUploaderConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=GcsIndexerConfig(remote_url=os.getenv("GCS_REMOTE_URL")), downloader_config=GcsDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=GcsConnectionConfig( access_config=GcsAccessConfig( service_account_key=os.getenv("GCS_SERVICE_ACCOUNT_KEY") ) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Google Drive Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/google-drive <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Google Drive to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/ZiRD7hC-wh8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Google Cloud account](https://console.cloud.google.com). * The Google Drive API enabled in the account. [Learn how](https://cloud.google.com/apis/docs/getting-started#enabling_apis). * Within the account, a Google Cloud service account and its related `credentials.json` key file or its contents in JSON format. [Create a service account](https://developers.google.com/workspace/guides/create-credentials#create_a_service_account). [Create credentials for a service account](https://developers.google.com/workspace/guides/create-credentials#create_credentials_for_a_service_account). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the `credentials.json` key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * A Google Drive [shared folder](https://support.google.com/drive/answer/2375091) or [shared drive](https://support.google.com/a/users/answer/7212025). * Give the service account access to the shared folder or shared drive. To do this, share the folder or drive with the service account's email address. [Learn how](https://support.google.com/drive/answer/7166529). [Learn more](https://www.googlecloudcommunity.com/gc/Workspace-Q-A/Can-i-give-access-to-document-of-google-drive-to-service-account/m-p/530106). * Get the shared folder's ID or shared drive's ID. This is a part of the URL for your Google Drive shared folder or shared drive, represented in the following URL as `{folder_id}`: `https://drive.google.com/drive/folders/{folder-id}`. ## Document permissions metadata The source connector outputs any permissions information that it can find in the source location about the processed source documents and associates that information with each corresponding element that is generated. This permissions information is output into the `permissions_data` field, which is within the `data_source` field under the element's `metadata` field. This information lists the users or groups, if any, that have permissions to read, update, or delete the element's associated source document. The following example shows what the output looks like. Ellipses indicate content that has been omitted from this example for brevity. ```json [ { "...": "...", "metadata": { "...": "...", "data_source": { "...": "...", "permissions_data": [ { "read": { "users": [ "11111111111111111111" ], "groups": [ "22222222222222222222", "33333333333333333333" ] } }, { "update": { "users": [ "44444444444444444444", "55555555555555555555" ], "groups": [ "66666666666666666666", ] } }, { "delete": { "users": [ "77777777777777777777" ], "groups": [ "88888888888888888888" ] } } ], "...": "..." } } } ] ``` To look up information about a particular Google Cloud user, use the user's ID along with the [Admin SDK API](https://developers.google.com/workspace/admin/directory/reference/rest/v1/users/get) or the [People API](https://developers.google.com/people/api/rest/v1/people/get) for Google Cloud. To look up information about a particular Google Cloud group, use the group's ID along with the [Admin SDK API](https://developers.google.com/workspace/admin/directory/reference/rest/v1/groups/get) or the [Cloud Identity API](https://cloud.google.com/identity/docs/reference/rest/v1/groups/get) for Google Cloud. The Google Drive connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[google-drive]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `GOOGLE_DRIVE_FOLDER_ID` - The folder or drive ID, represented by `--drive-id` (CLI) or `drive_id` (Python). * One of the following: * `GCP_SERVICE_ACCOUNT_KEY_FILEPATH` - The path to the `credentials.json` key file, represented by `--service-account-key-path` (CLI) or `service_account_key_path` (Python). * `GCP_SERVICE_ACCOUNT_KEY_STRING` - The contents of the `credentials.json` key file as a string, represented by `--service-account-key` (CLI) or `service_account_key` (Python). <Note> To use `--extensions` with a comma-separated list (CLI) or `extensions` with an array of strings (Python) to process files with only those extensions, do not include the leading dot in the file extensions. For example, use `pdf` or `docx` instead of `.pdf` or `.docx`. </Note> Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ google-drive \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR \ --drive-id $GOOGLE_DRIVE_FOLDER_ID \ --service-account-key-path $GCP_SERVICE_ACCOUNT_KEY_FILEPATH \ # Or --service-account-key $GCP_SERVICE_ACCOUNT_KEY_STRING \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ local \ --output-dir $LOCAL_FILE_OUTPUT_DIR ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.google_drive import ( GoogleDriveConnectionConfig, GoogleDriveAccessConfig, GoogleDriveIndexerConfig, GoogleDriveDownloaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=GoogleDriveIndexerConfig(), downloader_config=GoogleDriveDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=GoogleDriveConnectionConfig( access_config=GoogleDriveAccessConfig( service_account_key_path=os.getenv("GCP_SERVICE_ACCOUNT_KEY_FILEPATH"), # Or service_account_key=os.getenv("GCP_SERVICE_ACCOUNT_KEY_STRING") ), drive_id=os.getenv("GOOGLE_DRIVE_FOLDER_ID"), ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> ## Output a key file's contents as a string If you need to convert the contents of a `credentials.json` file into a string, you could use a Python script such as the following. This script takes the local path to the key file as input and outputs the key file's contents as a string. ```python Python # Filename: json_file_to_string.py import json import sys def json_file_to_string(file_path): try: # Read the JSON file. with open(file_path, 'r') as file: # Load the JSON data. data = json.load(file) # Convert the JSON data back to a string, with no whitespace. json_string = json.dumps(data, separators=(',', ':')) return json_string except FileNotFoundError: print(f"Error: File '{file_path}' not found.") return None except json.JSONDecodeError: print(f"Error: '{file_path}' is not a valid JSON file.") return None except Exception as e: print(f"An unexpected error occurred: {e}") return None if name == "main": if len(sys.argv) != 2: print("Usage: python json_file_to_string.py <path_to_json_file>") sys.exit(1) file_path = sys.argv[1] result = json_file_to_string(file_path) if result: print(result) ``` # Jira Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/jira <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Jira to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/JR5mcukOSUQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Jira Cloud account](https://www.atlassian.com/try/cloud/signup?bundle=jira-software\&edition=free) or [Jira Data Center installation](https://confluence.atlassian.com/adminjiraserver/installing-jira-data-center-938846870.html). * The site URL for your [Jira Data Center installation](https://confluence.atlassian.com/jirakb/find-your-site-url-to-set-up-the-jira-data-center-and-server-mobile-app-954244798.html) or Jira Cloud account. For Jira Cloud, open Jira in your web browser and copy the address from the browser's address bar. If you're unsure, check the dashboard URL, or if viewing an issue, project or board, the site URL is typically everything that comes before but not including `/jira`, such as `https://<organization>.atlassian.net`. * To process Jira projects, provide the IDs for the target projects. To get a project's ID, sign in to your Jira Cloud account or Jira Data Center installation, and then go to the following URL: `https://<organization>.atlassian.net/rest/api/latest/project/<project-key>`, replacing `<organization>` with yours, and replacing `<project-key>` with the target project's key. In the response, look for the URL `https://<organization>.atlassian.net/rest/api/3/project/<project-id>`, where `<project-id>` is the target project's ID. * To process Jira boards, the IDs for the target boards. To get a board's ID, sign in to your Jira Cloud account or Jira Data Center installation, and then go to the following URL: `https://<organization>.atlassian.net/rest/agile/1.0/board?projectKeyOrId=<project-key-or-id>`, replacing `<organization>` with yours, and `<project-key-or-id>` with the associated project's key or ID. In the response, look for the URL `https://<organization>.atlassian.net/rest/agile/1.0/board/<board-id>`, where `<board-id>` is the board's ID. * To process Jira issues, the IDs for the target issues. To get an issue's ID, sign in to your Jia Cloud account or Jira Data Center installation, open the issue, and then look at the URL in your browser's address bar. The issue ID is the string of characters after the final slash in the URL. * A user in your [Jira Cloud account](https://support.atlassian.com/jira-cloud-administration/docs/manage-users-groups-permissions-and-roles-in-jira-cloud/) or [Jira Data Center installation](https://confluence.atlassian.com/adminjiraserver/create-edit-or-remove-a-user-938847025.html). * The user must have the correct permissions in your [Jira Cloud account](https://support.atlassian.com/jira-cloud-administration/docs/manage-users-groups-permissions-and-roles-in-jira-cloud/) or [Jira Data Center installation](https://confluence.atlassian.com/jirakb/permissions-made-simple-for-jira-server-717062767.html) to access the target projects, boards, and issues. * One of the following: * For Jira Cloud or Jira Data Center, the target user's name or email address, and password. [Change a Jira Cloud user's password](https://support.atlassian.com/user-management/docs/change-password-for-portal-only-customers/). [Change a Jira Data Center user's password](https://confluence.atlassian.com/adminjiraserver/create-edit-or-remove-a-user-938847025.html). * For Jira Cloud only, the target user's name or email address, and API token. [Create an API token](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/). * For Jira Data Center only, the target user's personal access token (PAT). [Create a PAT](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html). The Jira connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[jira]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `JIRA_URL` - The site URL for your Jira Data Center installation or Jira Cloud account, represented by `--url` (CLI) or `url` (Python). * One of the following: * For Jira Cloud or Jira Data Center, the target user's name or email address, and password or API token, as follows: * `JIRA_USERNAME` - The name or email address of the target user, represented by `--username` (CLI) or `username` (Python). * `JIRA_PASSWORD_OR_API_TOKEN` - The user's password (for password authentication) or API token (for API token authentication), represented by `--password` (CLI) or `password` (Python). * For Jira Data Center only, the target user's personal access token (PAT), as follows: * `JIRA_PERSONAL_ACCESS_TOKEN` - The user's personal access token (PAT), represented by `--token` (CLI) or `token` (Python). Also, to process specific projects, boards, or issues, use: * `--projects` with a comma-delimited list of target project IDs (CLI) or `project` with an array of target project IDs (Python). * `--boards` with a comma-delmited list of target board IDs (CLI) or `boards` with an array of target board IDs (Python). * `--issues` with a comma-delimited list of target issue IDs (CLI) or `issues` with an array of target issue IDs (Python). * `--status-filter` with a comma-delimited list of statuses to filter Jira issues by (CLI) or `status_filter` with an array of statuses to filter Jira issues by (Python). * `--download-attachments` (CLI) or `download_attachments=True` (Python) to download attachments from Jira issues. * `--no-download-attachments` (CLI) or `download_attachments=False` (Python) to not download attachments from Jira issues. This is the default behavior. * `--cloud` (CLI) or `cloud=True` (Python) to specify using Jira Cloud. The default is to use Jira Data Center. * `--no-cloud` (CLI) or `cloud=False` (Python) to specify using Jira Data Center. This is the default behavior. Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ jira \ --url $JIRA_URL \ --username $JIRA_USERNAME \ --password $JIRA_PASSWORD_OR_API_TOKEN \ # Password or API token authentication. --token $JIRA_PERSONAL_ACCESS_TOKEN \ # Personal access token authentication only. --cloud \ # For Jira Cloud. --no-cloud \ # For Jira Data Center (default). --output-dir $LOCAL_FILE_OUTPUT_DIR \ --projects <project-id>,<project-id> \ --boards <board-id>,<board-id> \ --issues <issue-id>,<issue-id> \ --status-filters <status>,<status> \ --download-attachments \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.jira import ( JiraIndexerConfig, JiraDownloaderConfig, JiraConnectionConfig, JiraAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=JiraIndexerConfig( projects=[ "<project-id>", "<project-id>" ], boards=[ "<board-id>", "<board-id>" ], issues=[ "<issue-id>", "<issue-id>" ], status_filter=[ "<status>", "<status>" ] ), downloader_config=JiraDownloaderConfig( download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR"), download_attachments=True ), source_connection_config=JiraConnectionConfig( access_config=JiraAccessConfig( password=os.getenv("JIRA_PASSWORD_OR_API_TOKEN"), # Password or API token authentication. # token=os.getenv("JIRA_PERSONAL_ACCESS_TOKEN") # Personal access token authentication only. ), url=os.getenv("JIRA_URL"), username=os.getenv("JIRA_USERNAME"), # For password or API token authentication. cloud=True # True for Jira Cloud, False (default) for Jira Data Center. ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Kafka Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/kafka <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Kafka to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A Kafka cluster in [Confluent Cloud](https://www.confluent.io/confluent-cloud). ([Create a cluster](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#create-ak-clusters).) The following video shows how to set up a Kafka cluster in Confluent Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/zcKJ96J4Xvk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The [hostname and port number](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#view-a-ak-cluster) of the bootstrap Kafka cluster to connect to.. * The name of the topic to read messages from or write messages to on the cluster. [Create a topic](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). [Access available topics](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). * For authentication, an [API key and secret](https://docs.confluent.io/cloud/current/security/authenticate/workload-identities/service-accounts/api-keys/manage-api-keys.html#add-an-api-key). The Kafka connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[kafka]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `KAFKA_BOOTSTRAP_SERVER` - The hostname of the bootstrap Kafka cluster to connect to, represented by `--bootstrap-server` (CLI) or `bootstrap_server` (Python). * `KAFKA_PORT` - The port number of the cluster, represented by `--port` (CLI) or `port` (Python). * `KAFKA_TOPIC` - The unique name of the topic to read messages from and write messages to on the cluster, represented by `--topic` (CLI) or `topic` (Python). If you use Kafka API keys and secrets for authentication: * `KAFKA_API_KEY` - The Kafka API key value, represented by `--kafka-api-key` (CLI) or `kafka_api_key` (Python). * `KAFKA_SECRET` - The secret value for the Kafka API key, represented by `--secret` (CLI) or `secret` (Python). Additional settings include: * `--confluent` (CLI) or `confluent` (Python): True to indicate that the cluster is running Confluent Kafka. * `--num-messages-to-consume` (CLI) or `num_messages_to_consume` (Python): The maximum number of messages to get from the topic. The default is `1` if not otherwise specified. * `--timeout` (CLI) or `timeout` (Python): The maximum amount of time to wait for the response of a request to the topic, expressed in seconds. The default is `1.0` if not otherwise specified. * `--group-id` (CLI) or `group_id` (Python): The ID of the consumer group, if any, that is associated with the target Kafka cluster. (A consumer group is a way to allow a pool of consumers to divide the consumption of data over topics and partitions.) The default is `default_group_id` if not otherwise specified. Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ kafka \ --bootstrap-server $KAFKA_BOOTSTRAP_SERVER \ --port $KAFKA_PORT \ --topic $KAFKA_TOPIC \ --kafka-api-key $KAFKA_API_KEY \ --secret $KAFKA_API_KEY \ --confluent true \ --batch-size 100 \ --num-messages-to-consume 1 \ --timeout 1.0 \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.kafka.cloud import ( CloudKafkaIndexerConfig, CloudKafkaDownloaderConfig, CloudKafkaConnectionConfig, CloudKafkaAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=CloudKafkaIndexerConfig( topic=os.getenv("KAFKA_TOPIC"), num_messages_to_consume=100, timeout=1 ), downloader_config=CloudKafkaDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=CloudKafkaConnectionConfig( access_config=CloudKafkaAccessConfig( kafka_api_key=os.getenv("KAFKA_API_KEY"), secret=os.getenv("KAFKA_SECRET") ), bootstrap_server=os.getenv("KAFKA_BOOTSTRAP_SERVER"), port=os.getenv("KAFKA_PORT") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Local Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/local <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect local files to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. You will need the local source connector dependencies: ```bash CLI, Python pip install unstructured-ingest ``` To use the local source connector, you must set `--input-path` (CLI) or `input_path` (Python) to the path in the local filesystem which contains documents you wish to process. Optionally, you can limit processing to certain file types by setting `--file-glob` (CLI) or `file_glob` (Python), for example to `.docx` to process only `.docx` files. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. The following 3-minute video shows how to use the Unstructured Ingest Python library to send multiple PDFs from a local directory in batches to be ingested by Unstructured for processing: <iframe width="560" height="315" src="https://www.youtube.com/embed/tSKHFXsBQ-c" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ local \ --input-path $LOCAL_FILE_INPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --chunking-strategy by_title \ --embedding-provider huggingface \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ local \ --output-dir $LOCAL_FILE_OUTPUT_DIR ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # MongoDB Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/mongodb <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect MongoDB to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/8YBVHt5spIQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The MongoDB requirements for a MongoDB Atlas deployment include: <Warning> For MongoDB Atlas, SCRAM-SHA-1 is not supported for authentication. This means that cluster types that only include SCRAM-SHA-1, such as Free, M0, Flex, and Serverless, are not supported. Unstructured only supports SCRAM-SHA-256 for MongoDB Atlas, which is cryptographically stronger than SCRAM-SHA-1. If you try to test or use a connector that refers to a cluster type that only includes SCRAM-SHA-1, the operation will fail, and you will get an error message similar to the following: `[digital envelope routines] unsupported`. </Warning> * A MongoDB Atlas account. [Create an account](https://www.mongodb.com/cloud/atlas/register). * A MongoDB Atlas cluster. [Create a cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/). Be sure to not select a cluster type that only includes SCRAM-SHA-1, such as Free, M0, Flex, or Serverless. * The cluster must be reachable from your application environment, for example by adding IP addresses to your IP access list. [Learn more](https://www.mongodb.com/docs/atlas/setup-cluster-security/#network-and-firewall-requirements). * The cluster must be configured to allow IP address. [Learn how](https://www.mongodb.com/docs/atlas/security/ip-access-list/#add-ip-access-list-entries). To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> * The cluster must have at least one database. [Create a database](https://www.mongodb.com/docs/compass/current/databases/#create-a-database). * The database must have at least one user, and that user must have sufficient access to the database. [Create a database user](https://www.mongodb.com/docs/atlas/security-add-mongodb-users/#add-database-users). [Give the user database access](https://www.mongodb.com/docs/manual/core/authorization/). * The database must have at least one collection. [Create a collection](https://www.mongodb.com/docs/compass/current/collections/#create-a-collection). <Note> For the destination connector, Unstructured recommends that all documents in the target collection have a field named `record_id` with a `String` data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the collection or, in some cases, the operation could fail altogether. </Note> * The connection string for the cluster. For MongoDB Atlas, this connection string must include the protocol, username, password, host, and cluster name. For example: ```text mongodb+srv://<db_user>:<db_password>@<host>/?retryWrites=true&w=majority&appName=<cluster> ``` To get the connection string in MongoDB Atlas, do the following: 1. Log in to your MongoDB Atlas console. 2. In the sidebar, under Databases, click Clusters. 3. Click on the cluster you want to connect to. 4. Click Connect. 5. Click Drivers. 6. Under Add your connection string into your application code, copy the connection string. You can then close the Connect dialog in MongoDB Atlas. Before you use this connection string, be sure to fill in any placeholders in the string, such as your MongoDB Atlas database user's password value. [Learn more](https://www.mongodb.com/resources/products/fundamentals/mongodb-connection-string). The MongoDB connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[mongodb]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). For a MongoDB Atlas deployment, the following environment variables: * `MONGODB_DATABASE` - The name of the database, represented by `--database` (CLI) or `database` (Python). * `MONGODB_COLLECTION` - The name of the collection in the database, represented by `--collection` (CLI) or `collection` (Python). * `MONGODB_URI` - The connection string for the cluster, represented by `--uri` (CLI) or `uri` (Python). For a local MongoDB server, the following environment variables: * `MONGODB_HOST` - The host for the local MongoDB server, represented by `--host` (CLI) or `host` (Python). * `MONGODB_PORT` - The port for the local MongoDB server, represented by `--port` (CLI) or `port` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ mongodb \ --metadata-exclude filename,file_directory,metadata.data_source.date_processed \ --uri $MONGODB_URI \ --database $MONGODB_DATABASE \ --collection $MONGODB_COLLECTION \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.mongodb import ( MongoDBAccessConfig, MongoDBConnectionConfig, MongoDBIndexerConfig, MongoDBDownloaderConfig ) from unstructured_ingest.processes.connectors.local import LocalConnectionConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=MongoDBIndexerConfig(batch_size=100), downloader_config=MongoDBDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=MongoDBConnectionConfig( access_config=MongoDBAccessConfig(uri=os.getenv("MONGODB_URI")), database=os.getenv("MONGODB_DATABASE"), collection=os.getenv("MONGODB_COLLECTION") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Notion Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/notion <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Notion to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A Notion access token for an [internal integration](https://developers.notion.com/docs/authorization#internal-integration-auth-flow-set-up) or [public integration](https://developers.notion.com/docs/authorization#public-integration-auth-flow-set-up). * The IDs of any specific target Notion [databases](https://developers.notion.com/docs/working-with-databases). To get a database's ID: 1. Open the database as a full page in Notion. 2. Click Share, and then click Copy link. 3. Paste the link in your text editor. The URL uses the following format: ```text https://www.notion.so/{workspace_name}/{database_id}?v={view_id} ``` 4. Find the part that corresponds to `{database_id}` in the URL you pasted. It is a 36-character string. This value is the database's ID. Note that if you receive a database ID by calling the Notion API, for example from the Notion search endpoint, it will contain hyphens in the UUIDv4 format. You may use either the hyphenated or un-hyphenated ID. * The IDs of any specific target Notion [pages](https://developers.notion.com/docs/working-with-page-content). To get a page's ID: 1. Open the page in Notion. 2. Click Share, and then click Copy link. 3. Paste the link in your text editor. The URL ends in a page ID. It should be a 32-character string. 4. Format this value by inserting hyphens (`-`) in the following pattern: `8-4-4-4-12` (each number is the length of characters between the hyphens). Example: `1429989fe8ac4effbc8f57f56486db54` becomes `1429989f-e8ac-4eff-bc8f-57f56486db54`. This value is the page's ID. The Notion connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[notion]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `NOTION_ACCESS_TOKEN` - The API key for the Notion integration, represented by `--notion-api-key` (CLI) or `notion_api_key` (Python). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ notion \ --notion-api-key $NOTION_ACCESS_TOKEN \ --page-ids 4b1a4fff-d740-4e59-aa9f-cc5c7c2b4789,47dcbd6d-13f1-4fc1-af7d-5fc147697c00 \ --database-ids 1382c376-5a0a-8034-a3c3-d33402979df2,f2a60ad9-c466-4b06-859d-1ab176dea8c8 \ --recursive \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.notion.connector import ( NotionIndexerConfig, NotionDownloaderConfig, NotionConnectionConfig, NotionAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=NotionIndexerConfig( page_ids=[ "4b1a4fff-d740-4e59-aa9f-cc5c7c2b4789", "47dcbd6d-13f1-4fc1-af7d-5fc147697c00" ], database_ids=[ "1382c376-5a0a-8034-a3c3-d33402979df2", "f2a60ad9-c466-4b06-859d-1ab176dea8c8" ], recursive=True ), downloader_config=NotionDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=NotionConnectionConfig( access_config=NotionAccessConfig(notion_api_key=os.getenv("NOTION_ACCESS_TOKEN")) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # OneDrive Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/one-drive <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect OneDrive to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. (Even if you only plan to use OneDrive, you still need a plan that includes SharePoint Online, because OneDrive is built on SharePoint technology.) [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * The OneDrive and SharePoint Online plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The path to the target OneDrive folder, starting from the OneDrive account's root folder, for example `my-folder/my-subfolder`. The following video shows how to get a path: <iframe width="560" height="315" src="https://www.youtube.com/embed/SfUA1IwtI5U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration that has access to the target OneDrive account, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Files.ReadWrite.All` (if both reading and writing are needed) * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` * `Directory.Read.All` 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/jBJsrSkpClo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Both authentication types also require the User Principal Name (UPN) for the OneDrive account. This is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password authentication, you will also need the password for the target UPN. The OneDrive connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[onedrive]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `ONEDRIVE_PATH` - The path to the target folder in the OneDrive account, starting with the account's root folder (for example `my-folder/my-subfolder`), represented by `--path` (CLI) or `path` (Python). * `ONEDRIVE_CLIENT_ID` - The application (client) ID of the Microsoft Entra ID app registration that has access to the OneDrive account, represented by `--client-id` (CLI) or `client_id` (Python). * `ONEDRIVE_CLIENT_CRED` - The client secret for the Entra ID app registration, represented by `--client-cred` (CLI) or `client_cred` (Python). * `ONEDRIVE_TENANT` - The directory (tenant) ID of the Entra ID app registration, represented by `--tenant` (CLI) or `tenant` (Python). * `ONEDRIVE_USER_PNAME` - The User Principal Name (UPN) for the OneDrive user account in Entra ID (typically the user's email address), represented by `--user-pname` (CLI) or `user_pname` (Python). * `ONEDRIVE_USER_PASSWORD` - The password for the target UPN, represented by `--password` (CLI) or `password` (Python). * `ONEDRIVE_AUTHORITY_URL` - The authentication token provider URL for the Entra ID app registration (the default is `https://login.microsoftonline.com` if not otherwise specified), represented by `--authority-url` (CLI) or `authority_url` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ onedrive \ --client-id $ONEDRIVE_CLIENT_ID \ --client-cred $ONEDRIVE_CLIENT_CRED \ --authority-url $ONEDRIVE_AUTHORITY_URL \ --tenant $ONEDRIVE_TENANT \ --user-pname $ONEDRIVE_USER_PNAME \ --password $ONEDRIVE_USER_PASSWORD \ # For username and password authentication. --path $ONEDRIVE_PATH \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.onedrive import ( OnedriveIndexerConfig, OnedriveDownloaderConfig, OnedriveConnectionConfig, OnedriveAccessConfig ) from unstructured_ingest.processes.connectors.local import LocalConnectionConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=OnedriveIndexerConfig(path=os.getenv("ONEDRIVE_PATH")), downloader_config=OnedriveDownloaderConfig( download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR") ), source_connection_config=OnedriveConnectionConfig( access_config=OnedriveAccessConfig( client_cred=os.getenv("ONEDRIVE_CLIENT_CRED"), password=os.getenv("ONEDRIVE_USER_PASSWORD") # For username and password authentication. ), client_id=os.getenv("ONEDRIVE_CLIENT_ID"), tenant=os.getenv("ONEDRIVE_TENANT"), user_pname=os.getenv("ONEDRIVE_USER_PNAME"), authority_url=os.getenv("ONEDRIVE_AUTHORITY_URL") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # OpenSearch Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/opensearch <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect OpenSearch to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * An OpenSearch instance, such as an [AWS OpenSearch](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/setting-up.html) instance... <iframe width="560" height="315" src="https://www.youtube.com/embed/BNOYTbRbaFQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> ...or a [local](https://opensearch.org/downloads.html) instance. <iframe width="560" height="315" src="https://www.youtube.com/embed/Rew3_pNnYIs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> In all cases, the [OpenSearch version](https://github.com/opensearch-project/opensearch-py/blob/main/COMPATIBILITY.md) must be compatible with the client version of the [opensearch-py](https://github.com/opensearch-project/opensearch-py/blob/main/COMPATIBILITY.md) package on PyPI. * The instance's hostname and port number. To learn how, see: * [Creating and managing Amazon OpenSearch Service domains](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html) for AWS OpenSearch instances. * [Communicate with OpenSearch](https://opensearch.org/docs/latest/getting-started/communicate/) for local OpenSearch instances. * The name of the search index on the instance. [Create an index](https://opensearch.org/docs/latest/api-reference/index-apis/create-index/). The OpenSearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "record_id": { "type": "text" }, "element_id": { "type": "keyword" }, "text": { "type": "text" }, "embeddings": { "type": "knn_vector", "dimension": 384 }, "metadata": { "type": "object", "properties": { "parent_id": { "type": "text" }, "page_number": { "type": "integer" }, "is_continuation": { "type": "boolean" }, "orig_elements": { "type": "text" }, "partitioner_type": { "type": "text" } } } } } } ``` See also: * [Mappings and field types](https://opensearch.org/docs/latest/field-types/) * [Explicit mapping](https://opensearch.org/docs/latest/field-types/#explicit-mapping) * [Dynamic mapping](https://opensearch.org/docs/latest/field-types/#dynamic-mapping) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) * If you're using basic authentication to the instance, the user's name and password. * If you're using certificates for authentication instead: * The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA. * The path to the combined private key and certificate file, or * The paths to the separate private key and certificate file. To learn more, see: * For AWS OpenSearch instances, see [Tutorial: Configure a domain with the internal user database and HTTP basic authentication](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac-http-auth.html) and [Creating a custom endpoint for Amazon OpenSearch Service](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/customendpoint.html). * For local OpenSearch instances, see [Authentication backends](https://opensearch.org/docs/latest/security/authentication-backends/authc-index/), [HTTP basic authentication](https://opensearch.org/docs/latest/security/authentication-backends/basic-authc/), and [Client certificate authentication](https://opensearch.org/docs/latest/security/authentication-backends/client-auth/). The OpenSearch connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[opensearch]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `OPENSEARCH_HOST` - The hostname and port number, defined as `<hostname>:<port-number>` and represented by `--hosts` (CLI) or `hosts` (Python). * `OPENSEARCH_INDEX_NAME` - The name of the search index, represented by `--index-name` (CLI) or `index_name` (Python). If you're using basic authentication to the instance: * `OPENSEARCH_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python). * `OPENSEARCH_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). If you're using certificates for authentication instead: * `OPENSEARCH_CA_CERTS` - The path to the Certificate Authority (CA) bundle, if you use intermediate CAs with your root CA. This is represented by `--ca-certs` (CLI) or `ca_certs` (Python). * `OPENSEARCH_CLIENT_CERT` - The path to the combined private key and certificate file, or the path to just the certificate file. This is represented by `--client-cert` (CLI) or `client_cert` (Python). * `OPENSEARCH_CLIENT_KEY` - The path to the private key file, if `OPENSEARCH_CLIENT_CERT` refers to just the certificate file. This is represented by `--client-key` (CLI) or `client_key` (Python). Additional related settings include: * `--use-ssl` (CLI) or `use_ssl=True` (Python) to use SSL for the connection. * `--verify-certs` (CLI) or `verify_certs=True` (Python) to verify SSL certificates. * `--ssl-show-warn` (CLI) or `ssl_show_warn=True` (Python) to show a warning when verifying SSL certificates is disabled. Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ opensearch \ --index-name $OPENSEARCH_INDEX_NAME \ --fields "director,plot" \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --hosts $OPENSEARCH_HOST \ --username $OPENSEARCH_USERNAME \ --password $OPENSEARCH_PASSWORD \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.elasticsearch.opensearch import ( OpensearchIndexerConfig, OpensearchDownloaderConfig, OpenSearchConnectionConfig, OpenSearchAccessConfig ) from unstructured_ingest.processes.connectors.local import LocalConnectionConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=OpensearchIndexerConfig(index_name=os.getenv("OPENSEARCH_INDEX_NAME")), downloader_config=OpensearchDownloaderConfig( download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR"), fields=["director", "plot"] ), source_connection_config=OpenSearchConnectionConfig( access_config=OpenSearchAccessConfig( password=os.getenv("OPENSEARCH_PASSWORD"), use_ssl=True ), hosts=[os.getenv("OPENSEARCH_HOST")], username=os.getenv("OPENSEARCH_USERNAME") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig() ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Outlook Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/outlook <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Outlook to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/9yESRp9pzv0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Outlook user's email address. * A Microsoft Entra ID app registration in the same Azure account as the Outlook account. You will need this app registration's application (client) ID, client secret, and directory (tenant) ID. [Learn how](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app). * The Entra ID app registration must have the following Graph API permission levels of the application (not delegated) type: * `Mail.Read` * `Mail.ReadBasic` * `User.Read.All` [Learn how](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions). The Outlook connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[outlook]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `OUTLOOK_USER_EMAIL` - The Outlook user's email address, represented by `--user-email` (CLI) or `user_mail` (Python). * `OUTLOOK_APP_CLIENT_ID` - The application (client) ID of the Microsoft Entra ID app registration that has access to the user's email account, represented by `--client-id` (CLI) or `client_id` (Python). * `OUTLOOK_APP_CLIENT_SECRET` - The client secret for the Entra ID app registration, represented by `--client-cred` (CLI) or `client_cred` (Python). * `OUTLOOK_APP_TENANT` - The directory (tenant) ID of the Entra ID app registration, represented by `--tenant` (CLI) or `tenant` (Python). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Embedding is optional. unstructured-ingest \ outlook \ --user-email $OUTLOOK_USER_EMAIL \ --outlook-folders Inbox \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --client-id $OUTLOOK_APP_CLIENT_ID \ --client-cred $OUTLOOK_APP_CLIENT_SECRET \ --tenant $OUTLOOK_APP_TENANT \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --chunking-strategy by_title \ --embedding-provider huggingface ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.outlook import ( OutlookIndexerConfig, OutlookDownloaderConfig, OutlookConnectionConfig, OutlookAccessConfig ) from unstructured_ingest.processes.connectors.local import LocalUploaderConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Embedding is optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=OutlookIndexerConfig( outlook_folders=["Inbox"], recursive=False, user_email=os.getenv("OUTLOOK_USER_EMAIL") ), downloader_config=OutlookDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=OutlookConnectionConfig( access_config=OutlookAccessConfig(client_cred=os.getenv("OUTLOOK_APP_CLIENT_SECRET")), client_id=os.getenv("OUTLOOK_APP_CLIENT_ID"), tenant=os.getenv("OUTLOOK_APP_TENANT") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL") ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Overview Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/overview Connect to your favorite data storage platforms for effortless batch processing of your files. We are constantly adding new data connectors and if you don’t see your favorite platform let us know in our community Slack. When ingesting data from a storage via a source connector, you typically store the resulting `.json` files in a specified output directory. To "rehydrate" elements in JSON form into in-memory objects to further use in your application, use `elements_from_json`: ```python from unstructured.staging.base import elements_from_json elements = elements_from_json(filename=path_to_json_file) ``` # PostgreSQL Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/postgresql <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect PostgreSQL to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local PostgreSQL installations are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local PostgreSQL installations are supported. The following video shows how to set up [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/): <iframe width="560" height="315" src="https://www.youtube.com/embed/QuIlEimejDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to set up [Azure Database for PostgreSQL](https://azure.microsoft.com/products/postgresql): <iframe width="560" height="315" src="https://www.youtube.com/embed/6lvtBUFI7eQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A PostgreSQL instance. * [Create an Amazon RDS for PostgreSQL instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html). * [Create an Azure Database for PostgreSQL server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-deploy-on-azure-free-account). * [Install PostgreSQL locally](https://www.postgresql.org/docs/current/tutorial-install.html). * The host name and port number for the instance. * For Amazon RDS for PostgreSQL, learn how to [get the host name and port number](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html#postgresql-endpoint). * For Azure Database for PostgreSQL, learn how to [get the host](https://learn.microsoft.com/azure/postgresql/flexible-server/quickstart-create-server#get-the-connection-information). The port number is `5432`. * For local PostgreSQL installations, these values are in the `postgresql.conf` file's `listen_addresses` and `port` settings. This file should be on the same machine as the instance. These values might also already be set as environment variables named `PGHOST` and `PGPORT` on the same machine as the instance. * For other installation types, see your PostgreSQL provider's documentation. * Depending on your network security requirements, you might need to allow access to your instance only from specific IP addresses. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> To learn how to allow these IP address ranges, see your PostgreSQL provider's documentation, for example with [Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) or [Azure Database for PostgreSQL](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-manage-firewall-portal#create-a-firewall-rule-after-server-is-created). <Note> For Amazon RDS for PostgreSQL, Amazon recommends that you set the instance's Public access setting to No by default, as this approach is more secure. This means that no resources can connect to the instance outside of the instance's associated Virtual Private Cloud (VPC) without extra configuration. [Learn more](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.WorkingWithRDSInstanceinaVPC.html#USER_VPC.Hiding). [Access an Amazon RDS instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html). If you must enable public access, set the instance's Public access setting to Yes, and then adjust the instance's related security group to allow this access. [Learn how](https://repost.aws/en/questions/QUxemKa9u5TV6CmLiO-r5prg/lost-public-access-to-aws-rds-postgresql-instance). [Troubleshoot issues with connecting to Amazon RDS instances](https://repost.aws/knowledge-center/rds-connectivity-instance-subnet-vpc). </Note> * A database in the instance. * For Amazon RDS for PostgreSQL and Azure Database for PostgreSQL, the default database name is `postgres` unless a custom database name was specified during the instance creation process. * For local PostgreSQL installations, learn how to [create a database](https://www.postgresql.org/docs/current/tutorial-createdb.html). * For other installation types, see your PostgreSQL provider's documentation. * A table in the database. Learn how to [create a table](https://www.postgresql.org/docs/current/tutorial-table.html). The table's schema must match the schema of the documents that Unstructured produces. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: <CodeGroup> ```sql PostgreSQL CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings DECIMAL [], parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` ```sql PostgreSQL with pgvector CREATE EXTENSION vector; CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings vector(3072), parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` </CodeGroup> See also: * [CREATE TABLE](https://www.postgresql.org/docs/current/sql-createtable.html) for PostgreSQL * [CREATE TABLE](https://github.com/pgvector/pgvector) for PostrgreSQL with pgvector * [Unstructured document elements and metadata](/api-reference/partition/document-elements) The following video shows how to use the `psql` utility to connect to PostgreSQL, list databases, and list and create tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/IKo-4QHdNF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A user in the database, and a password for the user. * For Amazon RDS for PostgreSQL, learn how to [create a user](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Roles.html). * For Azure Database for PostgreSQL, learn how to [create a user](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-create-users). * For local PostgreSQL installations, learn how to [create a user](https://www.postgresql.org/docs/current/sql-createuser.html). * For other installation types, see your PostgreSQL provider's documentation. * Database access for the user. * For Amazon RDS for PostgreSQL, learn how to [control user access](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Access.html). * For Azure Database for PostgreSQL, learn how to [control user access](https://www.postgresql.org/docs/current/sql-createuser.html). * For local PostgreSQL installations, learn how to [give database access to a user](https://www.postgresql.org/docs/current/sql-grant.html). * For other installation types, see your PostgreSQL provider's documentation. The PostgreSQL connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[postgres]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `PGHOST` - The host name, represented by `--host` (CLI) or `host` (Python). * `PGPORT` - The port number, represented by `--port` (CLI) or `port` (Python). * `PGUSER` - The username, represented by `--username` (CLI) or `username` (Python). * `PGPASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). * `PGDATABASE` - The name of the database, represented by `--database` (CLI) or `database` (Python). Now call the Unstructured CLI or Python. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process files locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ postgres \ --host $PGHOST \ --port $PGPORT \ --database $PGDATABASE \ --table-name elements \ --id-column id \ --username $PGUSER \ --password $PGPASSWORD \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.postgres import ( PostgresIndexerConfig, PostgresDownloaderConfig, PostgresConnectionConfig, PostgresAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import ( LocalConnectionConfig, LocalUploaderConfig ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=PostgresIndexerConfig( table_name="elements", id_column="id" ), downloader_config=PostgresDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=PostgresConnectionConfig( access_config=PostgresAccessConfig(password=os.getenv("PGPASSWORD")), host=os.getenv("PGHOST"), port=os.getenv("PGPORT"), username=os.getenv("PGUSER"), database=os.getenv("PGDATABASE") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig(), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # S3 Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/s3 <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect S3 to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/hyDHfhVVAhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account; enable anonymous access to the bucket (which is supported but not recommended); or generate an AWS STS session token for temporary access, if required by your organization's security requirements. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Anonymous (supported but not recommended) or authenticated access to the bucket. * [Enable anonymous bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-anonymous-user). ![Enable anonymous bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-anon-access.gif) * [Enable authenticated bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/walkthrough1.html). ![Enable authenticated bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-auth-access.gif) * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated access in untrusted environments or enhanced security scenarios, an AWS STS session token for temporary access, in addition to an AWS access key and secret access key. [Create a session token](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#api_getsessiontoken). <iframe width="560" height="315" src="https://www.youtube.com/embed/4_csSXc_GNU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Warning> AWS STS credentials can be valid for as little as 15 minutes or as long as 36 hours, depending on how the credentials were initially generated. After the expiry time, the credentials are no longer valid will no longer work with the corresponding S3 connector. You must get a new set of credentials to replace the expired ones by calling [GetSessionToken](https://docs.aws.amazon.com/STS/latest/APIReference/API_GetSessionToken.html) in the AWS STS API. To overwrite the expired credentials with the new set: * For the Unstructured user interface (UI), manually update the AWS Key, AWS Secret Key, and STS Token fields in the Unstructured UI for the corresponding S3 [source](/ui/sources/s3) or [destination](/ui/destinations/s3) connector. * For the Unstructured API, use the Unstructured Workflow Endpoint to call the [update source](/api-reference/workflow/overview#update-a-source-connector) or [update destination](/api-reference/workflow/overview#update-a-destination-connector) connector operation for the corresponding S3 [source](/api-reference/workflow/sources/s3) or [destination](/api-reference/workflow/destinations/s3) connector. * For Unstructured Ingest, change the values of `--key`, `--secret`, and `--token` (CLI) or `key`, `secret`, and `token` (Python) in your command or code for the corresponding S3 [source](/open-source/ingestion/source-connectors/s3) or [destination](/open-source/ingestion/destination-connectors/s3) connector. </Warning> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, and authenticated bucket access is enabled, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). The S3 connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[s3]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `AWS_S3_URL` - The path to the S3 bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/`. * If the bucket does not have anonymous access enabled, provide the AWS credentials: * `AWS_ACCESS_KEY_ID` - The AWS access key ID for the authenticated AWS IAM user, represented by `--key` (CLI) or `key` (Python). * `AWS_SECRET_ACCESS_KEY` - The corresponding AWS secret access key, represented by `--secret` (CLI) or `secret` (Python). * `AWS_STS_TOKEN` - If required, the AWS STS session token for temporary access, represented by `--token` (CLI) or `token` (Python). * If the bucket has anonymous access enabled for reading from the bucket, set `--anonymous` (CLI) or `anonymous=True` (Python) instead. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. <iframe width="560" height="315" src="https://www.youtube.com/embed/9qSiL0QFsBA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ s3 \ --remote-url $AWS_S3_URL \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR \ --key $AWS_ACCESS_KEY_ID \ --secret $AWS_SECRET_ACCESS_KEY \ --token $AWS_STS_TOKEN \ # If using AWS STS token. --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy fast \ --chunking-strategy by_title \ --embedding-provider huggingface \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \ local \ --output-dir $LOCAL_FILE_OUTPUT_DIR ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.s3 import ( S3IndexerConfig, S3DownloaderConfig, S3ConnectionConfig, S3AccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=S3IndexerConfig(remote_url=os.getenv("AWS_S3_URL")), downloader_config=S3DownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=S3ConnectionConfig( access_config=S3AccessConfig( key=os.getenv("AWS_ACCESS_KEY_ID"), secret=os.getenv("AWS_SECRET_ACCESS_KEY"), token=os.getenv("AWS_STS_TOKEN", None) # If using AWS STS token. ) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> ## Add an access policy to an existing bucket To use the Amazon S3 console to add an access policy that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to an existing S3 bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 2. Open the [Amazon S3 Console](https://console.aws.amazon.com/s3/home). 3. Browse to the existing bucket and open it. 4. Click the Permissions tab. 5. In the Bucket policy area, click Edit. 6. In the Policy text area, copy the following JSON-formatted policy. To change the following policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this policy, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-bucket-name>` in two places with the name of your bucket. ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<my-account-id>:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<my-bucket-name>", "arn:aws:s3:::<my-bucket-name>/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } ``` 7. Click Save changes. ## Create a bucket with the AWS CLI To use the AWS CLI to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 2. [Set up the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). 3. Copy the following script to a file on your local machine, for example a file named `create-s3-bucket.sh`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this script, replace the following: Replace `<my-account-id>` with your AWS account ID. * Replace `<my-unique-bucket-name>` with the name of your bucket. * Replace `<us-east-1>` with your AWS Region. ```bash #!/bin/bash # Set variables for the AWS account ID, Amazon S3 bucket name, and AWS Region. ACCOUNT_ID="<my-account-id>" BUCKET_NAME="<my-unique-bucket-name>" REGION="<us-east-1>" # Temporary filename for the bucket policy. # Do not change this variable. POLICY_FILE="bucket_policy.json" # Create the bucket. aws s3api create-bucket --bucket $BUCKET_NAME --region $REGION # Wait for the bucket to exist. echo "Waiting for bucket '$BUCKET_NAME' to be fully created..." aws s3api wait bucket-exists --bucket $BUCKET_NAME # Check if the wait command was successful. if [ $? -eq 0 ]; then echo "The bucket '$BUCKET_NAME' has been fully created." else echo "Error: Timed out waiting for bucket '$BUCKET_NAME' to be created." exit 1 fi # Remove the "block public policy" bucket access setting. aws s3api put-public-access-block \ --bucket $BUCKET_NAME \ --public-access-block-configuration \ '{"BlockPublicPolicy": false, "IgnorePublicAcls": false, "BlockPublicAcls": false, "RestrictPublicBuckets": false}' # Check if the operation was successful. if [ $? -eq 0 ]; then echo "The block public policy access setting was removed from '$BUCKET_NAME'." else echo "Error: Failed to remove the block public policy access setting from '$BUCKET_NAME'." exit 1 fi # Create the bucket policy. cat << EOF > $POLICY_FILE { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::$ACCOUNT_ID:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::$BUCKET_NAME", "arn:aws:s3:::$BUCKET_NAME/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } EOF # Apply the bucket policy. aws s3api put-bucket-policy --bucket $BUCKET_NAME --policy file://$POLICY_FILE # Check if the policy application was successful. if [ $? -eq 0 ]; then echo "The bucket policy was applied to '$BUCKET_NAME'." else echo "Error: Failed to apply the bucket policy to '$BUCKET_NAME'." exit 1 fi # Verify the applied policy. echo "Verifying the applied policy:" aws s3api get-bucket-policy --bucket $BUCKET_NAME --query Policy --output text # Remove the temporary bucket policy file. rm $POLICY_FILE ``` 4. Run the script, for example: ```bash sh create-s3-bucket.sh ``` 5. After the bucket is created, you can delete the script file, if you want. ## Create a bucket with AWS CloudFormation To use the AWS CloudFormation console to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Save the following YAML to a file on your local machine, for example `create-s3-bucket.yaml`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: 'CloudFormation template to create an S3 bucket with specific permissions for account users.' Parameters: BucketName: Type: String Description: 'Name of the S3 bucket to create' Resources: MyS3Bucket: Type: 'AWS::S3::Bucket' Properties: BucketName: !Ref BucketName PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: false IgnorePublicAcls: true RestrictPublicBuckets: true BucketPolicy: Type: 'AWS::S3::BucketPolicy' Properties: Bucket: !Ref MyS3Bucket PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowAllAuthenticatedUsersInAccount Effect: Allow Principal: AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root' Action: - 's3:GetObject' - 's3:PutObject' - 's3:ListBucket' - 's3:DeleteObject' Resource: - !Sub 'arn:aws:s3:::${BucketName}' - !Sub 'arn:aws:s3:::${BucketName}/' Outputs: BucketName: Description: 'Name of the created S3 bucket' Value: !Ref MyS3Bucket ``` 2. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 3. Open the [AWS CloudFormation Console](https://console.aws.amazon.com/cloudformation/home). 4. Click Create stack > With new resources (standard). 5. On the Create stack page, with Choose an existing template already selected, select Upload a template file. 6. Click Choose file, and browse to and select the YAML file from your local machine. 7. Click Next. 8. Enter a unique Stack name and BucketName. 9. Click Next two times. 10. Click Submit. 11. Wait until the Status changes to CREATE\_COMPLETE. 12. After the bucket is created, you can delete the YAML file, if you want. # Salesforce Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/salesforce <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Salesforce to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/5a-nh4t78V8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Salesforce account](https://developer.salesforce.com/signup). * Your Salesforce username. To get this username, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Administration, expand Users, and then click Users. 4. Note the User Name value (not the Name value) for your user. * The names of the Salesforce categories (objects) that you want to access, specified as a comma-separated list. Available categories include `Account`, `Campaign`, `Case`, `EmailMessage`, and `Lead`. * A Salesforce connected app in your Salesforce account. If you do not already have a Salesforce connected app, to create one, start by creating or getting the certificate (`.crt`) and private key (`.pem`) files that you will associate with the connected app. If you do not have the certificate and private key files, you can use the `openssl` utility on your local machine to create your own private key and self-signed certificate, if your organization allows, by running the following commands, one command at a time: ```bash openssl genrsa -out MyPrivateKey.pem -traditional openssl req -new -key MyPrivateKey.pem -out MyCertificateSigningRequest.csr openssl x509 -req -in MyCertificateSigningRequest.csr -signkey MyPrivateKey.pem -out MyCertificate.crt -days 365 ``` Of course, you can change these preceding example filenames as needed. Be sure to store these generated files in a secure location. To create a Salesforce connected app, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Platform Tools, expand Apps, and then click App Manager. 4. Click New Connected App. 5. With Create a Connected App selected, click Continue. 6. At a minimum, fill in the following, and then click Save: * Connected App Name * API Name (can be the same as Connected App Name, but do not use spaces or punctuation) * Contact Email * Under API (Enable OAuth Settings), check Enable OAuth Settings. * For Callback URL, entering `https://localhost` is okay if you won't be using this connected app for other special authentication scenarios. * Check Use digital signatures, click Choose File, and browse to and select your certificate (`.crt`) file. * For Selected OAuth Scopes, move the following entries from the Available OAuth Scopes list to the Selected OAuth Scopes list: * Manage user data via APIs (api) * Perform requests on your behalf at any time (refresh\_token, offline\_access) * Uncheck Require Proof Key for Code Exchange (PKCE) Extension for Supported Authorization Flows. * Leave Require Secret for Web Server Flow checked. * Leave Require Secret for Refresh Token Flow checked. * Check Enable Authorization Code and Credentials Flow. 7. On the connected app's details page, click Manage, click Edit Policies, set the following under OAuth Policies, and then click Save: * Set Permitted Users to All users may self-authorize. * Set IP Relaxation to Relax IP restrictions. * Set Refresh Token Policy to Refresh token is valid until revoked. * The OAuth consumer key (client ID) for the Salesforce connected app. To get the Salesforce connected app's consumer key, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Platform Tools, expand Apps, and then click App Manager. 4. In the list of apps, click the arrow next to the target connected app, and click View. 5. Click Manage Consumer Details. 6. Complete the on-screen security verification. 7. Note the Consumer Key value. * You must use your Salesforce account to do a one-time approval of the Salesforce connected app by using its consumer key and callback URL. To do this, while you are logged in to your Salesforce account, browse to the following URL, replacing `<client-id>` with the consumer key value. This URL assumes that the callback URL is `https://localhost`: ``` https://login.salesforce.com/services/oauth2/authorize?response_type=code&client_id=<client-id>&redirect_uri=https%3A%2F%2Flocalhost ``` * To ensure maximum compatibility across Unstructured service offerings, you should give the contents of the private key (`.pem`) file to Unstructured as a string that contains the contents of the file (not the private key file itself).\ To print this string suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-private-key-file>` with the path to the private key file. * For macOS or Linux: ```bash cat <path-to-private-key-file> ``` * For Windows: ```text Get-Content <path-to-private-key-file> ``` The Salesforce connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[salesforce]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `SALESFORCE_USERNAME` - The Salesforce username that has access to the required Salesforce categories, represented by `--username` (CLI) or `username` (Python). * `SALESFORCE_CONSUMER_KEY` - The consumer key (client ID) for the Salesforce connected app, represented by `--consumer-key` (CLI) or `consumer_key` (Python). * `SALESFORCE_PRIVATE_KEY` - The contents of the private key (PEM) associated with the consumer key for the Salesforce connected app, represented by `--private-key` (CLI) or `private_key` (Python), or * `SALESFORCE_PRIVATE_KEY_PATH` - The local path to the (PEM) associated with the consumer key for the Salesforce connected app, represented by `--private-key-path` (CLI) or `private_key_path` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ salesforce \ --username $SALESFORCE_USERNAME \ --consumer-key $SALESFORCE_CONSUMER_KEY \ --private-key $SALESFORCE_PRIVATE_KEY \ --categories "EmailMessage,Account,Lead,Case,Campaign" \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --recursive \ --verbose \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.salesforce import ( SalesforceIndexerConfig, SalesforceDownloaderConfig, SalesforceConnectionConfig, SalesforceAccessConfig ) from unstructured_ingest.processes.connectors.local import LocalUploaderConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=SalesforceIndexerConfig(categories=["EmailMessage", "Account", "Lead", "Case", "Campaign"]), downloader_config=SalesforceDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=SalesforceConnectionConfig( access_config=SalesforceAccessConfig( consumer_key=os.getenv("SALESFORCE_CONSUMER_KEY"), private_key=os.getenv("SALESFORCE_PRIVATE_KEY") ), username=os.getenv("SALESFORCE_USERNAME") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # SFTP Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/sftp <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect SFTP to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * The SFTP server hostname, port, username, and password. SFTP servers are offered by several vendors. For example, the following video shows how to create and set up an SFTP server by using AWS Transfer Family: <iframe width="560" height="315" src="https://www.youtube.com/embed/3_HHSnoFsoM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The directory path to start accessing data from, specified as `sftp://<path>/<to>/<directory>`. See the [SFTP documentation](https://en.wikipedia.org/wiki/SSH_File_Transfer_Protocol). The SFTP connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[sftp]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `SFTP_HOST` - The SFTP hostname, represented by `--host` (CLI) or `host` (Python). * `SFTP_PORT` - The SFTP port number, represented by `--port` (CLI) or `port` (Python). * `SFTP_REMOTE_URL` - The directory path to start accessing data from, represented by `--remote-url` (CLI) or `remote_url` (Python). * `SFTP_USERNAME` - The SFTP username, represented by `--username` (CLI) or `username` (Python). * `SFTP_PASSWORD` - The SFTP password, represented by `--password` (CLI) or `password` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ sftp \ --host $SFTP_HOST \ --port $SFTP_PORT \ --username $SFTP_USERNAME \ --password $SFTP_PASSWORD \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --num-processes 2 \ --recursive \ --verbose \ --strategy hi_res \ --remote-url $SFTP_REMOTE_URL \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR \ --output-dir $LOCAL_FILE_OUTPUT_DIR ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.fsspec.sftp import ( SftpIndexerConfig, SftpDownloaderConfig, SftpConnectionConfig, SftpAccessConfig ) from unstructured_ingest.processes.connectors.local import LocalUploaderConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=SftpIndexerConfig(remote_url=os.getenv("SFTP_REMOTE_URL")), downloader_config=SftpDownloaderConfig( download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR"), remote_url=os.getenv("SFTP_REMOTE_URL") ), source_connection_config=SftpConnectionConfig( access_config=SftpAccessConfig(password=os.getenv("SFTP_PASSWORD")), host=os.getenv("SFTP_HOST"), port=os.getenv("SFTP_PORT"), username=os.getenv("SFTP_USERNAME"), ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # SharePoint Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/sharepoint <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect SharePoint to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <Note> If you are setting up the SharePoint connector for the first time, you can skip past this note. Previous versions of the SharePoint connector relied on SharePoint app principals for authentication. Current versions of the SharePoint connector no longer support these SharePoint app principals. Microsoft deprecated support for Share Point app principals on November 27, 2023. SharePoint app principals will no longer work for SharePoint tenants that were created on or after November 1, 2024, and they will stop working for all SharePoint tenants as of April 2, 2026. [Learn more](https://learn.microsoft.com/sharepoint/dev/sp-add-ins/retirement-announcement-for-azure-acs). Current versions of the SharePoint connector now rely on Microsoft Entra ID app registrations for authentication. To migrate from SharePoint app princpals to Entra ID app regisrations, replace the following settings in your existing SharePoint connector, as listed in the requirements following this note: * Replace the deprecated SharePoint app principal's application client ID value with your replacement Entra ID app registration's Application (client) ID value. * Replace the deprecated SharePoint app principal's client secret value with your replacement Entra ID app registration's Client secret value. * Add your replacement Entra ID app registration's Directory (tenant) ID value, token authority URL value, and the correct set of Microsoft Graph access permissions for SharePoint Online. If you need migration help, get assistance from our [Slack community](https://short.unstructured.io/pzw05l7) or [contact us](https://unstructured.io/contact) directly. </Note> * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. (Even if you only plan to use SharePoint Online, you still need a plan that includes OneDrive, because the SharePoint connector is built on OneDrive technology.) [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * The SharePoint Online and OneDrive plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The SharePoint Online site URL. * Site collection-level URLs typically have the format `https://<tenant>.sharepoint.com/sites/<site-collection-name>`. * Root site collection-level URLs typically have the format `https://<tenant>.sharepoint.com`. * To process all sites within a SharePoint tenant, use a site URL of `https://<tenant>-admin.sharepoint.com`. [Learn more](https://learn.microsoft.com/microsoft-365/community/query-string-url-tricks-sharepoint-m365). * The display name of the SharePoint Online library to use. The default is `Documents`. * The path to the SharePoint Online library to use. By default, the root of the target library is used. To start from a path other than the root, enter the path that you want to use, beginning from the root. For example, to use the my-folder > my-subfolder path in the target library, you would specify `my-folder/my-subfolder`. The following video shows how to get the site URL and a path within the site: <iframe width="560" height="315" src="https://www.youtube.com/embed/E3fRwJU-KTc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` [Learn more](https://learn.microsoft.com/answers/questions/2116616/service-principal-access-to-sharepoint-online). 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/X7fnRYyxy0Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The token authority URL for your Microsoft Entra ID app registration. This is typically `https://login.microsoftonline.com` * For username and password authentication, you must also provide the User Principal Name (UPN) and its password for the OneDrive account in the Microsoft Entra ID tenant. This UPN is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The SharePoint connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[sharepoint]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `SHAREPOINT_SITE_URL` - The SharePoint site URL, represented by `--site` (CLI) or `site` (Python). * `SHAREPOINT_LIBRARY_NAME` - The display name of the SharePoint library to use, represented by `--library` (CLI) or `library` (Python). The default is `Documents`. * `SHAREPOINT_SITE_PATH` - The path to use within the library, represented by `--path` (CLI) or `path` (Python). The default is the root of the target library. To use a different path, specify the correct path format as described previously in this article. * `ENTRA_ID_APP_CLIENT_ID` - The Application (client) ID value for the Microsoft Entra ID app registration, represented by `--client-id` (CLI) or `client_id` (Python). * `ENTRA_ID_APP_TENANT_ID` - The Directory (tenant) ID value for the Entra ID app registration, represented by `--client-id` (CLI) or `client_id` (Python). * `ENTRA_ID_TOKEN_AUTHORITY_URL` - The token authority URL for the Entra ID app registration, represented by `--authority-url` (CLI) or `authority_url` (Python). The default is `https://login.microsoftonline.com`. * `ENTRA_ID_APP_CLIENT_SECRET` - The Client secret value for the Entra ID app registration, represented by `--client-cred` (CLI) or `client_cred` (Python). * `ENTRA_ID_USER_PRINCIPAL_NAME` - For username and password authentication, the User Principal Name (UPN) for the target OneDrive account in the Microsoft Entra ID tenant. * `ENTRA_ID_USER_PASSWORD` - For username and password authentication, the password for the target UPN. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured-ingest \ sharepoint \ --site $SHAREPOINT_SITE_URL \ --library $SHAREPOINT_LIBRARY_NAME \ --path $SHAREPOINT_SITE_PATH \ --recursive \ --client-id $ENTRA_ID_APP_CLIENT_ID \ --tenant $ENTRA_ID_APP_TENANT_ID \ --authority-url $ENTRA_ID_TOKEN_AUTHORITY_URL \ --client-cred $ENTRA_ID_APP_CLIENT_SECRET \ --user-pname $ENTRA_ID_USER_PRINCIPAL_NAME \ --password $ENTRA_ID_USER_PASSWORD \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy hi_res \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sharepoint import ( SharepointIndexerConfig, SharepointDownloaderConfig, SharepointConnectionConfig, SharepointAccessConfig ) from unstructured_ingest.processes.connectors.local import ( LocalUploaderConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=SharepointIndexerConfig( path=os.getenv("SHAREPOINT_SITE_PATH"), recursive=True # True to recursively download files in their respective folders. ), downloader_config=SharepointDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=SharepointConnectionConfig( access_config=SharepointAccessConfig( client_cred=os.getenv("ENTRA_ID_APP_CLIENT_SECRET"), password=os.getenv("ENTRA_ID_USER_PASSWORD"), # For username and password authentication. ), site=os.getenv("SHAREPOINT_SITE_URL"), library=os.getenv("SHAREPOINT_LIBRARY_NAME"), client_id=os.getenv("ENTRA_ID_APP_CLIENT_ID"), tenant=os.getenv("ENTRA_ID_APP_TENANT_ID"), authority_url=os.getenv("ENTRA_ID_TOKEN_AUTHORITY_URL"), user_pname=os.getenv("ENTRA_ID_USER_PRINCIPAL_NAME") # For username and password authentication. ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "reprocess": True, "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # SingleStore Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/singlestore <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect SingleStore to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A SingleStore deployment, database, and table. [Learn how](https://www.singlestore.com/blog/how-to-get-started-with-singlestore/). * The hostname for the SingleStore deployment. * The port for the host. * The username for the deployment. * The password for the user. * The name of the database in the deployment. * The name of the table in the database. To get the values for the hostname, port, username, and password: 1. In your SingleStore account's dashboard sidebar, click Deployments. 2. From the drop-down list at the top of the Deployments page, select your deployment. 3. On the Overview tab, in the Compute area, in the Connect drop-down list for your deployment, select Your App. 4. If a Create User dialog box appears, note the User name and Password values. 5. In the Connect to Workspace pane's Your App tab, note the string in the following format: ``` <user-name>:<password>@<host>:<port> ``` * `<user-name>` is the username. * `<password>` is the user's password. * `<host>` is the workspace's hostname. * `<post>` is the host's port. To get the values for the database and table names: 1. In your SingleStore dashboard's sidebar, click Deployments. 2. From the drop-down list at the top of the Deployments page, select your deployment. 3. On the Databases tab, note Name of your database. 4. Click the database and, on the Tables tab, note the name of your table. The SingleStore connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[singlestore]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `SINGLESTORE_HOST` - The hostname for the SingleStore deployment, represented by `--host` (CLI) or `host` (Python). * `SINGLESTORE_PORT` - The port for the host, represented by `--port` (CLI) or `port` (Python). * `SINGLESTORE_USER` - The username for the deployment, represented by `--user` (CLI) or `user` (Python). * `SINGLESTORE_PASSWORD` - The password for the user, represented by `--password` (CLI) or `password` (Python). * `SINGLESTORE_DB` - The name of the database in the deployment, represented by `--database` (CLI) or `database` (Python). * `SINGLESTORE_TABLE` - The name of the table in the database, represented by `--table-name` (CLI) or `table_name` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ singlestore \ --host $SINGLESTORE_HOST \ --port $SINGLESTORE_PORT \ --user $SINGLESTORE_USER \ --password $SINGLESTORE_PASSWORD \ --database $SINGLESTORE_DB \ --table-name $SINGLESTORE_TABLE \ --id-column id \ --batch-size 100 \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --strategy fast \ --chunking-strategy by_title \ --embedding-provider huggingface ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.singlestore import ( SingleStoreIndexerConfig, SingleStoreDownloaderConfig, SingleStoreConnectionConfig, SingleStoreAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import ( LocalUploaderConfig, LocalConnectionConfig ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=SingleStoreIndexerConfig( table_name=os.getenv("SINGLESTORE_TABLE"), id_column="id", batch_size=100 ), downloader_config=SingleStoreDownloaderConfig( download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR") ), source_connection_config=SingleStoreConnectionConfig( access_config=SingleStoreAccessConfig( password=os.getenv("SINGLESTORE_PASSWORD") ), host=os.getenv("SINGLESTORE_HOST"), port=os.getenv("SINGLESTORE_PORT"), user=os.getenv("SINGLESTORE_USER"), database=os.getenv("SINGLESTORE_DB") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig(), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Slack Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/slack <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Slack to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/0F1MXMgk3kY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Slack app. Create a Slack app by following [Step 1: Creating an app](https://api.slack.com/quickstart#creating). * The app must have the the following OAuth scopes: * To read messages from public channels, add the scopes `channels:join` and `channels:history`. * To read messages from private channels, add the scope `groups:history`. Add these scopes to the app by following [Step 2: Requesting scopes](https://api.slack.com/quickstart#scopes). * The app must be installed and authorized for the target Slack workspace. Install and authorize the app by following [Step 3: Installing and authorizing the app](https://api.slack.com/quickstart#installing). * The app's access token. Get this token by following [Step 3: Installing and authorizing the app](https://api.slack.com/quickstart#installing). * Add the app to the target channels in the Slack workspace. To do this from the channel, open the channel's details page, click the Integrations tab, click Add apps, and follow the on-screen directions to install the app. * The channel ID for each target channel. To get this ID, open the channel's details page, and look for the Channel ID field on the About tab. * The starting and ending date and time range for the channels to be processed. Supported formats include: * `YYYY-MM-DD` * `YYYY-MM-DDTHH:MM:SS` * `YYYY-MM-DDTHH:MM:SSZ` * `YYYY-MM-DD+HH:MM:SS` * `YYYY-MM-DD-HH:MM:SS` <Info> Slack has established API rate limits that could restrict the number and frequency of messages that can be read. To explore possible workarounds for these cases, see your Slack representative. </Info> The Slack connector dependencies: ```bash pip install "unstructured-ingest[slack]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `SLACK_BOT_USER_OAUTH_TOKEN` - The OAuth token for the Slack app, represented by `--token` (CLI) or `token` (Python). To specify the starting and ending date and time range for the channels to be processed: * For the CLI, use one of the following supported formats: * `YYYY-MM-DD` * `YYYY-MM-DDTHH:MM:SS` * `YYYY-MM-DDTHH:MM:SSZ` * `YYYY-MM-DD+HH:MM:SS` * `YYYY-MM-DD-HH:MM:SS` * For Python, use the `datetime.datetime` function. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ slack \ --token $SLACK_BOT_USER_OAUTH_TOKEN \ --channels C03FVNHR70A,C03FVNRG43D \ --start-date 2024-10-22 \ --end-date 2024-10-23 \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL ``` ```python Python Ingest import os from datetime import datetime from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.slack import ( SlackIndexerConfig, SlackDownloaderConfig, SlackConnectionConfig, SlackAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=SlackIndexerConfig( channels=["C03FVNHR70A", "C03FVNRG43D"], start_date=datetime(year=2024, month=10, day=22), end_date=datetime(year=2024, month=10, day=23) ), downloader_config=SlackDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=SlackConnectionConfig( access_config=SlackAccessConfig(token=os.getenv("SLACK_BOT_USER_OAUTH_TOKEN")) ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Snowflake Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/snowflake <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Snowflake to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A Snowflake [account](https://signup.snowflake.com/) and its account identifier. <iframe width="560" height="315" src="https://www.youtube.com/embed/LgCjLyClg3o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To get the identifier for the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click your username, and then click Account > View account details. 3. On the Account tab, note the value of the Account Identifier field. Alternatively, the following Snowflake query returns the current account's identifier: ```text SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS "Account Identifier" ``` * The Snowflake [user's login name (not its username) and its password](https://docs.snowflake.com/user-guide/admin-user-management#creating-users) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/sFLPGVe4VBM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view the login name for a user: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. On the Users tab, in the list of available users, click the name of the target user. 4. In the About tile, note the Login Name for the user. Alternatively, the following Snowflake query returns information about the user with the username of `<my-user>`, including their `login_name` value representing their login name: ```text SHOW USERS LIKE '<my-user>'; ``` * The name of the Snowflake [role](https://docs.snowflake.com/sql-reference/sql/create-role) that the user belongs to and that also has sufficient access to the Snowflake database, schema, table, and host. <iframe width="560" height="315" src="https://www.youtube.com/embed/cxe2Ed_-qkM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * To create a database in Snowflake, the role needs to be granted `CREATE DATABASE` privilege at the current account level; and `USAGE` privilege on the warehouse that is used to create the database. * To create a schema in a database in Snowflake, the role needs to be granted `USAGE` privilege on the database and the warehouse that is used to create the schema; and `CREATE SCHEMA` on the database. * To create a table in a schema in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to create the table; and `CREATE TABLE` on the schema. * To write to a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `INSERT` on the table. * To read from a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `SELECT` on the table. To view a list of available roles in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. Click the Roles tab. Alternatively, the following Snowflake query returns a list of available roles in the current account: ```text SHOW ROLES; ``` [Grant privileges to a role](https://docs.snowflake.com/sql-reference/sql/grant-privilege). [Learn more](https://docs.snowflake.com/user-guide/security-access-control-privileges). * The Snowflake warehouse's [hostname and its port number](https://docs.snowflake.com/sql-reference/functions/system_allowlist) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/ItLW0N1uEOI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available warehouses in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Warehouses. This view does not provide access to the warehouses' hostnames or port numbers. To get this information, you must run a Snowflake query. The following Snowflake query returns a list of available warehouse types, hostnames, and port numbers in the current account. Look for the row with a `type` of `SNOWFLAKE_DEPLOYMENT`: ```text SELECT t.VALUE:type::VARCHAR as type, t.VALUE:host::VARCHAR as host, t.VALUE:port as port FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t; ``` * The name of the Snowflake [database](https://docs.snowflake.com/sql-reference/sql/create-database) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/f49tin_Pehs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available databases in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. Alternatively, the following Snowflake query returns a list of available databases in the current account: ```text SHOW DATABASES; ``` * The name of the [schema](https://docs.snowflake.com/sql-reference/sql/create-schema) in the database. To view a list of available schemas for a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the target database. Alternatively, the following Snowflake query returns a list of available schemas in the current account: ```text SHOW SCHEMAS; ``` The following Snowflake query returns a list of available schemas for the database named `<database_name>` in the current account: ```text SHOW SCHEMAS IN DATABASE <database_name>; ``` * The name of the [table](https://docs.snowflake.com/sql-reference/sql/create-table) in the schema. To view a list of available tables for a schema in a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the database that contains the target schema. 4. Expand the name of the target schema. 5. Expand Tables. Alternatively, the following Snowflake query returns a list of available tables for the schema named `<schema_name>` in the datbase named `<database_name>` in the current account: ```text SHOW TABLES IN SCHEMA <database_name>.<schema_name>; ``` Snowflake requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured is as follows. In the following `CREATE TABLE` statement, replace the following placeholders with the appropriate values: * `<database_name>`: The name of the target database in the Snowflake account. * `<schema_name>`: The name of the target schema in the database. * `<number-of-dimensions>`: The number of dimensions for any embeddings that you plan to use. This value must match the number of dimensions for any embeddings that are\ specified in your related Unstructured workflows or pipelines. If you plan to use Snowflake vector embedding generation or Snowflake vector search, this value must match the number of dimensions that you plan to have Snowflake generate or search against. ```sql SQL CREATE TABLE <database_name>.<schema_name>.ELEMENTS ( ID VARCHAR(36) PRIMARY KEY NOT NULL DEFAULT UUID_STRING(), RECORD_ID VARCHAR, ELEMENT_ID VARCHAR, TEXT VARCHAR, EMBEDDINGS VECTOR(FLOAT, <number-of-dimensions>), TYPE VARCHAR, SYSTEM VARCHAR, LAYOUT_WIDTH DECIMAL, LAYOUT_HEIGHT DECIMAL, POINTS VARCHAR, URL VARCHAR, VERSION VARCHAR, DATE_CREATED TIMESTAMP_TZ, DATE_MODIFIED TIMESTAMP_TZ, DATE_PROCESSED TIMESTAMP_TZ, PERMISSIONS_DATA VARCHAR, RECORD_LOCATOR VARCHAR, CATEGORY_DEPTH INTEGER, PARENT_ID VARCHAR, ATTACHED_FILENAME VARCHAR, FILETYPE VARCHAR, LAST_MODIFIED TIMESTAMP_TZ, FILE_DIRECTORY VARCHAR, FILENAME VARCHAR, LANGUAGES ARRAY, PAGE_NUMBER VARCHAR, LINKS VARCHAR, PAGE_NAME VARCHAR, LINK_URLS ARRAY, LINK_TEXTS ARRAY, SENT_FROM ARRAY, SENT_TO ARRAY, SUBJECT VARCHAR, SECTION VARCHAR, HEADER_FOOTER_TYPE VARCHAR, EMPHASIZED_TEXT_CONTENTS ARRAY, EMPHASIZED_TEXT_TAGS ARRAY, TEXT_AS_HTML VARCHAR, REGEX_METADATA VARCHAR, DETECTION_CLASS_PROB DECIMAL, IMAGE_BASE64 VARCHAR, IMAGE_MIME_TYPE VARCHAR, ORIG_ELEMENTS VARCHAR, IS_CONTINUATION BOOLEAN ); ``` * The name of the column in the table that uniquely identifies each record (for example, `RECORD_ID`). The Snowflake connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[snowflake]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). These environment variables: * `SNOWFLAKE_ACCOUNT` - The ID of the target Snowflake account, represented by `--account` (CLI) or `account` (Python). * `SNOWFLAKE_USER` - The name of the target Snowflake user, represented by `--user` (CLI) or `user` (Python). * `SNOWFLAKE_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python). * `SNOWFLAKE_ROLE` - The target role for the user, represented by `--role` (CLI) or `role` (Python). * `SNOWFLAKE_HOST` - The hostname for the target Snowflake warehouse, represented by `--host` (CLI) or `host` (Python). * `SNOWFLAKE_PORT` - The warehouse's port number, represented by `--port` (CLI) or `port` (Python). The default is `443` if not otherwise specified. * `SNOWFLAKE_DATABASE` - The name of the target Snowflake database, represented by `--database` (CLI) or `database` (Python). * `SNOWFLAKE_SCHEMA` - The name of the target schema in the database, represented by `--schema` (CLI) or `schema` (Python). * `SNOWFLAKE_TABLE` - The name of the target table in the schema, represented by `--table-name` (CLI) or `table_name` (Python). For the destination connector, the default is `elements` if not otherwise specified. * `SNOWFLAKE_RECORD_ID_KEY` - The name of the column in the table that uniquely identifies each record, represented by: * For the source connector, `--id-column` (CLI) or `id_column` (Python). * For the destination connector, `--record-id-key` (CLI) or `record_id_key` (Python). For the destination connector, the default is `record_id` if not otherwise specified. Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector: This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash unstructured \ snowflake \ --account $SNOWFLAKE_ACCOUNT \ --user $SNOWFLAKE_USER \ --password $SNOWFLAKE_PASSWORD \ --role $SNOWFLAKE_ROLE \ --host $SNOWFLAKE_HOST \ --port $SNOWFLAKE_PORT \ --database $SNOWFLAKE_DATABASE \ --schema $SNOWFLAKE_SCHEMA \ --batch-size 50 \ --table-name $SNOWFLAKE_TABLE \ --id-column $SNOWFLAKE_RECORD_ID_KEY \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR\ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --output-dir $LOCAL_FILE_OUTPUT_DIR ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.snowflake import ( SnowflakeIndexerConfig, SnowflakeConnectionConfig, SnowflakeAccessConfig, SnowflakeDownloaderConfig ) from unstructured_ingest.processes.connectors.local import LocalUploaderConfig from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=SnowflakeIndexerConfig( table_name=os.getenv("SNOWFLAKE_TABLE"), id_column=os.getenv("SNOWFLAKE_RECORD_ID_KEY"), batch_size=100 ), downloader_config=SnowflakeDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=SnowflakeConnectionConfig( access_config=SnowflakeAccessConfig( password=os.getenv("SNOWFLAKE_PASSWORD") ), account=os.getenv("SNOWFLAKE_ACCOUNT"), user=os.getenv("SNOWFLAKE_USER"), host=os.getenv("SNOWFLAKE_HOST"), port=os.getenv("SNOWFLAKE_PORT"), database=os.getenv("SNOWFLAKE_DATABASE"), schema=os.getenv("SNOWFLAKE_SCHEMA"), role=os.getenv("SNOWFLAKE_ROLE") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # SQLite Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/sqlite <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect SQLite to your preprocessing pipeline, and use the Unstructured Ingest CLI or the Unstructured Ingest Python library to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. * A SQLite instance. [Download and install SQLite](https://www.sqlitetutorial.net/download-install-sqlite/). * A SQLite database. [Create a database](https://www.sqlite.org/quickstart.html). * The path to the database's `.db` file. * A table in the database. [Create a table](https://www.sqlite.org/cli.html). The table's schema must match the schema of the documents that Unstructured produces. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: ```sql SQLite CREATE TABLE elements ( id TEXT PRIMARY KEY, record_id TEXT, element_id TEXT, text TEXT, embeddings TEXT, parent_id TEXT, page_number INTEGER, is_continuation INTEGER, orig_elements TEXT, partitioner_type TEXT ); ``` See also: * [CREATE TABLE](https://www.sqlite.org/lang_createtable.html) for SQLite * [Unstructured document elements and metadata](/api-reference/partition/document-elements) You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `SQLITE_DB_PATH` - The path to the database's `.db` file, represented by `--database` (CLI) or `database` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ sqlite \ --database-path $SQLITE_DB_PATH \ --table-name elements \ --id-column id \ --download-dir $LOCAL_FILE_DOWNLOAD_DIR \ --chunking-strategy by_title \ --embedding-provider huggingface \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.sql.sqlite import ( SQLiteIndexerConfig, SQLiteDownloaderConfig, SQLiteConnectionConfig, SQLiteAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import ( LocalConnectionConfig, LocalUploaderConfig ) # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=SQLiteIndexerConfig( table_name="elements", id_column="id" ), downloader_config=SQLiteDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=SQLiteConnectionConfig( access_config=SQLiteAccessConfig(), database_path=os.getenv("SQLITE_DB_PATH") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), destination_connection_config=LocalConnectionConfig(), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Zendesk Source: https://docs.unstructured.io/open-source/ingestion/source-connectors/zendesk <Tip>This page was recently updated. What do you think about it? [Let us know!](https://smart-forms.saasjet.com/external?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbGllbnRLZXkiOiIzZGQ5OTEzYy04Mzg1LTNkNDAtYWQ1MS03NmM0Nzg5YjAxOGUiLCJpYXQiOjE3MTQ0MzI1ODU5MjF9.m5gHGCOHt_UgOs-JCdTdEHRcQUNftvLGWeKzK2o2pQ4).</Tip> Connect Zendesk to your preprocessing pipeline, and use the Unstructured CLI or Python to batch process all your documents and store structured outputs locally on your filesystem. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/f2yeXwSvZtE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Zendesk account](https://www.zendesk.com/register/). * Your [Zendesk subdomain](https://support.zendesk.com/hc/en-us/articles/4409381383578-Where-can-I-find-my-Zendesk-subdomain), for example, the `<organization>` part of `<organization>.zendesk.com`. * The login email address for your Zendesk account. * An [API token](https://support.zendesk.com/hc/en-us/articles/4408889192858-Managing-access-to-the-Zendesk-API#topic_tcb_fk1_2yb) (not an OAuth token) for your login email address. The Zendesk connector dependencies: ```bash CLI, Python pip install "unstructured-ingest[zendesk]" ``` You might also need to install additional dependencies, depending on your needs. [Learn more](/open-source/ingestion/ingest-dependencies). The following environment variables: * `ZENDESK_SUBDOMAIN` - The subdomain for your Zendesk account, represented by `--subdomain` (CLI) or `subdomain` (Python). This subdomain is the `<organization>` part of `<organization>.zendesk.com`. * `ZENDESK_EMAIL` - The login email address for your Zendesk account, represented by `--email` (CLI) or `email` (Python). * `ZENDESK_API_TOKEN` - The API token (not OAuth token) for your login email address, represented by `--api-token` (CLI) or `api_token` (Python). Now call the Unstructured Ingest CLI or the Unstructured Ingest Python library. The destination connector can be any of the ones supported. This example uses the local destination connector. This example sends data to Unstructured for processing by default. To process data locally instead, see the instructions at the end of this page. <CodeGroup> ```bash CLI #!/usr/bin/env bash # Chunking and embedding are optional. unstructured-ingest \ zendesk \ --subdomain $ZENDESK_SUBDOMAIN \ --email $ZENDESK_EMAIL \ --api-token $ZENDESK_API_TOKEN \ --batch-size 2 \ --item-type tickets \ --output-dir $LOCAL_FILE_OUTPUT_DIR \ --partition-by-api \ --api-key $UNSTRUCTURED_API_KEY \ --partition-endpoint $UNSTRUCTURED_API_URL \ --chunking-strategy by_title \ --embedding-provider huggingface ``` ```python Python Ingest import os from unstructured_ingest.pipeline.pipeline import Pipeline from unstructured_ingest.interfaces import ProcessorConfig from unstructured_ingest.processes.connectors.zendesk import ( ZendeskIndexerConfig, ZendeskDownloaderConfig, ZendeskConnectionConfig, ZendeskAccessConfig ) from unstructured_ingest.processes.partitioner import PartitionerConfig from unstructured_ingest.processes.chunker import ChunkerConfig from unstructured_ingest.processes.embedder import EmbedderConfig from unstructured_ingest.processes.connectors.local import LocalUploaderConfig # Chunking and embedding are optional. if name == "main": Pipeline.from_configs( context=ProcessorConfig(), indexer_config=ZendeskIndexerConfig( batch_size=2, item_type=["tickets"] ), downloader_config=ZendeskDownloaderConfig(download_dir=os.getenv("LOCAL_FILE_DOWNLOAD_DIR")), source_connection_config=ZendeskConnectionConfig( access_config=ZendeskAccessConfig( api_token=os.getenv("ZENDESK_API_TOKEN") ), subdomain=os.getenv("ZENDESK_SUBDOMAIN"), email=os.getenv("ZENDESK_EMAIL") ), partitioner_config=PartitionerConfig( partition_by_api=True, api_key=os.getenv("UNSTRUCTURED_API_KEY"), partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"), additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 15 } ), chunker_config=ChunkerConfig(chunking_strategy="by_title"), embedder_config=EmbedderConfig(embedding_provider="huggingface"), uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR")) ).run() ``` </CodeGroup> For the Unstructured Ingest CLI and the Unstructured Ingest Python library, you can use the `--partition-by-api` option (CLI) or `partition_by_api` (Python) parameter to specify where files are processed: * To do local file processing, omit `--partition-by-api` (CLI) or `partition_by_api` (Python), or explicitly specify `partition_by_api=False` (Python). Local file processing does not use an Unstructured API key or API URL, so you can also omit the following, if they appear: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` * To send files to the [Unstructured Partition Endpoint](/api-reference/partition/overview) for processing, specify `--partition-by-api` (CLI) or `partition_by_api=True` (Python). Unstructured also requires an Unstructured API key and API URL, by adding the following: * `--api-key $UNSTRUCTURED_API_KEY` (CLI) or `api_key=os.getenv("UNSTRUCTURED_API_KEY")` (Python) * `--partition-endpoint $UNSTRUCTURED_API_URL` (CLI) or `partition_endpoint=os.getenv("UNSTRUCTURED_API_URL")` (Python) * The environment variables `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL`, representing your API key and API URL, respectively. <Note> You must specify the API URL only if you are not using the default API URL for Unstructured Ingest, for example, if you are using a [self-hosted](/self-hosted/overview) instance of the Unstructured API. The default API URL for Unstructured Ingest is `https://api.unstructuredapp.io/general/v0/general`, which is the API URL for the [Unstructured Partition Endpoint](/api-reference/partition/overview). However, you should always use the URL that was provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). If you do not have an API key, [get one now](/api-reference/partition/overview). If the Unstructured API is self-hosted, the process for generating Unstructured API keys, and the Unstructured API URL that you use, are different. For details, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Note> # Supported file types Source: https://docs.unstructured.io/open-source/ingestion/supported-file-types The Unstructured Ingest CLI and Unstructured Ingest Python library support processing of the following file types: By file extension: | File extension | | -------------- | | `.abw` | | `.bmp` | | `.csv` | | `.cwk` | | `.dbf` | | `.dif`\* | | `.doc` | | `.docm` | | `.docx` | | `.dot` | | `.dotm` | | `.eml` | | `.epub` | | `.et` | | `.eth` | | `.fods` | | `.heic` | | `.htm` | | `.html` | | `.hwp` | | `.jpeg` | | `.jpg` | | `.md` | | `.mcw` | | `.msg` | | `.mw` | | `.odt` | | `.org` | | `.p7s` | | `.pbd` | | `.pdf` | | `.png` | | `.pot` | | `.ppt` | | `.pptm` | | `.pptx` | | `.prn` | | `.rst` | | `.rtf` | | `.sdp` | | `.sxg` | | `.tiff` | | `.txt` | | `.tsv` | | `.xls` | | `.xlsx` | | `.xml` | | `.zabw` | By file type: | Category | File types | | ----------------- | ------------------------------------------------------------------ | | Apple | `.cwk`, `.mcw` | | CSV | `.csv` | | Data Interchange | `.dif`\* | | dBase | `.dbf` | | E-mail | `.eml`, `.msg`, `.p7s` | | EPUB | `.epub` | | HTML | `.htm`, `.html` | | Image | `.bmp`, `.heic`, `.jpeg`, `.jpg`, `.png`, `.prn`, `.tiff` | | Markdown | `.md` | | OpenOffice | `.odt` | | Org Mode | `.org` | | Other | `.eth`, `.pbd`, `.sdp` | | PDF | `.pdf` | | Plain text | `.txt` | | PowerPoint | `.pot`, `.ppt`, `.pptm`, `.pptx` | | reStructured Text | `.rst` | | Rich Text | `.rtf` | | Spreadsheet | `.et`, `.fods`, `.mw`, `.xls`, `.xlsx` | | StarOffice | `.sxg` | | TSV | `.tsv` | | Word processing | `.abw`, `.doc`, `.docm`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` | | XML | `.xml` | `` For `.dif`, `\n` characters in `.dif` files are supported, but `\r\n` characters will raise the error `UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type`. # Docker installation Source: https://docs.unstructured.io/open-source/installation/docker-installation Follow these steps to run the Unstructured open source library inside a Docker container. <Steps> <Step title="Install and run Docker"> If you do not have Docker already installed and running, you can install and run a tool such as Docker Desktop, which is available for macOS, Windows, and Linux. Learn how to install and run: [Docker Desktop on Mac](https://docs.docker.com/desktop/setup/install/mac-install/) * [Docker Desktop on Windows](https://docs.docker.com/desktop/setup/install/windows-install/) * [Docker Desktop on Linux](https://docs.docker.com/desktop/setup/install/linux/) </Step> <Step title="Pull the Unstructured Docker image"> <Info>If you are an experienced Docker user, you plan to parse only a single type of data, and you want to accelerate the image-building process, you can [build your own Docker image](#building-your-own-docker-image) instead of pulling the latest prebuilt image.</Info> <Tabs> <Tab title="Docker Desktop UI"> <Note> The following steps are for AMD64-based systems. If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the Docker CLI tab in this step instead. </Note> 1. In your Docker Desktop UI's search box, enter `downloads.unstructured.io/unstructured-io/unstructured:latest`. 2. On the Images tab, next to unstructured-io/unstructured, click Pull. To list the available images on your machine, in the sidebar, click Images. To remove this image from your machine at any time, click the trash can (Delete) icon next to the image in the list of available images. </Tab> <Tab title="Docker CLI"> From your terminal or command prompt, run the following command. <Tip>If you have the Docker Desktop UI running, you can click the Terminal button in the UI's lower right corner to run a Docker CLI session from within the Docker Desktop UI.</Tip> For AMD64-based systems, run the following command: ```bash # The AMD64 platform is the default. docker pull downloads.unstructured.io/unstructured-io/unstructured:latest # Or, to explicitly specify the AMD64 platform: docker pull --platform=linux/amd64 downloads.unstructured.io/unstructured-io/unstructured:latest ``` For ARM64-based systems (such as Apple Silicon), run the following command instead: ```bash docker pull --platform=linux/arm64 downloads.unstructured.io/unstructured-io/unstructured:latest ``` To list the available images on your machine, run the following command: ```bash docker images ``` To remove this image from your machine at any time, run the following command: ```bash docker rmi downloads.unstructured.io/unstructured-io/unstructured:latest ``` </Tab> </Tabs> </Step> <Step title="Create and run a container from the image"> <Tabs> <Tab title="Docker Desktop UI"> <Note> The following steps are for AMD64-based systems. If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the Docker CLI tab in this step instead. </Note> 1. In the Docker Desktop UI's sidebar, click Images. 2. Next to unstructured-io/unstructured, click the play (Run) icon. 3. Expand Optional settings. 4. For Container name, enter some name for your container, such as `unstructured`. 5. In the sidebar, click Containers. 6. Next to your container, click the play (Start) icon. </Tab> <Tab title="Docker CLI"> For AMD64-based systems, run the following command, replacing `<container-name>` with some name for your container, such as `unstructured`: ```bash # The AMD64 platform is the default. docker run -dt --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest # Or, to explicitly specify the AMD64 platform: docker run -dt --platform=linux/amd64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest ``` For ARM64-based systems (such as Apple Silicon), run the following command instead, replacing `<container-name>` with some name for your container, such as `unstructured`: ```bash docker run -dt --platform=linux/arm64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest ``` </Tab> </Tabs> </Step> <Step title="Interact with the Unstructured open source library by running code inside the container"> <Tabs> <Tab title="Docker Desktop UI"> 1. In the Docker Desktop UI, in the lower right corner, click the Terminal button. 2. To start a terminal session inside the container, run the following command, replacing `<container-name>` with the name of your container, such as `unstructured`: ```bash docker exec -it <container-name> bash ``` 3. Run Unstructured open source library calls from inside the container. For example, start the Python interpreter: ```bash python ``` And then run the following commands, one command at a time, to make calls to the Unstructured open source library. These calls process a PDF file in the `/app/example-docs/pdf` directory named `layout-parser-paper.pdf`. The processed data is written as a JSON file named `layout-parser-paper-output.json` in that same directory: ```bash >>> from unstructured.partition.pdf import partition_pdf >>> from unstructured.staging.base import elements_to_json >>> elements = partition_pdf(filename="/app/example-docs/pdf/layout-parser-paper.pdf") >>> elements_to_json(elements=elements, filename="/app/example-docs/pdf/layout-parser-paper-output.json") ``` After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal: ```bash >>> exit() cat ./example-docs/pdf/layout-parser-paper-output.json ``` 4. To exit the terminal session, run the following command, or press `Ctrl+D`: ```bash exit ``` </Tab> <Tab title="Docker CLI"> 1. Run the following command, replacing `<container-name>` with the name of your container, such as `unstructured`: ```bash docker exec -it <container-name> bash ``` 2. Run Unstructured open source library calls from inside the container. For example, start the Python interpreter: ```bash python ``` And then run the following commands, one command at a time, to make calls to the Unstructured open source library. These calls process a PDF file in the `/app/example-docs/pdf` directory named `layout-parser-paper.pdf`. The processed data is written as a JSON file named `layout-parser-paper-output.json` in that same directory: ```bash >>> from unstructured.partition.pdf import partition_pdf >>> from unstructured.staging.base import elements_to_json >>> elements = partition_pdf(filename="/app/example-docs/pdf/layout-parser-paper.pdf") >>> elements_to_json(elements=elements, filename="/app/example-docs/pdf/layout-parser-paper-output.json") ``` After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal: ```bash >>> exit() cat ./example-docs/pdf/layout-parser-paper-output.json ``` 3. To exit the terminal session, run the following command, or press `Ctrl+D`: ```bash exit ``` </Tab> </Tabs> </Step> <Step title="Interact with the Unstructured open source library by running code outside the container"> You can also interact with the Unstructured open source library by running code that is on the same machine as the running container but not within the container itself. To do this, you can use the Docker CLI to create a container that mounts the local directory containing the code into the container itself, and then run that code from the container. 1. Run one of the following commands, replacing the following placeholders with the appropriate values: * Replace `<host-path>` with the path to the directory containing your code, for example `/Users/<username>/my_example_code/`. * Replace `<container-path>` with the path to some directory within the container to mount `<host-path>` into, for example `/app/my_example_code/`. If `<container-path>` does not already exist, it will be created at the same time that the container is created. * Replace `<container-name>` with some name for your container, such as `unstructured_mount`. For AMD64-based systems, run the following command: ```bash # The AMD64 platform is the default. docker run -dt -v <host-path>:<container-path>--name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest # Or, to explicitly specify the AMD64 platform: docker run -dt -v <host-path>:<container-path> --platform=linux/amd64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest ``` For ARM64-based systems (such as Apple Silicon), run the following command instead: ```bash docker run -dt -v <host-path>:<container-path> --platform=linux/arm64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest ``` 2. Start a terminal session inside the container by running the following command, replacing `<container-name>` with the name of your container, such as `unstructured_mount`: ```bash docker exec -it <container-name> bash ``` 3. Add `<container-path>` to the `PYTHONPATH` environment variable within the container by running the following commands, replacing `<container-path>` with the path to the target directory within the container: ```bash PYTHONPATH="${PYTHONPATH}:<container-path>" export PYTHONPATH ``` 4. Run Unstructured open source library calls, referencing your code from `<container-path>`. For example, if you have a file named `main.py` in `<host-path>`that contains the four commands following `>>>` from the previous step, you can run it as follows, replacing `<container-path>` with the path to the target directory within the container: ```bash python <container-path>/main.py ``` To print the contents of the JSON file to the terminal, run the following command: ```bash cat /app/example-docs/pdf/layout-parser-paper-output.json ``` 5. To exit the terminal session, run the following command, or press `Ctrl+D`: ```bash exit ``` </Step> <Step title="Stop running the container"> If you do not need the keep running the container, you can stop it as follows: <Tabs> <Tab title="Docker Desktop UI"> 1. In the Docker Desktop UI, in the sidebar, click Containers. 2. Next to your container, click the square (Stop) icon. </Tab> <Tab title="Docker CLI"> Run the following command, replacing `<container-name>` with the name of your container, such as `unstructured` or `unstructured_mount`: ```bash docker stop <container-name> ``` </Tab> </Tabs> </Step> </Steps> ## Building your own Docker image You can build your own Docker image instead of pulling the latest prebuilt image. If you only plan to parse a single type of data, you can accelerate the build process by excluding certain packages or requirements needed for other data types. Refer to the [Dockerfile](https://github.com/Unstructured-IO/unstructured/blob/main/Dockerfile) to determine which lines are necessary for your requirements. ```bash make docker-build # Start a Bash shell inside of the running Docker container. make docker-start-bash ``` # Full installation Source: https://docs.unstructured.io/open-source/installation/full-installation To install the Unstructured open source library on a local development machine, run one or more of the following commands. These commands assume that you are using the Python package and project manager [uv](https://docs.astral.sh/uv/), running within an activated [venv](https://docs.python.org/3/library/venv.html) virtual environment that was created with `uv`. However, `uv` and `venv` are not required. To work with all [supported file types](/open-source/introduction/supported-file-types), run: ```bash uv add "unstructured[all-docs]" ``` To conserve disk space and reduce code dependencies, you can run the following command instead to work with a default set of supported file types: ```bash uv add unstructured ``` The preceding command supports plain text files (`.txt`), HTML files (`.html`), XML files (`.xml`), and emails (`.eml`, `.msg`, and `.p7s`) by default. To further conserve disk space and reduce code dependencies, you can run the following command instead, replacing `<extra>` with the appropriate extra for the target file type: ```bash uv add "unstructured[<extra>]" ``` The following file type extras are available: * `all-docs` (for all supported file types in this list) * `csv` (for `.csv` files only) * `docx` (for `.doc` and `.docx` files only) * `epub` (for `.epub` files only) * `image` (for all supported image file types: `.bmp`, `.heic`, `.jpeg`, `.png`, and `.tiff`) * `md` (for `.md` files only) * `odt` (for `.odt` files only) * `org` (for `.org` files only) * `pdf` (for `.pdf` files only) * `pptx` (for `.ppt` and `.pptx` files only) * `rst` (for `.rst` files only) * `rtf` (for `.rtf` files only) * `tsv` (for `.tsv` files only) * `xlsx` (for `.xls` and `.xlsx` files only) Note that you can install multiple extras at the same time by separating them with commas, for example: ```bash uv add "unstructured[pdf,docx]" ``` For maximum compatiblity, you should also install the following system dependencies: * [libmagic-dev](https://man7.org/linux/man-pages/man3/libmagic.3.html) (for filetype detection) * [poppler-utils](https://poppler.freedesktop.org/) and [tesseract-ocr](https://github.com/tesseract-ocr/tesseract) (for images and PDFs), and `tesseract-lang` (for additional language support) * [libreoffice](https://www.libreoffice.org/discover/libreoffice/) (for Microsoft Office documents) * [pandoc](https://pandoc.org/) (for `.epub`, `.odt`, and `.rtf` files. For `.rtf` files, you must have version 2.14.2 or newer. Running [this script](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh) will install the correct version for you.) Installation instructured for these system dependencies vary by operating system type. For details, follow the preceding links or see your operating system's documentation. # Overview Source: https://docs.unstructured.io/open-source/installation/overview Unstructured open source library offers flexible options for installation. Refer to the Full installation guide to learn how to install Unstructured with all modules, or how to trim down the installation to only what you require. Refer to the Docker installation to learn how to run Unstructured in a Docker container. <CardGroup cols={2}> <Card iconType="duotone" title="Full Installation" icon="download" href="/open-source/installation/full-installation" /> <Card iconType="duotone" title="Docker Installation" icon="docker" href="/open-source/installation/docker-installation" /> </CardGroup> # Integrations Source: https://docs.unstructured.io/open-source/integrations Integrate your model development pipeline with your favorite machine learning frameworks and libraries, and prepare your data for ingestion into downstream systems. Most of our integrations come in the form of [staging functions](/open-source/core-functionality/staging), which take a list of `Element` objects as input and return formatted dictionaries as output. ## `Integration with Argilla` You can convert a list of `Text` elements to an [Argilla](https://www.argilla.io/) `Dataset` using the [stage\_for\_argilla](/open-source/core-functionality/staging#stage-for-argilla) staging function. Specify the type of dataset to be generated using the `argilla_task` parameter. Valid values are `"text_classification"`, `"token_classification"`, and `"text2text"`. Follow the link for more details on usage. ## `Integration with Baseplate` [Baseplate](https://docs.baseplate.ai/introduction) is a backend optimized for use with LLMs that has an easy to use spreadsheet interface. The `unstructured` library offers a staging function to convert a list of `Element` objects into the [rows format](https://docs.baseplate.ai/api-reference/documents/overview) required by the Baseplate API. See the [stage\_for\_baseplate](/open-source/core-functionality/staging#stage-for-baseplate) documentation for information on how to stage elements for ingestion into Baseplate. ## `Integration with Datasaur` You can format a list of `Text` elements as input to token based tasks in [Datasaur](https://datasaur.ai/) using the [stage\_for\_datasaur](/open-source/core-functionality/staging#stage-for-datasaur) staging function. You will obtain a list of dictionaries indexed by the keys `"text"` with the content of the element, and `"entities"` with an empty list. Follow the link to learn how to customise your entities and for more details on usage. ## `Integration with Hugging Face` You can prepare `Text` elements for processing in Hugging Face [Transformers](https://huggingface.co/docs/transformers/index) pipelines by splitting the elements into chunks that fit into the model’s attention window using the [stage\_for\_transformers](/open-source/core-functionality/staging#stage-for-transformers) staging function. You can customise the transformation by defining the `buffer` and `window_size`, the `split_function` and the `chunk_separator`. if you need to operate on text directly instead of `unstructured` `Text` objects, use the [chunk\_by\_attention\_window](/open-source/core-functionality/staging#stage-for-transformers) helper function. Follow the links for more details on usage. ## `Integration with Labelbox` You can format your outputs for use with [LabelBox](https://labelbox.com/) using the [stage\_for\_label\_box](/open-source/core-functionality/staging#stage-for-label-box) staging function. LabelBox accepts cloud-hosted data and does not support importing text directly. With this integration you can stage the data files in the `output_directory` to be uploaded to a cloud storage service (such as S3 buckets) and get a config of type `List[Dict[str, Any]]` that can be written to a `.json` file and imported into LabelBox. Follow the link to see how to generate the `config.json` file that can be used with LabelBox, how to upload the staged data files to an S3 bucket, and for more details on usage. ## `Integration with Label Studio` You can format your outputs for upload to [Label Studio](https://labelstud.io/) using the [stage\_for\_label\_studio](/open-source/core-functionality/staging#stage-for-label-studio) staging function. After running `stage_for_label_studio`, you can write the results to a JSON folder that is ready to be included in a new Label Studio project. You can also include pre-annotations and predictions as part of your upload. See the [Label Studio docs](https://labelstud.io/tags/labels.html) for a full list of options for labels and annotations. ## `Integration with LangChain` Our integration with [LangChain](https://github.com/hwchase17/langchain) makes it incredibly easy to combine language models with your data, no matter what form it is in. The [Unstructured.io File Loader](https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/unstructured_file.html) extracts the text from a variety of unstructured text files using our `unstructured` library. It is designed to be used as a way to load data into LangChain. Here is the simplest way to use the `UnstructuredFileLoader` in `langchain`. ```python from langchain.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader("state_of_the_union.txt") loader.load() ``` Checkout the [LangChain docs](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html) for more examples about how to use Unstructured data loaders. ## `Integration with LlamaIndex` To use `Unstructured.io File Loader` you will need to have [LlamaIndex](https://github.com/jerryjliu/llama_index) 🦙 (GPT Index) installed in your environment. Just `pip install llama-index` and then pass in a `Path` to a local file. Optionally, you may specify split\_documents if you want each element generated by `unstructured` to be placed in a separate document. Here is a simple example of how to use it: ```python from pathlib import Path from llama_index import download_loader UnstructuredReader = download_loader("UnstructuredReader") loader = UnstructuredReader() documents = loader.load_data(file=Path('./10k_filing.html')) ``` See [here](https://llamahub.ai/) for more LlamaHub examples. ## `Integration with Pandas` You can convert a list of `Element` objects to a Pandas dataframe with columns for the text from each element and their types such as `NarrativeText` or `Title` using the [convert\_to\_dataframe](/open-source/core-functionality/staging#convert-to-dataframe) staging function. Follow the link for more details on usage. ## `Integration with Prodigy` You can format your JSON or CSV outputs for use with [Prodigy](https://prodi.gy/docs/api-loaders) using the [stage\_for\_prodigy](/open-source/core-functionality/staging#stage-for-prodigy) and [stage\_csv\_for\_prodigy](/open-source/core-functionality/staging#stage-csv-for-prodigy) staging functions. After running `stage_for_prodigy` | `stage_csv_for_prodigy`, you can write the results to a `.json` | `.jsonl` or a `.csv` file that is ready to be used with Prodigy. Follow the links for more details on usage. ## `Integration with Weaviate` [Weaviate](https://weaviate.io/) is an open-source vector database that allows you to store data objects and vector embeddings from a variety of ML models. Storing text and embeddings in a vector database such as Weaviate is a key component of the [emerging LLM tech stack](https://medium.com/@unstructured-io/llms-and-the-emerging-ml-tech-stack-bdb189c8be5c). See the [stage\_for\_weaviate](open-source/core-functionality/staging#stage-for-weaviate) docs for details on how to upload `unstructured` outputs to Weaviate. An example notebook is also available [here](https://github.com/Unstructured-IO/unstructured/tree/main/examples/weaviate). # Overview Source: https://docs.unstructured.io/open-source/introduction/overview <Note>The Unstructured open source library is designed as a starting point for quick prototyping and has [limits](#limits). For production scenarios, use the [Unstructured user interface (UI)](/ui/overview) or the [Unstructured API](/api-reference/overview) instead.</Note> <Tip>To start using the Unstructured open source library right away, skip ahead to the [quickstart](/open-source/introduction/quick-start).</Tip> The Unstructured open source library ([GitHub](https://github.com/Unstructured-IO/unstructured), [PyPI](https://pypi.org/project/unstructured/)) offers an open-source toolkit designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs), the Unstructured open source library provides modular functions and connectors that work seamlessly together. This cohesive system ensures efficient transformation of unstructured data into structured formats, while also offering adaptability to various platforms and use cases. ## Key functionality * Precise document extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about [Document elements and metadata](../concepts/document-elements). * Robust file support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found [here](/open-source/introduction/supported-file-types). * Robust core functionality: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes: * [Partitioning](/open-source/core-functionality/partitioning): The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis. * [Cleaning](/open-source/core-functionality/cleaning): Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications. * [Extracting](/open-source/core-functionality/extracting): This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents. * [Staging](/open-source/core-functionality/staging): Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of [destination connectors](/open-source/ingestion/destination-connectors/overview) in the [Unstructured Ingest CLI and Unstructured Ingest Python library](/open-source/ingestion/overview). * [Chunking](/open-source/core-functionality/chunking): The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements). ## Common use cases * Pretraining models * Fine-tuning models * Retrieval Augmented Generation (RAG) * Traditional ETL <Note>GPU usage is not supported for the Unstructured open source library.</Note> ## Limits The Unstructured open source library has the following limits as compared to the [Unstructured UI](/ui/overview) and the [Unstructured API](/api-reference/overview): * Not designed for production scenarios. * Significantly decreased performance on document and table extraction. * No access to Unstructured's latest vision language model (VLM) offerings. * No access to Unstructured's fine-tuned OCR models. * No access to Unstructured's by-page and by-similarity chunking strategies. * No support for generating embeddings in the core [Unstructured](https://github.com/Unstructured-IO/unstructured) open source offering. (However, you can generate embeddings as a separate step manually. [Learn how](/open-source/core-functionality/embedding). Also, there is built-in support for generating embeddings by using the open source's [Unstructured Ingest CLI](/open-source/ingestion/ingest-cli) and [Unstructured Ingest Python library](/open-source/ingestion/python-ingest) offerings. [Learn more](/open-source/how-to/embedding).) * No support for Unstructured's enrichment types such as image descriptions, table descriptions, and named entity recognition (NER). * Lack of support for SOC2 Type 2, HIPAA, and GDPR compliance. * No authentication or identity management in the core open source offering for local document processing. * No incremental data loading. * No ETL job scheduling or monitoring. * No image extraction from documents. * Less sophisticated document hierarchy detection. * You must manage many of your own code dependencies, for instance for libraries such as Poppler and Tesseract. * For local document processing, you must manage your own infrastructure, including parallelization and other performance optimizations. ## Pricing Calls to the Unstructured open source library that are routed to Unstructured's software-as-a-service (SaaS) for processing (for example, by calling the [partition\_via\_api](/open-source/core-functionality/partitioning#partition-via-api) or [partition\_multiple\_via\_api](/open-source/core-functionality/partitioning#partition-multiple-via-api) functions with an Unstructured API key and an Unstructured SaaS URL) require an Unstructured account for billing purposes. Unstructured offers three account pricing plans: * SaaS Cloud-hosted - Processing happens on Unstructured's software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment. * Private SaaS - Processing also happens on Unstructured's SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy. * VPC - Sometimes referred to as [self-hosted](/self-hosted/overview), an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support. For more details, see the [Unstructured Pricing](https://unstructured.io/pricing) page. Some of these plans are billed on a per-page basis. Unstructured calculates a page as follows: * For these file types, a page is a page, slide, or image: `.pdf`, `.pptx`, and `.tiff`. * For `.docx` files that have page metadata, Unstructured calculates the number of pages based on that metadata. * For all other file types, Unstructured calculates the number of pages as the file's size divided by 100 KB. * For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed. # Quickstart Source: https://docs.unstructured.io/open-source/introduction/quick-start In this quickstart, you use the [Unstructured open source library](/open-source/introduction/overview) ([GitHub](https://github.com/Unstructured-IO/unstructured), [PyPI](https://pypi.org/project/unstructured/)) along with Python on your local development machine to partition a PDF file into a standard set of [Unstructured document elements and metadata](/open-source/concepts/document-elements). You can use these elements and metadata as input into your RAG applications, AI agents, model fine-tuning tasks, and more. <Steps> <Step title="Prerequisites"> To complete this quickstart, you need: * A Python virtual environment manager is recommended to manage your Python code dependencies. This quickstart uses [uv](https://docs.astral.sh/uv/) for managing virtual environments and [venv](https://docs.python.org/3/library/venv.html) as the virtual environment type. Installation and use of `uv` and `venv` are described in the following steps. However, `uv` and `venv` are not required to use the Unstructured open source library. * Python 3.9 or higher. You can use `uv` to install Python if needed, as described in the following steps. * A PDF file on your local machine. If you do not have a PDF file available, this quickstart provides a sample PDF file named `layout-parser-paper.pdf` that you can download in a later step. (The Unstructured open source library provides [support for additional file types](/open-source/introduction/supported-file-types) as well.) </Step> <Step title="Install uv"> <Tabs> <Tab title="macOS, Linux"> To use `curl` with `sh`: ```bash curl -fsSL https://get.uv.dev | bash ``` To use `wget` with `sh` instead: ```bash wget -qO- https://astral.sh/uv/install.sh | sh ``` </Tab> <Tab title="Windows"> To use PowerShell with `irm` to download the script and run it with `iex`: ```powershell powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" ``` </Tab> </Tabs> To install `uv` by using other approaches such as PyPI, Homebrew, or WinGet, see [Installing uv](https://docs.astral.sh/uv/getting-started/installation/). </Step> <Step title="Install Python"> `uv` will detect and use Python if you already have it installed. To view a list of installed Python versions, run the following command: ```bash uv python list ``` If, however, you do not already have Python installed, you can install a version of Python for use with `uv` by running the following command. For example, this command installs Python 3.12 for use with `uv`: ```bash uv python install 3.12 ``` </Step> <Step title="Create a uv project"> Use `uv` to create a project by switching to the directory on your development machine where you want to create the project and then running the following command: ```bash uv init ``` </Step> <Step title="Create a venv virtual environment"> To isolate and manage your project's code dependencies, from your project directory, use `uv` to create a virtual environment with `venv` by running the following command: ```bash # Create the virtual environment by using the current Python version: uv venv # Or, if you want to use a specific Python version: uv venv --python 3.12 ``` </Step> <Step title="Activate the virtual environment"> To activate the `venv` virtual environment, run one of the following commands: <Tabs> <Tab title="macOS, Linux"> * For `bash` or `zsh`, run `source .venv/bin/activate` * For `fish`, run `source .venv/bin/activate.fish` * For `csh` or `tcsh`, run `source .venv/bin/activate.csh` * For `pwsh`, run `.venv/bin/Activate.ps1` </Tab> <Tab title="Windows"> * For `cmd.exe`, run `.venv\Scripts\activate.bat` * For `PowerShell`, run `.venv\Scripts\Activate.ps1` </Tab> </Tabs> To deactivate the virtual environment at any time, run `deactivate`. </Step> <Step title="Install the Unstructured open source library"> With the virtual environment activated to enable code dependency isolation and management, use `uv` to install the Unstructured open source library by running the following command: ```bash uv add unstructured ``` The preceding command supports plain text files (`.txt`), HTML files (`.html`), XML files (`.xml`), and emails (`.eml`, `.msg`, and `.p7s`) without any additional dependencies. To work with other file types, you must also install these dependencies, as follows, replacing `<extra>` with the appropriate extra for the target file type: ```bash uv add "unstructured[<extra>]" ``` The following file type extras are available: * `all-docs` (for all supported file types in this list) * `csv` (for `.csv` files only) * `docx` (for `.doc` and `.docx` files only) * `epub` (for `.epub` files only) * `image` (for all supported image file types: `.bmp`, `.heic`, `.jpeg`, `.png`, and `.tiff`) * `md` (for `.md` files only) * `odt` (for `.odt` files only) * `org` (for `.org` files only) * `pdf` (for `.pdf` files only) * `pptx` (for `.ppt` and `.pptx` files only) * `rst` (for `.rst` files only) * `rtf` (for `.rtf` files only) * `tsv` (for `.tsv` files only) * `xlsx` (for `.xls` and `.xlsx` files only) As this quickstart uses a sample PDF file, run the following command: ```bash uv add "unstructured[pdf]" ``` Note that you can install multiple extras at the same time by separating them with commas, for example: ```bash uv add "unstructured[pdf,docx]" ``` </Step> <Step title="Install system dependencies"> You maximum compatibility, you should also install the following system dependencies: * [libmagic-dev](https://man7.org/linux/man-pages/man3/libmagic.3.html) (for filetype detection) * [poppler-utils](https://poppler.freedesktop.org/) and [tesseract-ocr](https://github.com/tesseract-ocr/tesseract) (for images and PDFs), and `tesseract-lang` (for additional language support) * [libreoffice](https://www.libreoffice.org/discover/libreoffice/) (for Microsoft Office documents) * [pandoc](https://pandoc.org/) (for `.epub`, `.odt`, and `.rtf` files. For `.rtf` files, you must have version 2.14.2 or newer. Running [this script](https://github.com/Unstructured-IO/unstructured/blob/main/scripts/install-pandoc.sh) will install the correct version for you.) Installation instructured for these system dependencies vary by operating system type. For details, follow the preceding links or see your operating system's documentation. </Step> <Step title="Download the sample PDF file"> Download the sample PDF file named `layout-parser-paper.pdf` from the following location to your local development machine: [https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf](https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf) (You can also use any other PDF file that you want to work with instead of this sample file, if you prefer.) </Step> <Step title="Add the Python code"> In the project's `main.py` file, add the following Python code, replacing `<path/to>` with the path to the `layout-parser-paper.pdf` file that you downloaded to your local development machine. (If you want to use a different PDF file, replace `layout-parser-paper` with the name of that PDF file instead.) ```python from unstructured.partition.pdf import partition_pdf from unstructured.staging.base import elements_to_json file_path = "<path/to>" base_file_name = "layout-parser-paper" def main(): elements = partition_pdf(filename=f"{file_path}/{base_file_name}.pdf") elements_to_json(elements=elements, filename=f"{file_path}/{base_file_name}-output.json") if name == "main": main() ``` </Step> <Step title="Run the Python code"> Use `uv` to run the preceding Python code by running the following command: ```bash uv run main.py ``` It might take a few minutes for the command to finish. </Step> <Step title="View the output"> After the command finishes running successfully, view the Unstructured elements and metadata that were generated by opening the `layout-parser-paper-output.json` file in your editor. This file will be in the location as the original `layout-parser-paper.pdf` file. (If you used a different PDF file, the output file will be named `<your-file-name>-output.json` instead.) </Step> </Steps> ## Next steps * Learn more about the [available partition functions](/open-source/core-functionality/partitioning) in addition to `partition_pdf` for converting other types of files into standard [Unstructured document elements and metadata](/open-source/concepts/document-elements). * By default, the preceding example uses the `auto` partitioning strategy. Learn about other [available partitioning strategies](/open-source/concepts/partitioning-strategies) for fine-tuned approaches to converting different types of files into Unstructured document elements. * Learn about [available chunking functions](/open-source/core-functionality/chunking) for splitting up the text in your document elements into manageable chunks as needed to fit into your models' limited context windows. * Learn about [available cleaning functions](/open-source/core-functionality/cleaning) for cleaning up your document elements' data as needed. * Learn about [available extraction functions](/open-source/core-functionality/extracting) for getting precise information out of your document elements as needed. * Learn about how to [generate vector embeddings](/open-source/core-functionality/embedding) for the text in your document elements for use in RAG applications, AI agents, model fine-tuning tasks, and more. * For an additional code example, see the [Unstructured Quick Tour](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Quick_Tour.ipynb) Google Colab notebook. * The Unstructured open source library is also available as a [Docker container](/open-source/installation/docker-installation). * The [Unstructured Ingest CLI and Unstructured Ingest Python library](/open-source/ingestion/overview) build upon the Unstructured open source library by providing additional functionality such as batch file processing, ingesting files from remote source locations and sending the processed files' data to remote destination locations, creating programmatic ETL pipelines, optionally processing files on Unstructured-hosted compute resource instead of locally for improved performance and quality on a pay-as-you-go basis, and more. * The [Unstructured user interface (UI)](/ui/overview) and [Unstructured API](/api-reference/overview) are superior to the Unstructured open source library, the Unstructured Ingest CLI, and the Unstructured Ingest Python library. The Unstructured UI and API are designed for production scenarios, with significantly increased performance and quality, the latest OCR and vision language models, advanced chunking strategies, security compliance, multi-user account management, job scheduling and monitoring, self-hosted deployment options, and more on a pay-as-you-go or subscription basis. ## Need help? * Join the [Unstructured Slack community](https://short.unstructured.io/pzw05l7) and post your questions in the # ask-for-help-open-source-library channel. * Post your bug reports and feature requests in the [Unstructured open source library GitHub repository](https://github.com/Unstructured-IO/unstructured/issues). These bug reports and feature requests are evaluated and addressed\ based on the interest and availability of the open source community. # Supported file types Source: https://docs.unstructured.io/open-source/introduction/supported-file-types The Unstructured open source library supports processing of the following file types: By file extension: | File extension | | -------------- | | `.bmp` | | `.csv` | | `.doc` | | `.docx` | | `.eml` | | `.epub` | | `.heic` | | `.html` | | `.jpeg` | | `.png` | | `.md` | | `.msg` | | `.odt` | | `.org` | | `.p7s` | | `.pdf` | | `.png` | | `.ppt` | | `.pptx` | | `.rst` | | `.rtf` | | `.tiff` | | `.txt` | | `.tsv` | | `.xls` | | `.xlsx` | | `.xml` | By file type: | Category | File types | | ----------------- | ----------------------------------------- | | CSV | `.csv` | | E-mail | `.eml`, `.msg`, `.p7s` | | EPUB | `.epub` | | Excel | `.xls`, `.xlsx` | | HTML | `.html` | | Image | `.bmp`, `.heic`, `.jpeg`, `.png`, `.tiff` | | Markdown | `.md` | | Org Mode | `.org` | | Open Office | `.odt` | | PDF | `.pdf` | | Plain text | `.txt` | | PowerPoint | `.ppt`, `.pptx` | | reStructured Text | `.rst` | | Rich Text | `.rtf` | | TSV | `.tsv` | | Word | `.doc`, `.docx` | | XML | `.xml` | See also [Partitioning](/open-source/core-functionality/partitioning). # AWS self-hosted onboarding checklist Source: https://docs.unstructured.io/self-hosted/aws/onboard <Note> To proceed with a self-hosted deployment, your organization must first sign a self-hosting agreement with Unstructured. If you have not yet signed this agreement, stop here, and begin the self-hosting agreement process by contacting your Unstructured sales representative, emailing Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or filling out the [contact form](https://unstructured.io/contact) on the Unstructured website. </Note> After your organization has signed the self-hosting agreement with Unstructured, a member of the Unstructured technical enablement team will reach out to you to begin the deployment onboarding process. To streamline this process, you are encouraged to begin setting up your target environment as soon as possible. Choose one of the following setup options: * [Do it all for me](#do-it-all-for-me): Have Unstructured set up the required infrastructure in your AWS account and then deploy the Unstructured UI and API into that newly created infrastructure. * [Bring my own infrastructure](#bring-my-own-infrastructure): Set up the required infrastructure yourself in your AWS account, and then have Unstructured deploy the Unstructured UI and API into your existing infrastructure. ## Questions? Need help? If you have questions or need help as you go, contact your Unstructured sales representative or technical enablement contact. If you do not know who they are, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website, and a member of the Unstructured sales or technical enablement teams will get back to you as soon as possible. ## Do it all for me If you want Unstructured to set up the required infrastructure for you in your AWS account and then deploy the Unstructured UI and API into that newly created infrastructure, then provide your Unstructured sales representative or technical enablement contact with the access credentials for an IAM user or service principal in your AWS account that has the following required permissions. ### Core networking permissions For VPC and subnet management: * `ec2:CreateVpc` * `ec2:CreateSubnet` * `ec2:CreateRouteTable` * `ec2:CreateInternetGateway` * `ec2:CreateNatGateway` * `ec2:ModifyVpcAttribute` (for DNS settings) * `ec2:AssociateRouteTable`, `ec2:CreateRoute` (for public and private route tables) * `ec2:AllocateAddress` (for Elastic IP assignment to the NAT Gateway) For security group rules: * `ec2:AuthorizeSecurityGroupIngress/Egress` (to configure cluster and node security groups to allow VPC CIDR traffic) ### EKS permissions For the cluster role: * Attach the managed policies `AmazonEKSClusterPolicy` and `AmazonEKSVPCResourceController` to a role with `sts:AssumeRole` trust for `eks.amazonaws.com` For the node group role: Attach these managed policies: * `AmazonEKSWorkerNodePolicy` (for node operations) * `AmazonEKS_CNI_Policy` (for networking) * `AmazonEC2ContainerRegistryReadOnly` (for ECR access) For OIDC integration: * `iam:CreateOpenIDConnectProvider` (to associate the EKS cluster with IAM OIDC) * `iam:CreateRole` + `iam:AttachRolePolicy` (for service accounts in the `recommender`, `etl-operator`, and `data-broker` namespaces) ### Storage and database These permissions: * `s3:CreateBucket` * `s3:PutBucketVersioning` * `s3:PutBucketEncryption` For these S3 buckets: * `u10d--etl-blob-cache` `u10d--etl-job-db` `u10d--etl-job-status` `u10d--job-files` For RDS: `rds:CreateDBInstance` * `rds:CreateDBSubnetGroup` * `rds:CreateDBSecurityGroup` + `ec2:AuthorizeSecurityGroupIngress` (to allow VPC CIDR access) ### Add-ons and utilities For the EBS CSI Driver: * `eks:CreateAddon` with IAM role attachment permissions for the `ebs.csi.aws.com` service account For the SSH Key: * `ec2:CreateKeyPair` + `ec2:ExportKeyPair` (for node group remote access) ### Cross-service requirements * For IAM: `iam:PassRole` (to assign roles to EKS, RDS, and S3) * For KMS: `kms:CreateKey` (if using CMK for S3 and RDS encryption) * For CloudFormation: `cloudformation:` For least privilege, scope resource ARNs in policies (for example, restrict S3 bucket names with wildcards such as `u10d--etl`). The EKS Pod Identity Agent requires `eks-auth:AssumeRoleForPodIdentity` permission on node roles when used with IRSA. ## Bring my own infrastructure If you want to set up the required infrastructure yourself, set things up as follows within your AWS account for Unstructured to deploy the Unstructured UI and API into. You must also provide your Unstructured sales representative or technical enablement contact with the access credentials for an IAM user or service principal in your AWS account that has access to the target Amazon Elastic Kubernetes Service (EKS) cluster to deploy the Unstructured UI and API into. ### VPC and networking VPC * CIDR: `10.0.0.0/16` - Any CIDR should work, but make sure it has enough space. * DNS Hostnames: Enabled * DNS Support: Enabled * Internet Gateway * Attached to the VPC * Public Subnet * CIDR: `10.0.0.0/24` * Public IP on launch: true * Availability Zone: `${region}a` * NAT Gateway + Elastic IP * Lives in the public subnet * Private Subnets (x2) * CIDRs: `10.0.1.0/24`, `10.0.2.0/24` * AZs: `${region}a` and `${region}b` * Route Tables * Public: default route (`0.0.0.0/0`) via IGW * Private (x2): default route via NAT Gateway ### IAM roles and policies * EKS Cluster Role * Trusts: `eks.amazonaws.com` * Attached policies: * `AmazonEKSClusterPolicy` * `AmazonEKSVPCResourceController` * EKS Node Group Role * Trusts: `ec2.amazonaws.com`, `eks.amazonaws.com` * Attached policies: * `AmazonEKSWorkerNodePolicy` * `AmazonEKS_CNI_Policy` * `AmazonEC2ContainerRegistryReadOnly` * OIDC Service Account IAM Roles (x3) * Namespaces: `recommender`, `etl-operator`, `data-broker` * Each role assumes via `sts:AssumeRoleWithWebIdentity` with OIDC provider * Each has an S3 policy allowing access to specific buckets (see below) ### EKS cluster * EKS Control Plane * Version: `1.31` or greater * Subnet: Private subnets only * Public endpoint access: Enabled * Private endpoint access: Disabled * Node Group * Instance type: `c5.4xlarge` (or larger, depending on cost factors) * Disk size: 100 GB * Desired size: 2 (min 2, max 5) * Remote SSH access: Enabled (with generated SSH key) * SSH key: Key pair created and exported * Security Groups * EKS Cluster SG (implicitly created by AWS) * Node SG: Allows all traffic within cluster CIDR (`10.0.0.0/16`), self, and metadata IP * Egress: Allows all #### Kubernetes add-ons Installed via `aws.eks.Addon`: * EKS Pod Identity Agent * Version: `v1.3.4-eksbuild.1` * Metrics Server * Version: `v0.7.2-eksbuild.1` * EBS CSI Driver * Version: `v1.38.1-eksbuild.2` * Configured with: * Service account annotation: `eks.amazonaws.com/role-arn` * Pod identity access annotation #### Storage class * Name: `ebs-sc` * Default: Yes * Provisioner: `ebs.csi.aws.com` * Parameters: `type=gp3`, `encrypted=true` * Volume Binding Mode: `WaitForFirstConsumer` ### RDS * RDS Subnet Group * Uses the private subnets * RDS Instance * Engine: Postgres 16 * Size: `db.t3.micro` * Allocated storage: 20 GB * Auth: Setup a Username and Password, keep secure. * Security group: Allows all traffic from `10.0.0.0/16` (keep in mind your CIDR group from the VPC) * DB name: `postgres` ### S3 buckets * `u10d-{stack_name}-etl-blob-cache` * `u10d-{stack_name}-etl-job-db` * `u10d-{stack_name}-etl-job-status` * `u10d-{stack_name}-job-files` All created with: * Versioning enabled * Server-side encryption (AES256) * Force destroy: true ### Keys * SSH Key Pair (RSA 4096-bit) * Key exported as `private_key` (PEM) ### Secrets and ConfigMaps After your infrastructure is set up, but before Unstructured can deploy the Unstructured UI and API into your insfrastructure, Unstructured will need to know the values of the following Secrets and ConfigMaps. These must be provided to Unstructured as a set of YAML files in Kubernetes [Secret](https://kubernetes.io/docs/concepts/configuration/secret/) and [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) format. The Secrets are as follows. #### Blob storage credentials * `BLOB_STORAGE_ADAPTER_ACCESS_KEY_ID` * `BLOB_STORAGE_ADAPTER_SECRET_ACCESS_KEY` * `BLOB_STORAGE_ADAPTER_REGION_NAME` #### Database credentials * `DB_USERNAME` * `DB_PASSWORD` * `DB_HOST` * `DB_NAME` * `DB_DATABASE` (used in `platform-api` only) #### Authentication * `JWT_SECRET_KEY` * `AUTH_STRATEGY` (sometimes encoded, sometimes not) * `SESSION_SECRET` * `SHARED_SECRET` * `KEYCLOAK_CLIENT_SECRET` * `KEYCLOAK_ADMIN_SECRET` * `KEYCLOAK_ADMIN` * `KEYCLOAK_ADMIN_PASSWORD` * `API_BEARER_TOKEN` The ConfigMaps are as follows. #### Blob storage settings * `BLOB_STORAGE_ADAPTER_TYPE` (always `s3` for AWS) * `BLOB_STORAGE_ADAPTER_BUCKET` * `ETL_BLOB_CACHE_BUCKET_NAME` * `ETL_API_BLOB_STORAGE_ADAPTER_BUCKET` * `ETL_API_BLOB_STORAGE_ADAPTER_TYPE` * `ETL_API_DB_REMOTE_BUCKET_NAME` * `ETL_API_JOB_STATUS_DEST_BUCKET_NAME` * `JOB_STATUS_BUCKET_NAME` * `JOB_DB_BUCKET_NAME` #### Environment * `ENV` * `ENVIRONMENT` * `JOB_ENV` * `JOB_ENVIRONMENT` #### Observability and OpenTelemetry (OTel) * `JOB_OTEL_EXPORTER_OTLP_ENDPOINT` * `JOB_OTEL_METRICS_EXPORTER` * `JOB_OTEL_TRACES_EXPORTER` * `OTEL_EXPORTER_OTLP_ENDPOINT` * `OTEL_METRICS_EXPORTER` * `OTEL_TRACES_EXPORTER` #### Unstructured API and authentication * `UNSTRUCTURED_API_URL` * `JWKS_URL` * `JWT_ISSUER` * `JWT_AUDIENCE` * `SINGLE_PLANE_DEPLOYMENT` #### Front end and dashboard * `API_BASE_URL` * `API_CLIENT_BASE_URL` * `API_URL` * `APM_SERVICE_NAME` * `APM_SERVICE_NAME_CLIENT` * `AUTH_STRATEGY` * `FRONTEND_BASE_URL` * `KEYCLOAK_CALLBACK_URL` * `KEYCLOAK_CLIENT_ID` * `KEYCLOAK_DOMAIN` * `KEYCLOAK_REALM` * `KEYCLOAK_SSL_ENABLED` * `KEYCLOAK_TRUST_ISSUER` * `PUBLIC_BASE_URL` * `PUBLIC_RELEASE_CHANNEL` #### Sentry and feature flags * `SENTRY_DSN` * `SENTRY_SAMPLE_RATE` * `WORKFLOW_NODE_EDITOR_FF_REQUEST_FORM` * `CUSTOM_WORKFLOW_FF_REQUEST_FORM` #### Redis * `REDIS_DSN` #### Other * `IMAGE_PULL_SECRETS` * `PRIVATE_KEY_SECRETS_ADAPTER_TYPE` * `PRIVATE_KEY_SECRETS_ADAPTER_AWS_REGION` * `SECRETS_ADAPTER_TYPE` * `SECRETS_ADAPTER_AWS_REGION` The preceding Secrets and ConfigMaps must be added to the following files: | File name | Type | Resource name | Namespace | Data keys | | ------------------------------------- | --------- | -------------------------- | -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `data-broker-env-cm.yaml` | ConfigMap | `data-broker-env` | `api` | `JOB_STATUS_BUCKET_NAME`, `JOB_DB_BUCKET_NAME`, `BLOB_STORAGE_ADAPTER_TYPE` | | `data-broker-env-secret.yaml` | Secret | `data-broker-env` | `api` | ` BLOB_STORAGE_ADAPTER_ACCESS_KEY_ID`, `BLOB_STORAGE_ADAPTER_REGION_NAME`, `BLOB_STORAGE_ADAPTER_SECRET_ACCESS_KEY` | | `dataplane-api-env-cm.yaml` | Secret | `dataplane-api-env` | `api` | ` DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME` | | `etl-operator-env-cm.yaml` | ConfigMap | `etl-operator-env` | `etl-operator` | ` BLOB_STORAGE_ADAPTER_BUCKET`, `JOB_STATUS_BUCKET_NAME`, `JOB_DB_BUCKET_NAME`, `BLOB_STORAGE_ADAPTER_TYPE`, `ENV`, `ENVIRONMENT`, `REDIS_DSN`, `ETL_API_BLOB_STORAGE_ADAPTER_BUCKET`, `ETL_API_BLOB_STORAGE_ADAPTER_TYPE`, `ETL_API_DB_REMOTE_BUCKET_NAME`, `ETL_API_JOB_STATUS_DEST_BUCKET_NAME (x2)`, `ETL_BLOB_CACHE_BUCKET_NAME`, `IMAGE_PULL_SECRETS`, `JOB_ENV`, `JOB_ENVIRONMENT`, `JOB_OTEL_EXPORTER_OTLP_ENDPOINT`, `JOB_OTEL_METRICS_EXPORTER`, `JOB_OTEL_TRACES_EXPORTER`, `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_METRICS_EXPORTER`, `OTEL_TRACES_EXPORTER`, `UNSTRUCTURED_API_URL` | | `etl-operator-env-secret.yaml` | Secret | `etl-operator-env` | `etl-operator` | ` BLOB_STORAGE_ADAPTER_ACCESS_KEY_ID`, `BLOB_STORAGE_ADAPTER_REGION_NAME`, `BLOB_STORAGE_ADAPTER_SECRET_ACCESS_KEY` | | `frontend-env-cm.yaml` | ConfigMap | `frontend-env` | `www` | `API_BASE_URL`, `API_CLIENT_BASE_URL`, `API_URL`, `APM_SERVICE_NAME`, `APM_SERVICE_NAME_CLIENT`, `AUTH_STRATEGY`, `ENV`, `FRONTEND_BASE_URL`, `KEYCLOAK_CALLBACK_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_DOMAIN`, `KEYCLOAK_REALM`, `KEYCLOAK_SSL_ENABLED`, `KEYCLOAK_TRUST_ISSUER`, `PUBLIC_BASE_URL`, `PUBLIC_RELEASE_CHANNEL`, `SENTRY_DSN`, `SENTRY_SAMPLE_RATE`, `WORKFLOW_NODE_EDITOR_FF_REQUEST_FORM`, `CUSTOM_WORKFLOW_FF_REQUEST_FORM` | | `frontend-env-secret.yaml` | Secret | `frontend-env` | `www` | `API_BEARER_TOKEN`, `KEYCLOAK_ADMIN_SECRET`, `KEYCLOAK_CLIENT_SECRET`, `SESSION_SECRET`, `SHARED_SECRET` | | `keycloak-secret.yaml` | Secret | `phasetwo-keycloak-env` | `www` | `KEYCLOAK_ADMIN`, `KEYCLOAK_ADMIN_PASSWORD` | | `platform-api-env-cm.yaml` | ConfigMap | `platform-api-env` | `api` | `JWKS_URL`, `JWT_ISSUER`, `JWT_AUDIENCE`, `SINGLE_PLANE_DEPLOYMENT` | | `platform-api-env-secret.yaml` | Secret | `platform-api-env` | `api` | `DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME`, `DB_DATABASE`, `JWT_SECRET_KEY`, `AUTH_STRATEGY` | | `recommender-env-cm.yaml` | ConfigMap | `recommender-env` | `recommender` | `BLOB_STORAGE_ADAPTER_TYPE`, `ETL_BLOB_CACHE_BUCKET_NAME` | | `recommender-env-secret.yaml` | Secret | `recommender-env` | `recommender` | `BLOB_STORAGE_ADAPTER_ACCESS_KEY_ID`, `BLOB_STORAGE_ADAPTER_REGION_NAME`, `BLOB_STORAGE_ADAPTER_SECRET_ACCESS_KEY` | | `secret-provider-api-env-cm.yaml` | ConfigMap | `secrets-provider-api-env` | `secrets` | `ENV`, `ENVIRONMENT`, `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_METRICS_EXPORTER`, `OTEL_TRACES_EXPORTER`, `PRIVATE_KEY_SECRETS_ADAPTER_AWS_REGION`, `PRIVATE_KEY_SECRETS_ADAPTER_TYPE`, `SECRETS_ADAPTER_AWS_REGION`, `SECRETS_ADAPTER_TYPE` | | `secret-provider-api-env-secret.yaml` | Secret | `secrets-provider-api-env` | `secrets` | `BLOB_STORAGE_ADAPTER_ACCESS_KEY_ID`, `BLOB_STORAGE_ADAPTER_REGION_NAME`, `BLOB_STORAGE_ADAPTER_SECRET_ACCESS_KEY` | | `usage-collector-env-secret.yaml` | Secret | `usage-collector-env` | `api` | `DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME`, `BLOB_STORAGE_ADAPTER_TYPE` | For example, for the `etl-operator-env-cm.yaml` [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) file, the contents would look like this: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: data-broker-env namespace: api data: JOB_STATUS_BUCKET_NAME: "<your-value>" JOB_DB_BUCKET_NAME: "<your-value>" BLOB_STORAGE_ADAPTER_TYPE: "<your-value>" ``` For the `etl-operator-env-secret.yaml` [Secret](https://kubernetes.io/docs/concepts/configuration/secret/) file, the contents would look like this: ```yaml apiVersion: v1 kind: Secret metadata: name: data-broker-env namespace: api type: Opaque stringData: BLOB_STORAGE_ADAPTER_ACCESS_KEY_ID: "<your-value>" BLOB_STORAGE_ADAPTER_REGION_NAME: "<your-value>" BLOB_STORAGE_ADAPTER_SECRET_ACCESS_KEY: "<your-value>" ``` # Self-hosting Unstructured on Amazon Web Services (AWS) Source: https://docs.unstructured.io/self-hosted/aws/overview ## Getting started To get started with a self-hosted deployment, your organization must first sign a self-hosting agreement with Unstructured. This agreement outlines the terms and conditions for your organization to use Unstructured in a self-hosted environment. To begin the self-hosting agreement process, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website. After your organization has signed the self-hosting agreement, a member of the Unstructured technical enablement team will reach out to you to begin the deployment onboarding process. To streamline this process, you are encouraged to begin setting up your target environment as soon as possible. To do this, see the [onboarding checkist](/self-hosted/aws/onboard). ## Questions? Need help? Contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website, and a member of the Unstructured sales or support teams will get back to you as soon as possible. # Azure self-hosted onboarding checklist Source: https://docs.unstructured.io/self-hosted/azure/onboard <Note> To proceed with a self-hosted deployment, your organization must first sign a self-hosting agreement with Unstructured. If you have not yet signed this agreement, stop here, and begin the self-hosting agreement process by contacting your Unstructured sales representative, emailing Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or filling out the [contact form](https://unstructured.io/contact) on the Unstructured website. </Note> After your organization has signed the self-hosting agreement with Unstructured, a member of the Unstructured technical enablement team will reach out to you to begin the deployment onboarding process. To streamline this process, you are encouraged to begin setting up your target environment as soon as possible. Choose one of the following setup options: * [Do it all for me](#do-it-all-for-me): Have Unstructured set up the required infrastructure in your AWS account and then deploy the Unstructured UI and API into that newly created infrastructure. * [Bring my own infrastructure](#bring-my-own-infrastructure): Set up the required infrastructure yourself in your AWS account, and then have Unstructured deploy the Unstructured UI and API into your existing infrastructure. ## Questions? Need help? If you have questions or need help as you go, contact your Unstructured sales representative or technical enablement contact. If you do not know who they are, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website, and a member of the Unstructured sales or technical enablement teams will get back to you as soon as possible. ## Do it all for me If you want Unstructured to set up the required infrastructure for you into your Azure account and then deploy the Unstructured UI and API into that newly created infrastrucrure, then provide your Unstructured sales representative or technical enablement contact with the access credentials for a Microsoft Entra ID user or service principal in your Azure account that has the following required permissions. ### Subscription and resource group * `Microsoft.Resources/subscriptions/resourceGroups/write` (to create the resource group) * `Microsoft.Resources/subscriptions/resourceGroups/read` (to read the resource group) ### VNet and networking * `Microsoft.Network/virtualNetworks/write` (to create the VNet) * `Microsoft.Network/virtualNetworks/read` (to read the VNet) * `Microsoft.Network/publicIPAddresses/write` (to create the public IPs) * `Microsoft.Network/publicIPAddresses/read` (to read the public IPs) * `Microsoft.Network/natGateways/write` (to create the NAT Gateway) * `Microsoft.Network/natGateways/read` (to read the NAT Gateway) * `Microsoft.Network/routeTables/write` (to create the route tables) * `Microsoft.Network/routeTables/read` (to read the route tables) * `Microsoft.Network/networkSecurityGroups/write` (to create the NSGs) * `Microsoft.Network/networkSecurityGroups/read` (to read the NSGs) ### AKS cluster * `Microsoft.ContainerService/managedClusters/write` (to create the AKS cluster) * `Microsoft.ContainerService/managedClusters/read` (to read the AKS cluster) * `Microsoft.ContainerService/agentPools/write` (to create the node pools) * `Microsoft.ContainerService/agentPools/read` (to read the node pools) ### Managed identities and RBAC * `Microsoft.ManagedIdentity/userAssignedIdentities/write` (to create the managed identities) * `Microsoft.ManagedIdentity/userAssignedIdentities/read` (to read managed identities) * Assign built-in roles such as: * Contributor or scoped Network Contributor for the AKS cluster identity * Monitoring Metrics Publisher, AcrPull, and Storage Blob Data Reader for the node pool identity * Storage Blob Data Contributor for workload identities ### Kubernetes add-ons Permissions depend on the Helm/YAML installation, but Azure RBAC integration requires `Microsoft.ContainerService/managedClusters/accessProfiles//read` (to access kubeconfig) ### Storage class `Microsoft.Storage/storageAccounts/write` (to create the storage account for CSI driver provisioning) * `Microsoft.Storage/storageAccounts/read` ### PostgreSQL database * `Microsoft.DBforPostgreSQL/flexibleServers/write` (to create the PostgreSQL server) * `Microsoft.DBforPostgreSQL/flexibleServers/read` * NSG permissions for database access: allow traffic from the VNet CIDR ## Bring my own infrastructure If you want to set up the required infrastructure yourself, set things up as follows within your Azure account for Unstructured to deploy the Unstructured UI and API into. You must also provide your Unstructured sales representative or technical enablement contact with the access credentials for an IAM user or service principal in your AWS account that has access to the target Azure Kubernetes Service (AKS) cluster to deploy the Unstructured UI and API into. ### Azure subscription and resource group * Subscription * Ensure you have access to a valid Azure subscription * You will need the `subscription_id` if deploying via CLI or Pulumi * Resource Group * Name: `u10d-{env}-rg` * Region: e.g., `eastus2` * All resources (VNet, AKS, PostgreSQL, Storage, etc.) will be created inside this group ### VNet and networking * Virtual Network (VNet) * Address space: `10.0.0.0/16` * DNS Hostnames: Enabled * DNS Support: Enabled * Internet Access * Handled via Azure's default gateway and public IPs * Public Subnet * Address: `10.0.0.0/24` * Assign Public IP: true * Availability Zone: `${region}a` * NAT Gateway + Public IP * NAT Gateway in the public subnet * Public IP resource attached * Private Subnets (x2) * Addresses: `10.0.1.0/24`, `10.0.2.0/24` * AZs: `${region}a` and `${region}b` * Route Tables * Public: route `0.0.0.0/0` via internet * Private: route `0.0.0.0/0` via NAT Gateway ### Managed identities and RBAC * AKS Cluster Managed Identity * Assign roles: * `Contributor` or more scoped role * `Network Contributor` * Node Pool Managed Identity * Assign roles: * `Monitoring Metrics Publisher` * `AcrPull` (if using ACR) * `Storage Blob Data Reader` * Workload Identity Bindings (x3) * Namespaces: `recommender`, `etl-operator`, `data-broker` * Use Azure AD Workload Identity Federation * Assign `Storage Blob Data Contributor` to required containers ### AKS Cluster * Control Plane * Version: `1.31` or higher * API authorized IPs: optional * Private cluster networking recommended * Node Pool * VM Size: `Standard_D16s_v5` * Disk Size: 100 GB * Desired Size: 2 (min: 2, max: 5) * SSH: Enabled via key pair * SSH key exported in PEM format * NSGs (Network Security Groups) * Allow intra-cluster traffic (`10.0.0.0/16`) * Allow all egress #### Kubernetes Add-ons Install via Helm or YAML: * Workload Identity Webhook * Metrics Server — `v0.7.2` * Azure Disk CSI Driver * Provisioner: `disk.csi.azure.com` ### Storage class ```yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: azure-disk-sc annotations: storageclass.kubernetes.io/is-default-class: "true" provisioner: disk.csi.azure.com parameters: skuName: Premium_LRS kind: Managed volumeBindingMode: WaitForFirstConsumer ``` ### Secrets and ConfigMaps After your infrastructure is set up, but before Unstructured can deploy the Unstructured UI and API into your insfrastructure, Unstructured will need to know the values of the following Secrets and ConfigMaps. These must be provided to Unstructured as a set of YAML files in Kubernetes [Secret](https://kubernetes.io/docs/concepts/configuration/secret/) and [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) format. Capture these during setup * DB host, username, password * Container names * SSH private key * Auth secrets The Secrets are as follows. #### Blob storage credentials (Azure) * `BLOB_STORAGE_ADAPTER_ACCOUNT_NAME` * `BLOB_STORAGE_ADAPTER_ACCOUNT_KEY` * `BLOB_STORAGE_ADAPTER_CONTAINER_REGION` (optional) #### Database credentials * `DB_USERNAME` * `DB_PASSWORD` * `DB_HOST` * `DB_NAME` * `DB_DATABASE` #### Authentication * `JWT_SECRET_KEY` * `AUTH_STRATEGY` * `SESSION_SECRET` * `SHARED_SECRET` * `KEYCLOAK_CLIENT_SECRET` * `KEYCLOAK_ADMIN_SECRET` * `KEYCLOAK_ADMIN` * `KEYCLOAK_ADMIN_PASSWORD` * `API_BEARER_TOKEN` The ConfigMaps are as follows. #### Blob storage settings * `BLOB_STORAGE_ADAPTER_TYPE`: `azure` * `BLOB_STORAGE_ADAPTER_BUCKET` * `ETL_BLOB_CACHE_BUCKET_NAME` * `ETL_API_BLOB_STORAGE_ADAPTER_BUCKET` * `ETL_API_BLOB_STORAGE_ADAPTER_TYPE`: `azure` * `ETL_API_DB_REMOTE_BUCKET_NAME` * `ETL_API_JOB_STATUS_DEST_BUCKET_NAME` * `JOB_STATUS_BUCKET_NAME` * `JOB_DB_BUCKET_NAME` #### Environment * `ENV`, `ENVIRONMENT` * `JOB_ENV`, `JOB_ENVIRONMENT` #### Observability and OpenTelementry (OTel) * `JOB_OTEL_EXPORTER_OTLP_ENDPOINT` * `JOB_OTEL_METRICS_EXPORTER` * `JOB_OTEL_TRACES_EXPORTER` * `OTEL_EXPORTER_OTLP_ENDPOINT` * `OTEL_METRICS_EXPORTER` * `OTEL_TRACES_EXPORTER` #### Unstructured API and authentication * `UNSTRUCTURED_API_URL` * `JWKS_URL` * `JWT_ISSUER` * `JWT_AUDIENCE` * `SINGLE_PLANE_DEPLOYMENT` #### Front end and dashboard * `API_BASE_URL` * `API_CLIENT_BASE_URL` * `API_URL` * `APM_SERVICE_NAME` * `APM_SERVICE_NAME_CLIENT` * `AUTH_STRATEGY` * `FRONTEND_BASE_URL` * `KEYCLOAK_CALLBACK_URL` * `KEYCLOAK_CLIENT_ID` * `KEYCLOAK_DOMAIN` * `KEYCLOAK_REALM` * `KEYCLOAK_SSL_ENABLED` * `KEYCLOAK_TRUST_ISSUER` * `PUBLIC_BASE_URL` * `PUBLIC_RELEASE_CHANNEL` #### Redis * `REDIS_DSN` #### Other * `IMAGE_PULL_SECRETS` * `PRIVATE_KEY_SECRETS_ADAPTER_TYPE`: `azure` * `PRIVATE_KEY_SECRETS_ADAPTER_AZURE_REGION` * `SECRETS_ADAPTER_TYPE`: `azure` * `SECRETS_ADAPTER_AZURE_REGION` The preceding Secrets and ConfigMaps must be added to the following files: | File Name | Type | Resource name | Namespace | Data keys | | ------------------------------------- | --------- | -------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `data-broker-env-cm.yaml` | ConfigMap | `data-broker-env` | `api` | `JOB_STATUS_BUCKET_NAME`, `JOB_DB_BUCKET_NAME`, `BLOB_STORAGE_ADAPTER_TYPE` | | `data-broker-env-secret.yaml` | Secret | `data-broker-env` | `api` | `BLOB_STORAGE_ADAPTER_ACCOUNT_NAME`, `BLOB_STORAGE_ADAPTER_ACCOUNT_KEY`, `BLOB_STORAGE_ADAPTER_CONTAINER_REGION` | | `dataplane-api-env-cm.yaml` | Secret | `dataplane-api-env` | `api` | `DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME` | | `etl-operator-env-cm.yaml` | ConfigMap | `etl-operator-env` | `etl-operator` | `BLOB_STORAGE_ADAPTER_BUCKET`, `JOB_STATUS_BUCKET_NAME`, `JOB_DB_BUCKET_NAME`, `BLOB_STORAGE_ADAPTER_TYPE`, `ENV`, `ENVIRONMENT`, `REDIS_DSN`, `ETL_API_BLOB_STORAGE_ADAPTER_BUCKET`, `ETL_API_BLOB_STORAGE_ADAPTER_TYPE`, `ETL_API_DB_REMOTE_BUCKET_NAME`, `ETL_API_JOB_STATUS_DEST_BUCKET_NAME` (x2), `ETL_BLOB_CACHE_BUCKET_NAME`, `IMAGE_PULL_SECRETS`, `JOB_ENV`, `JOB_ENVIRONMENT`, `JOB_OTEL_EXPORTER_OTLP_ENDPOINT`, `JOB_OTEL_METRICS_EXPORTER`, `JOB_OTEL_TRACES_EXPORTER`, `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_METRICS_EXPORTER`, `OTEL_TRACES_EXPORTER`, `UNSTRUCTURED_API_URL` | | `etl-operator-env-secret.yaml` | Secret | `etl-operator-env` | `etl-operator` | `BLOB_STORAGE_ADAPTER_ACCOUNT_NAME`, `BLOB_STORAGE_ADAPTER_ACCOUNT_KEY`, `BLOB_STORAGE_ADAPTER_CONTAINER_REGION` | | `frontend-env-cm.yaml` | ConfigMap | `frontend-env` | `www` | `API_BASE_URL`, `API_CLIENT_BASE_URL`, `API_URL`, `APM_SERVICE_NAME`, `APM_SERVICE_NAME_CLIENT`, `AUTH_STRATEGY`, `ENV`, `FRONTEND_BASE_URL`, `KEYCLOAK_CALLBACK_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_DOMAIN`, `KEYCLOAK_REALM`, `KEYCLOAK_SSL_ENABLED`, `KEYCLOAK_TRUST_ISSUER`, `PUBLIC_BASE_URL`, `PUBLIC_RELEASE_CHANNEL`, `SENTRY_DSN`, `SENTRY_SAMPLE_RATE`, `WORKFLOW_NODE_EDITOR_FF_REQUEST_FORM`, `CUSTOM_WORKFLOW_FF_REQUEST_FORM` | | `frontend-env-secret.yaml` | Secret | `frontend-env` | `www` | `API_BEARER_TOKEN`, `KEYCLOAK_ADMIN_SECRET`, `KEYCLOAK_CLIENT_SECRET`, `SESSION_SECRET`, `SHARED_SECRET` | | `keycloak-secret.yaml` | Secret | `phasetwo-keycloak-env` | `www` | `KEYCLOAK_ADMIN`, `KEYCLOAK_ADMIN_PASSWORD` | | `platform-api-env-cm.yaml` | ConfigMap | `platform-api-env` | `api` | `JWKS_URL`, `JWT_ISSUER`, `JWT_AUDIENCE`, `SINGLE_PLANE_DEPLOYMENT` | | `platform-api-env-secret.yaml` | Secret | `platform-api-env` | `api` | `DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME`, `DB_DATABASE`, `JWT_SECRET_KEY`, `AUTH_STRATEGY` | | `recommender-env-cm.yaml` | ConfigMap | `recommender-env` | `recommender` | `BLOB_STORAGE_ADAPTER_TYPE`, `ETL_BLOB_CACHE_BUCKET_NAME` | | `recommender-env-secret.yaml` | Secret | `recommender-env` | `recommender` | `BLOB_STORAGE_ADAPTER_ACCOUNT_NAME`, `BLOB_STORAGE_ADAPTER_ACCOUNT_KEY`, `BLOB_STORAGE_ADAPTER_CONTAINER_REGION` | | `secret-provider-api-env-cm.yaml` | ConfigMap | `secrets-provider-api-env` | `secrets` | `ENV`, `ENVIRONMENT`, `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_METRICS_EXPORTER`, `OTEL_TRACES_EXPORTER`, `PRIVATE_KEY_SECRETS_ADAPTER_AZURE_REGION`, `PRIVATE_KEY_SECRETS_ADAPTER_TYPE`, `SECRETS_ADAPTER_AZURE_REGION`, `SECRETS_ADAPTER_TYPE` | | `secret-provider-api-env-secret.yaml` | Secret | `secrets-provider-api-env` | `secrets` | `BLOB_STORAGE_ADAPTER_ACCOUNT_NAME`, `BLOB_STORAGE_ADAPTER_ACCOUNT_KEY`, `BLOB_STORAGE_ADAPTER_CONTAINER_REGION` | | `usage-collector-env-secret.yaml` | Secret | `usage-collector-env` | `api` | `DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME`, `BLOB_STORAGE_ADAPTER_TYPE` | For example, for the `data-broker-env-cm.yaml` ConfigMap file, the contents would look like this: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: data-broker-env namespace: api data: JOB_STATUS_BUCKET_NAME: "<your-value>" JOB_DB_BUCKET_NAME: "<your-value>" BLOB_STORAGE_ADAPTER_TYPE: "<your-value>" ``` The `data-broker-env-secret.yaml` Secret file would look like this: ```yaml apiVersion: v1 kind: Secret metadata: name: data-broker-env namespace: api type: Opaque stringData: BLOB_STORAGE_ADAPTER_ACCOUNT_NAME: "<your-value>" BLOB_STORAGE_ADAPTER_ACCOUNT_KEY: "<your-value>" BLOB_STORAGE_ADAPTER_CONTAINER_REGION: "<your-value>" ``` # Self-hosting Unstructured on Azure Source: https://docs.unstructured.io/self-hosted/azure/overview ## Getting started To get started with a self-hosted deployment, your organization must first sign a self-hosting agreement with Unstructured. This agreement outlines the terms and conditions for your organization to use Unstructured in a self-hosted environment. To begin the self-hosting agreement process, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website. After your organization has signed the self-hosting agreement, a member of the Unstructured technical enablement team will reach out to you to begin the deployment onboarding process. To streamline this process, you are encouraged to begin setting up your target environment as soon as possible. To do this, see the [onboarding checkist](/self-hosted/azure/onboard). ## Questions? Need help? Contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website, and a member of the Unstructured sales or support teams will get back to you as soon as possible. # GCP self-hosted onboarding checklist Source: https://docs.unstructured.io/self-hosted/gcp/onboard <Note> To proceed with a self-hosted deployment, your organization must first sign a self-hosting agreement with Unstructured. If you have not yet signed this agreement, stop here, and begin the self-hosting agreement process by contacting your Unstructured sales representative, emailing Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or filling out the [contact form](https://unstructured.io/contact) on the Unstructured website. </Note> After your organization has signed the self-hosting agreement with Unstructured, a member of the Unstructured technical enablement team will reach out to you to begin the deployment onboarding process. To streamline this process, you are encouraged to begin setting up your target environment as soon as possible. Choose one of the following setup options: * [Do it all for me](#do-it-all-for-me): Have Unstructured set up the required infrastructure in your AWS account and then deploy the Unstructured UI and API into that newly created infrastructure. * [Bring my own infrastructure](#bring-my-own-infrastructure): Set up the required infrastructure yourself in your AWS account, and then have Unstructured deploy the Unstructured UI and API into your existing infrastructure. ## Questions? Need help? If you have questions or need help as you go, contact your Unstructured sales representative or technical enablement contact. If you do not know who they are, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website, and a member of the Unstructured sales or technical enablement teams will get back to you as soon as possible. ## Do it all for me If you want Unstructured to set up the required infrastructure for you in your GCP account and then deploy the Unstructured UI and API into that newly created infrastructure, then provide your Unstructured sales representative or technical enablement contact with the access credentials for an IAM user or service account in your GCP account that has the following required permissions: ### Core networking permissions VPC/subnet management: * `compute.networks.create` * `compute.subnetworks.create` * `compute.routers.create` (for Cloud NAT) * `compute.addresses.create` (for NAT IPs) * `compute.firewalls.create` (for intra-cluster traffic rules) Shared VPC (if used): * `compute.organizations.admin` (for the host project) * `compute.networks.use` (for the service project) ### GKE cluster permissions Control plane: * `container.clusters.create` * `container.clusters.update` (for private cluster settings) * `compute.networks.useExternalIp` (for public endpoint access) Node pools: * `compute.instances.create` * `compute.disks.create` (for node disks) * `compute.instanceGroups.create` (for autoscaling) IAM roles: * For the GKE cluster SA service account: `roles/container.hostServiceAgentUser` * For the node SA service account: `roles/container.nodeServiceAccount` * For the workload identity service account: `roles/iam.workloadIdentityUser` ### Storage and database GCS buckets: * `storage.buckets.create` * `storage.objects.create` (for versioning) * `storage.buckets.update` (for encryption/lifecycle rules) Cloud SQL: * `cloudsql.instances.create` * `cloudsql.instances.connect` (for private IPs) * `vpcaccess.connectors.use` (if using Serverless VPC Access) Persistent disks (CSI): * `compute.disks.create` (for `pd.csi.storage.gke.io`) * `compute.subnetworks.use` (for regional disks) ### Advanced configurations Workload identity: * `iam.serviceAccounts.getAccessToken` (for federated access) * `iam.serviceAccounts.setIamPolicy` (to bind Kubernetes SAs to GCP SAs) Cloud NAT: * `compute.routers.update` (for NAT configuration) * `compute.addresses.use` (for NAT IP allocation) OS login/SSH: * `compute.projects.setCommonInstanceMetadata` (for SSH key upload) * `compute.instances.osAdminLogin` ### Minimum required roles Project level: * `roles/editor` (broad access, or scope with custom roles) Scoped roles: * `roles/compute.networkAdmin` (for VPC and subnets) * `roles/container.admin` (for GKE) * `roles/storage.admin` (for GCS) * `roles/cloudsql.admin` (for Postgres) ## Bring my own infrastructure If you want to set up the required infrastructure yourself, set things up as follows within your GCP account for Unstructured to deploy the Unstructured UI and API into. You must also provide your Unstructured sales representative or technical enablement contact with the access credentials for an IAM user or service account in your GCP account that has access to the target Google Kubernetes Engine (GKE) cluster to deploy the Unstructured UI and API into. ### VPC and networking (GCP equivalent) * VPC Network * Name: `u10d-platform` * Subnet Mode: Custom * CIDR: `10.0.0.0/16` * DNS: Internal DNS supported by default * Internet Gateway * GCP provides implicit internet access via default internet gateway (No need to explicitly create) * Public Subnet * Subnet: `public-subnet` — `10.0.0.0/24` * Region: `${region}` * Enable external IPs on VM instances for internet access * NAT Gateway * Use Cloud NAT attached to a Cloud Router in public subnet * Needed to provide egress internet access to private subnet instances * Private Subnets (x2) * `private-subnet-a`: `10.0.1.0/24`, region `${region}-a` * `private-subnet-b`: `10.0.2.0/24`, region `${region}-b` * Routes * Public subnet: default route `0.0.0.0/0` to Internet Gateway (via external IPs) * Private subnets: route `0.0.0.0/0` via Cloud NAT ### IAM roles and policies * GKE Cluster IAM Service Account * Grant roles: * `roles/container.clusterAdmin` * `roles/compute.networkAdmin` * GKE Node IAM Service Account * Grant roles: * `roles/container.nodeServiceAccount` * `roles/compute.viewer` * `roles/storage.objectViewer` * Workload Identity IAM Bindings (x3) * Namespaces: `recommender`, `etl-operator`, `data-broker` * Use Workload Identity Federation * Bind GCP IAM Service Accounts to Kubernetes service accounts * Grant `roles/storage.objectAdmin` for access to GCS buckets ### GKE cluster * Control Plane * Version: `1.31` or higher * Private Cluster: Enabled * Master Authorized Networks: your IP(s) * Enable Public Endpoint Access: Yes * Node Pool * Machine type: `n2-standard-16` * Disk: 100GB, SSD (default boot disk) * Node count: min 2, max 5, autoscaling enabled * SSH access: via OS Login + SSH keys * SSH key: Add public key to instance metadata * Firewall Rules * Allow: * Internal: `10.0.0.0/16` * Egress: all * Kubernetes master access to nodes ### Kubernetes add-ons (installed via `kubectl` or Helm) * Workload Identity Config * Metrics Server * Deployed manually (same version: `v0.7.2`) * GCP CSI Driver * Provisioner: `pd.csi.storage.gke.io` * Role binding needed for controller SA ### Storage class ```yaml apiVersion: [storage.k8s.io/v1](http://storage.k8s.io/v1) kind: StorageClass metadata: name: pd-ssd annotations: [storageclass.kubernetes.io/is-default-class:](http://storageclass.kubernetes.io/is-default-class:) "true" provisioner: [pd.csi.storage.gke.io](http://pd.csi.storage.gke.io/) parameters: type: pd-ssd encrypted: "true" volumeBindingMode: WaitForFirstConsumer ``` ### Cloud SQL (Postgres) * Private IP-enabled Cloud SQL instance * Engine: Postgres 16 * Size: `db-f1-micro` (or `db-custom-1-3840`) * Storage: 20GB * Credentials: Username/password * Private network: Use the private VPC * Cloud SQL Auth Proxy or private VPC peering to connect from GKE ### GCS Buckets * Buckets: * `u10d-{stack_name}-etl-blob-cache` * `u10d-{stack_name}-etl-job-db` * `u10d-{stack_name}-etl-job-status` * `u10d-{stack_name}-job-files` * Config: * Versioning: Enabled * Encryption: Default (Google-managed key or CMEK if needed) * Lifecycle rule: Auto-delete / force destroy if needed (optional) ### Keys * SSH Key Pair * Generate manually (`ssh-keygen -t rsa -b 4096`) * Upload public key to project metadata or OS Login * Export private key as PEM for automation ### Secrets and ConfigMaps After your infrastructure is set up, but before Unstructured can deploy the Unstructured UI and API into your insfrastructure, Unstructured will need to know the values of the following Secrets and configuration mappings (also known as ConfigMaps). The Secrets are as follows. #### Blob storage credentials * `BLOB_STORAGE_ADAPTER_GCP_SERVICE_ACCOUNT_KEY_JSON` * `BLOB_STORAGE_ADAPTER_REGION_NAME` #### Database credentials * `DB_USERNAME` * `DB_PASSWORD` * `DB_HOST` * `DB_NAME` * `DB_DATABASE` (used in `platform-api` only) #### Authentication * `JWT_SECRET_KEY` * `AUTH_STRATEGY` (sometimes encoded, sometimes not) * `SESSION_SECRET` * `SHARED_SECRET` * `KEYCLOAK_CLIENT_SECRET` * `KEYCLOAK_ADMIN_SECRET` * `KEYCLOAK_ADMIN` * `KEYCLOAK_ADMIN_PASSWORD` * `API_BEARER_TOKEN` The ConfigMaps are as follows. #### Blob storage settings * `BLOB_STORAGE_ADAPTER_TYPE` (always `gcp` for GCP) * `BLOB_STORAGE_ADAPTER_BUCKET` * `ETL_BLOB_CACHE_BUCKET_NAME` * `ETL_API_BLOB_STORAGE_ADAPTER_BUCKET` * `ETL_API_BLOB_STORAGE_ADAPTER_TYPE` * `ETL_API_DB_REMOTE_BUCKET_NAME` * `ETL_API_JOB_STATUS_DEST_BUCKET_NAME` * `JOB_STATUS_BUCKET_NAME` * `JOB_DB_BUCKET_NAME` #### Environment * `ENV` * `ENVIRONMENT` * `JOB_ENV` * `JOB_ENVIRONMENT` #### Observability and OpenTelemetry (OTel) * `JOB_OTEL_EXPORTER_OTLP_ENDPOINT` * `JOB_OTEL_METRICS_EXPORTER` * `JOB_OTEL_TRACES_EXPORTER` * `OTEL_EXPORTER_OTLP_ENDPOINT` * `OTEL_METRICS_EXPORTER` * `OTEL_TRACES_EXPORTER` #### Unstructured API and authentication * `UNSTRUCTURED_API_URL` * `JWKS_URL` * `JWT_ISSUER` * `JWT_AUDIENCE` * `SINGLE_PLANE_DEPLOYMENT` #### Front end and dashboard * `API_BASE_URL` * `API_CLIENT_BASE_URL` * `API_URL` * `APM_SERVICE_NAME` * `APM_SERVICE_NAME_CLIENT` * `AUTH_STRATEGY` * `FRONTEND_BASE_URL` * `KEYCLOAK_CALLBACK_URL` * `KEYCLOAK_CLIENT_ID` * `KEYCLOAK_DOMAIN` * `KEYCLOAK_REALM` * `KEYCLOAK_SSL_ENABLED` * `KEYCLOAK_TRUST_ISSUER` * `PUBLIC_BASE_URL` * `PUBLIC_RELEASE_CHANNEL` #### Sentry & Feature Flags * `SENTRY_DSN` * `SENTRY_SAMPLE_RATE` * `WORKFLOW_NODE_EDITOR_FF_REQUEST_FORM` * `CUSTOM_WORKFLOW_FF_REQUEST_FORM` #### Redis * `REDIS_DSN` #### Other * `IMAGE_PULL_SECRETS` * `PRIVATE_KEY_SECRETS_ADAPTER_TYPE` * `PRIVATE_KEY_SECRETS_ADAPTER_GCP_REGION` * `SECRETS_ADAPTER_TYPE` * `SECRETS_ADAPTER_GCP_REGION` The preceding Secrets and ConfigMaps must be added to the following files: | File name | Type | Resource name | Namespace | Data keys | | ------------------------------------- | --------- | -------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `data-broker-env-cm.yaml` | ConfigMap | `data-broker-env` | `api` | `JOB_STATUS_BUCKET_NAME`, `JOB_DB_BUCKET_NAME`, `BLOB_STORAGE_ADAPTER_TYPE` | | `data-broker-env-secret.yaml` | Secret | `data-broker-env` | `api` | `BLOB_STORAGE_ADAPTER_GCP_SERVICE_ACCOUNT_KEY_JSON`, `BLOB_STORAGE_ADAPTER_REGION_NAME` | | `dataplane-api-env-cm.yaml` | Secret | `dataplane-api-env` | `api` | `DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME` | | `etl-operator-env-cm.yaml` | ConfigMap | `etl-operator-env` | `etl-operator` | `BLOB_STORAGE_ADAPTER_BUCKET`, `JOB_STATUS_BUCKET_NAME`, `JOB_DB_BUCKET_NAME`, `BLOB_STORAGE_ADAPTER_TYPE`, `ENV`, `ENVIRONMENT`, `REDIS_DSN`, `ETL_API_BLOB_STORAGE_ADAPTER_BUCKET`, `ETL_API_BLOB_STORAGE_ADAPTER_TYPE`, `ETL_API_DB_REMOTE_BUCKET_NAME`, `ETL_API_JOB_STATUS_DEST_BUCKET_NAME` (x2), `ETL_BLOB_CACHE_BUCKET_NAME`, `IMAGE_PULL_SECRETS`, `JOB_ENV`, `JOB_ENVIRONMENT`, `JOB_OTEL_EXPORTER_OTLP_ENDPOINT`, `JOB_OTEL_METRICS_EXPORTER`, `JOB_OTEL_TRACES_EXPORTER`, `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_METRICS_EXPORTER`, `OTEL_TRACES_EXPORTER`, `UNSTRUCTURED_API_URL` | | `etl-operator-env-secret.yaml` | Secret | `etl-operator-env` | `etl-operator` | `BLOB_STORAGE_ADAPTER_GCP_SERVICE_ACCOUNT_KEY_JSON`, `BLOB_STORAGE_ADAPTER_REGION_NAME,` | | `frontend-env-cm.yaml` | ConfigMap | `frontend-env` | `www` | `API_BASE_URL`, `API_CLIENT_BASE_URL`, `API_URL`, `APM_SERVICE_NAME`, `APM_SERVICE_NAME_CLIENT`, `AUTH_STRATEGY`, `ENV`, `FRONTEND_BASE_URL`, `KEYCLOAK_CALLBACK_URL`, `KEYCLOAK_CLIENT_ID`, `KEYCLOAK_DOMAIN`, `KEYCLOAK_REALM`, `KEYCLOAK_SSL_ENABLED`, `KEYCLOAK_TRUST_ISSUER`, `PUBLIC_BASE_URL`, `PUBLIC_RELEASE_CHANNEL`, `SENTRY_DSN`, `SENTRY_SAMPLE_RATE`, `WORKFLOW_NODE_EDITOR_FF_REQUEST_FORM`, `CUSTOM_WORKFLOW_FF_REQUEST_FORM` | | `frontend-env-secret.yaml` | Secret | `frontend-env` | `www` | `API_BEARER_TOKEN`, `KEYCLOAK_ADMIN_SECRET`, `KEYCLOAK_CLIENT_SECRET`, `SESSION_SECRET`, `SHARED_SECRET` | | `keycloak-secret.yaml` | Secret | `phasetwo-keycloak-env` | `www` | `KEYCLOAK_ADMIN`, `KEYCLOAK_ADMIN_PASSWORD` | | `platform-api-env-cm.yaml` | ConfigMap | `platform-api-env` | `api` | `JWKS_URL`, `JWT_ISSUER`, `JWT_AUDIENCE`, `SINGLE_PLANE_DEPLOYMENT` | | `platform-api-env-secret.yaml` | Secret | `platform-api-env` | `api` | `DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME`, `DB_DATABASE`, `JWT_SECRET_KEY`, `AUTH_STRATEGY` | | `recommender-env-cm.yaml` | ConfigMap | `recommender-env` | `recommender` | `BLOB_STORAGE_ADAPTER_TYPE`, `ETL_BLOB_CACHE_BUCKET_NAME` | | `recommender-env-secret.yaml` | Secret | `recommender-env` | `recommender` | `BLOB_STORAGE_ADAPTER_GCP_SERVICE_ACCOUNT_KEY_JSON`, `BLOB_STORAGE_ADAPTER_REGION_NAME` | | `secret-provider-api-env-cm.yaml` | ConfigMap | `secrets-provider-api-env` | `secrets` | `ENV`, `ENVIRONMENT`, `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_METRICS_EXPORTER`, `OTEL_TRACES_EXPORTER`, `PRIVATE_KEY_SECRETS_ADAPTER_GCP_REGION`, `PRIVATE_KEY_SECRETS_ADAPTER_TYPE`, `SECRETS_ADAPTER_GCP_REGION`, `SECRETS_ADAPTER_TYPE` | | `secret-provider-api-env-secret.yaml` | Secret | `secrets-provider-api-env` | `secrets` | `BLOB_STORAGE_ADAPTER_GCP_SERVICE_ACCOUNT_KEY_JSON`, `BLOB_STORAGE_ADAPTER_REGION_NAME` | | `usage-collector-env-secret.yaml` | Secret | `usage-collector-env` | `api` | `DB_PASSWORD`, `DB_USERNAME`, `DB_HOST`, `DB_NAME`, `BLOB_STORAGE_ADAPTER_TYPE` | For example, for the `data-broker-env-cm.yaml` [ConfigMap](https://kubernetes.io/docs/concepts/configuration/configmap/) file, the contents would look like this: ```yaml apiVersion: v1 kind: ConfigMap metadata: name: data-broker-env namespace: api data: JOB_STATUS_BUCKET_NAME: "<your-value>" JOB_DB_BUCKET_NAME: "<your-value>" BLOB_STORAGE_ADAPTER_TYPE: "<your-value>" ``` For the `data-broker-env-secret.yaml` [Secret](https://kubernetes.io/docs/concepts/configuration/secret/) file, the contents would look like this: ```yaml apiVersion: v1 kind: Secret metadata: name: data-broker-env namespace: api type: Opaque stringData: BLOB_STORAGE_ADAPTER_GCP_SERVICE_ACCOUNT_KEY_JSON: "<your-value>" BLOB_STORAGE_ADAPTER_REGION_NAME: "<your-value>" ``` # Self-hosting Unstructured on Google Cloud Platform (GCP) Source: https://docs.unstructured.io/self-hosted/gcp/overview ## Getting started To get started with a self-hosted deployment, your organization must first sign a self-hosting agreement with Unstructured. This agreement outlines the terms and conditions for your organization to use Unstructured in a self-hosted environment. To begin the self-hosting agreement process, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website. After your organization has signed the self-hosting agreement, a member of the Unstructured technical enablement team will reach out to you to begin the deployment onboarding process. To streamline this process, you are encouraged to begin setting up your target environment as soon as possible. To do this, see the [onboarding checkist](/self-hosted/gcp/onboard). ## Questions? Need help? Contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website, and a member of the Unstructured sales or support teams will get back to you as soon as possible. # Self-hosting Unstructured Source: https://docs.unstructured.io/self-hosted/overview Unstructured offers self-hosted deployments, which allow you to run the [Unstructured user interface](/ui/overview) (UI) and the [Unstructured API](/api-reference/overview) on infrastructure that you maintain in your [Amazon Web Services (AWS)](/self-hosted/aws/overview), [Azure](/self-hosted/azure/overview), or [Google Cloud Platform (GCP)](/self-hosted/gcp/overview) account. Running Unstructured on your own cloud-based infrastructure provides benefits such as the following: * Security, privacy, and ownership: Your organization might have strict data security requirements to keep your data and models within a virtual private cloud (VPC). * Compliance and data sovereignty: Certain industries and locales might have regulatory requirements that require data to be processed and for data and models to be stored in specific cloud provider regions. * Customization: Self-hosted deployments allow for more customization and control options over your environments. * Testing and development: Self-hosted deployments can be useful for multi-environment testing and development purposes, allowing you to experiment with Unstructured in a testing environment without affecting your production environment. ## Getting started To get started with a self-hosted deployment, your organization must first sign a self-hosting agreement with Unstructured. This agreement outlines the terms and conditions for your organization to use Unstructured in a self-hosted environment. To begin the self-hosting agreement process, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website. After your organization has signed the self-hosting agreement, a member of the Unstructured technical enablement team will reach out to you to begin the deployment onboarding process. To streamline this process, you are encouraged to begin setting up your target environment as soon as possible. To do this, see the onboarding requirements for your [AWS](/self-hosted/aws/onboard), [Azure](/self-hosted/azure/onboard), or\ [GCP](/self-hosted/gcp/onboard) account. ## Questions? Need help? Contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or fill out the [contact form](https://unstructured.io/contact) on the Unstructured website, and a member of the Unstructured sales or support teams will get back to you as soon as possible. # Security and compliance overview Source: https://docs.unstructured.io/self-hosted/security-compliance/overview This document outlines the security features and compliance posture of the software system. It is intended to provide customers and stakeholders with a clear understanding of the security mechanisms in place and the standards to which the system adheres. ## Encryption In Transit * All inter-service communication within the Kubernetes cluster is encrypted using mutual TLS (mTLS) via Istio service mesh. * All ingress and egress communication uses TLS 1.2 or higher. At Rest * Data stored in blob storage is encrypted at rest using cloud-native encryption mechanisms (e.g., Azure Storage Service Encryption, AWS S3 Server-Side Encryption). * Encryption keys are managed by the underlying cloud provider and rotated according to provider best practices. ## Secrets management * Secrets are securely managed using the CSI (Container Storage Interface) driver with support for: * Azure Key Vault * AWS Secrets Manager * Google Secret Manager * Secrets are mounted into pods at runtime and are never stored in plaintext in the cluster. * Access to secrets is controlled through fine-grained identity and access management (IAM) policies, tied to the workload identity. * Rotation of secrets is handled externally and reflected immediately via CSI mounts. ## Authentication and authorization * End-user authentication is managed through Keycloak, which supports integration with external identity providers such as Azure Active Directory and Google Identity. * Authentication follows the OAuth2 and OpenID Connect standards. * Role-based access control (RBAC) is applied across the system. * Roles and Permissions: * Super Admin * General Developer * Billing ## CI/CD and software supply chain security * Continuous integration and delivery (CI/CD) pipelines include: * Static code analysis * Automated testing * Software Bill of Materials (SBOM) creation * Vulnerability scanning using Grype * All container images are built from hardened, minimal base images and scanned prior to release. ## Logging and monitoring * Logs are emitted to standard output and error streams (`stdout`/`stderr`) following container logging best practices. * Logs can be collected by pluggable logging agents (e.g., Elastic Agent) deployed into the Kubernetes cluster. * Centralized logging enables customers to integrate with their own observability solutions and meet internal audit requirements. * Metrics are emitted to an Otel Collector which can be configured to ship to various observability solutions. ## Compliance The system is designed and operated in accordance with industry-recognized security and data protection standards: * SOC 2 Type II: Controls are implemented and audited for security, availability, and confidentiality. * ISO 27001: Operational processes align with the ISO 27001 framework for information security management. * GDPR: Data handling practices conform to the General Data Protection Regulation (GDPR) for protection of personal data. * FedRAMP: The system meets the security requirements of the Federal Risk and Authorization Management Program (FedRAMP) for U.S. government systems. ## Additional protections * Each workflow job runs in an isolated, ephemeral Kubernetes namespace to ensure strong runtime separation. * Namespaced service isolation within the Kubernetes cluster # Account API keys and API URLs Source: https://docs.unstructured.io/ui/account/api-key-url An API key is a unique string of characters used to authenticate calls to an application programming interface (API). In Unstructured, API keys are used as a kind of password that allows code or scripts to call the [Unstructured API](/api-reference/overview) to automate an Unstructured account on behalf of an associated account user or other entity tied to the API key that is provided with each call. This API key also enables Unstructured to associate usage and billing information with the related Unstructured account. You need an API key whenever you want to call the Unstructured API's [Unstructured Workflow Endpoint](/api-reference/workflow/overview) or [Unstructured Partition Endpoint](/api-reference/partition/overview). You do not need an API key to use the [Unstructured user interface](/ui/overview) (UI) to create or manage connectors, workflows, or jobs in your Unstructured account, or to manage your Unstructured account. ## View available API keys 1. Sign in to your Unstructured account. 2. On the sidebar, click API Keys. 3. Existing API keys are listed on the API Keys tab. ## Create a new API key 1. Sign in to your Unstructured account. 2. On the sidebar, click API Keys. 3. On the API Keys tab, click Generate New Key. 4. Enter some descriptive name for the API key, and then click Save. The new API key is created and appears in the list of available API keys. The only setting you can change for this new API key is its [status](#enable-or-disable-an-existing-api-key). You cannot change its name, value, or any other settings. ## Get an existing API key's value 1. Sign in to your Unstructured account. 2. On the sidebar, click API Keys. 3. On the API Keys tab, in the list of available keys, click the Copy icon for the target key. The key's value is copied to your system's clipboard. If you lose this API key's value, simply repeat this procedure. If you are concerned that an unauthorized user or other entity might have this API key's value, you should [permanently delete](#permanently-delete-an-existing-api-key) this key. ## Enable or disable an existing API key You might want to temporarily disable an existing API key that you suspect is causing some kind of issue or might be compromised in some way. However, you might want the option to re-enable it after you solve your issues or determine that it was not compromised. 1. Sign in to your Unstructured account. 2. On the sidebar, click API Keys. 3. On the API Keys tab, in the list of available keys, switch the Status toggle on or off for the target API key to enable or disable that key, respectively. ## Permanently delete an existing API key <Warning> Deleting an existing API key is a permanent operation. Deleted API keys cannot be recovered. You cannot use a deleted API key to call the Unstructured APIs. </Warning> You might want to permanently delete an existing API key that you know is causing issues or is compromised. If you have any code or scripts that rely on the old, existing API key that you want to permanently delete, you should rotate the key first. To rotate an old, existing API key before you permanently delete it: 1. Create a new, replacement API key by following the steps in [Create a new API key](#create-a-new-api-key), earlier on this page. 2. Update your code or scripts that rely on the old, existing API key's value that you want to permanently delete with the new, replacement API key's value. 3. Delete the old, existing API key. To permanently delete an existing API key: 1. Sign in to your Unstructured account. 2. On the sidebar, click API Keys. 3. On the API Keys tab, in the list of available keys, click the Options vertical ellipsis for the target API key. 4. Click Delete. 5. Confirm that you want to permanently delete the API key by clicking Continue. ## View available API URLs In some cases, you might need to specify a particular API URL to call one of the Unstructured API's available Endpoints. For more information, see the getting started instructions for the Unstructured API's [Unstructured Workflow Endpoint](/api-reference/workflow/overview) or [Unstructured Partition Endpoint](/api-reference/partition/overview). 1. Sign in to your Unstructured account. 2. On the sidebar, click API Keys. 3. On the API Keys tab, the available API URLs are listed. # Account billing Source: https://docs.unstructured.io/ui/account/billing ## Add, view, or change payment details ![Unstructured account settings](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/AccountSettings.png) ![Unstructured account payment method](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/AccountBillingPaymentMethod.png) <img src="https://unstructured-tech-docs.s3.amazonaws.com/Unstructured-Platform-Billing.gif" alt="Unstructured billing how-to" width="500" /> 1. Sign in to your Unstructured account. 2. At the bottom of the sidebar, click your user icon, and then click Account Settings. 3. Click the Billing tab. 4. Click Manage Payment Method, follow the on-screen instructions to enter or update your payment details, and then click Save card. ## View billing details ![Unstructured account settings](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/AccountSettings.png) ![Unstructured account pay-per-page billing](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/AccountBillingPayPerPage.png) <img src="https://unstructured-tech-docs.s3.amazonaws.com/Unstructured-Platform-Billing.gif" alt="Unstructured billing how-to" width="500" /> 1. Sign in to your Unstructured account. 2. At the bottom of the sidebar, click your user icon, and then click Account Settings. 3. Click the Billing tab. A billing overview for the current month and a list of your billing invoices are displayed. ## How is pay-per-page billing calculated? Unstructured calculates a page as follows: * For these file types, a page is a page, slide, or image: `.pdf`, `.pptx`, and `.tiff`. * For `.docx` files that have page metadata, Unstructured calculates the number of pages based on that metadata. * For all other file types, Unstructured calculates the number of pages as the file's size divided by 100 KB. * For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed. The Billing tab shows your account's current billing rate per page. To complete usage-related tasks in your account, see [Account usage](/ui/account/usage). # Organizational accounts Source: https://docs.unstructured.io/ui/account/organizations <Note> The following information applies only to Unstructured organizational accounts. To request an organizational account to be created, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or complete the online [Unstructured contact form](https://unstructured.io/contact). </Note> An organizational account is an Unstructured account that allows multiple users to access Unstructured resources and share billing. An organizational account contains a collection of Unstructured [workspaces](/ui/account/workspaces) that can share resources with subsets of users within the account. An organizational account manages access its organization and the organization's workspaces by inviting Unstructured personal account users. Any usage that occurs in an organizational account by any of its invited users is applied to the shared billing for that organizational account. A workspace is a collection of Unstructured resources (such as connectors, workflows, jobs, and API keys) within an organization. This is different from a personal workspace, which is a collection of resources within an Unstructured personal account. An organizational account can have multiple workspaces. ## Organizational account architecture Each organizational account operates independently of all other organizational accounts. Each organizational account can have multiple workspaces. Each workspace operates independently of all other workspaces. ```mermaid graph LR Org1[Organization 1] Org2[Organization 2] Org3[Organization 3] WS1A[Workspace 1A] WS2A[Workspace 2A] WS2B[Workspace 2B] WS3A[Workspace 3A] WS3B[Workspace 3B] WS3C[Workspace 3C] Org1 --> WS1A Org2 --> WS2A Org2 --> WS2B Org3 --> WS3A Org3 --> WS3B Org3 --> WS3C ``` Each organizational account can have multiple users. Users can have access to multiple organizational accounts. ```mermaid graph LR User1[User 1] User2[User 2] User3[User 3] Org1[Organization 1] Org2[Organization 2] Org3[Organization 3] User1 -.-> Org1 User2 -.-> Org1 User2 -.-> Org2 User3 -.-> Org1 User3 -.-> Org2 User3 -.-> Org3 ``` Each user also has a personal account, which operates independently of all other organizational accounts and personal accounts. Each personal account has a personal workspace, which operates independently of all other workspaces. Each user can add a seprate billing method to their personal account that will be applied only to usage in their personal account. ```mermaid graph LR User1[User 1] User2[User 2] User3[User 3] PA1[Personal Account for User 1] PA2[Personal Account for User 2] PA3[Personal Account for User 3] Org1[Organization 1] Org2[Organization 2] Org3[Organization 3] User1 --> PA1 User2 --> PA2 User3 --> PA3 User1 -.-> Org1 User2 -.-> Org1 User2 -.-> Org2 User3 -.-> Org1 User3 -.-> Org2 User3 -.-> Org3 ``` Within an organizational account, an invited user (also known as a member) can have access to multiple workspaces within that organizational account. ```mermaid graph LR User1[User 1] User2_Org1[User 2] User2_Org2[User 2] User3_Org1[User 3] User3_Org2[User 3] User3_Org3[User 3] Org1[Organization 1] Org2[Organization 2] Org3[Organization 3] WS1A[Workspace 1A] WS2A[Workspace 2A] WS2B[Workspace 2B] WS3A[Workspace 3A] WS3B[Workspace 3B] WS3C[Workspace 3C] Org1 --> WS1A Org1 -.-> User1 Org1 -.-> User2_Org1 Org1 -.-> User3_Org1 User1 -.-> WS1A User2_Org1 -.-> WS1A User3_Org1 -.-> WS1A Org2 --> WS2A Org2 --> WS2B Org2 -.-> User2_Org2 Org2 -.-> User3_Org2 User2_Org2 -.-> WS2A User2_Org2 -.-> WS2B User3_Org2 -.-> WS2B Org3 --> WS3A Org3 --> WS3B Org3 --> WS3C Org3 -.-> User3_Org3 User3_Org3 -.-> WS3A User3_Org3 -.-> WS3B User3_Org3 -.-> WS3C ``` Each user must be explicitly added to the organizational accounts—and workspaces within those organizational accounts—that they are allowed to access. ## Create an organizational account Only authorized Unstructured representatives can create organizational accounts. To request an organizational account to be created, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or complete the online [Unstructured contact form](https://unstructured.io/contact). ## Access an organizational account 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that you want to access. If the organizational account selector is not shown, or if the selector does not show the organizational account you want to access, then your user has not been added to that organizational account. To be added to the organizational account, contact one of the organizational account's administrators. ## Add a member to an organizational account <Info> Your user must have the Super Admin [role](/ui/account/roles) in the organizational account to add members to that organizational account. </Info> To add a user to an organizational account as a member: 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that you want to add the member to. 3. In the sidebar, above your user icon, click the Settings (gear) icon. 4. Click Manage Account. 5. In the top navigation bar, click Members. 6. Click New Member. 7. In the Add New Member dialog, enter the email address for the organizational account's new member. 8. In the role selector, select the new member's [role](/ui/account/roles) for this organizational account. 9. Click Continue. ## Change an organizational account role for a member <Info> Your user must have the Super Admin role in the organizational account to change its members' organizational account roles. </Info> 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that contains the member you want to change roles for. 3. In the sidebar, above your user icon, click the Settings (gear) icon. 4. Click Manage Account. 5. In the top navigation bar, click Members. 6. Click the member you want to change roles for. 7. Next to Role click the Edit (pencil) icon. 8. Select the new role. 9. Click the Accept (check mark) icon. ## Remove a member from an organizational account <Warning> Removing a member from an organizational account also removes the member from all of the organizational account's workspaces. The member will no longer have access to the organizational account or any of its workspaces. The member will continue to be able to use their personal account and personal workspace. Any connectors, workflows, jobs, or other resources that were created by the member within the organizational accounts's workspaces will not be removed. The only way to reverse removing a member from an organizational account is to add the member back to the organizational account—and to each of the workspaces that the member might have previously been a member of in that organizational account. </Warning> <Info> Your user must have the Super Admin [role](/ui/account/roles) in the organizational account to remove members from that organizational account. </Info> 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that you want to remove the member from. 3. In the sidebar, above your user icon, click the Settings (gear) icon. 4. Click Manage Account. 5. In the top navigation bar, click Members. 6. In the list of members, click the member you want to remove. 7. In the member's settings pane, click Remove Member. 8. Confirm the removal by clicking Continue. ## Permanently delete an organizational account <Warning> Deleting an organizational account is a permanent action and is not recoverable. Once an organizational account is deleted, all workspaces associated with that organizational account are also deleted and are not recoverable. The organizational account's members will no longer have access to the deleted organizational account or any of its workspaces. Each member will continue to be able to use their personal account and personal workspace. Any code or scripts that reference connectors, workflows, jobs, or workflow API keys that are associated with workflows in the deleted organizational account will fail. </Warning> Only authorized Unstructured representatives can delete organizational accounts. To request an organizational account to be permanently deleted, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or complete the online [Unstructured contact form](https://unstructured.io/contact). # Accounts overview Source: https://docs.unstructured.io/ui/account/overview After you sign up for an Unstructured account, you can manage your account's [billing](/ui/account/billing) and get your account's [usage](/ui/account/usage) details. You can also create and manage your account's [API keys](/ui/account/api-key-url)—and get any [API URLs](/ui/account/api-key-url) as needed—to work programmatically with your Unstructured account through the [Unstructured API](/api-reference/overview). An Unstructured [organizational account](/ui/account/organizations) is a special kind of account that can manage multiple [workspaces](/ui/account/workspaces) and use [roles](/ui/account/roles) to manage user access to the account and its workspaces. # Role-based access control (RBAC) Source: https://docs.unstructured.io/ui/account/roles <Note> The following information applies only to Unstructured [organizational accounts](/ui/account/organizations). To request an organizational account to be created, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or complete the online [Unstructured contact form](https://unstructured.io/contact). </Note> Roles in Unstructured are part of the role-based access control (RBAC) system that manages permissions for members of organizational accounts and their workspaces. (Roles are not used in personal accounts.) Any member with the Super Admin role in an organizational account can manage the roles of that organizational account's members and the roles of the members of the organizational account's workspaces. Any member with the Super Admin role in an organizational account or the Workspace Admin role in a workspace within an organizational account can mange the roles of that workspace's members. A Super Admin member assigns an organizational account member's initial role when they are [added to the organizational account](/ui/account/organizations#add-a-member-to-an-organizational-account). This member's initial role can be [changed](/ui/account/organizations#change-an-organizational-account-role-for-a-member) later. A Super Admin or Workspace Admin member assigns a workspace member's initial role when they are [added to the workspace](/ui/account/workspaces#add-a-member-to-a-workspace). This member's initial role can be [changed](/ui/account/workspaces#change-a-workspace-role-for-a-member) later. ## Organizational account roles Organizational account roles include: * Super Administrator * Account Member * Billing Administrator These roles include the following permissions: | Permission | Super Administrator | Account Member | Billing Administrator | | --------------------------------------------------------------- | ---------------------------------- | ---------------------------------- | ---------------------------------- | | Organizational account management | | | | | Can add members to the account | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | | Can remove members from the account | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | | Can view members of the account | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | | Can view an account member's details | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | | Can change an account member's role | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | | Billing | | | | | Can view usage and costs | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | Can edit payment information | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | Can add budget | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | Can distribute budget across workspaces | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | Workspaces | | | | | Can create workspaces | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | | Can add members to a workspace | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | | Can view members of a workspace | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | | Can be added to a workspace | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | | Can take actions in a workspace (based on their workspace role) | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | ## Workspace roles Workspace roles include: * Viewer * Developer * Operator * Workspace administrators These roles include the following permissions: | Resource | Action | Viewer | Developer | Operator | Workspace Administrator | | ----------------- | ----------------------------------- | ---------------------------------- | ---------------------------------- | ---------------------------------- | ---------------------------------- | | Workflows | Read | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | | | Create | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | Edit | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | Delete | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | Run | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | | | Schedule | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | | Connectors | Read | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | | | Create | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | Edit | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | Delete | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | Workspace members | Add members to the workspace | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | View the workspace's members | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | | | Remove members from the workspace | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | Change the workspace members' roles | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | Budget limits | Add budget | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | Edit budget | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="no-label">No</span> | <span class="yes-label">Yes</span> | | | View budget | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | <span class="yes-label">Yes</span> | Super Administrators in an organizational account have complete access to all of the organizational account's workspaces, regardless of whether they are a member of those workspaces. # Account usage Source: https://docs.unstructured.io/ui/account/usage To view usage details for your Unstructured account, do the following: ![Unstructured account settings](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/AccountSettings.png) ![Unstructured account billing usage](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/AccountUsage.png) <img src="https://unstructured-tech-docs.s3.amazonaws.com/Unstructured-Platform-Billing.gif" alt="Unstructured billing how-to" width="500" /> 1. Sign in to your Unstructured account. 2. At the bottom of the sidebar, click your user icon, and then click Account Settings. 3. Click the Usage History tab. ## How is usage calculated for billing purposes? Unstructured calculates a page as follows: * For these file types, a page is a page, slide, or image: `.pdf`, `.pptx`, and `.tiff`. * For `.docx` files that have page metadata, Unstructured calculates the number of pages based on that metadata. * For all other file types, Unstructured calculates the number of pages as the file's size divided by 100 KB. * For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed. The Billing tab shows your account's current billing rate per page. To complete billing-related tasks in your account, see [Account billing](/ui/account/billing). # Workspaces Source: https://docs.unstructured.io/ui/account/workspaces <Note> The following information applies only to Unstructured [organizational accounts](/ui/account/organizations). To request an organizational account to be created, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or complete the online [Unstructured contact form](https://unstructured.io/contact). </Note> A workspace is a collection of resources such as connectors, workflows, jobs, and API keys. Each workspace operates independently of all other workspaces. ## Workspace architecture A workspace can belong to an [organizational account](/ui/account/organizations) or to a user's personal account. Every organizational account begins with one default workspace. An organizational account can have multiple workspaces. (A user's personal account always has one and only one workspace.) ```mermaid graph LR Org1[Organization 1] Org2[Organization 2] Org3[Organization 3] WS1A[Workspace 1A] WS2A[Workspace 2A] WS2B[Workspace 2B] WS3A[Workspace 3A] WS3B[Workspace 3B] WS3C[Workspace 3C] Org1 --> WS1A Org2 --> WS2A Org2 --> WS2B Org3 --> WS3A Org3 --> WS3B Org3 --> WS3C ``` Users can have access to multiple workspaces within an organizational account. ```mermaid graph LR User1[User 1] User2_Org1[User 2] User2_Org2[User 2] User3_Org1[User 3] User3_Org2[User 3] User3_Org3[User 3] Org1[Organization 1] Org2[Organization 2] Org3[Organization 3] WS1A[Workspace 1A] WS2A[Workspace 2A] WS2B[Workspace 2B] WS3A[Workspace 3A] WS3B[Workspace 3B] WS3C[Workspace 3C] Org1 --> WS1A Org1 -.-> User1 Org1 -.-> User2_Org1 Org1 -.-> User3_Org1 User1 -.-> WS1A User2_Org1 -.-> WS1A User3_Org1 -.-> WS1A Org2 --> WS2A Org2 --> WS2B Org2 -.-> User2_Org2 Org2 -.-> User3_Org2 User2_Org2 -.-> WS2A User2_Org2 -.-> WS2B User3_Org2 -.-> WS2B Org3 --> WS3A Org3 --> WS3B Org3 --> WS3C Org3 -.-> User3_Org3 User3_Org3 -.-> WS3A User3_Org3 -.-> WS3B User3_Org3 -.-> WS3C ``` Each user must be explicitly added to each workspace that they are allowed to access. (A user always has access to their personal workspace within their personal account.) This assumes that the user has first been added to the workspace's parent organizational account. To make programmatic API calls to a workspace instead of using the UI, an API key must first exist for that workspace. The caller then uses that API key to make API calls to that workspace. Each API key is valid only for its related workspace. ## Create a workspace To create a workspace, you must first have an available Unstructured [organizational account](/ui/account/organizations). Currently, only Unstructured representatives can create organizational accounts. To request an organizational account to be created, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or complete the online [Unstructured contact form](https://unstructured.io/contact). <Info> Your user must have the Super Admin [role](/ui/account/roles) in the intended workspace's parent organizational account to create workspaces in that organizational account. (A user always has one and only one personal workspace within their personal account. You cannot create additional workspaces in personal accounts.) </Info> 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that you want to add a workspace to. 3. In the sidebar, above your user icon, click the Settings (gear) icon. 4. Click View All Workspaces. 5. Click New Workspace. 6. In the Create New Workspace dialog, enter a name for the new workspace. 7. Optionally, enter a budget limit for the new workspace. If you do not want to set a budget limit, leave this field blank. <Warning> This workspace budget limit is for informational purposes only. Workspace usage is not stopped if this limit is exceeded. </Warning> 8. Click Continue. ## Access a workspace 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that contains the workspace you want to access. 3. In the workspace selector, select the name of the workspace that you want to access. If the workspace selector does not show the workspace you want to access, then your user has not been added to that workspace. To be added to the workspace, contact one of the workspace's administrators or its parent organizational account's administrators. ## Add a member to a workspace <Info> Your user must have either the Super Admin [role](/ui/account/roles) in the workspace's parent organizational account or the Workspace Admin role in the workspace to add members to that workspace. The user to be added must also have the Developer role in the workspace's parent organizational account. (A user always has access to their personal workspace within their personal account.) </Info> To add a user to a workspace as a member: 1. Make sure the account user that you want to add has already been added as a member of the workspace's parent organizational account. [Learn how](/ui/account/organizations#add-a-member-to-an-organization). 2. Sign in to your Unstructured account. 3. In the top navigation bar, in the organizational account selector, select the name of the organizational account that contains the workspace you want to add a member to. 4. In the workspace selector, select the name of the workspace that you want to add a member to. 5. In the sidebar, above your user icon, click the Settings (gear) icon. 6. Click Manage Workspace. 7. In the top navigation bar, click Members. 8. Click Add New Member. 9. In the Add New Member dialog, select one of the organizational account's members to add to this workspace as a member. 10. In the role selector, select the new team member's [role](/ui/account/roles) for this workspace. 11. Click Continue. ## Change a workspace role for a member <Info> Your user must have either the Super Admin role in the workspace's parent organizational account or the Workspace Admin role in the workspace to change the roles for that workspace's members. </Info> 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that contains the workspace for the member you want to change roles for. 3. In the workspace selector, select the name of the workspace that contains the member you want to change roles for. 4. In the sidebar, above your user icon, click the Settings (gear) icon. 5. Click Manage Workspace. 6. In the top navigation bar, click Members. 7. Next to the member you want to change roles for, click the ellipses. 8. Click Edit Permissions. 9. Select the new role. 10. Click Continue. ## Remove a member from a workspace <Warning> A member who is removed from a workspace will no longer have access to the workspace. The member will continue to be able to use their personal account and personal workspace. Any connectors, workflows, jobs, or other resources that were created by the member within the workspace will not be removed. The only way to reverse removing a member from a workspace is to add the member back to the workspace. </Warning> <Info> Your user must have either the Super Admin [role](/ui/account/roles) in the workspace's parent organizational account or the Workspace Admin role in the workspace to remove members from that workspace. (A user cannot be removed from their personal workspace within their personal account.) </Info> 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that contains the workspace you want to remove a member from. 3. In the workspace selector, select the name of the workspace that you want to remove a member from. 4. In the sidebar, above your user icon, click the Settings (gear) icon. 5. Click Manage Workspace. 6. In the top navigation bar, click Members. 7. In the list of members, click the ellipses next to the member that you want to remove. 8. Click Remove Member. 9. Confirm the removal by clicking Continue. ## Create an API key for a workspace <Info> Your user must have either the Super Admin [role](/ui/account/roles) in the workspace's parent organizational account or the Workspace Admin or Developer role in the workspace to create API keys for that workspace. </Info> To make programmatic API calls to a workspace instead of using the UI, an API key must first exist for that workspace, and then the caller uses that API key to make API calls to that workspace. An API key can be used only with the workspace for which it was created. 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that contains the workspace you want to create an API key for. 3. In the workspace selector, select the name of the workspace that you want to create an API key for. 4. In the sidebar, above your user icon, click the Settings (gear) icon. 5. Click Manage Workspace. 6. In the top navigation bar, click API Keys. 7. Click Generate New Key. 8. Enter some name for the key, and click Continue. 9. To get the key's value, in the list of available keys, click the Copy icon next to the newly created key. The key's value is copies to your system's clipboard. ## Delete an API key for a workspace An API key is valid only for the workspace for which it was created. <Warning> Deleting an existing API key is a permanent operation. Deleted API keys cannot be recovered. You cannot use a deleted API key to call the Unstructured APIs. </Warning> <Info> Your user must have either the Super Admin [role](/ui/account/roles) in the workspace's parent organizational account or the Workspace Admin or Developer role in the workspace to delete API keys from that workspace. </Info> 1. Sign in to your Unstructured account. 2. In the top navigation bar, in the organizational account selector, select the name of the organizational account that contains the workspace you want to delete an API key for. 3. In the workspace selector, select the name of the workspace that you want to delete an API key for. 4. In the sidebar, above your user icon, click the Settings (gear) icon. 5. Click Manage Workspace. 6. In the top navigation bar, click API Keys. 7. n the list of available keys, click the ellipses next to the key you want to delete. 8. Click Delete. 9. Confirm the deletion by clicking Continue. ## Permanently delete a workspace <Warning> Deleting a workspace is a permanent action and is not recoverable. Once a workspace is deleted, all connectors, workflows, jobs, and API keys that are associated with that workspace are also deleted and are not recoverable. The workspace's members are not deleted, but they will no longer have access to the deleted workspace's connectors, workflows, jobs, and API keys. Any code or scripts that reference the deleted workspace's connectors, workflows, jobs, or workspace API keys will fail. </Warning> <Info> A user's personal workspace cannot be deleted from their personal account. </Info> Only authorized Unstructured representatives can delete workspaces. To request a workspace to be permanently deleted, contact your Unstructured sales representative, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io), or complete the online [Unstructured contact form](https://unstructured.io/contact). # Chunking Source: https://docs.unstructured.io/ui/chunking After partitioning, chunking rearranges the resulting document elements into manageable "chunks" to stay within the limits of an embedding model and to improve retrieval precision. The goal is to retrieve only parts of documents that contain only the information that is relevant to a user's query. You can specify if and how Unstructured chunks those elements, based on your intended end use. During chunking, Unstructured uses a [basic](#basic-chunking-strategy) chunking strategy that attempts to combine two or more consecutive text elements into each chunk that fits together within the [max characters](#max-characters-setting) setting. To determine the best max characters setting, see the documentation for the embedding model that you want to use. You can further control this behavior with [by title](#chunk-by-title-strategy), [by page](#chunk-by-page-strategy), and [by similarity](#chunk-by-similarity-strategy) chunking strategies. In all cases, Unstructured will only split individual elements if they exceed the specified max characters length. After chunking, you will have document elements of only the following types: * `CompositeElement`: Any text element will become a `CompositeElement` after chunking. A composite element can be a combination of two or more original text elements that together fit within the max characters setting. It can also be a single element that doesn't leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original text element that was too big to fit in one chunk and required splitting. * `Table`: A table element is not combined with other elements, and if it fits within the max characters setting it will remain as is. * `TableChunk`: Large tables that exceed the max characters setting are split into special `TableChunk` elements. Here are a few examples: ```json { "type": "CompositeElement", "element_id": "3e891e70121efac5f8374e4f1d41f25c", "text": "PORTABLE DOCUMENT FORMAT (PDF) SPECIFICATIONS...<full-content-redacted-for-brevity>...", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1, "orig_elements": "eJy...<full-content-redacted-for-brevity>...tY0=", "filename": "Portable-Document-Format-Specifications.pdf" } } ``` ```json { "type": "Table", "element_id": "2abca97d9f7dde6557cda4731f5e1d01", "text": "Date 2005-04-08 2008-06-04 2011-12-20 2012-1-13 2014-9-26 2016-9-14...<full-content-redacted-for-brevity>...", "metadata": { "text_as_html": "<table><thead><tr><th>Date</th><th>Version</th><th>Summary of Revisions</th></tr></thead><tbody>...<full-content-redacted-for-brevity>...", "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 3, "orig_elements": "eJz...<full-content-redacted-for-brevity>...r8=", "filename": "Portable-Document-Format-Specifications.pdf" } } ``` The following sections provide information about the available chunking strategies and their settings. <Note>You can change a workflow's preconfigured strategy only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.</Note> ## Basic chunking strategy The basic chunking strategy uses only the [max characters](#max-characters-setting) setting (an absolute or "hard" limit) and [new after n characters](#new-after-n-characters-setting) setting (an approximate or "soft" limit) to combine sequential elements to maximally fill each chunk. This strategy adds elements to a chunk until the new after n characters limit is reached. A new chunk is then started. No chunk will exceed the max characters limit. For elements larger than the "max characters" limit, the text is split into multiple chunks at spaces or new lines to avoid cutting words. Table elements are always treated as standalone chunks. If a table is too large, the table is chunked by rows. This strategy does not use section boundaries, page boundaries, or content similarities to determine the chunks' contents. The following diagram illustrates conceptually how a candidate element is chunked to fit within the max characters (hard) and new after n characters (soft) limits. * In scenario 1, the candidate element exceeds the hard limit, and so the candidate element will become the first element in the next chunk. * In scenario 2, the first candidate element exceeds the soft limit but remains within the hard limit. Because the second candidate element begins after the soft limit has been reached, the second candidate element will become the first element in the next chunk. * In scenario 3, the first two candidate elements exceed the soft limit but remain within the hard limit. Even though the third candidate element remains within the hard limit, because it begins after the soft limit has been reached, the third candidate element will become the first element in the next chunk. ![Chunking with hard and soft limits](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunking_Soft_Hard_Limits.png) The following two conceptualdiagrams show how a basic chunking strategy with a max characters setting of 200 would chunk the following text and table elements. In this first conceptual diagram, each chunk of text gets as close as possible to the 200-character hard limit without going over, and lexical constructs such as sentence endings are not recognized: ![Basic chunking of text with a 200-character hard limit](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunk-By-Character-200-Paragraph.png) In this second conceptual diagram, each chunk for the table also gets as close as possible to the 200-character hard limit without going over. However, for tables, row endings are also considered in determining chunk boundaries. For this table, the first chunk is close to the 200-character hard limit and also a row ending. The second chunk is well short of the 200-character hard limit because of a row (and, in this case, also the table) ending: ![Basic chunking of a table with a 200-character hard limit](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunk-By-Character-200-Table.png) Context between chunks can be maintained by using the [overlap](#overlap-setting) and [overlap all](#overlap-all-setting) settings. The overlap setting repeats the specified number of characters from the end of the previous chunk at the beginning of the next chunk. By default, overlap all is applied only to relatively large elements. If overlap all is set to true, the overlap is applied to all chunks, regardless. The overlap setting is based on the number of characters, so words might be split. The overlap setting's character count is included in the chunk size; nonetheless, the chunk's total size must not exceed the max characters setting. The following conceptual diagram illustrates conceptually how chunks are calculated by setting overlap all to true or false. In this diagram, setting overlap all to true results in a portion at the end of each chunk always being copied over to the beginning of the next chunk. By default (or by setting overalp all to false), only a portion at the end of Element 6 Part 1 in Chunk 2 being is copied over to the beginning of Element 6 Part 2 in Chunk 3, because Element 6 is larger than the max characters setting. ![Chunking with overall all set to true or false](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunking_Overlap_All.png) The following diagram shows how a basic chunking strategy with a max characters setting of 200, an overlap of 25 characters, and overlap all set to true would chunk the following text. Note that some of the text is split in the middle of a word: ![Basic chunking of text with a 200-character hard limit, an overlap of 25 characters, and overlap all set to true](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunk-By-Character-200-Overlap-25.png) To use this chunking strategy, choose Chunk by character in the Chunkers section of a Chunker node in a workflow. ## Chunk by title strategy The by-title chunking strategy attempts to preserve section boundaries when determining the chunks' contents, primarily when a Title element is encountered. The title is used as the section header for the chunk. The max characters and new after n characters settings are still respected. The following conceptual diagram illustrates conceptually how elements are chunked when Title elements are encountered (see Chunks 1, 4, and 6), while still respecting the max characters and new after n characters settings (see Chunks 2 and 3): ![Chunking by title](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunking_By_Title.png) A single chunk should not contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. The following conceptual diagram illustrates this point, in that many Title elements can produce many relatively small chunks. ![Many titles can lead to many chunks by title](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunking_By_Title_Segmentation.png) To reduce the number of chunks, you can use the [combine text under n characters](#combine-text-under-n-characters-setting) setting. This setting attempts to combine elements into a single chunk until the combine text under n characters limit is reached, as shown in the following conceptual diagram. In this case, multiple Title elements are combined into a single chunk. However, when the combine text under n characters limit is reached, the chunk is closed and a new one is started. In any case, the new chunk must start with a Title element. For instance, if Element 3 exceeded the combine text under n characters limit, the chunk would be closed and a new one would be started, beginning with Title 2, followed by Element 3. ![Chunking with combine text under n characters](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunking_Combine_Text.png) The following diagram shows how a chunk by title strategy with a max characters setting of 200 would chunk the following text. Although the first chunk is close to the 200-character hard limit, the second chunk is well short of this limit due to encountering the title immediately after it, which starts a new chunk: ![Chunking by title with a 200-character hard limit](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunk-By-Title-200-Paragraph.png) To use this chunking strategy, choose Chunk by title in the Chunkers section of a Chunker node in a workflow. ## Chunk by page strategy The by-page chunking strategy attempts to preserve page boundaries when determining the chunks' contents. A single chunk should not contain text that occurred in two different page. When a new page starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. The following diagram shows how a chunk by page strategy with a max characters setting of 200 would chunk the following text. Notice that due to the page break, the second chunk is very small, as it could not fit into the first chunk's hard character limit.\ Nonetheless, the second chunk is still part of same page as the first chunk: ![Chunking by page with a 200-character hard limit](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunk-By-Page-200.png) To use this chunking strategy, choose Chunk by page in the Chunkers section of a Chunker node in a workflow. ## Chunk by similarity strategy The by-similarity chunking strategy uses the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combines them into chunks. As with the other chunking strategies, chunks will never exceed the absolute maximum chunk size set by Max characters. For this reason, not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can guarantee that two elements with low similarity will not be combined in a single chunk. To use this chunking strategy, choose Chunk by similarity in the Chunkers section of a Chunker node in a workflow. You can control the level of topic similarity you require for elements to have by setting [Similarity threshold](#similarity-threshold). The following diagram shows how a chunk by similarity strategy with a max characters setting of 1000 and similarity threshold of 0.5 would chunk the following text. Notice that the two chunks are well short of the 1000-character hard limit, as the paragraph break introduces a convenient lexical construct for helping determinine the similarities of sentences to each other: ![Chunking by similarity with a 1000-character hard limit and 0.5 similarity threshold](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/chunking/Chunk-By-Similarity-1000-50.png) ## Max characters setting Specifies the absolute maximum number of characters in a chunk. To specify this setting, enter a number into the Max characters field. This setting applies to all of the chunking strategies. ## Combine text under n characters setting Combines elements from a section into a chunk until a section reaches a length of this many characters. To specify this setting, enter a number into the Combine text under n chars field. This setting applies only to the chunking strategy Chunk by title. ## Include original elements setting If the Include original elements box is checked, the elements that were used to form a chunk appear in the `metadata` field's `orig_elements` field for that chunk. This setting applies to all of the chunking strategies. ## Multipage sections setting If the Multipage sections box is checked, this allows sections to span multiple pages. This setting applies only to the chunking strategy Chunk by title. ## New after n characters setting Closes new sections after reaching a length of this many characters. This is an approximate limit. To specify this setting, enter a number into the New after n characters field. This setting applies only to the chunking strategies Chunk by character, Chunk by title, and Chunk by page. ## Overlap setting Applies a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. To specify this setting, enter a number into the Overlap field. This setting applies only to the chunking strategies Chunk by character, Chunk by title, and Chunk by page. ## Overlap all setting If the Overlap all box is checked, applies overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. This setting applies only to the chunking strategies Chunk by character, Chunk by title, and Chunk by page. ## Similarity threshold setting Specifies the minimum similarity that text in consecutive elements must have to be included in the same chunk. This must be a value between `0.0` and `1.0`, exclusive (`0.01` to `0.99`). The default is `0.5` if not otherwise specified. To specify this setting, enter a number into the Similarity threshold field. This setting applies only to the chunking strategy Chunk by similarity. ## Contextual chunking A technique known as contextual chunking prepends chunk-specific explanatory context to each chunk. Contextual chunking has been shown to enhance traditional RAG solutions by yielding significant improvements in retrieval accuracy, which directly translates to better performance in downstream tasks. [Learn more](https://www.anthropic.com/news/contextual-retrieval). To apply contextual chunking, switch on the Contextual chunking toggle in the settings for any chunking strategy. This chunk-specific explanatory context information is typically a couple of sentences in length. Contextual chunking happens before any embeddings are generated. When contextual chunking is applied, the contextual information in each chunk begins with `Prefix:` and ends with a semicolon (`;`). The chunk's original content begins with `Original:`. For example, without contextual chunking applied, elements would for instance be generated similar to the following. Line breaks have been inserted here for readability. The output will not contain these line breaks: ```json { "type": "CompositeElement", "element_id": "aa482034de5ade41b7223bb3beeb6a22", "text": "THE\n\nCONSTITUTION of the United States\n\nG\n\nNATIONAL CONSTITUTION CENTER\n\nWe the People of the United States, in Order to form a more perfect Union, establish Justice, insure ...<full-content-redacted-for-brevity>...", "metadata": { "filename": "constitution.pdf", "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1 } }, { "type": "CompositeElement", "element_id": "59fbfcfb51e52c426df4c48a620c6031", "text": "SECTION. 2\n\nThe House of Representatives shall be composed of Mem- bers chosen every second Year by the People of the several States, and the Electors in each State shall ...<full-content-redacted-for-brevity>...", "metadata": { "filename": "constitution.pdf", "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2 } }, ``` Applying contextual chunking to those same elements would result in the following output. Line breaks and blank lines have been inserted here for readability. The output will not contain these line breaks and blank lines: ```json { "type": "CompositeElement", "element_id": "063ed41d2a989191f2281b2d35c4b4ae", "text": "Prefix: This is the opening preamble and first section of Article I of the U.S. Constitution, establishing the fundamental purpose of the document and the basic structure of legislative power in Congress. It appears at the very beginning of the main constitutional text, before all other articles and amendments.; Original: THE\n\nCONSTITUTION of the United States\n\nG\n\nNATIONAL CONSTITUTION CENTER\n\nWe the People of the United States, in Order to form a more perfect Union, establish Justice, insure ...<full-content-redacted-for-brevity>...", "metadata": { "filename": "constitution.pdf", "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1 } }, { "type": "CompositeElement", "element_id": "2270f6b8c8b4afc668f6277789370ffd", "text": "Prefix: This chunk appears in Article I, Section 2 of the U.S. Constitution, which establishes the structure, composition, and powers of the House of Representatives as one of the two chambers of Congress. It follows Section 1's establishment of Congress and precedes Section 3's establishment of the Senate.; Original: SECTION. 2\n\nThe House of Representatives shall be composed of Mem- bers chosen every second Year by the People of the several States, and the Electors in each State shall ...<full-content-redacted-for-brevity>...", "metadata": { "filename": "constitution.pdf", "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2 } } ``` ## Learn more * <Icon icon="blog" /> "Breaking It Down: Chunking Strategies" in [Level Up Your GenAI Apps: Essential Data Preprocessing for Any RAG System](https://unstructured.io/blog/level-up-your-genai-apps-essential-data-preprocessing-for-any-rag-system) * <Icon icon="blog" /> [Contextual Chunking in Unstructured Platform: Boost Your RAG Retrieval Accuracy](https://unstructured.io/blog/contextual-chunking-in-unstructured-platform-boost-your-rag-retrieval-accuracy) * <Icon icon="blog" /> [Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices). # Supported connectors Source: https://docs.unstructured.io/ui/connectors Unstructured supports connecting to the following source and destination types. ```mermaid flowchart LR Sources-->Unstructured-->Destinations ``` ## Sources * [Azure](/ui/sources/azure-blob-storage) * [Box](/ui/sources/box) * [Confluence](/ui/sources/confluence) * [Couchbase](/ui/sources/couchbase) * [Databricks Volumes](/ui/sources/databricks-volumes) * [Dropbox](/ui/sources/dropbox) * [Elasticsearch](/ui/sources/elasticsearch) * [Google Cloud Storage](/ui/sources/google-cloud) * [Google Drive](/ui/sources/google-drive) * [Jira](/ui/sources/jira) * [Kafka](/ui/sources/kafka) * [MongoDB](/ui/sources/mongodb) * [OneDrive](/ui/sources/onedrive) * [Outlook](/ui/sources/outlook) * [PostgreSQL](/ui/sources/postgresql) * [S3](/ui/sources/s3) * [Salesforce](/ui/sources/salesforce) * [SharePoint](/ui/sources/sharepoint) * [Slack](/ui/sources/slack) * [Snowflake](/ui/sources/snowflake) * [Zendesk](/ui/sources/zendesk) If your source is not listed here, you might still be able to connect Unstructured to it through scripts or code by using the [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli) or the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest). [Learn more](/open-source/ingestion/source-connectors/overview). ## Destinations * [Astra DB](/ui/destinations/astradb) * [Azure AI Search](/ui/destinations/azure-ai-search) * [Couchbase](/ui/destinations/couchbase) * [Databricks Volumes](/ui/destinations/databricks-volumes) * [Delta Tables in Amazon S3](/ui/destinations/delta-table) * [Delta Tables in Databricks](/ui/destinations/databricks-delta-table) * [Elasticsearch](/ui/destinations/elasticsearch) * [Google Cloud Storage](/ui/destinations/google-cloud) * [IBM watsonx.data](/ui/destinations/ibm-watsonxdata) * [Kafka](/ui/destinations/kafka) * [Milvus](/ui/destinations/milvus) * [MotherDuck](/ui/destinations/motherduck) * [MongoDB](/ui/destinations/mongodb) * [Neo4j](/ui/destinations/neo4j) * [OneDrive](/ui/destinations/onedrive) * [Pinecone](/ui/destinations/pinecone) * [PostgreSQL](/ui/destinations/postgresql) * [Qdrant](/ui/destinations/qdrant) * [Redis](/ui/destinations/redis) * [S3](/ui/destinations/s3) * [Snowflake](/ui/destinations/snowflake) * [Weaviate](/ui/destinations/weaviate) If your destination is not listed here, you might still be able to connect Unstructured to it through scripts or code by using the [Unstructured Ingest CLI](/open-source/ingestion/overview#unstructured-ingest-cli) or the [Unstructured Ingest Python library](/open-source/ingestion/python-ingest). [Learn more](/open-source/ingestion/destination-connectors/overview). # Astra DB Source: https://docs.unstructured.io/ui/destinations/astradb <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Astra DB. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/PMs1iwL52aM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Astra account. [Create or sign in to an Astra account](https://astra.datastax.com/). * A database in the Astra account. [Create a database in an account](https://docs.datastax.com/en/astra-db-classic/databases/manage-create.html). * An application token for the database. [Create a database application token](https://docs.datastax.com/en/astra-db-serverless/administration/manage-application-tokens.html). * A namespace in the database. [Create a namespace in a database](https://docs.datastax.com/en/astra-db-serverless/databases/manage-namespaces.html#create-namespace). * A collection in the namespace. [Create a collection in a namespace](https://docs.datastax.com/en/astra-db-serverless/databases/manage-collections.html#create-collection). An existing collection is not required. At runtime, the collection behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your namespace. If Unstructured generates embeddings, the new collections's name will be `u<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. If Unstructured does not generate embeddings, the new collections's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collections's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Pinecone account. The new collection's name will be `unstructuredautocreated`. To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Astra DB. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Collection Name: The name of the collection in the namespace. If no value is provided, see the beginning of this article for the behavior at run time. * Keyspace (required): The name of the keyspace in the collection. * Batch Size: The maximum number of records per batch. The default is `20` if not otherwise specified. * Flatten Metadata: Check this box to flatten the metadata into each record. Specifically, when flattened, the metadata key values are brought to the top level of the element, and the `metadata` key itself is removed. By default, the metadata is not flattened. * API Endpoint (required): The database's associated API endpoint. * Token (required): The application token for the database. # Azure AI Search Source: https://docs.unstructured.io/ui/destinations/azure-ai-search <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Azure AI Search. The requirements are as follows. The following video shows how to fulfill the minimum set of Azure AI Search requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/6ZjU5OupWE8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Here are some more details about these requirements: * The endpoint and API key for Azure AI Search. [Create an endpoint and API key](https://learn.microsoft.com/azure/search/search-create-service-portal). * The name of the index in Azure AI Search. [Create an index](https://learn.microsoft.com/rest/api/searchservice/create-index). <iframe width="560" height="315" src="https://www.youtube.com/embed/WY8h8Gtyo7o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The Azure AI Search index that you use must have an index schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "@odata.context": "https://ingest-test-azure-ai-search.search.windows.net/$metadata#indexes/$entity", "@odata.etag": "\"0x8DCED5D96393CA9\"", "name": "<my-index-name>", "defaultScoringProfile": null, "fields": [ { "name": "id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "record_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "element_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "text", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "embeddings", "type": "Collection(Edm.Single)", "searchable": true, "filterable": false, "retrievable": true, "stored": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": 3072, "vectorSearchProfile": "embeddings-config-profile", "vectorEncoding": null, "synonymMaps": [] }, { "name": "type", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "metadata", "type": "Edm.ComplexType", "fields": [ { "name": "category_depth", "type": "Edm.Int32", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "parent_id", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "attached_to_filename", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "filetype", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "last_modified", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "is_continuation", "type": "Edm.Boolean", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "file_directory", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "filename", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "data_source", "type": "Edm.ComplexType", "fields": [ { "name": "url", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "version", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_created", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_modified", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "date_processed", "type": "Edm.DateTimeOffset", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "permissions_data", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "record_locator", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] }, { "name": "coordinates", "type": "Edm.ComplexType", "fields": [ { "name": "system", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "layout_width", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "layout_height", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "points", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] }, { "name": "languages", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "page_number", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "orig_elements", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "links", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "page_name", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "url", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "link_urls", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "link_texts", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "sent_from", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "sent_to", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "subject", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "section", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "header_footer_type", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "emphasized_text_contents", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "emphasized_text_tags", "type": "Collection(Edm.String)", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": false, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "text_as_html", "type": "Edm.String", "searchable": true, "filterable": false, "retrievable": true, "stored": true, "sortable": false, "facetable": false, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "regex_metadata", "type": "Edm.String", "searchable": true, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "detection_class_prob", "type": "Edm.Double", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] }, { "name": "partitioner_type", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "stored": true, "sortable": true, "facetable": true, "key": false, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "normalizer": null, "dimensions": null, "vectorSearchProfile": null, "vectorEncoding": null, "synonymMaps": [] } ] } ], "scoringProfiles": [], "corsOptions": null, "suggesters": [], "analyzers": [], "normalizers": [], "tokenizers": [], "tokenFilters": [], "charFilters": [], "encryptionKey": null, "similarity": { "@odata.type": "#Microsoft.Azure.Search.BM25Similarity", "k1": null, "b": null }, "semantic": null, "vectorSearch": { "algorithms": [ { "name": "embeddings-config", "kind": "hnsw", "hnswParameters": { "metric": "cosine", "m": 4, "efConstruction": 400, "efSearch": 500 }, "exhaustiveKnnParameters": null } ], "profiles": [ { "name": "embeddings-config-profile", "algorithm": "embeddings-config", "vectorizer": null, "compression": null } ], "vectorizers": [], "compressions": [] } } ``` See also: * [Search indexes in Azure AI Search](https://learn.microsoft.com/azure/search/search-what-is-an-index) * [Schema of a search index](https://learn.microsoft.com/azure/search/search-what-is-an-index#schema-of-a-search-index) * [Example index schema](https://learn.microsoft.com/rest/api/searchservice/create-index#examples) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Azure AI Search. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Endpoint (required): The endpoint URL for Azure AI Search. * Index Name (required): The name of the index for Azure AI Search. * API Key (required): The API key for Azure AI Search. # Couchbase Source: https://docs.unstructured.io/ui/destinations/couchbase <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Couchbase. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Couchbase Capella clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Couchbase Capella clusters and local Couchbase server deployments are supported. <iframe width="560" height="315" src="https://www.youtube.com/embed/9-RIBmIdi70" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For Couchbase Capella, you will need: * A [Couchbase Capella account](https://docs.couchbase.com/cloud/get-started/create-account.html#sign-up-free-tier). * A [Couchbase Capella cluster](https://docs.couchbase.com/cloud/get-started/create-account.html#getting-started). * A [bucket](https://docs.couchbase.com/cloud/clusters/data-service/manage-buckets.html#add-bucket), [scope](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-scope), and [collection](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-collection) on the cluster. * The cluster's [public connection string](https://docs.couchbase.com/cloud/get-started/connect.html#connect-from-sdk-cbsh-cli-or-ide). * The [cluster access name (username) and secret (password)](https://docs.couchbase.com/cloud/clusters/manage-database-users.html#create-database-credentials). * [Incoming IP address allowance](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) for the cluster. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> For a local Couchbase server, you will need: * [Installation of a local Couchbase server](https://docs.couchbase.com/server/current/getting-started/start-here.html). * [Connection details](https://docs.couchbase.com/server/current/guides/connect.html) to the local Couchbase server. To learn more about how to set up a Couchbase cluster and play with data, refer to this [tutorial](https://developer.couchbase.com/tutorial-quickstart-flask-python). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Couchbase. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Bucket (required): The name of the bucket to connect to on the cluster. * Connection string (required): The connection string for the Couchbase cluster. * Scope: The name of the scope in the bucket. The default is `_default` if not otherwise specified. * Collection: The name of the collection in the scope. The default is `_default` if not otherwise specified. * Batch Size: The maximum number of documents per transmitted batch. The default is `50` if not otherwise specified. * Collection ID key (source connector only): The name of the collection field that contains the document ID. The default is `id` if not otherwise specified * Username (required): The cluster access name (username) that has access to the cluster. * Password (required): The password for the corresponding username. ## Learn more * <Icon icon="blog" /> [Couchbase Integration in Unstructured Platform](https://unstructured.io/blog/couchbase-integration-in-unstructured-platform) # Delta Tables in Databricks Source: https://docs.unstructured.io/ui/destinations/databricks-delta-table <Tip> This article covers connecting Unstructured to Delta Tables in Databricks. For information about connecting Unstructured to Delta Tables in Amazon S3 instead, see [Delta Tables in Amazon S3](/ui/destinations/delta-table). For information about connecting Unstructured to Databricks Volumes instead, see [Databricks Volumes](/ui/destinations/databricks-volumes). </Tip> <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to a Delta Table in Databricks. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Datbricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * One of the following compute resources within the workspace: * A SQL warehouse for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html). The following video shows how to create a SQL warehouse if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the warehouse's owner to use it: <iframe width="560" height="315" src="https://www.youtube.com/embed/N-Aw9-U3_fE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An all-purpose cluster for [AWS](https://docs.databricks.com/compute/use-compute.html), [Azure](https://learn.microsoft.com/azure/databricks/compute/use-compute), or [GCP](https://docs.gcp.databricks.com/compute/use-compute.html). The following video shows how to create an all-purpose cluster if you do not already have one available, get its Server Hostname and HTTP Path values, and set permissions for someone other than the cluster's owner to use it: <iframe width="560" height="315" src="https://www.youtube.com/embed/apgibaelVY0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The SQL warehouse's or cluster's Server Hostname and HTTP Path values for [AWS](https://docs.databricks.com/integrations/compute-details.html), [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html). * Unity Catalog enabled in the workspace for [AWS](https://docs.databricks.com/data-governance/unity-catalog/get-started.html), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/get-started), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/get-started.html). * Within Unity Catalog: * A catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html). * A schema (formerly known as a database) for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) within that catalog, * A table for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html) within that schema (formerly known as a database). You can have the connector attempt to create a table for you automatically at run time. To do this, in the connector settings as described later in this article, do one of the following: * Specify the name of the table that you want the connector to attempt to create within the specified catalog and schema (formerly known as a database). * Leave the table name blank. The connector will attempt to create a table within the specified catalog and schema (formerly known as a database). For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview), the table is named `u<short-workflow-id>`. For the [Unstructured Ingest CLI and Ingest Python library](/open-source/ingestion/overview), the table is named `unstructuredautocreated`. The connector will attempt to create the table on behalf of the related Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, as described later in these requirements. If successful, the table's owner is set as the related Databricks workspace user or Databricks managed service principal. The owner will have all Unity Catalog privileges on the table by default. No other Databricks workspace users or Databricks managed service principals will have any privileges on the table by default. <Warning> If the table's parent schema (formerly known as a database) is not owned by the same Databricks workspace user or Databricks managed service principal that is referenced in the connector settings, then you should grant the new table's owner the `CREATE TABLE` privilege on that parent schema (formerly known as a database) before the connector attempts to create the table. Otherwise, table creation could fail. </Warning> <Note> Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), and tables might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, and tables. </Note> The following video shows how to create a catalog, schema (formerly known as a database), and a table in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them: <iframe width="560" height="315" src="https://www.youtube.com/embed/ffNnq-6bpd4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> If you want to use an existing table or create one yourself beforehand, this table must contain at minimum the following column names and their data types: ```text CREATE TABLE IF NOT EXISTS <catalog_name>.<schema_name>.<table_name> ( id STRING NOT NULL PRIMARY KEY, record_id STRING NOT NULL, element_id STRING NOT NULL, text STRING, embeddings ARRAY<FLOAT>, type STRING, metadata VARIANT ); ``` <Info> In Databricks, a table's schema is different than a schema (formerly known as a database) in a catalog-schema object relationship in Unity Catalog. </Info> * Within Unity Catalog, a volume for [AWS](https://docs.databricks.com/volumes/utility-commands.html), [Azure](https://learn.microsoft.com/azure/databricks/volumes/utility-commands), or [GCP](https://docs.gcp.databricks.com/volumes/utility-commands.html). The volume can be in the same schema (formerly known as a database) as the table, or the volume and table can be in separate schemas. In either case, both of these schemas must share the same parent catalog. <Note> Using dashes (`-`) in the names of volumes might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of volumes. </Note> The following video shows how to create a catalog, schema (formerly known as a database), and a volume in Unity Catalog if you do not already have them available, and set privileges for someone other than their owner to use them: <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Databricks managed service principal authentication (using Databricks OAuth M2M) to the workspace: * A Databricks managed service principal. This service principal must have the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse. * The service principal's UUID (or Client ID or Application ID) value. * The OAuth Secret value for the service principal. To get this information, see Steps 1-3 of the instructions for [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). <Note> For Azure Databricks, this connector only supports Databricks managed service principals for authentication. Microsoft Entra ID managed service principals are not supported. </Note> The following video shows how to create a Databricks managed service principal if you do not already have one available: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Databricks personal access token authentication to the workspace, the Databricks personal access token value for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat#azure-databricks-personal-access-tokens-for-workspace-users), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspace-users). This token must be for the workspace user who has the appropriate access permissions to the catalog, schema (formerly known as a database), table, volume, and cluster or SQL warehouse, The following video shows how to create a Databricks personal access token if you do not already have one available: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Databricks workspace user or Databricks managed service principal must have the following minimum set of permissions and privileges to write to an existing volume or table in Unity Catalog. If the owner of these is that Databricks workspace user or Databricks managed service principal, then they will have all necessary permissions and privileges by default. If the owner is someone else, then the following permissions and privileges must be explicitly granted to them before using the connector: * To use an all-purpose cluster for access, `Can Restart` permission on that cluster. Learn how to check and set cluster permissions for [AWS](https://docs.databricks.com/compute/clusters-manage.html#compute-permissions), [Azure](https://learn.microsoft.com/azure/databricks/compute/clusters-manage#cluster-level-permissions), or [GCP](https://docs.gcp.databricks.com/compute/clusters-manage.html#compute-permissions). * To use a SQL warehouse for access, `Can use` permission on that SQL warehouse. Learn how to check and set SQL warehouse permissions for [AWS](https://docs.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse), [Azure](https://learn.microsoft.com/azure/databricks/compute/sql-warehouse/create#manage), or [GCP](https://docs.gcp.databricks.com/compute/sql-warehouse/create.html#manage-a-sql-warehouse). * To access a Unity Catalog volume, the following privileges: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. Learn how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). * To access a Unity Catalog table, the following privileges: * `USE CATALOG` on the table's parent catalog in Unity Catalog. * `USE SCHEMA` on the tables's parent schema (formerly known as a database) in Unity Catalog. * To create a new table, `CREATE TABLE` on the table's parent schema (formerly known as a database) in Unity Catalog. * If the table already exists, `MODIFY` and `SELECT` on the table. Learn how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Delta Tables in Databricks. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Server Hostname (required): The target Databricks cluster's or SQL warehouse's Server Hostname value. * HTTP Path (required): The cluster's or SQL warehouse's HTTP Path value. * Token (required for PAT authentication): For Databricks personal access token (PAT) authentication, the target Databricks user's PAT value. * UUID and OAuth Secret (required for OAuth authentication): For Databricks OAuth machine-to-machine (M2M) authentication, the Databricks managed service principal's UUID (or Client ID or Application ID) and OAuth Secret (client secret) values. * Catalog (required): The name of the catalog in Unity Catalog for the target volume and table in the Databricks workspace. * Database: The name of the schema (formerly known as a database) in Unity Catalog for the target table. The default is `default` if not otherwise specified. If the target table and volume are in the same schema (formerly known as a database), then Database and Schema will have the same names. * Table Name: The name of the target table in Unity Catalog. * If a table name is specified, but a table with that name does not exist within the specified schema (formerly known as a database), the connector attempts to create a table with that name within that schema. * If no table name is specified, the connector attempts to create a table named `u<short-workflow-id>` within the specified schema (formerly known as a database). See the beginning of this article for additional technical requirements before having the connector attempt to create a table. * Schema: The name of the schema (formerly known as a database) in Unity Catalog for the target volume. The default is `default` if not otherwise specified. If the target volume and table are in the same schema (formerly known as a database), then Schema and Database will have the same names. * Volume (required): The name of the target volume in Unity Catalog. * Volume Path: Any target folder path inside of the volume to use instead of the volume's root. If not otherwise specified, processing occurs at the volume's root. <Note> Using dashes (`-`) in the names of catalogs, schemas (formerly known as databases), tables, and volumes might cause isolated issues with the connector. It is recommended to use underscores (`_`) instead of dashes in the names of catalogs, schemas, tables, and volumes. </Note> ## Learn more * <Icon icon="blog" /> [Getting Started with Unstructured and Delta Tables in Databricks](https://unstructured.io/blog/getting-started-with-unstructured-and-delta-tables-in-databricks) * <Icon icon="blog" /> [Integration Highlight: Databricks Delta Tables](https://unstructured.io/blog/integration-highlight-databricks-delta-tables) * <Icon icon="blog" /> [RAG: Seamlessly Integrating Context from Multiple Sources into Delta Tables in Databricks](https://unstructured.io/blog/rag-seamlessly-integrating-context-from-multiple-sources-into-delta-tables-in-databricks) # Databricks Volumes Source: https://docs.unstructured.io/ui/destinations/databricks-volumes <Tip> This article covers connecting Unstructured to Databricks Volumes. For information about connecting Unstructured to Delta Tables in Databricks instead, see [Delta Tables in Databricks](/ui/destinations/databricks-delta-table). </Tip> <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Databricks Volumes. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Databricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * The workspace's URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids). Examples: * AWS: `https://<workspace-id>.cloud.databricks.com` * Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net` * GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com` <Note> Do not add a trailing slash (`/`) to the workspace URL. </Note> * The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), the following Databricks authentication types are supported: * Databricks OAuth machine-to-machine (M2M) authentication for\ [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). You will need the the Client ID (or UUID or Application ID) and OAuth Secret (client secret) values for the corresponding service principal. Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. The following video shows how to create a Databricks managed service principal: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html). You will need the personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For [Unstructured Ingest](/open-source/ingestion/overview), the following Databricks authentication types are supported: * For Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html): The personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password (basic) authentication ([AWS](https://docs.databricks.com/archive/dev-tools/basic.html) only): The user's name and password values. * For OAuth machine-to-machine (M2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html)): The client ID and OAuth secret values for the corresponding service principal. * For OAuth user-to-machine (U2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-u2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-u2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-u2m.html)): No additional values. * For Azure managed identities (formerly Managed Service Identities (MSI) authentication) ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-mi) only): The client ID value for the corresponding managed identity. * For Microsoft Entra ID service principal authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-sp) only): The tenant ID, client ID, and client secret values for the corresponding service principal. * For Azure CLI authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-cli) only): No additional values. * For Microsoft Entra ID user authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/user-aad-token) only): The Entra ID token for the corresponding Entra ID user. * For Google Cloud Platform credentials authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-creds.html) only): The local path to the corresponding Google Cloud service account's credentials file. * For Google Cloud Platform ID authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-id.html) only): The Google Cloud service account's email address. * The name of the parent catalog in Unity Catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html) for the volume. * The name of the parent schema (formerly known as a database) in Unity Catalog for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) for the volume. * The name of the volume in Unity Catalog for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html), and optionally any path in that volume that you want to access directly, beginning with the volume's root. * The Databricks workspace user or service principal must have the following minimum set of privileges to read from or write to the existing volume in Unity Catalog: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. The following videos shows how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog. <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Learn more about how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Databricks Volumes. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Host (required): The Databricks workspace host URL. <Note> Do not add a trailing slash (`/`) to the host URL. </Note> * Catalog (required): The name of the catalog to use. * Schema : The name of the associated schema. If not specified, default is used. * Volume (required): The name of the associated volume. * Volume Path : Any optional path to access within the volume. * For Authentication Method, if you select Service Principal, you must also specify the following: * OAuth Secret (required): The associated OAuth Secret value for the Databricks managed service principal that has the appropriate privileges to the volume. * UUID (required): The Client ID (or UUID or Application ID) value for the Databricks managed service principal that has appropriate privileges to the volume. * For Authentication Method, if you select Token, you must also specify the Databricks personal access token's value in the Token field. # Delta Tables in Amazon S3 Source: https://docs.unstructured.io/ui/destinations/delta-table <Tip> This article covers connecting Unstructured to Delta Tables in Amazon S3. For information about connecting Unstructured to Delta Tables in Databricks instead, see [Delta Tables in Databricks](/ui/destinations/databricks-delta-table). </Tip> <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to a Delta Table, stored in Amazon S3. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements to store Delta Tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/_W4565dcUGI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). ## Add an access policy to an existing bucket To use the Amazon S3 console to add an access policy that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to an existing S3 bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 2. Open the [Amazon S3 Console](https://console.aws.amazon.com/s3/home). 3. Browse to the existing bucket and open it. 4. Click the Permissions tab. 5. In the Bucket policy area, click Edit. 6. In the Policy text area, copy the following JSON-formatted policy. To change the following policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this policy, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-bucket-name>` in two places with the name of your bucket. ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<my-account-id>:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<my-bucket-name>", "arn:aws:s3:::<my-bucket-name>/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } ``` 7. Click Save changes. ## Create a bucket with AWS CloudFormation To use the AWS CloudFormation console to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Save the following YAML to a file on your local machine, for example `create-s3-bucket.yaml`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: 'CloudFormation template to create an S3 bucket with specific permissions for account users.' Parameters: BucketName: Type: String Description: 'Name of the S3 bucket to create' Resources: MyS3Bucket: Type: 'AWS::S3::Bucket' Properties: BucketName: !Ref BucketName PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: false IgnorePublicAcls: true RestrictPublicBuckets: true BucketPolicy: Type: 'AWS::S3::BucketPolicy' Properties: Bucket: !Ref MyS3Bucket PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowAllAuthenticatedUsersInAccount Effect: Allow Principal: AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root' Action: - 's3:GetObject' - 's3:PutObject' - 's3:ListBucket' - 's3:DeleteObject' Resource: - !Sub 'arn:aws:s3:::${BucketName}' - !Sub 'arn:aws:s3:::${BucketName}/' Outputs: BucketName: Description: 'Name of the created S3 bucket' Value: !Ref MyS3Bucket ``` 2. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 3. Open the [AWS CloudFormation Console](https://console.aws.amazon.com/cloudformation/home). 4. Click Create stack > With new resources (standard). 5. On the Create stack page, with Choose an existing template already selected, select Upload a template file. 6. Click Choose file, and browse to and select the YAML file from your local machine. 7. Click Next. 8. Enter a unique Stack name and BucketName. 9. Click Next two times. 10. Click Submit. 11. Wait until the Status changes to CREATE\_COMPLETE. 12. After the bucket is created, you can delete the YAML file, if you want. ## Create a bucket with the AWS CLI To use the AWS CLI to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 2. [Set up the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). 3. Copy the following script to a file on your local machine, for example a file named `create-s3-bucket.sh`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this script, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-unique-bucket-name>` with the name of your bucket. * Replace `<us-east-1>` with your AWS Region. ```bash #!/bin/bash # Set variables for the AWS account ID, Amazon S3 bucket name, and AWS Region. ACCOUNT_ID="<my-account-id>" BUCKET_NAME="<my-unique-bucket-name>" REGION="<us-east-1>" # Temporary filename for the bucket policy. # Do not change this variable. POLICY_FILE="bucket_policy.json" # Create the bucket. aws s3api create-bucket --bucket $BUCKET_NAME --region $REGION # Wait for the bucket to exist. echo "Waiting for bucket '$BUCKET_NAME' to be fully created..." aws s3api wait bucket-exists --bucket $BUCKET_NAME # Check if the wait command was successful. if [ $? -eq 0 ]; then echo "The bucket '$BUCKET_NAME' has been fully created." else echo "Error: Timed out waiting for bucket '$BUCKET_NAME' to be created." exit 1 fi # Remove the "block public policy" bucket access setting. aws s3api put-public-access-block \ --bucket $BUCKET_NAME \ --public-access-block-configuration \ '{"BlockPublicPolicy": false, "IgnorePublicAcls": false, "BlockPublicAcls": false, "RestrictPublicBuckets": false}' # Check if the operation was successful. if [ $? -eq 0 ]; then echo "The block public policy access setting was removed from '$BUCKET_NAME'." else echo "Error: Failed to remove the block public policy access setting from '$BUCKET_NAME'." exit 1 fi # Create the bucket policy. cat << EOF > $POLICY_FILE { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::$ACCOUNT_ID:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::$BUCKET_NAME", "arn:aws:s3:::$BUCKET_NAME/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } EOF # Apply the bucket policy. aws s3api put-bucket-policy --bucket $BUCKET_NAME --policy file://$POLICY_FILE # Check if the policy application was successful. if [ $? -eq 0 ]; then echo "The bucket policy was applied to '$BUCKET_NAME'." else echo "Error: Failed to apply the bucket policy to '$BUCKET_NAME'." exit 1 fi # Verify the applied policy. echo "Verifying the applied policy:" aws s3api get-bucket-policy --bucket $BUCKET_NAME --query Policy --output text # Remove the temporary bucket policy file. rm $POLICY_FILE ``` 4. Run the script, for example: ```bash sh create-s3-bucket.sh ``` 5. After the bucket is created, you can delete the script file, if you want. ## Delta table output format A Delta table consists of Parquet files that contain data and a transaction log that stores metadata about the transactions. [Learn more](https://delta-io.github.io/delta-rs/how-delta-lake-works/architecture-of-delta-table/). The Delta Tables in Amazon S3 destination connector generates the following output within the specified path to the S3 bucket (or the specified folder within the bucket): Initially, one Parquet (`.parquet`) file per file in the source location. For example, for a file in the source location named `my-file.pdf`, an associated file with the extension `.parquet` is generated. Various kinds of file transactions can result in additional Parquet files being generated. These Parquet filenames are automatically generated by the Delta Lake engine and are not meant to be manually modified. * A folder named `_delta_log` that contains metadata and change history about the `.parquet` files. As Parquet files are added to, changed, or removed from the specified bucket or folder path, the `_delta_log` folder is updated with any related metadata and change history details. Together, this set of Parquet files and their associated `_delta_log` folder (and its contents) describe a single, versioned Delta table. Because of this, Unstructured recommends the following usage best practices: * In the source location, each set of source files that is to be considered as a unit for change management purposes should be controlled by a unique, dedicated Delta Tables in S3 destination connector. This connector should reference a unique, dedicated output folder within the bucket. Having multiple workflows refer to different sets of source files, yet all share the same Delta table, could results in data loss or table corruption. * Avoid directly modifying, adding, or deleting Parquet data files or the `_delta_log` folder within a Delta table's directory. This can lead to data loss or table corruption. * If you need to copy or move a Delta table to a different location, you must move or copy its entire set of Parquet files and its associated `_delta_log` folder (and its contents) together as a unit. Note that the copied or moved Delta table will no longer be controlled by the original Delta Tables in S3 destination connector. To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Delta Table. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * AWS Region (required): The AWS Region identifier (for example, `us-east-1`) for the Amazon S3 bucket you want to store the Delta Table in. * Bucket URI (required): The URI of the Amazon S3 bucket you want to store the Delta Table in. This typically takes the format `s3://my-bucket/my-folder`. * AWS Access Key ID (required): The AWS access key ID for the AWS IAM principal (such as an IAM user) that has the appropriate access to the S3 bucket. * AWS Secret Access Key (required): The AWS secret access key for the corresponding AWS access key ID. # Elasticsearch Source: https://docs.unstructured.io/ui/destinations/elasticsearch <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Elasticsearch. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Elastic Cloud instances are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Elastic Cloud instances and self-manged Elasticsearch instances are supported. * For Elastic Cloud, you will need an [Elastic Cloud service instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#hosted-elasticsearch-service). <iframe width="560" height="315" src="https://www.youtube.com/embed/HEYIFC7uOus" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For self-managed Elasticsearch, you will need a [self-managed Elasticsearch instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#elasticsearch-deployment-options). <iframe width="560" height="315" src="https://www.youtube.com/embed/9znbAkNG1Ok" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * You will need the name of the index on the instance. See [Create index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) and [Get index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-index.html). The Elasticsearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "record_id": { "type": "text" }, "element_id": { "type": "keyword" }, "text": { "type": "text" }, "embeddings": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }, "metadata": { "type": "object", "properties": { "parent_id": { "type": "text" }, "page_number": { "type": "integer" }, "is_continuation": { "type": "boolean" }, "orig_elements": { "type": "text" }, "partitioner_type": { "type": "text" } } } } } } ``` See also: * [An Introduction to Elasticsearch Mapping](https://www.elastic.co/blog/found-elasticsearch-mapping-introduction) * [Explicit mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html) * [Dynamic field mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) <iframe width="560" height="315" src="https://www.youtube.com/embed/0f0tEcrvP8g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Elastic Cloud, you will need the Elastic Cloud service instance's API key. If you are using Unstructured Ingest, you will also need the instance's Cloud ID. To get these, see your Elasticsearch Service web console. * For self-managed Elasticsearch, you will need: * The self-managed instance's hostname and port number. See [Networking](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html). * If you're using basic authentication to the self-managed instance, the user's name and password. * If you're using token-based authentication to the self-managed instance, the bearer token or API key for the instance. See [Token-based authentication services](https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html) and [Create API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). * If you're using certificate authentication to the self-managed instance, the path to the Certificate Authority (CA) file on the instance, and the certificate fingerprint. See [SSL certificate API](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-ssl.html) and [Where can I see my Certificate Fingerprint?](https://discuss.elastic.co/t/where-can-i-see-my-certificate-fingerprint/319335/3). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Elasticsearch. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Host (required): The endpoint URL for the target cluster. * Index Name (required): The name of the target index in the cluster. * API Key (required): The Elastic Cloud API key for the target cluster. # Google Cloud Storage Source: https://docs.unstructured.io/ui/destinations/google-cloud <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Google Cloud Storage. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/R57duYCxLi8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Google Cloud service account. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#console). * A service account key for the service account. See [Create a service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating) in [Create and delete service account keys](https://cloud.google.com/iam/docs/keys-create-delete). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the service account key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * The URI for a Google Cloud Storage bucket. This URI consists of the target bucket name, plus any target folder within the bucket, expressed as `gs://<bucket-name>[/folder-name]`. [Create a bucket](https://cloud.google.com/storage/docs/creating-buckets#console). This bucket must have, at minimum, one of the following roles applied to the target Google Cloud service account: * `Storage Object Viewer` for bucket read access. * `Storage Object User` for bucket write access. * The `Storage Object Admin` role provides read and write access, plus access to additional bucket operations. To apply one of these roles to a service account for a bucket, see [Add a principal to a bucket-level policy](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-add) in [Set and manage IAM policies on buckets](https://cloud.google.com/storage/docs/access-control/using-iam-permissions). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Google GCS. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Bucket URI (required): The URI for the Google Cloud Storage bucket and any target folder path within the bucket. This URI takes the format `gs://<bucket-name>[/folder-name]`. * Recursive (source connector only): Check this box to ingest data recursively from any subfolders, starting from the path specified by Bucket URI. * Service Account Key (required): The contents of a service account key file, expressed as a single string without line breaks, for a Google Cloud service account that has the required access permissions to the bucket. # IBM watsonx.data Source: https://docs.unstructured.io/ui/destinations/ibm-watsonxdata <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to IBM watsonx.data. The requirements are as follows. * An [IBM Cloud account](https://cloud.ibm.com/login). [Create an IBM Cloud account](https://cloud.ibm.com/registration) if you do not already have one. * An API key for the IBM Cloud account. If you do not have one already, create one as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/AsV8Edq_Lko" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. In the top navigation bar, click Manage and then, under Security and access, click Access (IAM). 3. On the sidebar, under Manage identities, click API keys. 4. With the View list showing My IBM Cloud API keys, click Create. 5. Enter some Name and an optional Description for the API key. 6. Leave Leaked action set to Disable the leaked key and Session creation set to No. 7. Click Create. 8. Click Copy or Download to copy or save the API key to a secure location. You won't be able to access this API key from this screen again. * An IBM Cloud Object Storage (COS) instance in the account, and a bucket within that instance. If you do not have them already, create them as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/C_7q1EM8w20" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. Click Create resource. 4. With IBM Cloud catalog selected, search for and select Object Storage. 5. Complete the on-screen instructions to finish creating the COS instance. 6. With the COS instance's settings page shown, on the Buckets tab, click Create bucket. 7. Complete the on-screen instructions to finish creating the bucket. * The name, region, and public endpoint for the target bucket within the target Cloud Object Storage (COS) instance. To get these: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Storage, and then click the target COS instance. 4. On the Buckets tab, click the target bucket. 5. On the Configuration tab, note the following: * Under Bucket details, note the Bucket name. This is the bucket's name. * Under Bucket details section, note the value inside of the parentheses inside Location, for example `us-east`. This is the bucket's region. * Under Endpoints, note the value of Public, for example `s3.us-east.cloud-object-storage.appdomain.cloud`. (Ignore the values of Private and Direct). This is the bucket's public endpoint. * An HMAC access key ID and secret access key for the target Cloud Object Storage (COS) instance. If you do not have them already, get or create them as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Storage, and then click the target COS instance. 4. On the Service credentials tab, if there is a credential that you want to use in the list, expand the credential, and copy the following values to a secure location: * `access_key_id` under `cos_hmac_keys`, which represents the HMAC access key ID. * `secret_access_key` under `cos_hmac_keys`, which represents the HMAC secret access key. After you have copied the preceding values, you have completed this procedure. 5. If there is not a credential that you want to use, or there are no credentials at all, click New Credential. 6. Enter some Name for the credential. 7. For Role, select at least Writer, leave Select Service ID set to Auto Generated, switch on Include HMAC Credential, and then click Add. 8. In the list of credentials, expand the credential, and copy the following values to a secure location: * `access_key_id` under `cos_hmac_keys`, which represents the HMAC access key ID. * `secret_access_key` under `cos_hmac_keys`, which represents the HMAC secret access key. * An IBM watsonx.data data store instance in the IBM Cloud account. If you do not have one already, create one as follows: <iframe width="560" height="315" src="https://www.youtube.com/embed/oU21hQ9TUnU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. Click Create resource. 4. With IBM Cloud catalog selected, search for and select watsonx.data. 5. Complete the on-screen instructions to finish creating the watsonx.data data store instance. * An Apache Iceberg-based catalog within the watsonx.data data store instance. If you do not have one already, create one as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Infrastructure manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. Click Add component. 8. Under Storage, click IBM Cloud Object Storage, and then click Next. 9. Complete the on-screen instructions to finish creating the Iceberg catalog. This includes providing the following settings: * Some display name for the component. * The name of the target bucket within the target Cloud Object Storage (COS) instance that you noted earlier. * The region for the target bucket, which you noted earlier. * The public endpoint for the target bucket, which you noted earlier. For this screen only, be sure to prefix the public endpoint with `https://`. * The HMAC access key ID for the target COS instance, which you noted earlier. * The HMAC secret access key for the target COS instance, which you noted earlier. 10. Next to Connection status, click Test connection to test the connection. Do not proceed until Successful is shown. If the connection is not successful, check the values you entered for the target bucket name, region, endpoint, access key, and secret access key, and try again. 11. Check the box labelled Associate Catalog. 12. Check the box labelled Activate now. 13. Under Associated catalog, for Catalog type, select Apache Iceberg. 14. Enter some Catalog name. 15. Click Associate. 16. On the sidebar, click Infrastructure manager. Make sure the catalog is associated with the appropriate engines. If it is not, rest your mouse on an unassociated target engine, click the Manage associations icon, check the box next to the target catalog's name, and then click Save and restart engine. To create an engine if one is not already shown, click Add component, and follow the on-screen to add an appropriate engine from the list of available Engines (for example, an IBM Presto engine). * The catalog name and metastore REST endpoint for the target Iceberg catalog. To get this: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Infrastructure manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. In the Catalogs section, click the target Iceberg catalog. 8. On the Details tab, note the value of Name representing the catalog name, and Metastore REST endpoint representing the metastore REST endpoint. (Ignore the Metastore Thrift endpoint value.) * A namespace (also known as a schema) and a table in the target catalog. If you do not have these already, create them as follows: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Data manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. On the Browse data tab, under Catalogs associated, click the target catalog. 8. Click the ellipses, and then click Create schema. 9. Enter some Name for the schema, and then click Create. 10. On the sidebar, click Query workspace. 11. In the SQL editor, enter and run a table creation statement such as the following one that uses [Presto SQL](https://prestodb.io/docs/current/connector/iceberg.html) syntax, replacing `<catalog-name>` with the name of the target catalog and `<schema-name>` with the name of the target schema: ```sql CREATE TABLE <catalog-name>.<schema-name>.elements ( "type" varchar, "element_id" varchar, "text" varchar, "file_directory" varchar, "filename" varchar, "languages" array(varchar), "last_modified" double, "page_number" varchar, "filetype" varchar, "url" varchar, "version" varchar, "record_locator" varchar, "date_created" double, "date_modified" double, "date_processed" double, "filesize_bytes" bigint, "points" varchar, "system" varchar, "layout_width" double, "layout_height" double, "id" varchar, "record_id" varchar, "parent_id" varchar ) WITH ( delete_mode = 'copy-on-write', format = 'PARQUET', format_version = '2' ) ``` Incoming elements that do not have matching column names will be dropped upon record insertion. For example, if the incoming data has an element named `sent_from` and there is no column named `sent_from` in the table, the `sent_from` element will be dropped upon record insertion. You should modify the preceding sample table creation statement to add columns for any additional elements that you want to be included upon record insertion. To increase query performance, Iceberg uses [hidden partitioning](https://iceberg.apache.org/docs/latest/partitioning/) to group similar rows together when writing. You can also [explicitly define partitions](https://prestodb.io/docs/current/connector/iceberg.html#create-table) as part of the preceding `CREATE TABLE` statement. * The name of the target namespace (also known as a schema) within the target catalog, and name of the target table within that schema. To get these: 1. [Log in to your IBM Cloud account](https://cloud.ibm.com/login). 2. On the sidebar, click the Resource list icon. If the sidebar is not visible, click the Navigation Menu icon to the far left of the top navigation bar. 3. In the list of resources, expand Databases, and then click the target watsonx.data data store instance. 4. Click Open web console. 5. If prompted, log in to the web console. 6. On the sidebar, click Data manager. If the sidebar is not visible, click the Global navigation icon to the far left of the top navigation bar. 7. On the Browse data tab, expand the name of the target catalog, and note the names of the target schema and target table. * The name of the column in the target table that uniquely identifies each of the records in the table. * To improve performance, the target table should be set to regularly remove old metadata files. To do this, run the following Python script. (You cannot use the preceding `CREATE TABLE` statement, or other SQL statements such as `ALTER TABLE`, to set this behavior.) To get the values for the specified environment variables, see the preceding instructions. ```python # Improves performance by setting the target table to regularly remove # old metadata files. # # First, install the following dependencies into your Python virtual # environment: # # pip install requests pyiceberg pyarrow # # Then, set the following environment variables: # # IBM_IAM_API_KEY - An API key value for the target IBM Cloud account. # IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT - The metastore REST endpoint # value for the target Apache Iceberg catalog in the target IBM watsonx.data # data store instance. # IBM_COS_BUCKET_PUBLIC_ENDPOINT - The target IBM Cloud Object Storage (COS) # instance’s endpoint value. # IBM_COS_ACCESS_KEY - An HMAC access key ID for the target COS instance. # IBM_COS_SECRET_ACCESS_KEY - The associated HMAC secret access key ID for the # target HMAC access key. # IBM_COS_BUCKET_REGION - The target COS instance’s region short ID. # IBM_ICEBERG_CATALOG - The name of the target Iceberg catalog. # IBM_ICEBERG_SCHEMA - The name of the target namespace (also known as a schema) # in the target catalog. # IBM_ICEBERG_TABLE - The name of the target table in the target schema. # # To get these values, see the Unstructured documentation for the # IBM watsonx.data connector. import os import requests from pyiceberg.catalog import load_catalog def main(): # Get a bearer token for the target IBM Cloud account. bearer_token = requests.post( url="https://iam.cloud.ibm.com/identity/token", headers={ "Content-Type": "application/x-www-form-urlencoded", "Accept": "application/json" }, data={ "grant_type": "urn:ibm:params:oauth:grant-type:apikey", "apikey": os.getenv("IBM_IAM_API_KEY") } ).json().get("access_token") # Connect to the target Iceberg catalog. catalog = load_catalog( os.getenv("IBM_ICEBERG_CATALOG"), { "type": "rest", "uri": f"https://{os.getenv("IBM_ICEBERG_CATALOG_METASTORE_REST_ENDPOINT")}/mds/iceberg", "token": bearer_token, "warehouse": os.getenv("IBM_ICEBERG_CATALOG"), "s3.endpoint": os.getenv("IBM_COS_BUCKET_PUBLIC_ENDPOINT"), "s3.access-key-id": os.getenv("IBM_COS_ACCESS_KEY"), "s3.secret-access-key": os.getenv("IBM_COS_SECRET_ACCESS_KEY"), "s3.region": os.getenv("IBM_COS_BUCKET_REGION") }, ) # Load the target table. table = catalog.load_table(f"{os.getenv("IBM_ICEBERG_SCHEMA")}.{os.getenv("IBM_ICEBERG_TABLE")}") # Set the target table's properties to remove old metadata files. with table.transaction() as transaction: transaction.set_properties( { "commit.manifest.min-count-to-merge": 10, "commit.manifest-merge.enabled": True, "write.metadata.previous-versions-max": 10, "write.metadata.delete-after-commit.enabled": True, } ) # Confirm that the target table's properties were set as expected. print(table.metadata.properties) if name == "main": main() ``` To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click IBM watsonx.data. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test**. Fill in the following fields: * Name (required): A unique name for this connector. * Iceberg Endpoint (required): The metastore REST endpoint for the target Apache Iceberg-based catalog within the IBM watsonx.data data store instance. Do not include `https://` in this value. * Object Storage Endpoint (required): The public endpoint for the target bucket within the IBM Cloud Object Storage (COS) instance that is associated with the catalog. Do not include `https://` in this value. * Object Storage Region (required): The region short ID (such as us-east) for the bucket. * IAM API Key (required): A valid API key value for the IBM Cloud account. * Access Key ID (required): A valid hash-based message authentication code (HMAC) access key ID for the COS instance. * Secret Access Key (required): The HMAC secret access key for the access key ID. * Catalog (required): The name of the target Apache Iceberg-based catalog within the IBM watsonx.data data store instance. * Namespace (required): The name of the target namespace (also known as a schema) within the catalog. * Table (required): The name of the target table within the namespace (schema). * Max Connection Retries: The maximum number of retries when connecting to the catalog. Typically, an optimal setting is `15`. The default is `10`. If specified, it must be a number between `2` and `100`, inclusive. * Max Upload Retries: The maximum number of retries when uploading data. Typically, an optimal setting is `150`. The default is `50`. If specified, it must be a number between `2` and `500`, inclusive. * Record ID Key: The name of the column that uniquely identifies each record in the target table. The default is `record_id`. ## Learn more * <Icon icon="blog" /> [Getting Started with Unstructured and IBM watsonx.data](https://unstructured.io/blog/getting-started-with-unstructured-and-ibm-watsonx-data). # Kafka Source: https://docs.unstructured.io/ui/destinations/kafka <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Kafka. The requirements are as follows. * A Kafka cluster in [Confluent Cloud](https://www.confluent.io/confluent-cloud). ([Create a cluster](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#create-ak-clusters).) The following video shows how to set up a Kafka cluster in Confluent Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/zcKJ96J4Xvk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The [hostname and port number](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#view-a-ak-cluster) of the bootstrap Kafka cluster to connect to.. * The name of the topic to read messages from or write messages to on the cluster. [Create a topic](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). [Access available topics](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). * For authentication, an [API key and secret](https://docs.confluent.io/cloud/current/security/authenticate/workload-identities/service-accounts/api-keys/manage-api-keys.html#add-an-api-key). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Kafka. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Bootstrap Server (required): The hostname of the bootstrap Kafka cluster to connect to. * Port: The port number of the cluster. * Group ID: The ID of the consumer group, if any, that is associated with the target Kafka cluster. (A consumer group is a way to allow a pool of consumers to divide the consumption of data over topics and partitions.) The default is `default_group_id` if not otherwise specified. * Topic (required): The unique name of the topic to read messages from and write messages to on the cluster. * Number of messages to consume: The maximum number of messages to get from the topic. The default is `100` if not otherwise specified. * Batch Size: The maximum number of messages to send in a single batch. The default is `100` if not otherwise specified. * API Key (required): The Kafka API key value. * Secret (required): The secret value for the Kafka API key. ## Learn more * <Icon icon="blog" /> [Unstructured Platform Now Integrates with Apache Kafka in Confluent Cloud](https://unstructured.io/blog/unstructured-platform-now-integrates-with-apache-kafka-in-confluent-cloud) # Milvus Source: https://docs.unstructured.io/ui/destinations/milvus <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Milvus. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Milvus cloud-based instances (such as Zilliz Cloud, and Milvus on IBM watsonx.data) are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Milvus local and cloud-based instances are supported. The following video shows how to fulfill the minimum set of requirements for Milvus cloud-based instances, demonstrating Milvus on IBM watsonx.data: * For Zilliz Cloud, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/ASEmYryJpkU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Zilliz Cloud account](https://cloud.zilliz.com/signup). * A [Zilliz Cloud cluster](https://docs.zilliz.com/docs/create-cluster). * The URI of the cluster, also known as the cluster's public endpoint, which takes a format such as `https://<cluster-id>.<cluster-type>.<cloud-provider>-<region>.cloud.zilliz.com`. [Get the cluster's public endpoint](https://docs.zilliz.com/docs/manage-cluster#connect-to-cluster). * The token to access the cluster. [Get the cluster's token](https://docs.zilliz.com/docs/manage-cluster#connect-to-cluster). * The name of the [database](https://docs.zilliz.com/docs/database#create-database) in the instance. * The name of the [collection](https://docs.zilliz.com/docs/manage-collections-console#create-collection) in the database. The collection must have a a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows: | Field Name | Field Type | Max Length | Dimension | Index | Metric Type | | -------------------------------- | ----------------- | ---------- | --------- | ------------- | ----------- | | `element_id` (primary key field) | VARCHAR | `200` | -- | -- | -- | | `embeddings` (vector field) | FLOAT\_VECTOR | -- | `3072` | Yes (Checked) | Cosine | | `record_id` | VARCHAR | `200` | -- | -- | -- | * For Milvus on IBM watsonx.data, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/hLCwoe2fCnc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An [IBM Cloud account](https://cloud.ibm.com/registration). * The [IBM watsonx.data subscription plan](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-getting-started). * A [Milvus service instance in IBM watsonx.data](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-adding-milvus-service). * The URI of the instance, which takes the format of `https://`, followed by instance's GRPC host, followed by a colon and the GRPC port. This takes the format of `https://<host>:<port>`. [Get the instance's GRPC host and GRPC port](https://cloud.ibm.com/docs/watsonxdata?topic=watsonxdata-conn-to-milvus). * The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance. * The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section. * The uername and password to access the instance. The username for Milvus on IBM watsonx.data is always `ibmlhapikey`. The password for Milvus on IBM watsonx.data is in the form of an IBM Cloud user API key. [Get the user API key](https://cloud.ibm.com/docs/account?topic=account-userapikey\&interface=ui). * For Milvus local, you will need: * A [Milvus instance](https://milvus.io/docs/install-overview.md). * The [URI](https://milvus.io/api-reference/pymilvus/v2.4.x/MilvusClient/Client/MilvusClient.md) of the instance. * The name of the [database](https://milvus.io/docs/manage_databases.md) in the instance. * The name of the [collection](https://milvus.io/docs/manage-collections.md) in the database. Note the collection requirements at the end of this section. * The [username and password, or token](https://milvus.io/docs/authenticate.md) to access the instance. All Milvus instances require the target collection to have a defined schema before Unstructured can write to the collection. The minimum viable schema for Unstructured contains only the fields `element_id`, `embeddings`, and `record_id`, as follows. This example code demonstrates the use of the [Python SDK for Milvus](https://pypi.org/project/pymilvus/) to create a collection with this minimum viable schema, targeting Milvus on IBM watsonx.data. For the `connections.connect` arguments to connect to other types of Milvus deployments, see your Milvus provider's documentation: ```python Python import os from pymilvus import ( connections, FieldSchema, DataType, CollectionSchema, Collection, ) connections.connect( alias="default", host=os.getenv("MILVUS_GRPC_HOST"), port=os.getenv("MILVUS_GRPC_PORT"), user=os.getenv("MILVUS_USER"), password=os.getenv("MILVUS_PASSWORD"), secure=True ) primary_key = FieldSchema( name="element_id", dtype=DataType.VARCHAR, is_primary=True, max_length=200 ) vector = FieldSchema( name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=3072 ) record_id = FieldSchema( name="record_id", dtype=DataType.VARCHAR, max_length=200 ) schema = CollectionSchema( fields=[primary_key, vector, record_id], enable_dynamic_field=True ) collection = Collection( name="my_collection", schema=schema, using="default" ) index_params = { "metric_type": "L2", "index_type": "IVF_FLAT", "params": {"nlist": 1024} } collection.create_index( field_name="embeddings", index_params=index_params ) ``` Other approaches, such as [creating collections instantly](https://milvus.io/docs/create-collection-instantly.md) or [setting nullable and default fields](https://milvus.io/docs/nullable-and-default.md), have not been fully evaluated by Unstructured and might produce unexpected results. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Milvus. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields for Milvus on IBM watsonx.data: * Name (required): A unique name for this connector. * GRPC Host (required): The GRPC host name for the Milvus instance. * GRPC Port: The GRPC port number for the instance. * DB Name: The name of the database in the instance. The default is `default` if not otherwise specified. * Collection Name (required): The name of the collection in the database. * Username: The username to access the Milvus instance. The default is `ibmlhapikey` if not otherwise specified. * API Key (required): The IBM Cloud user API key. Fill in the following fields for Milvus on Zilliz Cloud: * Name (required): A unique name for this connector. * URI (required): The URI of the Milvus instance, for example: `https://12345.serverless.gcp-us-west1.cloud.zilliz.com`. * DB Name: The name of the database in the instance. The default is `default` if not otherwise specified. * Collection Name (required): The name of the collection in the database. * Username (required): The username to access the Milvus instance. * Password (required): The password corresponding to the username to access the instance. Fill in the following fields for other Milvus deployments: * Name (required): A unique name for this connector. * URI (required): The URI of the Milvus instance, for example: `https://12345.serverless.gcp-us-west1.cloud.zilliz.com`. * DB Name: The name of the database in the instance. The default is `default` if not otherwise specified. * Collection Name (required): The name of the collection in the database. * Username (required): The username to access the Milvus instance. * Password (required): The password corresponding to the username to access the instance. # MongoDB Source: https://docs.unstructured.io/ui/destinations/mongodb <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to MongoDB. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/8YBVHt5spIQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The MongoDB requirements for a MongoDB Atlas deployment include: <Warning> For MongoDB Atlas, SCRAM-SHA-1 is not supported for authentication. This means that cluster types that only include SCRAM-SHA-1, such as Free, M0, Flex, and Serverless, are not supported. Unstructured only supports SCRAM-SHA-256 for MongoDB Atlas, which is cryptographically stronger than SCRAM-SHA-1. If you try to test or use a connector that refers to a cluster type that only includes SCRAM-SHA-1, the operation will fail, and you will get an error message similar to the following: `[digital envelope routines] unsupported`. </Warning> * A MongoDB Atlas account. [Create an account](https://www.mongodb.com/cloud/atlas/register). * A MongoDB Atlas cluster. [Create a cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/). Be sure to not select a cluster type that only includes SCRAM-SHA-1, such as Free, M0, Flex, or Serverless. * The cluster must be reachable from your application environment, for example by adding IP addresses to your IP access list. [Learn more](https://www.mongodb.com/docs/atlas/setup-cluster-security/#network-and-firewall-requirements). * The cluster must be configured to allow IP address. [Learn how](https://www.mongodb.com/docs/atlas/security/ip-access-list/#add-ip-access-list-entries). To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> * The cluster must have at least one database. [Create a database](https://www.mongodb.com/docs/compass/current/databases/#create-a-database). * The database must have at least one user, and that user must have sufficient access to the database. [Create a database user](https://www.mongodb.com/docs/atlas/security-add-mongodb-users/#add-database-users). [Give the user database access](https://www.mongodb.com/docs/manual/core/authorization/). * The database must have at least one collection. [Create a collection](https://www.mongodb.com/docs/compass/current/collections/#create-a-collection). <Note> For the destination connector, Unstructured recommends that all documents in the target collection have a field named `record_id` with a `String` data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the collection or, in some cases, the operation could fail altogether. </Note> * The connection string for the cluster. For MongoDB Atlas, this connection string must include the protocol, username, password, host, and cluster name. For example: ```text mongodb+srv://<db_user>:<db_password>@<host>/?retryWrites=true&w=majority&appName=<cluster> ``` To get the connection string in MongoDB Atlas, do the following: 1. Log in to your MongoDB Atlas console. 2. In the sidebar, under Databases, click Clusters. 3. Click on the cluster you want to connect to. 4. Click Connect. 5. Click Drivers. 6. Under Add your connection string into your application code, copy the connection string. You can then close the Connect dialog in MongoDB Atlas. Before you use this connection string, be sure to fill in any placeholders in the string, such as your MongoDB Atlas database user's password value. [Learn more](https://www.mongodb.com/resources/products/fundamentals/mongodb-connection-string). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click MongoDB. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Database (required): The name of the database on the instance. * Collection (required): The name of the collection within the database. * Connection String (required): The MongoDB instance connection string. # MotherDuck Source: https://docs.unstructured.io/ui/destinations/motherduck <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to MotherDuck. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/tj_0qmvPpJQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [MotherDuck account](https://app.motherduck.com). * A [MotherDuck access token](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/authenticating-to-motherduck/#creating-an-access-token) for the account. * A database in the account. * [Create a database](https://motherduck.com/docs/sql-reference/motherduck-sql-reference/create-database/). * [List available databases](https://motherduck.com/docs/key-tasks/database-operations/basics-operations/#listing-databases). You can run commands to manage MotherDuck databases, schemas, tables, and more in the [MotherDuck UI](https://motherduck.com/docs/getting-started/motherduck-quick-tour/) or for example by connecting to MotherDuck with the [DuckDB CLI](https://motherduck.com/docs/key-tasks/authenticating-and-connecting-to-motherduck/connecting-to-motherduck/). * A schema in the target database. * [Create a schema](https://duckdb.org/docs/sql/statements/create_schema.html). * You can list available schemas and their parent catalogs by running the following command in the MotherDuck UI or the DuckDB CLI: ```sql SELECT * FROM information_schema.schemata; ``` The MotherDuck connector uses the default schema name of `main` if not otherwise specified. * A table in the target schema. * [Create a table](https://duckdb.org/docs/sql/statements/create_table). * You can list available tables in a schema by running the following commands in the MotherDuck UI or the DuckDB CLI, replacing the target catalog and schema names: ```sql USE <catalog_name>.<schema_name>; SHOW TABLES; ``` The MotherDuck connector uses the default table name of `elements` if not otherwise specified. For maximum compatibility, Unstructured recommends the following table schema: ```sql CREATE TABLE elements ( id VARCHAR, element_id VARCHAR, text TEXT, embeddings FLOAT[], type VARCHAR, system VARCHAR, layout_width DECIMAL, layout_height DECIMAL, points TEXT, url TEXT, version VARCHAR, date_created INTEGER, date_modified INTEGER, date_processed DOUBLE, permissions_data TEXT, record_locator TEXT, category_depth INTEGER, parent_id VARCHAR, attached_filename VARCHAR, filetype VARCHAR, last_modified TIMESTAMP, file_directory VARCHAR, filename VARCHAR, languages VARCHAR[], page_number VARCHAR, links TEXT, page_name VARCHAR, link_urls VARCHAR[], link_texts VARCHAR[], sent_from VARCHAR[], sent_to VARCHAR[], subject VARCHAR, section VARCHAR, header_footer_type VARCHAR, emphasized_text_contents VARCHAR[], emphasized_text_tags VARCHAR[], text_as_html TEXT, regex_metadata TEXT, detection_class_prob DECIMAL, partitioner_type VARCHAR ); ``` You can list the schema of a table by running the following commands in the MotherDuck UI or the DuckDB CLI, replacing the target catalog, schema, and table names: ```sql USE <catalog_name>.<schema_name>; DESCRIBE TABLE <table_name>; ``` To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click MotherDuck. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Database (required): The name of the target MotherDuck database. * Schema (required): The name of the target schema within the database. * Table: The name of the target table within the schema. By default, this table is named `elements` if not otherwise specified. * MotherDuck Token (required): The access token value within the MotherDuck account that has the appropriate access to the target database, schema, and table. ## Learn more * <Icon icon="blog" /> [Unstructured's New MotherDuck Integration](https://unstructured.io/blog/unstructured-s-new-motherduck-integration) # Neo4j Source: https://docs.unstructured.io/ui/destinations/neo4j <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Neo4j. The requirements are as follows. * A [Neo4j deployment](https://neo4j.com/deployment-center/). * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local Neo4j deployments are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local Neo4j deployments are supported. The following video shows how to set up a Neo4j Aura deployment: <iframe width="560" height="315" src="https://www.youtube.com/embed/fo8uDIm1zCE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The username and password for the user who has access to the Neo4j deployment. The default user is typically `neo4j`. * For a Neo4j Aura instance, the defaut user's is typically set when the instance is created. * For an AWS Marketplace, Microsoft Azure Marketplace, or Google Cloud Marketplace deployment of Neo4j, the default user is typically set during the deployment process. * For a local Neo4j deployment, you can [set the default user's initial password](https://neo4j.com/docs/operations-manual/current/configuration/set-initial-password/) or [recover an admin user and its password](https://neo4j.com/docs/operations-manual/current/authentication-authorization/password-and-user-recovery/). * The connection URI for the Neo4j deployment, which starts with `neo4j://`, `neo4j+s://`, `bolt://`, or `bolt+s://`; followed by `localhost` or the host name; and sometimes ending with a colon and the port number (such as `:7687`). For example: * For a Neo4j Aura deployment, browse to the target Neo4j instance in the Neo4j Aura account and click Connect > Drivers to get the connection URI, which follows the format `neo4j+s://<host-name>`. A port number is not used or needed. * For an AWS Marketplace, Microsoft Azure Marketplace, or Google Cloud Marketplace deployment of Neo4j, see [Neo4j on AWS](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-aws/), [Neo4j on Azure](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-azure/), or [Neo4j on GCP](https://neo4j.com/docs/operations-manual/current/cloud-deployments/neo4j-gcp/) for details about how to get the connection URI. * For a local Neo4j deployment, the URI is typically `bolt://localhost:7687` * For other Neo4j deployment types, see the deployment provider's documentation. [Learn more](https://neo4j.com/docs/browser-manual/current/operations/dbms-connection). * The name of the target database in the Neo4j deployment. A default Neo4j deployment typically contains two standard databases: one named `neo4j` for user data and another named `system` for system data and metadata. Some Neo4j deployment types support more than these two databases per deployment; Neo4j Aura instances do not. * [Create additional databases](https://neo4j.com/docs/operations-manual/current/database-administration/standard-databases/create-databases/) for a local Neo4j deployment that uses Enterprise Edition; or for Neo4j on AWS, Neo4j on Azure, or Neo4j on GCP deployments. * [Get a list of additional available databases](https://neo4j.com/docs/operations-manual/current/database-administration/standard-databases/listing-databases/) for a local Neo4j deployment that uses Enterprise Edition; or for Neo4j on AWS, Neo4j on Azure, or Neo4j on GCP deployments. ## Graph Output The graph ouput of the Neo4j destination connector is represented in the following diagram: ```mermaid graph BT subgraph dn [Document Node] D[Document] end style dn stroke-dasharray: 5 subgraph en [Element Nodes] UE1[UnstructuredElement] UE2[UnstructuredElement] UE3[UnstructuredElement] UE4[UnstructuredElement] UE5[UnstructuredElement] UE6[UnstructuredElement] end style en stroke-dasharray: 5 UE1 -->|PART_OF_DOCUMENT| D UE2 -->|PART_OF_DOCUMENT| D UE3 -->|PART_OF_DOCUMENT| D UE4 -->|PART_OF_DOCUMENT| D UE5 -->|PART_OF_DOCUMENT| D UE6 -->|PART_OF_DOCUMENT| D subgraph cn [Chunk Nodes] C1[Chunk] C2[Chunk] C3[Chunk] C4[Chunk] end style cn stroke-dasharray: 5 C1 -->|NEXT_CHUNK| C2 C2 -->|NEXT_CHUNK| C3 C3 -->|NEXT_CHUNK| C4 C1 -->|PART_OF_DOCUMENT| D C2 -->|PART_OF_DOCUMENT| D C3 -->|PART_OF_DOCUMENT| D C4 -->|PART_OF_DOCUMENT| D UE1 -.->|PART_OF_CHUNK| C1 UE2 -.->|PART_OF_CHUNK| C1 UE3 -.->|PART_OF_CHUNK| C2 UE4 -.->|PART_OF_CHUNK| C3 UE5 -.->|PART_OF_CHUNK| C4 UE6 -.->|PART_OF_CHUNK| C4 ``` [View the preceding diagram in full-screen mode](https://mermaid.live/view#pako:eNqFlN9vgjAQx_-Vps-6REEfeFiyFZYli7hskCyTxXS0ihFaU9oHo_7vq_IjgIzyxN330157d70TjDmh0IFbgQ8JeA4iBvSXq9_CQRhYuTxWGWUS-Br9KQC39pYOyki5VB5Tel2XS8H3dExwnmAh8NEBs4LohKA6hJfSOkJe7hh6k1XI9C4qlkpQUjK1Oh1UrUHVHlRng-p8QO1kgRqzoC8JxuPH8_vTR7BevqzdJQoXnh-cgVvf0wRYJsA2ATMTMP8f6FQz1tVEiWL7Vi3RpHBW5rRtWm3TbpmdnMbGnKIipb73FazRa-i_nXXAKvC9ZFWHuJfs6nrIUCVkKBIy1AjZpgTfGuWhwVRnnDT6ZFC3-vVpo0v6dKvRJH263eiRXh2OYEZFhndEj5nTlY6gTPSriaCjfwndYJXKCEbsolGsJP88shg6-onRERRcbRPobHCaa0sdCJbU3WHdbFmFHDD75jyrIUp2kotFMddu4-3yB3k-fcg). In the preceding diagram: * The `Document` node represents the source file. * The `UnstructuredElement` nodes represent the source file's Unstructured `Element` objects, before chunking. * The `Chunk` nodes represent the source file's Unstructured `Element` objects, after chunking. * Each `UnstructuredElement` node has a `PART_OF_DOCUMENT` relationship with the `Document` node. * Each `Chunk` node also has a `PART_OF_DOCUMENT` relationship with the `Document` node. * Each `UnstructuredElement` node has a `PART_OF_CHUNK` relationship with a `Chunk` element. * Each `Chunk` node, except for the "last" `Chunk` node, has a `NEXT_CHUNK` relationship with its "next" `Chunk` node. Learn more about [document elements](/ui/document-elements) and [chunking](/ui/chunking). Some related example Neo4j graph queries include the following. Query for all available nodes and relationships: ```text MATCH path=(source)-[relationship]->(target) RETURN path ``` Query for `Chunk` to `Document` relationships: ```text MATCH (chunk:Chunk)-[relationship:PART_OF_DOCUMENT]->(doc:Document) RETURN chunk, relationship, doc ``` Query for `UnstructuredElement` to `Document` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_DOCUMENT]->(doc:Document) RETURN element, relationship, doc ``` Query for `UnstructuredElement` to `Chunk` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) RETURN element, relationship, chunk ``` Query for `Chunk` to `Chunk` relationships: ```text MATCH (this:Chunk)-[relationship:NEXT_CHUNK]->(previous:Chunk) RETURN this, relationship, previous ``` Query for `UnstructuredElement` to `Chunk` to `Document` relationships: ```text MATCH (element:UnstructuredElement)-[ecrelationship:PART_OF_CHUNK]-(chunk:Chunk)-[cdrelationship:PART_OF_DOCUMENT]->(doc:Document) RETURN element, ecrelationship, chunk, cdrelationship, doc ``` Query for `UnstructuredElements` containing the text `jury`, and show their `Chunk` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) WHERE element.text =~ '(?i).jury.' RETURN element, relationship, chunk ``` Query for the `Chunk` with the specified `id`, and show its `UnstructuredElement` relationships: ```text MATCH (element:UnstructuredElement)-[relationship:PART_OF_CHUNK]->(chunk:Chunk) WHERE chunk.id = '731508bf53637ce4431fe93f6028ebdf' RETURN element, relationship, chunk ``` Additionally, for the [Unstructured UI](/ui/overview) and [Unstructured Workflow Endpoint](/api-reference/workflow/), when a [Named entity recognition (NER)](/ui/enriching/ner) DAG node is added to a custom workflow, any recognized entities are output as `Entity` nodes in the graph. This additional graph ouput of the Neo4j destination connector is represented in the following diagram: ```mermaid graph TD Chunk -->|HAS_ENTITY| Entity Entity -->|ENTITY_TYPE| Entity ``` In the preceding diagram: * The `Chunk` node represents one of the source file's Unstructured `Element` objects, after chunking. * The `Entity` node represents a recognized entity. * A `Chunk` node can have `HAS_ENTITY` relationships with `Entity` nodes. * An `Entity` node can have `ENTITY_TYPE` relationships with other `Entity` nodes. Some related example Neo4j graph queries include the following. Query for all available nodes and relationships: ```text MATCH path=(source)-[relationship]->(target) RETURN path ``` Query for `Entity` to `Entity` relationships: ```text MATCH (child:Entity)-[relationship:ENTITY_TYPE]->(parent:Entity) RETURN child, relationship, parent ``` Query for `Entity` nodes containing the text `PERSON`, and show their `Entity` relationships: ```text MATCH (child:Entity)-[relationship:ENTITY_TYPE]->(parent:Entity) WHERE parent.id = 'PERSON' RETURN child, relationship, parent ``` Query for `Entity` nodes containing the text `amendment`, and show their `Chunk` relationships: ```text MATCH (element:Chunk)-[relationship:HAS_ENTITY]->(entity:Entity) WHERE entity.id =~ '(?i).amendment.' RETURN element, relationship, entity ``` QUERY FOR `Entity` nodes containing the text `PERSON`, and show their `Entity` to `Entity` to `Chunk` relationships: ```text MATCH (chunk:Chunk)-[ccrelationship:HAS_ENTITY]-(child:Entity)-[cprelationship:ENTITY_TYPE]->(parent:Entity) WHERE parent.id =~ 'PERSON' RETURN chunk, ccrelationship, child, cprelationship, parent ``` To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Neo4j. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * URI (required): The connection URI for the Neo4j deployment, which typically starts with `neo4j://`, `neo4j+s://`, `bolt://`, or `bolt+s://`; is followed by the host name; and ends with a colon and the port number (such as `:7473`, `:7474`, or `:7687`). * Database (required): The name of the target database in the Neo4j deployment. A default Neo4j deployment typically contains a standard database named `neo4j` for user data. * Username (required): The name of the user who has access to the Neo4j deployment. A default Neo4j deployment typically contains a default user named `neo4j`. * Password (required): The password for the user. * Batch Size: The maximum number of nodes or relationships to be transmitted per batch. The default is `100` if not otherwise specified. # OneDrive Source: https://docs.unstructured.io/ui/destinations/onedrive <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to OneDrive. The requirements are as follows. * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. (Even if you only plan to use OneDrive, you still need a plan that includes SharePoint Online, because OneDrive is built on SharePoint technology.) [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * The OneDrive and SharePoint Online plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The path to the target OneDrive folder, starting from the OneDrive account's root folder, for example `my-folder/my-subfolder`. The following video shows how to get a path: <iframe width="560" height="315" src="https://www.youtube.com/embed/SfUA1IwtI5U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration that has access to the target OneDrive account, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Files.ReadWrite.All` (if both reading and writing are needed) * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` * `Directory.Read.All` 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/jBJsrSkpClo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Both authentication types also require the User Principal Name (UPN) for the OneDrive account. This is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password authentication, you will also need the password for the target UPN. To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click OneDrive. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Client ID (required): The application (client) ID of the Microsoft Entra ID app registration that has access to the OneDrive account. * Tenant ID (required): The directory (tenant) ID of the Entra ID app registration. * Authority URL (required): The authentication token provider URL for the Entra ID app registration. The default is `https://login.microsoftonline.com`. * Principal Name (required): The User Principal Name (UPN) for the OneDrive user account in Entra ID. This is typically the user's email address. * Client Credential (required): The client secret for the Entra ID app registration. * Password (required for username and password authentication): The password for the target UPN. * Path (source connector only): The path to the target folder in the OneDrive account, starting with the account's root folder, for example `my-folder/my-subfolder`. * Recursive (source connector only): Check this box to recursively access files from subfolders within the specified OneDrive path. * Remote URL (destination connector only): `onedrive://`, followed by the path to the target folder in the OneDrive account, starting with the account's root folder, for example `onedrive://my-folder/my-subfolder`. # Overview Source: https://docs.unstructured.io/ui/destinations/overview <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Destination connectors in Unstructured are designed to specify the endpoint for data processed within the platform. These connectors ensure that the transformed and analyzed data is securely and efficiently transferred to a storage system for future use, often to a vector database for tasks that involve high-speed retrieval and advanced data analytics operations. ![Destinations in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Destinations-Sidebar.png) To see your existing destination connectors, on the sidebar, click Connectors, and then click Destinations. To create a destination connector: 1. In the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. For Name, enter some unique name for this connector. 5. In the Provider area, click the destination location type that matches yours. 6. Click Continue. 7. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: * [Astra DB](/ui/destinations/astradb) * [Azure AI Search](/ui/destinations/azure-ai-search) * [Couchbase](/ui/destinations/couchbase) * [Databricks Volumes](/ui/destinations/databricks-volumes) * [Delta Tables in Amazon S3](/ui/destinations/delta-table) * [Delta Tables in Databricks](/ui/destinations/databricks-delta-table) * [Elasticsearch](/ui/destinations/elasticsearch) * [Google Cloud Storage](/ui/destinations/google-cloud) * [IBM watsonx.data](/ui/destinations/ibm-watsonxdata) * [Kafka](/ui/destinations/kafka) * [Milvus](/ui/destinations/milvus) * [MongoDB](/ui/destinations/mongodb) * [MotherDuck](/ui/destinations/motherduck) * [Neo4j](/ui/destinations/neo4j) * [OneDrive](/ui/destinations/onedrive) * [Pinecone](/ui/destinations/pinecone) * [PostgreSQL](/ui/destinations/postgresql) * [Qdrant](/ui/destinations/qdrant) * [Redis](/ui/destinations/redis) * [S3](/ui/destinations/s3) * [Snowflake](/ui/destinations/snowflake) * [Weaviate](/ui/destinations/weaviate) 8. If a Continue button appears, click it, and fill in any additional settings fields. 9. Click Save and Test. # Pinecone Source: https://docs.unstructured.io/ui/destinations/pinecone <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Pinecone. The following video shows how to fulfill the minimum set of Pinecone requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/Li0yhaeguYQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Here are some more details about the requirements: * A Pinecone account. [Get an account](https://app.pinecone.io/). <iframe width="560" height="315" src="https://www.youtube.com/embed/uPoa7PLcuZ4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Pinecone API key. [Get an API key](https://docs.pinecone.io/guides/get-started/authentication#find-your-pinecone-api-key). * A Pinecone serverless index. [Create a serverless index](https://docs.pinecone.io/guides/indexes/create-an-index). An existing index is not required. At runtime, the index behavior is as follows: For the [Unstructured UI](/ui/overview) and [Unstructured API](/api-reference/overview): * If an existing index name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing index's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing index's embedding settings to match, and try the run again. * If an index name is not specified, Unstructured creates a new index in your Pinecone account. If Unstructured generates embeddings, the new index's name will be `u<short-workflow-id>-<short-embedding-model-name>-<number-of-dimensions>`. If Unstructured does not generate embeddings, the new index's name will be `u<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing index name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing index's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing index's embedding settings to match, and try the run again. * If an index name is not specified, Unstructured creates a new index in your Pinecone account. The new index's name will be `unstructuredautocreated`. <Note> If you create a new index or use an existing one, Unstructured recommends that all records in the target index have a field named `record_id` with a string data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the index or, in some cases, the operation could fail altogether. </Note> * Within a Pinecone serverless index, custom [namespaces](https://docs.pinecone.io/guides/index-data/indexing-overview#namespaces) are supported but are not required. To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Pinecone. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Index Name: The name of the index in the Pinecone database. If no value is provided, see the beginning of this article for the behavior at run time. * Namespace: The name of any custom namespace in the Pinecone index to upsert data into. If no value is provided, the default namespace of `default` will be used. * Batch Size: The number of records to use in a single batch. The default is `50` if not otherwise specified. * API Key (required): The Pinecone API key. # PostgreSQL Source: https://docs.unstructured.io/ui/destinations/postgresql <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to PostgreSQL. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local PostgreSQL installations are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local PostgreSQL installations are supported. The following video shows how to set up [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/): <iframe width="560" height="315" src="https://www.youtube.com/embed/QuIlEimejDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to set up [Azure Database for PostgreSQL](https://azure.microsoft.com/products/postgresql): <iframe width="560" height="315" src="https://www.youtube.com/embed/6lvtBUFI7eQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A PostgreSQL instance. * [Create an Amazon RDS for PostgreSQL instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html). * [Create an Azure Database for PostgreSQL server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-deploy-on-azure-free-account). * [Install PostgreSQL locally](https://www.postgresql.org/docs/current/tutorial-install.html). * The host name and port number for the instance. * For Amazon RDS for PostgreSQL, learn how to [get the host name and port number](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html#postgresql-endpoint). * For Azure Database for PostgreSQL, learn how to [get the host](https://learn.microsoft.com/azure/postgresql/flexible-server/quickstart-create-server#get-the-connection-information). The port number is `5432`. * For local PostgreSQL installations, these values are in the `postgresql.conf` file's `listen_addresses` and `port` settings. This file should be on the same machine as the instance. These values might also already be set as environment variables named `PGHOST` and `PGPORT` on the same machine as the instance. * For other installation types, see your PostgreSQL provider's documentation. * Depending on your network security requirements, you might need to allow access to your instance only from specific IP addresses. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> To learn how to allow these IP address ranges, see your PostgreSQL provider's documentation, for example with [Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) or [Azure Database for PostgreSQL](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-manage-firewall-portal#create-a-firewall-rule-after-server-is-created). <Note> For Amazon RDS for PostgreSQL, Amazon recommends that you set the instance's Public access setting to No by default, as this approach is more secure. This means that no resources can connect to the instance outside of the instance's associated Virtual Private Cloud (VPC) without extra configuration. [Learn more](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.WorkingWithRDSInstanceinaVPC.html#USER_VPC.Hiding). [Access an Amazon RDS instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html). If you must enable public access, set the instance's Public access setting to Yes, and then adjust the instance's related security group to allow this access. [Learn how](https://repost.aws/en/questions/QUxemKa9u5TV6CmLiO-r5prg/lost-public-access-to-aws-rds-postgresql-instance). [Troubleshoot issues with connecting to Amazon RDS instances](https://repost.aws/knowledge-center/rds-connectivity-instance-subnet-vpc). </Note> * A database in the instance. * For Amazon RDS for PostgreSQL and Azure Database for PostgreSQL, the default database name is `postgres` unless a custom database name was specified during the instance creation process. * For local PostgreSQL installations, learn how to [create a database](https://www.postgresql.org/docs/current/tutorial-createdb.html). * For other installation types, see your PostgreSQL provider's documentation. * A table in the database. Learn how to [create a table](https://www.postgresql.org/docs/current/tutorial-table.html). The table's schema must match the schema of the documents that Unstructured produces. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: <CodeGroup> ```sql PostgreSQL CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings DECIMAL [], parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` ```sql PostgreSQL with pgvector CREATE EXTENSION vector; CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings vector(3072), parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` </CodeGroup> See also: * [CREATE TABLE](https://www.postgresql.org/docs/current/sql-createtable.html) for PostgreSQL * [CREATE TABLE](https://github.com/pgvector/pgvector) for PostrgreSQL with pgvector * [Unstructured document elements and metadata](/api-reference/partition/document-elements) The following video shows how to use the `psql` utility to connect to PostgreSQL, list databases, and list and create tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/IKo-4QHdNF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A user in the database, and a password for the user. * For Amazon RDS for PostgreSQL, learn how to [create a user](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Roles.html). * For Azure Database for PostgreSQL, learn how to [create a user](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-create-users). * For local PostgreSQL installations, learn how to [create a user](https://www.postgresql.org/docs/current/sql-createuser.html). * For other installation types, see your PostgreSQL provider's documentation. * Database access for the user. * For Amazon RDS for PostgreSQL, learn how to [control user access](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Access.html). * For Azure Database for PostgreSQL, learn how to [control user access](https://www.postgresql.org/docs/current/sql-createuser.html). * For local PostgreSQL installations, learn how to [give database access to a user](https://www.postgresql.org/docs/current/sql-grant.html). * For other installation types, see your PostgreSQL provider's documentation. To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click PostgreSQL. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Host (required): The host name of the target PostgreSQL instance. * DB Name (required): The name of the target database on the instance. * Port (required): The port number on the instance. * Username (required): The name of the target user with the appropriate access to the instance. * Password (required): The password for the user. * Table Name (required): The name of the target table in the database. * Batch Size: The maximum number of rows to transmit at a time. The default is `100` if not otherwise specified. * ID Column (required, source connector only): The name of the ID column in the table. * Columns (required, source connector only): A comma separated list of column names to process (including the specified ID Column column). The default is all columns if not otherwise specified. ## Learn more * <Icon icon="blog" /> [PostgreSQL Integration in the Unstructured Platform](https://unstructured.io/blog/postgresql-integration-in-the-unstructured-platform) # Qdrant Source: https://docs.unstructured.io/ui/destinations/qdrant <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Qdrant. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only [Qdrant Cloud](https://qdrant.tech/documentation/cloud-intro/) is supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Qdrant Cloud, [Qdrant local](https://github.com/qdrant/qdrant), and [Qdrant client-server](https://qdrant.tech/documentation/quickstart/) are supported. The following video shows how to set up Qdrant Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/730jcEAJUG8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Qdrant local, the path to the local Qdrant installation, for example: `/qdrant/local` * For Qdrant client-server, the Qdrant server URL, for example: `http://localhost:6333` * For Qdrant Cloud: * A [Qdrant account](https://cloud.qdrant.io/login). * A [Qdrant cluster](https://qdrant.tech/documentation/cloud/create-cluster/). * The cluster's URL. To get this URL, do the following: 1. Sign in to your Qdrant Cloud account. 2. On the sidebar, under Dashboard, click Clusters. 3. Click the cluster's name. 4. Note the value of the Endpoint field, for example: `https://<random-guid>.<region-id>.<cloud-provider>.cloud.qdrant.io`. * A [Qdrant API key](https://qdrant.tech/documentation/cloud/authentication/#create-api-keys). * The name of the target [collection](https://qdrant.tech/documentation/concepts/collections) on the Qdrant local installation, Qdrant server, or Qdrant Cloud cluster. Qdrant requires the target collection to exist before Unstructured can write to the collection. The following example code demonstrates the use of the [Python Qdrant Client](https://pypi.org/project/qdrant-client/) to create a collection on a Qdrant Cloud cluster, configuring the collection for vectors with 3072 dimensions: ```python Python from qdrant_client import QdrantClient, models import os client = QdrantClient( url=os.getenv("QDRANT_URL"), api_key=os.getenv("QDRANT_API_KEY") ) client.create_collection( collection_name=os.getenv("QDRANT_COLLECTION"), vectors_config=models.VectorParams( size=3072, distance=models.Distance.COSINE ) ) collection = client.get_collection( collection_name=os.getenv("QDRANT_COLLECTION") ) print(f"The collection named '{os.getenv("QDRANT_COLLECTION")}' exists and " + f"has a status of '{collection.status}'.") ``` To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Qdrant. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * URL (required): The target Qdrant cluster's URL. * Collection Name (required): The name of the target collection on the cluster. * Batch Size The maximum number of records to be transmitted per batch. The default is `50` if not otherwise specified. * API Key (required): The Qdrant API key value. ## Learn more * <Icon icon="blog" /> [Qdrant Integration in the Unstructured Platform](https://unstructured.io/blog/streamlining-rag-pipeline-development-qdrant-integration-in-the-unstructured-platform) # Redis Source: https://docs.unstructured.io/ui/destinations/redis <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Redis. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/_zpzzr2VtoM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Redis](https://redis.io) database, for example in [Redis Cloud](https://redis.io/cloud/). * The target database's hostname and port number. [Create a database in Redis Cloud](https://redis.io/docs/latest/operate/rc/rc-quickstart/#create-an-account). * The username and password for the target database. [Get the username and password in Redis Cloud](https://redis.io/docs/latest/operate/rc/rc-quickstart/#connect-to-a-database). * The database number for the target database. Redis databases are typically numbered from 0 to 15, with the default database number typically being 0. To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Redis. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * URI: If URI is selected for Authentication Method, the connection URI for the target database user and password, hostname, and port number. Use the format of `redis://<username>:<password>@<hostname>:<port-number>`. If SSL encryption is enabled for the database, use `rediss://` instead of `redis://`. * Host (required if Password is selected for Authentication Method): The target database's hostname. * Port: If Password is selected for Authentication Method, the database's port number. The default is `6379` unless otherwise specified. * Username (required if Password is selected for Authentication Method): The name of the database user (not the logged in user). * Password (required if Password is selected for Authentication Method): The database user's password. * Database Index: The number (index) for the target database. The default is `0` unless otherwise specified. * SSL: Check this box to enable SSL encryption for the connection. Check this box only if SSL is already set up and enabled for the target database. This box is checked by default. * Key Prefix: A string to prepend to each element's `element_id`. This is useful for distinguishing between different data sets in the same Redis instance. The default is an empty string. * Batch Size: The maximum number of records to upload in a single batch. The default is `100` unless otherwise specified. ## Learn more * <Icon icon="blog" /> [Redis Support in Unstructured Platform: Supercharging Your RAG Pipeline](https://unstructured.io/blog/redis-support-in-unstructured-platform-supercharging-your-rag-pipeline) # S3 Source: https://docs.unstructured.io/ui/destinations/s3 <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Amazon S3. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/hyDHfhVVAhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account; enable anonymous access to the bucket (which is supported but not recommended); or generate an AWS STS session token for temporary access, if required by your organization's security requirements. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Anonymous (supported but not recommended) or authenticated access to the bucket. * [Enable anonymous bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-anonymous-user). ![Enable anonymous bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-anon-access.gif) * [Enable authenticated bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/walkthrough1.html). ![Enable authenticated bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-auth-access.gif) * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated access in untrusted environments or enhanced security scenarios, an AWS STS session token for temporary access, in addition to an AWS access key and secret access key. [Create a session token](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#api_getsessiontoken). <iframe width="560" height="315" src="https://www.youtube.com/embed/4_csSXc_GNU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Warning> AWS STS credentials can be valid for as little as 15 minutes or as long as 36 hours, depending on how the credentials were initially generated. After the expiry time, the credentials are no longer valid will no longer work with the corresponding S3 connector. You must get a new set of credentials to replace the expired ones by calling [GetSessionToken](https://docs.aws.amazon.com/STS/latest/APIReference/API_GetSessionToken.html) in the AWS STS API. To overwrite the expired credentials with the new set: * For the Unstructured user interface (UI), manually update the AWS Key, AWS Secret Key, and STS Token fields in the Unstructured UI for the corresponding S3 [source](/ui/sources/s3) or [destination](/ui/destinations/s3) connector. * For the Unstructured API, use the Unstructured Workflow Endpoint to call the [update source](/api-reference/workflow/overview#update-a-source-connector) or [update destination](/api-reference/workflow/overview#update-a-destination-connector) connector operation for the corresponding S3 [source](/api-reference/workflow/sources/s3) or [destination](/api-reference/workflow/destinations/s3) connector. * For Unstructured Ingest, change the values of `--key`, `--secret`, and `--token` (CLI) or `key`, `secret`, and `token` (Python) in your command or code for the corresponding S3 [source](/open-source/ingestion/source-connectors/s3) or [destination](/open-source/ingestion/destination-connectors/s3) connector. </Warning> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, and authenticated bucket access is enabled, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Amazon S3. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Bucket URI (required): The URI for the bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/`. * Recursive (source connector only): Check this box to access subfolders within the bucket. * AWS Key: For secret or token authentication, the AWS access key ID for the authenticated AWS IAM user. * AWS Secret Key: For secret or token authentication, the AWS secret access key corresponding to the preceding AWS access key ID. * STS Token: For token authentication, the AWS STS session token for temporary access. * Custom URL: A custom URL, if connecting to a non-AWS S3 bucket. ## Add an access policy to an existing bucket To use the Amazon S3 console to add an access policy that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to an existing S3 bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 2. Open the [Amazon S3 Console](https://console.aws.amazon.com/s3/home). 3. Browse to the existing bucket and open it. 4. Click the Permissions tab. 5. In the Bucket policy area, click Edit. 6. In the Policy text area, copy the following JSON-formatted policy. To change the following policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this policy, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-bucket-name>` in two places with the name of your bucket. ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<my-account-id>:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<my-bucket-name>", "arn:aws:s3:::<my-bucket-name>/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } ``` 7. Click Save changes. ## Create a bucket with AWS CloudFormation To use the AWS CloudFormation console to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Save the following YAML to a file on your local machine, for example `create-s3-bucket.yaml`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: 'CloudFormation template to create an S3 bucket with specific permissions for account users.' Parameters: BucketName: Type: String Description: 'Name of the S3 bucket to create' Resources: MyS3Bucket: Type: 'AWS::S3::Bucket' Properties: BucketName: !Ref BucketName PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: false IgnorePublicAcls: true RestrictPublicBuckets: true BucketPolicy: Type: 'AWS::S3::BucketPolicy' Properties: Bucket: !Ref MyS3Bucket PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowAllAuthenticatedUsersInAccount Effect: Allow Principal: AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root' Action: - 's3:GetObject' - 's3:PutObject' - 's3:ListBucket' - 's3:DeleteObject' Resource: - !Sub 'arn:aws:s3:::${BucketName}' - !Sub 'arn:aws:s3:::${BucketName}/' Outputs: BucketName: Description: 'Name of the created S3 bucket' Value: !Ref MyS3Bucket ``` 2. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 3. Open the [AWS CloudFormation Console](https://console.aws.amazon.com/cloudformation/home). 4. Click Create stack > With new resources (standard). 5. On the Create stack page, with Choose an existing template already selected, select Upload a template file. 6. Click Choose file, and browse to and select the YAML file from your local machine. 7. Click Next. 8. Enter a unique Stack name and BucketName. 9. Click Next two times. 10. Click Submit. 11. Wait until the Status changes to CREATE\_COMPLETE. 12. After the bucket is created, you can delete the YAML file, if you want. ## Create a bucket with the AWS CLI To use the AWS CLI to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 2. [Set up the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). 3. Copy the following script to a file on your local machine, for example a file named `create-s3-bucket.sh`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this script, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-unique-bucket-name>` with the name of your bucket. * Replace `<us-east-1>` with your AWS Region. ```bash #!/bin/bash # Set variables for the AWS account ID, Amazon S3 bucket name, and AWS Region. ACCOUNT_ID="<my-account-id>" BUCKET_NAME="<my-unique-bucket-name>" REGION="<us-east-1>" # Temporary filename for the bucket policy. # Do not change this variable. POLICY_FILE="bucket_policy.json" # Create the bucket. aws s3api create-bucket --bucket $BUCKET_NAME --region $REGION # Wait for the bucket to exist. echo "Waiting for bucket '$BUCKET_NAME' to be fully created..." aws s3api wait bucket-exists --bucket $BUCKET_NAME # Check if the wait command was successful. if [ $? -eq 0 ]; then echo "The bucket '$BUCKET_NAME' has been fully created." else echo "Error: Timed out waiting for bucket '$BUCKET_NAME' to be created." exit 1 fi # Remove the "block public policy" bucket access setting. aws s3api put-public-access-block \ --bucket $BUCKET_NAME \ --public-access-block-configuration \ '{"BlockPublicPolicy": false, "IgnorePublicAcls": false, "BlockPublicAcls": false, "RestrictPublicBuckets": false}' # Check if the operation was successful. if [ $? -eq 0 ]; then echo "The block public policy access setting was removed from '$BUCKET_NAME'." else echo "Error: Failed to remove the block public policy access setting from '$BUCKET_NAME'." exit 1 fi # Create the bucket policy. cat << EOF > $POLICY_FILE { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::$ACCOUNT_ID:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::$BUCKET_NAME", "arn:aws:s3:::$BUCKET_NAME/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } EOF # Apply the bucket policy. aws s3api put-bucket-policy --bucket $BUCKET_NAME --policy file://$POLICY_FILE # Check if the policy application was successful. if [ $? -eq 0 ]; then echo "The bucket policy was applied to '$BUCKET_NAME'." else echo "Error: Failed to apply the bucket policy to '$BUCKET_NAME'." exit 1 fi # Verify the applied policy. echo "Verifying the applied policy:" aws s3api get-bucket-policy --bucket $BUCKET_NAME --query Policy --output text # Remove the temporary bucket policy file. rm $POLICY_FILE ``` 4. Run the script, for example: ```bash sh create-s3-bucket.sh ``` 5. After the bucket is created, you can delete the script file, if you want. # Snowflake Source: https://docs.unstructured.io/ui/destinations/snowflake <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Snowflake. The requirements are as follows. * A Snowflake [account](https://signup.snowflake.com/) and its account identifier. <iframe width="560" height="315" src="https://www.youtube.com/embed/LgCjLyClg3o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To get the identifier for the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click your username, and then click Account > View account details. 3. On the Account tab, note the value of the Account Identifier field. Alternatively, the following Snowflake query returns the current account's identifier: ```text SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS "Account Identifier" ``` * The Snowflake [user's login name (not its username) and its password](https://docs.snowflake.com/user-guide/admin-user-management#creating-users) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/sFLPGVe4VBM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view the login name for a user: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. On the Users tab, in the list of available users, click the name of the target user. 4. In the About tile, note the Login Name for the user. Alternatively, the following Snowflake query returns information about the user with the username of `<my-user>`, including their `login_name` value representing their login name: ```text SHOW USERS LIKE '<my-user>'; ``` * The name of the Snowflake [role](https://docs.snowflake.com/sql-reference/sql/create-role) that the user belongs to and that also has sufficient access to the Snowflake database, schema, table, and host. <iframe width="560" height="315" src="https://www.youtube.com/embed/cxe2Ed_-qkM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * To create a database in Snowflake, the role needs to be granted `CREATE DATABASE` privilege at the current account level; and `USAGE` privilege on the warehouse that is used to create the database. * To create a schema in a database in Snowflake, the role needs to be granted `USAGE` privilege on the database and the warehouse that is used to create the schema; and `CREATE SCHEMA` on the database. * To create a table in a schema in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to create the table; and `CREATE TABLE` on the schema. * To write to a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `INSERT` on the table. * To read from a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `SELECT` on the table. To view a list of available roles in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. Click the Roles tab. Alternatively, the following Snowflake query returns a list of available roles in the current account: ```text SHOW ROLES; ``` [Grant privileges to a role](https://docs.snowflake.com/sql-reference/sql/grant-privilege). [Learn more](https://docs.snowflake.com/user-guide/security-access-control-privileges). * The Snowflake warehouse's [hostname and its port number](https://docs.snowflake.com/sql-reference/functions/system_allowlist) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/ItLW0N1uEOI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available warehouses in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Warehouses. This view does not provide access to the warehouses' hostnames or port numbers. To get this information, you must run a Snowflake query. The following Snowflake query returns a list of available warehouse types, hostnames, and port numbers in the current account. Look for the row with a `type` of `SNOWFLAKE_DEPLOYMENT`: ```text SELECT t.VALUE:type::VARCHAR as type, t.VALUE:host::VARCHAR as host, t.VALUE:port as port FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t; ``` * The name of the Snowflake [database](https://docs.snowflake.com/sql-reference/sql/create-database) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/f49tin_Pehs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available databases in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. Alternatively, the following Snowflake query returns a list of available databases in the current account: ```text SHOW DATABASES; ``` * The name of the [schema](https://docs.snowflake.com/sql-reference/sql/create-schema) in the database. To view a list of available schemas for a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the target database. Alternatively, the following Snowflake query returns a list of available schemas in the current account: ```text SHOW SCHEMAS; ``` The following Snowflake query returns a list of available schemas for the database named `<database_name>` in the current account: ```text SHOW SCHEMAS IN DATABASE <database_name>; ``` * The name of the [table](https://docs.snowflake.com/sql-reference/sql/create-table) in the schema. To view a list of available tables for a schema in a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the database that contains the target schema. 4. Expand the name of the target schema. 5. Expand Tables. Alternatively, the following Snowflake query returns a list of available tables for the schema named `<schema_name>` in the datbase named `<database_name>` in the current account: ```text SHOW TABLES IN SCHEMA <database_name>.<schema_name>; ``` Snowflake requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured is as follows. In the following `CREATE TABLE` statement, replace the following placeholders with the appropriate values: * `<database_name>`: The name of the target database in the Snowflake account. * `<schema_name>`: The name of the target schema in the database. * `<number-of-dimensions>`: The number of dimensions for any embeddings that you plan to use. This value must match the number of dimensions for any embeddings that are\ specified in your related Unstructured workflows or pipelines. If you plan to use Snowflake vector embedding generation or Snowflake vector search, this value must match the number of dimensions that you plan to have Snowflake generate or search against. ```sql SQL CREATE TABLE <database_name>.<schema_name>.ELEMENTS ( ID VARCHAR(36) PRIMARY KEY NOT NULL DEFAULT UUID_STRING(), RECORD_ID VARCHAR, ELEMENT_ID VARCHAR, TEXT VARCHAR, EMBEDDINGS VECTOR(FLOAT, <number-of-dimensions>), TYPE VARCHAR, SYSTEM VARCHAR, LAYOUT_WIDTH DECIMAL, LAYOUT_HEIGHT DECIMAL, POINTS VARCHAR, URL VARCHAR, VERSION VARCHAR, DATE_CREATED TIMESTAMP_TZ, DATE_MODIFIED TIMESTAMP_TZ, DATE_PROCESSED TIMESTAMP_TZ, PERMISSIONS_DATA VARCHAR, RECORD_LOCATOR VARCHAR, CATEGORY_DEPTH INTEGER, PARENT_ID VARCHAR, ATTACHED_FILENAME VARCHAR, FILETYPE VARCHAR, LAST_MODIFIED TIMESTAMP_TZ, FILE_DIRECTORY VARCHAR, FILENAME VARCHAR, LANGUAGES ARRAY, PAGE_NUMBER VARCHAR, LINKS VARCHAR, PAGE_NAME VARCHAR, LINK_URLS ARRAY, LINK_TEXTS ARRAY, SENT_FROM ARRAY, SENT_TO ARRAY, SUBJECT VARCHAR, SECTION VARCHAR, HEADER_FOOTER_TYPE VARCHAR, EMPHASIZED_TEXT_CONTENTS ARRAY, EMPHASIZED_TEXT_TAGS ARRAY, TEXT_AS_HTML VARCHAR, REGEX_METADATA VARCHAR, DETECTION_CLASS_PROB DECIMAL, IMAGE_BASE64 VARCHAR, IMAGE_MIME_TYPE VARCHAR, ORIG_ELEMENTS VARCHAR, IS_CONTINUATION BOOLEAN ); ``` * The name of the column in the table that uniquely identifies each record (for example, `RECORD_ID`). To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Snowflake. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Account ID (required): The target Snowflake account's identifier. * Role (required): The name of the Snowflake role that the user belongs to. This role must have the appropriate access to the target Snowflake warehouse, database, schema, and table. * User (required): The target Snowflake user's login name (not their username). * Password (required): The user's password. * Host (required): The hostname of the target Snowflake warehouse. * Port (required): The warehouse's port number. The default is `443` if not otherwise specified. * Database (required): The name of the target Snowflake database. * Schema (required): The name of the target Snowflake schema within the database. * Table (required for source connector only): The name of the target Snowflake table within the database's schema. For the destination connector, the default is `elements` if not otherwise specified. * Columns (source connector only): A comma-separated list of columns to fetch from the table. By default, all columns are fetched unless otherwise specified. * ID Column (required, source connector only): The name of the column that uniquely identifies each record in the table. * Record ID Key (destination connector only): The name of the column that uniquely identifies each record in the table. The default is `record_id` if not otherwise specified. * Batch Size (required): The maximum number of rows to fetch for each batch. The default is `50` if not otherwise specified. ## Learn more * <Icon icon="blog" /> [Getting Started with Unstructured and Snowflake](https://unstructured.io/blog/getting-started-with-unstructured-and-snowflake). # Weaviate Source: https://docs.unstructured.io/ui/destinations/weaviate <Note> If you're new to Unstructured, read this note first. Before you can create a destination connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your destination connector. After you create the destination connector, add it along with a [source connector](/ui/sources/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create destination connectors with the Unstructured API. [Learn how](/api-reference/workflow/destinations/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a destination connector! Keep reading to learn how. </Note> Send processed data from Unstructured to Weaviate. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview): only [Weaviate Cloud](https://weaviate.io/developers/wcs) clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview): Weaviate Cloud clusters, [Weaviate installed locally](https://weaviate.io/developers/weaviate/quickstart/local), and [Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded) are supported. * For Weaviate installed locally, you will need the name of the target collection on the local instance. * For Embedded Weaviate, you will need the instance's connection URL and the name of the target collection on the instance. * For Weaviate Cloud, you will need: <iframe width="560" height="315" src="https://www.youtube.com/embed/FvhJDUZM078" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Weaviate database instance. The following information assumes that you have a Weaviate Cloud (WCD) account with a Weaviate database cluster in that account. [Create a WCD account](https://weaviate.io/developers/wcs/quickstart#create-a-wcd-account). [Create a database cluster](https://weaviate.io/developers/wcs/quickstart#create-a-weaviate-cluster). For other database options, [learn more](https://weaviate.io/developers/weaviate/installation). * The URL and API key for the database cluster. [Get the URL and API key](https://weaviate.io/developers/wcs/quickstart#explore-the-details-panel). * The name of the target collection in the database. [Create a collection](https://weaviate.io/developers/wcs/tools/collections-tool). An existing collection is not required. At runtime, the collection behavior is as follows: For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. If Unstructured generates embeddings, the new collection's name will be `U<short-workflow-id>_<short-embedding-model-name>_<number-of-dimensions>`. If Unstructured does not generate embeddings, the new collection's name will be `U<short-workflow-id`. For [Unstructured Ingest](/open-source/ingestion/overview): * If an existing collection name is specified, and Unstructured generates embeddings, but the number of dimensions that are generated does not match the existing collection's embedding settings, the run will fail. You must change your Unstructured embedding settings or your existing collection's embedding settings to match, and try the run again. * If a collection name is not specified, Unstructured creates a new collection in your Weaviate cluster. The new collection's name will be `Unstructuredautocreated`. If Unstructured creates a new collection and generates embeddings, you will not see an embeddings property in tools such as the Weaviate Cloud Collections user interface. To view the generated embeddings, you can run a Weaviate GraphQL query such as the following. In this query, replace `<collection-name>` with the name of the new collection, and replace `<property-name>` with the name of each additional available property that you want to return results for, such as `text`, `type`, `element_id`, `record_id`, and so on. The embeddings will be returned in the `vector` property. ```text { Get { <collection-name> { _additional { vector } <property-name> <property-name> } } } ``` Weaviate requires an existing collection to have a data schema before you add data. At minimum, this schema must contain the `record_id` property, as follows: ```json { "class": "Elements", "properties": [ { "name": "record_id", "dataType": ["text"] } ] } ``` Weaviate generates any additional properties based on the incoming data. If you have specific schema requirements, you can define the schema manually. Unstructured cannot provide a schema that is guaranteed to work for everyone in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following collection schema example for your own specific schema requirements: ```json { "class": "Elements", "properties": [ { "name": "record_id", "dataType": ["text"] }, { "name": "element_id", "dataType": ["text"] }, { "name": "text", "dataType": ["text"] }, { "name": "embeddings", "dataType": ["number[]"] }, { "name": "metadata", "dataType": ["object"], "nestedProperties": [ { "name": "parent_id", "dataType": ["text"] }, { "name": "page_number", "dataType": ["text"] }, { "name": "is_continuation", "dataType": ["boolean"] }, { "name": "orig_elements", "dataType": ["text"] }, { "name": "partitioner_type", "dataType": ["text"] } ] } ] } ``` See also : * [Collection schema](https://weaviate.io/developers/weaviate/config-refs/schema) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) To create the destination connector: 1. On the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Weaviate. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for the connector. * Cluster URL (required): The URL of the Weaviate database cluster. * Collection Name: The name of the target collection within the cluster. If no value is provided, see the beginning of this article for the behavior at run time. * API Key (required): The API key provided by Weaviate to access the cluster. # Document elements and metadata Source: https://docs.unstructured.io/ui/document-elements When Unstructured [partitions](/ui/partitioning) a file, the result is a list of document elements, sometimes referred to simply as elements. These elements represent different components of the source file. ## Element example Here's an example of what an element might look like: ```json { "type": "NarrativeText", "element_id": "5ef1d1117721f0472c1ad825991d7d37", "text": "The Unstructured documentation covers the following services:", "metadata": { "last_modified": "2024-05-01T14:15:22", "page_number": 1, "languages": ["eng"], "parent_id": "56f24319ae258b735cac3ec2a271b1d9", "file_directory": "/content", "filename": "Unstructured documentation.html", "filetype": "text/html" } } ``` Every element has a [type](#element-type); an [element\_id](#element-id); the extracted `text`; and some [metadata](#metadata) which might vary depending on the element type, file structure, and some additional settings that are applied during [partitioning](/ui/partitioning), [chunking](/ui/chunking), and [enriching](/ui/enriching/overview). Optionally, the element can also have an [embeddings](/ui/embedding) derived from the `text`; the length of `embeddings` depends on the embedding model that is used. ## Element type Instead of treating all files as strings of plain text, Unstructured preserves the semantic structure of the files. This gives you more control and flexibility over how you further use the processed files and allows you to take their structure into consideration. At the same time, normalizing data from various file formats to the Unstructured element type scheme lets you treat all files the same in your downstream processing, regardless of source format. For example, if you plan to summarize a file, you might only be interested in the narrative of the file and not care about its headers and footers. You can easily filter out the elements you don't need by specifying their type. Here are some examples of the element types your file might contain: | Element type | Description | | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | | `Address` | A text element for capturing physical addresses. | | `CodeSnippet` | A text element for capturing code snippets. | | `EmailAddress` | A text element for capturing email addresses. | | `FigureCaption` | An element for capturing text associated with figure captions. | | `Footer` | An element for capturing document footers. | | `FormKeysValues` | An element for capturing key-value pairs in a form. | | `Formula` | An element containing formulas in a file. | | `Header` | An element for capturing document headers. | | `Image` | A text element for capturing image metadata. | | `ListItem` | `ListItem` is a `NarrativeText` element that is part of a list. | | `NarrativeText` | `NarrativeText` is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. | | `PageBreak` | An element for capturing page breaks. | | `PageNumber` | An element for capturing page numbers. | | `Table` | An element for capturing tables. | | `Title` | A text element for capturing titles. | | `UncategorizedText` | Base element for capturing free text from within files. Applies to extracted text not associated with bounding boxes if the input is a PDF file. | If you apply chunking, you will also see the `CompositeElement` type. `CompositeElement` is a chunk formed from text (non-`Table`) elements. A composite element might be formed by combining one or more sequential elements produced by partitioning. For example, several individual list items might be combined into a single chunk. For the most up-to-date list of available element types, see the `TYPE_TO_TEXT_ELEMENT_MAP` type-annotated mapping definition and the `ElementType` class definition in the [elements.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py) file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) repository in GitHub. ## Element ID By default, the element ID is a SHA-256 hash of the element's text, its position on the page, the page number it's on, and the name of the related file. This is to ensure that the ID is deterministic and unique at the file level. ## Metadata Unstructured tracks a variety of metadata about the elements extracted from files. Metadata is tracked at the element level within `metadata`. Element metadata enables you to do things such as: * Filter file elements based on an element's metadata value. For instance, you might want to limit your scope to elements from a certain page, or you might want to use only elements that have an email matching a regular expression in their metadata. * Map an element to the page where it occurred so that the original page can be retrieved when that element matches search criteria. For the most up-to-date list of all available metadata fields, see the `ElementMetadata` class definition in the [elements.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/elements.py) file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) repository in GitHub. ### Common metadata fields All file types return the following `metadata` fields when the information is available from the source file: | Metadata field name | Description | | -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `category_depth` | The depth of the element relative to other elements of the same category. Category depth is the depth of an element relative to other elements of the same category. It is set by a file partitioner and enables the [document hierarchy](#document-hierarchy) after processing to compute more accurate hierarchies. Category depth might be set using native document hierarchies, for example reflecting `<H1>` or `<H2>` tags within an HTML file or the indentation level of a bulleted list item in a Word document. | | `coordinates` | Any X-Y bounding box [coordinates](#element-coordinates). | | `detection_class_prob` | The detection model class probabilities. Applies only to Unstructured inference using the High Res strategy. | | `emphasized_text_contents` | The related emphasized text (bold or italic) in the original file. | | `emphasized_text_tags` | Any tags on the text that are emphasized in the original file. | | `file_directory` | The related file's directory. | | `filename` | The related file's filename. | | `filetype` | The related file's type. | | `is_continuation` | True if the element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to Max Characters. | | `languages` | Document languages at the file or element level. The list is ordered by probability of being the primary language of the text. | | `last_modified` | The related file's last modified date. | | `parent_id` | The ID of the element's parent element. `parent_id` might be used to infer where an element resides within the overall [document hierarchy](#document-hierarchy). For instance, a `NarrativeText` element might have a `Title` element as a parent (a “subtitle”), which in turn might have another `Title` element as its parent (a "title"). | | `text_as_html` | The HTML representation of the related extracted table. Only applicable to [table elements](#table-specific-metadata). | Notes on common metadata fields: #### Document hierarchy `parent_id` and `category_depth` enhance hierarchy detection to identify the document structure in various file formats by measuring relative depth of an element within its category. This is especially useful in files with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure. #### Element coordinates Some file types support location data for the elements, usually in the form of bounding boxes. The `coordinates` metadata field contains: * `points` : These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the `y` coordinate increases in the downward direction. * `system`: The points have an associated coordinate system. A typical example of a coordinate system is `PixelSpace`, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height. ### Additional metadata fields by file type | Field name | Applicable file types | Description | | ---------------------- | --------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | | `attached_to_filename` | MSG | The name of the file that the attached file is attached to. | | `bcc_recipient` | EML | The related [email](#email) BCC recipient. | | `cc_recipient` | EML | The related [email](#email) CC recipient. | | `email_message_id` | EML | The related [email](#email) message ID. | | `header_footer_type` | Word Doc | The pages that a header or footer applies to in a [Word document](#microsoft-word-files): `primary`, `even_only`, and `first_page`. | | `image_mime_type` | HTML, image, PDF | The MIME type of the image. | | `image_url` | HTML | The URL to the image. | | `link_start_indexes` | HTML, PDF | A list of the index locations within the extracted content where the `links` can be found. | | `link_texts` | HTML | A list of text strings that are associated with the `link_urls`. | | `link_urls` | HTML | A list of URLs within the extracted content. | | `links` | PDF | A list of links within the extracted content. | | `page_name` | XLSX | The related sheet's name in an [Excel file](#microsoft-excel-files). | | `page_number` | DOCX, PDF, PPT, XLSX | The related file's page number. | | `section` | EPUB | The book section title corresponding to a table of contents. | | `sent_from` | EML | The related [email](#email) sender. | | `sent_to` | EML | The related [email](#email) recipient. | | `signature` | EML | The related [email](#email) signature. | | `subject` | EML | The related [email](#email) subject. | Here are some notes on additional metadata fields by file type: #### Email For emails, metadata will contain the following fields, where available: * `bcc_recipient` * `cc_recipient` * `email_message_id` * `sent_from` * `sent_to` * `signature` * `subject` `sent_from` is a list of strings because the [RFC 822](https://www.rfc-editor.org/rfc/rfc822) spec for emails allows for multiple sent from email addresses. #### Microsoft Excel files For Excel files, metadata will contain a `page_name` element, which corresponds to the sheet name in the Excel file. #### Microsoft Word files Headers and footers in Word files include a `header_footer_type` indicating which page a header or footer applies to. Valid values are `"primary"`, `"even_only"`, and `"first_page"`. ### Table-specific metadata For `Table` elements, the raw text of the table will be stored in the `text` attribute for the element, and HTML representation of the table will be available in the element metadata under `text_as_html`. Unstructured will automatically extract all tables for all doc types if you check the Infer Table Structure in the ConnectorSettings area of the Transform section of a workflow. Here's an example of a table element. The `text` of the element will look like this (line breaks are added here for readability): ```text Dataset Base Model1 Large Model Notes PubLayNet [38] F / M M Layouts of modern scientific documents PRImA [3] M - Layouts of scanned modern magazines and scientific reports Newspaper [17] F - Layouts of scanned US newspapers from the 20th century TableBank [18] F F Table region on modern scientific and business document HJDataset [31] F / M - Layouts of history Japanese documents ``` And the `text_as_html` metadata for the same element will look like this (line breaks are added here for readability): ```html <table> <thead> <th>Dataset</th> <th>| Base Model’</th> <th>| Notes</th> </thead> <tr> <td>PubLayNet</td> <td>[38] F/M</td> <td>Layouts of modern scientific documents</td> </tr> <tr> <td>PRImA [3]</td> <td>M</td> <td>Layouts of scanned modern magazines and scientific reports</td> </tr> <tr> <td>Newspaper</td> <td>F</td> <td>Layouts of scanned US newspapers from the 20th century</td> </tr> <tr> <td>TableBank</td> <td>F</td> <td>Table region on modern scientific and business document</td> </tr> <tr> <td>HJDataset [31]</td> <td>F/M</td> <td>Layouts of history Japanese documents</td> </tr> </table> ``` ### Data connector metadata fields Documents can include additional file metadata, based on the specified source connector. #### Common data connector metadata fields * `date_created` * `date_modified` * `date_processed` * `permissions_data` * `record_locator` * `url` * `version` #### Additional metadata fields by connector type (within record\_locator) | Source connector | Additional metadata | | ---------------- | ---------------------------------------------------- | | Astra DB | `document_id` | | Confluence | `document_id`, `space_id` | | Discord | `channel_id` | | Dropbox | `file_id`, `protocol`, `remote_file_path` | | Elasticsearch | `hosts` | | Google Drive | `file_id` | | Jira | `id`, `key` | | MongoDB | `collection`, `database`, `document_id` | | Notion | `database_id` (for databases), `page_id` (for pages) | | OpenSearch | `hosts` | | S3 | `protocol`, `remote_file_path` | | SharePoint | `server_relative_path`, `user_pname` | ### VLM generated HTML elements The vision language model (VLM) [partitioner](/ui/partitioning) also generates an HTML representation of the Unstructured elements that are produced. Unstructured has developed an element ontology that assigns incoming Unstructured elements to these various defined element ontology types. These element ontology types are used to generate standard HTML elements with the element ontology type as class attributes on those HTML elements. The generated HTML elements are output as `text_as_html` along with their `parent_id` in `metadata`, to allow for easier HTML reconstruction of the entire document as needed. For example, given the following table element produced with the VLM partitioner, the `text_as_html` field is an HTML representation of the derived table, and `parent_id` is the `element_id` of the Unstructured element for the page that contains this table. (Line breaks are added here to the `text` and `text_as_html` fields for readability.) ```json { "type": "Table", "element_id": "c60aea37616e3db75660918c6d657c38", "text": "ITEM QUANTITY PRICE TOTAL Office Desk (Oak wood, 140x70 cm) 2 $249 $498 Ergonomic Chair (Adjustable height & lumbar support) 3 $189 $567 Whiteboard Set (Magnetic, 90x60 cm + 4 markers) 2 $59 $118 SUBTOTAL $1,183 VAT (19%) $224.77 TOTAL $1,407.77", "metadata": { "category_depth": 1, "page_number": 1, "parent_id": "8cc3b39afcd948d49d85084eaae80ff8", "text_as_html": "<table class=\"Table\" id=\"958308a90ccd4fcb825cb12eed20d103\"> <thead> <tr> <th>ITEM</th> <th>QUANTITY</th> <th>PRICE</th> <th>TOTAL</th> </tr> </thead> <tbody> <tr> <td>Office Desk (Oak wood, 140x70 cm)</td> <td>2</td> <td>$249</td> <td>$498</td> </tr> <tr> <td>Ergonomic Chair (Adjustable height & lumbar support)</td> <td>3</td> <td>$189</td> <td>$567</td> </tr> <tr> <td>Whiteboard Set (Magnetic, 90x60 cm + 4 markers)</td> <td>2</td> <td>$59</td> <td>$118</td> </tr> <tr> <td colspan=\"3\">SUBTOTAL</td> <td>$1,183</td> </tr> <tr> <td colspan=\"3\">VAT (19%)</td> <td>$224.77</td> </tr> <tr> <td colspan=\"3\">TOTAL</td> <td>$1,407.77</td> </tr> </tbody> </table>", "languages": [ "eng" ], "filetype": "application/pdf", "partitioner_type": "vlm_partition", "filename": "invoice.pdf" } } ``` For the most up-to-date list of available element ontology types, see the [ontology.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/ontology.py) file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) repository in GitHub. For the most up-to-date list of mappings between element ontology types and Unstructured element types, see the [mappings.py](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/documents/mappings.py) file, located in the [Unstructured-IO/unstructured](https://github.com/Unstructured-IO/unstructured) repository in GitHub. # Embedding Source: https://docs.unstructured.io/ui/embedding After partitioning, chunking, and summarizing, the embedding step creates arrays of numbers known as vectors, representing the text that is extracted by Unstructured. These vectors are stored or embedded next to the text itself. These vector embeddings are generated by an [embedding model](https://python.langchain.com/v0.2/docs/concepts/#embedding-models) that is provided by an embedding provider. You typically save these embeddings in a vector store. When a user queries a retrieval augmented generation (RAG) application, the application can use a vector database to perform a [similarity search](https://www.pinecone.io/learn/what-is-similarity-search/) in that vector store and then return the items whose embeddings are the closest to that user's query. Here is an example of a document element generated by Unstructured, along with its vector embeddings generated by the embedding model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) on Hugging Face: ```json { "type": "Title", "element_id": "fdbf5369-4485-453b-9701-1bb42c83b00b", "text": "THE CONSTITUTION of the United States", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1, "filename": "constitution.pdf", "data_source": { "record_locator": { "path": "/input/constitution.pdf" }, "date_created": "1723069423.0536132", "date_modified": "1723069423.055078", "date_processed": "1725666244.571788", "permissions_data": [ { "mode": 33188 } ] } }, "embeddings": [ -0.06138836592435837, 0.08634615689516068, -0.019471267238259315, "<full-results-omitted-for-brevity>", 0.0895417109131813, 0.05604064092040062, 0.01376157347112894 ] } ``` [Learn more](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag). ## Generate embeddings To generate embeddings, choose one of the following embedding providers and models in the Select Embedding Model section of an Embedder node in a workflow: <Note>You can change a workflow's preconfigured provider only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.</Note> * Azure OpenAI: Use [Azure OpenAI](https://azure.microsoft.com/products/ai-services/openai-service) to generate embeddings with one of the following models: * text-embedding-3-small, with 1536 dimensions. * text-embedding-3-large, with 3072 dimensions. * Ada 002 (Text) (`text-embedding-ada-002`), with 1536 dimensions. [Learn more](https://learn.microsoft.com/azure/ai-services/openai/concepts/models#embeddings). * Amazon Bedrock: Use [Amazon Bedrock](https://aws.amazon.com/bedrock) to generate embeddings with one of the following models: * Titan Text Embeddings V2, with 1024 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html). * Titan Embeddings G1 - Text, with 1536 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html). * Titan Multimodal Embeddings G1, with 1024 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html). * Cohere Embed English, with 1024 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html). * Cohere Embed Multilingual, with 1024 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html). * TogetherAI: Use [TogetherAI](https://www.together.ai/) to generate embeddings with one of the following models: * M2-BERT-80M-32K-Retrieval, with 768 dimensions. [Learn more](https://docs.together.ai/docs/serverless-models#embedding-models). * Voyage AI: Use [Voyage AI](https://www.voyageai.com/) to generate embeddings with one of the following models: * voyage-code-2, with 1536 dimensions. * voyage-3, with 1024 dimensions. * voyage-3-large, with 1024 dimensions. * voyage-3-lite, with 512 dimensions. * voyage-code-3, with 1024 dimensions. * voyage-finance-2, with 1024 dimensions. * voyage-law-2, with 1024 dimensions. * voyage-multimodal-3, with 1024 dimensions. [Learn more](https://docs.voyageai.com/docs/embeddings). # Image descriptions Source: https://docs.unstructured.io/ui/enriching/image-descriptions After partitioning and chunking, you can have Unstructured generate text-based summaries of detected images. This summarization is done by using models offered through these providers: * [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI. * [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic. * [Claude 3.5 Sonnet](https://aws.amazon.com/bedrock/claude/), provided through Amazon Bedrock. Here is an example of the output of a detected image using GPT-4o. Note specifically the `text` field that is added. Line breaks have been inserted here for readability. The output will not contain these line breaks. ```json { "type": "Image", "element_id": "3303aa13098f5a26b9845bd18ee8c881", "text": "{\n \"type\": \"graph\",\n \"description\": \"The graph shows the relationship between Potential (V) and Current Density (A/cm2). The x-axis is labeled 'Current Density (A/cm2)' and ranges from 0.0000001 to 0.1. The y-axis is labeled 'Potential (V)' and ranges from -2.5 to 1.5. There are six different data series represented by different colors: blue (10g), red (4g), green (6g), purple (2g), orange (Control), and light blue (8g). The data points for each series show how the potential changes with varying current density.\"\n}", "metadata": { "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1, "image_base64": "/9j...<full results omitted for brevity>...Q==", "image_mime_type": "image/jpeg", "filename": "7f239e1d4ef3556cc867a4bd321bbc41.pdf", "data_source": {} } } ``` Here are three examples of the descriptions for detected images. These descriptions are generated with GPT-4o by OpenAI: ![Description of an image showing a scatter plot graph](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/enriching/Image-Description-1.png) ![Description of an image showing the Matthews Correlation Coefficient for different VQA datasets](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/enriching/Image-Description-2.png) ![Description of an image showing three scatter plots](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/enriching/Image-Description-3.png) Any embeddings that are produced after these summaries are generated will be based on the `text` field's contents. ## Generate image descriptions To generate image descriptions, in an Enrichment node in a workflow, specify the following: <Note> You can change a workflow's image description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings. </Note> <Warning> Image summary descriptions are generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description enrichment node will not produce any image summary descriptions, and it could also cause the workflow to stop running or produce unexpected results. </Warning> Select Image, and then choose one of the following provider (and model) combinations to use: * OpenAI (GPT-4o). [Learn more](https://openai.com/index/hello-gpt-4o/). * Anthropic (Claude 3.5 Sonnet). [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). * Amazon Bedrock (Claude 3.5 Sonnet). [Learn more](https://aws.amazon.com/bedrock/claude/). # Named entity recognition (NER) Source: https://docs.unstructured.io/ui/enriching/ner After partitioning and chunking, you can have Unstructured generate a list of recognized entities and their types (such as the names of organizations, products, and people) in the content, through a process known as named entity recognition (NER). You can also have Unstructured generate a list of relationships between the entities that are recognized. This NER is done by using models offered through these providers: * [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI. * [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic. Here is an example of a list of recognized entities and their entity types, along with a list of relationships between those entities and their relationship types, using GPT-4o. Note specifically the `entities` field that is added to the `metadata` field. ```json { "type": "CompositeElement", "element_id": "bc8333ea0d374670ff0bd03c6126e70d", "text": "SECTION. 3\n\nThe Senate of the United States shall be composed of two Senators from each State, [chosen by the Legislature there- of,]* for six Years; and each Senator shall have one Vote.\n\n Immediately after they shall be assembled in Consequence of the first Election, they shall be divided as equally as may be into three Classes. The Seats of the Senators of the first Class shall be vacated at the Expiration of the second Year, of the second Class at the Expiration of the fourth Year, and of the third Class at the Expiration of the sixth Year, so that one third may be chosen every second Year; [and if Vacan- cies happen by Resignation, or otherwise, during the Recess of the Legislature of any State, the Executive thereof may make temporary Appointments until the next Meeting of the Legislature, which shall then fill such Vacancies.]\n\nC O N S T I T U T I O N O F T H E U N I T E D S T A T E S", "metadata": { "filename": "constitution.pdf", "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 2, "entities": { "items": [ { "entity": "Senate", "type": "ORGANIZATION" }, { "entity": "United States", "type": "LOCATION" }, { "entity": "Senator", "type": "ROLE" }, { "entity": "State", "type": "LOCATION" }, { "entity": "Legislature", "type": "ORGANIZATION" }, { "entity": "Executive", "type": "ROLE" }, { "entity": "C O N S T I T U T I O N O F T H E U N I T E D S T A T E S", "type": "DOCUMENT" } ], "relationships": [ { "from": "Senate", "relationship": "based_in", "to": "United States" }, { "from": "Senator", "relationship": "has_role", "to": "Senate" }, { "from": "Legislature", "relationship": "has_office_in", "to": "State" }, { "from": "Executive", "relationship": "has_role", "to": "State" }, { "from": "C O N S T I T U T I O N O F T H E U N I T E D S T A T E S", "relationship": "dated", "to": "DATE" } ] } } } ``` Here is another example of some of the entities, their entity types, and relationships that are recognized for a given paragraph of text. This information is generated by GPT-4o by OpenAI: ![Named entity recognition for information in a paragraph of text](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/enriching/NER-Example.png) By default, the following entity types are supported for NER: `PERSON` * `ORGANIZATION` * `LOCATION` * `DATE` * `TIME` * `EVENT` * `MONEY` * `PERCENT` * `FACILITY` * `PRODUCT` * `ROLE` * `DOCUMENT` * `DATASET` By default, the following entity relationships are supported for NER: * `PERSON` - `ORGANIZATION`: `works_for`, `affiliated_with`, `founded` * `PERSON` - `LOCATION`: `born_in`, `lives_in`, `traveled_to` * `ORGANIZATION` - `LOCATION`: `based_in`, `has_office_in` * Entity - `DATE`: `occurred_on`, `founded_on`, `died_on`, `published_in` * `PERSON` - `PERSON`: `married_to`, `parent_of`, `colleague_of` * `PRODUCT` - `ORGANIZATION`: `developed_by`, `owned_by` * `EVENT` - `LOCATION`: `held_in`, `occurred_in` * Entity - `ROLE`: `has_title`, `acts_as`, `has_role` * `DATASET` - `PERSON`: `mentions` * `DATASET` - `DOCUMENT`: `located_in` * `PERSON` - `DATASET`: `published` * `DOCUMENT` - `DOCUMENT`: `referenced_in`, `contains` * `DOCUMENT` - `DATE`: `dated` * `PERSON` - `DOCUMENT`: `published` You can add, rename, or delete items in this list of default entity types and default entity relationship types. You can also add any clarifying instructions to the prompt that is used to run NER. To do this, see the next section. # Generate a list of entities and their relationships To generate a list of recognized entities and their relationships, in an Enrichment node in a workflow, specify the following: <Note> You can change a workflow's NER settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings. Entities are only recognized when the Partitioner node in a workflow is also set to use the High Res partitioning strategy. [Learn more](/ui/partitioning). </Note> 1. Select Text. 2. For Model, select either OpenAI (GPT-4o) or Anthropic (Claude 3.5 Sonnet). 3. The selected model will follow a default set of instructions (called a prompt) to perform NER using a set of predefined entity types and relationships. To experiment with running the default prompt against some sample data, click Edit, and then click Run Prompt. The selected Model uses the Prompt to run NER on the Input sample and shows the results in the Output. Look specifically at the `response_json` field for the entities that were recognized and their relationships. 4. To customize the prompt, change the contents of Prompt. <Note> For best results, Unstructured strongly recommends that you limit your changes only to certain portions of the default prompt, specifically: * Adding, renaming, or deleting items in the list of predefined types (such as `PERSON`, `ORGANIZATION`, `LOCATION`, and so on). * Adding, renaming, or deleting items in the list of predefined relationships (such as `works_for`, `based_in`, `has_role`, and so on). * As needed, adding any clarifying instructions only between these two lines: ```text ... Provide the entities and their corresponding types as a structured JSON response. (Add any clarifying instructions here only.) [START OF TEXT] ... ``` * Changing any other portions of the default prompt could produce unexpected results. </Note> 5. To experiment with different data, change the contents of Input sample. For best results, Unstructured strongly recommends that the JSON structure in Input sample be preserved. 6. When you are satisfied with the Model and Prompt that you want to use, click Save. # Overview Source: https://docs.unstructured.io/ui/enriching/overview Enriching adds enhancments to the processed data that Unstructured produces. These enrichments include: * Providing a summarized description of the contents of a detected image. [Learn more](/ui/enriching/image-descriptions). * Providing a summarized description of the contents of a detected table. [Learn more](/ui/enriching/table-descriptions). * Providing a representation of a detected table in HTML markup format. [Learn more](/ui/enriching/table-to-html). * Providing a list of recognized entities and their types, through a process known as named entity recognition (NER). [Learn more](/ui/enriching/ner). To add an enrichment, in an Enrichment node in a workflow, select one of the following enrichment types: <Note> You can change enrichment settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings. </Note> <Warning> Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description, table description, or table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> * Image to provide a summarized description of the contents of each detected image. [Learn more](/ui/enriching/image-descriptions). * Table to provide a summarized description of the contents of each detected table. [Learn more](/ui/enriching/table-descriptions). * Table can also provide a representation of each detected table in HTML markup format. [Learn more](/ui/enriching/table-to-html). * Text to provide a list of recognized entities and their types by using a technique called named entity recognition (NER). [Learn more](/ui/enriching/ner). The Text enrichment type also supports custom prompts. In the Details tab, first select the input type Text and choose a provider (and model) combination. Then, under Prompt, click Edit. Then, you can test your custom prompt in the Edit & Test Prompt section by clicking Run Prompt. To add multiple enrichments, create an additional Enrichment node for each enrichment type that you want to add. # Table descriptions Source: https://docs.unstructured.io/ui/enriching/table-descriptions After partitioning and chunking, you can have Unstructured generate text-based summaries of detected tables. This summarization is done by using models offered through these providers: * [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI. * [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet), provided through Anthropic. * [Claude 3.5 Sonnet](https://aws.amazon.com/bedrock/claude/), provided through Amazon Bedrock. Here is an example of the output of a detected table using GPT-4o. Note specifically the `text` field that is added. Line breaks have been inserted here for readability. The output will not contain these line breaks. ```json { "type": "Table", "element_id": "5713c0e90194ac7f0f2c60dd614bd24d", "text": "The table consists of 6 rows and 7 columns. The columns represent inhibitor concentration (g), bc (V/dec), ba (V/dec), Ecorr (V), icorr (A/cm\u00b2), polarization resistance (\u03a9), and corrosion rate (mm/year). As the inhibitor concentration increases, the corrosion rate generally decreases, indicating the effectiveness of the inhibitor. Notably, the polarization resistance increases with higher inhibitor concentrations, peaking at 6 grams before slightly decreasing. This suggests that the inhibitor is most effective at 6 grams, significantly reducing the corrosion rate and increasing polarization resistance. The data provides valuable insights into the optimal concentration of the inhibitor for corrosion prevention.", "metadata": { "text_as_html": "<table>...<full results omitted for brevity>...</table>", "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1, "image_base64": "/9j...<full results omitted for brevity>...//Z", "image_mime_type": "image/jpeg", "filename": "7f239e1d4ef3556cc867a4bd321bbc41.pdf", "data_source": {} } } ``` Here are two examples of the descriptions for detected tables. These descriptions are generated with GPT-4o by OpenAI: ![Description of a table with information about endoscopic datasets](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/enriching/Table-Description-1.png) ![Description of a table with information about potentiodynamic polarization of stainless steel](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/enriching/Table-Description-2.png) The generated table's summary will overwrite any previous contents in the `text` field. The table's original content is available in the `image_base64` field. Any embeddings that are produced after these summaries are generated will be based on the new `text` field's contents. ## Generate table descriptions To generate table descriptions, in an Enrichment node in a workflow, specify the following: <Note> You can change a workflow's table description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings. </Note> <Warning> Table summary descriptions are generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains a table description enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains a table description enrichment node will not produce any table summary descriptions, and it could also cause the workflow to stop running or produce unexpected results. </Warning> Select Table, and then choose one of the following provider (and model) combinations to use: * OpenAI (GPT-4o). [Learn more](https://openai.com/index/hello-gpt-4o/). * Anthropic (Claude 3.5 Sonnet). [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). * Amazon Bedrock (Claude 3.5 Sonnet). [Learn more](https://aws.amazon.com/bedrock/claude/). Make sure after you choose the provider and model, that Table Description is also displayed. If Table Description and Table to HTML are both displayed, be sure to select Table Description. # Tables to HTML Source: https://docs.unstructured.io/ui/enriching/table-to-html After partitioning and chunking, you can have Unstructured generate representations of each detected table in HTML markup format. This table-to-HTML output is done by using [GPT-4o](https://openai.com/index/hello-gpt-4o/), provided through OpenAI. Here is an example of the HTML markup output of a detected table using GPT-4o. Note specifically the `text_as_html` field that is added. Line breaks have been inserted here for readability. The output will not contain these line breaks. ````json { "type": "Table", "element_id": "31aa654088742f1388d46ea9c8878272", "text": "Inhibitor Polarization Corrosion be (V/dec) ba (V/dec) Ecorr (V) icorr (AJcm?) concentration (g) resistance (Q) rate (mmj/year) 0.0335 0.0409 \u20140.9393 0.0003 24.0910 2.8163 1.9460 0.0596 .8276 0.0002 121.440 1.5054 0.0163 0.2369 .8825 0.0001 42121 0.9476 s NO 03233 0.0540 \u20140.8027 5.39E-05 373.180 0.4318 0.1240 0.0556 .5896 5.46E-05 305.650 0.3772 = 5 0.0382 0.0086 .5356 1.24E-05 246.080 0.0919", "metadata": { "text_as_html": "```html\n <table>\n <tr>\n<th>Inhibitor concentration (g)</th>\n <th>bc (V/dec)</th>\n<th>ba (V/dec)</th>\n<th>Ecorr (V)</th>\n <th>icorr (A/cm\u00b2)</th>\n<th>Polarization resistance (\u03a9)</th>\n <th>Corrosion rate (mm/year)</th>\n </tr>\n <tr>\n <td>0</td>\n<td>0.0335</td>\n<td>0.0409</td>\n<td>\u22120.9393</td>\n <td>0.0003</td>\n<td>24.0910</td>\n<td>2.8163</td>\n </tr>\n <tr>\n <td>2</td>\n<td>1.9460</td>\n<td>0.0596</td>\n<td>\u22120.8276</td>\n<td>0.0002</td>\n<td>121.440</td>\n<td>1.5054</td>\n </tr>\n <tr>\n <td>4</td>\n<td>0.0163</td>\n<td>0.2369</td>\n<td>\u22120.8825</td>\n<td>0.0001</td>\n<td>42.121</td>\n<td>0.9476</td>\n </tr>\n <tr>\n <td>6</td>\n<td>0.3233</td>\n<td>0.0540</td>\n<td>\u22120.8027</td>\n<td>5.39E-05</td>\n<td>373.180</td>\n<td>0.4318</td>\n </tr>\n <tr>\n <td>8</td>\n<td>0.1240</td>\n<td>0.0556</td>\n<td>\u22120.5896</td>\n<td>5.46E-05</td>\n<td>305.650</td>\n<td>0.3772</td>\n </tr>\n <tr>\n <td>10</td>\n<td>0.0382</td>\n<td>0.0086</td>\n<td>\u22120.5356</td>\n<td>1.24E-05</td>\n<td>246.080</td>\n<td>0.0919</td>\n </tr>\n </table>\n```", "filetype": "application/pdf", "languages": [ "eng" ], "page_number": 1, "image_base64": "/9j...<full results omitted for brevity>...//Z", "image_mime_type": "image/jpeg", "filename": "embedded-images-tables.pdf", "data_source": {} } } ```` ## Generate table-to-HTML output To generate table-to-HTML output, in an Enrichment node in a workflow, for Model, select OpenAI (GPT-4o). Make sure after you choose this provider and model, that Table to HTML is also selected. <Note> You can change a workflow's table description settings only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings. </Note> <Warning> Table-to-HTML generation happens only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains a table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains a table-to-HTML enrichment node will not generate any table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> # More examples Source: https://docs.unstructured.io/ui/examples This page contains more examples of Unstructured output. ## Doctor's notes These examples were produced by using the VLM partitioning strategy with GPT-4o by OpenAI. Personally identifiable information (PII) has been blurred out of these screenshots. ![Doctor's note example 1](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Doctors-Note-1-VLM-GPT-4o.png) ![Doctor's note example 2](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Doctors-Note-2-VLM-GPT-4o.png) ![Doctor's note example 3](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Doctors-Note-3-VLM-GPT-4o.png) ## Court document This example was produced by using the VLM partitioning strategy with GPT-4o by OpenAI. Personally identifiable information (PII) has been blurred out of this screenshot. ![Court notes](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Court-Notes-VLM-GPT-4o.png) ## Type-filled form This example was produced by using the VLM partitioning strategy with GPT-4o by OpenAI. The typed information is intended to be fictitious. A box and arrow has been overlaid onto the screenshot to highlight some of the content that was detected. ![Handwritten form](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/PDF-Form-VLM-GPT-4o.png) ## Handwriting-filled form This example was produced by using the VLM partitioning strategy with GPT-4o by OpenAI. The handwritten information is intended to be fictitious. ![Handwritten form](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Handwritten-Form-VLM-GPT-4o.png) ## Product manual pages These examples were produced by using the VLM partitioning strategy with GPT-4o by OpenAI. A company logo and name has been blurred out of one of these screenshots. Boxes and arrows have been overlaid onto some of these screenshots to highlight some of the content that was detected. ![Product manual](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Product-Manual-VLM-GPT-4o.png) ![Multilanguage warnings](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Multilanguage-Warnings-VLM-GPT-4o.png) ![Parts list](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Parts-List-VLM-GPT-4o.png) ![Bookcase instructions](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Bookcase-Instructions-VLM-GPT-4o.png) ## Parts sheets These examples were produced by using the VLM partitioning strategy with GPT-4o by OpenAI. Company logos and names have been blurred out of these screenshots. Boxes and arrows have been overlaid onto these screenshots to highlight some of the content that was detected. ![Parts sheet for item 70-200](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Parts-Sheet-70-200-VLM-GPT-4o.png) ![Parts sheet for item 163T-PRLF](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Parts-Sheet-163T-PRLF-VLM-GPT-4o.png) ## Shipping label This example was produced by using the VLM partitioning strategy with GPT-4o by OpenAI. The sender and receipient details are meant to be fictitious. ![Shipping label](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Shipping-Label-VLM-GPT-4o.png) ## Product packaging This example was produced by using the VLM partitioning strategy with GPT-4o by OpenAI. Because this image was captured on a curve, it is difficult to blur out all identifiable company information without significantly compromising some of the example's content to be shown. Boxes have been overlaid onto the screenshot to highlight some of the content that was detected. ![Oatmeal box](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Oatmeal-Box-VLM-GPT-4o.png) ## Recipe card This example was produced by using the VLM partitioning strategy with GPT-4o by OpenAI. Boxes and arrows have been overlaid onto the screenshot to highlight some of the content that was detected. ![Recipe card](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Swordfish-Recipe-VLM-GPT-4o.png) ## Poorly captured image This image was intentionally captured with poor focus. The captured sign was covered with a thin layer of dirt and debris. Some graffiti has been blurred out of the top of this image. This example was produced by using the VLM partitioning strategy with GPT-4o by OpenAI. Boxes have been overlaid onto the screenshot to highlight some of the content that was detected. ![Interpretive sign](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/examples/Interpretive-Sign-VLM-GPT-4o.png) # Jobs Source: https://docs.unstructured.io/ui/jobs ## Jobs dashboard ![Jobs in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Jobs-Sidebar.png) To view the jobs dashboard, on the sidebar, click Jobs. The jobs dashboard provides a centralized view for managing and monitoring the execution of data processing tasks within your workflows. The jobs dashboard lists each job and its associated Status, Job ID, Created date and time, Workflow name, and Runtime duration. Each job's status, shown in the Status column, can be: ![Pending job icon](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Job-Pending.png) Pending: The job's data is currently not attempting to be processed. ![In-progress job icon](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Job-In-Progress.png) In Progress: The job's data is attempting to be processed. ![Fully finished job icon](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Job-Finished-Fully.png) Finished: 100% of the job's data has been successfully processed. ![Partially finished job icon](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Job-Finished-Partially.png) Finished: 90% to 99% of the job's data has been sucessfully processed. ![Failed job icon](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Job-Failed.png) Failed: Less than 90% of the job's data has been successfully processed. ## Run a job <Warning> You must first have an existing workflow to run a job against. If you do not have an existing workflow, stop. [Create a workflow](/ui/workflows#create-a-workflow), and then return here. To see your existing workflows, on the sidebar, click Workflows. </Warning> To run a job, on the sidebar, click Workflows, and then click Run in the row for the workflow that you want to run. ## Monitor a job ![Completed job](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Job-Complete.png) The job details pane is a comprehensive section for monitoring the specific details of jobs executed within a particular workflow. To access this pane, click the specific job on the jobs dashboard. Clicking the Details button shows: * The job's ID. * The job's start date. * Information about what happened in each of the job's stages. Clicking the Errors button provides any available information about what might have gone wrong with the job or in any of its stages. # Overview Source: https://docs.unstructured.io/ui/overview The Unstructured user interface (UI) is a no-code user interface, pay-as-you-go platform for transforming your unstructured data into data that is ready for Retrieval Augmented Generation (RAG). <Tip>To start using the Unstructured UI right away, skip ahead to the [quickstart](/ui/quickstart).</Tip> Here is a screenshot of the Unstructured UI Start page: ![Partial view of the Unstructured UI](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Start-Screen-Partial.png) This 90-second video provides a brief overview of the Unstructured UI: <iframe width="560" height="315" src="https://www.youtube.com/embed/IVKcQDZa9Zc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Icon icon="blog" /> [Read the announcement](https://unstructured.io/blog/introducing-unstructured-platform-the-enterprise-etl-platform-for-the-genai-tech-stackintroducing-unstructured-platform-beta-the-enterprise-etl-platform-for-the-genai-tech-stack). ## How does it work? To get your data RAG-ready, Unstructured moves it through the following process: ```mermaid flowchart LR Connect-->Route-->Transform-->Chunk-->Enrich-->Embed-->Persist ``` <Steps> <Step title="Connect"> Unstructured offers multiple [source connectors](/ui/sources/overview) to connect to your data in its existing location. </Step> <Step title="Route"> Routing determines which strategy Unstructured uses to transform your documents into Unstructured's canonical JSON schema. Unstructured provides four [partitioning](/ui/partitioning) strategies for document transformation, as follows. Unstructured recommends that you choose the Auto partitioning strategy in most cases. With Auto, Unstructured does all the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page. You should consider the following additional strategies only if you are absolutely sure that your documents are of the same type. Each of the following strategies are best suited for specific situations. Choosing one of these strategies other than Auto for sets of documents of different types could produce undesirable results, including reduction in transformation quality. * VLM: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`. * High Res: For all other [supported file types](/ui/supported-file-types), and for the generation of bounding box coordinates. * Fast: For text-only documents. </Step> <Step title="Transform"> Your source document is transformed into Unstructured's canonical JSON schema. Regardless of the input document, this JSON schema gives you a [standardized output](/ui/document-elements). It contains more than 20 elements, such as `Header`, `Footer`, `Title`, `NarrativeText`, `Table`, `Image`, and many more. Each document is wrapped in extensive metadata so you can understand languages, file types, sources, hierarchies, and much more. </Step> <Step title="Chunk"> Unstructured provides these [chunking](/ui/chunking) strategies: * Basic combines sequential elements up to specified size limits. Oversized elements are split, while tables are isolated and divided if necessary. Overlap between chunks is optional. * By Title uses semantic chunking, understands the layout of the document, and makes intelligent splits. * By Page attempts to preserve page boundaries when determining the chunks' contents. * By Similarity uses an embedding model to identify topically similar sequential elements and combines them into chunks. </Step> <Step title="Enrich"> Images and tables can be optionally summarized. This generates enriched content around the images or tables that were parsed during the transformation process. </Step> <Step title="Embed"> Unstructured uses optional third-party [embedding](/ui/embedding) providers such as OpenAI. </Step> <Step title="Persist"> Unstructured offers multiple [destination connectors](/ui/destinations/overview), including all major vector databases. </Step> </Steps> To simplify this process and provide it as a no-code solution, Unstructured brings together these key concepts: ```mermaid flowchart LR SC[Source Connectors] DC[Destination Connectors] W[Workflow] J[Jobs] SC --> W DC --> W W --> J ``` <Steps> <Step title="Source Connectors"> [Source connectors](/ui/sources/overview) to ingest your data into Unstructured for transformation. </Step> <Step title="Destination Connectors"> [Destination connectors](/ui/destinations/overview) tell Unstructured where to write your transformed data to. </Step> <Step title="Workflow"> A [workflow](/ui/workflows) connects sources to destinations and provide chunking, embedding, and scheduling options. </Step> <Step title="Jobs"> [Jobs](/ui/jobs) enable you to monitor data transformation progress. </Step> </Steps> ## What support is there for compliance? The platform is designed for global reach with SOC2 Type 1, SOC2 Type 2, and HIPAA compliance. It has support for over 50 languages. ## How am I billed? To use the Unstructured UI, you must have an Unstructured account for billing purposes. Unstructured offers three account pricing plans: * SaaS Cloud-hosted - Processing happens on Unstructured's software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment. * Private SaaS - Processing also happens on Unstructured's SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy. * VPC - Sometimes referred to as [self-hosted](/self-hosted/overview), an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support. For more details, see the [Unstructured Pricing](https://unstructured.io/pricing) page. Some of these plans are billed on a per-page basis. Unstructured calculates a page as follows: * For these file types, a page is a page, slide, or image: `.pdf`, `.pptx`, and `.tiff`. * For `.docx` files that have page metadata, Unstructured calculates the number of pages based on that metadata. * For all other file types, Unstructured calculates the number of pages as the file's size divided by 100 KB. * For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed. ## How do I get started? Skip ahead to the [quickstart](/ui/quickstart). ## How do I get help? [Contact us directly](https://unstructured.io/contact), or join our [Slack community](https://short.unstructured.io/pzw05l7). For enterprise support, email Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). # Partitioning Source: https://docs.unstructured.io/ui/partitioning Partitioning extracts content from raw unstructured files and outputs that content as structured [document elements](/ui/document-elements). For specific file types, such as image files and PDF files, Unstructured offers special strategies to partition them. Each of these strategies has trade-offs for output speed, cost to output, and quality of output. PDF files, for example, vary in quality and complexity. In simple cases, traditional natural language processing (NLP) extraction techniques might be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required to process a PDF file. Some of these strategies implement rule-based workflows, which can be faster and cheaper, because they always extract in the same way, but you might sometimes get lower-quality resolution. Other strategies implement model-based workflows, which can be slower and costlier because they require a model that performs inference, but you can get higher-quality resolution. When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs. For example, the Fast strategy can be about 100 times faster than leading image-to-text models. To choose one of these strategies, select one of the following four Partition Strategy options in the Partitioner node of a workflow. <Note>You can change a workflow's preconfigured strategy only through [Custom](/ui/workflows#create-a-custom-workflow) workflow settings.</Note> Unstructured recommends that you choose the Auto partitioning strategy in most cases. With Auto, Unstructured does all the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page. You should consider the following additional strategies only if you are absolutely sure that your documents are of the same type. Each of the following strategies are best suited for specific situations. Choosing one of these strategies other than Auto for sets of documents of different types could produce undesirable results, including reduction in transformation quality. * VLM: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`. * High Res: For all other [supported file types](/ui/supported-file-types), and for the generation of bounding box coordinates. * Fast: For text-only documents. ## Images and tables in PDF files The differences between the various partitioning strategies can be more clearly demonstrated by the ways each of these strategies handle images and tables within PDF files. For example, the Fast partitioning strategy skips processing images altogether in PDF files: ![The Fast strategy skips processing images in PDF files](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Fast-Image-Example.png) For tables, the Fast strategy interprets table cells in PDF files as a mixture of title, list, and uncategorized text elements: ![The Fast strategy interprets table cells in PDF files as text](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Fast-Table-Example.png) The High Res strategy, by itself, processes images in PDF files sometimes with limited output: ![The High Res strategy processes images in PDF files with limited results](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Hi-Res-Image-Example.png) However, when combined with the [image description](/ui/enriching/image-descriptions) enrichment, the High Res strategy can process images in PDF files with better result output: ![The High Res strategy with image description produces better results](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Hi-Res-Image-Enriched-Example.png) For tables, the High Res strategy processes tables in PDF files with the table's text and an HTML representation of the table as output: ![The High Res strategy processes tables in PDF files with the table's text and HTML as output](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Hi-Res-Table-Example.png) When combined with the [table description](/ui/enriching/table-descriptions) and [tables to HTML](/ui/enriching/table-to-html) enrichments, the High Res strategy can process tables in PDF files with even richer result output: ![The High Res strategy with table summarization and table-to-HTML enrichments produces better results](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Hi-Res-Table-Enriched-Example.png) The VLM strategy processes images in PDF files with image summaries and text as HTML elements as output. The following example shows GPT-4o by OpenAI being used. If the Auto strategy is selected in this example, Unstructured will route to the VLM strategy for processing: ![The VLM strategy processes images in PDF files with image summaries and text as HTML](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/VLM-Auto-Image-GPT-4o-Example.png) For tables, the VLM strategy processes tables in PDF files with the table's text and an HTML representation of the table as output, similar to the High Res strategy. The following example shows GPT-4o by OpenAI being used. If the Auto strategy is selected in this example, Unstructured will route to the VLM strategy for processing: ![The VLM strategy processes tables in PDF files with table summaries and text as HTML](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/VLM-Auto-Table-GPT-4o-Example.png) ## Handwriting and multilanguage characters in PDF files The differences between the various partitioning strategies can be more clearly demonstrated by the ways each of these strategies handle handwriting and multilanguage characters within PDF files. For example, the Fast partitioning strategy skips processing handwriting altogether in PDF files. The Fast strategy processes multilanguage characters in PDF files with limited output, depending on the language. In the following example, Japanese hiragana characters are processed as text, but the output can be very difficult to work with: ![The Fast strategy produces cryptic CID codes for hiragana characters](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Hiragana-Fast.png) For handwriting, the High Res strategy typically produces unusable results, for example: ![The High Res strategy typically produces unusable results for handwriting](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Handwriting-Hi-Res.png) For multilanguage characters, the High Res strategy also typically produces unusable results, for example failing to recognize Japanese hiragana characters: ![The High Res strategy typically produces unusable results for multilanguage characters](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Hiragana-Hi-Res.png) The VLM strategy can produce great results for handwriting, such as this example that uses GPT-4o by OpenAI: ![The VLM strategy can process handwriting well](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Handwriting-VLM-GPT-4o.png) The VLM strategy also has great support for recognizing multilanguage characters, such as this example that uses GPT-4o by OpenAI to recognize Japanese hiragana characters: ![The VLM strategy can process Japanese hiragana well](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/partitioning/Hiragana-VLM.png) ## Supported languages Fast partitioning accepts any text inputs, though automatic language detection of those inputs is restricted to [langdetect](https://pypi.org/project/langdetect/). High Res partitioning leverages Tesseract OCR. For the list of languages that Tesseract supports, see: [Languages/Scripts supported in different versions of Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html). Language support for VLM depends on the model used. The list of supported languages for a particular model is maintained by that model's provider. For the list of languages that each model supports, see the following, where provided: * Anthropic * Claude 3.5 Sonnet: Arabic, Bengali, Chinese (Simplified), English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazil), Spanish, Swahili, and Yoruba are mentioned. ([Source](https://docs.anthropic.com/en/docs/build-with-claude/multilingual-support)) * OpenAI * GPT-4o: Arabic, Chinese, English, French, German, Gujarati, Hindi, Italian, Japanese, Korean, Marathi, Persian, Portuguese, Russian, Spanish, Tamil, Telugu, Turkish, Urdu, and Vietnamese are mentioned. ([Source](https://openai.com/index/hello-gpt-4o/)) * Amazon Bedrock * Claude 3.5 Sonnet: "English, Spanish, Japanese, and multiple other languages" ([Source](https://aws.amazon.com/bedrock/claude/)) * Claude 3 Opus: "English, Spanish, Japanese, and multiple other languages" ([Source](https://aws.amazon.com/bedrock/claude/)) * Claude 3 Haiku: "English, Spanish, Japanese, and multiple other languages" ([Source](https://aws.amazon.com/bedrock/claude/)) * Claude 3 Sonnet: "English, Spanish, Japanese, and multiple other languages" ([Source](https://aws.amazon.com/bedrock/claude/)) * Amazon Nova Pro: "200+ languages" ([Source](https://aws.amazon.com/ai/generative-ai/nova/)) * Amazon Nova Lite: "200+ languages" ([Source](https://aws.amazon.com/ai/generative-ai/nova/)) * Meta Llama 3.2 90B Instruct: "English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai" ([Source](https://aws.amazon.com/bedrock/llama/)) * Meta Llama 3.2 11B Instruct: "English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai" ([Source](https://aws.amazon.com/bedrock/llama/)) # Unstructured UI quickstarts Source: https://docs.unstructured.io/ui/quickstart This page contains two quickstarts to help you get started with the Unstructured user interface (UI). * The [local quickstart](#local-quickstart) is an ideal starting point and requires no prior setup. It shows you how to use the Unstructured UI to process a single file that is stored on your local machine. This approach is ideal for rapid testing and prototyping of Unstructured ETL+ workflows, across the full range of Unstructured features, against a single representative file. * The [remote quickstart](#remote-quickstart) takes a bit longer, but it shows you how to use the UI to set up and run Unstructured ETL+ workflows against files and data stored in remote source file and object stores, websites, databases, and vector stores. Unstructured delivers the resulting processed data to remote destination file and object stores, databases, and vector stores. This approach is ideal for production use cases, where you want to process large volumes of files and data in a scalable and efficient manner. <Tip> If you're not sure which quickstart to use, we recommend starting with the [local quickstart](#local-quickstart). The local quickstart is a fast and easy way to learn about Unstructured. When you're happy with the results from the workflow that you create there, it is straightforward to turn it into a production-ready workflow with all of the settings that you want already in place! </Tip> ## Local quickstart This quickstart uses a no-code, point-and-click user interface (UI) in your web browser to have Unstructured process a single file that is stored on your local machine. The file is first processed on Unstructured-hosted compute resources. The UI then shows the processed data that Unstructured generates for that file. You can download that processed data as a `.json` file to your local machine. This approach enables rapid, local, run-adjust-repeat prototyping of end-to-end Unstructured ETL+ workflows with a full range of Unstructured features. After you get the results you want, you can then attach remote source and destination connectors to both ends of your existing workflow to begin processing remote files and data at scale in production. To run this quickstart, you will need a local file with a size of 10 MB or less and one of the following file types: | File type | | --------- | | `.bmp` | | `.csv` | | `.doc` | | `.docx` | | `.email` | | `.epub` | | `.heic` | | `.html` | | `.jpg` | | `.md` | | `.odt` | | `.org` | | `.pdf` | | `.pot` | | `.potm` | | `.ppt` | | `.pptm` | | `.pptx` | | `.rst` | | `.rtf` | | `.sgl` | | `.tiff` | | `.txt` | | `.tsv` | | `.xls` | | `.xlsx` | | `.xml` | <Note> For processing remote files at scale in production, Unstructured supports many more files types than these. [See the list of supported file types](/ui/supported-file-types). Unstructured also supports processing files from remote object stores, and data from remote sources in websites, web apps, databases, and vector stores. For more information, see the [source connector overview](/ui/sources/overview) and the [remote quickstart](/ui/quickstart#remote-quickstart) for how to set up and run production-ready Unstructured ETL+ workflows at scale. </Note> If you do not have any files available, you can use one of the sample files that Unstructured offers in the UI. Or, you can download one or more sample files from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in the Unstructured repo on GitHub. <iframe width="560" height="315" src="https://www.youtube.com/embed/pICXdW_P50s" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Steps> <Step title="Sign up and sign in"> * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Step> <Step title="Create a workflow"> 1. In the Unstructured UI, on the sidebar, click Workflows. ![Workflows in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflows-Sidebar.png) 2. Click New Workflow. 3. Select Build it Myself, if it is not already selected. 4. Click Continue. The visual workflow editor appears. ![Visual workflow designer](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Single-File-Design.png) The workflow is represented visually as a series of directed acyclic graph (DAG) nodes. Each node represents a step in the workflow. The workflow proceeds end to end from left to right. By default, the workflow starts with three nodes: * Source: This node represents the location where you have your files or data for Unstructured to process. For this quickstart, this node represents a single file on your local machine. After you get the results you want, you can update this node to represent files or data in a remote location at scale in production. * Partitioner: This node represents the [partitioning](/ui/partitioning) step, which extracts content from unstructured files and data and outputs it as structured [document elements](/ui/document-elements) for consistent representation across varying kinds of file and data types. For this quickstart, this node extracts the contents of a single file on your local machine and outputs it as a series of structured document elements in JSON format. * Destination: This node represents the location where you want Unstructured to put the processed files or data. After you get the results you want, you can update this node to have Unstructured put the processed files or data into a remote location at scale in production. </Step> <Step title="Process a local file"> 1. Drag the file that you want Unstructured to process from your local machine's file browser app and drop it into the Source node's Drop file to test area. The file must have a size of 10 MB or less and one of the file types listed at the beginning of this quickstart. If you are not able to drag and drop the file, you can click Drop file to test and then browse to and select the file instead. Alternatively, you can use a sample file that Unstructured offers. To do this, click the Source node, and then in the Source pane, with Details selected, on the Local file tab, click one of the files under Or use a provided sample file. To view the file's contents before you select it, click the eyes button next to the file. 2. Above the Source node, click Test. ![Testing a single local file workflow](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Test-Source.png) Unstructured displays a visual representation of the file and begins processing its contents, sending it through each of the workflow's nodes in sequence. Depending on the file's size and the workflow's complexity, this processing could take several minutes. After Unstructured has finished its processing, the processed data appears in the Test output pane, as a series of structured elements in JSON format. ![Viewing single local file output](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Test-Single-File-Output.png) 3. In the Test output pane, you can: * Search through the processed, JSON-formatted representation of the file by using the Search JSON box. * Download the full JSON as a `.json` file to your local machine by clicking Download full JSON. 4. When you are done, click the Close button in the Test output pane. </Step> <Step title="Add more nodes to the workflow"> 1. You can now add more nodes to the workflow to do further testing of various Unstructured features and with the option of eventually moving the workflow into production. For example, you can: ![Adding a node to the workflow](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Add-Node.png) <iframe width="560" height="315" src="https://www.youtube.com/embed/TE_J-pHalv8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Add a Chunker node after the Partitioner node, to chunk the partitioned data into smaller pieces for your retrieval augmented generation (RAG) applications. To do this, click the add (+) button to the right of the Partitioner node, and then click Enrich > Chunker. Click the new Chunker node and specify its settings. For help, click the FAQ button in the Chunker node's pane. [Learn more about chunking and chunker settings](/ui/chunking). * Add an Enrichment node after the Chunker node, to apply enrichments to the chunked data such as image summaries, table summaries, table-to-HTML transforms, and named entity recognition (NER). To do this, click the add (+) button to the right of the Chunker node, and then click Enrich > Enrichment. Click the new Enrichment node and specify its settings. For help, click the FAQ button in the Enrichment node's pane. [Learn more about enrichments and enrichment settings](/ui/enriching/overview). <Warning> Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description, table description, or table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> * Add an Embedder node after the Enrichment node, to generate vector embeddings for performing vector-based searches. To do this, click the add (+) button to the right of the Enrichment node, and then click Transform > Embedder. Click the new Embedder node and specify its settings. For help, click the FAQ button in the Embedder node's pane. [Learn more about embedding and embedding settings](/ui/embedding). 2. Each time you add a node or change its settings, you can click Test above the Source node again to test the current workflow end to end and see the results of the changes, if any. 3. Keep repeating this step as many times as you want, until you get the results you want. </Step> <Step title="Next steps"> After you get the results you want, you have the option of moving your workflow into production. To do this, complete the following instructions. <Note> The following instructions have you create a new workflow that is suitable for production. This behavior is planned to be fixed in a future release, allowing you to update the workflow that you just created, rather than needing to create a new one. </Note> 1. With your workflow remaining open in the visual workflow editor, open a new tab in your web browser, and in this new tab, sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. In this new tab, create a [source connector](/ui/sources/overview) for your remote source location. This is the location in production where you have files or data in a file or object store, website, database, or vector store that you want Unstructured to process. ![Connectors button on the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Sources-Sidebar.png) 3. Create a [destination connector](/ui/destinations/overview) for your remote destination location. This is the location in production where you want Unstructured to put the processed data as `.json` files in a file or object store, or as records in a database or vector store. 4. Create a workflow: on the sidebar, click Workflows, and then click New Workflow. Select Build it Myself, and then click Continue to open the visual workflow editor. 5. In the visual workflow editor, click Source. 6. In the Source pane, with Details selected, on the Connectors tab, select the source connector that you just created. 7. Click the Destination node. 8. In the Destination pane, with Details selected, select the destination connector that you just created. 9. Using your original workflow on the other tab as a guide, add any additional nodes to this new workflow as needed, and configure those new nodes' settings to match the other ones. 10. Click Save. 11. To run the workflow: a. Make sure to click Save first.<br /> b. Click the Close button next to the workflow's name in the top navigation bar.<br /> c. On the sidebar, click Workflows.<br /> d. In the list of available workflows, click the Run button for the workflow that you just saved.<br /> e. On the sidebar, click Jobs.<br /> ![Viewing the list of available jobs](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Select-Job.png) f. In the list of available jobs, click the job that you just ran.<br /> g. After the job status shows Finished, go to the your destination location to see the processed files or data that Unstructured put there. See also the [remote quickstart](/ui/quickstart#remote-quickstart) for more coverage about how to set up and run production-ready Unstructured ETL+ workflows at scale. </Step> </Steps> * To learn more, read [Skip the Setup. Get Straight to the Results with Our Redesigned Interactive Workflow Builder](https://unstructured.io/blog/skip-the-setup-get-straight-to-the-results-with-our-redesigned-interactive-workflow-builder). ## Remote quickstart This quickstart uses a no-code, point-and-click user interface in your web browser to get all of your data RAG-ready. Data is processed on Unstructured-hosted compute resources. The requirements are as follows. * A compatible source (input) location that contains your data for Unstructured to process. [See the list of supported source types](/ui/connectors#sources). * For document-based source locations, compatible files in that location. [See the list of supported file types](/ui/supported-file-types). If you do not have any files available, you can download some from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in the Unstructured repo on GitHub. * A compatible destination (output) location for Unstructured to put the processed data. [See the list of supported destination types](/ui/connectors#destinations). <iframe width="560" height="315" src="https://www.youtube.com/embed/Wn2FfHT6H-o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Steps> <Step title="Sign up and sign in"> * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Step> <Step title="Set the source (input) location"> ![Sources in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Sources-Sidebar.png) 1. From your Unstructured dashboard, in the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. For Name, enter some unique name for this connector. 5. In the Provider area, click the source location type that matches yours. 6. Click Continue. 7. Fill in the fields with the appropriate settings. [Learn more](/ui/sources/overview). 8. If a Continue button appears, click it, and fill in any additional settings fields. 9. Click Save and Test. </Step> <Step title="Set the destination (output) location"> ![Destinations in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Destinations-Sidebar.png) 1. In the sidebar, click Connectors. 2. Click Destinations. 3. Cick New or Create Connector. 4. For Name, enter some unique name for this connector. 5. In the Provider area, click the destination location type that matches yours. 6. Click Continue. 7. Fill in the fields with the appropriate settings. [Learn more](/ui/sources/overview). 8. If a Continue button appears, click it, and fill in any additional settings fields. 9. Click Save and Test. </Step> <Step title="Define the workflow"> ![Workflows in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflows-Sidebar.png) 1. In the sidebar, click Workflows. 2. Click New Workflow. 3. Next to Build it for Me, click Create Workflow. <Note>If a radio button appears instead of Build it for Me, select it, and then click Continue.</Note> 4. For Workflow Name, enter some unique name for this workflow. 5. In the Sources dropdown list, select your source location from Step 3. 6. In the Destinations dropdown list, select your destination location from Step 4. <Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note> 7. Click Continue. 8. The Reprocess All box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors: * Checking this box reprocesses all documents in the source location on every workflow run. * Unchecking this box causes only new documents that are added to the source location, or existing documents that are updated in the source location, since the last workflow run to be processed on future runs. Previously processed documents are not processed again. However: * Even if this box is unchecked, a renamed file is always treated as a new file, regardless of whether the file's original contents have changed. * Even if this box is unchecked, a file that is removed but is added back later with the same file name is processed on future runs only if the file's contents have changed since the file was originally processed. 9. Click Continue. 10. If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don't repeat. 11. Click Complete. </Step> <Step title="Process the documents"> ![Workflows in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflows-Sidebar.png) 1. If you did not choose to run this workflow on a schedule in Step 5, you can run the workflow now: on the sidebar, click Workflows. 2. Next to your workflow from Step 5, click Run. </Step> <Step title="Monitor the processing job"> ![Select a job](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Select-Job.png) ![Completed job](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Job-Complete.png) 1. In the sidebar, click Jobs. 2. In the list of jobs, wait for the job's Status to change to Finished. 3. Click the row for the job. 4. After Overview displays Finished, go to the next Step. </Step> <Step title="View the processed data"> Go to your destination location to view the processed data. </Step> </Steps> # Azure Source: https://docs.unstructured.io/ui/sources/azure-blob-storage <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Azure Blob Storage. The requirements are as follows. The following video shows how to fulfill the minimum set of Azure Storage account requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/Vl3KCphlh9Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Note> If you are generating an SAS token as shown in the preceding video, be sure to set the following permissions: * Read and List for reading from the container only. * Write and List for writing to the container only. * Read, Write, and List for both reading from and writing to the container. </Note> Here are some more details about these requirements: * An Azure account. To create one, [learn how](https://azure.microsoft.com/pricing/purchase-options/azure-account). <iframe width="560" height="315" src="https://www.youtube.com/embed/2bQ6WiJ1ncA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An Azure Storage account, and a container within that account. [Create a storage account](https://learn.microsoft.com/azure/storage/common/storage-account-create). [Create a container](https://learn.microsoft.com/azure/storage/blobs/blob-containers-portal). <iframe width="560" height="315" src="https://www.youtube.com/embed/AhuNgBafmUo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <iframe width="560" height="315" src="https://www.youtube.com/embed/xmndjYnGvcs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Azure Storage remote URL, using the format `az://<container-name>/<path/to/file/or/folder/in/container/as/needed>` For example, if your container is named `my-container`, and there is a folder in the container named `my-folder`, the Azure Storage remote URL would be `az://my-container/my-folder/`. * An SAS token (recommended), access key, or connection string for the Azure Storage account. [Create an SAS token (recommended)](https://learn.microsoft.com/azure/ai-services/translator/document-translation/how-to-guides/create-sas-tokens). [Get an access key](https://learn.microsoft.com/azure/storage/common/storage-account-keys-manage#view-account-access-keys). [Get a connection string](https://learn.microsoft.com/azure/storage/common/storage-configure-connection-string#configure-a-connection-string-for-an-azure-storage-account). Create an SAS token (recommended): <iframe width="560" height="315" src="https://www.youtube.com/embed/X6cmJ2IbVzo?start=240&end=370" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Get an access key or connection string: <iframe width="560" height="315" src="https://www.youtube.com/embed/muMmcwVfFqs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Azure Blob Storage. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Recursive: Check this box to recursively access files from subfolders within the container. * Remote URL (required): The Azure Storage remote URL, with the format `az://<container-name>/<path/to/file/or/folder/in/container/as/needed>` For example, if your container is named `my-container`, and there is a folder in the container named `my-folder`, the Azure Storage remote URL would be `az://my-container/my-folder/`. * Account Name: For SAS token authentication and account key authentication, the Azure Storage account name. * SAS Token: For SAS token authentication, the SAS token for the Azure Storage account. * Account Key: For account key authentication, the key for the Azure Storage account. * Connection String: For connection string authentication, the connection string for the Azure Storage account. # Box Source: https://docs.unstructured.io/ui/sources/box <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Box. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/Gkniu1sfz1o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. Access to the [Developer Console](https://app.box.com/developers/console) from your [Box enterprise account](https://account.box.com/signup/enterprise-plan) or [Box developer account](https://account.box.com/signup/developer). 2. A Box Custom App in your Box account, set up to use Server Authentication (with JWT). See [Setup with JWT](https://developer.box.com/guides/authentication/jwt/jwt-setup/). 3. The appropriate application scopes and advanced features set up for the Box Custom App, as follows: * In the Box Custom App, on the Configuration tab, under Application Scopes, check the box titled Write all files and folders stored in Box. * In the Box Custom App, on the Configuration tab, under Advanced Features, check the box titled Make API calls using the as-user header. * Click Save Changes before continuing. 4. Authorization by a Box Admin in your Box account for the Box Custom App. See [App Authorization](https://developer.box.com/guides/authentication/jwt/jwt-setup/#app-authorization). 5. Access by your Box account's source or target [folder](https://app.box.com/folder/0) to your Box Custom App, as follows: * In the Box Custom App, on the General Settings tab, copy the Service Account ID (which takes the form `AutomationUser_<your-app-service-id>_<a-random-string@boxdevedition.com`). * Share your Box account's target folder with the copied service account's email address as a Co-owner or Editor. * Note the remote URL to the target folder, which takes the format `box://<path/to/folder/in/account>`. 6. The private key configuration JSON file for the Box Custom App, or a string that contains this file's contents. * To download this file, in the Box Custom App, on the Configuration tab, under Add and Manage Public Keys, click Generate a Public/Private Keypair. Store the downloaded private key configuration JSON file in a secure location. * To ensure maximum compatibility across Unstructured service offerings, you should give the private key configuration JSON file information to Unstructured as a single-line string that contains the contents of the downloaded private key configuration JSON file (and not the file itself). To print this single-line string, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the private key configuration JSON file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Box. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Data URL (required): The URL to the target Box folder. This URL must take the format `box://<path/to/folder/in/account>`. * Recursive: Check this box to recursively access files from subfolders within the taget Box folder. This box is unchecked by default. * Config (required): The contents of the private key configuration JSON file for the Box Custom App with access to the target Box folder. This file's contents must be provided as a single string. # Confluence Source: https://docs.unstructured.io/ui/sources/confluence <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Confluence. The requirements are as follows. * A [Confluence Cloud account](https://www.atlassian.com/software/confluence/pricing) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/installing-confluence-data-center-203603.html). * The site URL for your [Confluence Cloud account](https://community.atlassian.com/t5/Confluence-questions/confluence-cloud-url/qaq-p/1157148) or [Confluence Data Center installation](https://confluence.atlassian.com/confkb/how-to-find-your-site-url-to-set-up-the-confluence-data-center-and-server-mobile-app-938025792.html). * A user in your [Confluence Cloud account](https://confluence.atlassian.com/cloud/invite-edit-and-remove-users-744721624.html) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/add-and-invite-users-138313.html). * The user must have the correct permissions in your [Conflunce Cloud account](https://support.atlassian.com/confluence-cloud/docs/what-are-confluence-cloud-permissions-and-restrictions/) or [Confluence Data Center installation](https://confluence.atlassian.com/doc/permissions-and-restrictions-139557.html) to access the target spaces and pages. * One of the following: * For Confluence Cloud or Confluence Data Center, the target user's name or email address, and password. [Change a Confluence Cloud user's password](https://support.atlassian.com/confluence-cloud/docs/change-your-confluence-password/). [Change a Confluence Data Center user's password](https://confluence.atlassian.com/doc/change-your-password-139416.html). * For Confluence Cloud only, the target user's name or email address, and API token. [Create an API token](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/). * For Confluence Data Center only, the target user's personal access token (PAT). [Create a PAT](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html). * Optionally, the names of the specific [spaces](https://support.atlassian.com/confluence-cloud/docs/navigate-spaces/) in the Confluence instance to access. The following video provides related setup information for Confluence Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/tXu6S6Pd7Dc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> ## Document permissions metadata The source connector outputs any permissions information that it can find in the source location about the processed source documents and associates that information with each corresponding element that is generated. This permissions information is output into the `permissions_data` field, which is within the `data_source` field under the element's `metadata` field. This information lists the users or groups, if any, that have permissions to read, update, or delete the element's associated source document. The following example shows what the output looks like. Ellipses indicate content that has been omitted from this example for brevity. ```json [ { "...": "...", "metadata": { "...": "...", "data_source": { "...": "...", "permissions_data": [ { "read": { "users": [ "11111:11111111-1111-1111-1111-111111111111" ], "groups": [ "22222222-2222-2222-2222-22222222", "33333333-3333-3333-3333-33333333" ] } }, { "update": { "users": [ "44444:44444444-4444-4444-4444-44444444", "55555:55555555-5555-5555-5555-55555555" ], "groups": [ "66666666-6666-6666-6666-66666666", ] } }, { "delete": { "users": [ "77777:77777777-7777-7777-7777-77777777" ], "groups": [ "88888888-8888-8888-8888-88888888" ] } } ], "...": "..." } } } ] ``` To look up information about a particular Confluence user, use the user's ID (also known as their account ID) from the preceding output to call the [GET /wiki/rest/api/user](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-users/#api-wiki-rest-api-user-get) operation in the Confluence REST API. To look up information about a particular Confluence group, use the group's ID from the preceding output to call the [GET /wiki/rest/api/group/by-id](https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-group/#api-wiki-rest-api-group-by-id-get) operation in the Confluence REST API. To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Confluence. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * URL (required): The target Confluence site's URL. * For username and password authentication: for Authentication Method, select Username and Password. Then enter the username into the Username field and the password into the Password field. * For API token authentication: for Authentication Method, select Username and API Token. Then enter the username into the Username field and the API token into the API Token field. * For personal access token (PAT) authentication: for Authentication Method, select Personal Access Token. Then enter the PAT into the Personal Access Token field. * Cloud: Check this box if you are using Confluence Cloud. By default this box is unchecked. * Max number of spaces: The maximum number of Confluence spaces to access within the Confluence Cloud instance. The default is 500 unless otherwise specified. * Max number of docs per space: The maximum number of documents to access within each space. The default is 150 unless otherwise specified. * List of spaces: A comma-separated string that lists the names of all of the spaces to access, for example: `luke,paul`. By default, if no space names are specified, and the Max Number of Spaces is reached for the instance, be aware that you might get unexpected results. * Extract inline images: Check this box to download images and replace the HTML content with Base64-encoded images. By default, this box is unchecked. * Extract files: Check this box to download any embedded files in pages. By default, this box is unchecked. # Couchbase Source: https://docs.unstructured.io/ui/sources/couchbase <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Couchbase. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Couchbase Capella clusters are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Couchbase Capella clusters and local Couchbase server deployments are supported. <iframe width="560" height="315" src="https://www.youtube.com/embed/9-RIBmIdi70" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For Couchbase Capella, you will need: * A [Couchbase Capella account](https://docs.couchbase.com/cloud/get-started/create-account.html#sign-up-free-tier). * A [Couchbase Capella cluster](https://docs.couchbase.com/cloud/get-started/create-account.html#getting-started). * A [bucket](https://docs.couchbase.com/cloud/clusters/data-service/manage-buckets.html#add-bucket), [scope](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-scope), and [collection](https://docs.couchbase.com/cloud/clusters/data-service/scopes-collections.html#create-collection) on the cluster. * The cluster's [public connection string](https://docs.couchbase.com/cloud/get-started/connect.html#connect-from-sdk-cbsh-cli-or-ide). * The [cluster access name (username) and secret (password)](https://docs.couchbase.com/cloud/clusters/manage-database-users.html#create-database-credentials). * [Incoming IP address allowance](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) for the cluster. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> For a local Couchbase server, you will need: * [Installation of a local Couchbase server](https://docs.couchbase.com/server/current/getting-started/start-here.html). * [Connection details](https://docs.couchbase.com/server/current/guides/connect.html) to the local Couchbase server. To learn more about how to set up a Couchbase cluster and play with data, refer to this [tutorial](https://developer.couchbase.com/tutorial-quickstart-flask-python). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Couchbase. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Bucket (required): The name of the bucket to connect to on the cluster. * Connection string (required): The connection string for the Couchbase cluster. * Scope: The name of the scope in the bucket. The default is `_default` if not otherwise specified. * Collection: The name of the collection in the scope. The default is `_default` if not otherwise specified. * Batch Size: The maximum number of documents per transmitted batch. The default is `50` if not otherwise specified. * Collection ID key (source connector only): The name of the collection field that contains the document ID. The default is `id` if not otherwise specified * Username (required): The cluster access name (username) that has access to the cluster. * Password (required): The password for the corresponding username. ## Learn more * <Icon icon="blog" /> [Couchbase Integration in Unstructured Platform](https://unstructured.io/blog/couchbase-integration-in-unstructured-platform) # Databricks Volumes Source: https://docs.unstructured.io/ui/sources/databricks-volumes <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Databricks Volumes. The requirements are as follows. * A Databricks account on [AWS](https://docs.databricks.com/getting-started/free-trial.html), [Azure](https://learn.microsoft.com/azure/databricks/getting-started/), or [GCP](https://docs.gcp.databricks.com/getting-started/index.html). * A workspace within the Databricks account for [AWS](https://docs.databricks.com/admin/workspace/index.html), [Azure](https://learn.microsoft.com/azure/databricks/admin/workspace/), or [GCP](https://docs.gcp.databricks.com/admin/workspace/index.html). * The workspace's URL. Get the workspace URL for [AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids), [Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids), or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids). Examples: * AWS: `https://<workspace-id>.cloud.databricks.com` * Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net` * GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com` <Note> Do not add a trailing slash (`/`) to the workspace URL. </Note> * The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html). For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), the following Databricks authentication types are supported: * Databricks OAuth machine-to-machine (M2M) authentication for\ [AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html). You will need the the Client ID (or UUID or Application ID) and OAuth Secret (client secret) values for the corresponding service principal. Note that for Azure, only Databricks managed service principals are supported. Microsoft Entra ID managed service principals are not supported. The following video shows how to create a Databricks managed service principal: <iframe width="560" height="315" src="https://www.youtube.com/embed/wBmqv5DaA1E" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html). You will need the personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> For [Unstructured Ingest](/open-source/ingestion/overview), the following Databricks authentication types are supported: * For Databricks personal access token authentication for [AWS](https://docs.databricks.com/dev-tools/auth/pat.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/pat), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/pat.html): The personal access token's value. The following video shows how to create a Databricks personal access token: <iframe width="560" height="315" src="https://www.youtube.com/embed/OzEU2miAS6I" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password (basic) authentication ([AWS](https://docs.databricks.com/archive/dev-tools/basic.html) only): The user's name and password values. * For OAuth machine-to-machine (M2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-m2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-m2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-m2m.html)): The client ID and OAuth secret values for the corresponding service principal. * For OAuth user-to-machine (U2M) authentication ([AWS](https://docs.databricks.com/dev-tools/auth/oauth-u2m.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/oauth-u2m), and [GCP](https://docs.gcp.databricks.com/dev-tools/auth/oauth-u2m.html)): No additional values. * For Azure managed identities (formerly Managed Service Identities (MSI) authentication) ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-mi) only): The client ID value for the corresponding managed identity. * For Microsoft Entra ID service principal authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-sp) only): The tenant ID, client ID, and client secret values for the corresponding service principal. * For Azure CLI authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/azure-cli) only): No additional values. * For Microsoft Entra ID user authentication ([Azure](https://learn.microsoft.com/azure/databricks/dev-tools/user-aad-token) only): The Entra ID token for the corresponding Entra ID user. * For Google Cloud Platform credentials authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-creds.html) only): The local path to the corresponding Google Cloud service account's credentials file. * For Google Cloud Platform ID authentication ([GCP](https://docs.gcp.databricks.com/dev-tools/auth/gcp-id.html) only): The Google Cloud service account's email address. * The name of the parent catalog in Unity Catalog for [AWS](https://docs.databricks.com/catalogs/create-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/create-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/create-catalog.html) for the volume. * The name of the parent schema (formerly known as a database) in Unity Catalog for [AWS](https://docs.databricks.com/schemas/create-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/create-schema), or [GCP](https://docs.gcp.databricks.com/schemas/create-schema.html) for the volume. * The name of the volume in Unity Catalog for [AWS](https://docs.databricks.com/tables/managed.html), [Azure](https://learn.microsoft.com/azure/databricks/tables/managed), or [GCP](https://docs.gcp.databricks.com/tables/managed.html), and optionally any path in that volume that you want to access directly, beginning with the volume's root. * The Databricks workspace user or service principal must have the following minimum set of privileges to read from or write to the existing volume in Unity Catalog: * `USE CATALOG` on the volume's parent catalog in Unity Catalog. * `USE SCHEMA` on the volume's parent schema (formerly known as a database) in Unity Catalog. * `READ VOLUME` and `WRITE VOLUME` on the volume. The following videos shows how to create and set privileges for a catalog, schema (formerly known as a database), and volume in Unity Catalog. <iframe width="560" height="315" src="https://www.youtube.com/embed/yF9DJphhQQc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Learn more about how to check and set Unity Catalog privileges for [AWS](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges), [Azure](https://learn.microsoft.com/azure/databricks/data-governance/unity-catalog/manage-privileges/#grant), or [GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/manage-privileges/index.html#show-grant-and-revoke-privileges). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Databricks Volumes. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Host (required): The Databricks workspace host URL. <Note> Do not add a trailing slash (`/`) to the host URL. </Note> * Catalog (required): The name of the catalog to use. * Schema : The name of the associated schema. If not specified, default is used. * Volume (required): The name of the associated volume. * Volume Path : Any optional path to access within the volume. * For Authentication Method, if you select Service Principal, you must also specify the following: * OAuth Secret (required): The associated OAuth Secret value for the Databricks managed service principal that has the appropriate privileges to the volume. * UUID (required): The Client ID (or UUID or Application ID) value for the Databricks managed service principal that has appropriate privileges to the volume. * For Authentication Method, if you select Token, you must also specify the Databricks personal access token's value in the Token field. # Dropbox Source: https://docs.unstructured.io/ui/sources/dropbox <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Dropbox. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/Vku5uYa-2N4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. A [Dropbox account](https://www.dropbox.com/try/teams). 2. A Dropbox app for your Dropbox account. To create a Dropbox app, do the following: a) Sign in to the [Dropbox Developers](https://www.dropbox.com/developers) portal with the same credentials as your Dropbox account.<br /> b) Open your [App Console](https://www.dropbox.com/developers/apps).<br /> c) Click Create app.<br /> d) For Choose an API, select Scoped access.<br /> e) For Choose the type of access you need, select App folder.<br /> f) Enter a name for your app, and then click Create app.<br /> g) On the app's Permissions tab, under Files and folders, check the boxes labelled files.content.read or files.content.write or both, depending on whether you want to read files, write files, or both. Then click Submit.<br /> h) On the app's Settings tab, note the value of the App folder name field. This is the name of the app folder that Dropbox will create under the `Apps` top-level folder in your Dropbox account that the Dropbox app will use for access. If you change the value of App folder name field here, Dropbox will create an app folder with that name under the `Apps` top-level folder instead.<br /> i) Under OAuth 2, next to Generated access token, click Generate. Copy the value of this access token. You should only click Generate after you have completed all of the preceding steps first. This is because the access token is scoped to the specific app folder and settings at the time the access token is generated. If you change the app folder name or any of the permissions later, you should regenerate the access token.<br /> <Warning> Access tokens are valid for only four hours after they are created. After this four-hour period, you can no longer use the expired access token. Dropbox does not allow the creation of access tokens that are valid for more than four hours. To replace an expired access token, you must first generate a refresh token for the corresponding access token. To learn how to generate an access token and its corresponding refresh token, see [Replace an expired access token](#replace-an-expired-access-token), later in this article. If you do not already have the corresponding refresh token for an existing access token, or if you lose a refresh token after you generate it, you must generate a new access token and its corresponding refresh token. Instead of continualy replacing expired access tokens yourself, you can have Unstructured do it for you as needed; just supply Unstructured with the refresh token along with the Dropbox app's App key and App secret values. To learn how to supply these to Unstructured, look for mentions of "refresh token," "app key," and "app secret" in the connector settings later in this article. </Warning> 3. The app folder that your Dropbox app will use for access can be found in your Dropbox account under the `Apps` top-level folder. For example, if the value of the App folder name field above is `my-folder`, then the app folder that your Dropbox app will use for access can be found under `https://dropbox.com/home/Apps/my-folder` ![The my-folder app folder under the Apps top-level folder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-folder.png) <Warning> Your Dropbox app will not have access to upload or download files from the root of the app folder. Instead, you must create a subfolder inside of the app folder for your Dropbox app to upload or download files from. You will use the name of that subfolder when specifying your remote URL in the next step. For example, if your Dropbox app uses an app folder named `my-folder` for access within the `Apps` top-level folder, and you create a subfolder named `data` within the `my-folder` app folder, then the subfolder that your Dropbox app will upload and download files from can be found under `https://dropbox.com/home/Apps/my-folder/data` ![The data subfolder under the my-folder subfolder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-subfolder.png) </Warning> 4. Note the remote URL to your subfolder inside of the app folder, which takes the format `dropbox://<subfolder-name>`. For example, if your Dropbox app uses an app folder named `my-folder` for access within the `Apps` top-level folder, and you create a subfolder named `data` within the `my-folder` app folder, then the remote URL is `dropbox://data` ![The data subfolder under the my-folder subfolder](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/dropbox-app-subfolder.png) ## Replace an expired access token Dropbox app access tokens are valid for only four hours. After this time, you can no longer use the expired access token. To have Unstructured automatically replace expired access tokens on your behalf, do the following: <iframe width="560" height="315" src="https://www.youtube.com/embed/PZyRgpPNEUs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> 1. Get the app key and app secret values for your Dropbox app. To do this: a) Sign in to the [Dropbox Developers](https://www.dropbox.com/developers) portal with the same credentials as your Dropbox account.<br /> b) Open your [App Console](https://www.dropbox.com/developers/apps).<br /> c) Click your Dropbox app's icon.<br /> d) On the Settings tab, next to App key, copy the value of the app key.<br /> e) Next to App secret, click Show, and then copy the value of the app secret. 2. Use your web browser to browse to the following URL, replacing `<app-key>` with the app key for your Dropbox app: ```text https://www.dropbox.com/oauth2/authorize?client_id=<app-key>&response_type=code&token_access_type=offline ``` 3. Click Continue. 4. Click Allow. 5. In the Access code generated tile, copy the access code that is shown. 6. Use the [curl](https://curl.se/) utility in your Terminal or Command Prompt, or use a REST API client such as [Postman](https://www.postman.com/product/api-client/), to make the following REST API call, replacing the following placeholders: * Replace `<app-key>` with the app key for your Dropbox app. * Replace `<app-secret>` with the app secret for your Dropbox app. * Replace `<access-code>` with the access code that you just copied. ```text curl --location --request POST 'https://api.dropbox.com/oauth2/token' \ --user '<app-key>:<app-secret>' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'code=<access-code>' \ --data-urlencode 'grant_type=authorization_code' \ ``` 7. In the response, copy the following two values: * The value of `access_token` (starting with the characters `sl`) is the new, valid access token. * The value of `refresh_token` is the refresh token that can be used to replace this access token much faster and easier next time. If you lose this refresh token, you must go back to Step 2. For the [Unstructured UI](/ui/overview), if you want Unstructured to use this refresh token to automatically replace the expired access token instead of replacing it yourself, then add the following values to your connector settings, and then stop here: * Add the `refresh_token` value to the connector settings Refresh token field. * Add the `<app-key>` value to the connector settings App key field. * Add the `<app-secret>` value to the connector settings App secret field. For the [Unstructured API](/api-reference/overview) and [Unstructured Ingest](/open-source/ingestion/overview), if you want Unstructured to use this refresh token to automatically replace the expired access token instead of replacing it yourself, then add the following values to your connector settings, and then stop here: * Add the `refresh_token` value to the `refresh_token` parameter. * Add the `<app-key>` value to the `app_key` parameter. * Add the `<app-secret>` value to the connector settings `app_secret` parameter. 8. If for some reason you need to manually replace the expired access token yourself instead of having Unstructured do it for you, you can use the refresh token that you just copied to get a new access token: * Replace `<refresh-token>` with the refresh token. * Replace `<app-key>` with the app key for your Dropbox app. * Replace `<app-secret>` with the app secret for your Dropbox app. ```text curl https://api.dropbox.com/oauth2/token \ --data refresh_token=<refresh-token> \ --data grant_type=refresh_token \ --data client_id=<app-key> \ --data client_secret=<app-secret> ``` 9. In the response, copy the following two values: * The value of `access_token` (starting with the characters `sl`) is the new, valid access token. In the connector, replace the old, expired access token value with this new, valid access token value. * The value of `refresh_token` is the new, valid refresh token. To replace the expired access token yourself, go back to Step 8. To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Dropbox. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Data URL (required): The URL to the target subfolder inside of the app folder for the Dropbox app, starting with `dropbox://`. * Recursive (source connector only): Check this box to also process all child folders and their files within the subfolder. * App key (required) - The app key for your Dropbox app. This allows Unstructured to automatically replace expired access tokens. * Refresh token (required) - The refresh token for the Dropbox app. This allows Unstructured to automatically replace expired access tokens. * App secret (required) - The app secret for your Dropbox app. This allows Unstructured automatically to replace expired access tokens. # Elasticsearch Source: https://docs.unstructured.io/ui/sources/elasticsearch <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Elasticsearch. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), only Elastic Cloud instances are supported. * For [Unstructured Ingest](/open-source/ingestion/overview), Elastic Cloud instances and self-manged Elasticsearch instances are supported. * For Elastic Cloud, you will need an [Elastic Cloud service instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#hosted-elasticsearch-service). <iframe width="560" height="315" src="https://www.youtube.com/embed/HEYIFC7uOus" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For self-managed Elasticsearch, you will need a [self-managed Elasticsearch instance](https://www.elastic.co/guide/en/elasticsearch/reference/current/install-elasticsearch.html#elasticsearch-deployment-options). <iframe width="560" height="315" src="https://www.youtube.com/embed/9znbAkNG1Ok" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * You will need the name of the index on the instance. See [Create index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) and [Get index](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-index.html). The Elasticsearch index that you use must have a schema that is compatible with the schema of the documents that Unstructured produces for you. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following index schema example for your own needs: ```json { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "record_id": { "type": "text" }, "element_id": { "type": "keyword" }, "text": { "type": "text" }, "embeddings": { "type": "dense_vector", "dims": 384, "index": true, "similarity": "cosine" }, "metadata": { "type": "object", "properties": { "parent_id": { "type": "text" }, "page_number": { "type": "integer" }, "is_continuation": { "type": "boolean" }, "orig_elements": { "type": "text" }, "partitioner_type": { "type": "text" } } } } } } ``` See also: * [An Introduction to Elasticsearch Mapping](https://www.elastic.co/blog/found-elasticsearch-mapping-introduction) * [Explicit mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html) * [Dynamic field mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html) * [Unstructured document elements and metadata](/api-reference/partition/document-elements) <iframe width="560" height="315" src="https://www.youtube.com/embed/0f0tEcrvP8g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For Elastic Cloud, you will need the Elastic Cloud service instance's API key. If you are using Unstructured Ingest, you will also need the instance's Cloud ID. To get these, see your Elasticsearch Service web console. * For self-managed Elasticsearch, you will need: * The self-managed instance's hostname and port number. See [Networking](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html). * If you're using basic authentication to the self-managed instance, the user's name and password. * If you're using token-based authentication to the self-managed instance, the bearer token or API key for the instance. See [Token-based authentication services](https://www.elastic.co/guide/en/elasticsearch/reference/current/token-authentication-services.html) and [Create API key](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-create-api-key.html). * If you're using certificate authentication to the self-managed instance, the path to the Certificate Authority (CA) file on the instance, and the certificate fingerprint. See [SSL certificate API](https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-ssl.html) and [Where can I see my Certificate Fingerprint?](https://discuss.elastic.co/t/where-can-i-see-my-certificate-fingerprint/319335/3). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Elasticsearch. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Host (required): The endpoint URL for the target cluster. * Index Name (required): The name of the target index in the cluster. * API Key (required): The Elastic Cloud API key for the target cluster. # Google Cloud Storage Source: https://docs.unstructured.io/ui/sources/google-cloud <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Google Cloud Storage. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/R57duYCxLi8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Google Cloud service account. [Create a service account](https://cloud.google.com/iam/docs/service-accounts-create#console). * A service account key for the service account. See [Create a service account key](https://cloud.google.com/iam/docs/keys-create-delete#creating) in [Create and delete service account keys](https://cloud.google.com/iam/docs/keys-create-delete). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the service account key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * The URI for a Google Cloud Storage bucket. This URI consists of the target bucket name, plus any target folder within the bucket, expressed as `gs://<bucket-name>[/folder-name]`. [Create a bucket](https://cloud.google.com/storage/docs/creating-buckets#console). This bucket must have, at minimum, one of the following roles applied to the target Google Cloud service account: * `Storage Object Viewer` for bucket read access. * `Storage Object User` for bucket write access. * The `Storage Object Admin` role provides read and write access, plus access to additional bucket operations. To apply one of these roles to a service account for a bucket, see [Add a principal to a bucket-level policy](https://cloud.google.com/storage/docs/access-control/using-iam-permissions#bucket-add) in [Set and manage IAM policies on buckets](https://cloud.google.com/storage/docs/access-control/using-iam-permissions). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Google GCS. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Bucket URI (required): The URI for the Google Cloud Storage bucket and any target folder path within the bucket. This URI takes the format `gs://<bucket-name>[/folder-name]`. * Recursive (source connector only): Check this box to ingest data recursively from any subfolders, starting from the path specified by Bucket URI. * Service Account Key (required): The contents of a service account key file, expressed as a single string without line breaks, for a Google Cloud service account that has the required access permissions to the bucket. # Google Drive Source: https://docs.unstructured.io/ui/sources/google-drive <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Google Drive. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/ZiRD7hC-wh8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Google Cloud account](https://console.cloud.google.com). * The Google Drive API enabled in the account. [Learn how](https://cloud.google.com/apis/docs/getting-started#enabling_apis). * Within the account, a Google Cloud service account and its related `credentials.json` key file or its contents in JSON format. [Create a service account](https://developers.google.com/workspace/guides/create-credentials#create_a_service_account). [Create credentials for a service account](https://developers.google.com/workspace/guides/create-credentials#create_credentials_for_a_service_account). To ensure maximum compatibility across Unstructured service offerings, you should give the service account key information to Unstructured as a single-line string that contains the contents of the downloaded service account key file (and not the service account key file itself). To print this single-line string without line breaks, suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-downloaded-key-file>` with the path to the `credentials.json` key file that you downloaded by following the preceding instructions. * For macOS or Linux: ```text tr -d '\n' < <path-to-downloaded-key-file> ``` * For Windows: ```text (Get-Content -Path "<path-to-downloaded-key-file>" -Raw).Replace("`r`n", "").Replace("`n", "") ``` * A Google Drive [shared folder](https://support.google.com/drive/answer/2375091) or [shared drive](https://support.google.com/a/users/answer/7212025). * Give the service account access to the shared folder or shared drive. To do this, share the folder or drive with the service account's email address. [Learn how](https://support.google.com/drive/answer/7166529). [Learn more](https://www.googlecloudcommunity.com/gc/Workspace-Q-A/Can-i-give-access-to-document-of-google-drive-to-service-account/m-p/530106). * Get the shared folder's ID or shared drive's ID. This is a part of the URL for your Google Drive shared folder or shared drive, represented in the following URL as `{folder_id}`: `https://drive.google.com/drive/folders/{folder-id}`. ## Document permissions metadata The source connector outputs any permissions information that it can find in the source location about the processed source documents and associates that information with each corresponding element that is generated. This permissions information is output into the `permissions_data` field, which is within the `data_source` field under the element's `metadata` field. This information lists the users or groups, if any, that have permissions to read, update, or delete the element's associated source document. The following example shows what the output looks like. Ellipses indicate content that has been omitted from this example for brevity. ```json [ { "...": "...", "metadata": { "...": "...", "data_source": { "...": "...", "permissions_data": [ { "read": { "users": [ "11111111111111111111" ], "groups": [ "22222222222222222222", "33333333333333333333" ] } }, { "update": { "users": [ "44444444444444444444", "55555555555555555555" ], "groups": [ "66666666666666666666", ] } }, { "delete": { "users": [ "77777777777777777777" ], "groups": [ "88888888888888888888" ] } } ], "...": "..." } } } ] ``` To look up information about a particular Google Cloud user, use the user's ID along with the [Admin SDK API](https://developers.google.com/workspace/admin/directory/reference/rest/v1/users/get) or the [People API](https://developers.google.com/people/api/rest/v1/people/get) for Google Cloud. To look up information about a particular Google Cloud group, use the group's ID along with the [Admin SDK API](https://developers.google.com/workspace/admin/directory/reference/rest/v1/groups/get) or the [Cloud Identity API](https://cloud.google.com/identity/docs/reference/rest/v1/groups/get) for Google Cloud. To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Google Drive. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Drive ID (required): The target folder's or drive's ID. * Extensions: A comma-separated list of any file extensions to be included in the ingestion process (such as `jpg,pdf`), if filtering is needed. The default is to include all files, if not otherwise specified. <Note> Do not include the leading dot in the file extensions. For example, use `jpg` or `pdf` instead of `.jpg` or `.pdf`. </Note> * Recursive: Check this box to also access files from all subfolders within the folder or drive. * Account Key (required): The contents of the `credentials.json` key file for the target service account. These contents must be expressed as a single-line string without line breaks. # Jira Source: https://docs.unstructured.io/ui/sources/jira <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Jira. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/JR5mcukOSUQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Jira Cloud account](https://www.atlassian.com/try/cloud/signup?bundle=jira-software\&edition=free) or [Jira Data Center installation](https://confluence.atlassian.com/adminjiraserver/installing-jira-data-center-938846870.html). * The site URL for your [Jira Data Center installation](https://confluence.atlassian.com/jirakb/find-your-site-url-to-set-up-the-jira-data-center-and-server-mobile-app-954244798.html) or Jira Cloud account. For Jira Cloud, open Jira in your web browser and copy the address from the browser's address bar. If you're unsure, check the dashboard URL, or if viewing an issue, project or board, the site URL is typically everything that comes before but not including `/jira`, such as `https://<organization>.atlassian.net`. * To process Jira projects, provide the IDs for the target projects. To get a project's ID, sign in to your Jira Cloud account or Jira Data Center installation, and then go to the following URL: `https://<organization>.atlassian.net/rest/api/latest/project/<project-key>`, replacing `<organization>` with yours, and replacing `<project-key>` with the target project's key. In the response, look for the URL `https://<organization>.atlassian.net/rest/api/3/project/<project-id>`, where `<project-id>` is the target project's ID. * To process Jira boards, the IDs for the target boards. To get a board's ID, sign in to your Jira Cloud account or Jira Data Center installation, and then go to the following URL: `https://<organization>.atlassian.net/rest/agile/1.0/board?projectKeyOrId=<project-key-or-id>`, replacing `<organization>` with yours, and `<project-key-or-id>` with the associated project's key or ID. In the response, look for the URL `https://<organization>.atlassian.net/rest/agile/1.0/board/<board-id>`, where `<board-id>` is the board's ID. * To process Jira issues, the IDs for the target issues. To get an issue's ID, sign in to your Jia Cloud account or Jira Data Center installation, open the issue, and then look at the URL in your browser's address bar. The issue ID is the string of characters after the final slash in the URL. * A user in your [Jira Cloud account](https://support.atlassian.com/jira-cloud-administration/docs/manage-users-groups-permissions-and-roles-in-jira-cloud/) or [Jira Data Center installation](https://confluence.atlassian.com/adminjiraserver/create-edit-or-remove-a-user-938847025.html). * The user must have the correct permissions in your [Jira Cloud account](https://support.atlassian.com/jira-cloud-administration/docs/manage-users-groups-permissions-and-roles-in-jira-cloud/) or [Jira Data Center installation](https://confluence.atlassian.com/jirakb/permissions-made-simple-for-jira-server-717062767.html) to access the target projects, boards, and issues. * One of the following: * For Jira Cloud or Jira Data Center, the target user's name or email address, and password. [Change a Jira Cloud user's password](https://support.atlassian.com/user-management/docs/change-password-for-portal-only-customers/). [Change a Jira Data Center user's password](https://confluence.atlassian.com/adminjiraserver/create-edit-or-remove-a-user-938847025.html). * For Jira Cloud only, the target user's name or email address, and API token. [Create an API token](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/). * For Jira Data Center only, the target user's personal access token (PAT). [Create a PAT](https://confluence.atlassian.com/enterprise/using-personal-access-tokens-1026032365.html). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Jira. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * URL (required): The URL of the Jira instance. * Username (required for password or API token authentication): The username of the Jira user. * Password (required for password or API token authentication): For password authentication, the password of the Jira user. For API token authentication, the API token of the Jira user. * Personal Access Token (required for personal access token authentication): The personal access token of the Jira user. * Cloud: Check this box if you are using Jira Cloud. The default is unchecked to use Jira Data Center. * Projects: A comma-separated list of IDs of the target projects in Jira to access. * Boards: A comma-separated list of IDs of the target boards in Jira to access. * Issues: A comma-separated list of IDs of the target issues in Jira to access. * Status Filter: A comma-separated list of statuses to filter Jira issues by. * Download Attachments: If checked, download attachments from Jira issues. By default, attachments are not downloaded. ## Learn more * <Icon icon="blog" /> [Jira Integration in Unstructured: Build RAG Systems with Project Management Data](https://unstructured.io/blog/jira-integration-in-unstructured-build-rag-systems-with-project-management-data) # Kafka Source: https://docs.unstructured.io/ui/sources/kafka <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Kafka. The requirements are as follows. * A Kafka cluster in [Confluent Cloud](https://www.confluent.io/confluent-cloud). ([Create a cluster](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#create-ak-clusters).) The following video shows how to set up a Kafka cluster in Confluent Cloud: <iframe width="560" height="315" src="https://www.youtube.com/embed/zcKJ96J4Xvk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The [hostname and port number](https://docs.confluent.io/cloud/current/clusters/create-cluster.html#view-a-ak-cluster) of the bootstrap Kafka cluster to connect to.. * The name of the topic to read messages from or write messages to on the cluster. [Create a topic](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). [Access available topics](https://docs.confluent.io/cloud/current/client-apps/topics/index.html#create-topics). * For authentication, an [API key and secret](https://docs.confluent.io/cloud/current/security/authenticate/workload-identities/service-accounts/api-keys/manage-api-keys.html#add-an-api-key). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Kafka. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Bootstrap Server (required): The hostname of the bootstrap Kafka cluster to connect to. * Port: The port number of the cluster. * Group ID: The ID of the consumer group, if any, that is associated with the target Kafka cluster. (A consumer group is a way to allow a pool of consumers to divide the consumption of data over topics and partitions.) The default is `default_group_id` if not otherwise specified. * Topic (required): The unique name of the topic to read messages from and write messages to on the cluster. * Number of messages to consume: The maximum number of messages to get from the topic. The default is `100` if not otherwise specified. * Batch Size: The maximum number of messages to send in a single batch. The default is `100` if not otherwise specified. * API Key (required): The Kafka API key value. * Secret (required): The secret value for the Kafka API key. ## Learn more * <Icon icon="blog" /> [Unstructured Platform Now Integrates with Apache Kafka in Confluent Cloud](https://unstructured.io/blog/unstructured-platform-now-integrates-with-apache-kafka-in-confluent-cloud) # MongoDB Source: https://docs.unstructured.io/ui/sources/mongodb <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from MongoDB. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/8YBVHt5spIQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The MongoDB requirements for a MongoDB Atlas deployment include: <Warning> For MongoDB Atlas, SCRAM-SHA-1 is not supported for authentication. This means that cluster types that only include SCRAM-SHA-1, such as Free, M0, Flex, and Serverless, are not supported. Unstructured only supports SCRAM-SHA-256 for MongoDB Atlas, which is cryptographically stronger than SCRAM-SHA-1. If you try to test or use a connector that refers to a cluster type that only includes SCRAM-SHA-1, the operation will fail, and you will get an error message similar to the following: `[digital envelope routines] unsupported`. </Warning> * A MongoDB Atlas account. [Create an account](https://www.mongodb.com/cloud/atlas/register). * A MongoDB Atlas cluster. [Create a cluster](https://www.mongodb.com/docs/atlas/tutorial/create-new-cluster/). Be sure to not select a cluster type that only includes SCRAM-SHA-1, such as Free, M0, Flex, or Serverless. * The cluster must be reachable from your application environment, for example by adding IP addresses to your IP access list. [Learn more](https://www.mongodb.com/docs/atlas/setup-cluster-security/#network-and-firewall-requirements). * The cluster must be configured to allow IP address. [Learn how](https://www.mongodb.com/docs/atlas/security/ip-access-list/#add-ip-access-list-entries). To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> * The cluster must have at least one database. [Create a database](https://www.mongodb.com/docs/compass/current/databases/#create-a-database). * The database must have at least one user, and that user must have sufficient access to the database. [Create a database user](https://www.mongodb.com/docs/atlas/security-add-mongodb-users/#add-database-users). [Give the user database access](https://www.mongodb.com/docs/manual/core/authorization/). * The database must have at least one collection. [Create a collection](https://www.mongodb.com/docs/compass/current/collections/#create-a-collection). <Note> For the destination connector, Unstructured recommends that all documents in the target collection have a field named `record_id` with a `String` data type. Unstructured can use this field to do intelligent document overwrites. Without this field, duplicate documents might be written to the collection or, in some cases, the operation could fail altogether. </Note> * The connection string for the cluster. For MongoDB Atlas, this connection string must include the protocol, username, password, host, and cluster name. For example: ```text mongodb+srv://<db_user>:<db_password>@<host>/?retryWrites=true&w=majority&appName=<cluster> ``` To get the connection string in MongoDB Atlas, do the following: 1. Log in to your MongoDB Atlas console. 2. In the sidebar, under Databases, click Clusters. 3. Click on the cluster you want to connect to. 4. Click Connect. 5. Click Drivers. 6. Under Add your connection string into your application code, copy the connection string. You can then close the Connect dialog in MongoDB Atlas. Before you use this connection string, be sure to fill in any placeholders in the string, such as your MongoDB Atlas database user's password value. [Learn more](https://www.mongodb.com/resources/products/fundamentals/mongodb-connection-string). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click MongoDB. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Database (required): The name of the database on the instance. * Collection (required): The name of the collection within the database. * Connection String (required): The MongoDB instance connection string. # OneDrive Source: https://docs.unstructured.io/ui/sources/onedrive <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from OneDrive. The requirements are as follows. * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. (Even if you only plan to use OneDrive, you still need a plan that includes SharePoint Online, because OneDrive is built on SharePoint technology.) [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * The OneDrive and SharePoint Online plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The path to the target OneDrive folder, starting from the OneDrive account's root folder, for example `my-folder/my-subfolder`. The following video shows how to get a path: <iframe width="560" height="315" src="https://www.youtube.com/embed/SfUA1IwtI5U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration that has access to the target OneDrive account, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Files.ReadWrite.All` (if both reading and writing are needed) * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` * `Directory.Read.All` 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/jBJsrSkpClo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Both authentication types also require the User Principal Name (UPN) for the OneDrive account. This is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For username and password authentication, you will also need the password for the target UPN. To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click OneDrive. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Client ID (required): The application (client) ID of the Microsoft Entra ID app registration that has access to the OneDrive account. * Tenant ID (required): The directory (tenant) ID of the Entra ID app registration. * Authority URL (required): The authentication token provider URL for the Entra ID app registration. The default is `https://login.microsoftonline.com`. * Principal Name (required): The User Principal Name (UPN) for the OneDrive user account in Entra ID. This is typically the user's email address. * Client Credential (required): The client secret for the Entra ID app registration. * Password (required for username and password authentication): The password for the target UPN. * Path (source connector only): The path to the target folder in the OneDrive account, starting with the account's root folder, for example `my-folder/my-subfolder`. * Recursive (source connector only): Check this box to recursively access files from subfolders within the specified OneDrive path. * Remote URL (destination connector only): `onedrive://`, followed by the path to the target folder in the OneDrive account, starting with the account's root folder, for example `onedrive://my-folder/my-subfolder`. # Outlook Source: https://docs.unstructured.io/ui/sources/outlook <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Outlook. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/9yESRp9pzv0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The Outlook user's email address. * A Microsoft Entra ID app registration in the same Azure account as the Outlook account. You will need this app registration's application (client) ID, client secret, and directory (tenant) ID. [Learn how](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app). * The Entra ID app registration must have the following Graph API permission levels of the application (not delegated) type: * `Mail.Read` * `Mail.ReadBasic` * `User.Read.All` [Learn how](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Outlook. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * User Email (required): The user's email address for the target Outlook account. * Authentication Provider URL (required): The authentication token provider URL for the Entra ID app registration. The default is `https://login.microsoftonline.com`. * Tenant (required): The directory (tenant) ID of the Microsoft Entra ID app registration that has access to the Outlook account. * Client ID (required): The application (client) ID of the Entra ID app registration. * Client Credential (required): The client secret for the Entra ID app registration. * Folders (required): A list of top-level folder names to access in the Outlook account. Separate multiple folder names by commas. Do not specify any subfolder names. * Recursive: Check this box to access any and all subfolders within each of the specified top-level folders. # Overview Source: https://docs.unstructured.io/ui/sources/overview <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Source connectors are essential components in data integration systems that establish a link between your files and the data ingestion process. They facilitate the batch processing of files, allowing for the systematic retrieval and ingestion of data stored in various file formats. ![Sources in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Sources-Sidebar.png) To see your existing source connectors, on the sidebar, click Connectors, and then click Sources. To create a source connector: 1. In the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. For Name, enter some unique name for this connector. 5. In the Provider area, click the source location type that matches yours. 6. Click Continue. 7. Fill in the fields according to your connector type. To learn how, click your connector type in the following list: * [Azure](/ui/sources/azure-blob-storage) * [Box](/ui/sources/box) * [Confluence](/ui/sources/confluence) * [Couchbase](/ui/sources/couchbase) * [Databricks Volumes](/ui/sources/databricks-volumes) * [Dropbox](/ui/sources/dropbox) * [Elasticsearch](/ui/sources/elasticsearch) * [Google Cloud Storage](/ui/sources/google-cloud) * [Google Drive](/ui/sources/google-drive) * [Jira](/ui/sources/jira) * [Kafka](/ui/sources/kafka) * [MongoDB](/ui/sources/mongodb) * [OneDrive](/ui/sources/onedrive) * [Outlook](/ui/sources/outlook) * [PostgreSQL](/ui/sources/postgresql) * [S3](/ui/sources/s3) * [Salesforce](/ui/sources/salesforce) * [SharePoint](/ui/sources/sharepoint) * [Slack](/ui/sources/slack) * [Snowflake](/ui/sources/snowflake) * [Zendesk](/ui/sources/zendesk) 8. If a Continue button appears, click it, and fill in any additional settings fields. 9. Click Save and Test. # PostgreSQL Source: https://docs.unstructured.io/ui/sources/postgresql <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from PostgreSQL. The requirements are as follows. * For the [Unstructured UI](/ui/overview) or the [Unstructured API](/api-reference/overview), local PostgreSQL installations are not supported. * For [Unstructured Ingest](/open-source/ingestion/overview), local and non-local PostgreSQL installations are supported. The following video shows how to set up [Amazon RDS for PostgreSQL](https://aws.amazon.com/rds/postgresql/): <iframe width="560" height="315" src="https://www.youtube.com/embed/QuIlEimejDs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to set up [Azure Database for PostgreSQL](https://azure.microsoft.com/products/postgresql): <iframe width="560" height="315" src="https://www.youtube.com/embed/6lvtBUFI7eQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A PostgreSQL instance. * [Create an Amazon RDS for PostgreSQL instance](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html). * [Create an Azure Database for PostgreSQL server](https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-deploy-on-azure-free-account). * [Install PostgreSQL locally](https://www.postgresql.org/docs/current/tutorial-install.html). * The host name and port number for the instance. * For Amazon RDS for PostgreSQL, learn how to [get the host name and port number](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ConnectToPostgreSQLInstance.html#postgresql-endpoint). * For Azure Database for PostgreSQL, learn how to [get the host](https://learn.microsoft.com/azure/postgresql/flexible-server/quickstart-create-server#get-the-connection-information). The port number is `5432`. * For local PostgreSQL installations, these values are in the `postgresql.conf` file's `listen_addresses` and `port` settings. This file should be on the same machine as the instance. These values might also already be set as environment variables named `PGHOST` and `PGPORT` on the same machine as the instance. * For other installation types, see your PostgreSQL provider's documentation. * Depending on your network security requirements, you might need to allow access to your instance only from specific IP addresses. To get Unstructured's IP address ranges, go to [https://assets.p6m.u10d.net/publicitems/ip-prefixes.json](https://assets.p6m.u10d.net/publicitems/ip-prefixes.json) and allow all of the `ip_prefix` fields' values that are listed. <Note>These IP address ranges are subject to change. You can always find the latest ones in the preceding file.</Note> To learn how to allow these IP address ranges, see your PostgreSQL provider's documentation, for example with [Amazon RDS for PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.RDSSecurityGroups.html) or [Azure Database for PostgreSQL](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-manage-firewall-portal#create-a-firewall-rule-after-server-is-created). <Note> For Amazon RDS for PostgreSQL, Amazon recommends that you set the instance's Public access setting to No by default, as this approach is more secure. This means that no resources can connect to the instance outside of the instance's associated Virtual Private Cloud (VPC) without extra configuration. [Learn more](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.WorkingWithRDSInstanceinaVPC.html#USER_VPC.Hiding). [Access an Amazon RDS instance in a VPC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_VPC.Scenarios.html). If you must enable public access, set the instance's Public access setting to Yes, and then adjust the instance's related security group to allow this access. [Learn how](https://repost.aws/en/questions/QUxemKa9u5TV6CmLiO-r5prg/lost-public-access-to-aws-rds-postgresql-instance). [Troubleshoot issues with connecting to Amazon RDS instances](https://repost.aws/knowledge-center/rds-connectivity-instance-subnet-vpc). </Note> * A database in the instance. * For Amazon RDS for PostgreSQL and Azure Database for PostgreSQL, the default database name is `postgres` unless a custom database name was specified during the instance creation process. * For local PostgreSQL installations, learn how to [create a database](https://www.postgresql.org/docs/current/tutorial-createdb.html). * For other installation types, see your PostgreSQL provider's documentation. * A table in the database. Learn how to [create a table](https://www.postgresql.org/docs/current/tutorial-table.html). The table's schema must match the schema of the documents that Unstructured produces. Unstructured cannot provide a schema that is guaranteed to work in all circumstances. This is because these schemas will vary based on your source files' types; how you want Unstructured to partition, chunk, and generate embeddings; any custom post-processing code that you run; and other factors. You can adapt the following table schema example for your own needs: <CodeGroup> ```sql PostgreSQL CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings DECIMAL [], parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` ```sql PostgreSQL with pgvector CREATE EXTENSION vector; CREATE TABLE elements ( id UUID PRIMARY KEY, record_id VARCHAR, element_id VARCHAR, text TEXT, embeddings vector(3072), parent_id VARCHAR, page_number INTEGER, is_continuation BOOLEAN, orig_elements TEXT, partitioner_type VARCHAR ); ``` </CodeGroup> See also: * [CREATE TABLE](https://www.postgresql.org/docs/current/sql-createtable.html) for PostgreSQL * [CREATE TABLE](https://github.com/pgvector/pgvector) for PostrgreSQL with pgvector * [Unstructured document elements and metadata](/api-reference/partition/document-elements) The following video shows how to use the `psql` utility to connect to PostgreSQL, list databases, and list and create tables: <iframe width="560" height="315" src="https://www.youtube.com/embed/IKo-4QHdNF4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A user in the database, and a password for the user. * For Amazon RDS for PostgreSQL, learn how to [create a user](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Roles.html). * For Azure Database for PostgreSQL, learn how to [create a user](https://learn.microsoft.com/azure/postgresql/flexible-server/how-to-create-users). * For local PostgreSQL installations, learn how to [create a user](https://www.postgresql.org/docs/current/sql-createuser.html). * For other installation types, see your PostgreSQL provider's documentation. * Database access for the user. * For Amazon RDS for PostgreSQL, learn how to [control user access](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.Access.html). * For Azure Database for PostgreSQL, learn how to [control user access](https://www.postgresql.org/docs/current/sql-createuser.html). * For local PostgreSQL installations, learn how to [give database access to a user](https://www.postgresql.org/docs/current/sql-grant.html). * For other installation types, see your PostgreSQL provider's documentation. To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click PostgreSQL. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Host (required): The host name of the target PostgreSQL instance. * DB Name (required): The name of the target database on the instance. * Port (required): The port number on the instance. * Username (required): The name of the target user with the appropriate access to the instance. * Password (required): The password for the user. * Table Name (required): The name of the target table in the database. * Batch Size: The maximum number of rows to transmit at a time. The default is `100` if not otherwise specified. * ID Column (required, source connector only): The name of the ID column in the table. * Columns (required, source connector only): A comma separated list of column names to process (including the specified ID Column column). The default is all columns if not otherwise specified. ## Learn more * <Icon icon="blog" /> [PostgreSQL Integration in the Unstructured Platform](https://unstructured.io/blog/postgresql-integration-in-the-unstructured-platform) # S3 Source: https://docs.unstructured.io/ui/sources/s3 <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Amazon S3. The requirements are as follows. The following video shows how to fulfill the minimum set of Amazon S3 requirements: <iframe width="560" height="315" src="https://www.youtube.com/embed/hyDHfhVVAhs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The preceding video does not show how to create an AWS account; enable anonymous access to the bucket (which is supported but not recommended); or generate an AWS STS session token for temporary access, if required by your organization's security requirements. For more information about requirements, see the following: * An AWS account. [Create an AWS account](https://aws.amazon.com/free). <iframe width="560" height="315" src="https://www.youtube.com/embed/lIdh92JmWtg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * An S3 bucket. [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). Additional approaches are in the following video and in the how-to sections at the end of this page. <iframe width="560" height="315" src="https://www.youtube.com/embed/e6w9LwZJFIA" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Anonymous (supported but not recommended) or authenticated access to the bucket. * [Enable anonymous bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-anonymous-user). ![Enable anonymous bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-anon-access.gif) * [Enable authenticated bucket access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/walkthrough1.html). ![Enable authenticated bucket access](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/connectors/s3-auth-access.gif) * For authenticated bucket read access, the authenticated AWS IAM user must have at minimum the permissions of `s3:ListBucket` and `s3:GetObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). <iframe width="560" height="315" src="https://www.youtube.com/embed/y4SfQoJpipo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For bucket write access, authenticated access to the bucket must be enabled (anonymous access must not be enabled), and the authenticated AWS IAM user must have at minimum the permission of `s3:PutObject` for that bucket. [Learn how](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html). * For authenticated access, an AWS access key and secret access key for the authenticated AWS IAM user in the account. [Create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey). <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * For authenticated access in untrusted environments or enhanced security scenarios, an AWS STS session token for temporary access, in addition to an AWS access key and secret access key. [Create a session token](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html#api_getsessiontoken). <iframe width="560" height="315" src="https://www.youtube.com/embed/4_csSXc_GNU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Warning> AWS STS credentials can be valid for as little as 15 minutes or as long as 36 hours, depending on how the credentials were initially generated. After the expiry time, the credentials are no longer valid will no longer work with the corresponding S3 connector. You must get a new set of credentials to replace the expired ones by calling [GetSessionToken](https://docs.aws.amazon.com/STS/latest/APIReference/API_GetSessionToken.html) in the AWS STS API. To overwrite the expired credentials with the new set: * For the Unstructured user interface (UI), manually update the AWS Key, AWS Secret Key, and STS Token fields in the Unstructured UI for the corresponding S3 [source](/ui/sources/s3) or [destination](/ui/destinations/s3) connector. * For the Unstructured API, use the Unstructured Workflow Endpoint to call the [update source](/api-reference/workflow/overview#update-a-source-connector) or [update destination](/api-reference/workflow/overview#update-a-destination-connector) connector operation for the corresponding S3 [source](/api-reference/workflow/sources/s3) or [destination](/api-reference/workflow/destinations/s3) connector. * For Unstructured Ingest, change the values of `--key`, `--secret`, and `--token` (CLI) or `key`, `secret`, and `token` (Python) in your command or code for the corresponding S3 [source](/open-source/ingestion/source-connectors/s3) or [destination](/open-source/ingestion/destination-connectors/s3) connector. </Warning> * If the target files are in the root of the bucket, the path to the bucket, formatted as `protocol://bucket/` (for example, `s3://my-bucket/`). If the target files are in a folder, the path to the target folder in the S3 bucket, formatted as `protocol://bucket/path/to/folder/` (for example, `s3://my-bucket/my-folder/`). * If the target files are in a folder, and authenticated bucket access is enabled, make sure the authenticated AWS IAM user has authenticated access to the folder as well. [Enable authenticated folder access](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html#example-bucket-policies-folders). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Amazon S3. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Bucket URI (required): The URI for the bucket or folder, formatted as `s3://my-bucket/` (if the files are in the bucket's root) or `s3://my-bucket/my-folder/`. * Recursive (source connector only): Check this box to access subfolders within the bucket. * AWS Key: For secret or token authentication, the AWS access key ID for the authenticated AWS IAM user. * AWS Secret Key: For secret or token authentication, the AWS secret access key corresponding to the preceding AWS access key ID. * STS Token: For token authentication, the AWS STS session token for temporary access. * Custom URL: A custom URL, if connecting to a non-AWS S3 bucket. ## Add an access policy to an existing bucket To use the Amazon S3 console to add an access policy that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to an existing S3 bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 2. Open the [Amazon S3 Console](https://console.aws.amazon.com/s3/home). 3. Browse to the existing bucket and open it. 4. Click the Permissions tab. 5. In the Bucket policy area, click Edit. 6. In the Policy text area, copy the following JSON-formatted policy. To change the following policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this policy, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-bucket-name>` in two places with the name of your bucket. ```json { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<my-account-id>:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::<my-bucket-name>", "arn:aws:s3:::<my-bucket-name>/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } ``` 7. Click Save changes. ## Create a bucket with AWS CloudFormation To use the AWS CloudFormation console to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. Save the following YAML to a file on your local machine, for example `create-s3-bucket.yaml`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. ```yaml AWSTemplateFormatVersion: '2010-09-09' Description: 'CloudFormation template to create an S3 bucket with specific permissions for account users.' Parameters: BucketName: Type: String Description: 'Name of the S3 bucket to create' Resources: MyS3Bucket: Type: 'AWS::S3::Bucket' Properties: BucketName: !Ref BucketName PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: false IgnorePublicAcls: true RestrictPublicBuckets: true BucketPolicy: Type: 'AWS::S3::BucketPolicy' Properties: Bucket: !Ref MyS3Bucket PolicyDocument: Version: '2012-10-17' Statement: - Sid: AllowAllAuthenticatedUsersInAccount Effect: Allow Principal: AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root' Action: - 's3:GetObject' - 's3:PutObject' - 's3:ListBucket' - 's3:DeleteObject' Resource: - !Sub 'arn:aws:s3:::${BucketName}' - !Sub 'arn:aws:s3:::${BucketName}/' Outputs: BucketName: Description: 'Name of the created S3 bucket' Value: !Ref MyS3Bucket ``` 2. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). 3. Open the [AWS CloudFormation Console](https://console.aws.amazon.com/cloudformation/home). 4. Click Create stack > With new resources (standard). 5. On the Create stack page, with Choose an existing template already selected, select Upload a template file. 6. Click Choose file, and browse to and select the YAML file from your local machine. 7. Click Next. 8. Enter a unique Stack name and BucketName. 9. Click Next two times. 10. Click Submit. 11. Wait until the Status changes to CREATE\_COMPLETE. 12. After the bucket is created, you can delete the YAML file, if you want. ## Create a bucket with the AWS CLI To use the AWS CLI to create an Amazon S3 bucket that allows all authenticated AWS IAM users in the corresponding AWS account to read and write to the bucket, do the following. <Info>Your organization might have stricter bucket policy requirements. Check with your AWS account administrator if you are unsure.</Info> 1. [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html). 2. [Set up the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). 3. Copy the following script to a file on your local machine, for example a file named `create-s3-bucket.sh`. To change the following bucket policy to restrict it to a specific user in the AWS account, change `root` to that specific username. In this script, replace the following: * Replace `<my-account-id>` with your AWS account ID. * Replace `<my-unique-bucket-name>` with the name of your bucket. * Replace `<us-east-1>` with your AWS Region. ```bash #!/bin/bash # Set variables for the AWS account ID, Amazon S3 bucket name, and AWS Region. ACCOUNT_ID="<my-account-id>" BUCKET_NAME="<my-unique-bucket-name>" REGION="<us-east-1>" # Temporary filename for the bucket policy. # Do not change this variable. POLICY_FILE="bucket_policy.json" # Create the bucket. aws s3api create-bucket --bucket $BUCKET_NAME --region $REGION # Wait for the bucket to exist. echo "Waiting for bucket '$BUCKET_NAME' to be fully created..." aws s3api wait bucket-exists --bucket $BUCKET_NAME # Check if the wait command was successful. if [ $? -eq 0 ]; then echo "The bucket '$BUCKET_NAME' has been fully created." else echo "Error: Timed out waiting for bucket '$BUCKET_NAME' to be created." exit 1 fi # Remove the "block public policy" bucket access setting. aws s3api put-public-access-block \ --bucket $BUCKET_NAME \ --public-access-block-configuration \ '{"BlockPublicPolicy": false, "IgnorePublicAcls": false, "BlockPublicAcls": false, "RestrictPublicBuckets": false}' # Check if the operation was successful. if [ $? -eq 0 ]; then echo "The block public policy access setting was removed from '$BUCKET_NAME'." else echo "Error: Failed to remove the block public policy access setting from '$BUCKET_NAME'." exit 1 fi # Create the bucket policy. cat << EOF > $POLICY_FILE { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowAuthenticatedUsersInAccountReadWrite", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::$ACCOUNT_ID:root" }, "Action": [ "s3:GetObject", "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::$BUCKET_NAME", "arn:aws:s3:::$BUCKET_NAME/" ], "Condition": { "StringEquals": { "aws:PrincipalType": "IAMUser" } } } ] } EOF # Apply the bucket policy. aws s3api put-bucket-policy --bucket $BUCKET_NAME --policy file://$POLICY_FILE # Check if the policy application was successful. if [ $? -eq 0 ]; then echo "The bucket policy was applied to '$BUCKET_NAME'." else echo "Error: Failed to apply the bucket policy to '$BUCKET_NAME'." exit 1 fi # Verify the applied policy. echo "Verifying the applied policy:" aws s3api get-bucket-policy --bucket $BUCKET_NAME --query Policy --output text # Remove the temporary bucket policy file. rm $POLICY_FILE ``` 4. Run the script, for example: ```bash sh create-s3-bucket.sh ``` 5. After the bucket is created, you can delete the script file, if you want. # Salesforce Source: https://docs.unstructured.io/ui/sources/salesforce <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Salesforce. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/5a-nh4t78V8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Salesforce account](https://developer.salesforce.com/signup). * Your Salesforce username. To get this username, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Administration, expand Users, and then click Users. 4. Note the User Name value (not the Name value) for your user. * The names of the Salesforce categories (objects) that you want to access, specified as a comma-separated list. Available categories include `Account`, `Campaign`, `Case`, `EmailMessage`, and `Lead`. * A Salesforce connected app in your Salesforce account. If you do not already have a Salesforce connected app, to create one, start by creating or getting the certificate (`.crt`) and private key (`.pem`) files that you will associate with the connected app. If you do not have the certificate and private key files, you can use the `openssl` utility on your local machine to create your own private key and self-signed certificate, if your organization allows, by running the following commands, one command at a time: ```bash openssl genrsa -out MyPrivateKey.pem -traditional openssl req -new -key MyPrivateKey.pem -out MyCertificateSigningRequest.csr openssl x509 -req -in MyCertificateSigningRequest.csr -signkey MyPrivateKey.pem -out MyCertificate.crt -days 365 ``` Of course, you can change these preceding example filenames as needed. Be sure to store these generated files in a secure location. To create a Salesforce connected app, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Platform Tools, expand Apps, and then click App Manager. 4. Click New Connected App. 5. With Create a Connected App selected, click Continue. 6. At a minimum, fill in the following, and then click Save: * Connected App Name * API Name (can be the same as Connected App Name, but do not use spaces or punctuation) * Contact Email * Under API (Enable OAuth Settings), check Enable OAuth Settings. * For Callback URL, entering `https://localhost` is okay if you won't be using this connected app for other special authentication scenarios. * Check Use digital signatures, click Choose File, and browse to and select your certificate (`.crt`) file. * For Selected OAuth Scopes, move the following entries from the Available OAuth Scopes list to the Selected OAuth Scopes list: * Manage user data via APIs (api) * Perform requests on your behalf at any time (refresh\_token, offline\_access) * Uncheck Require Proof Key for Code Exchange (PKCE) Extension for Supported Authorization Flows. * Leave Require Secret for Web Server Flow checked. * Leave Require Secret for Refresh Token Flow checked. * Check Enable Authorization Code and Credentials Flow. 7. On the connected app's details page, click Manage, click Edit Policies, set the following under OAuth Policies, and then click Save: * Set Permitted Users to All users may self-authorize. * Set IP Relaxation to Relax IP restrictions. * Set Refresh Token Policy to Refresh token is valid until revoked. * The OAuth consumer key (client ID) for the Salesforce connected app. To get the Salesforce connected app's consumer key, do the following: 1. Log in to your Salesforce account. 2. In the top navigation bar, click the Quick Settings (gear) icon, and then click Open Advanced Setup. 3. In the Home tab, under Platform Tools, expand Apps, and then click App Manager. 4. In the list of apps, click the arrow next to the target connected app, and click View. 5. Click Manage Consumer Details. 6. Complete the on-screen security verification. 7. Note the Consumer Key value. * You must use your Salesforce account to do a one-time approval of the Salesforce connected app by using its consumer key and callback URL. To do this, while you are logged in to your Salesforce account, browse to the following URL, replacing `<client-id>` with the consumer key value. This URL assumes that the callback URL is `https://localhost`: ``` https://login.salesforce.com/services/oauth2/authorize?response_type=code&client_id=<client-id>&redirect_uri=https%3A%2F%2Flocalhost ``` * To ensure maximum compatibility across Unstructured service offerings, you should give the contents of the private key (`.pem`) file to Unstructured as a string that contains the contents of the file (not the private key file itself).\ To print this string suitable for copying, you can run one of the following commands from your Terminal or Command Prompt. In this command, replace `<path-to-private-key-file>` with the path to the private key file. * For macOS or Linux: ```bash cat <path-to-private-key-file> ``` * For Windows: ```text Get-Content <path-to-private-key-file> ``` To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Salesforce. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Username (required): The Salesforce username that has access to the target Salesforce categories (objects). * Salesforce categories: A comma-separated list of the Salesforce categories to access. Available categories include `Account`, `Campaign`, `Case`, `EmailMessage`, and `Lead`. * Consumer key (required): The consumer key (client ID) for the target Salesforce connected app. * Private key (PEM) (required): The private key (PEM) associated with the consumer key for the Salesforce connected app. The PEM must be expressed as a single-line string. # SharePoint Source: https://docs.unstructured.io/ui/sources/sharepoint <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from SharePoint. The requirements are as follows. <Note> If you are setting up the SharePoint connector for the first time, you can skip past this note. Previous versions of the SharePoint connector relied on SharePoint app principals for authentication. Current versions of the SharePoint connector no longer support these SharePoint app principals. Microsoft deprecated support for Share Point app principals on November 27, 2023. SharePoint app principals will no longer work for SharePoint tenants that were created on or after November 1, 2024, and they will stop working for all SharePoint tenants as of April 2, 2026. [Learn more](https://learn.microsoft.com/sharepoint/dev/sp-add-ins/retirement-announcement-for-azure-acs). Current versions of the SharePoint connector now rely on Microsoft Entra ID app registrations for authentication. To migrate from SharePoint app princpals to Entra ID app regisrations, replace the following settings in your existing SharePoint connector, as listed in the requirements following this note: * Replace the deprecated SharePoint app principal's application client ID value with your replacement Entra ID app registration's Application (client) ID value. * Replace the deprecated SharePoint app principal's client secret value with your replacement Entra ID app registration's Client secret value. * Add your replacement Entra ID app registration's Directory (tenant) ID value, token authority URL value, and the correct set of Microsoft Graph access permissions for SharePoint Online. If you need migration help, get assistance from our [Slack community](https://short.unstructured.io/pzw05l7) or [contact us](https://unstructured.io/contact) directly. </Note> * A SharePoint Online plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes SharePoint Online. [Learn more](https://www.microsoft.com/en-us/microsoft-365/SharePoint/compare-SharePoint-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). * A OneDrive for business plan, or a Microsoft 365 or Office 365 Business or enterprise plan that includes OneDrive. (Even if you only plan to use SharePoint Online, you still need a plan that includes OneDrive, because the SharePoint connector is built on OneDrive technology.) [Learn more](https://www.microsoft.com/microsoft-365/onedrive/compare-onedrive-plans). [Shop for business plans](https://www.microsoft.com/microsoft-365/business/compare-all-microsoft-365-business-products). [Shop for enterprise plans](https://www.microsoft.com/microsoft-365/enterprise/microsoft365-plans-and-pricing). OneDrive personal accounts, and Microsoft 365 Free, Basic, Personal, and Family plans are not supported. * The SharePoint Online and OneDrive plans must share the same Microsoft Entra ID tenant. [Learn more](https://learn.microsoft.com/microsoft-365/enterprise/subscriptions-licenses-accounts-and-tenants-for-microsoft-cloud-offerings?view=o365-worldwide). * The SharePoint Online site URL. * Site collection-level URLs typically have the format `https://<tenant>.sharepoint.com/sites/<site-collection-name>`. * Root site collection-level URLs typically have the format `https://<tenant>.sharepoint.com`. * To process all sites within a SharePoint tenant, use a site URL of `https://<tenant>-admin.sharepoint.com`. [Learn more](https://learn.microsoft.com/microsoft-365/community/query-string-url-tricks-sharepoint-m365). * The display name of the SharePoint Online library to use. The default is `Documents`. * The path to the SharePoint Online library to use. By default, the root of the target library is used. To start from a path other than the root, enter the path that you want to use, beginning from the root. For example, to use the my-folder > my-subfolder path in the target library, you would specify `my-folder/my-subfolder`. The following video shows how to get the site URL and a path within the site: <iframe width="560" height="315" src="https://www.youtube.com/embed/E3fRwJU-KTc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Two types of authentication are supported: client credentials and a username and password. Both authentication types require a Microsoft Entra ID app registration. You will need to provide the Application (client) ID, Directory (tenant) ID, and Client secret for the Entra ID app registration, and the app registration must have the correct set of Microsoft Graph access permissions. These permissions include: * `Sites.ReadWrite.All` (if both reading and writing are needed) * `User.Read.All` [Learn more](https://learn.microsoft.com/answers/questions/2116616/service-principal-access-to-sharepoint-online). 1. [Create an Entra ID app registration](https://learn.microsoft.com/entra/identity-platform/quickstart-register-app?pivots=portal). 2. [Add Graph access permissions to an app registration](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#add-permissions-to-an-application). 3. [Grant consent for the added Graph permissions](https://learn.microsoft.com/entra/identity-platform/howto-update-permissions?pivots=portal#grant-consent-for-the-added-permissions-for-the-enterprise-application). The following video shows how to create an Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/aBAY-LKLPSo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> The following video shows how to add the correct set of Graph access permissions to the Entra ID app registration: <iframe width="560" height="315" src="https://www.youtube.com/embed/X7fnRYyxy0Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * The token authority URL for your Microsoft Entra ID app registration. This is typically `https://login.microsoftonline.com` * For username and password authentication, you must also provide the User Principal Name (UPN) and its password for the OneDrive account in the Microsoft Entra ID tenant. This UPN is typically the OneDrive account user's email address. To find a UPN: 1. Depending on your plan, sign in to your Microsoft 365 admin center (typically [https://admin.microsoft.com](https://admin.microsoft.com)) using your administrator credentials, or sign in to your Office 365 portal (typically [https://portal.office.com](https://portal.office.com)) using your credentials. 2. In the Users section, click Active users. 3. Locate the user account in the list of active users. 4. The UPN is displayed in the Username column. The following video shows how to get a UPN: <iframe width="560" height="315" src="https://www.youtube.com/embed/H0yYfhfyCE0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click SharePoint. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Site URL (required): The base URL of the SharePoint site to connect to. * Documents Library: The display name of the SharePoint library to use. The default is `Documents`. * Path: The path to use within the library. The default is the root of the target library. To use a different path, specify the correct path format as described previously in this article. * Recursive: Check this box to recursively process data from subfolders within the target path. * Client ID (required): The Application (client) ID for the Microsoft Entra ID app registration with the correct set of Microsoft Graph access permissions. * Tenant ID (required): The Directory (tenant) ID for the Entra ID app registration. * Authority URL (required): The authentication token provider URL for the Entra ID app registration. The default is `https://login.microsoftonline.com`. * Client Credentials (required): The Client secret for the Entra ID app registration. * User Principal Name (UPN) (required for username and password authentication): For username and password authentication, the UPN for the OneDrive account in the Entra ID tenant. * Password (required for username and password authentication): For username and password authentication, the password for the UPN. # Slack Source: https://docs.unstructured.io/ui/sources/slack <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Slack. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/0F1MXMgk3kY" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A Slack app. Create a Slack app by following [Step 1: Creating an app](https://api.slack.com/quickstart#creating). * The app must have the the following OAuth scopes: * To read messages from public channels, add the scopes `channels:join` and `channels:history`. * To read messages from private channels, add the scope `groups:history`. Add these scopes to the app by following [Step 2: Requesting scopes](https://api.slack.com/quickstart#scopes). * The app must be installed and authorized for the target Slack workspace. Install and authorize the app by following [Step 3: Installing and authorizing the app](https://api.slack.com/quickstart#installing). * The app's access token. Get this token by following [Step 3: Installing and authorizing the app](https://api.slack.com/quickstart#installing). * Add the app to the target channels in the Slack workspace. To do this from the channel, open the channel's details page, click the Integrations tab, click Add apps, and follow the on-screen directions to install the app. * The channel ID for each target channel. To get this ID, open the channel's details page, and look for the Channel ID field on the About tab. * The starting and ending date and time range for the channels to be processed. Supported formats include: * `YYYY-MM-DD` * `YYYY-MM-DDTHH:MM:SS` * `YYYY-MM-DDTHH:MM:SSZ` * `YYYY-MM-DD+HH:MM:SS` * `YYYY-MM-DD-HH:MM:SS` <Info> Slack has established API rate limits that could restrict the number and frequency of messages that can be read. To explore possible workarounds for these cases, see your Slack representative. </Info> To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Click New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Slack. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Channels (required): A comma-separated list of the target Slack channel IDs from which to read messages. * Start Date: The date and time, in ISO 8601 format (such as `YYYY-MM-DDTHH:MM:SSZ`), from which to begin reading messages. Messages created before this timestamp are ignored. By default, all messages are read unless an end date is specified. * End Date: The date and time, in ISO 8601 format (such as `YYYY-MM-DDTHH:MM:SSZ`), at which to stop reading messages. Messages created after this timestamp are ignored. By default, all messages are read unless a start date is specified. * Bot Token (required): The access token for the Slack app that has permission to read messages from the target Slack channels. # Snowflake Source: https://docs.unstructured.io/ui/sources/snowflake <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Snowflake. The requirements are as follows. * A Snowflake [account](https://signup.snowflake.com/) and its account identifier. <iframe width="560" height="315" src="https://www.youtube.com/embed/LgCjLyClg3o" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To get the identifier for the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click your username, and then click Account > View account details. 3. On the Account tab, note the value of the Account Identifier field. Alternatively, the following Snowflake query returns the current account's identifier: ```text SELECT CURRENT_ORGANIZATION_NAME() || '-' || CURRENT_ACCOUNT_NAME() AS "Account Identifier" ``` * The Snowflake [user's login name (not its username) and its password](https://docs.snowflake.com/user-guide/admin-user-management#creating-users) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/sFLPGVe4VBM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view the login name for a user: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. On the Users tab, in the list of available users, click the name of the target user. 4. In the About tile, note the Login Name for the user. Alternatively, the following Snowflake query returns information about the user with the username of `<my-user>`, including their `login_name` value representing their login name: ```text SHOW USERS LIKE '<my-user>'; ``` * The name of the Snowflake [role](https://docs.snowflake.com/sql-reference/sql/create-role) that the user belongs to and that also has sufficient access to the Snowflake database, schema, table, and host. <iframe width="560" height="315" src="https://www.youtube.com/embed/cxe2Ed_-qkM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * To create a database in Snowflake, the role needs to be granted `CREATE DATABASE` privilege at the current account level; and `USAGE` privilege on the warehouse that is used to create the database. * To create a schema in a database in Snowflake, the role needs to be granted `USAGE` privilege on the database and the warehouse that is used to create the schema; and `CREATE SCHEMA` on the database. * To create a table in a schema in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to create the table; and `CREATE TABLE` on the schema. * To write to a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `INSERT` on the table. * To read from a table in Snowflake, the role needs to be granted `USAGE` privilege on the database and schema and the warehouse that is used to write to the table; and `SELECT` on the table. To view a list of available roles in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Users & Roles. 3. Click the Roles tab. Alternatively, the following Snowflake query returns a list of available roles in the current account: ```text SHOW ROLES; ``` [Grant privileges to a role](https://docs.snowflake.com/sql-reference/sql/grant-privilege). [Learn more](https://docs.snowflake.com/user-guide/security-access-control-privileges). * The Snowflake warehouse's [hostname and its port number](https://docs.snowflake.com/sql-reference/functions/system_allowlist) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/ItLW0N1uEOI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available warehouses in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Admin > Warehouses. This view does not provide access to the warehouses' hostnames or port numbers. To get this information, you must run a Snowflake query. The following Snowflake query returns a list of available warehouse types, hostnames, and port numbers in the current account. Look for the row with a `type` of `SNOWFLAKE_DEPLOYMENT`: ```text SELECT t.VALUE:type::VARCHAR as type, t.VALUE:host::VARCHAR as host, t.VALUE:port as port FROM TABLE(FLATTEN(input => PARSE_JSON(SYSTEM$ALLOWLIST()))) AS t; ``` * The name of the Snowflake [database](https://docs.snowflake.com/sql-reference/sql/create-database) in the account. <iframe width="560" height="315" src="https://www.youtube.com/embed/f49tin_Pehs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> To view a list of available databases in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. Alternatively, the following Snowflake query returns a list of available databases in the current account: ```text SHOW DATABASES; ``` * The name of the [schema](https://docs.snowflake.com/sql-reference/sql/create-schema) in the database. To view a list of available schemas for a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the target database. Alternatively, the following Snowflake query returns a list of available schemas in the current account: ```text SHOW SCHEMAS; ``` The following Snowflake query returns a list of available schemas for the database named `<database_name>` in the current account: ```text SHOW SCHEMAS IN DATABASE <database_name>; ``` * The name of the [table](https://docs.snowflake.com/sql-reference/sql/create-table) in the schema. To view a list of available tables for a schema in a database in the current Snowflake account: 1. Log in to [Snowsight](https://docs.snowflake.com/user-guide/ui-snowsight-homepage) with your Snowflake account. 2. In Snowsight, on the navigation menu, click Data > Databases. 3. Expand the name of the database that contains the target schema. 4. Expand the name of the target schema. 5. Expand Tables. Alternatively, the following Snowflake query returns a list of available tables for the schema named `<schema_name>` in the datbase named `<database_name>` in the current account: ```text SHOW TABLES IN SCHEMA <database_name>.<schema_name>; ``` Snowflake requires the target table to have a defined schema before Unstructured can write to the table. The recommended table schema for Unstructured is as follows. In the following `CREATE TABLE` statement, replace the following placeholders with the appropriate values: * `<database_name>`: The name of the target database in the Snowflake account. * `<schema_name>`: The name of the target schema in the database. * `<number-of-dimensions>`: The number of dimensions for any embeddings that you plan to use. This value must match the number of dimensions for any embeddings that are\ specified in your related Unstructured workflows or pipelines. If you plan to use Snowflake vector embedding generation or Snowflake vector search, this value must match the number of dimensions that you plan to have Snowflake generate or search against. ```sql SQL CREATE TABLE <database_name>.<schema_name>.ELEMENTS ( ID VARCHAR(36) PRIMARY KEY NOT NULL DEFAULT UUID_STRING(), RECORD_ID VARCHAR, ELEMENT_ID VARCHAR, TEXT VARCHAR, EMBEDDINGS VECTOR(FLOAT, <number-of-dimensions>), TYPE VARCHAR, SYSTEM VARCHAR, LAYOUT_WIDTH DECIMAL, LAYOUT_HEIGHT DECIMAL, POINTS VARCHAR, URL VARCHAR, VERSION VARCHAR, DATE_CREATED TIMESTAMP_TZ, DATE_MODIFIED TIMESTAMP_TZ, DATE_PROCESSED TIMESTAMP_TZ, PERMISSIONS_DATA VARCHAR, RECORD_LOCATOR VARCHAR, CATEGORY_DEPTH INTEGER, PARENT_ID VARCHAR, ATTACHED_FILENAME VARCHAR, FILETYPE VARCHAR, LAST_MODIFIED TIMESTAMP_TZ, FILE_DIRECTORY VARCHAR, FILENAME VARCHAR, LANGUAGES ARRAY, PAGE_NUMBER VARCHAR, LINKS VARCHAR, PAGE_NAME VARCHAR, LINK_URLS ARRAY, LINK_TEXTS ARRAY, SENT_FROM ARRAY, SENT_TO ARRAY, SUBJECT VARCHAR, SECTION VARCHAR, HEADER_FOOTER_TYPE VARCHAR, EMPHASIZED_TEXT_CONTENTS ARRAY, EMPHASIZED_TEXT_TAGS ARRAY, TEXT_AS_HTML VARCHAR, REGEX_METADATA VARCHAR, DETECTION_CLASS_PROB DECIMAL, IMAGE_BASE64 VARCHAR, IMAGE_MIME_TYPE VARCHAR, ORIG_ELEMENTS VARCHAR, IS_CONTINUATION BOOLEAN ); ``` * The name of the column in the table that uniquely identifies each record (for example, `RECORD_ID`). To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Snowflake. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for this connector. * Account ID (required): The target Snowflake account's identifier. * Role (required): The name of the Snowflake role that the user belongs to. This role must have the appropriate access to the target Snowflake warehouse, database, schema, and table. * User (required): The target Snowflake user's login name (not their username). * Password (required): The user's password. * Host (required): The hostname of the target Snowflake warehouse. * Port (required): The warehouse's port number. The default is `443` if not otherwise specified. * Database (required): The name of the target Snowflake database. * Schema (required): The name of the target Snowflake schema within the database. * Table (required for source connector only): The name of the target Snowflake table within the database's schema. For the destination connector, the default is `elements` if not otherwise specified. * Columns (source connector only): A comma-separated list of columns to fetch from the table. By default, all columns are fetched unless otherwise specified. * ID Column (required, source connector only): The name of the column that uniquely identifies each record in the table. * Record ID Key (destination connector only): The name of the column that uniquely identifies each record in the table. The default is `record_id` if not otherwise specified. * Batch Size (required): The maximum number of rows to fetch for each batch. The default is `50` if not otherwise specified. ## Learn more * <Icon icon="blog" /> [Powering Enterprise RAG: Unstructured's New Snowflake Integration](https://unstructured.io/blog/powering-enterprise-rag-unstructured-s-new-snowflake-integration) # Zendesk Source: https://docs.unstructured.io/ui/sources/zendesk <Note> If you're new to Unstructured, read this note first. Before you can create a source connector, you must first sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). After you sign in, the [Unstructured user interface](/ui/overview) (UI) appears, which you use to create your source connector. After you create the source connector, add it along with a [destination connector](/ui/destinations/overview) to a [workflow](/ui/workflows). Then run the worklow as a [job](/ui/jobs). To learn how, try out the [hands-on UI quickstart](/ui/quickstart) or watch the 4-minute [video tutorial](https://www.youtube.com/watch?v=Wn2FfHT6H-o). You can also create source connectors with the Unstructured API. [Learn how](/api-reference/workflow/sources/overview). If you need help, reach out to the [community](https://short.unstructured.io/pzw05l7) on Slack, or [contact us](https://unstructured.io/contact) directly. You are now ready to start creating a source connector! Keep reading to learn how. </Note> Ingest your files into Unstructured from Zendesk. The requirements are as follows. <iframe width="560" height="315" src="https://www.youtube.com/embed/f2yeXwSvZtE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * A [Zendesk account](https://www.zendesk.com/register/). * Your [Zendesk subdomain](https://support.zendesk.com/hc/en-us/articles/4409381383578-Where-can-I-find-my-Zendesk-subdomain), for example, the `<organization>` part of `<organization>.zendesk.com`. * The login email address for your Zendesk account. * An [API token](https://support.zendesk.com/hc/en-us/articles/4408889192858-Managing-access-to-the-Zendesk-API#topic_tcb_fk1_2yb) (not an OAuth token) for your login email address. To create the source connector: 1. On the sidebar, click Connectors. 2. Click Sources. 3. Cick New or Create Connector. 4. Give the connector some unique Name. 5. In the Provider area, click Zendesk. 6. Click Continue. 7. Follow the on-screen instructions to fill in the fields as described later on this page. 8. Click Save and Test. Fill in the following fields: * Name (required): A unique name for the connector. * Subdomain (required): The subdomain of your Zendesk site, for example the `<organization>` part of `<organization>.zendesk.com`. * Email (required): The email address of the user account that has access to the Zendesk subdomain. * API Token (required): The API token (not an OAuth token) for the user account. For more information, see [Managing access to the Zendesk API](https://support.zendesk.com/hc/en-us/articles/226022787-Generating-a-new-API-token). * Item type: The type of item to parse. Allowed values include `tickets` and `articles`. If no value is provided, the default is `tickets`. * Batch size: The maximum number of items to structure per batch. The default is `2`. ## Learn more * <Icon icon="blog" /> [Unlock Customer Support Knowledge: Unstructured Integrates with Zendesk](https://unstructured.io/blog/unlock-customer-support-knowledge-unstructured-integrates-with-zendesk) # Supported file types Source: https://docs.unstructured.io/ui/supported-file-types The Unstructured user interface (UI) supports processing of the following file types: By file extension: | File extension | | -------------- | | `.abw` | | `.bmp` | | `.csv` | | `.cwk` | | `.dbf` | | `.dif`\* | | `.doc` | | `.docm` | | `.docx` | | `.dot` | | `.dotm` | | `.eml` | | `.epub` | | `.et` | | `.eth` | | `.fods` | | `.heic` | | `.htm` | | `.html` | | `.hwp` | | `.jpeg` | | `.jpg` | | `.md` | | `.mcw` | | `.msg` | | `.mw` | | `.odt` | | `.org` | | `.p7s` | | `.pbd` | | `.pdf` | | `.png` | | `.pot` | | `.ppt` | | `.pptm` | | `.pptx` | | `.prn` | | `.rst` | | `.rtf` | | `.sdp` | | `.sxg` | | `.tiff` | | `.txt` | | `.tsv` | | `.xls` | | `.xlsx` | | `.xml` | | `.zabw` | By file type: | Category | File types | | ----------------- | ------------------------------------------------------------------ | | Apple | `.cwk`, `.mcw` | | CSV | `.csv` | | Data Interchange | `.dif`\* | | dBase | `.dbf` | | E-mail | `.eml`, `.msg`, `.p7s` | | EPUB | `.epub` | | HTML | `.htm`, `.html` | | Image | `.bmp`, `.heic`, `.jpeg`, `.jpg`, `.png`, `.prn`, `.tiff` | | Markdown | `.md` | | OpenOffice | `.odt` | | Org Mode | `.org` | | Other | `.eth`, `.pbd`, `.sdp` | | PDF | `.pdf` | | Plain text | `.txt` | | PowerPoint | `.pot`, `.ppt`, `.pptm`, `.pptx` | | reStructured Text | `.rst` | | Rich Text | `.rtf` | | Spreadsheet | `.et`, `.fods`, `.mw`, `.xls`, `.xlsx` | | StarOffice | `.sxg` | | TSV | `.tsv` | | Word processing | `.abw`, `.doc`, `.docm`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` | | XML | `.xml` | `` For `.dif`, `\n` characters in `.dif` files are supported, but `\r\n` characters will raise the error `UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type`. # Workflows Source: https://docs.unstructured.io/ui/workflows ## Workflows dashboard ![Workflows in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflows-Sidebar.png) To view the workflows dashboard, on the sidebar, click Workflows. A workflow in Unstructured is a defined sequence of processes that automate the data handling from source to destination. It allows users to configure how and when data should be ingested, processed, and stored. Workflows are crucial for establishing a systematic approach to managing data flows within the platform, ensuring consistency, efficiency, and adherence to specific data processing requirements. ## Create a workflow Unstructured provides two types of workflow builders: [Automatic](#create-an-automatic-workflow) or Build it For Me workflows, which use sensible default workflow settings to enable you to get good-quality results faster. * [Custom](#create-a-custom-worklow) or Build it Myself workflows, which enable you to fine-tune the workflow settings behind the scenes to get very specific results. ### Create an automatic workflow <Warning> You must first have an existing source connector and destination connector to add to the workflow. You cannot create an automatic workflow that uses a local source connector. If you do not have an existing remote connector for either your target source (input) or destination (output) location, [create the source connector](/ui/sources/overview), [create the destination connector](/ui/destinations/overview), and then return here. To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations. </Warning> To create an automatic workflow: 1. On the sidebar, click Workflows. 2. Click New Workflow. 3. Next to Build it for Me, click Create Workflow. <Note>If a radio button appears instead of Build it for Me, select it, and then click Continue.</Note> 4. For Workflow Name, enter some unique name for this workflow. 5. In the Sources dropdown list, select your source location. 6. In the Destinations dropdown list, select your destination location. <Note>You can select multiple source and destination locations. Files will be ingested from all of the selected source locations, and the processed data will be delivered to all of the selected destination locations.</Note> 7. Click Continue. 8. The Reprocess All box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors: * Checking this box reprocesses all documents in the source location on every workflow run. * Unchecking this box causes only new documents that are added to the source location, or existing documents that are updated in the source location, since the last workflow run to be processed on future runs. Previously processed documents are not processed again. However: * Even if this box is unchecked, a renamed file is always treated as a new file, regardless of whether the file's original contents have changed. * Even if this box is unchecked, a file that is removed but is added back later with the same file name is processed on future runs only if the file's contents have changed since the file was originally processed. 9. Click Continue. 10. If you want this workflow to run on a schedule, in the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. Otherwise, select Don't repeat. 11. Click Complete. By default, this workflow partitions, chunks, and generates embeddings as follows: * Partitioner: Auto strategy Unstructured automatically analyzes and processes files on a page-by-page basis (for PDF files) and on a document-by-document basis for everything else: * If the page or document has no images and likely does not have tables, Fast partitioning is used, and the page or document is billed at the Fast rate for processing. * If the page or document has only a few tables or images with standard layouts and languages, High Res partitioning is used, and the page or document is billed at the High Res rate for processing. * If the page or document has more than a few tables or images, VLM partitioning is used, and the page or document is billed at the VLM rate for processing. [Learn about partitioning strategies](/ui/partitioning). * Chunker: Chunk by Title strategy * Contextual Chunking: No (unchecked) * Combine Text Under N Characters: 3000 * Include Original Elements: Yes (checked) * Max Characters: 5500 * Multipage Sections: Yes (checked) * New After N Characters: 3500 * Overlap: 350 * Overlap All: Yes (checked) [Learn about chunking strategies](/ui/chunking). * Embedder: * Provider: Azure OpenAI * Model: text-embedding-3-large, with 3072 dimensions [Learn about embedding providers and models](/ui/embedding). * Enrichments: This workflow contains no enrichments. [Learn about available enrichments](/ui/enriching/overview). After this workflow is created, you can change any or all of its settings if you want to. This includes the workflow's source connector, destination connector, partitioning, chunking, and embedding settings. You can also add enrichments to the workflow if you want to. <Warning> Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description, table description, or table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> To change the workflow's default settings or to add enrichments: 1. On the sidebar, click Workflows. 2. In the list of available workflows, click the workflow that was just created. This opens a visual designer that shows your workflow as a directed acyclic graph (DAG). This DAG contains a node representing each step in the workflow. There is one node for the partitioning step, another node for the chunking step, and so on. 3. To learn how to change a node's settings or to add enrichment nodes, click the FAQ button in the flyout pane in the workflow DAG designer. If you did not previously set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now. ### Create a custom workflow <Tip> If you already have an existing workflow that you want to change, do the following: 1. On the sidebar, click Workflows. 2. Click the name of the workflow that you want to change. 3. Skip ahead to Step 11 in the following procedure. </Tip> <Warning> You must first have an existing source connector and destination connector to add to the workflow. You can create a custom workflow that uses a local source connector, but you cannot save the workflow. If you do not have an existing connector for either your target source (input) or destination (output) location, [create the source connector](/ui/sources/overview), [create the destination connector](/ui/destinations/overview), and then return here. To see your existing connectors, on the sidebar, click Connectors, and then click Sources or Destinations. </Warning> 1. On the sidebar, click Workflows. 2. Click New Workflow. 3. Click the Build it Myself option, and then click Continue. 4. In the This workflow pane, click the Details button. ![Workflow details](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Details.png) 5. Next to Name, click the pencil icon, enter some unique name for this workflow, and then click the check mark icon. 6. If you want this workflow to run on a schedule, click the Schedule button. In the Repeat Run dropdown list, select one of the scheduling options, and fill in the scheduling settings. 7. To overwrite any previously processed files, or to retry any documents that fail to process, click the Settings button, and check either or both of the boxes. The Reprocess All Files box applies only to blob storage connectors such as the Amazon S3, Azure Blob Storage, and Google Cloud Storage connectors: * Checking this box reprocesses all documents in the source location on every workflow run. * Unchecking this box causes only new documents that are added to the source locations, or existing documents that are updated in the source location, since the last workflow run to be processed on future runs. Previously processed documents are not processed again. However: * Even if this box is unchecked, a renamed file is always treated as a new file, regardless of whether the file's original contents have changed. * Even if this box is unchecked, a file that is removed but is added back later with the same file name is processed on future runs only if the file's contents have changed since the file was originally processed. 8. The workflow begins with the following layout: ```mermaid flowchart LR Source-->Partitioner-->Destination ``` The following workflow layouts are also valid: ```mermaid flowchart LR Source-->Partitioner-->Chunker-->Destination ``` ```mermaid flowchart LR Source-->Partitioner-->Chunker-->Embedder-->Destination ``` ```mermaid flowchart LR Source-->Partitioner-->Enrichment-->Chunker-->Destination ``` ```mermaid flowchart LR Source-->Partitioner-->Enrichment-->Chunker-->Embedder-->Destination ``` 9. In the pipeline designer, click the Source node. In the Source pane, select the source location. Then click Save. ![Workflow designer](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Designer.png) <Note> To use a local source location, do not choose a source connector. If the workflow uses a local source location, in the Source node, drag or click to specify a local file, and then click Test. The workflow's results are displayed on-screen. A workflow that uses a local source location has the following limitations: * You cannot save the workflow. * You cannot send the results to a remote destination location, even if you have attached a destination connector to the workflow. However, you can save the results to a local JSON-formatted file. </Note> 10. Click the Destination node. In the Destination pane, select the destination location. Then click Save. 11. As needed, add more nodes by clicking the plus icon (recommended) or Add Node button: ![Add node to workflow](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Add-Node.png) * Click Connect to add another Source or Destination node. You can add multiple source and destination locations. Files will be ingested from all of the source locations, and the processed data will be delivered to all of the destination locations. [Learn more](#custom-workflow-node-types). * Click Enrich to add a Chunker or Enrichment node. [Learn more](#custom-workflow-node-types). <Warning> Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description, table description, or table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> * Click Transform to add a Partitioner or Embedder node. [Learn more](#custom-workflow-node-types). <Tip> Make sure to add nodes in the correct order. If you are unsure, see the usage hints in the blue note that appears in the node's settings pane. ![Node usage hints note](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Node-Usage-Hints.png) </Tip> To edit a node, click that node, and then change its settings. To delete a node, click that node, and then click the trash can icon above it. 12. Click Save. 13. If you did not set the workflow to run on a schedule, you can [run the worklow](#edit-delete-or-run-a-workflow) now. #### Custom workflow node types <AccordionGroup> <Accordion title="Partitioner node"> Choose from one of four available partitioning strategies. Unstructured recommends that you choose the Auto partitioning strategy in most cases. With Auto, Unstructured does all the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page. You should consider the following additional strategies only if you are absolutely sure that your documents are of the same type. Each of the following strategies are best suited for specific situations. Choosing one of these strategies other than Auto for sets of documents of different types could produce undesirable results, including reduction in transformation quality. * VLM: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`. * High Res: For all other [supported file types](/ui/supported-file-types), and for the generation of bounding box coordinates. * Fast: For text-only documents. For VLM, you must also choose a VLM provider and model. Available choices include: * Anthropic: * Anthropic Claude 3.5 Sonnet * OpenAI: * OpenAI GPT-4o * Amazon Bedrock: * Anthropic Claude 3.5 Sonnet * Anthropic Claude 3 Opus * Anthropic Claude 3 Haiku * Anthropic Claude 3 Sonnet * Amazon Nova Pro * Amazon Nova Lite * Meta Llama 3.2 90B Instruct * Meta Llama 3.2 11B Instruct * Vertex AI: * Gemini 2.0 Flash <Note> When you use the VLM strategy with embeddings for PDF files of 200 or more pages, you might notice some errors when these files are processed. These errors typically occur when these larger PDF files have lots of tables and high-resolution images. </Note> [Learn more](/ui/partitioning). </Accordion> <Accordion title="Chunker node"> For Chunkers, select one of the following: * Chunk by title: Preserve section boundaries and optionally page boundaries as well. A single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following: * Contextual chunking: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking). * Combine text under n chars: Combine elements until a section reaches a length of this many characters. The default is 0. * Include original elements: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. * Max characters: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is 2048. * Multipage sections: Check this box to allow sections to span multiple pages. By default, this box is unchecked. * New after n chars: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is \**1500. Overlap: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is 160. * Overlap all: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. * Chunk by character (also known as basic chunking): Combine sequential elements to maximally fill each chunk. Also, specify the following: * Contextual chunking: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking). * Include original elements: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. * Max characters: Cut off new sections after reaching a length of this many characters. The default is 2048. * New after n chars: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is \**1500. Overlap: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is 160. * Overlap All: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. * Chunk by page: Preserve page boundaries. When a new page is detected, the existing chunk is closed and a new one is started, even if the next element would fit in the prior chunk. Also, specify the following: * Contextual chunking: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking). * Include original elements: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. * Max characters: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is 500. * New after n chars: Cut off new sections after reaching a length of this many characters. This is an approximate limit. The default is 50. * Overlap: Apply a prefix of this many trailing characters from the prior text-split chunk to second and later chunks formed from oversized elements by text-splitting. The default is 30. * Overlap all: Check this box to apply overlap to "normal" chunks formed by combining whole elements. Use with caution as this can introduce noise into otherwise clean semantic units. By default, this box is unchecked. * Chunk by similarity: Use the [sentence-transformers/multi-qa-mpnet-base-dot-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1) embedding model to identify topically similar sequential elements and combine them into chunks. Also, specify the following: * Contextual chunking: When switched on, prepends chunk-specific explanatory context to each chunk. [Learn more](/ui/chunking#contextual-chunking). * Include original elements: Check this box to output the elements that were used to form a chunk, to appear in the `metadata` field's `orig_elements` field for that chunk. By default, this box is unchecked. * Max characters: Cut off new sections after reaching a length of this many characters. This is a strict limit. The default is 500. * Similarity threshold: Specify a threshold between 0 and 1 exclusive (0.01 to 0.99 inclusive), where 0 indicates completely dissimilar vectors and 1 indicates identical vectors, taking into consider the trade-offs between precision (a higher threshold) and recall (a lower threshold). The default is 0.5. [Learn more](https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061). Learn more: * [Chunking overview](/ui/chunking) * [Chunking for RAG: best practices](https://unstructured.io/blog/chunking-for-rag-best-practices) </Accordion> <Accordion title="Enrichment node"> Choose one of the following: <Warning> Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description, table description, or table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> * Image to summarize images. Also select one of the following provider (and model) combinations to use: * OpenAI (GPT-4o). [Learn more](https://openai.com/index/hello-gpt-4o/). * Anthropic (Claude 3.5 Sonnet). [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). * Amazon Bedrock (Claude 3.5 Sonnet). [Learn more](https://aws.amazon.com/bedrock/claude/). [Learn more](/ui/enriching/image-descriptions). * Table to summarize tables. Also select one of the following provider (and model) combinations to use: * OpenAI (GPT-4o). [Learn more](https://openai.com/index/hello-gpt-4o/). * Anthropic (Claude 3.5 Sonnet). [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). * Amazon Bedrock (Claude 3.5 Sonnet). [Learn more](https://aws.amazon.com/bedrock/claude/). Make sure after you choose the provider and model, that Table Description is also displayed. If Table Description and Table to HTML are both displayed, be sure to select Table Description. [Learn more](/ui/enriching/table-descriptions). * Table to convert tables to HTML. Also select one of the following provider (and model) combinations to use: * OpenAI (GPT-4o). [Learn more](https://openai.com/index/hello-gpt-4o/). Make sure after you choose this provider and model, that Table to HTML is also selected. [Learn more](/ui/enriching/table-to-html). * Text to generate a list of recognized entities and their relationships by using a technique called named entity recognition (NER). Also select one of the following provider (and model) combinations to use: * OpenAI (GPT-4o). [Learn more](https://openai.com/index/hello-gpt-4o/). * Anthropic (Claude 3.5 Sonnet). [Learn more](https://www.anthropic.com/news/claude-3-5-sonnet). You can also customize the prompt used to add or remove entities and relationships. In the Details tab, under Prompt, click Edit. Click Run Prompt in the Edit & Test Prompt section to test the prompt. [Learn more](/ui/enriching/ner). </Accordion> <Accordion title="Embedder node"> For Select Embedding Model, select one of the following: * Azure OpenAI: Use Azure OpenAI to generate embeddings with one of the following models: * text-embedding-3-small, with 1536 dimensions. * text-embedding-3-large, with 3072 dimensions. * Ada 002 (Text) (`text-embedding-ada-002`), with 1536 dimensions. [Learn more](https://learn.microsoft.com/azure/ai-services/openai/concepts/models#embeddings). * Amazon Bedrock: Use Amazon Bedrock to generate embeddings with one of the following models: * Titan Text Embeddings V2, with 1024 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html). * Titan Embeddings G1 - Text, with 1536 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html). * Titan Multimodal Embeddings G1, with 1024 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html). * Cohere Embed English, with 1024 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html). * Cohere Embed Multilingual, with 1024 dimensions. [Learn more](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html). * TogetherAI: Use TogetherAI to generate embeddings with one of the following models: * M2-BERT-80M-32K-Retrieval, with 768 dimensions. [Learn more](https://docs.together.ai/docs/serverless-models#embedding-models). * Voyage AI: Use Voyage AI to generate embeddings with one of the following models: * voyage-code-2, with 1536 dimensions. * voyage-3, with 1024 dimensions. * voyage-3-large, with 1024 dimensions. * voyage-3-lite, with 512 dimensions. * voyage-code-3, with 1024 dimensions. * voyage-finance-2, with 1024 dimensions. * voyage-law-2, with 1024 dimensions. * voyage-multimodal-3, with 1024 dimensions. [Learn more](https://docs.voyageai.com/docs/embeddings). Learn more: * [Embedding overview](/ui/embedding) * [Understanding embedding models: make an informed choice for your RAG](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag). </Accordion> </AccordionGroup> ## Edit, delete, or run a workflow To run a workflow once, manually: 1. On the sidebar, click Workflows. 2. In the list of workflows, click Run in the row for the workflow that you want to run. For each of the workflows on the Workflows list page, the following actions are available by clicking the ellipses (the three dots) in the row for the respective workflow: * Edit via Form: Changes the existing configuration of your workflow. * Delete: Removes the workflow from the platform. Use this action cautiously, as it will permanently delete the workflow and its configurations. * Open: Opens the workflow's settings page. ## Pause a scheduled workflow To stop running a workflow that is set to run on a repeating schedule: 1. On the sidebar, click Workflows. 2. In the list of workflows, turn off the Status toggle in the row for the workflow that you want to stop running on a repeated schedule. Turning off the Status toggle also disables the workflow's Run button, which prevents that workflow from being run manually as well. To resume running the workflow on its original repeating schedule, as well as enable the workflow to be run manually as needed, turn on the workflow's Status toggle. ## Duplicate a workflow To duplicate (copy or clone) a workflow: 1. On the sidebar, click Workflows. 2. In the list of workflows, click the ellipses (the three dots) in the row for the workflow that you want to duplicate. 3. Click Duplicate. A duplicate of the workflow is created with the same configuration as the original workflow. The duplicate workflow has the same display name as the original workflow but with (Copy) at the end. # Unstructured Source: https://docs.unstructured.io/welcome ![ETL plus for GenAI data banner](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/welcome/ETL-For-GenAI-Data.png) Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning. This 60-second video describes more about what Unstructured does and its benefits: <iframe width="560" height="315" src="https://www.youtube.com/embed/b2AcxJDXOLs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> This 40-second video demonstrates a simple use case that Unstructured helps solve: <iframe width="560" height="315" src="https://www.youtube.com/embed/E-tupjji22U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> This one-minute video shows why using Unstructured is preferable to building your own similar solution: <iframe width="560" height="315" src="https://www.youtube.com/embed/P9HzldV72ho" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> Unstructured offers the Unstructured user interface (UI) and the Unstructured API. Read on to learn more. ## <Icon icon="computer" /> Unstructured user interface (UI) No-code UI. Production-ready. [Learn more](/ui/overview). Here is a screenshot of the Unstructured UI Start page: ![Partial view of the Unstructured UI](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Start-Screen-Partial.png) This 90-second video provides a brief overview of the Unstructured UI: <iframe width="560" height="315" src="https://www.youtube.com/embed/IVKcQDZa9Zc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Tip>To start using the Unstructured UI right away, skip ahead to the [quickstart](#unstructured-ui-quickstart).</Tip> ## <Icon icon="square-terminal" /> Unstructured API Use scripts or code. Production-ready. [Learn more](/api-reference/overview). The Unstructured API consists of two parts: * The [Unstructured Workflow Endpoint](/api-reference/workflow/overview) enables a full range of partitioning, chunking, embedding, and enrichment options for your files and data. It is designed to batch-process files and data in remote locations; send processed results to various storage, databases, and vector stores; and use the latest and highest-performing models on the market today. It has built-in logic to deliver the highest quality results at the lowest cost. [Learn more](/api-reference/workflow/overview). * The [Unstructured Partition Endpoint](/api-reference/partition/overview) is intended for rapid prototyping of Unstructured's various partitioning strategies, with limited support for chunking. It is designed to work only with processing of local files, one file at a time. Use the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) for production-level scenarios, file processing in batches, files and data in remote locations, generating embeddings, applying post-transform enrichments, using the latest and highest-performing models, and for the highest quality results at the lowest cost. [Learn more](/api-reference/partition/overview). Here is a screenshot of some Python code that calls the Unstructured Workflow Endpoint: ![Python code that calls the Unstructured Workflow Endpoint](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Python-Workflow-Code-Partial.png) <Tip>To start using the Unstructured Workflow Endpoint right away, skip ahead to the [quickstart](#unstructured-workflow-endpoint-quickstart).</Tip> *** ## <Icon icon="files" /> Supported file types The Unstructured user interface (UI) and Unstructured API support processing of the following file types: By file extension: | File extension | | -------------- | | `.abw` | | `.bmp` | | `.csv` | | `.cwk` | | `.dbf` | | `.dif`\* | | `.doc` | | `.docm` | | `.docx` | | `.dot` | | `.dotm` | | `.eml` | | `.epub` | | `.et` | | `.eth` | | `.fods` | | `.heic` | | `.htm` | | `.html` | | `.hwp` | | `.jpeg` | | `.jpg` | | `.md` | | `.mcw` | | `.msg` | | `.mw` | | `.odt` | | `.org` | | `.p7s` | | `.pbd` | | `.pdf` | | `.png` | | `.pot` | | `.ppt` | | `.pptm` | | `.pptx` | | `.prn` | | `.rst` | | `.rtf` | | `.sdp` | | `.sxg` | | `.tiff` | | `.txt` | | `.tsv` | | `.xls` | | `.xlsx` | | `.xml` | | `.zabw` | By file type: | Category | File types | | ----------------- | ------------------------------------------------------------------ | | Apple | `.cwk`, `.mcw` | | CSV | `.csv` | | Data Interchange | `.dif`\* | | dBase | `.dbf` | | E-mail | `.eml`, `.msg`, `.p7s` | | EPUB | `.epub` | | HTML | `.htm`, `.html` | | Image | `.bmp`, `.heic`, `.jpeg`, `.jpg`, `.png`, `.prn`, `.tiff` | | Markdown | `.md` | | OpenOffice | `.odt` | | Org Mode | `.org` | | Other | `.eth`, `.pbd`, `.sdp` | | PDF | `.pdf` | | Plain text | `.txt` | | PowerPoint | `.pot`, `.ppt`, `.pptm`, `.pptx` | | reStructured Text | `.rst` | | Rich Text | `.rtf` | | Spreadsheet | `.et`, `.fods`, `.mw`, `.xls`, `.xlsx` | | StarOffice | `.sxg` | | TSV | `.tsv` | | Word processing | `.abw`, `.doc`, `.docm`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` | | XML | `.xml` | `*` For `.dif`, `\n` characters in `.dif` files are supported, but `\r\n` characters will raise the error `UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type`. *** ## Pricing To use the Unstructured UI or call the Unstructured API, you must have an Unstructured account. Unstructured offers three account pricing plans: * SaaS Cloud-hosted - Processing happens on Unstructured's software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment. * Private SaaS - Processing also happens on Unstructured's SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy. * VPC - Sometimes referred to as [self-hosted](/self-hosted/overview), an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support. For more details, see the [Unstructured Pricing](https://unstructured.io/pricing) page. Some of these plans are billed on a per-page basis. Unstructured calculates a page as follows: * For these file types, a page is a page, slide, or image: `.pdf`, `.pptx`, and `.tiff`. * For `.docx` files that have page metadata, Unstructured calculates the number of pages based on that metadata. * For all other file types, Unstructured calculates the number of pages as the file's size divided by 100 KB. * For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed. *** ## <Icon icon="computer" /> Unstructured UI quickstart This quickstart uses a no-code, point-and-click user interface (UI) in your web browser to have Unstructured process a single file that is stored on your local machine. The file is first processed on Unstructured-hosted compute resources. The UI then shows the processed data that Unstructured generates for that file. You can download that processed data as a `.json` file to your local machine. This approach enables rapid, local, run-adjust-repeat prototyping of end-to-end Unstructured ETL+ workflows with a full range of Unstructured features. After you get the results you want, you can then attach remote source and destination connectors to both ends of your existing workflow to begin processing remote files and data at scale in production. To run this quickstart, you will need a local file with a size of 10 MB or less and one of the following file types: | File type | | --------- | | `.bmp` | | `.csv` | | `.doc` | | `.docx` | | `.email` | | `.epub` | | `.heic` | | `.html` | | `.jpg` | | `.md` | | `.odt` | | `.org` | | `.pdf` | | `.pot` | | `.potm` | | `.ppt` | | `.pptm` | | `.pptx` | | `.rst` | | `.rtf` | | `.sgl` | | `.tiff` | | `.txt` | | `.tsv` | | `.xls` | | `.xlsx` | | `.xml` | <Note> For processing remote files at scale in production, Unstructured supports many more files types than these. [See the list of supported file types](/ui/supported-file-types). Unstructured also supports processing files from remote object stores, and data from remote sources in websites, web apps, databases, and vector stores. For more information, see the [source connector overview](/ui/sources/overview) and the [remote quickstart](/ui/quickstart#remote-quickstart) for how to set up and run production-ready Unstructured ETL+ workflows at scale. </Note> If you do not have any files available, you can use one of the sample files that Unstructured offers in the UI. Or, you can download one or more sample files from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in the Unstructured repo on GitHub. <iframe width="560" height="315" src="https://www.youtube.com/embed/pICXdW_P50s" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> <Steps> <Step title="Sign up and sign in"> * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). </Step> <Step title="Create a workflow"> 1. In the Unstructured UI, on the sidebar, click Workflows. ![Workflows in the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflows-Sidebar.png) 2. Click New Workflow. 3. Select Build it Myself, if it is not already selected. 4. Click Continue. The visual workflow editor appears. ![Visual workflow designer](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Single-File-Design.png) The workflow is represented visually as a series of directed acyclic graph (DAG) nodes. Each node represents a step in the workflow. The workflow proceeds end to end from left to right. By default, the workflow starts with three nodes: * Source: This node represents the location where you have your files or data for Unstructured to process. For this quickstart, this node represents a single file on your local machine. After you get the results you want, you can update this node to represent files or data in a remote location at scale in production. * Partitioner: This node represents the [partitioning](/ui/partitioning) step, which extracts content from unstructured files and data and outputs it as structured [document elements](/ui/document-elements) for consistent representation across varying kinds of file and data types. For this quickstart, this node extracts the contents of a single file on your local machine and outputs it as a series of structured document elements in JSON format. * Destination: This node represents the location where you want Unstructured to put the processed files or data. After you get the results you want, you can update this node to have Unstructured put the processed files or data into a remote location at scale in production. </Step> <Step title="Process a local file"> 1. Drag the file that you want Unstructured to process from your local machine's file browser app and drop it into the Source node's Drop file to test area. The file must have a size of 10 MB or less and one of the file types listed at the beginning of this quickstart. If you are not able to drag and drop the file, you can click Drop file to test and then browse to and select the file instead. Alternatively, you can use a sample file that Unstructured offers. To do this, click the Source node, and then in the Source pane, with Details selected, on the Local file tab, click one of the files under Or use a provided sample file. To view the file's contents before you select it, click the eyes button next to the file. 2. Above the Source node, click Test. ![Testing a single local file workflow](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Test-Source.png) Unstructured displays a visual representation of the file and begins processing its contents, sending it through each of the workflow's nodes in sequence. Depending on the file's size and the workflow's complexity, this processing could take several minutes. After Unstructured has finished its processing, the processed data appears in the Test output pane, as a series of structured elements in JSON format. ![Viewing single local file output](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Test-Single-File-Output.png) 3. In the Test output pane, you can: * Search through the processed, JSON-formatted representation of the file by using the Search JSON box. * Download the full JSON as a `.json` file to your local machine by clicking Download full JSON. 4. When you are done, click the Close button in the Test output pane. </Step> <Step title="Add more nodes to the workflow"> 1. You can now add more nodes to the workflow to do further testing of various Unstructured features and with the option of eventually moving the workflow into production. For example, you can: ![Adding a node to the workflow](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Workflow-Add-Node.png) <iframe width="560" height="315" src="https://www.youtube.com/embed/TE_J-pHalv8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen /> * Add a Chunker node after the Partitioner node, to chunk the partitioned data into smaller pieces for your retrieval augmented generation (RAG) applications. To do this, click the add (+) button to the right of the Partitioner node, and then click Enrich > Chunker. Click the new Chunker node and specify its settings. For help, click the FAQ button in the Chunker node's pane. [Learn more about chunking and chunker settings](/ui/chunking). * Add an Enrichment node after the Chunker node, to apply enrichments to the chunked data such as image summaries, table summaries, table-to-HTML transforms, and named entity recognition (NER). To do this, click the add (+) button to the right of the Chunker node, and then click Enrich > Enrichment. Click the new Enrichment node and specify its settings. For help, click the FAQ button in the Enrichment node's pane. [Learn more about enrichments and enrichment settings](/ui/enriching/overview). <Warning> Image summary descriptions, table summary descriptions, and table-to-HTML output is generated only when the Partitioner node in a workflow is set to use the High Res [partitioning strategy](/ui/partitioning) and the workflow also contains an image description, table description, or table-to-HTML enrichment node. Setting the Partitioner node to use Auto, VLM, or Fast in a workflow that also contains an image description, table description, or table-to-HTML enrichment node will not generate any image summary descriptions, table summary descriptions, or table-to-HTML output, and it could also cause the workflow to stop running or produce unexpected results. </Warning> * Add an Embedder node after the Enrichment node, to generate vector embeddings for performing vector-based searches. To do this, click the add (+) button to the right of the Enrichment node, and then click Transform > Embedder. Click the new Embedder node and specify its settings. For help, click the FAQ button in the Embedder node's pane. [Learn more about embedding and embedding settings](/ui/embedding). 2. Each time you add a node or change its settings, you can click Test above the Source node again to test the current workflow end to end and see the results of the changes, if any. 3. Keep repeating this step as many times as you want, until you get the results you want. </Step> <Step title="Next steps"> After you get the results you want, you have the option of moving your workflow into production. To do this, complete the following instructions. <Note> The following instructions have you create a new workflow that is suitable for production. This behavior is planned to be fixed in a future release, allowing you to update the workflow that you just created, rather than needing to create a new one. </Note> 1. With your workflow remaining open in the visual workflow editor, open a new tab in your web browser, and in this new tab, sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. In this new tab, create a [source connector](/ui/sources/overview) for your remote source location. This is the location in production where you have files or data in a file or object store, website, database, or vector store that you want Unstructured to process. ![Connectors button on the sidebar](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Sources-Sidebar.png) 3. Create a [destination connector](/ui/destinations/overview) for your remote destination location. This is the location in production where you want Unstructured to put the processed data as `.json` files in a file or object store, or as records in a database or vector store. 4. Create a workflow: on the sidebar, click Workflows, and then click New Workflow. Select Build it Myself, and then click Continue to open the visual workflow editor. 5. In the visual workflow editor, click Source. 6. In the Source pane, with Details selected, on the Connectors tab, select the source connector that you just created. 7. Click the Destination node. 8. In the Destination pane, with Details selected, select the destination connector that you just created. 9. Using your original workflow on the other tab as a guide, add any additional nodes to this new workflow as needed, and configure those new nodes' settings to match the other ones. 10. Click Save. 11. To run the workflow: a. Make sure to click Save first.<br /> b. Click the Close button next to the workflow's name in the top navigation bar.<br /> c. On the sidebar, click Workflows.<br /> d. In the list of available workflows, click the Run button for the workflow that you just saved.<br /> e. On the sidebar, click Jobs.<br /> ![Viewing the list of available jobs](https://mintlify.s3.us-west-1.amazonaws.com/unstructured-53/img/ui/Select-Job.png) f. In the list of available jobs, click the job that you just ran.<br /> g. After the job status shows Finished, go to the your destination location to see the processed files or data that Unstructured put there. See also the [remote quickstart](/ui/quickstart#remote-quickstart) for more coverage about how to set up and run production-ready Unstructured ETL+ workflows at scale. </Step> </Steps> * [Learn more about the Unstructured UI](/ui/overview). * <Icon icon="blog" /> Also, read [Skip the Setup. Get Straight to the Results with Our Redesigned Interactive Workflow Builder](https://unstructured.io/blog/skip-the-setup-get-straight-to-the-results-with-our-redesigned-interactive-workflow-builder). *** ## <Icon icon="square-terminal" /> Unstructured Workflow Endpoint quickstart This quickstart uses the Unstructured Python SDK to call the Unstructured Workflow Endpoint to get your data RAG-ready. The Python code for this quickstart is in a remote hosted Google Colab notebook. Data is processed on Unstructured-hosted compute resources. The requirements are as follows: * A compatible source (input) location that contains your data for Unstructured to process. [See the list of supported source types](/ui/connectors#sources). This quickstart uses an Amazon S3 bucket as the source location. If you use a different source type, you will need to modify the quickstart notebook accordingly. * For document-based source locations, compatible files in that location. [See the list of supported file types](/ui/supported-file-types). If you do not have any files available, you can download some from the [example-docs](https://github.com/Unstructured-IO/unstructured-ingest/tree/main/example-docs) folder in the `Unstructured-IO/unstructured-ingest` repository in GitHub. * A compatible destination (output) location for Unstructured to put the processed data. [See the list of supported destination types](/ui/connectors#destinations). For this quickstart's destination location, a different folder in the same Amazon S3 bucket as the source location is used. If you use a different destination S3 bucket or a different destination type, you will need to modify the quickstart notebook accordingly. <Steps> <Step title="Sign up, sign in, and get your API key"> 1. Sign in to your Unstructured account: * If you do not already have an Unstructured account, go to [https://unstructured.io/contact](https://unstructured.io/contact) and fill out the online form to indicate your interest. * If you already have an Unstructured account, sign in by using the URL of the sign in page that Unstructured provided to you when your Unstructured account was created. After you sign in, the Unstructured user interface (UI) then appears, and you can start using it right away. If you do not have this URL, contact Unstructured Sales at [sales@unstructured.io](mailto:sales@unstructured.io). 2. Get your Unstructured API key: a. In the Unstructured UI, click API Keys on the sidebar.<br /> b. Click Generate API Key.<br /> c. Follow the on-screen instructions to finish generating the key.<br /> d. Click the Copy icon next to your new key to add the key to your system's clipboard. If you lose this key, simply return and click the Copy icon again.<br /> </Step> <Step title="Create and set up the S3 bucket"> This quickstart uses an Amazon S3 bucket as both the source location and the destination location. (You can use other source and destination types that are supported by Unstructured. If you use a different source or destination type, or if you use a different S3 bucket for the destination location, you will need to modify the quickstart notebook accordingly.) Inside of the S3 bucket, a folder named `input` represents the source location. This is where your files to be processed will be stored. The S3 URI to the source location will be `s3://<your-bucket-name>/input`. Inside of the same S3 bucket, a folder inside named `output` represents the destination location. This is where Unstructured will put the processed data. The S3 URI to the destination location will be `s3://<your-bucket-name>/output`. Learn how to [create an S3 bucket and set it up for Unstructured](/api-reference/workflow/sources/s3). (Do not run the Python SDK code or REST commands at the end of those setup instructions.) </Step> <Step title="Run the quickstart notebook"> After your S3 bucket is created and set up, follow the instructions in this [quickstart notebook](https://colab.research.google.com/github/Unstructured-IO/notebooks/blob/main/notebooks/Unstructured_Platform_Workflow_Endpoint_Quickstart.ipynb). </Step> <Step title="View the processed data"> After you run the quickstart notebook, go to your destination location to view the processed data. </Step> </Steps> [Learn more about the Unstructured API](/api-reference/overview). *** ## <Icon icon="question" /> Get in touch If you can't find the information you're looking for in the documentation, or if you need help, [contact us directly](https://unstructured.io/contact), or [join our Slack](https://short.unstructured.io/pzw05l7) where our team and community can help you.

Dataset	\| Base Model’	\| Notes
PubLayNet	[38] F/M	Layouts of modern scientific documents
PRImA [3]	M	Layouts of scanned modern magazines and scientific reports
Newspaper	F	Layouts of scanned US newspapers from the 20th century
TableBank	F	Table region on modern scientific and business document
HJDataset [31]	F/M	Layouts of history Japanese documents

Gels and karyotypes	600 dpi (8 bit grayscale depth)
High pressure liquid chromatography	300