The Unstructured API on Azure is deprecated. It is no longer supported and is not being actively updated. Unstructured is now available on the Azure Marketplace as a private offering. To explore supported options for running Unstructured within your virtual private cloud (VPC), email Unstructured Sales at sales@unstructured.io.
This page is not being actively updated. It might contain out-of-date information. This page is provided for legacy reference purposes only.
Follow these steps to deploy the Unstructured API service into your Azure account.
This article describes how to create several interrelated resources in your Azure account.
Your Azure account will be charged on an ongoing basis for these resources, even if you are not actively using them.
Manually shutting down the associated Azure virtual machine when you are not using it can help reduce—but not fully eliminate—these ongoing charges.
To stop accruing all related ongoing charges, you must delete all of the associated Azure resources.
Log in to the Azure Portal
Go to https://portal.azure.com.
Access the Azure Marketplace
Go to the Unstructured Data Preprocessing - Customer Hosted API offering in the Azure Marketplace.
Start the deployment process
Configure the deployment options
On the Create a virtual machine page, click the Basics tab.
In the Project details section, select an existing Subscription, and select an Resource group from the dropdown menus, or create a new resource group by clicking Create new.
In the Instance details section, enter a name in the Virtual machine name field. Note this name, as you will need it later steps.
Select a Region from the dropdown menu.
For Image, select Unstructured Customer Hosted API Hourly - x64 Gen2 (default).
For Size, select a VM size from the dropdown menu, or leave the default VM size selection. To learn more, see Azure VM comparisons.
In the Administrator account section, for Authentication type, select SSH public key or Password.
Enter the credential settings, depending on the authentication type.
Set up the load balancer
Click the Networking tab.
In the Networking interface section, fill out the following fields:
In the Load balancing section, fill out the following fields:
Load balancing options: Select Azure load balancer.
Select a load balancer: Click Create a load balancer and fill out the following fields in the pop-up window, or select an existing load balancer from the dropdown menu and note this name as you will need it in later steps:
Click Create.
Finalize and deploy
Click Review + create.
Wait for validation.
Click Create.
Post-deployment: additional load balancer configuration
Post-deployment: get the deployed endpoint URL
Go to your virtual machine: in the Azure portal’s Search resources, services and docs box, enter and then select Virtual machines.
Search for and open the new virtual machine that you created earlier, using the name that you entered earlier.
On the Overview tab, under Properties, note the Public IP address for the Load balancer.
The deployed endpoint URL is http://<load-balancer-public-IP-address>/general/v0/general. Note this endpoint URL, as you will need it later to call the Unstructured API.
Post-deployment: set API environment variables
Note the API environment variables in the API environment variables section, later on this page. If you need to set any of these in the Docker container on the virtual machine, do the following:
If the virtual machine is not already running from earlier, click the Start icon.
After the virtual machine starts, click the Connect icon, and then click Connect from the drop-down list.
Follow the on-screen directions for one of the available options to connect to the virtual machine and display a connected terminal.
Stop the running container in the virtual machine, so that you can restart it later with the environment variables set: In the connected terminal, run the following command: sudo docker container ls
.
Note the CONTAINER ID
value for the running container.
Run the following command, replacing <CONTAINER ID>
with the CONTAINER ID
value:
Now run the container again, setting the environment variables at the same time: Run the following command: sudo docker image ls
.
Note the REPOSITORY
and TAG
value for the Docker image.
Run the following command, replacing <REPOSITORY>
and <TAG>
with the REPOSITORY
and TAG
values for the Docker image, and replacing
<VAR1>=<value1>
, <VAR2>=<value2>
and so on with the environment variable name and value pairs:
Verify that the environment variables were set correctly: Run the following command:
Note the CONTAINER ID
value for the running container.
Run the following command, replacing <CONTAINER ID>
with the CONTAINER ID
value:
Call the Unstructured API
You can now use the running virtual machine to call the Unstructured API. For example, run one of the following, setting the following environment variables to make your code more portable:
UNSTRUCTURED_API_URL
to http://
, followed by your load balancer’s public IP address, followed by /general/v0/general
.LOCAL_FILE_INPUT_DIR
to the path on your local machine to the files for the Unstructured API to process. If you do not have any input files available, you can download any of the ones from the example-docs folder in GitHub.LOCAL_FILE_OUTPUT_DIR
to the path on your local machine for Unstructured API to send the processed output in JSON format.Ingest CLI
You must first install the Unstructured Ingest CLI.
Because you are calling a private API and therefore do not need an Unstructured API key, you can omit the command-line option --api-key
Or, for better code portability, it is recommended that you first set the environment variable UNSTRUCTURED_API_KEY
to an empty string and then include the command-line option --api-key
.
Ingest Python library
You must first install the Unstructured Ingest Python library.
Because you are calling a private API and therefore do not need an Unstructured API key, you can omit the parameter api_key
. Or, for better code portability, it is recommended that you first set the environment variable UNSTRUCTURED_API_KEY
to an empty string and then include the parameter api_key
.
Unstructured recommends the following load balancer settings, which you should set on your deployment’s load balancer soon after you finalize and deploy it.
On the load balancer’s Overview tab in the Azure portal:
On the load balancer’s Settings tab in the Azure portal:
Frontend IP configuration: Private IP
Backend pools: VMSS
Health probes:
80
or 443
(this can be any port that the backend VMs are listening on)/healthcheck
5
Load balancing rules:
443
for HTTPS, or 80
for HTTP443
for HTTPS, or 80
for HTTP60
Inbound NAT rules:
443
for HTTPS, or 80
for HTTP443
for HTTPS, or 80
for HTTP60
Unstructured supports the following environment variables, which you can set in the Docker image on the virtual machine, as needed:
ALLOW_ORIGINS
: CORS-allowed origins.UNSTRUCTURED_ALLOWED_MIMETYPE
: The list of allowed MIME types, if you want to limit the file types that can be processed.UNSTRUCTURED_API_KEY
: The default Unstructured API key to use.UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB
: The minimum amount of free memory in MB to allow for processing a file. If this memory is too low, the server will return a 503
error.UNSTRUCTURED_PDF_HI_RES_MAX_PAGES
: The maximum number of pages in a PDF file that the Unstructured API will not reject, if the hi_res
strategy is used. The default is 300
.UNSTRUCTURED_REDIRECT_ROOT_URL
: If this is set, redirect a GET
request to the Unstructured API to use this URL instead.The Unstructured API on Azure is deprecated. It is no longer supported and is not being actively updated. Unstructured is now available on the Azure Marketplace as a private offering. To explore supported options for running Unstructured within your virtual private cloud (VPC), email Unstructured Sales at sales@unstructured.io.
This page is not being actively updated. It might contain out-of-date information. This page is provided for legacy reference purposes only.
Follow these steps to deploy the Unstructured API service into your Azure account.
This article describes how to create several interrelated resources in your Azure account.
Your Azure account will be charged on an ongoing basis for these resources, even if you are not actively using them.
Manually shutting down the associated Azure virtual machine when you are not using it can help reduce—but not fully eliminate—these ongoing charges.
To stop accruing all related ongoing charges, you must delete all of the associated Azure resources.
Log in to the Azure Portal
Go to https://portal.azure.com.
Access the Azure Marketplace
Go to the Unstructured Data Preprocessing - Customer Hosted API offering in the Azure Marketplace.
Start the deployment process
Configure the deployment options
On the Create a virtual machine page, click the Basics tab.
In the Project details section, select an existing Subscription, and select an Resource group from the dropdown menus, or create a new resource group by clicking Create new.
In the Instance details section, enter a name in the Virtual machine name field. Note this name, as you will need it later steps.
Select a Region from the dropdown menu.
For Image, select Unstructured Customer Hosted API Hourly - x64 Gen2 (default).
For Size, select a VM size from the dropdown menu, or leave the default VM size selection. To learn more, see Azure VM comparisons.
In the Administrator account section, for Authentication type, select SSH public key or Password.
Enter the credential settings, depending on the authentication type.
Set up the load balancer
Click the Networking tab.
In the Networking interface section, fill out the following fields:
In the Load balancing section, fill out the following fields:
Load balancing options: Select Azure load balancer.
Select a load balancer: Click Create a load balancer and fill out the following fields in the pop-up window, or select an existing load balancer from the dropdown menu and note this name as you will need it in later steps:
Click Create.
Finalize and deploy
Click Review + create.
Wait for validation.
Click Create.
Post-deployment: additional load balancer configuration
Post-deployment: get the deployed endpoint URL
Go to your virtual machine: in the Azure portal’s Search resources, services and docs box, enter and then select Virtual machines.
Search for and open the new virtual machine that you created earlier, using the name that you entered earlier.
On the Overview tab, under Properties, note the Public IP address for the Load balancer.
The deployed endpoint URL is http://<load-balancer-public-IP-address>/general/v0/general. Note this endpoint URL, as you will need it later to call the Unstructured API.
Post-deployment: set API environment variables
Note the API environment variables in the API environment variables section, later on this page. If you need to set any of these in the Docker container on the virtual machine, do the following:
If the virtual machine is not already running from earlier, click the Start icon.
After the virtual machine starts, click the Connect icon, and then click Connect from the drop-down list.
Follow the on-screen directions for one of the available options to connect to the virtual machine and display a connected terminal.
Stop the running container in the virtual machine, so that you can restart it later with the environment variables set: In the connected terminal, run the following command: sudo docker container ls
.
Note the CONTAINER ID
value for the running container.
Run the following command, replacing <CONTAINER ID>
with the CONTAINER ID
value:
Now run the container again, setting the environment variables at the same time: Run the following command: sudo docker image ls
.
Note the REPOSITORY
and TAG
value for the Docker image.
Run the following command, replacing <REPOSITORY>
and <TAG>
with the REPOSITORY
and TAG
values for the Docker image, and replacing
<VAR1>=<value1>
, <VAR2>=<value2>
and so on with the environment variable name and value pairs:
Verify that the environment variables were set correctly: Run the following command:
Note the CONTAINER ID
value for the running container.
Run the following command, replacing <CONTAINER ID>
with the CONTAINER ID
value:
Call the Unstructured API
You can now use the running virtual machine to call the Unstructured API. For example, run one of the following, setting the following environment variables to make your code more portable:
UNSTRUCTURED_API_URL
to http://
, followed by your load balancer’s public IP address, followed by /general/v0/general
.LOCAL_FILE_INPUT_DIR
to the path on your local machine to the files for the Unstructured API to process. If you do not have any input files available, you can download any of the ones from the example-docs folder in GitHub.LOCAL_FILE_OUTPUT_DIR
to the path on your local machine for Unstructured API to send the processed output in JSON format.Ingest CLI
You must first install the Unstructured Ingest CLI.
Because you are calling a private API and therefore do not need an Unstructured API key, you can omit the command-line option --api-key
Or, for better code portability, it is recommended that you first set the environment variable UNSTRUCTURED_API_KEY
to an empty string and then include the command-line option --api-key
.
Ingest Python library
You must first install the Unstructured Ingest Python library.
Because you are calling a private API and therefore do not need an Unstructured API key, you can omit the parameter api_key
. Or, for better code portability, it is recommended that you first set the environment variable UNSTRUCTURED_API_KEY
to an empty string and then include the parameter api_key
.
Unstructured recommends the following load balancer settings, which you should set on your deployment’s load balancer soon after you finalize and deploy it.
On the load balancer’s Overview tab in the Azure portal:
On the load balancer’s Settings tab in the Azure portal:
Frontend IP configuration: Private IP
Backend pools: VMSS
Health probes:
80
or 443
(this can be any port that the backend VMs are listening on)/healthcheck
5
Load balancing rules:
443
for HTTPS, or 80
for HTTP443
for HTTPS, or 80
for HTTP60
Inbound NAT rules:
443
for HTTPS, or 80
for HTTP443
for HTTPS, or 80
for HTTP60
Unstructured supports the following environment variables, which you can set in the Docker image on the virtual machine, as needed:
ALLOW_ORIGINS
: CORS-allowed origins.UNSTRUCTURED_ALLOWED_MIMETYPE
: The list of allowed MIME types, if you want to limit the file types that can be processed.UNSTRUCTURED_API_KEY
: The default Unstructured API key to use.UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB
: The minimum amount of free memory in MB to allow for processing a file. If this memory is too low, the server will return a 503
error.UNSTRUCTURED_PDF_HI_RES_MAX_PAGES
: The maximum number of pages in a PDF file that the Unstructured API will not reject, if the hi_res
strategy is used. The default is 300
.UNSTRUCTURED_REDIRECT_ROOT_URL
: If this is set, redirect a GET
request to the Unstructured API to use this URL instead.