Unstructured API services
Getting started with API services
Process individual files
Batch processing and ingestion
- Overview
- Ingest CLI
- Ingest Python library
- Ingest dependencies
- Ingest configuration
- Source connectors
- Destination connectors
How to
- Choose a partitioning strategy
- Choose a hi-res model
- Get element contents
- Process a subset of files
- Set embedding behavior
- Parse simple PDFs and HTML
- Set partitioning behavior
- Set chunking behavior
- Output unique element IDs
- Output bounding box coordinates
- Set document language for better OCR
- Extract tables as HTML
- Extract images and tables from documents
- Get chunked elements
- Change element coordinate systems
- Work with PowerPoint files
- Use LangChain and Ollama
- Use LangChain and Llama 3
- Transform a JSON file into a different schema
- Generate a JSON schema for a file
Troubleshooting
Endpoints
Examples
This page provides some examples of accessing Unstructured API via different methods.
For each of these examples, you’ll need:
These environment variables:
UNSTRUCTURED_API_KEY
- Your Unstructured API key value.UNSTRUCTURED_API_URL
- Your Unstructured API URL.
If you do not specify the API URL, your Unstructured Serverless API pay-as-you-go account will be used by default. You must always specify your Serverless API key.
To use the Free Unstructured API, you must always specify your Free API key, and the Free API URL which is https://api.unstructured.io/general/v0/general
To use the pay-as-you-go Unstructured API on Azure or AWS with the SDKs, you must always specify the corresponding API URL. See the Azure or AWS instructions.
Changing partition strategy for a PDF
Here’s how you can modify partition strategy for a PDF file, and select an alternative model to use with Unstructured API.
The hi_res
strategy supports different models, and the default is layout_v1.1.0
.
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--strategy hi_res \
--hi-res-model-name layout_v1.1.0 \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
import os
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig,
LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
strategy="hi_res",
hi_res_model_name="layout_v1.0.0",
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
).run()
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper.pdf' \
-F 'strategy=hi_res' \
-F 'hi_res_model_name=layout_v1.1.0'
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import os
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
client = UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
api_url=os.getenv("UNSTRUCTURED_API_URL"),
)
filename = "sample-docs/layout-parser-paper.pdf"
file = open(filename, "rb")
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
# Note that this currently only supports a single file.
files=shared.Files(
content=file.read(),
file_name=filename,
),
strategy=shared.Strategy.HI_RES,
hi_res_model_name="layout_v1.1.0",
split_pdf_page=True,
split_pdf_allow_failed=True,
split_pdf_concurrency_level=15
)
)
try:
res = client.general.partition(request=req)
print(res.elements[0])
except SDKError as e:
print(e)
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import { UnstructuredClient } from "unstructured-client";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/dist/sdk/models/operations";
import * as fs from "fs";
const key = process.env.UNSTRUCTURED_API_KEY;
const url = process.env.UNSTRUCTURED_API_URL;
const client = new UnstructuredClient({
serverURL: url,
security: {
apiKeyAuth: key,
},
});
const filename = "sample-docs/layout-parser-paper.pdf";
const data = fs.readFileSync(filename);
client.general.partition({
files: {
content: data,
fileName: filename,
},
strategy: Strategy.HiRes,
hiResModelName: "layout_v1.1.0",
splitPdfPage: true,
splitPdfConcurrencyLevel: 15,
splitPdfAllowFailed: true
}).then((res: PartitionResponse) => {
if (res.statusCode == 200) {
console.log(res.elements?.[0]);
}
}).catch((e) => {
console.log(e.statusCode);
console.log(e.body);
});
If you have a local deployment of the Unstructured API, you can use other supported models, such as yolox
.
Specifying the language of a document for better OCR results
For better OCR results, you can specify what languages your document is in using the languages
parameter.
View the list of available languages.
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--strategy ocr_only \
--ocr-languages kor \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
import os
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig,
LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
strategy="ocr_only",
ocr_languages=["kor"],
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
).run()
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/korean.png' \
-F 'strategy=ocr_only' \
-F 'languages=kor'
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import os
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
client = UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
server_url=os.getenv("UNSTRUCTURED_API_URL"),
)
filename = "sample-docs/korean.png"
file = open(filename, "rb")
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
# Note that this currently only supports a single file.
files=shared.Files(
content=file.read(),
file_name=filename,
),
strategy=shared.Strategy.OCR_ONLY,
languages=["kor"],
split_pdf_page=True,
split_pdf_allow_failed=True,
split_pdf_concurrency_level=15
)
)
try:
res = client.general.partition(request=req)
print(res.elements[0])
except SDKError as e:
print(e)
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import { UnstructuredClient } from "unstructured-client";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/dist/sdk/models/operations";
import * as fs from "fs";
const key = process.env.UNSTRUCTURED_API_KEY;
const url = process.env.UNSTRUCTURED_API_URL;
const client = new UnstructuredClient({
serverURL: url,
security: {
apiKeyAuth: key,
},
});
const filename = "sample-docs/korean.png";
const data = fs.readFileSync(filename);
client.general.partition({
files: {
content: data,
fileName: filename,
},
strategy: Strategy.OcrOnly,
languages: ["kor"],
splitPdfPage: true,
splitPdfConcurrencyLevel: 15,
splitPdfAllowFailed: true
}).then((res: PartitionResponse) => {
if (res.statusCode == 200) {
console.log(res.elements?.[0]);
}
}).catch((e) => {
console.log(e.statusCode);
console.log(e.body);
});
Saving bounding box coordinates
When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well.
Set the coordinates
parameter to true
to add this field to the elements in the response.
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--strategy hi_res \
--additional-partition-args="{\"coordinates\":\"true\", \"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
import os
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig,
LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res",
additional_partition_args={
"coordinates": True,
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
).run()
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper.pdf' \
-F 'coordinates=true' \
-F 'strategy=hi_res'
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import os
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
client = UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
server_url=os.getenv("UNSTRUCTURED_API_URL"),
)
filename = "sample-docs/layout-parser-paper.pdf"
file = open(filename, "rb")
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
# Note that this currently only supports a single file.
files=shared.Files(
content=file.read(),
file_name=filename,
),
strategy=shared.Strategy.HI_RES,
coordinates=True,
split_pdf_page=True,
split_pdf_allow_failed=True,
split_pdf_concurrency_level=15
)
)
try:
res = client.general.partition(request=req)
print(res.elements[0])
except SDKError as e:
print(e)
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import { UnstructuredClient } from "unstructured-client";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/dist/sdk/models/operations";
import * as fs from "fs";
const key = process.env.UNSTRUCTURED_API_KEY;
const url = process.env.UNSTRUCTURED_API_URL;
const client = new UnstructuredClient({
serverURL: url,
security: {
apiKeyAuth: key,
},
});
const filename = "sample-docs/layout-parser-paper.pdf";
const data = fs.readFileSync(filename);
client.general.partition({
files: {
content: data,
fileName: filename,
},
coordinates: true,
strategy: Strategy.HiRes,
splitPdfPage: true,
splitPdfConcurrencyLevel: 15,
splitPdfAllowFailed: true
}).then((res: PartitionResponse) => {
if (res.statusCode == 200) {
console.log(res.elements?.[0]);
}
}).catch((e) => {
console.log(e.statusCode);
console.log(e.body);
});
Returning unique element IDs
By default, the element ID is a SHA-256 hash of the element text. This is to ensure that
the ID is deterministic. One downside is that the ID is not guaranteed to be unique.
Different elements with the same text will have the same ID, and there could also be hash collisions.
To use UUIDs in the output instead, set unique_element_ids=true
. Note: this means that the element IDs
will be random, so with every partition of the same file, you will get different IDs.
This can be helpful if you’d like to use the IDs as a primary key in a database, for example.
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--strategy hi_res \
--additional-partition-args="{\"unique_element_ids\":\"true\", \"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
import os
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig,
LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res",
additional_partition_args={
"unique_element_ids": True,
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
).run()
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper-fast.pdf' \
-F 'unique_element_ids=true'
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import os
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
client = UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
server_url=os.getenv("UNSTRUCTURED_API_URL"),
)
filename = "sample-docs/layout-parser-paper-fast.pdf"
file = open(filename, "rb")
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
# Note that this currently only supports a single file.
files=shared.Files(
content=file.read(),
file_name=filename,
),
unique_element_ids=True,
strategy=shared.Strategy.HI_RES,
split_pdf_page=True,
split_pdf_allow_failed=True,
split_pdf_concurrency_level=15
)
)
try:
res = client.general.partition(request=req)
print(res.elements[0])
except SDKError as e:
print(e)
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import { UnstructuredClient } from "unstructured-client";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/dist/sdk/models/operations";
import * as fs from "fs";
const key = process.env.UNSTRUCTURED_API_KEY;
const url = process.env.UNSTRUCTURED_API_URL;
const client = new UnstructuredClient({
serverURL: url,
security: {
apiKeyAuth: key,
},
});
const filename = "sample-docs/layout-parser-paper-fast.pdf";
const data = fs.readFileSync(filename);
client.general.partition({
files: {
content: data,
fileName: filename,
},
uniqueElementIds: true,
strategy: Strategy.HiRes,
splitPdfPage: true,
splitPdfConcurrencyLevel: 15,
splitPdfAllowFailed: true
}).then((res: PartitionResponse) => {
if (res.statusCode == 200) {
console.log(res.elements?.[0]);
}
}).catch((e) => {
console.log(e.statusCode);
console.log(e.body);
});
Adding the chunking step after partitioning
You can combine partitioning and subsequent chunking in a single request by setting the chunking_strategy
parameter.
By default, the chunking_strategy
is set to None
, and no chunking is performed.
unstructured-ingest \
local \
--input-path $LOCAL_FILE_INPUT_DIR \
--output-dir $LOCAL_FILE_OUTPUT_DIR \
--chunking-strategy by_title \
--chunk-max-characters 1024 \
--partition-by-api \
--api-key $UNSTRUCTURED_API_KEY \
--partition-endpoint $UNSTRUCTURED_API_URL \
--strategy hi_res \
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}"
import os
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.v2.processes.connectors.local import (
LocalIndexerConfig,
LocalDownloaderConfig,
LocalConnectionConfig,
LocalUploaderConfig
)
from unstructured_ingest.v2.processes.partitioner import PartitionerConfig
from unstructured_ingest.v2.processes.chunker import ChunkerConfig
if __name__ == "__main__":
Pipeline.from_configs(
context=ProcessorConfig(),
indexer_config=LocalIndexerConfig(input_path=os.getenv("LOCAL_FILE_INPUT_DIR")),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
partition_by_api=True,
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
strategy="hi_res",
additional_partition_args={
"split_pdf_page": True,
"split_pdf_allow_failed": True,
"split_pdf_concurrency_level": 15
}
),
chunker_config=ChunkerConfig(
chunking_strategy="by_title",
chunk_max_characters=1024
),
uploader_config=LocalUploaderConfig(output_dir=os.getenv("LOCAL_FILE_OUTPUT_DIR"))
).run()
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
curl -X 'POST' $UNSTRUCTURED_API_URL \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-H 'unstructured-api-key: $UNSTRUCTURED_API_KEY' \
-F 'files=@sample-docs/layout-parser-paper-fast.pdf' \
-F 'chunking_strategy=by_title' \
-F 'max_characters=1024' \
-F 'strategy=hi_res'
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import os
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
client = UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
server_url=os.getenv("UNSTRUCTURED_API_URL"),
)
filename = "sample-docs/layout-parser-paper-fast.pdf"
file = open(filename, "rb")
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
# Note that this currently only supports a single file.
files=shared.Files(
content=file.read(),
file_name=filename,
),
chunking_strategy="by_title",
max_characters=1024,
strategy=shared.Strategy.HI_RES,
split_pdf_page=True,
split_pdf_allow_failed=True,
split_pdf_concurrency_level=15
)
)
try:
res = client.general.partition(request=req)
print(res.elements[0])
except SDKError as e:
print(e)
Unstructured recommends that you use the Unstructured Ingest CLI or the Unstructured Ingest Python library if any of the following apply to you:
- You need to work with documents in cloud storage.
- You want to cache the results of processing multiple files in batches.
- You want more precise control over document-processing pipeline stages such as partitioning, chunking, filtering, staging, and embedding.
import { UnstructuredClient } from "unstructured-client";
import { Strategy } from "unstructured-client/sdk/models/shared/index.js";
import { PartitionResponse } from "unstructured-client/dist/sdk/models/operations";
import * as fs from "fs";
const key = process.env.UNSTRUCTURED_API_KEY;
const url = process.env.UNSTRUCTURED_API_URL;
const client = new UnstructuredClient({
serverURL: url,
security: {
apiKeyAuth: key,
},
});
const filename = "sample-docs/layout-parser-paper-fast.pdf";
const data = fs.readFileSync(filename);
client.general.partition({
files: {
content: data,
fileName: filename,
},
chunkingStrategy: "by_title",
maxCharacters: 1024,
strategy: Strategy.HiRes,
splitPdfPage: true,
splitPdfConcurrencyLevel: 15,
splitPdfAllowFailed: true
}).then((res: PartitionResponse) => {
if (res.statusCode == 200) {
console.log(res.elements?.[0]);
}
}).catch((e) => {
console.log(e.statusCode);
console.log(e.body);
});
Was this page helpful?