Changing partition strategy for a PDF
Here’s how you can modify partition strategy for a PDF file, and select an alternative model to use with Unstructured API. Thehi_res
strategy supports different models, and the default is layout_v1.1.0
.
Ingest CLI
Ingest CLI
CLI
Ingest Python
Ingest Python
Python
yolox
.
Specifying the language of a document for better OCR results
For better OCR results, you can specify what languages your document is in using thelanguages
parameter.
View the list of available languages.
Ingest CLI
Ingest CLI
CLI
Ingest Python
Ingest Python
Python
Saving bounding box coordinates
When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set thecoordinates
parameter to true
to add this field to the elements in the response.
Ingest CLI
Ingest CLI
CLI
Ingest Python
Ingest Python
Python
Returning unique element IDs
By default, the element ID is a SHA-256 hash of the element text. This is to ensure that the ID is deterministic. One downside is that the ID is not guaranteed to be unique. Different elements with the same text will have the same ID, and there could also be hash collisions. To use UUIDs in the output instead, setunique_element_ids=true
. Note: this means that the element IDs
will be random, so with every partition of the same file, you will get different IDs.
This can be helpful if you’d like to use the IDs as a primary key in a database, for example.
Ingest CLI
Ingest CLI
CLI
Ingest Python
Ingest Python
Python
Adding the chunking step after partitioning
You can combine partitioning and subsequent chunking in a single request by setting thechunking_strategy
parameter.
By default, the chunking_strategy
is set to None
, and no chunking is performed.
Ingest CLI
Ingest CLI
CLI
Ingest Python
Ingest Python
Python