File extension |
---|
.abw |
.bmp |
.csv |
.cwk |
.dbf |
.dif * |
.doc |
.docm |
.docx |
.dot |
.dotm |
.eml |
.epub |
.et |
.eth |
.fods |
.heic |
.htm |
.html |
.hwp |
.jpeg |
.jpg |
.md |
.mcw |
.msg |
.mw |
.odt |
.org |
.p7s |
.pbd |
.pdf |
.png |
.pot |
.ppt |
.pptm |
.pptx |
.prn |
.rst |
.rtf |
.sdp |
.sxg |
.tiff |
.txt |
.tsv |
.xls |
.xlsx |
.xml |
.zabw |
Category | File types |
---|---|
Apple | .cwk , .mcw |
CSV | .csv |
Data Interchange | .dif * |
dBase | .dbf |
.eml , .msg , .p7s | |
EPUB | .epub |
HTML | .htm , .html |
Image | .bmp , .heic , .jpeg , .jpg , .png , .prn , .tiff |
Markdown | .md |
OpenOffice | .odt |
Org Mode | .org |
Other | .eth , .pbd , .sdp |
.pdf | |
Plain text | .txt |
PowerPoint | .pot , .ppt , .pptm , .pptx |
reStructured Text | .rst |
Rich Text | .rtf |
Spreadsheet | .et , .fods , .mw , .xls , .xlsx |
StarOffice | .sxg |
TSV | .tsv |
Word processing | .abw , .doc , .docm , .docx , .dot , .dotm , .hwp , .zabw |
XML | .xml |
*
For .dif
, \n
characters in .dif
files are supported, but \r\n
characters will raise the error
UnsupportedFileFormatError: Partitioning is not supported for the FileType.UNK file type
.
.pdf
, .pptx
, and .tiff
..docx
files that have page metadata, Unstructured calculates the number of pages based on that metadata..json
file to your local machine.
This approach enables rapid, local, run-adjust-repeat prototyping of end-to-end Unstructured ETL+ workflows with a full range of Unstructured features.
After you get the results you want, you can then attach remote source and destination connectors to both ends of your existing workflow to begin processing remote files and data at scale in production.
To run this quickstart, you will need a local file with a size of 10 MB or less and one of the following file types:
File type |
---|
.bmp |
.csv |
.doc |
.docx |
.email |
.epub |
.heic |
.html |
.jpg |
.md |
.odt |
.org |
.pdf |
.pot |
.potm |
.ppt |
.pptm |
.pptx |
.rst |
.rtf |
.sgl |
.tiff |
.txt |
.tsv |
.xls |
.xlsx |
.xml |
Sign up and sign in
Create a workflow
Process a local file
.json
file to your local machine by clicking Download full JSON.Add more nodes to the workflow
Next steps
.json
files in a file or object store, or as records in a database or vector store.
Unstructured-IO/unstructured-ingest
repository in GitHub.Sign up, sign in, and get your API key
Create and set up the S3 bucket
input
represents the
source location. This is where your files to be processed will be stored.
The S3 URI to the source location will be s3://<your-bucket-name>/input
.Inside of the same S3 bucket, a folder inside named output
represents the destination location. This
is where Unstructured will put the processed data.
The S3 URI to the destination location will be s3://<your-bucket-name>/output
.Learn how to create an S3 bucket and set it up for Unstructured. (Do not run the Python SDK code or REST commands at the end of those setup instructions.)Run the quickstart notebook
View the processed data