Code samples
Delta Table source connector
This sample code utilizes the Unstructured Open Source Library.
Objectives
-
Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
-
Process and store this data in a Databricks Delta Table.
-
Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.
Prerequisites
-
Unstructured Python SDK
-
Databricks account and workspace
-
AWS S3 for Delta Table storage
Processing and Storing into Databricks Delta Table
- Initialize PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
- Convert JSON output into Dataframe
import pyspark
dataframe = spark.createDataFrame(res.elements)
- Store DataFrame as Delta Table
dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")
Conclusion
This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.
Was this page helpful?