Requirements
To complete this walkthrough, you must first have:- An Unstructured account.
- A Firecrawl account and a Firecrawl API key for that account.
- An AWS account and a properly configured Amazon S3 bucket with bucket access credentials within that account.
- A Pinecone account and a Pinecone API key for that account.
- Python and the necessary development toolchain installed on your local development machine.
Unstructured account
Before you begin, you must have an Unstructured account and an Unstructured API key, as follows:-
If you do not already have an Unstructured account, sign up for free.
After you sign up, you are automatically signed in to your new Unstructured Starter account, at https://platform.unstructured.io.
To sign up for a Team or Enterprise account instead, contact Unstructured Sales, or learn more.
-
If you have an Unstructured Starter or Team account and are not already signed in, sign in to your account at https://platform.unstructured.io.
For an Enterprise account, see your Unstructured account administrator for instructions, or email Unstructured Support at support@unstructured.io.
-
Get your Unstructured API key:
a. After you sign in to your Unstructured Starter account, click API Keys on the sidebar.
b. Click Generate API Key.For a Team or Enterprise account, before you click API Keys, make sure you have selected the organizational workspace you want to create an API key for. Each API key works with one and only one organizational workspace. Learn more.
c. Follow the on-screen instructions to finish generating the key.
d. Click the Copy icon next to your new key to add the key to your system’s clipboard. If you lose this key, simply return and click the Copy icon again.
Firecrawl account and API key
Before you begin, you must have a Firecrawl account and a Firecrawl API key, as follows:- Sign up or sign in to your Firecrawl account.
- After you sign in, get your Firecrawl API key.
Amazon S3 bucket with access permissions and credentials
This walkthrough uses an Amazon S3 bucket as the Unstructured ETL+ workflow’s source. So, before you begin, you must have an AWS account, and within that account, a properly configured Amazon S3 bucket with bucket access credentials consisting of an AWS access key ID and secret access key for the AWS IAM user that has access to the bucket. Follow the S3 connector instructions to create and configure the bucket and get the bucket access credentials if you do not already have this all set up. (In these instructions, do not follow the directions to use the Unstructured UI to create the S3 source connector. You will do this later when you create the Unstructured ETL+ workflow.)Pinecone account and API key
Before you begin, you must have a Pinecone account and a Pinecone API key, as follows:- Sign up or log in to your Pinecone account.
- After you log in, get your Pinecone API key.
Python and development toolchain installation
Before you can start coding on your local machine, you must install Python, and you should also install a Python package and project manager to manage your project’s Firecrawl, AWS, Pinecone, and other code dependencies. This walkthrough uses the popular Python package and project manager uv (althoughuv
is not required to use Firecrawl, AWS, or Pinecone with Python).
1
Install uv
To install If you need to install
uv
, run one of the following commands, depending on your operating system:To use To use
curl
with sh
:wget
with sh
instead:uv
by using other approaches such as PyPI, Homebrew, or WinGet,
see Installing uv.2
Install Python
This walkthrough should work with Python 3.9 or later and has been tested with Python 3.12.
uv
will detect and use Python if you already have it installed.
To view a list of installed Python versions, run the following command:uv
by running the following command. For example, this command installs Python 3.12 for use with uv
:3
Create the project directory
Switch to the directory on your local development machine where you want to create the project directory for this walkthrough.
This example creates a project directory named
firecrawl_unstructured_demo
within your current working directory and then
switches to this new project directory:4
Intiialize the project
From within the new project directory, use
uv
to initialize the project by running the following command:5
Create a venv virtual environment
To isolate and manage your project’s code dependencies, you should create a virtual environment. This walkthrough uses
the popular Python virtual environment manager venv (although
venv
is not required to use CrewAI or the Unstructured Workflow Endpoint MCP Server).
From the root of your project directory, use uv
to create a virtual environment with venv
by running the following command:6
Activate the virtual environment
To activate the If you need to deactivate the virtual environment at any time, run the following command:
venv
virtual environment, run one of the following commands from the root of your project directory:- For
bash
orzsh
, runsource .venv/bin/activate
- For
fish
, runsource .venv/bin/activate.fish
- For
csh
ortcsh
, runsource .venv/bin/activate.csh
- For
pwsh
, run.venv/bin/Activate.ps1
7
Install the Firecrawl Python SDK
You will use the Firecrawl Python SDK later to extract data from a website.Use
uv
to install the Firecrawl Python SDK package, by running the following command:8
Install the AWS SDK for Python
You will use the AWS SDK for Python later to have Firecrawl add the extracted data to your Amazon S3 bucket.
-
Use
uv
to install the AWS SDK for Python package, by running the following command: -
To enable the AWS SDK for Python to authenticate with your AWS account, create a file named
credentials
in the~/.aws/
directory for macOS or Linux, or the<drive>:\Users\<username>\.aws\
directory for Windows. Then add the AWS access key ID and secret access key of the AWS IAM user that has access to your Amazon S3 bucket, and the short code for the AWS Region of the bucket (for example,us-east-1
), to thecredentials
file. In the followingcredentials
file example, replace the following placeholders:- Replace
<your-access-key-id>
with the AWS access key ID of the AWS IAM user that has access to the bucket. - Replace
<your-secret-access-key>
with the secret access key for the related access key ID. - Replace
<the-aws-short-region-code-for-your-bucket>
with the short code for the AWS Region of the bucket.
- Replace
9
Install the Pinecone Python SDK
You will use the Pinecone Python SDK later to create the Pinecone serverless index and, in an optional step, to help enable RAG chat on the data that will be storedin the index.Use
uv
to install the Pinecone Python SDK package, along with the grpc
extra to enable the programmatic creation of a Pinecone serverless index later, by running the following command:Step 1: Extract the website data and add the data to the bucket
In this step, you use Firecrawl to extract data from a website and then add the extracted data to the Amazon S3 bucket.1
Add your Firecrawl API key to your project
-
Set your Firecrawl API key in your project as an environment variable for your code to use. To do this:
a. In the root of your code project, create a file named
.env
.
b. Add the following environment variable to this.env
file, and replace<your-firecrawl-api-key>
with your Firecrawl API key:
-
To enable your Firecrawl code to access the environment variable, use
uv
to install thedotenv
package by running the following command: -
In the
.gitignore
file in the root of your project, add the following line, to help prevent accidentally checking in your Firecrawl API key (or anything else in the.env
file) into any shared code repositories later:
2
Create the Python script to extract the website data
-
In the
.env
file, add the following line, which defines an environment variable representing the base URL of the website to crawl. This walkthrough uses a website named Books to Scrape that contains ficticious data (although you can use any accessible website you want): -
Also in the
.env
file, add the following line, which defines an environment variable representing the name of the target Amazon S3 bucket to have Firecrawlstore the website crawl results in. Replace<the-name-of-your-bucket>
with the name of the bucket:Enter only the name of the bucket. Do not prefix the bucket’s name withs3://
. Do not include any trailing slash (/
) after the bucket name. -
Create a file named
firecrawl_extract.py
in the root of your project directory, and add the following code to it:The preceding code does the following:- Loads from your
.env
file your Firecrawl API key, the base URL for Firecrawl to use for website crawling, and the S3 bucket name for Firecrawl to send the website crawl results to. - Calls a function named
main
that uses Firecrawl to crawl the website. - The
main
function then calls thesave_to_s3
function, which adds the website crawl results to the S3 bucket.
full_results_key
andpage_key
variables. For example, you might want to save the website crawl results in a single folder and then keep overwriting those results with new results as they come in, instead of adding new results to separate subfolders. You can also change the number of crawled pages by changing thelimit
argument, outputting the results as Markdown instead of HTML, and so on. For more information, see the Firecrawl Python SDK documentation. - Loads from your
3
Run the script to extract the data
From the root of your project, use After the script runs, you should see the following output, where To verify that the data was successfully added, open your AmazonS3 console
and look in the
uv
to run the script to extract the data from the website by running the following command:<your-bucket-name>
is the name of your bucket, and <timestamp>
is the timestamp generated by the script:crawls
folder within your bucket.Step 2: Create a Pinecone serverless index
In this step, you create a Pinecone serverless index that Unstructured will use to store its results based on the extracted website data that Firecrawl sends to your Amazon S3 bucket.1
Add your Pinecone API key to your project
Set your Pinecone API key as an environment variable for your code to use. To do this, add the following environment variable to your
.env
file, and replace
<your-pinecone-api-key>
with your Pinecone API key:2
Add the Pinecone index name to your project
Add the following line to your
.env
file, and replace <the-name-of-your-index>
with the name of the serverlessindex you want to create:3
Create the Python script to create the Pinecone serverless index
Create a file named The preceding code creates in your Pinecone account a serverless index with the specified name for the specified cloud provider and within the specified region for that provider.
The index is configured to accept vectors with 3072 dimensions with a cosine metric and a dense vector type.
pinecone_create_index.py
in the root of your project directory and add the following code to it:4
Run the script to create the index
From the root of your project, use After the script runs, to verify that the index was successfully added, log in to your Pinecone account. Then open your default project,
and click Indexes under Database in the sidebar to view the new index that was added.
uv
to run the script to create the Pinecone serverless index by running the following command:Step 3: Set up and run the Unstructured ETL+ workflow
In this step, you set up and run the Unstructured ETL+ workflow that will take the extracted website data from your Amazon S3 bucket, process it to turn it into Unstructured elements and metadata, add vector embeddings to the elements, and then add all of this information to your Pinecone serverless index.1
Create the source connector
- If you are not already signed in, sign in to your Unstructured account.
- On the sidebar, click Connectors.
- With Sources selected, New.
- Enter some name for the source connector, such as
s3-firecrawl-source
. - For Provider, click Amazon S3.
- Click Continue.
- For Bucket URI, enter
s3://<your-bucket-name>/crawls/<timestamp>/pages/
, replacing<your-bucket-name>
with the name of your bucket, and<timestamp>
with the timestamp generated by the script in the previous step.. - For Authentication Method, select Secret.
- Enter your AWS access key ID and secret access key.
- Click Continue.
- Click Save and Test.
2
Create the destination connector
- On the sidebar, click Connectors.
- With Destinations selected, New.
- Enter some name for the destination connector, such as
pinecone-firecrawl-destination
. - For Provider, click Pinecone.
- Click Continue.
- For Index Name, enter the name of the Pinecone index you created earlier, such as
firecrawl-dense-index
. - Leave Namespace blank. Unstructured will automatically use the default namespace in the index.
- Leave Batch Size set to the default value of
50
. - For API Key, enter your Pinecone API key.
- Click Save and Test.
3
Create the workflow
- On the sidebar, click Workflows.
- Click New Workflow.
- With Build it Myself selected, click Continue.
- In the workflow designer, click the Source workflow node.
- In the node settings pane, on the Connectors tab, select the name of the source connector you created earlier, such as
s3-firecrawl-source
. - Click the Destination workflow node.
- In the node settings pane, on the Connectors tab, select the name of the destination connector you created earlier, such as
pinecone-firecrawl-destination
. - Between the Partitioner and Destination nodes, click the + (add node) button.
- Click Transform > Embedder.
- Click the Embedder workflow node.
- In the Settings pane, under Azure OpenAI, select text-embedding-3-large.
- Click any blank space in the workflow designer.
- In the workflow settings pane, click Details.
- For Name, click the pencil (edit) icon and then enter some name for the workflow, such as
firecrawl-s3-to-pinecone-workflow
. Then click the checkmark (save) icon. - Switch Active to on.
- Click Save.
4
Run the workflow as a job
- On the sidebar, click Workflows.
- Next to the name of the workflow you created earlier, click Run.
5
Monitor the job
- On the sidebar, click Jobs.
- Click the name of the workflow you created earlier.
- Monitor the job’s progress in the job details pane until the status changes to Finished.
6
View the results
-
One way to view the results is to create a file named
pinecone_fetch_from_index.py
in the root of your project directory and add the following code to it: -
From the root of your project, use
uv
to run the script as follows:The output will show each record’s ID and metadata. For brevity and performance, the records’ vector embeddings are not included in the output.