Firecrawl
Firecrawl is a tool that allows you to crawl websites and extract data from them.
You can use Unstructured to process the data that Firecrawl extracts from these websites. Unstructured turns this extracted website data into a single set of standard Unstructured document elements and metadata, improving the accuracy, relevance, and performance of your RAG applications, agentic AI workflows, model fine tuning, and more.
This example provides a hands-on walkthrough of how to use Firecrawl with Unstructured. In this example, you use Python code to have Firecrawl extract data from a ficticious website and then add the extracted data to an Amazon S3 bucket within your AWS account. From there, you use the Unstructured user interface (UI) to set up an ETL+ workflow that takes the extracted data from the S3 bucket, converts it into Unstructured document elements and metadata, adds vector embeddings, and then sends the results over into a serverless index within a Pinecone vector store.
Requirements
To complete this walkthrough, you must first have:
- An Unstructured account.
- A Firecrawl account and a Firecrawl API key for that account.
- An AWS account and a properly configured Amazon S3 bucket with bucket access credentials within that account.
- A Pinecone account and a Pinecone API key for that account.
- Python and the necessary development toolchain installed on your local development machine.
The following sections describe how to get these requirements.
Unstructured account
Before you begin, you must have an Unstructured account and an Unstructured API key, as follows:
If you signed up for Unstructured through the For Enterprise page, or if you are using a self-hosted deployment of Unstructured, the following sign up information does not apply to you, and you can skip ahead to the next step after you sign in. If you are not sure how to sign in, contact Unstructured Sales at sales@unstructured.io.
- If you do not already have an Unstructured account, go to https://unstructured.io/contact and fill out the online form to indicate your interest.
- If you already have an Unstructured account, go to https://platform.unstructured.io and sign in by using the email address, Google account, or GitHub account that is associated with your Unstructured account. The Unstructured user interface (UI) then appears, and you can start using it right away.
Firecrawl account and API key
Before you begin, you must have a Firecrawl account and a Firecrawl API key, as follows:
- Sign up or sign in to your Firecrawl account.
- After you sign in, get your Firecrawl API key.
Amazon S3 bucket with access permissions and credentials
This walkthrough uses an Amazon S3 bucket as the Unstructured ETL+ workflow’s source. So, before you begin, you must have an AWS account, and within that account, a properly configured Amazon S3 bucket with bucket access credentials consisting of an AWS access key ID and secret access key for the AWS IAM user that has access to the bucket. Follow the S3 connector instructions to create and configure the bucket and get the bucket access credentials if you do not already have this all set up. (In these instructions, do not follow the directions to use the Unstructured UI to create the S3 source connector. You will do this later when you create the Unstructured ETL+ workflow.)
Pinecone account and API key
Before you begin, you must have a Pinecone account and a Pinecone API key, as follows:
- Sign up or log in to your Pinecone account.
- After you log in, get your Pinecone API key.
Python and development toolchain installation
Before you can start coding on your local machine, you must install Python, and you should also install a Python package and project manager to manage your project’s Firecrawl, AWS, Pinecone, and other code dependencies.
This walkthrough uses the popular Python package and project manager uv (although uv
is not required to use Firecrawl, AWS, or Pinecone with Python).
Install uv
To install uv
, run one of the following commands, depending on your operating system:
To use curl
with sh
:
To use wget
with sh
instead:
To use curl
with sh
:
To use wget
with sh
instead:
To use PowerShell with irm
to download the script and run it with iex
:
If you need to install uv
by using other approaches such as PyPI, Homebrew, or WinGet,
see Installing uv.
Install Python
uv
will detect and use Python if you already have it installed.
To view a list of installed Python versions, run the following command:
If, however, you do not already have Python installed, you can install a version of Python for use with uv
by running the following command. For example, this command installs Python 3.12 for use with uv
:
Create the project directory
Switch to the directory on your local development machine where you want to create the project directory for this walkthrough.
This example creates a project directory named firecrawl_unstructured_demo
within your current working directory and then
switches to this new project directory:
Intiialize the project
From within the new project directory, use uv
to initialize the project by running the following command:
Create a venv virtual environment
To isolate and manage your project’s code dependencies, you should create a virtual environment. This walkthrough uses
the popular Python virtual environment manager venv (although venv
is not required to use CrewAI or the Unstructured Workflow Endpoint MCP Server).
From the root of your project directory, use uv
to create a virtual environment with venv
by running the following command:
Activate the virtual environment
To activate the venv
virtual environment, run one of the following commands from the root of your project directory:
- For
bash
orzsh
, runsource .venv/bin/activate
- For
fish
, runsource .venv/bin/activate.fish
- For
csh
ortcsh
, runsource .venv/bin/activate.csh
- For
pwsh
, run.venv/bin/Activate.ps1
- For
bash
orzsh
, runsource .venv/bin/activate
- For
fish
, runsource .venv/bin/activate.fish
- For
csh
ortcsh
, runsource .venv/bin/activate.csh
- For
pwsh
, run.venv/bin/Activate.ps1
- For
cmd.exe
, run.venv\Scripts\activate.bat
- For
PowerShell
, run.venv\Scripts\Activate.ps1
If you need to deactivate the virtual environment at any time, run the following command:
Install the Firecrawl Python SDK
You will use the Firecrawl Python SDK later to extract data from a website.
Use uv
to install the Firecrawl Python SDK package, by running the following command:
Install the AWS SDK for Python
You will use the AWS SDK for Python later to have Firecrawl add the extracted data to your Amazon S3 bucket.
-
Use
uv
to install the AWS SDK for Python package, by running the following command: -
To enable the AWS SDK for Python to authenticate with your AWS account, create a file named
credentials
in the~/.aws/
directory for macOS or Linux, or the<drive>:\Users\<username>\.aws\
directory for Windows. Then add the AWS access key ID and secret access key of the AWS IAM user that has access to your Amazon S3 bucket, and the short code for the AWS Region of the bucket (for example,us-east-1
), to thecredentials
file.In the following
credentials
file example, replace the following placeholders:- Replace
<your-access-key-id>
with the AWS access key ID of the AWS IAM user that has access to the bucket. - Replace
<your-secret-access-key>
with the secret access key for the related access key ID. - Replace
<the-aws-short-region-code-for-your-bucket>
with the short code for the AWS Region of the bucket.
- Replace
Install the Pinecone Python SDK
You will use the Pinecone Python SDK later to create the Pinecone serverless index and, in an optional step, to help enable RAG chat on the data that will be storedin the index.
Use uv
to install the Pinecone Python SDK package, along with the grpc
extra to enable the programmatic creation of a Pinecone serverless index later, by running the following command:
Step 1: Extract the website data and add the data to the bucket
In this step, you use Firecrawl to extract data from a website and then add the extracted data to the Amazon S3 bucket.
Add your Firecrawl API key to your project
-
Set your Firecrawl API key in your project as an environment variable for your code to use. To do this:
a. In the root of your code project, create a file named
.env
.
b. Add the following environment variable to this.env
file, and replace<your-firecrawl-api-key>
with your Firecrawl API key: -
To enable your Firecrawl code to access the environment variable, use
uv
to install thedotenv
package by running the following command: -
In the
.gitignore
file in the root of your project, add the following line, to help prevent accidentally checking in your Firecrawl API key (or anything else in the.env
file) into any shared code repositories later:
Create the Python script to extract the website data
-
In the
.env
file, add the following line, which defines an environment variable representing the base URL of the website to crawl. This walkthrough uses a website named Books to Scrape that contains ficticious data (although you can use any accessible website you want): -
Also in the
.env
file, add the following line, which defines an environment variable representing the name of the target Amazon S3 bucket to have Firecrawlstore the website crawl results in. Replace<the-name-of-your-bucket>
with the name of the bucket:Enter only the name of the bucket. Do not prefix the bucket’s name withs3://
. Do not include any trailing slash (/
) after the bucket name. -
Create a file named
firecrawl_extract.py
in the root of your project directory, and add the following code to it:The preceding code does the following:
- Loads from your
.env
file your Firecrawl API key, the base URL for Firecrawl to use for website crawling, and the S3 bucket name for Firecrawl to send the website crawl results to. - Calls a function named
main
that uses Firecrawl to crawl the website. - The
main
function then calls thesave_to_s3
function, which adds the website crawl results to the S3 bucket.
You can change the where the website crawl results are saved in your S3 bucket by changing the
full_results_key
andpage_key
variables. For example, you might want to save the website crawl results in a single folder and then keep overwriting those results with new results as they come in, instead of adding new results to separate subfolders.You can also change the number of crawled pages by changing the
limit
argument, outputting the results as Markdown instead of HTML, and so on. For more information, see the Firecrawl Python SDK documentation. - Loads from your
Run the script to extract the data
From the root of your project, use uv
to run the script to extract the data from the website by running the following command:
After the script runs, you should see the following output, where <your-bucket-name>
is the name of your bucket, and <timestamp>
is the timestamp generated by the script:
To verify that the data was successfully added, open your AmazonS3 console
and look in the crawls
folder within your bucket.
Step 2: Create a Pinecone serverless index
In this step, you create a Pinecone serverless index that Unstructured will use to store its results based on the extracted website data that Firecrawl sends to your Amazon S3 bucket.
Add your Pinecone API key to your project
Set your Pinecone API key as an environment variable for your code to use. To do this, add the following environment variable to your .env
file, and replace
<your-pinecone-api-key>
with your Pinecone API key:
Add the Pinecone index name to your project
Add the following line to your .env
file, and replace <the-name-of-your-index>
with the name of the serverlessindex you want to create:
Create the Python script to create the Pinecone serverless index
Create a file named pinecone_create_index.py
in the root of your project directory and add the following code to it:
The preceding code creates in your Pinecone account a serverless index with the specified name for the specified cloud provider and within the specified region for that provider. The index is configured to accept vectors with 3072 dimensions with a cosine metric and a dense vector type.
Run the script to create the index
From the root of your project, use uv
to run the script to create the Pinecone serverless index by running the following command:
After the script runs, to verify that the index was successfully added, log in to your Pinecone account. Then open your default project, and click Indexes under Database in the sidebar to view the new index that was added.
Step 3: Set up and run the Unstructured ETL+ workflow
In this step, you set up and run the Unstructured ETL+ workflow that will take the extracted website data from your Amazon S3 bucket, process it to turn it into Unstructured elements and metadata, add vector embeddings to the elements, and then add all of this information to your Pinecone serverless index.
Create the source connector
- Sign in to your Unstructured account.
- On the sidebar, click Connectors.
- With Sources selected, New.
- Enter some name for the source connector, such as
s3-firecrawl-source
. - For Provider, click Amazon S3.
- Click Continue.
- For Bucket URI, enter
s3://<your-bucket-name>/crawls/<timestamp>/pages/
, replacing<your-bucket-name>
with the name of your bucket, and<timestamp>
with the timestamp generated by the script in the previous step.. - For Authentication Method, select Secret.
- Enter your AWS access key ID and secret access key.
- Click Continue.
- Click Save and Test.
Create the destination connector
- On the sidebar, click Connectors.
- With Destinations selected, New.
- Enter some name for the destination connector, such as
pinecone-firecrawl-destination
. - For Provider, click Pinecone.
- Click Continue.
- For Index Name, enter the name of the Pinecone index you created earlier, such as
firecrawl-dense-index
. - Leave Namespace blank. Unstructured will automatically use the default namespace in the index.
- Leave Batch Size set to the default value of
50
. - For API Key, enter your Pinecone API key.
- Click Save and Test.
Create the workflow
- On the sidebar, click Workflows.
- Click New Workflow.
- With Build it Myself selected, click Continue.
- In the workflow designer, click the Source workflow node.
- In the node settings pane, on the Connectors tab, select the name of the source connector you created earlier, such as
s3-firecrawl-source
. - Click the Destination workflow node.
- In the node settings pane, on the Connectors tab, select the name of the destination connector you created earlier, such as
pinecone-firecrawl-destination
. - Between the Partitioner and Destination nodes, click the + (add node) button.
- Click Transform > Embedder.
- Click the Embedder workflow node.
- In the Settings pane, under Azure OpenAI, select text-embedding-3-large.
- Click any blank space in the workflow designer.
- In the workflow settings pane, click Details.
- For Name, click the pencil (edit) icon and then enter some name for the workflow, such as
firecrawl-s3-to-pinecone-workflow
. Then click the checkmark (save) icon. - Switch Active to on.
- Click Save.
Run the workflow as a job
- On the sidebar, click Workflows.
- Next to the name of the workflow you created earlier, click Run.
Monitor the job
- On the sidebar, click Jobs.
- Click the name of the workflow you created earlier.
- Monitor the job’s progress in the job details pane until the status changes to Finished.
View the results
-
One way to view the results is to create a file named
pinecone_fetch_from_index.py
in the root of your project directory and add the following code to it: -
From the root of your project, use
uv
to run the script as follows:The output will show each record’s ID and metadata. For brevity and performance, the records’ vector embeddings are not included in the output.