- Create an S3 vector bucket.
- Create a vector index in the bucket.
- Add the contents of one or more source JSON output files that have been generated by Unstructured to the vector index.
- Query the vector index against the contents of the source JSON output files that were added.
Requirements
To use this example, you will need:-
A set of one or more JSON output files that have been generated by Unstructured and stored somewhere on your local development machine. For maximum compatibility with this example, these files must contain vector embeddings that were
generated by Amazon Bedrock, by using the Titan Text Embeddings V2 (
amazon.titan-embed-text-v2:0
) embedding model, with 1024 dimensions. To get these files, you will need:-
An Unstructured account, as follows:
-
If you do not already have an Unstructured account, sign up for free.
After you sign up, you are automatically signed in to your new Unstructured Starter account, at https://platform.unstructured.io.
To sign up for a Team or Enterprise account instead, contact Unstructured Sales, or learn more.
-
If you have an Unstructured Starter or Team account and are not already signed in, sign in to your account at https://platform.unstructured.io.
For an Enterprise account, see your Unstructured account administrator for instructions, or email Unstructured Support at support@unstructured.io.
-
If you do not already have an Unstructured account, sign up for free.
After you sign up, you are automatically signed in to your new Unstructured Starter account, at https://platform.unstructured.io.
-
A workflow that generates vector embeddings and adds them to the JSON output files. Learn how to create a custom workflow and add an Embedder node to that workflow.
The destination connector for your worklow must generate JSON output files. These include destination connectors for file storage connectors such as Databricks Volumes, Google Cloud Storage, OneDrive, and S3. Destination connectors for databases such as Elasticsearch, Kafka, and MongoDB, and vector stores such as Astra DB, Pinecone, and Weaviate, do not generate JSON output files.
- After your workflow generates the JSON output files, you must copy them from your workflow’s destination location over to some location on your local development machine for access.
-
An Unstructured account, as follows:
- Python installed on your local development machine.
- An AWS account. Create an AWS account.
Step 1: Create the S3 vector bucket
- Sign in to the AWS Management Console.
- Open the Amazon S3 Console.
- On the sidebar, click Vector bucket.
- Click Create vector bucket.
- For Vector bucket name, enter some name for your bucket.
- For Encryption, select an encryption method, or leave the default.
- Click Create vector bucket.
Step 2: Add the vector index to the bucket
- With the list of vector buckets showing from the previous step, click the name of the bucket that you just created.
- Click Create vector index.
- For Vector index name, enter some name for your index.
- For Dimension enter the number of dimensions that Unstructured generated for your vector embeddings. For example,
for the Titan Text Embeddings V2 (
amazon.titan-embed-text-v2:0
) embedding model, enter1024
. If you are not sure how many dimensions to enter, see your workflow’s Embedder node settings. - Select the appropriate Distance metric for your embedding model. For example, for the
Titan Text Embeddings V2 (
amazon.titan-embed-text-v2:0
) embedding model, select Cosine. If you are not sure which distance metric to use, see your embedding model’s documentation. - Expand Additional settings.
- Within Metadata configuration, under Non-filterable metadata, click Add key.
- For Key, enter
text
. This allows you to query the vector index by thetext
field within each object that will be coming over into the index from the JSON output files. - Click Create vector index.
- After the vector index is created, copy the value of the index’s Amazon Resource Name (ARN), as you will need it in
later steps. This ARN takes the format
arn:aws:s3vectors:<region-id>:<account>:bucket/<bucket-name>/index/<index-name>
.
Step 3: Add the source JSON output files’ contents to the vector index
-
In your local Python virtual environment, install the
boto3
anduuid
libraries. - Set up Boto3 credentials for your AWS account. The following steps assume you have set up your Boto3 credentials from outside of the following code, such as setting environment variables or configuring a shared credentials file, One approach to getting and setting up Boto3 credentials is to create an AWS access key and secret access key and then use the AWS Command Line Interface (AWS CLI) to set up your credentials on your local development machine.
-
Add the following code to a Python script file in your virtual environment, replacing the following placeholders:
- Replace
<source-json-file-path>
with the path to the directory that contains your JSON output files. - Replace
<index-arn>
with the ARN of the vector index that you created previously in Step 2. - Replace
<index-region-short-id>
with the short ID of the region where your vector index is located, for exampleus-east-1
.
- Replace
- Run the script to add the JSON output files’ contents to the vector index. Each object in each JSON output file is added as a vector entry in the vector index.
Step 4: Query the vector index
-
In your local Python virtual environment, install the
numpy
library. -
Add the following code to another Python script file in your virtual environment, replacing the following placeholders:
- Replace
<index-arn>
with the ARN of the vector index that you created previously in Step 2. - Replace
<index-region-short-id>
with the short ID of the region where your vector index is located, for exampleus-east-1
. - Replace
<sentence-to-embed>
with the search text that you want to embed for the query.
- Replace
- Run the script to query the vector index and see the query results.
Appendix: Additional operations
Use the following code examples to perform additional vector index and vector bucket operations.List all entries in a vector index
Replace the following placeholders:- Replace
<index-arn>
with the ARN of the vector index that you created earlier in Step 2. - Replace
<index-region-short-id>
with the short ID of the region where your vector index is located, for exampleus-east-1
.
Delete all entries from a vector index
This operation will permanently delete all vector entries in the vector index. This operation cannot be undone.
- Replace
<index-arn>
with the ARN of the vector index that you created earlier in Step 2. - Replace
<index-region-short-id>
with the short ID of the region where your vector index is located, for exampleus-east-1
.
Delete a vector index
This operation will permanently delete a vector index. This operation cannot be undone.
- Replace
<index-arn>
with the ARN of the vector index that you created earlier in Step 2. - Replace
<index-region-short-id>
with the short ID of the region where your vector index is located, for exampleus-east-1
.
Delete a vector bucket
This operation will permanently delete a vector bucket. This operation cannot be undone.
-
Replace
<bucket-arn>
with the ARN of the vector bucket that you created earlier in Step 1. To get the ARN, do the following:- In the Amazon S3 console, on the sidebar, click Vector buckets.
- Next to the name of the vector bucket that you want to delete, click the copy button next to the bucket’s Amazon Resource Name (ARN).
-
Replace
<index-region-short-id>
with the short ID of the region where your vector bucket is located, for exampleus-east-1
.