PII detection

Personally identifiable information (PII) detection is important as part of an organization’s overall strategy for minimizing potential harm from misuse or unauthorized access to individuals’ private data. Detecting PII is a first step toward helping avoid identity theft, maintaining privacy, and building trust with customers and users. Organizations might also need to follow various data protection regulations, making PII detection a crucial part of the organization’s legal compliance framework. This hands-on example walkthrough demonstrates how to use the Microsoft Presidio SDK to identify and then redact PII in Unstructured JSON output files. Presidio can identify and redact or anonymize entities in text and images such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. In this walkthrough, you will use Python code to connect to a folder within an Amazon S3 bucket that already contains a collection of Unstructured JSON output files. For each file, your code will use Presidio to identify PII that matches specific patterns and then redact the identified PII. Your code will then write the Unstrucutred JSON output files’ contents that contain the redacted PII to a separate folder within an S3 bucket. You can then compare the output JSON files generated by Unstructured to the redacted data generated by Presidio and see the impact of using Presidio for PII detection. Redaction directly removes or obscures PII, such as replacing names with placeholders, or blacking out sensitive text. The example code from this walkthrough replaces the detected PII characters within text strings with placeholder text. Another technique, anonymization, can involve redaction, but also includes techniques such as generalization (for example, replacing specific dates with age ranges), suppression (removing entire data fields), and data masking (replacing data with random values while preserving the format). Presidio supports both redaction and anonymization. Although the term “anonymize” is visible throughout this walkthrough’s example code, Presidio’s available anonymization techniques are not explored here—only redaction.

The Microsoft Presidio SDK is not offered, maintained, or supported by Unstructured. For questions or issues related to Presidio, see the following resources:

For general discussions, use the discussion board in the Presidio repositiory on GiHub.
For questions or issues, file an issue in the Presidio repository on GitHub.
For other matters, email [email protected].

The example code in this walkthrough is provided as a general reference only. It is not intended to substitute for a complete PII detection strategy.The example code in this walkthrough is not guaranteed to detect and redact all possible PII. For instance, the code looks for PII only in text strings. It does not look for PII in non-text fields such as image_base64 and orig_elements within Unstructured metadata, and it does not look for PII in images.

Requirements

To use this example, you will need:

An Unstructured account, as follows:
1. If you do not already have an Unstructured account, sign up for free. After you sign up, you are automatically signed in to your new Unstructured Let’s Go account, at https://platform.unstructured.io.
  To sign up for a Business account instead, contact Unstructured Sales, or learn more.
2. If you have an Unstructured Let’s Go, Pay-As-You-Go, or Business SaaS account and are not already signed in, sign in to your account at https://platform.unstructured.io.
  For other types of Business accounts, see your Unstructured account administrator for sign-in instructions, or email Unstructured Support at [email protected].
A set of one or more Unstructured JSON output files that have been generated by Unstructured and stored in a folder within an Amazon S3 bucket that you have access to. One way to generate these files is to use an Unstructured workflow that relies on an S3 destination connector to store these Unstructured JSON output files. Learn how to create an S3 destination connector and create a custom workflow that uses your S3 destination connector.
Python installed on your local development machine.

Create and run the Python code

In your local Python virtual environment, install the following libraries:
- boto3
- presidio_analyzer
- presidio_anonymizer
For example, if you are using uv, you can install these libraries into your local uv virtual environment with the following command:
```
uv add boto3 presidio_analyzer presidio_anonymizer
```
In your local Python virtual environment, install the appropriate natural language processing (NLP) models for spaCy, which Presidio relies on for various internal tasks related to named entity recognition (NER) and PII identification. To find the appropriate model for your use case, do the following: a. Go to spaCy Trained Models & Pipelines.
b. On the sidebar, click your target language, for example English.
c. Click the model you want to use, for example en_core_web_lg.
d. Click Release details.
e. At the bottom of the release details page, in the Assets section, right-click the filename ending in .whl, for example en_core_web_lg-3.8.0-py3-none-any.whl, and select Copy Link Address from the context menu.
f. From the release details page, copy the URL from your web browser’s address bar.
g. Install the model into your local Python virtual environment by using the model’s name and the URL that you just copied. For example, if you are using uv, you can install the preceding model with a command such as the following:
```
uv pip install en_core_web_lg@https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl
```
Set up Boto3 credentials for your AWS account. The following steps assume you have set up your Boto3 credentials from outside of the following code, such as setting environment variables or configuring a shared credentials file, One approach to getting and setting up Boto3 credentials is to create an AWS access key and secret access key and then use the AWS Command Line Interface (AWS CLI) to set up your credentials on your local development machine.

Add the following code to a Python script file in your virtual environment, replacing the following placeholders:

Replace <input-bucket-name> with the name of the Amazon S3 bucket that contains your original Unstructured JSON files. This is the same bucket that you used for your S3 destination connector.
Replace <input-folder-prefix> with the path to the folder within the input bucket that contains your original Unstructured JSON files.
Replace <output-bucket-name> with the name of the S3 bucket that will contain copies of the contents of your Unstructured JSON files, with the redacted content within those files’ copies. This can be the same bucket as the input bucket, or a different bucket.
Replace <output-folder-prefix> with the path to the folder within the output bucket that will contain copies of the contents of your Unstructured JSON files, with the redacted content within those files’ copies. This must not be the same folder as the input folder.
Replace <bucket-region-short-id> with the short ID of the region where your buckets are located, for example us-east-1.

For the operators variable, a list of operators for built-in Presidio entities is specified. These operators look for common entities such as credit card numbers, email addresses, phone numbers, and more. You can remove any entities from this list that you do not want your code to look for. You can also add operators to this list for additional built-in entities that you want your code to also look for. And you can also add your own custom entities to this list.

import boto3
import json

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

operators={
    "CREDIT_CARD": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CREDIT_CARD>'}),
    "CRYPTO": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CRYPTO>'}),
    "EMAIL_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_EMAIL_ADDRESS>'}),
    "IBAN_CODE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IBAN_CODE>'}),
    "IP_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IP_ADDRESS>'}),
    "NRP": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_NRP>'}),
    "LOCATION": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_LOCATION>'}), 
    "PERSON": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PERSON>'}),
    "PHONE_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PHONE_NUMBER>'}),
    "MEDICAL_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_MEDICAL_LICENSE>'}),
    "URL": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_URL>'}),
    "US_BANK_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_BANK_NUMBER>'}),
    "US_DRIVER_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_DRIVER_LICENSE>'}),
    "US_ITIN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_ITIN>'}),
    "US_PASSPORT": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_PASSPORT>'}),
    "US_SSN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_SSN>'})
}

# Recursively check for string values in the provided JSON object (in this case, 
# the "metadata" field of the JSON object) and redact them 
# as appropriate.
def check_string_values(obj, analyzer, anonymizer):
    if isinstance(obj, dict):
        for key, value in obj.items():
            # Skip analyzing Base64-encoded fields.
            if key == 'image_base64' or key == 'orig_elements':
                pass
            elif isinstance(value, str):
                anonymized_results = anonymizer.anonymize(
                    text=value,
                    analyzer_results=analyzer.analyze(text=value, language="en"),
                    operators=operators
                )
                value = anonymized_results.text
            # Recurse through nested "metadata" fields.
            elif isinstance(value, dict):
                check_string_values(value, analyzer, anonymizer)
            # Skip analyzing non-string fields.
            else:
                pass
    return obj

def main():
    s3_input_bucket_name    = '<input-bucket-name>'
    s3_input_folder_prefix  = '<input-folder-prefix>'
    s3_output_bucket_name   = '<output-bucket-name>'
    s3_output_folder_prefix = '<output-folder-prefix>'
    s3_bucket_region        = '<bucket-region-short-id>'

    s3_client = boto3.client('s3')

    # Load the JSON files from the input folder.
    # Normalize the input folder prefix to ensure it ends with '/'.
    if not s3_input_folder_prefix.endswith('/'):
        s3_input_folder_prefix += '/'

    paginator = s3_client.get_paginator('list_objects_v2')
    page_iterator = paginator.paginate(
        Bucket=s3_input_bucket_name, 
        Prefix=s3_input_folder_prefix
    )
    files = []

    # Get the list of file keys from the input folder to anaylyze.
    # A file's key is the full path to the file within the bucket.
    # For example, if the input folder's name is "original" and the 
    # input file's name is "file1.json", the file's key is 
    # "original/file1.json".
    # There could be multiple "pages" of file listings available, 
    # so each of these "pages" must be looped through, so that 
    # no files are missed.
    for page in page_iterator:
        # "Contents" is missing if the folder is empty or the 
        # intended prefix is not found.
        if 'Contents' in page:
            for obj in page['Contents']:
                key = obj['Key']
                if not key.endswith('/'):  # Skip if it's a folder placeholder.
                    files.append(key)
                    print(f"Found file: {s3_input_bucket_name}/{key}")

    analyzer = AnalyzerEngine()
    anonymizer = AnonymizerEngine()
    s3_resource = boto3.resource('s3')

    # For each JSON file to analyze, load the JSON data.
    for key in files:
        print(f"Analyzing file: {s3_input_bucket_name}/{key}")
        content_object = s3_resource.Object(
            bucket_name=s3_input_bucket_name, 
            key=key 
        )

        file_content = content_object.get()['Body'].read().decode('utf-8')  # Bytes to text.
        json_data = json.loads(file_content) # Text to JSON.

        # For each element in the JSON data...
        for element in json_data:
            print(f"    Analyzing element with ID: {element['element_id']} in file {s3_input_bucket_name}/{key}")
            # If there is a "text" field...
            if 'text' in element:
                # ...get the text content...
                text_element = element['text']
                # ...and analyze and redact the text content as appropriate.
                anonymized_results = anonymizer.anonymize(
                    text=text_element,
                    analyzer_results=analyzer.analyze(text=text_element, language="en"),
                    operators=operators
                )
                element['text'] = anonymized_results.text
            # If there is a "metadata" field...
            if 'metadata' in element:
                # ...get the metadata content...
                metadata_element = element['metadata']
                # ...and analyze and redact the metadata content as appropriate.
                element['metadata'] = check_string_values(metadata_element, analyzer, anonymizer)

        # Get the filename from the key.
        filename = key.split(s3_input_folder_prefix)[1]

        # Normalize the output folder prefix to ensure it ends with '/'.
        if not s3_output_folder_prefix.endswith('/'):
            s3_output_folder_prefix += '/'
        
        # Then save the JSON data with its redactions to the output folder.
        print(f"Saving file: {s3_output_bucket_name}/{s3_output_folder_prefix}{filename}")
        s3_client.put_object(
            Bucket=s3_output_bucket_name,
            Key=f"{s3_output_folder_prefix}{filename}",
            Body=json.dumps(obj=json_data, indent=4).encode('utf-8')
        )

if __name__ == "__main__":
    main()

Run the Python script.
Go to the output folder in S3 and explore the generated files, searching for the <REDACTED_ placeholders in the generated files’ contents.

Notebooks

Code samples

Tool demos

Requirements

Create and run the Python code

Notebooks

Code samples

Tool demos

​Requirements

​Create and run the Python code

Requirements

Create and run the Python code