> ## Documentation Index
> Fetch the complete documentation index at: https://docs.unstructured.io/llms.txt
> Use this file to discover all available pages before exploring further.

# PII detection

Personally identifiable information (PII) detection is important as part of an organization's overall strategy for
minimizing potential harm from misuse or unauthorized access to individuals' private data. Detecting PII is a first step
toward helping avoid identity theft, maintaining privacy, and building trust with customers and users. Organizations
might also need to follow various data protection regulations, making PII detection a crucial part of the
organization's legal compliance framework.

This hands-on example walkthrough demonstrates how to use the [Microsoft Presidio SDK](https://microsoft.github.io/presidio/)
to identify and then redact PII in Unstructured JSON output files. Presidio can identify
and redact or anonymize entities in text and images such as credit card numbers, names, locations, social security numbers,
bitcoin wallets, US phone numbers, financial data, and more.

In this walkthrough, you will use Python code to connect to a folder within an Amazon S3 bucket that already contains a
collection of Unstructured JSON output files. For each file, your code will use Presidio to identify PII that matches
specific patterns and then redact the identified PII. Your code will then write the Unstrucutred JSON output files' contents
that contain the redacted PII to a separate folder within an S3 bucket. You can then compare the output JSON files generated by
Unstructured to the redacted data generated by Presidio and see the impact of using Presidio for PII detection.

*Redaction* directly removes or obscures PII, such as replacing names with placeholders, or blacking out sensitive text.
The example code from this walkthrough replaces the detected PII characters within text strings with placeholder text. Another technique, *anonymization*,
can involve redaction, but also includes techniques such as *generalization* (for example, replacing specific dates
with age ranges), *suppression* (removing entire data fields), and *data masking* (replacing data with random values
while preserving the format). Presidio supports both redaction and anonymization. Although the term "anonymize" is
visible throughout this walkthrough's example code, Presidio's available anonymization techniques are not explored here—only redaction.

<Note>
  The Microsoft Presidio SDK is not offered, maintained, or supported by Unstructured. For questions or issues
  related to Presidio, see the following resources:

  * For general discussions, use the [discussion board](https://github.com/microsoft/presidio/discussions) in the Presidio repository on GitHub.
  * For questions or issues, file an [issue](https://github.com/microsoft/presidio/issues) in the Presidio repository on GitHub.
  * For other matters, email [presidio@microsoft.com](mailto:presidio@microsoft.com).

  The example code in this walkthrough is provided as a general reference only. It is not intended to substitute
  for a complete PII detection strategy.

  The example code in this walkthrough is not guaranteed to detect and redact all possible PII. For instance, the code
  looks for PII only in text strings. It does not look for PII in non-text fields such as `image_base64` and `orig_elements`
  within Unstructured metadata, and it does not look for PII in images.
</Note>

## Requirements

To use this example, you will need:

* An Unstructured account, as follows:

  1. If you do not already have an Unstructured account, [sign up for free](https://unstructured.io/?modal=try-for-free).
     After you sign up, you are automatically signed in to your new Unstructured **Let's Go** account, at [https://platform.unstructured.io](https://platform.unstructured.io).

     <Note>
       To sign up for a **Business** account instead, [contact Unstructured Sales](https://unstructured.io/?modal=contact-sales), or [learn more](/ui/overview#how-am-i-billed%3F).
     </Note>

  2. If you have an Unstructured **Let's Go**, **Pay-As-You-Go**, or **Business SaaS** account and are not already signed in, sign in to your account at [https://platform.unstructured.io](https://platform.unstructured.io).

     <Note>
       For other types of **Business** accounts, see your Unstructured account administrator for sign-in instructions,
       or email Unstructured Support at [support@unstructured.io](mailto:support@unstructured.io).
     </Note>

* A set of one or more Unstructured JSON output files that have been generated by Unstructured and stored in a folder within an
  Amazon S3 bucket that you have access to. One way to generate these files is to use an Unstructured workflow that
  relies on an S3 destination connector to store these Unstructured JSON output files. Learn how to [create an S3 destination connector](/ui/destinations/s3) and
  [create a custom workflow](/ui/workflows#create-a-custom-workflow) that uses your S3 destination connector.

* Python installed on your local development machine.

## Create and run the Python code

1. In your local Python virtual environment, install the following libraries:

   * `boto3`
   * `presidio_analyzer`
   * `presidio_anonymizer`

   For example, if you are using [uv](https://docs.astral.sh/uv/), you can install these libraries into your local `uv`
   virtual environment with the following command:

   ```bash theme={null}
   uv add boto3 presidio_analyzer presidio_anonymizer
   ```

2. In your local Python virtual environment, install the appropriate natural language processing (NLP) models for
   [spaCy](https://spacy.io/), which Presidio relies on for various internal tasks related to named entity recognition (NER) and
   PII identification.

   To find the appropriate model for your use case, do the following:

   a. Go to [spaCy Trained Models & Pipelines](https://spacy.io/models).<br />
   b. On the sidebar, click your target language, for example **English**.<br />
   c. Click the model you want to use, for example **en\_core\_web\_lg**.<br />
   d. Click **Release details**.<br />
   e. At the bottom of the release details page, in the **Assets** section, right-click the filename ending in `.whl`, for
   example **en\_core\_web\_lg-3.8.0-py3-none-any.whl**, and select **Copy Link Address** from the context menu.<br />
   f. From the release details page, copy the URL from your web browser's address bar.<br />
   g. Install the model into your local Python virtual environment by using the model's name and the URL that you just copied.
   For example, if you are using `uv`, you can install the preceding model with a command such as the following:<br />

   ```bash theme={null}
   uv pip install en_core_web_lg@https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl
   ```

3. [Set up Boto3 credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for your AWS account.
   The following steps assume you have set up your Boto3 credentials from outside of the following code, such as setting
   [environment variables](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#environment-variables) or
   configuring a [shared credentials file](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#shared-credentials-file),

   One approach to getting and setting up Boto3 credentials is to [create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey)
   and then use the [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
   to [set up your credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-methods) on your local development machine.

   <iframe width="560" height="315" src="https://www.youtube.com/embed/MoFTaGJE65Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen />

4. Add the following code to a Python script file in your virtual environment, replacing the following placeholders:

   * Replace `<input-bucket-name>` with the name of the Amazon S3 bucket that contains your original Unstructured JSON files. This is the same
     bucket that you used for your S3 destination connector.
   * Replace `<input-folder-prefix>` with the path to the folder within the input bucket that contains your original Unstructured JSON files.
   * Replace `<output-bucket-name>` with the name of the S3 bucket that will contain copies of the contents of your Unstructured JSON files,
     with the redacted content within those files' copies. This can be the same bucket as the input bucket, or a different bucket.
   * Replace `<output-folder-prefix>` with the path to the folder within the output bucket that will contain copies of the contents of your Unstructured
     JSON files, with the redacted content within those files' copies. This must not be the same folder as the input folder.
   * Replace `<bucket-region-short-id>` with the short ID of the region where your buckets are located, for example `us-east-1`.

   For the `operators` variable, a list of operators for built-in Presidio entities is specified. These operators look for common entities such as
   credit card numbers, email addresses, phone numbers, and more. You can remove any entities from this list that you do not
   want your code to look for. You can also add operators to this list for additional
   [built-in entities](https://microsoft.github.io/presidio/supported_entities/) that you want your code to also look for. And you can also
   [add your own custom entities](https://microsoft.github.io/presidio/analyzer/adding_recognizers/) to this list.

   ```python theme={null}
   import boto3
   import json

   from presidio_analyzer import AnalyzerEngine
   from presidio_anonymizer import AnonymizerEngine
   from presidio_anonymizer.entities import OperatorConfig

   operators={
       "CREDIT_CARD": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CREDIT_CARD>'}),
       "CRYPTO": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_CRYPTO>'}),
       "EMAIL_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_EMAIL_ADDRESS>'}),
       "IBAN_CODE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IBAN_CODE>'}),
       "IP_ADDRESS": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_IP_ADDRESS>'}),
       "NRP": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_NRP>'}),
       "LOCATION": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_LOCATION>'}), 
       "PERSON": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PERSON>'}),
       "PHONE_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_PHONE_NUMBER>'}),
       "MEDICAL_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_MEDICAL_LICENSE>'}),
       "URL": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_URL>'}),
       "US_BANK_NUMBER": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_BANK_NUMBER>'}),
       "US_DRIVER_LICENSE": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_DRIVER_LICENSE>'}),
       "US_ITIN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_ITIN>'}),
       "US_PASSPORT": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_PASSPORT>'}),
       "US_SSN": OperatorConfig(operator_name="replace", params={'new_value': '<REDACTED_US_SSN>'})
   }

   # Recursively check for string values in the provided JSON object (in this case, 
   # the "metadata" field of the JSON object) and redact them 
   # as appropriate.
   def check_string_values(obj, analyzer, anonymizer):
       if isinstance(obj, dict):
           for key, value in obj.items():
               # Skip analyzing Base64-encoded fields.
               if key == 'image_base64' or key == 'orig_elements':
                   pass
               elif isinstance(value, str):
                   anonymized_results = anonymizer.anonymize(
                       text=value,
                       analyzer_results=analyzer.analyze(text=value, language="en"),
                       operators=operators
                   )
                   value = anonymized_results.text
               # Recurse through nested "metadata" fields.
               elif isinstance(value, dict):
                   check_string_values(value, analyzer, anonymizer)
               # Skip analyzing non-string fields.
               else:
                   pass
       return obj

   def main():
       s3_input_bucket_name    = '<input-bucket-name>'
       s3_input_folder_prefix  = '<input-folder-prefix>'
       s3_output_bucket_name   = '<output-bucket-name>'
       s3_output_folder_prefix = '<output-folder-prefix>'
       s3_bucket_region        = '<bucket-region-short-id>'

       s3_client = boto3.client('s3')

       # Load the JSON files from the input folder.
       # Normalize the input folder prefix to ensure it ends with '/'.
       if not s3_input_folder_prefix.endswith('/'):
           s3_input_folder_prefix += '/'

       paginator = s3_client.get_paginator('list_objects_v2')
       page_iterator = paginator.paginate(
           Bucket=s3_input_bucket_name, 
           Prefix=s3_input_folder_prefix
       )
       files = []

       # Get the list of file keys from the input folder to analyze.
       # A file's key is the full path to the file within the bucket.
       # For example, if the input folder's name is "original" and the 
       # input file's name is "file1.json", the file's key is 
       # "original/file1.json".
       # There could be multiple "pages" of file listings available, 
       # so each of these "pages" must be looped through, so that 
       # no files are missed.
       for page in page_iterator:
           # "Contents" is missing if the folder is empty or the 
           # intended prefix is not found.
           if 'Contents' in page:
               for obj in page['Contents']:
                   key = obj['Key']
                   if not key.endswith('/'):  # Skip if it's a folder placeholder.
                       files.append(key)
                       print(f"Found file: {s3_input_bucket_name}/{key}")

       analyzer = AnalyzerEngine()
       anonymizer = AnonymizerEngine()
       s3_resource = boto3.resource('s3')

       # For each JSON file to analyze, load the JSON data.
       for key in files:
           print(f"Analyzing file: {s3_input_bucket_name}/{key}")
           content_object = s3_resource.Object(
               bucket_name=s3_input_bucket_name, 
               key=key 
           )

           file_content = content_object.get()['Body'].read().decode('utf-8')  # Bytes to text.
           json_data = json.loads(file_content) # Text to JSON.

           # For each element in the JSON data...
           for element in json_data:
               print(f"    Analyzing element with ID: {element['element_id']} in file {s3_input_bucket_name}/{key}")
               # If there is a "text" field...
               if 'text' in element:
                   # ...get the text content...
                   text_element = element['text']
                   # ...and analyze and redact the text content as appropriate.
                   anonymized_results = anonymizer.anonymize(
                       text=text_element,
                       analyzer_results=analyzer.analyze(text=text_element, language="en"),
                       operators=operators
                   )
                   element['text'] = anonymized_results.text
               # If there is a "metadata" field...
               if 'metadata' in element:
                   # ...get the metadata content...
                   metadata_element = element['metadata']
                   # ...and analyze and redact the metadata content as appropriate.
                   element['metadata'] = check_string_values(metadata_element, analyzer, anonymizer)

           # Get the filename from the key.
           filename = key.split(s3_input_folder_prefix)[1]

           # Normalize the output folder prefix to ensure it ends with '/'.
           if not s3_output_folder_prefix.endswith('/'):
               s3_output_folder_prefix += '/'
           
           # Then save the JSON data with its redactions to the output folder.
           print(f"Saving file: {s3_output_bucket_name}/{s3_output_folder_prefix}{filename}")
           s3_client.put_object(
               Bucket=s3_output_bucket_name,
               Key=f"{s3_output_folder_prefix}{filename}",
               Body=json.dumps(obj=json_data, indent=4).encode('utf-8')
           )

   if __name__ == "__main__":
       main()
   ```

5. Run the Python script.

6. Go to the output folder in S3 and explore the generated files, searching for the `<REDACTED_` placeholders in the generated files' contents.
