Task

You want to generate a schema for a JSON file that Unstructured produces, so that you can validate, test, and document related JSON files across your systems.

Approach

Use a Python package such as genson to generate schemas for your JSON files.

The genson package is not owned or supported by Unstructured. For questions and requests, see the Issues tab of the genson repository in GitHub.

Generate a schema from the terminal

1

Install genson

Use pip to install the genson package.

pip install genson
2

Install jq

By default, genson generates the JSON schema as a single string without any line breaks or indented whitespace.

To pretty-print the schema that genson produces, install the jq utility.

The jq utility is not owned or supported by Unstructured. For questions and requests, see the Issues tab of the jq repository in GitHub.
3

Generate the schema

  1. Run the genson command, specifying the path to the input (source) JSON file, and the path to the output (target) JSON schema file to be generated. Use jq to pretty-print the schema’s content into the file to be generated.

    genson "/path/to/input/file.json" | jq '.' > "/path/to/output/schema.json"
    
  2. You can find the generated JSON schema file in the output path that you specified.

Generate a schema from Python code

1

Install dependencies

In your Python project, install the genson package.

pip install genson
2

Add and run the schema generation code

  1. Set the following local environment variables:

    • Set LOCAL_FILE_INPUT_PATH to the local path to the input (source) JSON file.
    • Set LOCAL_FILE_OUTPUT_PATH to the local path to the output (target) JSON schema file to be generated.
  2. Add the following Python code file to your project:

    import os, json
    from genson import SchemaBuilder
    
    def json_schema_from_file(
        input_file_path: str,
        output_schema_path: str
    ) -> None:
        try:
            with open(input_file_path, "r") as file:
                json_data = json.load(file)
    
            builder = SchemaBuilder()
            builder.add_object(json_data)
    
            schema = builder.to_schema()
    
            try:
                with open(output_schema_path, "w") as schema_file:
                    json.dump(schema, schema_file, indent=2)
            except IOError as e:
                raise IOError(f"Error writing to output file: {e}")
            
            print(f"JSON schema successfully generated and saved to '{output_schema_path}'.")
        except FileNotFoundError:
            print(f"Error: Input file '{input_file_path}' not found.")
        except IOError as e:
            print(f"I/O error occurred: {e}")
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
    
    if __name__ == "__main__":
        json_schema_from_file(
            input_file_path=os.getenv("LOCAL_FILE_INPUT_PATH"),
            output_schema_path=os.getenv("LOCAL_FILE_OUTPUT_PATH")
        )
    
  3. Run the Python code file.

  4. Check the path specified by LOCAL_FILE_OUTPUT_PATH for the generated JSON schema file.