As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the unstructured
library includes cleaning functions to help users sanitize output before sending it to downstream applications.
Philadelphia Eaglesâ\x80\x99 victory
automatically gets converted to Philadelphia Eagles' victory
in partition_html
using the replace_unicode_quotes
cleaning function. You can see how that works in the code snippet below:
unstructured
include an apply
method that allow you to apply the text cleaning to the document element without instantiating a new element. The apply
method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the replace_unicode_quotes
cleaning function using the apply
method.
str -> str
function, users can also easily include their own cleaning functions for custom data preparation tasks. In the example below, we remove citations from a section of text.
unstructured
library.
bytes_string_to_string
partition_html
when there is a character like an emoji that isn’t expected by the HTML parser. In that case, the encoded bytes get processed.
Examples:
bytes_string_to_string
function, you can check the source code here.
clean
clean_bullets
if bullets=True
.
clean_extra_whitespace
if extra_whitespace=True
.
clean_dashes
if dashes=True
.
clean_trailing_punctuation
if trailing_punctuation=True
.
lowercase=True
.
clean
function, you can check the source code here.
clean_bullets
clean_bullets
function, you can check the source code here.
clean_dashes
\u2013
.
Examples:
clean_dashes
function, you can check the source code here.
clean_non_ascii_chars
clean_non_ascii_chars
function, you can check the source code here.
clean_ordered_bullets
clean_ordered_bullets
function, you can check the source code here.
clean_postfix
ignore_case
is set to True
. The default is False
.
strip
is set to True
. The default is True
.
clean_postfix
function, you can check the source code here.
clean_prefix
ignore_case
is set to True
. The default is False
.
strip
is set to True
. The default is True
.
clean_prefix
function, you can check the source code here.
clean_trailing_punctuation
clean_trailing_punctuation
function, you can check the source code here.
group_broken_paragraphs
.txt
files. By default, group_broken_paragraphs
groups together lines split by \n
. You can change that behavior with the line_split
kwarg. The function considers \n\n
to be a paragraph break by default. You can change that behavior with the paragraph_split
kwarg.
Examples:
group_broken_paragraphs
function, you can check the source code here.
remove_punctuation
remove_punctuation
function, you can check the source code here.
replace_unicode_quotes
\x91
in strings.
Examples:
replace_unicode_quotes
function, you can check the source code here.
translate_text
translate_text
cleaning functions translates text between languages. translate_text
uses the Helsinki NLP MT models from transformers
for machine translation. Works for Russian, Chinese, Arabic, and many other languages.
Parameters:
text
: the input string to translate.
source_lang
: the two letter language code for the source language of the text. If source_lang
is not specified, the language will be detected using langdetect
.
target_lang
: the two letter language code for the target language for translation. Defaults to "en"
.
translate_text
function, you can check the source code here.