Cleaning
As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the unstructured
library includes cleaning functions to help users sanitize output before sending it to downstream applications.
Some cleaning functions apply automatically. In the example in the Partition section, the output Philadelphia Eaglesâ\x80\x99 victory
automatically gets converted to Philadelphia Eagles' victory
in partition_html
using the replace_unicode_quotes
cleaning function. You can see how that works in the code snippet below:
from unstructured.cleaners.core import replace_unicode_quotes
replace_unicode_quotes("Philadelphia Eaglesâ\x80\x99 victory")
Document elements in unstructured
include an apply
method that allow you to apply the text cleaning to the document element without instantiating a new element. The apply
method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the replace_unicode_quotes
cleaning function using the apply
method.
from unstructured.documents.elements import Text
element = Text("Philadelphia Eaglesâ\x80\x99 victory")
element.apply(replace_unicode_quotes)
print(element)
Since a cleaning function is just a str -> str
function, users can also easily include their own cleaning functions for custom data preparation tasks. In the example below, we remove citations from a section of text.
import re
remove_citations = lambda text: re.sub("\[\d{1,3}\]", "", text)
element = Text("[1] Geolocated combat footage has confirmed Russian gains in the Dvorichne area northwest of Svatove.")
element.apply(remove_citations)
print(element)
See below for a full list of cleaning functions in the unstructured
library.
bytes_string_to_string
Converts an output string that looks like a byte string to a string using the specified encoding. This happens sometimes in partition_html
when there is a character like an emoji that isn’t expected by the HTML parser. In that case, the encoded bytes get processed.
Examples:
from unstructured.cleaners.core import bytes_string_to_string
text = "Hello ð\x9f\x98\x80"
# The output should be "Hello 😀"
bytes_string_to_string(text, encoding="utf-8")
from unstructured.cleaners.core import bytes_string_to_string
from unstructured.partition.html import partition_html
text = """\n<html charset="utf-8"><p>Hello 😀</p></html>"""
elements = partition_html(text=text)
elements[0].apply(bytes_string_to_string)
# The output should be "Hello 😀"
elements[0].text
For more information about the bytes_string_to_string
function, you can check the source code here.
clean
Cleans a section of text with options including removing bullets, extra whitespace, dashes and trailing punctuation. Optionally, you can choose to lowercase the output.
Options:
-
Applies
clean_bullets
ifbullets=True
. -
Applies
clean_extra_whitespace
ifextra_whitespace=True
. -
Applies
clean_dashes
ifdashes=True
. -
Applies
clean_trailing_punctuation
iftrailing_punctuation=True
. -
Lowercases the output if
lowercase=True
.
Examples:
from unstructured.cleaners.core import clean
# Returns "an excellent point!"
clean("● An excellent point!", bullets=True, lowercase=True)
# Returns "ITEM 1A: RISK FACTORS"
clean("ITEM 1A: RISK-FACTORS", extra_whitespace=True, dashes=True)
For more information about the clean
function, you can check the source code here.
clean_bullets
Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the text are not removed.
Examples:
from unstructured.cleaners.core import clean_bullets
# Returns "An excellent point!"
clean_bullets("● An excellent point!")
# Returns "I love Morse Code! ●●●"
clean_bullets("I love Morse Code! ●●●")
For more information about the clean_bullets
function, you can check the source code here.
clean_dashes
Removes dashes from a section of text. Also handles special characters such as \u2013
.
Examples:
from unstructured.cleaners.core import clean_dashes
# Returns "ITEM 1A: RISK FACTORS"
clean_dashes("ITEM 1A: RISK-FACTORS\u2013")
For more information about the clean_dashes
function, you can check the source code here.
clean_non_ascii_chars
Removes non-ascii characters from a string.
Examples:
from unstructured.cleaners.core import clean_non_ascii_chars
text = "\x88This text contains ®non-ascii characters!●"
# Returns "This text contains non-ascii characters!"
clean_non_ascii_chars(text)
For more information about the clean_non_ascii_chars
function, you can check the source code here.
clean_ordered_bullets
Remove alphanumeric bullets from the beginning of text up to three “sub-section” levels.
Examples:
from unstructured.cleaners.core import clean_ordered_bullets
# Returns "This is a very important point"
clean_ordered_bullets("1.1 This is a very important point")
# Returns "This is a very important point ●"
clean_ordered_bullets("a.b This is a very important point ●")
For more information about the clean_ordered_bullets
function, you can check the source code here.
clean_postfix
Removes the postfix from a string if they match a specified pattern.
Options:
-
Ignores case if
ignore_case
is set toTrue
. The default isFalse
. -
Strips trailing whitespace is
strip
is set toTrue
. The default isTrue
.
Examples:
from unstructured.cleaners.core import clean_postfix
text = "The end! END"
# Returns "The end!"
clean_postfix(text, r"(END|STOP)", ignore_case=True)
For more information about the clean_postfix
function, you can check the source code here.
clean_prefix
Removes the prefix from a string if they match a specified pattern.
Options:
-
Ignores case if
ignore_case
is set toTrue
. The default isFalse
. -
Strips leading whitespace is
strip
is set toTrue
. The default isTrue
.
Examples:
from unstructured.cleaners.core import clean_prefix
text = "SUMMARY: This is the best summary of all time!"
# Returns "This is the best summary of all time!"
clean_prefix(text, r"(SUMMARY|DESCRIPTION):", ignore_case=True)
For more information about the clean_prefix
function, you can check the source code here.
clean_trailing_punctuation
Removes trailing punctuation from a section of text.
Examples:
from unstructured.cleaners.core import clean_trailing_punctuation
# Returns "ITEM 1A: RISK FACTORS"
clean_trailing_punctuation("ITEM 1A: RISK FACTORS.")
For more information about the clean_trailing_punctuation
function, you can check the source code here.
group_broken_paragraphs
Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in .txt
files. By default, group_broken_paragraphs
groups together lines split by \n
. You can change that behavior with the line_split
kwarg. The function considers \n\n
to be a paragraph break by default. You can change that behavior with the paragraph_split
kwarg.
Examples:
from unstructured.cleaners.core import group_broken_paragraphs
text = """The big brown fox
was walking down the lane.
At the end of the lane, the
fox met a bear."""
group_broken_paragraphs(text)
import re
from unstructured.cleaners.core import group_broken_paragraphs
para_split_re = re.compile(r"(\s*\n\s*){3}")
text = """The big brown fox
was walking down the lane.
At the end of the lane, the
fox met a bear."""
group_broken_paragraphs(text, paragraph_split=para_split_re)
For more information about the group_broken_paragraphs
function, you can check the source code here.
remove_punctuation
Removes ASCII and unicode punctuation from a string.
Examples:
from unstructured.cleaners.core import remove_punctuation
# Returns "A lovely quote"
remove_punctuation("“A lovely quote!”")
For more information about the remove_punctuation
function, you can check the source code here.
replace_unicode_quotes
Replaces unicode quote characters such as \x91
in strings.
Examples:
from unstructured.cleaners.core import replace_unicode_quotes
# Returns "“A lovely quote!”"
replace_unicode_characters("\x93A lovely quote!\x94")
# Returns ""‘A lovely quote!’"
replace_unicode_characters("\x91A lovely quote!\x92")
For more information about the replace_unicode_quotes
function, you can check the source code here.
translate_text
The translate_text
cleaning functions translates text between languages. translate_text
uses the Helsinki NLP MT models from transformers
for machine translation. Works for Russian, Chinese, Arabic, and many other languages.
Parameters:
-
text
: the input string to translate. -
source_lang
: the two letter language code for the source language of the text. Ifsource_lang
is not specified, the language will be detected usinglangdetect
. -
target_lang
: the two letter language code for the target language for translation. Defaults to"en"
.
Examples:
from unstructured.cleaners.translate import translate_text
# Output is "I'm a Berliner!"
translate_text("Ich bin ein Berliner!")
# Output is "I can also translate Russian!"
translate_text("Я тоже можно переводать русский язык!", "ru", "en")
For more information about the translate_text
function, you can check the source code here.
Was this page helpful?