Philadelphia Eaglesâ\x80\x99 victory automatically gets converted to Philadelphia Eagles' victory in partition_html using the replace_unicode_quotes cleaning function. You can see how that works in the code snippet below:
unstructured include an apply method that allow you to apply the text cleaning to the document element without instantiating a new element. The apply method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the replace_unicode_quotes cleaning function using the apply method.
str -> str function, users can also easily include their own cleaning functions for custom data preparation tasks. In the example below, we remove citations from a section of text.
unstructured library.
bytes_string_to_string
Converts an output string that looks like a byte string to a string using the specified encoding. This happens sometimes in partition_html when there is a character like an emoji that isn’t expected by the HTML parser. In that case, the encoded bytes get processed.
Examples:
bytes_string_to_string function, you can check the source code here.
clean
Cleans a section of text with options including removing bullets, extra whitespace, dashes and trailing punctuation. Optionally, you can choose to lowercase the output.
Options:
-
Applies
clean_bulletsifbullets=True. -
Applies
clean_extra_whitespaceifextra_whitespace=True. -
Applies
clean_dashesifdashes=True. -
Applies
clean_trailing_punctuationiftrailing_punctuation=True. -
Lowercases the output if
lowercase=True.
clean function, you can check the source code here.
clean_bullets
Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the text are not removed.
Examples:
clean_bullets function, you can check the source code here.
clean_dashes
Removes dashes from a section of text. Also handles special characters such as \u2013.
Examples:
clean_dashes function, you can check the source code here.
clean_non_ascii_chars
Removes non-ascii characters from a string.
Examples:
clean_non_ascii_chars function, you can check the source code here.
clean_ordered_bullets
Remove alphanumeric bullets from the beginning of text up to three “sub-section” levels.
Examples:
clean_ordered_bullets function, you can check the source code here.
clean_postfix
Removes the postfix from a string if they match a specified pattern.
Options:
-
Ignores case if
ignore_caseis set toTrue. The default isFalse. -
Strips trailing whitespace is
stripis set toTrue. The default isTrue.
clean_postfix function, you can check the source code here.
clean_prefix
Removes the prefix from a string if they match a specified pattern.
Options:
-
Ignores case if
ignore_caseis set toTrue. The default isFalse. -
Strips leading whitespace is
stripis set toTrue. The default isTrue.
clean_prefix function, you can check the source code here.
clean_trailing_punctuation
Removes trailing punctuation from a section of text.
Examples:
clean_trailing_punctuation function, you can check the source code here.
group_broken_paragraphs
Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in .txt files. By default, group_broken_paragraphs groups together lines split by \n. You can change that behavior with the line_split kwarg. The function considers \n\n to be a paragraph break by default. You can change that behavior with the paragraph_split kwarg.
Examples:
group_broken_paragraphs function, you can check the source code here.
remove_punctuation
Removes ASCII and unicode punctuation from a string.
Examples:
remove_punctuation function, you can check the source code here.
replace_unicode_quotes
Replaces unicode quote characters such as \x91 in strings.
Examples:
replace_unicode_quotes function, you can check the source code here.
translate_text
The translate_text cleaning functions translates text between languages. translate_text uses the Helsinki NLP MT models from transformers for machine translation. Works for Russian, Chinese, Arabic, and many other languages.
Parameters:
-
text: the input string to translate. -
source_lang: the two letter language code for the source language of the text. Ifsource_langis not specified, the language will be detected usinglangdetect. -
target_lang: the two letter language code for the target language for translation. Defaults to"en".
translate_text function, you can check the source code here.
