Extracting
extract_datetimetz
Extracts the date, time, and timezone in the Received
field(s) from an .eml
file. extract_datetimetz
takes in a string and returns a datetime.datetime object from the input string.
from unstructured.cleaners.extract import extract_datetimetz
text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
\n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
# Returns datetime.datetime(2021, 3, 26, 11, 4, 9, tzinfo=datetime.timezone(datetime.timedelta(seconds=43200)))
extract_datetimetz(text)
For more information about the extract_datetimetz
function, you can check the source code here.
extract_email_address
Extracts email addresses from a string input and returns a list of all the email addresses in the input string.
from unstructured.cleaners.extract import extract_email_address
text = """Me me@email.com and You <You@email.com>
([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
# Returns "['me@email.com', 'you@email.com']"
extract_email_address(text)
For more information about the extract_email_address
function, you can check the source code here.
extract_ip_address
Extracts IPv4 and IPv6 IP addresses in the input string and returns a list of all IP address in input string.
from unstructured.cleaners.extract import extract_ip_address
text = """Me me@email.com and You <You@email.com>
([ba23::58b5:2236:45g2:88h2]) (10.0.2.01)"""
# Returns "['ba23::58b5:2236:45g2:88h2', '10.0.2.01']"
extract_ip_address(text)
For more information about the extract_ip_address
function, you can check the source code here.
extract_ip_address_name
Extracts the names of each IP address in the Received
field(s) from an .eml
file. extract_ip_address_name
takes in a string and returns a list of all IP addresses in the input string.
from unstructured.cleaners.extract import extract_ip_address_name
text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
\n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
# Returns "['ABC.DEF.local', 'ABC.DEF.local2']"
extract_ip_address_name(text)
For more information about the extract_ip_address_name
function, you can check the source code here.
extract_mapi_id
Extracts the mapi id
in the Received
field(s) from an .eml
file. extract_mapi_id
takes in a string and returns a list of a string containing the mapi id
in the input string.
from unstructured.cleaners.extract import extract_mapi_id
text = """from ABC.DEF.local ([ba23::58b5:2236:45g2:88h2]) by
\n ABC.DEF.local2 ([ba23::58b5:2236:45g2:88h2%25]) with mapi id\
n 32.88.5467.123; Fri, 26 Mar 2021 11:04:09 +1200"""
# Returns "['32.88.5467.123']"
extract_mapi_id(text)
For more information about the extract_mapi_id
function, you can check the source code here.
extract_ordered_bullets
Extracts alphanumeric bullets from the beginning of text up to three “sub-section” levels.
Examples:
from unstructured.cleaners.extract import extract_ordered_bullets
# Returns ("1", "1", None)
extract_ordered_bullets("1.1 This is a very important point")
# Returns ("a", "1", None)
extract_ordered_bullets("a.1 This is a very important point")
For more information about the extract_ordered_bullets
function, you can check the source code here.
extract_text_after
Extracts text that occurs after the specified pattern.
Options:
-
If
index
is set, extract after the(index + 1)
th occurrence of the pattern. The default is0
. -
Strips trailing whitespace if
strip
is set toTrue
. The default isTrue
.
Examples:
from unstructured.cleaners.extract import extract_text_after
text = "SPEAKER 1: Look at me, I'm flying!"
# Returns "Look at me, I'm flying!"
extract_text_after(text, r"SPEAKER \d{1}:")
For more information about the extract_text_after
function, you can check the source code here.
extract_text_before
Extracts text that occurs before the specified pattern.
Options:
-
If
index
is set, extract before the(index + 1)
th occurrence of the pattern. The default is0
. -
Strips leading whitespace if
strip
is set toTrue
. The default isTrue
.
Examples:
from unstructured.cleaners.extract import extract_text_before
text = "Here I am! STOP Look at me! STOP I'm flying! STOP"
# Returns "Here I am!"
extract_text_before(text, r"STOP")
For more information about the extract_text_before
function, you can check the source code here.
extract_us_phone_number
Extracts a phone number from a section of text.
Examples:
from unstructured.cleaners.extract import extract_us_phone_number
# Returns "215-867-5309"
extract_us_phone_number("Phone number: 215-867-5309")
For more information about the extract_us_phone_number
function, you can check the source code here.
group_broken_paragraphs
Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in .txt
files. By default, group_broken_paragraphs
groups together lines split by \n
. You can change that behavior with the line_split
kwarg. The function considers \n\n
to be a paragraph break by default. You can change that behavior with the paragraph_split
kwarg.
Examples:
from unstructured.cleaners.core import group_broken_paragraphs
text = """The big brown fox
was walking down the lane.
At the end of the lane, the
fox met a bear."""
group_broken_paragraphs(text)
import re
from unstructured.cleaners.core import group_broken_paragraphs
para_split_re = re.compile(r"(\s*\n\s*){3}")
text = """The big brown fox
was walking down the lane.
At the end of the lane, the
fox met a bear."""
group_broken_paragraphs(text, paragraph_split=para_split_re)
For more information about the group_broken_paragraphs
function, you can check the source code here.
remove_punctuation
Removes ASCII and unicode punctuation from a string.
Examples:
from unstructured.cleaners.core import remove_punctuation
# Returns "A lovely quote"
remove_punctuation("“A lovely quote!”")
For more information about the remove_punctuation
function, you can check the source code here.
replace_unicode_quotes
Replaces unicode quote characters such as \x91
in strings.
Examples:
from unstructured.cleaners.core import replace_unicode_quotes
# Returns "“A lovely quote!”"
replace_unicode_characters("\x93A lovely quote!\x94")
# Returns ""‘A lovely quote!’"
replace_unicode_characters("\x91A lovely quote!\x92")
For more information about the replace_unicode_quotes
function, you can check the source code here.
translate_text
The translate_text
cleaning function translates text between languages. translate_text
uses the Helsinki NLP MT models from transformers
for machine translation. Works for Russian, Chinese, Arabic, and many other languages.
Parameters:
-
text
: the input string to translate. -
source_lang
: the two letter language code for the source language of the text. Ifsource_lang
is not specified, the language will be detected usinglangdetect
. -
target_lang
: the two letter language code for the target language for translation. Defaults to"en"
.
Examples:
from unstructured.cleaners.translate import translate_text
# Output is "I'm a Berliner!"
translate_text("Ich bin ein Berliner!")
# Output is "I can also translate Russian!"
translate_text("Я тоже можно переводать русский язык!", "ru", "en")
For more information about the translate_text
function, you can check the source code here.