Extracting
extract_datetimetz
Extracts the date, time, and timezone in the Received
field(s) from an .eml
file. extract_datetimetz
takes in a string and returns a datetime.datetime object from the input string.
For more information about the extract_datetimetz
function, you can check the source code here.
extract_email_address
Extracts email addresses from a string input and returns a list of all the email addresses in the input string.
For more information about the extract_email_address
function, you can check the source code here.
extract_ip_address
Extracts IPv4 and IPv6 IP addresses in the input string and returns a list of all IP address in input string.
For more information about the extract_ip_address
function, you can check the source code here.
extract_ip_address_name
Extracts the names of each IP address in the Received
field(s) from an .eml
file. extract_ip_address_name
takes in a string and returns a list of all IP addresses in the input string.
For more information about the extract_ip_address_name
function, you can check the source code here.
extract_mapi_id
Extracts the mapi id
in the Received
field(s) from an .eml
file. extract_mapi_id
takes in a string and returns a list of a string containing the mapi id
in the input string.
For more information about the extract_mapi_id
function, you can check the source code here.
extract_ordered_bullets
Extracts alphanumeric bullets from the beginning of text up to three “sub-section” levels.
Examples:
For more information about the extract_ordered_bullets
function, you can check the source code here.
extract_text_after
Extracts text that occurs after the specified pattern.
Options:
-
If
index
is set, extract after the(index + 1)
th occurrence of the pattern. The default is0
. -
Strips trailing whitespace if
strip
is set toTrue
. The default isTrue
.
Examples:
For more information about the extract_text_after
function, you can check the source code here.
extract_text_before
Extracts text that occurs before the specified pattern.
Options:
-
If
index
is set, extract before the(index + 1)
th occurrence of the pattern. The default is0
. -
Strips leading whitespace if
strip
is set toTrue
. The default isTrue
.
Examples:
For more information about the extract_text_before
function, you can check the source code here.
extract_us_phone_number
Extracts a phone number from a section of text.
Examples:
For more information about the extract_us_phone_number
function, you can check the source code here.
group_broken_paragraphs
Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in .txt
files. By default, group_broken_paragraphs
groups together lines split by \n
. You can change that behavior with the line_split
kwarg. The function considers \n\n
to be a paragraph break by default. You can change that behavior with the paragraph_split
kwarg.
Examples:
For more information about the group_broken_paragraphs
function, you can check the source code here.
remove_punctuation
Removes ASCII and unicode punctuation from a string.
Examples:
For more information about the remove_punctuation
function, you can check the source code here.
replace_unicode_quotes
Replaces unicode quote characters such as \x91
in strings.
Examples:
For more information about the replace_unicode_quotes
function, you can check the source code here.
translate_text
The translate_text
cleaning function translates text between languages. translate_text
uses the Helsinki NLP MT models from transformers
for machine translation. Works for Russian, Chinese, Arabic, and many other languages.
Parameters:
-
text
: the input string to translate. -
source_lang
: the two letter language code for the source language of the text. Ifsource_lang
is not specified, the language will be detected usinglangdetect
. -
target_lang
: the two letter language code for the target language for translation. Defaults to"en"
.
Examples:
For more information about the translate_text
function, you can check the source code here.