Basic Module

Module for basic preprocessing techniques: lowercase, urls replacement symbols replacement, multipart words underscoring, etc.

Some extra techniques very useful the testing of preprocess library in real scenarios are included in this module as functions: replace_point_secuence, add_doc_end_point.

Quick Example

The original text (first six lines):

For other (optional) flags of <opencv_createsamples>, see the official... documentation
at http://Docs.opencv.org/doc/user_guide/ug_traincascade.html.
[ 99 ]
www.it-ebooks.info.
Generating Haar Cascades for Custom 8.4 Targets
from preprocess.basic import replace_urls, multipart_words
from preprocess.data import test_text_path

path = test_text_path()

with open(path) as doc:
    text = doc.read()

resultant_text = replace_urls(multipart_words(text))
print("\n".join(resultant_text.split('\n')[:10]))

The output must look like this:

For other (optional) flags of <opencv_createsamples>, see the official... documentation
at http___Docs_opencv_org_doc_user_guide_ug_traincascade_html.
[ 99 ]
www_it_ebooks_info.
Generating Haar Cascades for Custom 8_4 Targets

Detailed Docs

  1. Normalize Module: normalization functions.

API Reference

preprocess.basic.add_doc_ending_point(text: str) str[source]

Add Final Text Dot

Comes from clean_punctuation script but with less functionalities, except adding an ending point at the end of the document.

Note

This is a function to garantied that the last sentence have an ending point. The sentence tokenization process can be standardized because every sentence, even the last one, have an ending point.

Parameters

textstr

text to process

Returns

text: str

The same text but, if missing, with a dot at the end

preprocess.basic.del_digits(text)[source]

Delete words compound only by digits.

preprocess.basic.del_tokens_len_one(text: str) str[source]

Delete tokens with length = 1.

This is kind of a basic stopword filtering.

preprocess.basic.expand_abbrevs(text: str, lang='en', type='classic') str[source]

Abbreviations expansion. Extend classical abbreviations with its corresponding long form written in a list of international abbreviations.

Cite

https://en.wikipedia.org/wiki/Abbreviation https://en.wikipedia.org/wiki/List_of_classical_abbreviations

preprocess.basic.expand_contractions(text: str, lang='en') str[source]

Expand english contractions.

preprocess.basic.extraspace_for_endingpoints(text: str) str[source]

Add an extra whitespace (if there isn’t any) between the last sentence letter and the ending point, allowing an easier way of parsing all sentences by a very distinctive ending point.

This function allows to avoid abbrev dots during the sentence parsing subprocess.

The original objective of this func was to preserve

in datasets

with one sentence by line (E.g. paraphrase detection, STS).

Replace punctuation also intend to do this, but because of the complexity of RE in replace_punctuation this function guarantee the 100% of sentence dots are separated at list by a whitespace by any other char.

preprocess.basic.hyphenation(text: str, collocations: list) str[source]

Made originally to underscored the collocations in the original text The recursive looking for collocations allow to find important expressions that define topic (of course there are better techniques to do this, using Deep Learning and more complex techniques.)

Once the collocations are hyphenated these turns into single words and are not mixed with the rest. For example, if you hypenate de collocation: [natural,language] as “natural_language” will be more informative in a Luhn term evaluation than just using “natural” and “language” separately.

Parameters

text: str

normalized text

collocations: tuple list

List of collocations

Return

text: str

same text with all collocations hyphenated with underscore char

preprocess.basic.lowercase(text: str) str[source]

Return lowercase of string.

preprocess.basic.multipart_words(text: str) str[source]

Hyphenated words like ‘end-of-line’ are called in NLP multi-part words.

All hyphens in multi-part words are changed by underscore character.

Note

That syllable segmentation of reach format text add extra hyphens to every text, those hyphens are removed in :func: replace_punctuation.

preprocess.basic.normalize_abbrevs(text: str, lang='en') str[source]

Recognize abbreviations marked with periods between initials and a dot after the last initial or the whole abbreviation. Usually some entity name abbreviations are written like this. The matched dots are underscored.

This function is made to help sentence tokenizers with end-of- sentence ambiguities introduced by some of this dots. For helping with semantic analysis use abbreviation expansion (expand_abbrevs), or deep Name Entity Tagging techniques.

Abbreviation Definition

An abbreviation is a shortened form of a written word or phrase. Abbreviations may be used to save space and time [Merriam-Webster2020a]. The accepted style here is “to use periods after uppercase letters, and after mixed-case abbreviations (E.g.: Jr., Mrs., A.M., …)”.

Note

In the case of U_S. the function will expect you filter at the end of preprocessing the conditions of dot in the expression. If a capital letter follows then this dot match with and end of sentence, other case must be erased.

Warning

Full lowercase abbreviations are not supported yet (E.g.: a.m., etc., …).

References

Merriam-Webster2020a

. Definition of Abbreviation Merrian-Webster, 2020

preprocess.basic.replace_dot_sequence(text: str) str[source]

Replace a contiguous dot sequence by the same amount of whitespace.

Please read carefully the documentation to see all the conventions adopted to replace this sequences, and how to maintain dot sentence delimiters for sentence tokenizers.

Note

It can’t be implemented without the finditer function. This expression r’(w+)[.]s*[.]+[s|[.]]*’ changes the sequences of points but it is impossible to handle the number of white spaces. This functions it is used also for the alignment process after normalization, where maintaining the length of the original text is important.

preprocess.basic.replace_punctuation(text: str) str[source]

Replace all punctuation characters based on patterns contained in punctuation script. The Regular Expressions are ordered based on structural elements (E.g. word syllabic division), paragraph and sentence transformations.

Note

All the syntactic and morphologic transformations depending on punctuation signs, must be done before applying replace_punctuation func.

It is important to apply replace_symbols func before this func. Also the abbreviation recognition, multipart words, replace_dots and replace urls, all these functions work with punctuation signs, so if they are not underscored or transformed, this func will take its own decisions with the remaining punct signs. For example the sentence tokenization will change in case of rare quotations: “.