This notebook is a reviewed English version (2020) of “Normalización de Textos con Python.ipynb” on the collection nlp_pydata2018 on GitHub.

Text Normalization with Python

The objective of this notebook is to show from-scratch functions made to teach the concepts on NLP, and many of this examples are in fact small segments of real text-preproc functions code.

Text Normalization: is the subprocess that implies to mix diferent ways of writing in a single one approiate and aceptable; for example a single document can contain the words “Señor”, “señor”, “Sr.”, “Sr” all of them must being normalized to a unique form.[1]

Tips:

  • The must important sign here is the sentence end dot. (Abel2015)

  • The second most important sign it is the underscore or “_”. This sign allows to mark collocations for post text preprocessing.

  • A whitespace before and after every sentence end dot makes simpler the Regular Expressions to tokenize.(Abel2015)

This notebook is an English version (2020) of “Normalización de Textos con Python.ipynb” on the collection nlp_pydata2018 on GitHub.

Preparing the scope for preprocessing

[1]:
import re
import string

LETTERS = ''.join([string.ascii_letters, string.digits])

Punctuation Signs

Signs out of ASCII & Latin1 range

This an example of rare ASCII quotation marks which usually appears in reach texts. This function filter those quotations to avoid rare characters.

[2]:
def punctuation_filter(text):
   text = re.sub(
                 u'(?:\xc2|\xa0)|'
                 u'(?:\\xe2\\x80\\x9d|\\xe2\\x80\\x9c)|'       #Del “” in ascii.
                 u'(?:\u201c|\u201d)|'                         #Del “” in utf8.
                 u'(?:["]|[\'])'                               #Delete dobles & single quotes
                 ,' ',text)
   text = re.sub(u'(?:\\xe2\\x80\\x99|\\xe2\\x80\\x98)|'       # Del ‘’ in ascii.
                 u'(?:\u2018|\u2019)'                          # Del ‘’ in ascii
                 ,'\'',text)
   text = re.sub(u'(?:\\xe2\\x80\\x93)|'                       # Delete rare hyphens ó – in ascii.
                 u'(?:\u2013)'                                 # Long hyphen utf8 encoding.
                 ,' - ',text)
   return text

Note: This func it is only a small example, a more developed func for this it is included in preprocess.punctuation function. The must important detail is that texts without this errors cleaned, will raise some error in the rest of normalization pipeline.

The 3 dots sign …

Something important for semantic analysis is the sentences end points location, for sentence tokenization. However for the treatment with regular expressions the three dots is a very complex sign. Although it is not yet clear what would be the pattern by which to replace it with the following code they are removed.

Note: this was problematic code because of white space between dots.

[3]:
def del_contiguous_point_support(text):
   for i in re.finditer('[.]\s*?[.]+?[\s|[.]]*',text):
      for j in range(i.start(),i.end()):
         if text[j] == '.' or text[j]==' ':
            text = text[:j]+' '+text[j+1:]
   return text

Special Tokens

Changes at the morphological and lexical level.

Emails and Multi-Word Expressions

Some tokens like emails pedro@gmail.com, or teaching - learning, Firefox-v0.8 must be maintained for their semantic value either as nouns or nominal phrases.

[5]:
def contiguos_string_recognition_support(text):
   text = re.sub('\n-','\n- ',text)
   # support for email address is inside the regexp
   for i in re.finditer('[.]\w*|-\w*|@\w*',text):
      for j in range(i.start(),i.end()):
         if j<(len(text)-1) and text[j] in string.punctuation and text[
         j+1] not in string.whitespace:
            text = text[:j]+'_'+text[j+1:]
   return text

URLs

Another special token are the URLs.

[6]:
def url_string_recognition_support(text):
   for i in re.finditer('www\S*(?=[.]+?\s+?)|www\S*(?=\s+?)|http\S*(?=[.]+?\s+?)'
                        +'|http\S*(?=\s+?)',text):
      for j in range(i.start(),i.end()):
         if text[j] in string.punctuation:
            text = text[:j]+'_'+text[j+1:]
   return text

In this function two URL situations are analyzed followed by space (Expr. www:nbsphinx-math:`S*`(?=:nbsphinx-math:`S`+?)), and URL as the final token of a sentence (Expr. www:nbsphinx-math:`S*`(?=[.]+?:nbsphinx-math:`s`+?)) of a sentence Eg.: **… www.google.com.*

Note: It is important that at the end of the parsed string (text) there is at least one whitespace. So in the case of: “text = ‘www.google.com’” regular expressions should identify that ‘m’ is also the end of the chain. This would make the recognition function more complex; when actually could be slve by adding a whitespace to the end of the string, before parsing it. This is very simple to implement in the flow (see as Eg. section add_text_end_dot).

Siglas y Abreviaturas

A special type of token is the acronyms, abbreviations, and similars. In this regard it must be needed a well-polished dictionary, or perhaps a good algorithm to recognize some (current solutions are based on Machine Learning). However there are several dictionaries, such as libreoffice once, that could be used and improved.

[7]:
def abbrev_recognition_support(text):
   for i in re.finditer('Dr(?=[.]+?)|Ms.C(?=[.]+?)|Ph.D(?=[.]+?)|Ing(?=[.]+?)|Lic(?=[.]+?)',
                        text):
      text = text[:i.end()]+'_'+text[i.end()+1:]
   return text

Hypothesis: Algorithms to search for a string in a list or dictionary may be somewhat slower than regular expressions. This is because a search is needed on a structure of data once for each token, in regular expressions it is reviewed and replaced in the text complete once for each pattern.

[8]:
#Pendiente versión 2 con diccionario de LibreOffice o de Google Translator.
abbr = open('data/abbr').read()
abbrDict = {}
pattern = ':'
for word in abbr.split('\n'):
    abbrDict[word] = word
print (len(abbrDict))

def abbr_filter(text, dic):
    ntext = ''
    for word in text.split(' '):
        if word in dic:
            word = dic[word]
        ntext = ntext + word + '_'
    return ntext
481

Profiling of Abbreviation Detection

[9]:
from time import clock
text = '' #Construyendo un texto de prueba.
for word in abbrDict:
    text += word+' '
for n in range(2):
    text += text

print (len(text))
11484
[10]:
print ('Expr')
start_time1=clock()
%timeit abbrev_recognition_support(text)
end_time1=clock()-start_time1
print ('Time based on Regular Expressions %.4f' %end_time1)
Expr
10000 loops, best of 3: 105 µs per loop
Tiempo basado en expresiones regulares 4.5335
[11]:
print ('Dict')
start_time2=clock()
%timeit abbr_filter(text,abbrDict)
end_time2=clock()-start_time2
print ('Time based on diccionaries %.4f' %end_time2)
Dict
1000 loops, best of 3: 1.07 ms per loop
Tiempo basado en uso de diccionarios 4.5355

Profiling Result

Indeed the dictionary-based acronym search is 10 times slower than based on regular expressions, evaluated in a context of more than 11000 terms, which equals the size than an average book.

Stopwords

Although stopwords are essentially meaningless tokens within the sentence, and they act generally as connectors, we separate them by their importance in the PLN. Fundamentally in the analysis of computational efficiency and the efficiency of similarity results.

[13]:
def del_char_len_one(text):
   text = re.sub('\s\w\s',' ',text)
   return text

Structural Normalization

The next function only add a dot at the end of the document, if there isn’t any. This avoid difficulties tokenizing the last sentence.

add_text_end_dot

[13]:
def add_text_end_dot(text):
   end = len(text)-1
   i = 0
   while text[end] not in LETTERS:
      end-=1
      if text[end] == '.':
         text = text[0:end]
         i+=1
   # if any char at the end is a dot before the first letter, then add one '.'
   if i==0:
      text += '.'
   return text

Normalization Pipeline

This process could be different depending in which is your goal at the end, the target your final data is designed.

[15]:
import time
from nltk.tokenize import RegexpTokenizer, WordPunctTokenizer
from preprocess.punctuation import Replacer
from preprocess.data import tnlp1_path

inita = time.time()
doc_name = tnlp1_path()[:-4]
with open(doc_name+'.txt','r') as text:
    print('---------')
    #Count unique terms
    tokenizer = RegexpTokenizer("\s+", gaps=True)
    tokensa = tokenizer.tokenize(text)
    tokens_uniqueA = set(tokensa)

#-------------------Special tokens recognition and normalization
initg = time.time()

with open(doc_name+'.txt','r') as text:
    print ('processing urls')
    text = url_string_recognition_support(text)
    print ('processing some special punctuation signs')
    text = punctuation_filter(text)
    print ('clean contiguous dots')
    text = del_contiguous_point_support(text)
    print ('abbrev recognition and normalization')
    #~ text = abbrev_recognition_support(text)
    print ('contiguous string recognition')
    # Esta demora mucho, hay que ver porque
    text = contiguos_string_recognition_support(text)

with open(doc_name+'1_normalized_tokens.txt', 'w') as txt:
    txt.write(text)

#-------------------Clean all punctuation sign
print ('- Limpiando los signos de puntuación.')
text = open('test/2.3/out_'+doc_name+'1_normalized_tokens.txt','r').read()
replacer = Replacer()
chunk = replacer.replace(text)

texto = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt','w')
texto.write(chunk)
texto.close()

text = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt','r').read()
tokenizer = RegexpTokenizer("\s+", gaps=True)
tokens = tokenizer.tokenize(text)

#Counting unique terms
tokens_uniqueD = set(tokens)

timeg = time.time() - initg

print ('-----CLEANNING-------------: ', timeg)
print ('tokens data type is:', type(tokens))
print ("Number of tokens after cleanning is: ", len(tokens),
"\nDeleted "+str(len(tokens)-len(tokensa))+" tokens curing cleanning.",
"\n Deleted uniques: ", len(tokens_uniqueD)-len(tokens_uniqueA))

text = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt', 'r').read()
text = add_text_end_dot(text)

texto = open('test/2.3/out_'+doc_name+'6_clean_punctuation.txt', 'w')
texto.write(text)
texto.close()

timefa = time.time() - inita
print ('Number of unique terms when filtering: ', len(tokens_uniqueD))

print ('Made in ', timefa)
print (time.ctime())
---------
processing urls
processing some special punctuation signs
clean contiguous dots
abbrev recognition and normalization
contiguous string recognition
- Limpiando los signos de puntuación.
-----LIMPIEZA-------------:  0.01045370101928711
El tipo de datos de tokens es: <class 'list'>
La cantidad de tokens después de limpiar es:  886
Eliminados 42 tokens durante la limpieza.
 Eliminados únicos:  -28
La cantidad de términos únicos al filtrar es:  346
Finalizado en  0.01274251937866211
Fri Sep  2 14:47:59 2016

Result Analysis

Comparing algorithm result versus human.

[16]:
textout = open('test/2.3/out_'+doc_name+'6_clean_punctuation.txt').read()
texthuman = open('test/2.3/'+doc_name+'_human_analysis.txt').read()
lineout = []
linehuman=[]

for line in textout.split('.'):
   lineout.append(line)
for line in texthuman.split('.'):
   linehuman.append(line)

for i in range(15):#max(len(lineout),len(linehuman))):
   if i < len(lineout):
        print (lineout[i])
   if i < len(linehuman):
        print (linehuman[i])
   print  ('-----')
ACID
ACID
-----
 En bases de datos se denomina ACID a un conjunto de características necesarias para que una serie de instrucciones puedan ser consideradas como una transacción

En bases de datos se denomina ACID a un conjunto de características necesarias para que una serie de instrucciones puedan ser consideradas como una transacción
-----
 Así pues si un sistema de gestión de bases de datos es ACID compliant quiere decir que el mismo cuenta con las funcionalidades necesarias para que sus transacciones tengan las características ACID
 Así pues, si un sistema de gestión de bases de datos es ACID compliant quiere decir que el mismo cuenta con las funcionalidades necesarias para que sus transacciones tengan las características ACID
-----
 En concreto ACID es un acrónimo de Atomicity Consistency Isolation and Durability


En concreto ACID es un acrónimo de Atomicity, Consistency, Isolation and Durability
-----
 Atomicidad Consistencia Aislamiento y Durabilidad en español
 Atomicidad, Consistencia, Aislamiento y Durabilidad en español
-----
 Definiciones


Definiciones
-----
 Atomicidad es la propiedad que asegura que la operación se ha realizado o no y por lo tanto ante un fallo del sistema no puede quedar a medias

- Atomicidad: es la propiedad que asegura que la operación se ha realizado o no, y por lo tanto ante un fallo del sistema no puede quedar a medias
-----
 Se dice que una operación es atómica cuando es imposible para otra parte de un sistema encontrar pasos intermedios
 Se dice que una operación es atómica cuando es imposible para otra parte de un sistema encontrar pasos intermedios
-----
 Si esta operación consiste en una serie de pasos todos ellos ocurren o ninguno
 Si esta operación consiste en una serie de pasos, todos ellos ocurren o ninguno
-----
 Por ejemplo en el caso de una transacción bancaria o se ejecuta tanto el depósito como la deducción o ninguna acción es realizada
 Por ejemplo, en el caso de una transacción bancaria o se ejecuta tanto el depósito como la deducción o ninguna acción es realizada
-----
 Consistencia

- Consistencia
-----
 Integridad
 Integridad
-----
 Es la propiedad que asegura que sólo se empieza aquello que se puede acabar
 Es la propiedad que asegura que sólo se empieza aquello que se puede acabar
-----
 Por lo tanto se ejecutan aquellas operaciones que no van a romper las reglas y directrices de integridad de la base de datos
 Por lo tanto se ejecutan aquellas operaciones que no van a romper las reglas y directrices de integridad de la base de datos
-----
 La propiedad de consistencia sostiene que cualquier transacción llevará a la base de datos desde un estado válido a otro también válido
 La propiedad de consistencia sostiene que cualquier transacción llevará a la base de datos desde un estado válido a otro también válido
-----

References

[1] [Indurkhya2008] Nitin Indurkhya. Book Handbook of Natural Language Processing. 2008. p. 10 ISBN: 978-1-4200-8593-8

Alphabetic Index

Collocations: sequence of words that appear together very frecuently, and became in new linguistic codes because of that. Eg. “black night”, “white wine”, “United States of America”, etc.