{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is a reviewed English version (2020) of \"*Normalización de Textos con Python.ipynb*\" on the collection [nlp_pydata2018 on GitHub](https://github.com/sorice/nlp_pydata2018/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Normalization with Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The objective of this notebook is to show _from-scratch_ functions\n", "made to teach the concepts on NLP, and many of this examples are in fact\n", "small segments of real _text-preproc_ functions code.\n", "\n", "**Text Normalization**: is the subprocess that implies to mix\n", "diferent ways of writing in a single one approiate and aceptable; \n", "for example a single document can contain the words \"Señor\", \"señor\", \n", "\"Sr.\", \"Sr\" all of them must being normalized to a unique form.[[1](#Indurkhya2008)]\n", "\n", "**Tips**:\n", "\n", "* The must important sign here is the **sentence end dot**. (Abel2015)\n", "* The second most important sign it is the **underscore** or \"_\". This sign allows to mark\n", " [collocations](#collocations) for post text preprocessing.\n", "* A whitespace before and after every *sentence end dot* makes simpler the \n", " Regular Expressions to tokenize.(Abel2015)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is an English version (2020) of \"*Normalización de Textos con Python.ipynb*\" on the collection [nlp_pydata2018 on GitHub](https://github.com/sorice/nlp_pydata2018/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Preparing the scope for preprocessing**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re\n", "import string\n", "\n", "LETTERS = ''.join([string.ascii_letters, string.digits])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Punctuation Signs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Signs out of ASCII & Latin1 range\n", "\n", "This an example of rare ASCII quotation marks which usually appears in reach texts. This function filter those quotations to avoid rare characters." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "def punctuation_filter(text):\n", " text = re.sub(\n", " u'(?:\\xc2|\\xa0)|'\n", " u'(?:\\\\xe2\\\\x80\\\\x9d|\\\\xe2\\\\x80\\\\x9c)|' #Del “” in ascii.\n", " u'(?:\\u201c|\\u201d)|' #Del “” in utf8.\n", " u'(?:[\"]|[\\'])' #Delete dobles & single quotes\n", " ,' ',text)\n", " text = re.sub(u'(?:\\\\xe2\\\\x80\\\\x99|\\\\xe2\\\\x80\\\\x98)|' # Del ‘’ in ascii.\n", " u'(?:\\u2018|\\u2019)' # Del ‘’ in ascii\n", " ,'\\'',text) \n", " text = re.sub(u'(?:\\\\xe2\\\\x80\\\\x93)|' # Delete rare hyphens ó – in ascii.\n", " u'(?:\\u2013)' # Long hyphen utf8 encoding.\n", " ,' - ',text)\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: This func it is only a small example, a more developed func for this it is included in ``preprocess.punctuation`` function. The must important detail is that texts without this errors \n", "cleaned, will raise some error in the rest of normalization pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The 3 dots sign ...\n", "\n", "Something important for semantic analysis is the sentences **end points** location, for sentence tokenization. However for the treatment with regular expressions the three dots is a very complex sign.\n", "Although it is not yet clear what would be the pattern by which to replace it with the following code they\n", "are removed.\n", "\n", "**Note**: this was problematic code because of white space between dots." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "def del_contiguous_point_support(text):\n", " for i in re.finditer('[.]\\s*?[.]+?[\\s|[.]]*',text):\n", " for j in range(i.start(),i.end()):\n", " if text[j] == '.' or text[j]==' ':\n", " text = text[:j]+' '+text[j+1:]\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Special Tokens\n", "\n", "*Changes at the morphological and lexical level.*\n", "\n", "### Emails and Multi-Word Expressions\n", "\n", "Some tokens like emails **pedro@gmail.com**, or **teaching - learning**,\n", "**Firefox-v0.8** must be maintained for their semantic value either as nouns or nominal phrases." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "def contiguos_string_recognition_support(text):\n", " text = re.sub('\\n-','\\n- ',text)\n", " # support for email address is inside the regexp\n", " for i in re.finditer('[.]\\w*|-\\w*|@\\w*',text): \n", " for j in range(i.start(),i.end()):\n", " if j<(len(text)-1) and text[j] in string.punctuation and text[\n", " j+1] not in string.whitespace:\n", " text = text[:j]+'_'+text[j+1:]\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### URLs\n", "\n", "Another special token are the URLs." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "def url_string_recognition_support(text):\n", " for i in re.finditer('www\\S*(?=[.]+?\\s+?)|www\\S*(?=\\s+?)|http\\S*(?=[.]+?\\s+?)'\n", " +'|http\\S*(?=\\s+?)',text):\n", " for j in range(i.start(),i.end()):\n", " if text[j] in string.punctuation:\n", " text = text[:j]+'_'+text[j+1:]\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this function two URL situations are analyzed followed by space (**Expr.** *www\\S*(?=\\S+?)*),\n", "and URL as the final token of a sentence (**Expr.** *www\\S*(?=[.]+?\\s+?)*) of a sentence **Eg.**: **... www.google.com.*\n", "\n", "**Note**: It is important that at the end of the parsed string (*text*) there is at least one whitespace.\n", "So in the case of:\n", "*\"text = 'www.google.com'\"* regular expressions should identify that *'m'* is also\n", "the end of the chain.\n", "This would make the recognition function more complex; when actually could be\n", "slve by adding a whitespace to the end of the string, before parsing it.\n", "This is very simple to implement in the flow (see as **Eg.** section\n", "[add_text_end_dot](#add_text_end_dot))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Siglas y Abreviaturas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A special type of token is the **acronyms, abbreviations, and similars**. In this regard it must be needed\n", "a well-polished dictionary, or perhaps a good algorithm to recognize some (current solutions are based on Machine Learning). However there are several dictionaries, such as libreoffice once, that could be used and improved." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "def abbrev_recognition_support(text):\n", " for i in re.finditer('Dr(?=[.]+?)|Ms.C(?=[.]+?)|Ph.D(?=[.]+?)|Ing(?=[.]+?)|Lic(?=[.]+?)',\n", " text):\n", " text = text[:i.end()]+'_'+text[i.end()+1:]\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hypothesis**: Algorithms to search for a string in a list or dictionary may be somewhat slower\n", "than regular expressions. This is because a search is needed on a structure\n", "of data once for each token, in regular expressions it is reviewed and replaced in the text\n", "complete once for each pattern." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "481\n" ] } ], "source": [ "#Pendiente versión 2 con diccionario de LibreOffice o de Google Translator.\n", "abbr = open('data/abbr').read()\n", "abbrDict = {}\n", "pattern = ':'\n", "for word in abbr.split('\\n'):\n", " abbrDict[word] = word\n", "print (len(abbrDict))\n", "\n", "def abbr_filter(text, dic):\n", " ntext = ''\n", " for word in text.split(' '):\n", " if word in dic:\n", " word = dic[word]\n", " ntext = ntext + word + '_' \n", " return ntext" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Profiling of Abbreviation Detection" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "11484\n" ] } ], "source": [ "from time import clock\n", "text = '' #Construyendo un texto de prueba.\n", "for word in abbrDict:\n", " text += word+' '\n", "for n in range(2):\n", " text += text\n", "\n", "print (len(text))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Expr\n", "10000 loops, best of 3: 105 µs per loop\n", "Tiempo basado en expresiones regulares 4.5335\n" ] } ], "source": [ "print ('Expr')\n", "start_time1=clock()\n", "%timeit abbrev_recognition_support(text)\n", "end_time1=clock()-start_time1\n", "print ('Time based on Regular Expressions %.4f' %end_time1)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dict\n", "1000 loops, best of 3: 1.07 ms per loop\n", "Tiempo basado en uso de diccionarios 4.5355\n" ] } ], "source": [ "print ('Dict')\n", "start_time2=clock()\n", "%timeit abbr_filter(text,abbrDict)\n", "end_time2=clock()-start_time2\n", "print ('Time based on diccionaries %.4f' %end_time2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Profiling Result\n", "\n", "Indeed the dictionary-based acronym search is 10 times slower than based on\n", "regular expressions, evaluated in a context of more than 11000 terms, which equals the size\n", "than an average book." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stopwords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although stopwords are essentially meaningless tokens within the sentence, and they act\n", "generally as connectors, we separate them by their importance in the PLN. Fundamentally in the\n", "analysis of computational efficiency and the efficiency of similarity results." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "def del_char_len_one(text):\n", " text = re.sub('\\s\\w\\s',' ',text)\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Structural Normalization\n", "\n", "The next function only add a dot at the end of the document, if there isn't any. This avoid difficulties tokenizing the last sentence.\n", "\n", "### add_text_end_dot" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "outputs": [], "source": [ "def add_text_end_dot(text):\n", " end = len(text)-1\n", " i = 0\n", " while text[end] not in LETTERS:\n", " end-=1\n", " if text[end] == '.':\n", " text = text[0:end]\n", " i+=1\n", " # if any char at the end is a dot before the first letter, then add one '.'\n", " if i==0: \n", " text += '.' \n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Normalization Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This process could be different depending in which is your goal at the end, the target your final data is designed." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---------\n", "processing urls\n", "processing some special punctuation signs\n", "clean contiguous dots\n", "abbrev recognition and normalization\n", "contiguous string recognition\n", "- Limpiando los signos de puntuación.\n", "-----LIMPIEZA-------------: 0.01045370101928711\n", "El tipo de datos de tokens es: \n", "La cantidad de tokens después de limpiar es: 886 \n", "Eliminados 42 tokens durante la limpieza. \n", " Eliminados únicos: -28\n", "La cantidad de términos únicos al filtrar es: 346\n", "Finalizado en 0.01274251937866211\n", "Fri Sep 2 14:47:59 2016\n" ] } ], "source": [ "import time\n", "from nltk.tokenize import RegexpTokenizer, WordPunctTokenizer\n", "from preprocess.punctuation import Replacer\n", "from preprocess.data import tnlp1_path\n", "\n", "inita = time.time()\n", "doc_name = tnlp1_path()[:-4]\n", "with open(doc_name+'.txt','r') as text:\n", " print('---------')\n", " #Count unique terms\n", " tokenizer = RegexpTokenizer(\"\\s+\", gaps=True)\n", " tokensa = tokenizer.tokenize(text)\n", " tokens_uniqueA = set(tokensa)\n", "\n", "#-------------------Special tokens recognition and normalization\n", "initg = time.time()\n", "\n", "with open(doc_name+'.txt','r') as text:\n", " print ('processing urls')\n", " text = url_string_recognition_support(text)\n", " print ('processing some special punctuation signs')\n", " text = punctuation_filter(text)\n", " print ('clean contiguous dots')\n", " text = del_contiguous_point_support(text)\n", " print ('abbrev recognition and normalization')\n", " #~ text = abbrev_recognition_support(text)\n", " print ('contiguous string recognition')\n", " # Esta demora mucho, hay que ver porque\n", " text = contiguos_string_recognition_support(text) \n", "\n", "with open(doc_name+'1_normalized_tokens.txt', 'w') as txt:\n", " txt.write(text)\n", "\n", "#-------------------Clean all punctuation sign\n", "print ('- Limpiando los signos de puntuación.')\n", "text = open('test/2.3/out_'+doc_name+'1_normalized_tokens.txt','r').read()\n", "replacer = Replacer()\n", "chunk = replacer.replace(text)\n", "\n", "texto = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt','w')\n", "texto.write(chunk)\n", "texto.close()\n", "\n", "text = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt','r').read()\n", "tokenizer = RegexpTokenizer(\"\\s+\", gaps=True)\n", "tokens = tokenizer.tokenize(text)\n", "\n", "#Counting unique terms\n", "tokens_uniqueD = set(tokens)\n", "\n", "timeg = time.time() - initg\n", "\n", "print ('-----CLEANNING-------------: ', timeg)\n", "print ('tokens data type is:', type(tokens))\n", "print (\"Number of tokens after cleanning is: \", len(tokens),\n", "\"\\nDeleted \"+str(len(tokens)-len(tokensa))+\" tokens curing cleanning.\",\n", "\"\\n Deleted uniques: \", len(tokens_uniqueD)-len(tokens_uniqueA))\n", "\n", "text = open('test/2.3/out_'+doc_name+'2_tokens_including_points.txt', 'r').read()\n", "text = add_text_end_dot(text)\n", "\n", "texto = open('test/2.3/out_'+doc_name+'6_clean_punctuation.txt', 'w')\n", "texto.write(text)\n", "texto.close()\n", "\n", "timefa = time.time() - inita\n", "print ('Number of unique terms when filtering: ', len(tokens_uniqueD))\n", "\n", "print ('Made in ', timefa)\n", "print (time.ctime())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Result Analysis\n", "\n", "Comparing algorithm result versus human." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ACID \n", "ACID \n", "-----\n", " En bases de datos se denomina ACID a un conjunto de características necesarias para que una serie de instrucciones puedan ser consideradas como una transacción \n", " \n", "En bases de datos se denomina ACID a un conjunto de características necesarias para que una serie de instrucciones puedan ser consideradas como una transacción \n", "-----\n", " Así pues si un sistema de gestión de bases de datos es ACID compliant quiere decir que el mismo cuenta con las funcionalidades necesarias para que sus transacciones tengan las características ACID \n", " Así pues, si un sistema de gestión de bases de datos es ACID compliant quiere decir que el mismo cuenta con las funcionalidades necesarias para que sus transacciones tengan las características ACID\n", "-----\n", " En concreto ACID es un acrónimo de Atomicity Consistency Isolation and Durability \n", "\n", "\n", "En concreto ACID es un acrónimo de Atomicity, Consistency, Isolation and Durability\n", "-----\n", " Atomicidad Consistencia Aislamiento y Durabilidad en español \n", " Atomicidad, Consistencia, Aislamiento y Durabilidad en español\n", "-----\n", " Definiciones \n", "\n", "\n", "Definiciones \n", "-----\n", " Atomicidad es la propiedad que asegura que la operación se ha realizado o no y por lo tanto ante un fallo del sistema no puede quedar a medias \n", " \n", "- Atomicidad: es la propiedad que asegura que la operación se ha realizado o no, y por lo tanto ante un fallo del sistema no puede quedar a medias\n", "-----\n", " Se dice que una operación es atómica cuando es imposible para otra parte de un sistema encontrar pasos intermedios \n", " Se dice que una operación es atómica cuando es imposible para otra parte de un sistema encontrar pasos intermedios\n", "-----\n", " Si esta operación consiste en una serie de pasos todos ellos ocurren o ninguno \n", " Si esta operación consiste en una serie de pasos, todos ellos ocurren o ninguno\n", "-----\n", " Por ejemplo en el caso de una transacción bancaria o se ejecuta tanto el depósito como la deducción o ninguna acción es realizada \n", " Por ejemplo, en el caso de una transacción bancaria o se ejecuta tanto el depósito como la deducción o ninguna acción es realizada\n", "-----\n", " Consistencia \n", "\n", "- Consistencia\n", "-----\n", " Integridad \n", " Integridad\n", "-----\n", " Es la propiedad que asegura que sólo se empieza aquello que se puede acabar \n", " Es la propiedad que asegura que sólo se empieza aquello que se puede acabar\n", "-----\n", " Por lo tanto se ejecutan aquellas operaciones que no van a romper las reglas y directrices de integridad de la base de datos \n", " Por lo tanto se ejecutan aquellas operaciones que no van a romper las reglas y directrices de integridad de la base de datos\n", "-----\n", " La propiedad de consistencia sostiene que cualquier transacción llevará a la base de datos desde un estado válido a otro también válido \n", " La propiedad de consistencia sostiene que cualquier transacción llevará a la base de datos desde un estado válido a otro también válido\n", "-----\n" ] } ], "source": [ "textout = open('test/2.3/out_'+doc_name+'6_clean_punctuation.txt').read()\n", "texthuman = open('test/2.3/'+doc_name+'_human_analysis.txt').read()\n", "lineout = []\n", "linehuman=[]\n", "\n", "for line in textout.split('.'):\n", " lineout.append(line)\n", "for line in texthuman.split('.'):\n", " linehuman.append(line)\n", " \n", "for i in range(15):#max(len(lineout),len(linehuman))):\n", " if i < len(lineout):\n", " print (lineout[i])\n", " if i < len(linehuman):\n", " print (linehuman[i])\n", " print ('-----')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "\n", "[1] *[Indurkhya2008]* Nitin Indurkhya. Book **Handbook of Natural Language Processing**. 2008. \n", "p. 10 **ISBN**: 978-1-4200-8593-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Alphabetic Index\n", "\n", "**Collocations**: sequence of words that appear together very frecuently, and became in new linguistic codes because of that. Eg. “black night”, “white wine”, \n", "“United States of America”, etc." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }