Quick Start

If preprocess is new for you, this guide will help you to run your first examples:

preprocess use some external library models, before _v0.5_ depends on NLTK and StanfordParser, which in turns depends on java. Java dependences must be installed by hand and the models must be collocated in the right place before starting the below installation.

NLP pretrained models are usually heavy, so preprocess do not include on it.

Config the enviroment

  1. Pretrained model downloading
  2. Extract models in /opt/stanford/

Note

By default preprocess/data/cfg/stanford.cfg is configured on the direction /opt/stanford.

Installation

preprocess is a Python 3 package and works well with 3.4 or later.

$ pip install preprocess

Using preprocess (Basic)

import preprocess
preprocess.basic.__all__

['lowercase',
 'replace_urls',
 'replace_symbols',
 'replace_dot_sequence',
 'multipart_words',
 'abbreviations',
 'expand_contractions',
 'replace_punctuation',
 'extraspace_for_endingpoints',
 'add_doc_ending_point',
 'del_tokens_len_one',
 'hyphenation']

Having a string as input all those functions are applicable.

Note

Some examples are designed based on real text obtained from pdf to text transformers.

string = "Free software is common in Europe... and UK it is not an exception"
preprocess.lowercase(string)

"free software is common in europe... and uk it is not an exception"

Replacing dot sequences. This function in particular, due to its uses in aligment, do not change the input string length in the output.

  • ... it is not considered a sentece separator so it is whiten.

  • three dots sequences are priority or two dots followed by a white space

string = ". ... a.... b. .. . c.. .. d.. . e"
preprocess.replace_dot_sequence(string)

'    . a   . b     . c      d   . e'

Underscore sign are used as indestructible punctuation sign during preprocessing. The multi-part words are hyphenated words ([Read more about hyphen](https://en.wikipedia.org/wiki/Hyphen)) and are very important in compound-modifier constructions, this are morphological changes which modifies the meaning of another word.

string = "Hyphen multi-part words"
preprocess.multipart_words(string)

'Hyphen multi_part words

Another example of underscoring function in preprocess is hyphenation, but its uses has been designed after having a list of tuples with collocations. Se more about collocations in preprocess.grams.collocations

collocations = [("natural", "language")]
string = "In natural language processing multi-words are important"
preprocess.hypenation(string, collocations)

'In natural_language processing multi-words are important'

Combining Techniques

This is a basic example of how to execute a designed order of basic preprocessing.

order = ['abbreviations', 'lowercase', 'replace_dot_sequence', 'multipart_words', 'replace_punctuation']
string = "Free-software is common in Europe... and U.K. it is not an exception."
preprocess.pipeline(string, flow=order)

'free_software is common in europe and u_k_ is not an exception '

Note

Not all the preprocess function in all modules can be concatenated.

Ngrams uses

string = 'free software is common in europe'
preprocess.ngrams(string)

['free software ', 'software is ', 'is common ', 'common in ', 'in europe ']
preprocess.ngrams(string, n=3)

['free software is ',
 'software is common ',
 'is common in ',
 'common in europe ']

Using Shallow

string = 'free software is common in europe'
preprocess.shallow.remove_stopwords(string)

' free software common europe'

Deep example

string = 'free software is common in europe'
preprocess.deep.syntdep(string)

[('common', 'nsubj', 'software'),
 ('software', 'amod', 'free'),
 ('common', 'cop', 'is'),
 ('common', 'nmod', 'europe'),
 ('europe', 'case', 'in')]

Working with PDF

The variety of pdf generators made “not standard” pdf documents, so in the reverse process those resultant text come with rare chars, strange puntuation-signs sequences, page brakes, numbers inside the string sequences, line brakes which divide words syllables, etc.

preprocess include Creative Commons books in PDF and text format to be preprocessed and to compare results with other libraries and platforms.

from preprocess.data import freeculture_pdfpath
pdf_path = freeculture_pdfpath()
text = preprocess.utils.io.pdftotxt(pdf_path)
len(text)

596767

Hopefully this examples gives you an idea of how to transform texts with preprocess library. The preprocessing stage is important to improve results as Applying Preprocessing tutorial shows. It is recommended that after this quickstart you read the preprocess API to learn complicated examples.