.. -*- mode: rst -*- Quick Start ============ If preprocess is new for you, this guide will help you to run your first examples: :mod:`preprocess` use some external library models, before _v0.5_ depends on NLTK and StanfordParser, which in turns depends on java. Java dependences must be `installed by hand `_ and the models must be collocated in the right place before starting the below installation. NLP pretrained models are usually heavy, so :mod:`preprocess` do not include on it. Config the enviroment ----------------------- 1. Pretrained model downloading - download `StanfordParser Syntactic Parser models `_ - download `StanfordParser POS Tagger models `_ - download `StanfordParser NER models `_ 2. Extract models in /opt/stanford/ .. NOTE:: By default *preprocess/data/cfg/stanford.cfg* is configured on the direction */opt/stanford*. Installation ------------ :mod:`preprocess` is a Python 3 package and works well with 3.4 or later. .. code-block:: bash $ pip install preprocess Using preprocess (Basic) ------------------------ .. code-block:: python import preprocess preprocess.basic.__all__ ['lowercase', 'replace_urls', 'replace_symbols', 'replace_dot_sequence', 'multipart_words', 'abbreviations', 'expand_contractions', 'replace_punctuation', 'extraspace_for_endingpoints', 'add_doc_ending_point', 'del_tokens_len_one', 'hyphenation'] Having a string as input all those functions are applicable. .. Note:: Some examples are designed based on real text obtained from pdf to text transformers. .. code-block:: python string = "Free software is common in Europe... and UK it is not an exception" preprocess.lowercase(string) "free software is common in europe... and uk it is not an exception" Replacing dot sequences. This function in particular, due to its uses in aligment, do not change the input string length in the output. * ``...`` it is not considered a sentece separator so it is whiten. * three dots sequences are priority or two dots followed by a white space .. code-block:: python string = ". ... a.... b. .. . c.. .. d.. . e" preprocess.replace_dot_sequence(string) ' . a . b . c d . e' Underscore sign are used as indestructible punctuation sign during preprocessing. The multi-part words are hyphenated words ([Read more about hyphen](https://en.wikipedia.org/wiki/Hyphen)) and are very important in compound-modifier constructions, this are morphological changes which modifies the meaning of another word. .. code-block:: python string = "Hyphen multi-part words" preprocess.multipart_words(string) 'Hyphen multi_part words Another example of underscoring function in preprocess is ``hyphenation``, but its uses has been designed after having a list of tuples with collocations. Se more about ``collocations`` in `preprocess.grams.collocations <../_modules/preprocess/grams/collocations.html#Collocations>`_ .. code-block:: python collocations = [("natural", "language")] string = "In natural language processing multi-words are important" preprocess.hypenation(string, collocations) 'In natural_language processing multi-words are important' Combining Techniques -------------------- This is a basic example of how to execute a designed order of basic preprocessing. .. code-block:: python order = ['abbreviations', 'lowercase', 'replace_dot_sequence', 'multipart_words', 'replace_punctuation'] string = "Free-software is common in Europe... and U.K. it is not an exception." preprocess.pipeline(string, flow=order) 'free_software is common in europe and u_k_ is not an exception ' .. Note:: Not all the preprocess function in all modules can be concatenated. Ngrams uses ----------- .. code-block:: python string = 'free software is common in europe' preprocess.ngrams(string) ['free software ', 'software is ', 'is common ', 'common in ', 'in europe '] .. code-block:: python preprocess.ngrams(string, n=3) ['free software is ', 'software is common ', 'is common in ', 'common in europe '] Using Shallow ------------- .. code-block:: python string = 'free software is common in europe' preprocess.shallow.remove_stopwords(string) ' free software common europe' Deep example ------------- .. code-block:: python string = 'free software is common in europe' preprocess.deep.syntdep(string) [('common', 'nsubj', 'software'), ('software', 'amod', 'free'), ('common', 'cop', 'is'), ('common', 'nmod', 'europe'), ('europe', 'case', 'in')] Working with PDF ---------------- The variety of pdf generators made *"not standard"* pdf documents, so in the reverse process those resultant text come with rare chars, strange puntuation-signs sequences, page brakes, numbers inside the string sequences, line brakes which divide words syllables, etc. ``preprocess`` include *Creative Commons* books in PDF and text format to be preprocessed and to compare results with other libraries and platforms. .. code-block:: python from preprocess.data import freeculture_pdfpath pdf_path = freeculture_pdfpath() text = preprocess.utils.io.pdftotxt(pdf_path) len(text) 596767 Hopefully this examples gives you an idea of how to transform texts with preprocess library. The preprocessing stage is important to improve results as `Applying Preprocessing `_ tutorial shows. It is recommended that after this *quickstart* you read the `preprocess API <../api/index.html>`_ to learn complicated examples.