Preprocessing and API

This is the preprocess API documentation. This section covers some advanced definitions of NLP techniques, also which of them are preprocessing techniques and the complete currently available set of preprocessing techniques.

NLP techniques

Techniques grouping in this library is based in the proposal contained in [Chong2013]. This cathegorization divide preprocessing techniques in 3 groups: basic, shallow and deep. In this work ngrams are not included in the mentioned groups, but ngrams [Shannon1948] is a way to represent texts very used joined to some preprocessing techniques E.g. stopword ngrams and syntactic ngrams. Due to the variety of ngram combinations techniques and the length of some of the codes to implement it, this technique is included in a separated module called grams.

Only two extra modules are included in preprocess: data and utils. The data contains all the congifuration data, the books used in the examples and some small text to show how alignment works, and so on. The utils module contains some helper functions to reduce the complexity of the preprocessing pipeline: convert from pdf to text, align preprocessed text with the original, etc.

The main objective of this library is to obtain the original text in a suitable form for later NLP tasks like: Text Similarity, Text Classification, Automatic translation, Text Retrieval, etc.

Modules

References

Shannon1948

C.E. Shannon, A mathematical theory of communications, The Bell Systems Tech. J. 27 (1948), 379-423.

Chong2013

Chong M. M. Y. A Study on Plagiarism Detection and Plagiarism Direction Identification Using Natural Language Processing Techniques, University of Wolverhampton, 2013, PhD Thesis, 326 pags.