Preprocess: Wrangling Text

preprocess is a python library for natural language tasks. The principal reason for its implementation it is to avoid the still current dispersion in python ecosystem for the next list of NLP sub-process: stopword removal, hyphenation, POS tagging, general text normalization tasks, etc. At the same time, even though current trend it is to convert everything to Pandas DataFrame, this project start when that trend wasn’t alive and Sklearn library needed Bunch objects and some of my colleague were using Weka (java) and arff format. So the comparison of performances was a nightmare. This library integrates methods to accomplish this purpose from NLTK, Spacy and others.

Recommendded Learning Path

  1. Quick Start.

  2. Bases of Text Normalization.

  3. Applying Preprocess.

  4. preprocess and NLTK entangles edu/entangled-nltk.

  5. Playfull Programming Filtering Important Words with Luhn.

Contributing

Interested in contributing to preprocess? preprocess is a welcoming, inclusive project and we would love to have you. We follow the Python Software Foundation Code of Conduct.

No matter your level of technical skill, you can be helpful. We appreciate bug reports, user testing, feature requests, bug fixes, product enhancements, and documentation improvements.

Check out the Contributing guide!

If you’ve signed up to do user testing, head over to the collab/evaluation.

Thank you for your contributions!

Concepts & API

Preprocessing Techniques

Are wrangling techniques that convert an original text object in some modified text suitable for NLP tasks. Usuarlly are not complicated to understand but its diversity and combined forms make difficult to apply or to program because are distributed in many libraries or not public available in open source, only mentioned in scientific papers.

Basic Techniques

  • Normalize Module: contains usually called normalization techniques

  • api/basic/punctuation: set of regular expressions for punctuation sign treatment

  • api/basic/hypen: changes some sign conventions that modify words, specially to identify collocations

  • api/basic/symbols: substitution of rare chars that represents symbols, frecuently appears in math texts

Future Goals

  • Implement/Reuse the open source better methods (performance speaking).

  • Implement some techniques mentioned in scientific papers, but not available in popular python open source libraries.

  • At the end to make the preprocess independent from NLTK, and less complex.

  • If after testing some Spacy or any other author have less heavy and better performance models to exclude stanfordParser models, very heavy and java dependent.

  • Create a mechanism to find better combination of preprocessing techniques for some NLP tasks, reusing the ideas of Sklearn Pipeline and GreadSearch parameter tuning technique. If possible to combine with Sklearn library in some independent module.