Utils Module

Helpers functions of preprocessing library.

This module includes some extra functions to complement the shallow and deep module.

Module author: Abel Meneses-Abad <abelma1980@gmail.com>

preprocess.utils.pdftotxt(path: str, pages=None, out=None) str[source]

PDF to txt using PDFMiner library.

preprocess.utils.pipeline(text: str, flow=None) str[source]

An easier function that allows to make a full Pipeline with the subprocess that users wants. Read the restriction- matrix to see what sequences of subprocess are imppossible.

Parameters

text: string to parse, generally a sentence.

steps: string list with the ordered sequence of subprocesses to

apply.

Returns

parsed resultstring output

Initial text preprocessed with techniques def in the pipeline.

preprocess.utils.ptb2universal(tagged_text: list) list[source]

Convert Pen Tree Bank POS extended tag set (36 tags) into universal tag set (12 tags).

Parameters

tagged_text: list

A list of (word,POS Tag) returned by a pos tagger in the extended form Eg. VBD, VBG, VBN, VBP, VBZ.

Return

new_text: list

The same list of (word, POS Tag) but in the universal form, the above tags are change by VERB.