Utils Module¶
Helpers functions of preprocessing library.
This module includes some extra functions to complement the shallow and deep module.
Module author: Abel Meneses-Abad <abelma1980@gmail.com>
- preprocess.utils.pdftotxt(path: str, pages=None, out=None) str[source]¶
PDF to txt using PDFMiner library.
- preprocess.utils.pipeline(text: str, flow=None) str[source]¶
An easier function that allows to make a full Pipeline with the subprocess that users wants. Read the restriction- matrix to see what sequences of subprocess are imppossible.
Parameters¶
text: string to parse, generally a sentence.
- steps: string list with the ordered sequence of subprocesses to
apply.
Returns¶
- parsed resultstring output
Initial text preprocessed with techniques def in the pipeline.
- preprocess.utils.ptb2universal(tagged_text: list) list[source]¶
Convert Pen Tree Bank POS extended tag set (36 tags) into universal tag set (12 tags).
Parameters¶
- tagged_text: list
A list of (word,POS Tag) returned by a pos tagger in the extended form Eg. VBD, VBG, VBN, VBP, VBZ.
Return¶
- new_text: list
The same list of (word, POS Tag) but in the universal form, the above tags are change by VERB.