Ngrams Module¶

Module for preprocessing techniques related to division of ngrams: syntactic grams, skip grams, stopword ngrams, etc.

preprocess.grams.contextual_ngrams(text, n, multioutput='raw_value')[source]¶

Generates a special kind of ngrams also called CTnG.

This ngrams are formed by sorting first the words, then removing stopwords and tokens of length one, stemming and sorting the ngrams [RdguezTorrejon2010b].

References¶

RdguezTorrejon2010b: Diego A. Rodríguez Torrejon & José Manuel Martín Ramos. (2010b). Detección de plagio en documentos. Sistema externo monolingüe de altas prestaciones basado en n-gramas contextuales. Procesamiento del Lenguaje Natural, 45:49–57

preprocess.grams.ngrams(text, n=2, gram_type='tokens', multioutput='raw_value')[source]¶

Generate the list of n-grams.

Parameters¶

textstr

string to parse, generally a sentence.

gram_typestr

Select the type of grams. string in [‘chars’, ‘tokens’]

multioutputstr

Format type of the output. String in [‘raw_value’, ‘tuple_list’]

raw value - list of n-grams in string format. Eg: ‘a b c’
tuple list - list of n-grams in tuple format. Eg: (‘a’,’b’,’c’)

preprocess.grams.skipgrams(text, n, k, gram_type='tokens')[source]¶

Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]

Parameters

sequence (sequence or iter) – the source data to be converted into trigrams
n (int) – the degree of the ngrams
k (int) – the skip distance

Return type

iter(tuple)

preprocess.grams.sngrams(st, text, N=2)[source]¶

Syntactic Ngrams

It is a novedouse technique that combines ngrams with dependency trees [Sidorov2012].

Parameters¶

st: tree: syntactic tree generated by stanford syntactic parser.
text: str: text to process
N: int: Length of the gram

Return¶

sn_grams: list: The list of syntactic dependencies tuples of len N

References¶

Sidorov2012: Grigori Sidorov et all (2012). Syntactic N-grams as Machine Learning Features for Natural Language Processing. Journal Expert Systems with Applications, 4(3): 853-860. Elsevier.

preprocess.grams.stopword_ngrams(text, n, lang='en', stops_path='', multioutput='raw_value')[source]¶

Ngrams obtained filtering all non stopwords also called SWNG [Stamatatos2011b].

References¶

Stamatatos2011b: Stamatatos, Efstathios (2011). Plagiarism Detection Using Stopword n-grams. Journal of the American Society for Information Science and Technology, 62(12):2512–2527.

class preprocess.grams.Collocations(text, ngrams=2, stopwords=True, lang='en')[source]¶

“Collocations is a kind of grams. They are a pair or group of words that are habitually juxtaposed. E.g. ‘strong coffee’, ‘black night’.

This class contain more methods inside to return the most important tokens based on different metrics.

Parameters¶

text: str: The text or list of text names to be processed.
ngrams: int: Number of grams your collocation must have [2,3,4].
stopwords: bool: Preprocess texts with/without stop words.
lang: str: Language of the texts [‘en’, ‘es’].

Attributes¶

list: list: Array with collocations.

Examples¶

>>> from preprocess.grams import Collocations
>>> from preprocess.demo import preProcessFlow
>>> from preprocess.data import load_culturalibre
>>> book = load_culturalibre()
>>> txt = preProcessFlow(book)
>>> collocations = Collocations(txt)

Show the first 10 collocations:

>>> collocations.head(10)
[('Cultura', 'libre'),
('disponible', 'enlace'),
('dominio', 'público'),
('Tribunal', 'Supremo'),
('propiedad', 'intelectual'),
('propiedad', 'creativa'),
('dueño', 'copyright'),
('dueños', 'copyright'),
('sentido', 'común'),
('Creative', 'Commons')]

The results of collocation list is more understandable after ejecute all the preprocessing pipeline.

This class internaly use the function remove_stopwords().

head(N: int)[source]¶: Show the first N elements of the collocation list based on the score function “likelihood_ratio.

tail(N: int)[source]¶: Show the last N elements of the collocation list based on the score function “likelihood_ratio.

write(path: str)[source]¶: Write the list of collocations tuples in a txt.

Ngrams Module¶

References¶

Parameters¶

Parameters¶

Return¶

References¶

References¶

Parameters¶

Attributes¶

Examples¶

Table of Contents

Previous topic

Next topic

This Page