Ngrams Module

Module for preprocessing techniques related to division of ngrams: syntactic grams, skip grams, stopword ngrams, etc.

preprocess.grams.contextual_ngrams(text, n, multioutput='raw_value')[source]

Generates a special kind of ngrams also called CTnG.

This ngrams are formed by sorting first the words, then removing stopwords and tokens of length one, stemming and sorting the ngrams [RdguezTorrejon2010b].

References

RdguezTorrejon2010b

Diego A. Rodríguez Torrejon & José Manuel Martín Ramos. (2010b). Detección de plagio en documentos. Sistema externo monolingüe de altas prestaciones basado en n-gramas contextuales. Procesamiento del Lenguaje Natural, 45:49–57

preprocess.grams.ngrams(text, n=2, gram_type='tokens', multioutput='raw_value')[source]

Generate the list of n-grams.

Parameters

textstr

string to parse, generally a sentence.

gram_typestr

Select the type of grams. string in [‘chars’, ‘tokens’]

multioutputstr
Format type of the output. String in [‘raw_value’, ‘tuple_list’]
  • raw value - list of n-grams in string format. Eg: ‘a b c’

  • tuple list - list of n-grams in tuple format. Eg: (‘a’,’b’,’c’)

preprocess.grams.skipgrams(text, n, k, gram_type='tokens')[source]

Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf

>>> sent = "Insurgents killed in ongoing fighting".split()
>>> list(skipgrams(sent, 2, 2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> list(skipgrams(sent, 3, 2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
Parameters
  • sequence (sequence or iter) – the source data to be converted into trigrams

  • n (int) – the degree of the ngrams

  • k (int) – the skip distance

Return type

iter(tuple)

preprocess.grams.sngrams(st, text, N=2)[source]

Syntactic Ngrams

It is a novedouse technique that combines ngrams with dependency trees [Sidorov2012].

Parameters

st: tree

syntactic tree generated by stanford syntactic parser.

text: str

text to process

N: int

Length of the gram

Return

sn_grams: list

The list of syntactic dependencies tuples of len N

References

Sidorov2012

Grigori Sidorov et all (2012). Syntactic N-grams as Machine Learning Features for Natural Language Processing. Journal Expert Systems with Applications, 4(3): 853-860. Elsevier.

preprocess.grams.stopword_ngrams(text, n, lang='en', stops_path='', multioutput='raw_value')[source]

Ngrams obtained filtering all non stopwords also called SWNG [Stamatatos2011b].

References

Stamatatos2011b

Stamatatos, Efstathios (2011). Plagiarism Detection Using Stopword n-grams. Journal of the American Society for Information Science and Technology, 62(12):2512–2527.

class preprocess.grams.Collocations(text, ngrams=2, stopwords=True, lang='en')[source]

“Collocations is a kind of grams. They are a pair or group of words that are habitually juxtaposed. E.g. ‘strong coffee’, ‘black night’.

This class contain more methods inside to return the most important tokens based on different metrics.

Parameters

text: str

The text or list of text names to be processed.

ngrams: int

Number of grams your collocation must have [2,3,4].

stopwords: bool

Preprocess texts with/without stop words.

lang: str

Language of the texts [‘en’, ‘es’].

Attributes

list: list

Array with collocations.

Examples

>>> from preprocess.grams import Collocations
>>> from preprocess.demo import preProcessFlow
>>> from preprocess.data import load_culturalibre
>>> book = load_culturalibre()
>>> txt = preProcessFlow(book)
>>> collocations = Collocations(txt)

Show the first 10 collocations:

>>> collocations.head(10)
[('Cultura', 'libre'),
('disponible', 'enlace'),
('dominio', 'público'),
('Tribunal', 'Supremo'),
('propiedad', 'intelectual'),
('propiedad', 'creativa'),
('dueño', 'copyright'),
('dueños', 'copyright'),
('sentido', 'común'),
('Creative', 'Commons')]

The results of collocation list is more understandable after ejecute all the preprocessing pipeline.

This class internaly use the function remove_stopwords().

head(N: int)[source]

Show the first N elements of the collocation list based on the score function “likelihood_ratio.

tail(N: int)[source]

Show the last N elements of the collocation list based on the score function “likelihood_ratio.

write(path: str)[source]

Write the list of collocations tuples in a txt.