Ngrams Module¶
Module for preprocessing techniques related to division of ngrams: syntactic grams, skip grams, stopword ngrams, etc.
- preprocess.grams.contextual_ngrams(text, n, multioutput='raw_value')[source]¶
Generates a special kind of ngrams also called CTnG.
This ngrams are formed by sorting first the words, then removing stopwords and tokens of length one, stemming and sorting the ngrams [RdguezTorrejon2010b].
References¶
- RdguezTorrejon2010b
Diego A. Rodríguez Torrejon & José Manuel Martín Ramos. (2010b). Detección de plagio en documentos. Sistema externo monolingüe de altas prestaciones basado en n-gramas contextuales. Procesamiento del Lenguaje Natural, 45:49–57
- preprocess.grams.ngrams(text, n=2, gram_type='tokens', multioutput='raw_value')[source]¶
Generate the list of n-grams.
Parameters¶
- textstr
string to parse, generally a sentence.
- gram_typestr
Select the type of grams. string in [‘chars’, ‘tokens’]
- multioutputstr
- Format type of the output. String in [‘raw_value’, ‘tuple_list’]
raw value - list of n-grams in string format. Eg: ‘a b c’
tuple list - list of n-grams in tuple format. Eg: (‘a’,’b’,’c’)
- preprocess.grams.skipgrams(text, n, k, gram_type='tokens')[source]¶
Returns all possible skipgrams generated from a sequence of items, as an iterator. Skipgrams are ngrams that allows tokens to be skipped. Refer to http://homepages.inf.ed.ac.uk/ballison/pdf/lrec_skipgrams.pdf
>>> sent = "Insurgents killed in ongoing fighting".split() >>> list(skipgrams(sent, 2, 2)) [('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')] >>> list(skipgrams(sent, 3, 2)) [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
- Parameters
sequence (sequence or iter) – the source data to be converted into trigrams
n (int) – the degree of the ngrams
k (int) – the skip distance
- Return type
iter(tuple)
- preprocess.grams.sngrams(st, text, N=2)[source]¶
Syntactic Ngrams
It is a novedouse technique that combines ngrams with dependency trees [Sidorov2012].
Parameters¶
- st: tree
syntactic tree generated by stanford syntactic parser.
- text: str
text to process
- N: int
Length of the gram
Return¶
- sn_grams: list
The list of syntactic dependencies tuples of len N
References¶
- Sidorov2012
Grigori Sidorov et all (2012). Syntactic N-grams as Machine Learning Features for Natural Language Processing. Journal Expert Systems with Applications, 4(3): 853-860. Elsevier.
- preprocess.grams.stopword_ngrams(text, n, lang='en', stops_path='', multioutput='raw_value')[source]¶
Ngrams obtained filtering all non stopwords also called SWNG [Stamatatos2011b].
References¶
- Stamatatos2011b
Stamatatos, Efstathios (2011). Plagiarism Detection Using Stopword n-grams. Journal of the American Society for Information Science and Technology, 62(12):2512–2527.
- class preprocess.grams.Collocations(text, ngrams=2, stopwords=True, lang='en')[source]¶
“Collocations is a kind of grams. They are a pair or group of words that are habitually juxtaposed. E.g. ‘strong coffee’, ‘black night’.
This class contain more methods inside to return the most important tokens based on different metrics.
Parameters¶
- text: str
The text or list of text names to be processed.
- ngrams: int
Number of grams your collocation must have [2,3,4].
- stopwords: bool
Preprocess texts with/without stop words.
- lang: str
Language of the texts [‘en’, ‘es’].
Attributes¶
- list: list
Array with collocations.
Examples¶
>>> from preprocess.grams import Collocations >>> from preprocess.demo import preProcessFlow >>> from preprocess.data import load_culturalibre >>> book = load_culturalibre() >>> txt = preProcessFlow(book) >>> collocations = Collocations(txt)
Show the first 10 collocations:
>>> collocations.head(10) [('Cultura', 'libre'), ('disponible', 'enlace'), ('dominio', 'público'), ('Tribunal', 'Supremo'), ('propiedad', 'intelectual'), ('propiedad', 'creativa'), ('dueño', 'copyright'), ('dueños', 'copyright'), ('sentido', 'común'), ('Creative', 'Commons')]
The results of collocation list is more understandable after ejecute all the preprocessing pipeline.
This class internaly use the function
remove_stopwords().- head(N: int)[source]¶
Show the first N elements of the collocation list based on the score function “likelihood_ratio.