Applying Preprocess for Real¶
This tutorial intends to show preprocess in a real context. After a quickstart in the library, and the bases of text normalization with python, the next obvious step is to apply preprocessing techniques in a real NLP problem
The selected problem is Semantic Text Similarity.
Semantic Text Similarity¶
SEMEVAl is an International Workshop on Semantic Evaluation, currently part of Lexical and Computational Semantic and Semantic Evaluation scientific conference. The objective of this workshop is to measure the degree of semantic equivalence between two texts. The data is composed by sentence pairs, coming from previously existing paraphrase datasets [Agirre2012]_. This event is divided in tasks, the task of interest here is Semantic Text Similarity
Usually in the gold standard the semantic equivalence is measured with a float number between [0-5].
Dataset¶
The data used for this example is a small part of SemEval 2012 Shared Task 6 Dataset, the en-en subset.
The subset is from MSR-Paraphrase, Microsoft Research Paraphrase Corpus. 750 pairs of sentences.
Legal Note¶
STS 2012 Dataset is under this licenses: * http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/ * http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/
[1]:
#import the dataset
import pandas as pd
data = pd.read_csv('data/2012SMTeuroparl.train.tsv', sep='\t')
[2]:
data.columns = ['score','s1','s2']
data.head()
[2]:
| score | s1 | s2 | |
|---|---|---|---|
| 0 | 4.25 | I know that in France they have had whole herd... | I know that in France, the principle of slaugh... |
| 1 | 4.80 | Unfortunately, the ultimate objective of a Eur... | Unfortunately the final objective of a Europea... |
| 2 | 4.80 | The right of a government arbitrarily to set a... | The right for a government to draw aside its c... |
| 3 | 4.00 | The House had also fought, however, for the re... | This Parliament has also fought for this reduc... |
| 4 | 4.80 | The right of a government arbitrarily to set a... | The right for a government to dismiss arbitrar... |
Requirements¶
Thise example use the open source library textsim, a personal proyect of the author. Is a library for text similarity which integrates some very known text similarity distances, and some implementation of those distances on scipy, sklearn and other python libraries.
[3]:
import numpy as np
import preprocess
import textsim
from copy import deepcopy
import warnings
warnings.filterwarnings('ignore')
import pickle
Preprocessing¶
[4]:
preprocess.basic.__all__
[4]:
['lowercase',
'replace_urls',
'replace_symbols',
'replace_dot_sequence',
'multipart_words',
'expand_abbrevs',
'normalize_abbrevs',
'expand_contractions',
'replace_punctuation',
'extraspace_for_endingpoints',
'add_doc_ending_point',
'del_tokens_len_one',
'hyphenation',
'del_digits']
[5]:
#You can play with the atomic steps preproc-text library allows
flow = ['lowercase',
'expand_contractions',
'replace_dot_sequence',
'multipart_words',
'replace_punctuation',
'del_digits']
pdata = deepcopy(data)
#Preprocess all the sentences and keep the new value in pdata
for i in range(len(pdata)):
pdata.iloc[i].s1 = preprocess.pipeline(pdata.iloc[i].s1, flow=flow)
pdata.iloc[i].s2 = preprocess.pipeline(pdata.iloc[i].s2, flow=flow)
Converting Sentences to Vectors of similarity distances using textsim, which gather the sentence similarity distances from Sklearn, Scipy, Nltk, Jellyfish, etc.
Every pair of sentences will be converted to one vector of float values, and the original score will be taken as the final result to get. The same process will be done with preprocessed data and original data, to calculate de impact of preprocess in the machine learning process.
The next process must take some time, because the cell must perform 7332 text to vector conversions, and then obtain 73343 calculations
[6]:
def distance_matrix(df, distances_list):
textsimData =pd.DataFrame()
#make textsim matrix
for metric in distances_list:
observations = []
for i in range(len(df)):
observations.append(textsim.__all_distances__[metric](df.iloc[i].s1, df.iloc[i].s2))
textsimData[metric] = observations
return textsimData
[7]:
pmatrix = distance_matrix(pdata, textsim.__all_distances__)
print(pmatrix.shape)
pmatrix.head()
(733, 43)
[7]:
| binary_distance | levenshtein_distance | edit_similarity | damerau_levenshtein_distance | jaro_distance | jaro_winkler_distance | hamming_distance | match_rating_comparison | dice_coefficient | lcs_distance | ... | matching_distance | minkowski_distance | rogerstanimoto_distance | russellrao_distance | seuclidean_distance | sokalmichener_distance | sokalsneath_distance | sqeuclidean_distance | yule_distance | qgram_distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 78 | 0.462069 | 78 | 0.764647 | 0.858788 | 119 | True | 0.622222 | 77 | ... | 0.620690 | 19.0 | 0.619048 | 0.413793 | 6.000000 | 0.619048 | 0.604651 | 21.0 | 3.789474 | 0.622222 |
| 1 | 0.0 | 32 | 0.769784 | 32 | 0.787642 | 0.872585 | 121 | True | 0.631579 | 110 | ... | 0.478261 | 11.0 | 0.357143 | 0.260870 | 4.690416 | 0.357143 | 0.370370 | 11.0 | 0.380952 | 0.631579 |
| 2 | 0.0 | 38 | 0.672414 | 38 | 0.857150 | 0.914290 | 70 | True | 0.823529 | 95 | ... | 0.388889 | 7.0 | 0.105263 | 0.111111 | 3.741657 | 0.105263 | 0.111111 | 7.0 | 0.000000 | 0.823529 |
| 3 | 0.0 | 148 | 0.467626 | 148 | 0.766001 | 0.812801 | 257 | True | 0.444444 | 158 | ... | 0.674419 | 31.0 | -2.777778 | -0.162791 | 7.615773 | -2.777778 | 0.000000 | 35.0 | 0.275862 | 0.444444 |
| 4 | 0.0 | 38 | 0.672414 | 38 | 0.821830 | 0.893098 | 100 | True | 0.787879 | 92 | ... | 0.444444 | 8.0 | 0.200000 | 0.166667 | 4.000000 | 0.200000 | 0.210526 | 8.0 | 0.000000 | 0.787879 |
5 rows × 43 columns
Searching for null values.
[8]:
#Counting null or infinite values
null_values = pmatrix.isnull().sum()
print(null_values[null_values>0])
correlation_distance 21
seuclidean_distance 17
yule_distance 33
dtype: int64
[9]:
null_values = pmatrix[(pmatrix == -np.inf) | (pmatrix == np.inf)].count()
print(null_values[null_values>0])
kulsinski_distance 2
dtype: int64
[12]:
#Replacing Yule distance null values
def replace_null(df, col_name,func='mean'):
is_inf = df[col_name] == np.inf
is_ninf = df[col_name] == -np.inf
col_mean = df[col_name][~is_inf & ~is_ninf].mean()
row_mask = df[col_name].isnull()
df[col_name][row_mask] = col_mean
return df
pmatrix = replace_null(pmatrix, 'yule_distance')
#Counting missing values
null_values = pmatrix.isnull().sum()
print(null_values[null_values>0])
correlation_distance 21
seuclidean_distance 17
dtype: int64
[13]:
#Repeating the process for 'correlation_distance' and 'seuclidean_distance'
pmatrix = replace_null(pmatrix, 'correlation_distance')
pmatrix = replace_null(pmatrix, 'seuclidean_distance')
#Counting missing values
null_values = pmatrix.isnull().sum()
print(null_values[null_values>0])
Series([], dtype: int64)
[14]:
null_values = pmatrix[(pmatrix == -np.inf) | (pmatrix == np.inf)].count()
print(null_values[null_values>0])
kulsinski_distance 2
dtype: int64
[15]:
def replace_null2(df, col_name,func='mean'):
is_inf = df[col_name] == np.inf
is_ninf = df[col_name] == -np.inf
col_mean = df[col_name][~is_inf & ~is_ninf].mean()
df[col_name][is_inf] = col_mean
df[col_name][is_ninf] = col_mean
return df
pmatrix = replace_null2(pmatrix, 'kulsinski_distance')
[16]:
null_values = pmatrix[(pmatrix == -np.inf) | (pmatrix == np.inf)].count()
print(null_values[null_values>0])
Series([], dtype: int64)
Droping not valuable features
The simple inspection of this columns series makes us to evaluate that the ‘binary_distance’, ‘match_rating_comparison’, ‘damerau_levenshtein_distance’ have 0.0 values, boolean values and same value than levenstein_distance respectively. So for the final calculation this columns are useless.
[17]:
clean_pmatrix = pmatrix.drop(['binary_distance', 'match_rating_comparison', 'damerau_levenshtein_distance'], axis=1)
pickle.dump(clean_pmatrix, open('data/ptrain.data.pkl', 'wb'))
[102]:
#Load preprocessed data for training in case you don't want to execute the
#computational previous step
pickle_pdata = open('data/ptrain.data.pkl', 'rb')
pmatrix = pickle.load(pickle_pdata)
[103]:
pmatrix.shape
[103]:
(733, 40)
[20]:
type(pmatrix)
[20]:
pandas.core.frame.DataFrame
To compare the influence of this preprocess techniques in the similarity text problem we need to compare with the original data. As well as the preprocessed pdata this original data must be transformed into float numer matrices, or feature matrix.
[19]:
#generating matrix distances
matrix = distance_matrix(data, textsim.__all_distances__)
#claning null
matrix = replace_null(matrix, 'yule_distance')
matrix = replace_null(matrix, 'correlation_distance')
matrix = replace_null(matrix, 'seuclidean_distance')
matrix = replace_null2(matrix, 'kulsinski_distance')
#droping non valuable features
matrix = matrix.drop(['binary_distance', 'match_rating_comparison', 'damerau_levenshtein_distance'], axis=1)
pickle.dump(matrix, open('data/train.data.pkl', 'wb'))
[21]:
print(matrix.shape)
matrix.head()
(733, 40)
[21]:
| levenshtein_distance | edit_similarity | jaro_distance | jaro_winkler_distance | hamming_distance | dice_coefficient | lcs_distance | lcs_similarity | smith_waterman_distance | needleman_wunsch_distance | ... | matching_distance | minkowski_distance | rogerstanimoto_distance | russellrao_distance | seuclidean_distance | sokalmichener_distance | sokalsneath_distance | sqeuclidean_distance | yule_distance | qgram_distance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 78 | 0.462069 | 0.764647 | 0.858788 | 119 | 0.622222 | 77 | 0.611111 | 90.0 | 156.0 | ... | 0.620690 | 19.0 | 0.619048 | 0.413793 | 6.000000 | 0.619048 | 0.604651 | 21.0 | 3.789474 | 0.622222 |
| 1 | 32 | 0.769784 | 0.787642 | 0.872585 | 121 | 0.631579 | 110 | 0.830189 | 186.0 | 64.0 | ... | 0.478261 | 11.0 | 0.357143 | 0.260870 | 4.690416 | 0.357143 | 0.370370 | 11.0 | 0.380952 | 0.631579 |
| 2 | 38 | 0.672414 | 0.857150 | 0.914290 | 70 | 0.823529 | 95 | 0.818966 | 152.0 | 76.0 | ... | 0.388889 | 7.0 | 0.105263 | 0.111111 | 3.741657 | 0.105263 | 0.111111 | 7.0 | 0.000000 | 0.823529 |
| 3 | 148 | 0.467626 | 0.766001 | 0.812801 | 257 | 0.444444 | 158 | 0.635815 | 203.0 | 296.0 | ... | 0.674419 | 31.0 | -2.777778 | -0.162791 | 7.615773 | -2.777778 | 0.000000 | 35.0 | 0.275862 | 0.444444 |
| 4 | 38 | 0.672414 | 0.821830 | 0.893098 | 100 | 0.787879 | 92 | 0.803493 | 141.0 | 76.0 | ... | 0.444444 | 8.0 | 0.200000 | 0.166667 | 4.000000 | 0.200000 | 0.210526 | 8.0 | 0.000000 | 0.787879 |
5 rows × 40 columns
#saving this matrix for future analysis
pickle.dump(matrix, open('data/train.data.pkl', 'wb'))
[154]:
#Loading distances matrix of original data
pickle_data = open('data/train.data.pkl', 'rb')
matrix = pickle.load(pickle_data)
Machine Learning model¶
[Some kind of Logistic Regression for classification.]
Features, use textsim.calc_all
[29]:
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
y = data['score']
X = matrix
#Training without preprocessing
reg = make_pipeline(StandardScaler(),SGDRegressor(max_iter=100, tol=1e-3))
reg.fit(X,y)
[29]:
Pipeline(steps=[('standardscaler', StandardScaler()),
('sgdregressor', SGDRegressor(max_iter=100))])
How to see one element prediction?
[30]:
test = matrix.iloc[0].to_numpy()
test_element = test.reshape(-1,1).reshape(1,-1)
print('prediction:', reg.predict(test_element)[0])
print('original:', data['score'][0])
prediction: 4.480812387415777
original: 4.25
Regression Model Generation with Preprocessed Data¶
[28]:
y = data['score']
pX = pmatrix
preg = make_pipeline(StandardScaler(),SGDRegressor(max_iter=100, tol=1e-3))
preg.fit(pX,y)
[28]:
Pipeline(steps=[('standardscaler', StandardScaler()),
('sgdregressor', SGDRegressor(max_iter=100))])
Validation¶
[Show differences between scores obtained with/without preprocess]
[32]:
from scipy import stats
print("Pearson correlation without preprocessing")
predict = pd.DataFrame()
predict['score'] = reg.predict(X)
p = stats.pearsonr(data['score'],predict['score'])
print("Pearson coeff:",p[0], "p-value:", p[1])
print("Pearson correlation with preprocessing")
ppredict = pd.DataFrame()
ppredict['score'] = preg.predict(pX)
pp = stats.pearsonr(data['score'],ppredict['score'])
print("Pearson coeff:",pp[0], "p-value:", pp[1])
Pearson correlation without preprocessing
Pearson coeff: 0.7667865417395282 p-value: 6.989890360376282e-143
Pearson correlation with preprocessing
Pearson coeff: 0.765653267795623 p-value: 3.2564477583017092e-142
Recommendations¶
usually we must reduce dimensionality, for better interpretabillity of the model, less complexity, reduce the training time, avoid overfitting and gain capacity of generalization
Feature selection process is not objective of this tutorial, but it is recommended that comparing the list of must important features, could show how preprocess is relevant for improving results, due to the straight relation between preprocess and selected features.