{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Applying Preprocess for Real\n",
    "\n",
    "This tutorial intends to show ``preprocess`` in a real context. After a \n",
    "quickstart in the library, and the bases of text normalization with \n",
    "python, the next obvious step is to apply preprocessing techniques in a \n",
    "real NLP problem\n",
    "\n",
    "The selected problem is *Semantic Text Similarity*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Semantic Text Similarity\n",
    "\n",
    "SEMEVAl is an International Workshop on Semantic Evaluation, currently\n",
    "part of Lexical and Computational Semantic and Semantic Evaluation\n",
    "scientific conference. The objective of this workshop is to measure\n",
    "the degree of semantic equivalence between two texts. The data is\n",
    "composed by sentence pairs, coming from previously existing paraphrase\n",
    "datasets [Agirre2012]_. This event is divided in tasks, the task of \n",
    "interest here is [Semantic Text Similarity](http://alt.qcri.org/semeval2012/task17/)\n",
    "\n",
    "Usually in the gold standard the semantic equivalence is measured with\n",
    "a float number between [0-5]."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset\n",
    "\n",
    "The data used for this example is a small part of SemEval 2012 Shared\n",
    "[Task 6 Dataset](https://www.cs.york.ac.uk/semeval-2012/task6/index.php%3Fid=data.html), the en-en subset.\n",
    "\n",
    "The subset is from MSR-Paraphrase, [Microsoft Research Paraphrase Corpus](http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/).\n",
    "750 pairs of sentences.\n",
    "\n",
    "### Legal Note\n",
    "\n",
    "STS 2012 Dataset is under this licenses:\n",
    "* http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/\n",
    "* http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import the dataset\n",
    "import pandas as pd\n",
    "data = pd.read_csv('data/2012SMTeuroparl.train.tsv', sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>score</th>\n",
       "      <th>s1</th>\n",
       "      <th>s2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4.25</td>\n",
       "      <td>I know that in France they have had whole herd...</td>\n",
       "      <td>I know that in France, the principle of slaugh...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.80</td>\n",
       "      <td>Unfortunately, the ultimate objective of a Eur...</td>\n",
       "      <td>Unfortunately the final objective of a Europea...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.80</td>\n",
       "      <td>The right of a government arbitrarily to set a...</td>\n",
       "      <td>The right for a government to draw aside its c...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.00</td>\n",
       "      <td>The House had also fought, however, for the re...</td>\n",
       "      <td>This Parliament has also fought for this reduc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.80</td>\n",
       "      <td>The right of a government arbitrarily to set a...</td>\n",
       "      <td>The right for a government to dismiss arbitrar...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   score                                                 s1  \\\n",
       "0   4.25  I know that in France they have had whole herd...   \n",
       "1   4.80  Unfortunately, the ultimate objective of a Eur...   \n",
       "2   4.80  The right of a government arbitrarily to set a...   \n",
       "3   4.00  The House had also fought, however, for the re...   \n",
       "4   4.80  The right of a government arbitrarily to set a...   \n",
       "\n",
       "                                                  s2  \n",
       "0  I know that in France, the principle of slaugh...  \n",
       "1  Unfortunately the final objective of a Europea...  \n",
       "2  The right for a government to draw aside its c...  \n",
       "3  This Parliament has also fought for this reduc...  \n",
       "4  The right for a government to dismiss arbitrar...  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.columns = ['score','s1','s2']\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Requirements\n",
    "\n",
    "Thise example use the open source library [textsim](https://github.com/sorice/textsim), \n",
    "a personal proyect of the author. Is a library for text similarity \n",
    "which integrates some very known text similarity distances, and some \n",
    "implementation of those distances on scipy, sklearn and other python libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import preprocess\n",
    "import textsim\n",
    "from copy import deepcopy\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['lowercase',\n",
       " 'replace_urls',\n",
       " 'replace_symbols',\n",
       " 'replace_dot_sequence',\n",
       " 'multipart_words',\n",
       " 'expand_abbrevs',\n",
       " 'normalize_abbrevs',\n",
       " 'expand_contractions',\n",
       " 'replace_punctuation',\n",
       " 'extraspace_for_endingpoints',\n",
       " 'add_doc_ending_point',\n",
       " 'del_tokens_len_one',\n",
       " 'hyphenation',\n",
       " 'del_digits']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preprocess.basic.__all__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "#You can play with the atomic steps preproc-text library allows\n",
    "flow = ['lowercase', \n",
    "        'expand_contractions', \n",
    "        'replace_dot_sequence', \n",
    "        'multipart_words', \n",
    "        'replace_punctuation', \n",
    "        'del_digits']\n",
    "\n",
    "pdata = deepcopy(data)\n",
    "\n",
    "#Preprocess all the sentences and keep the new value in pdata\n",
    "for i in range(len(pdata)):\n",
    "    pdata.iloc[i].s1 = preprocess.pipeline(pdata.iloc[i].s1, flow=flow)\n",
    "    pdata.iloc[i].s2 = preprocess.pipeline(pdata.iloc[i].s2, flow=flow)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature Engineering\n",
    "\n",
    "Converting Sentences to Vectors of similarity distances using *textsim*, which gather the sentence similarity distances from Sklearn, Scipy, Nltk, Jellyfish, etc.\n",
    "\n",
    "Every pair of sentences will be converted to one vector of float values, and the original score will be taken as the final result to get. The same process will be done with preprocessed data and original data, to calculate de impact of preprocess in the machine learning process.\n",
    "\n",
    "The next process must take some time, because the cell must perform 733*2 text to vector conversions, and then obtain 733*43 calculations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def distance_matrix(df, distances_list):\n",
    "    textsimData =pd.DataFrame()\n",
    "    #make textsim matrix\n",
    "    for metric in distances_list:\n",
    "        observations = []\n",
    "        for i in range(len(df)):\n",
    "            observations.append(textsim.__all_distances__[metric](df.iloc[i].s1, df.iloc[i].s2))\n",
    "        textsimData[metric] = observations\n",
    "    return textsimData"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(733, 43)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>binary_distance</th>\n",
       "      <th>levenshtein_distance</th>\n",
       "      <th>edit_similarity</th>\n",
       "      <th>damerau_levenshtein_distance</th>\n",
       "      <th>jaro_distance</th>\n",
       "      <th>jaro_winkler_distance</th>\n",
       "      <th>hamming_distance</th>\n",
       "      <th>match_rating_comparison</th>\n",
       "      <th>dice_coefficient</th>\n",
       "      <th>lcs_distance</th>\n",
       "      <th>...</th>\n",
       "      <th>matching_distance</th>\n",
       "      <th>minkowski_distance</th>\n",
       "      <th>rogerstanimoto_distance</th>\n",
       "      <th>russellrao_distance</th>\n",
       "      <th>seuclidean_distance</th>\n",
       "      <th>sokalmichener_distance</th>\n",
       "      <th>sokalsneath_distance</th>\n",
       "      <th>sqeuclidean_distance</th>\n",
       "      <th>yule_distance</th>\n",
       "      <th>qgram_distance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>78</td>\n",
       "      <td>0.462069</td>\n",
       "      <td>78</td>\n",
       "      <td>0.764647</td>\n",
       "      <td>0.858788</td>\n",
       "      <td>119</td>\n",
       "      <td>True</td>\n",
       "      <td>0.622222</td>\n",
       "      <td>77</td>\n",
       "      <td>...</td>\n",
       "      <td>0.620690</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0.619048</td>\n",
       "      <td>0.413793</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>0.619048</td>\n",
       "      <td>0.604651</td>\n",
       "      <td>21.0</td>\n",
       "      <td>3.789474</td>\n",
       "      <td>0.622222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>32</td>\n",
       "      <td>0.769784</td>\n",
       "      <td>32</td>\n",
       "      <td>0.787642</td>\n",
       "      <td>0.872585</td>\n",
       "      <td>121</td>\n",
       "      <td>True</td>\n",
       "      <td>0.631579</td>\n",
       "      <td>110</td>\n",
       "      <td>...</td>\n",
       "      <td>0.478261</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0.357143</td>\n",
       "      <td>0.260870</td>\n",
       "      <td>4.690416</td>\n",
       "      <td>0.357143</td>\n",
       "      <td>0.370370</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0.380952</td>\n",
       "      <td>0.631579</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>38</td>\n",
       "      <td>0.672414</td>\n",
       "      <td>38</td>\n",
       "      <td>0.857150</td>\n",
       "      <td>0.914290</td>\n",
       "      <td>70</td>\n",
       "      <td>True</td>\n",
       "      <td>0.823529</td>\n",
       "      <td>95</td>\n",
       "      <td>...</td>\n",
       "      <td>0.388889</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.105263</td>\n",
       "      <td>0.111111</td>\n",
       "      <td>3.741657</td>\n",
       "      <td>0.105263</td>\n",
       "      <td>0.111111</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.823529</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>148</td>\n",
       "      <td>0.467626</td>\n",
       "      <td>148</td>\n",
       "      <td>0.766001</td>\n",
       "      <td>0.812801</td>\n",
       "      <td>257</td>\n",
       "      <td>True</td>\n",
       "      <td>0.444444</td>\n",
       "      <td>158</td>\n",
       "      <td>...</td>\n",
       "      <td>0.674419</td>\n",
       "      <td>31.0</td>\n",
       "      <td>-2.777778</td>\n",
       "      <td>-0.162791</td>\n",
       "      <td>7.615773</td>\n",
       "      <td>-2.777778</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0.275862</td>\n",
       "      <td>0.444444</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>38</td>\n",
       "      <td>0.672414</td>\n",
       "      <td>38</td>\n",
       "      <td>0.821830</td>\n",
       "      <td>0.893098</td>\n",
       "      <td>100</td>\n",
       "      <td>True</td>\n",
       "      <td>0.787879</td>\n",
       "      <td>92</td>\n",
       "      <td>...</td>\n",
       "      <td>0.444444</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.200000</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>0.200000</td>\n",
       "      <td>0.210526</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.787879</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 43 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   binary_distance  levenshtein_distance  edit_similarity  \\\n",
       "0              0.0                    78         0.462069   \n",
       "1              0.0                    32         0.769784   \n",
       "2              0.0                    38         0.672414   \n",
       "3              0.0                   148         0.467626   \n",
       "4              0.0                    38         0.672414   \n",
       "\n",
       "   damerau_levenshtein_distance  jaro_distance  jaro_winkler_distance  \\\n",
       "0                            78       0.764647               0.858788   \n",
       "1                            32       0.787642               0.872585   \n",
       "2                            38       0.857150               0.914290   \n",
       "3                           148       0.766001               0.812801   \n",
       "4                            38       0.821830               0.893098   \n",
       "\n",
       "   hamming_distance  match_rating_comparison  dice_coefficient  lcs_distance  \\\n",
       "0               119                     True          0.622222            77   \n",
       "1               121                     True          0.631579           110   \n",
       "2                70                     True          0.823529            95   \n",
       "3               257                     True          0.444444           158   \n",
       "4               100                     True          0.787879            92   \n",
       "\n",
       "   ...  matching_distance  minkowski_distance  rogerstanimoto_distance  \\\n",
       "0  ...           0.620690                19.0                 0.619048   \n",
       "1  ...           0.478261                11.0                 0.357143   \n",
       "2  ...           0.388889                 7.0                 0.105263   \n",
       "3  ...           0.674419                31.0                -2.777778   \n",
       "4  ...           0.444444                 8.0                 0.200000   \n",
       "\n",
       "   russellrao_distance  seuclidean_distance  sokalmichener_distance  \\\n",
       "0             0.413793             6.000000                0.619048   \n",
       "1             0.260870             4.690416                0.357143   \n",
       "2             0.111111             3.741657                0.105263   \n",
       "3            -0.162791             7.615773               -2.777778   \n",
       "4             0.166667             4.000000                0.200000   \n",
       "\n",
       "   sokalsneath_distance  sqeuclidean_distance  yule_distance  qgram_distance  \n",
       "0              0.604651                  21.0       3.789474        0.622222  \n",
       "1              0.370370                  11.0       0.380952        0.631579  \n",
       "2              0.111111                   7.0       0.000000        0.823529  \n",
       "3              0.000000                  35.0       0.275862        0.444444  \n",
       "4              0.210526                   8.0       0.000000        0.787879  \n",
       "\n",
       "[5 rows x 43 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pmatrix = distance_matrix(pdata, textsim.__all_distances__)\n",
    "print(pmatrix.shape)\n",
    "pmatrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data Cleanning\n",
    "\n",
    "Searching for null values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "correlation_distance    21\n",
      "seuclidean_distance     17\n",
      "yule_distance           33\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "#Counting null or infinite values\n",
    "null_values = pmatrix.isnull().sum()\n",
    "print(null_values[null_values>0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "kulsinski_distance    2\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "null_values = pmatrix[(pmatrix == -np.inf) | (pmatrix == np.inf)].count()\n",
    "print(null_values[null_values>0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "correlation_distance    21\n",
      "seuclidean_distance     17\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "#Replacing Yule distance null values\n",
    "def replace_null(df, col_name,func='mean'):\n",
    "    is_inf = df[col_name] == np.inf \n",
    "    is_ninf = df[col_name] == -np.inf\n",
    "    col_mean = df[col_name][~is_inf & ~is_ninf].mean()\n",
    "    row_mask = df[col_name].isnull()\n",
    "    df[col_name][row_mask] = col_mean\n",
    "    return df\n",
    "\n",
    "pmatrix = replace_null(pmatrix, 'yule_distance')\n",
    "#Counting missing values\n",
    "null_values = pmatrix.isnull().sum()\n",
    "print(null_values[null_values>0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Series([], dtype: int64)\n"
     ]
    }
   ],
   "source": [
    "#Repeating the process for 'correlation_distance' and 'seuclidean_distance'\n",
    "pmatrix = replace_null(pmatrix, 'correlation_distance')\n",
    "pmatrix = replace_null(pmatrix, 'seuclidean_distance')\n",
    "\n",
    "#Counting missing values\n",
    "null_values = pmatrix.isnull().sum()\n",
    "print(null_values[null_values>0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "kulsinski_distance    2\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "null_values = pmatrix[(pmatrix == -np.inf) | (pmatrix == np.inf)].count()\n",
    "print(null_values[null_values>0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def replace_null2(df, col_name,func='mean'):\n",
    "    is_inf = df[col_name] == np.inf \n",
    "    is_ninf = df[col_name] == -np.inf\n",
    "    col_mean = df[col_name][~is_inf & ~is_ninf].mean()\n",
    "    df[col_name][is_inf] = col_mean\n",
    "    df[col_name][is_ninf] = col_mean\n",
    "    return df\n",
    "\n",
    "pmatrix = replace_null2(pmatrix, 'kulsinski_distance')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Series([], dtype: int64)\n"
     ]
    }
   ],
   "source": [
    "null_values = pmatrix[(pmatrix == -np.inf) | (pmatrix == np.inf)].count()\n",
    "print(null_values[null_values>0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Droping not valuable features**\n",
    "\n",
    "The simple inspection of this columns series makes us to evaluate that the 'binary_distance', 'match_rating_comparison', 'damerau_levenshtein_distance' have 0.0 values, boolean values and same value than levenstein_distance respectively. So for the final calculation this columns are useless."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_pmatrix = pmatrix.drop(['binary_distance', 'match_rating_comparison', 'damerau_levenshtein_distance'], axis=1)\n",
    "pickle.dump(clean_pmatrix, open('data/ptrain.data.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Load preprocessed data for training in case you don't want to execute the\n",
    "#computational previous step\n",
    "pickle_pdata = open('data/ptrain.data.pkl', 'rb')\n",
    "pmatrix = pickle.load(pickle_pdata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(733, 40)"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pmatrix.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(pmatrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature engineering with the data without preprocess \n",
    "\n",
    "To compare the influence of this preprocess techniques in the similarity text problem we need to compare with the original data. As well as the preprocessed ``pdata`` this original ``data`` must be transformed into float numer matrices, or feature matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "#generating matrix distances\n",
    "matrix = distance_matrix(data, textsim.__all_distances__)\n",
    "#claning null\n",
    "matrix = replace_null(matrix, 'yule_distance')\n",
    "matrix = replace_null(matrix, 'correlation_distance')\n",
    "matrix = replace_null(matrix, 'seuclidean_distance')\n",
    "matrix = replace_null2(matrix, 'kulsinski_distance')\n",
    "#droping non valuable features\n",
    "matrix = matrix.drop(['binary_distance', 'match_rating_comparison', 'damerau_levenshtein_distance'], axis=1)\n",
    "pickle.dump(matrix, open('data/train.data.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(733, 40)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>levenshtein_distance</th>\n",
       "      <th>edit_similarity</th>\n",
       "      <th>jaro_distance</th>\n",
       "      <th>jaro_winkler_distance</th>\n",
       "      <th>hamming_distance</th>\n",
       "      <th>dice_coefficient</th>\n",
       "      <th>lcs_distance</th>\n",
       "      <th>lcs_similarity</th>\n",
       "      <th>smith_waterman_distance</th>\n",
       "      <th>needleman_wunsch_distance</th>\n",
       "      <th>...</th>\n",
       "      <th>matching_distance</th>\n",
       "      <th>minkowski_distance</th>\n",
       "      <th>rogerstanimoto_distance</th>\n",
       "      <th>russellrao_distance</th>\n",
       "      <th>seuclidean_distance</th>\n",
       "      <th>sokalmichener_distance</th>\n",
       "      <th>sokalsneath_distance</th>\n",
       "      <th>sqeuclidean_distance</th>\n",
       "      <th>yule_distance</th>\n",
       "      <th>qgram_distance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>78</td>\n",
       "      <td>0.462069</td>\n",
       "      <td>0.764647</td>\n",
       "      <td>0.858788</td>\n",
       "      <td>119</td>\n",
       "      <td>0.622222</td>\n",
       "      <td>77</td>\n",
       "      <td>0.611111</td>\n",
       "      <td>90.0</td>\n",
       "      <td>156.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.620690</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0.619048</td>\n",
       "      <td>0.413793</td>\n",
       "      <td>6.000000</td>\n",
       "      <td>0.619048</td>\n",
       "      <td>0.604651</td>\n",
       "      <td>21.0</td>\n",
       "      <td>3.789474</td>\n",
       "      <td>0.622222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>32</td>\n",
       "      <td>0.769784</td>\n",
       "      <td>0.787642</td>\n",
       "      <td>0.872585</td>\n",
       "      <td>121</td>\n",
       "      <td>0.631579</td>\n",
       "      <td>110</td>\n",
       "      <td>0.830189</td>\n",
       "      <td>186.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.478261</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0.357143</td>\n",
       "      <td>0.260870</td>\n",
       "      <td>4.690416</td>\n",
       "      <td>0.357143</td>\n",
       "      <td>0.370370</td>\n",
       "      <td>11.0</td>\n",
       "      <td>0.380952</td>\n",
       "      <td>0.631579</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>0.672414</td>\n",
       "      <td>0.857150</td>\n",
       "      <td>0.914290</td>\n",
       "      <td>70</td>\n",
       "      <td>0.823529</td>\n",
       "      <td>95</td>\n",
       "      <td>0.818966</td>\n",
       "      <td>152.0</td>\n",
       "      <td>76.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.388889</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.105263</td>\n",
       "      <td>0.111111</td>\n",
       "      <td>3.741657</td>\n",
       "      <td>0.105263</td>\n",
       "      <td>0.111111</td>\n",
       "      <td>7.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.823529</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>148</td>\n",
       "      <td>0.467626</td>\n",
       "      <td>0.766001</td>\n",
       "      <td>0.812801</td>\n",
       "      <td>257</td>\n",
       "      <td>0.444444</td>\n",
       "      <td>158</td>\n",
       "      <td>0.635815</td>\n",
       "      <td>203.0</td>\n",
       "      <td>296.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.674419</td>\n",
       "      <td>31.0</td>\n",
       "      <td>-2.777778</td>\n",
       "      <td>-0.162791</td>\n",
       "      <td>7.615773</td>\n",
       "      <td>-2.777778</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0.275862</td>\n",
       "      <td>0.444444</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>38</td>\n",
       "      <td>0.672414</td>\n",
       "      <td>0.821830</td>\n",
       "      <td>0.893098</td>\n",
       "      <td>100</td>\n",
       "      <td>0.787879</td>\n",
       "      <td>92</td>\n",
       "      <td>0.803493</td>\n",
       "      <td>141.0</td>\n",
       "      <td>76.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.444444</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.200000</td>\n",
       "      <td>0.166667</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>0.200000</td>\n",
       "      <td>0.210526</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.787879</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 40 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   levenshtein_distance  edit_similarity  jaro_distance  \\\n",
       "0                    78         0.462069       0.764647   \n",
       "1                    32         0.769784       0.787642   \n",
       "2                    38         0.672414       0.857150   \n",
       "3                   148         0.467626       0.766001   \n",
       "4                    38         0.672414       0.821830   \n",
       "\n",
       "   jaro_winkler_distance  hamming_distance  dice_coefficient  lcs_distance  \\\n",
       "0               0.858788               119          0.622222            77   \n",
       "1               0.872585               121          0.631579           110   \n",
       "2               0.914290                70          0.823529            95   \n",
       "3               0.812801               257          0.444444           158   \n",
       "4               0.893098               100          0.787879            92   \n",
       "\n",
       "   lcs_similarity  smith_waterman_distance  needleman_wunsch_distance  ...  \\\n",
       "0        0.611111                     90.0                      156.0  ...   \n",
       "1        0.830189                    186.0                       64.0  ...   \n",
       "2        0.818966                    152.0                       76.0  ...   \n",
       "3        0.635815                    203.0                      296.0  ...   \n",
       "4        0.803493                    141.0                       76.0  ...   \n",
       "\n",
       "   matching_distance  minkowski_distance  rogerstanimoto_distance  \\\n",
       "0           0.620690                19.0                 0.619048   \n",
       "1           0.478261                11.0                 0.357143   \n",
       "2           0.388889                 7.0                 0.105263   \n",
       "3           0.674419                31.0                -2.777778   \n",
       "4           0.444444                 8.0                 0.200000   \n",
       "\n",
       "   russellrao_distance  seuclidean_distance  sokalmichener_distance  \\\n",
       "0             0.413793             6.000000                0.619048   \n",
       "1             0.260870             4.690416                0.357143   \n",
       "2             0.111111             3.741657                0.105263   \n",
       "3            -0.162791             7.615773               -2.777778   \n",
       "4             0.166667             4.000000                0.200000   \n",
       "\n",
       "   sokalsneath_distance  sqeuclidean_distance  yule_distance  qgram_distance  \n",
       "0              0.604651                  21.0       3.789474        0.622222  \n",
       "1              0.370370                  11.0       0.380952        0.631579  \n",
       "2              0.111111                   7.0       0.000000        0.823529  \n",
       "3              0.000000                  35.0       0.275862        0.444444  \n",
       "4              0.210526                   8.0       0.000000        0.787879  \n",
       "\n",
       "[5 rows x 40 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(matrix.shape)\n",
    "matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "    #saving this matrix for future analysis\n",
    "    pickle.dump(matrix, open('data/train.data.pkl', 'wb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Loading distances matrix of original data\n",
    "pickle_data = open('data/train.data.pkl', 'rb')\n",
    "matrix = pickle.load(pickle_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Machine Learning model\n",
    "\n",
    "[Some kind of Logistic Regression for classification.]\n",
    "\n",
    "[Features, use textsim.calc_all](make a brief description here, and link with github.com/sorice/textsim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(steps=[('standardscaler', StandardScaler()),\n",
       "                ('sgdregressor', SGDRegressor(max_iter=100))])"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.linear_model import SGDRegressor\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "y = data['score']\n",
    "X = matrix\n",
    "\n",
    "#Training without preprocessing\n",
    "reg = make_pipeline(StandardScaler(),SGDRegressor(max_iter=100, tol=1e-3))\n",
    "reg.fit(X,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**How to see one element prediction?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "prediction: 4.480812387415777\n",
      "original: 4.25\n"
     ]
    }
   ],
   "source": [
    "test = matrix.iloc[0].to_numpy()\n",
    "test_element = test.reshape(-1,1).reshape(1,-1)\n",
    "\n",
    "print('prediction:', reg.predict(test_element)[0])\n",
    "print('original:', data['score'][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regression Model Generation with Preprocessed Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(steps=[('standardscaler', StandardScaler()),\n",
       "                ('sgdregressor', SGDRegressor(max_iter=100))])"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y = data['score']\n",
    "pX = pmatrix\n",
    "preg = make_pipeline(StandardScaler(),SGDRegressor(max_iter=100, tol=1e-3))\n",
    "preg.fit(pX,y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Validation\n",
    "\n",
    "[Show differences between scores obtained with/without preprocess]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pearson correlation without preprocessing\n",
      "Pearson coeff: 0.7667865417395282 p-value: 6.989890360376282e-143\n",
      "Pearson correlation with preprocessing\n",
      "Pearson coeff: 0.765653267795623 p-value: 3.2564477583017092e-142\n"
     ]
    }
   ],
   "source": [
    "from scipy import stats\n",
    "\n",
    "print(\"Pearson correlation without preprocessing\")\n",
    "predict = pd.DataFrame()\n",
    "predict['score'] = reg.predict(X)\n",
    "p = stats.pearsonr(data['score'],predict['score'])\n",
    "print(\"Pearson coeff:\",p[0], \"p-value:\", p[1])\n",
    "\n",
    "print(\"Pearson correlation with preprocessing\")\n",
    "ppredict = pd.DataFrame()\n",
    "ppredict['score'] = preg.predict(pX)\n",
    "pp = stats.pearsonr(data['score'],ppredict['score'])\n",
    "print(\"Pearson coeff:\",pp[0], \"p-value:\", pp[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recommendations\n",
    "\n",
    "* usually we must reduce dimensionality, for better interpretabillity\n",
    "  of the model, less complexity, reduce the training time, avoid \n",
    "  overfitting and gain capacity of generalization \n",
    "\n",
    "* Feature selection process is not objective of this tutorial, but it\n",
    "  is recommended that comparing the list of must important features,\n",
    "  could show how preprocess is relevant for improving results, due to\n",
    "  the straight relation between preprocess and selected features. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}