{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Applying Preprocess for Real\n", "\n", "This tutorial intends to show ``preprocess`` in a real context. After a \n", "quickstart in the library, and the bases of text normalization with \n", "python, the next obvious step is to apply preprocessing techniques in a \n", "real NLP problem\n", "\n", "The selected problem is *Semantic Text Similarity*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Semantic Text Similarity\n", "\n", "SEMEVAl is an International Workshop on Semantic Evaluation, currently\n", "part of Lexical and Computational Semantic and Semantic Evaluation\n", "scientific conference. The objective of this workshop is to measure\n", "the degree of semantic equivalence between two texts. The data is\n", "composed by sentence pairs, coming from previously existing paraphrase\n", "datasets [Agirre2012]_. This event is divided in tasks, the task of \n", "interest here is [Semantic Text Similarity](http://alt.qcri.org/semeval2012/task17/)\n", "\n", "Usually in the gold standard the semantic equivalence is measured with\n", "a float number between [0-5]." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\n", "\n", "The data used for this example is a small part of SemEval 2012 Shared\n", "[Task 6 Dataset](https://www.cs.york.ac.uk/semeval-2012/task6/index.php%3Fid=data.html), the en-en subset.\n", "\n", "The subset is from MSR-Paraphrase, [Microsoft Research Paraphrase Corpus](http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/).\n", "750 pairs of sentences.\n", "\n", "### Legal Note\n", "\n", "STS 2012 Dataset is under this licenses:\n", "* http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/\n", "* http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "#import the dataset\n", "import pandas as pd\n", "data = pd.read_csv('data/2012SMTeuroparl.train.tsv', sep='\\t')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | score | \n", "s1 | \n", "s2 | \n", "
|---|---|---|---|
| 0 | \n", "4.25 | \n", "I know that in France they have had whole herd... | \n", "I know that in France, the principle of slaugh... | \n", "
| 1 | \n", "4.80 | \n", "Unfortunately, the ultimate objective of a Eur... | \n", "Unfortunately the final objective of a Europea... | \n", "
| 2 | \n", "4.80 | \n", "The right of a government arbitrarily to set a... | \n", "The right for a government to draw aside its c... | \n", "
| 3 | \n", "4.00 | \n", "The House had also fought, however, for the re... | \n", "This Parliament has also fought for this reduc... | \n", "
| 4 | \n", "4.80 | \n", "The right of a government arbitrarily to set a... | \n", "The right for a government to dismiss arbitrar... | \n", "
| \n", " | binary_distance | \n", "levenshtein_distance | \n", "edit_similarity | \n", "damerau_levenshtein_distance | \n", "jaro_distance | \n", "jaro_winkler_distance | \n", "hamming_distance | \n", "match_rating_comparison | \n", "dice_coefficient | \n", "lcs_distance | \n", "... | \n", "matching_distance | \n", "minkowski_distance | \n", "rogerstanimoto_distance | \n", "russellrao_distance | \n", "seuclidean_distance | \n", "sokalmichener_distance | \n", "sokalsneath_distance | \n", "sqeuclidean_distance | \n", "yule_distance | \n", "qgram_distance | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0.0 | \n", "78 | \n", "0.462069 | \n", "78 | \n", "0.764647 | \n", "0.858788 | \n", "119 | \n", "True | \n", "0.622222 | \n", "77 | \n", "... | \n", "0.620690 | \n", "19.0 | \n", "0.619048 | \n", "0.413793 | \n", "6.000000 | \n", "0.619048 | \n", "0.604651 | \n", "21.0 | \n", "3.789474 | \n", "0.622222 | \n", "
| 1 | \n", "0.0 | \n", "32 | \n", "0.769784 | \n", "32 | \n", "0.787642 | \n", "0.872585 | \n", "121 | \n", "True | \n", "0.631579 | \n", "110 | \n", "... | \n", "0.478261 | \n", "11.0 | \n", "0.357143 | \n", "0.260870 | \n", "4.690416 | \n", "0.357143 | \n", "0.370370 | \n", "11.0 | \n", "0.380952 | \n", "0.631579 | \n", "
| 2 | \n", "0.0 | \n", "38 | \n", "0.672414 | \n", "38 | \n", "0.857150 | \n", "0.914290 | \n", "70 | \n", "True | \n", "0.823529 | \n", "95 | \n", "... | \n", "0.388889 | \n", "7.0 | \n", "0.105263 | \n", "0.111111 | \n", "3.741657 | \n", "0.105263 | \n", "0.111111 | \n", "7.0 | \n", "0.000000 | \n", "0.823529 | \n", "
| 3 | \n", "0.0 | \n", "148 | \n", "0.467626 | \n", "148 | \n", "0.766001 | \n", "0.812801 | \n", "257 | \n", "True | \n", "0.444444 | \n", "158 | \n", "... | \n", "0.674419 | \n", "31.0 | \n", "-2.777778 | \n", "-0.162791 | \n", "7.615773 | \n", "-2.777778 | \n", "0.000000 | \n", "35.0 | \n", "0.275862 | \n", "0.444444 | \n", "
| 4 | \n", "0.0 | \n", "38 | \n", "0.672414 | \n", "38 | \n", "0.821830 | \n", "0.893098 | \n", "100 | \n", "True | \n", "0.787879 | \n", "92 | \n", "... | \n", "0.444444 | \n", "8.0 | \n", "0.200000 | \n", "0.166667 | \n", "4.000000 | \n", "0.200000 | \n", "0.210526 | \n", "8.0 | \n", "0.000000 | \n", "0.787879 | \n", "
5 rows × 43 columns
\n", "| \n", " | levenshtein_distance | \n", "edit_similarity | \n", "jaro_distance | \n", "jaro_winkler_distance | \n", "hamming_distance | \n", "dice_coefficient | \n", "lcs_distance | \n", "lcs_similarity | \n", "smith_waterman_distance | \n", "needleman_wunsch_distance | \n", "... | \n", "matching_distance | \n", "minkowski_distance | \n", "rogerstanimoto_distance | \n", "russellrao_distance | \n", "seuclidean_distance | \n", "sokalmichener_distance | \n", "sokalsneath_distance | \n", "sqeuclidean_distance | \n", "yule_distance | \n", "qgram_distance | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "78 | \n", "0.462069 | \n", "0.764647 | \n", "0.858788 | \n", "119 | \n", "0.622222 | \n", "77 | \n", "0.611111 | \n", "90.0 | \n", "156.0 | \n", "... | \n", "0.620690 | \n", "19.0 | \n", "0.619048 | \n", "0.413793 | \n", "6.000000 | \n", "0.619048 | \n", "0.604651 | \n", "21.0 | \n", "3.789474 | \n", "0.622222 | \n", "
| 1 | \n", "32 | \n", "0.769784 | \n", "0.787642 | \n", "0.872585 | \n", "121 | \n", "0.631579 | \n", "110 | \n", "0.830189 | \n", "186.0 | \n", "64.0 | \n", "... | \n", "0.478261 | \n", "11.0 | \n", "0.357143 | \n", "0.260870 | \n", "4.690416 | \n", "0.357143 | \n", "0.370370 | \n", "11.0 | \n", "0.380952 | \n", "0.631579 | \n", "
| 2 | \n", "38 | \n", "0.672414 | \n", "0.857150 | \n", "0.914290 | \n", "70 | \n", "0.823529 | \n", "95 | \n", "0.818966 | \n", "152.0 | \n", "76.0 | \n", "... | \n", "0.388889 | \n", "7.0 | \n", "0.105263 | \n", "0.111111 | \n", "3.741657 | \n", "0.105263 | \n", "0.111111 | \n", "7.0 | \n", "0.000000 | \n", "0.823529 | \n", "
| 3 | \n", "148 | \n", "0.467626 | \n", "0.766001 | \n", "0.812801 | \n", "257 | \n", "0.444444 | \n", "158 | \n", "0.635815 | \n", "203.0 | \n", "296.0 | \n", "... | \n", "0.674419 | \n", "31.0 | \n", "-2.777778 | \n", "-0.162791 | \n", "7.615773 | \n", "-2.777778 | \n", "0.000000 | \n", "35.0 | \n", "0.275862 | \n", "0.444444 | \n", "
| 4 | \n", "38 | \n", "0.672414 | \n", "0.821830 | \n", "0.893098 | \n", "100 | \n", "0.787879 | \n", "92 | \n", "0.803493 | \n", "141.0 | \n", "76.0 | \n", "... | \n", "0.444444 | \n", "8.0 | \n", "0.200000 | \n", "0.166667 | \n", "4.000000 | \n", "0.200000 | \n", "0.210526 | \n", "8.0 | \n", "0.000000 | \n", "0.787879 | \n", "
5 rows × 40 columns
\n", "