{
"cells": [
{
"cell_type": "markdown",
"id": "eb9ee8d4-afdd-400a-bae0-03ca972a2559",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"# Natural Language Processing Tutorial 1 - Static Word Embeddings with Word2Vec\n",
""
]
},
{
"cell_type": "markdown",
"id": "71725826-1cd3-4a89-9471-509ccb926af9",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## What you'll learn in this tutorial\n",
"\n",
"- We'll use the [gensim](https://radimrehurek.com/gensim/index.html) library to:\n",
" - explore pretrained word embeddings\n",
" - pretrain our own embeddings\n",
"- We will additionally:\n",
" - visualize word embeddings\n",
" - evaluate them intrisically and extrinsically"
]
},
{
"cell_type": "markdown",
"id": "ad4b05ea-4690-4368-a195-a82cbe8382d6",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"## Our schedule for today\n",
"\n",
"- Part 1: Using pretrained word embeddings with gensim\n",
" - How to download already pretrained embeddings\n",
" - Nearest neighbour similarity search \n",
" - Word embedding visualization via PCA\n",
" - Intrisic evaluation with word analogy and word similarity benchmarks\n",
" - **Task 1**\n",
"- Part 2: Pretraining your **own** embeddings\n",
" - Training choices\n",
" - Saving and loading your embeddings\n",
"- Part 3: Extrinsic evaluation of word embeddings\n",
" - Using word2vec embeddings for spam classification\n",
" - **Task 2**"
]
},
{
"cell_type": "markdown",
"id": "53576457-01be-4ec5-9404-9fdff84906cb",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## Part 1 : Using pretrained embeddings with gensim\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"id": "263da382-d2e4-41c3-b002-ee540da2fbab",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### What is gensim?\n",
"\n",
"- Gensim is one of many core NLP libraries:\n",
" - together with [NLTK](https://www.nltk.org), [spaCy](https://spacy.io) and [HuggingFace 🤗](https://huggingface.co)\n",
" - you can find its documentation [here](https://radimrehurek.com/gensim/auto_examples/index.html#other-resources)\n",
"- It can be used to deal with corpora and perform:\n",
" - Retrieval\n",
" - Topic Modelling\n",
" - Representation Learning (**word2vec** and **doc2vec**)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "1ebdec4e",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [],
"source": [
"# Run this cell now!\n",
"import gensim\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"import gensim.downloader as api\n",
"from gensim import utils\n",
"from gensim.models import KeyedVectors\n",
"from gensim.test.utils import datapath\n",
"from gensim.models import Word2Vec\n",
"\n",
"from sklearn.metrics.pairwise import cosine_similarity\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegressionCV\n",
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"from scipy.stats import pearsonr, spearmanr\n",
"\n",
"import nltk\n",
"from nltk.corpus import stopwords\n",
"\n",
"import torch\n",
"import torch.nn as nn\n",
"\n",
"import plotly.express as px"
]
},
{
"cell_type": "markdown",
"id": "ee260e2b-1403-4be2-af68-86f63869a5b1",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### Let's download some embeddings!"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a2ce6ee4-c4c5-4782-b404-c15f3ba0b1ad",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Run this cell now!\n",
"word_emb = api.load('word2vec-google-news-300')"
]
},
{
"cell_type": "markdown",
"id": "941bbbf7-0af0-43b9-994f-43186b2c0bf5",
"metadata": {},
"source": [
"- The object that we get is of type [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html)\n",
"- This is simply a map $w \\rightarrow \\mathbf{e}_w \\in \\mathbb{R}^{300}$\n",
"- You can explore [here](https://github.com/RaRe-Technologies/gensim-data#models) all the possible models or simply run ```api.info()```"
]
},
{
"cell_type": "markdown",
"id": "bf5fe072-b0ee-4e3c-81ac-9220e3d17352",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### How do these embeddings look like?"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a7261e0d-a31a-4011-92e9-1dfeeb39c034",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(300,)\n",
"[-0.06445312 -0.16015625 -0.01208496 0.13476562 -0.22949219 0.16210938\n",
" 0.3046875 -0.1796875 -0.12109375 0.25390625 -0.01428223 -0.06396484\n",
" -0.08056641 -0.05688477 -0.19628906 0.2890625 -0.05151367 0.14257812\n",
" -0.10498047 -0.04736328 -0.34765625 0.35742188 0.265625 0.00188446\n",
" -0.01586914 0.00195312 -0.35546875 0.22167969 0.05761719 0.15917969\n",
" 0.08691406 -0.0267334 -0.04785156 0.23925781 -0.05981445 0.0378418\n",
" 0.17382812 -0.41796875 0.2890625 0.32617188 0.02429199 -0.01647949\n",
" -0.06494141 -0.08886719 0.07666016 -0.15136719 0.05249023 -0.04199219\n",
" -0.05419922 0.00108337 -0.20117188 0.12304688 0.09228516 0.10449219\n",
" -0.00408936 -0.04199219 0.01409912 -0.02111816 -0.13476562 -0.24316406\n",
" 0.16015625 -0.06689453 -0.08984375 -0.07177734 -0.00595093 -0.00482178\n",
" -0.00089264 -0.30664062 -0.0625 0.07958984 -0.00909424 -0.04492188\n",
" 0.09960938 -0.33398438 -0.3984375 0.05541992 -0.06689453 -0.04467773\n",
" 0.11767578 -0.13964844 -0.26367188 0.17480469 -0.17382812 -0.40625\n",
" -0.06738281 -0.07617188 0.09423828 0.20996094 -0.16308594 -0.08691406\n",
" -0.0534668 -0.10351562 -0.07617188 -0.11083984 -0.03515625 -0.14941406\n",
" 0.0378418 0.38671875 0.14160156 -0.2890625 -0.16894531 -0.140625\n",
" -0.04174805 0.22753906 0.24023438 -0.01599121 -0.06787109 0.21875\n",
" -0.42382812 -0.5625 -0.49414062 -0.3359375 0.13378906 0.01141357\n",
" 0.13671875 0.0324707 0.06835938 -0.27539062 -0.15917969 0.00121307\n",
" 0.01208496 -0.0039978 0.00442505 -0.04541016 0.08642578 0.09960938\n",
" -0.04296875 -0.11328125 0.13867188 0.41796875 -0.28320312 -0.07373047\n",
" -0.11425781 0.08691406 -0.02148438 0.328125 -0.07373047 -0.01348877\n",
" 0.17773438 -0.02624512 0.13378906 -0.11132812 -0.12792969 -0.12792969\n",
" 0.18945312 -0.13867188 0.29882812 -0.07714844 -0.37695312 -0.10351562\n",
" 0.16992188 -0.10742188 -0.29882812 0.00866699 -0.27734375 -0.20996094\n",
" -0.1796875 -0.19628906 -0.22167969 0.08886719 -0.27734375 -0.13964844\n",
" 0.15917969 0.03637695 0.03320312 -0.08105469 0.25390625 -0.08691406\n",
" -0.21289062 -0.18945312 -0.22363281 0.06542969 -0.16601562 0.08837891\n",
" -0.359375 -0.09863281 0.35546875 -0.00741577 0.19042969 0.16992188\n",
" -0.06005859 -0.20605469 0.08105469 0.12988281 -0.01135254 0.33203125\n",
" -0.08691406 0.27539062 -0.03271484 0.12011719 -0.0625 0.1953125\n",
" -0.10986328 -0.11767578 0.20996094 0.19921875 0.02954102 -0.16015625\n",
" 0.00276184 -0.01367188 0.03442383 -0.19335938 0.00352478 -0.06542969\n",
" -0.05566406 0.09423828 0.29296875 0.04052734 -0.09326172 -0.10107422\n",
" -0.27539062 0.04394531 -0.07275391 0.13867188 0.02380371 0.13085938\n",
" 0.00236511 -0.2265625 0.34765625 0.13574219 0.05224609 0.18164062\n",
" 0.0402832 0.23730469 -0.16992188 0.10058594 0.03833008 0.10839844\n",
" -0.05615234 -0.00946045 0.14550781 -0.30078125 -0.32226562 0.18847656\n",
" -0.40234375 -0.3125 -0.08007812 -0.26757812 0.16699219 0.07324219\n",
" 0.06347656 0.06591797 0.17285156 -0.17773438 0.00276184 -0.05761719\n",
" -0.2265625 -0.19628906 0.09667969 0.13769531 -0.49414062 -0.27929688\n",
" 0.12304688 -0.30078125 0.01293945 -0.1875 -0.20898438 -0.1796875\n",
" -0.16015625 -0.03295898 0.00976562 0.25390625 -0.25195312 0.00210571\n",
" 0.04296875 0.01184082 -0.20605469 0.24804688 -0.203125 -0.17773438\n",
" 0.07275391 0.04541016 0.21679688 -0.2109375 0.14550781 -0.16210938\n",
" 0.20410156 -0.19628906 -0.35742188 0.35742188 -0.11962891 0.35742188\n",
" 0.10351562 0.07080078 -0.24707031 -0.10449219 -0.19238281 0.1484375\n",
" 0.00057983 0.296875 -0.12695312 -0.03979492 0.13183594 -0.16601562\n",
" 0.125 0.05126953 -0.14941406 0.13671875 -0.02075195 0.34375 ]\n"
]
}
],
"source": [
"# Access embeddings with word-lookup\n",
"print(word_emb[\"apple\"].shape)\n",
"print(word_emb[\"apple\"])"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "52a63f19-ac08-40a4-99c5-86b76bab8393",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 2.60009766e-02 -1.89208984e-03 1.85546875e-01 -5.17578125e-02\n",
" 5.12695312e-03 -1.09863281e-01 -8.17871094e-03 -8.83789062e-02\n",
" 9.66796875e-02 4.83398438e-02 1.10473633e-02 -3.63281250e-01\n",
" 8.20312500e-02 -2.12402344e-02 1.58203125e-01 4.41894531e-02\n",
" -1.17797852e-02 2.12890625e-01 -5.73730469e-02 5.66406250e-02\n",
" -1.07421875e-01 1.85546875e-01 7.71484375e-02 1.44958496e-04\n",
" 1.52343750e-01 -6.54296875e-02 -1.52343750e-01 2.25585938e-01\n",
" 8.10546875e-02 8.88671875e-02 7.32421875e-02 -1.03515625e-01\n",
" -6.68945312e-02 1.76757812e-01 2.12890625e-01 1.40625000e-01\n",
" -3.41796875e-02 1.78222656e-02 5.95703125e-02 2.86102295e-04\n",
" 5.88378906e-02 9.27734375e-03 1.66992188e-01 -2.70080566e-03\n",
" 1.15722656e-01 1.04492188e-01 5.37109375e-02 1.85546875e-02\n",
" 1.06445312e-01 5.05371094e-02 -1.64794922e-02 -1.27929688e-01\n",
" 2.16796875e-01 5.15136719e-02 4.78515625e-02 1.52343750e-01\n",
" 1.71875000e-01 7.86132812e-02 -5.88378906e-02 -4.29687500e-02\n",
" -7.27539062e-02 1.81640625e-01 -8.05664062e-02 -1.54296875e-01\n",
" -1.16699219e-01 8.44726562e-02 -6.17675781e-02 -4.51660156e-02\n",
" 9.21630859e-03 1.33789062e-01 1.92871094e-02 6.44531250e-02\n",
" 1.08886719e-01 1.58203125e-01 -2.35595703e-02 1.23535156e-01\n",
" 1.69921875e-01 3.49121094e-02 1.29882812e-01 2.65625000e-01\n",
" 1.93359375e-01 -8.83789062e-02 8.49609375e-02 -2.96630859e-02\n",
" 5.76171875e-02 2.51464844e-02 -1.01562500e-01 1.99218750e-01\n",
" 1.04492188e-01 -2.42919922e-02 2.01416016e-02 -3.51562500e-02\n",
" 6.64062500e-02 -6.20117188e-02 2.90527344e-02 -9.81445312e-02\n",
" -1.81640625e-01 2.14843750e-01 -5.76171875e-02 -4.51660156e-02\n",
" 4.49218750e-02 -1.95312500e-02 -2.08984375e-01 1.19628906e-01\n",
" -9.03320312e-02 5.07812500e-02 9.03320312e-03 -9.76562500e-02\n",
" -7.86132812e-02 -1.36718750e-01 -1.13769531e-01 -5.64575195e-03\n",
" -4.07714844e-02 -2.05993652e-03 -5.66406250e-02 3.64685059e-03\n",
" 8.30078125e-02 -7.08007812e-02 2.63671875e-01 1.24511719e-01\n",
" -1.61132812e-02 9.13085938e-02 -2.39257812e-01 -1.04980469e-02\n",
" -6.78710938e-02 1.40625000e-01 2.34375000e-01 -6.39648438e-02\n",
" 1.95312500e-01 5.02929688e-02 -1.25000000e-01 2.06298828e-02\n",
" -1.19140625e-01 -1.17187500e-01 -9.01222229e-05 3.68652344e-02\n",
" 1.46484375e-01 2.47802734e-02 -1.49414062e-01 3.03649902e-03\n",
" -3.10058594e-02 1.06933594e-01 2.55859375e-01 -6.00585938e-02\n",
" -2.07031250e-01 1.58203125e-01 -2.15820312e-01 -1.84570312e-01\n",
" -1.72851562e-01 7.99560547e-03 -3.03955078e-02 9.81445312e-02\n",
" 4.66918945e-03 2.57812500e-01 1.06933594e-01 1.26953125e-01\n",
" 6.34765625e-02 -1.30859375e-01 6.54296875e-02 -9.91210938e-02\n",
" 5.90820312e-02 -3.71093750e-02 1.01074219e-01 1.53320312e-01\n",
" -1.53320312e-01 -7.56835938e-02 5.85937500e-02 -5.05371094e-02\n",
" 2.08007812e-01 4.85839844e-02 -9.42382812e-02 -9.71679688e-02\n",
" -1.23046875e-01 -1.97265625e-01 -1.76757812e-01 -1.11328125e-01\n",
" 1.11328125e-01 -5.88378906e-02 2.27539062e-01 4.00390625e-02\n",
" 1.24511719e-01 1.47460938e-01 1.81884766e-02 4.05273438e-02\n",
" 1.69921875e-01 1.13769531e-01 -2.24609375e-02 6.73828125e-02\n",
" 8.59375000e-02 6.73828125e-02 2.06298828e-02 4.78515625e-02\n",
" 1.84326172e-02 2.05078125e-01 -4.68750000e-02 2.00195312e-01\n",
" -1.56250000e-02 -1.40625000e-01 1.09863281e-02 -1.73828125e-01\n",
" 4.85839844e-02 -1.58203125e-01 -1.04492188e-01 3.63769531e-02\n",
" 3.01513672e-02 1.27929688e-01 -1.14257812e-01 1.41601562e-01\n",
" 2.34375000e-01 -8.98437500e-02 -1.02996826e-03 -1.50390625e-01\n",
" 1.79687500e-01 1.35742188e-01 -2.08007812e-01 -1.27563477e-02\n",
" 1.75781250e-01 -1.39648438e-01 -2.03125000e-01 -3.00292969e-02\n",
" -2.78320312e-02 -6.50024414e-03 1.26953125e-01 -1.49414062e-01\n",
" 1.46484375e-01 -8.42285156e-03 1.12304688e-01 1.66015625e-01\n",
" -1.57470703e-02 1.23046875e-01 7.22656250e-02 -4.37011719e-02\n",
" -7.56835938e-02 -9.03320312e-02 1.01562500e-01 -1.44531250e-01\n",
" -4.00390625e-02 -1.26953125e-02 2.66113281e-02 -7.81250000e-02\n",
" 3.56445312e-02 3.49121094e-02 1.79687500e-01 -1.38671875e-01\n",
" 2.80761719e-02 -2.86865234e-02 6.78710938e-02 7.03125000e-02\n",
" 9.57031250e-02 5.00488281e-02 -2.20947266e-02 -3.00781250e-01\n",
" 1.14257812e-01 7.51953125e-02 1.26342773e-02 1.32812500e-01\n",
" 2.52685547e-02 3.63769531e-02 -2.81982422e-02 -1.36718750e-01\n",
" 1.79687500e-01 -9.27734375e-02 8.49609375e-02 1.32812500e-01\n",
" 3.97949219e-02 4.29687500e-01 -1.87988281e-02 -1.47460938e-01\n",
" 6.10351562e-02 9.03320312e-02 8.69140625e-02 -6.88476562e-02\n",
" 1.10839844e-01 9.81445312e-02 1.50390625e-01 1.61132812e-01\n",
" -8.05664062e-02 -1.74804688e-01 -3.32031250e-02 -1.28906250e-01\n",
" 1.22558594e-01 -1.44653320e-02 -1.63085938e-01 -3.58886719e-02\n",
" 2.78320312e-02 -6.34765625e-02 -7.91015625e-02 -1.14746094e-01\n",
" 1.84326172e-02 2.91748047e-02 -3.00781250e-01 -4.58984375e-02\n",
" -1.74804688e-01 2.33398438e-01 2.25830078e-02 1.10351562e-01\n",
" -1.03515625e-01 -1.21582031e-01 2.21679688e-01 -2.19726562e-02]\n"
]
}
],
"source": [
"# Access embeddings with index-lookup\n",
"print(word_emb[10])"
]
},
{
"cell_type": "markdown",
"id": "4c980361-d38a-4ef6-bb66-06e0c62c0a8d",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### Let's check the vocabulary"
]
},
{
"cell_type": "markdown",
"id": "3a19e89d-d6c2-49c5-8a8d-d71bd3561a67",
"metadata": {},
"source": [
"- Two important attributes:\n",
" - ```key_to_index``` : maps a word to its vocabulary index\n",
" - ```index_to_key``` : maps a vocabulary index to the corresponding word\n",
"- The vocabulary is usually sorted by decreasing frequency, i.e. more common words have lower indices."
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "fc7eaeac-f570-453c-b9a0-9be2cbf4f05e",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Vocabulary length 3000000\n",
"Index of cat 5947\n",
"Word at position 5947 is cat\n"
]
}
],
"source": [
"print(f\"Vocabulary length {len(word_emb.key_to_index)}\")\n",
"print(f\"Index of cat {word_emb.key_to_index['cat']}\") # from word to index\n",
"print(f\"Word at position 5947 is {word_emb.index_to_key[5947]}\") # from index to word"
]
},
{
"cell_type": "markdown",
"id": "335ab44e-e5b9-4060-bc1b-8cc12ab8878c",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### Compute similarity and distance"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "18c9ff38-21b9-469b-9942-ef71fc1236d7",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"w1 w2 cos_sim cos_dist\n",
"car minivan 0.691 0.309\n",
"car bicycle 0.536 0.464\n",
"car airplane 0.424 0.576\n",
"car cereal 0.139 0.861\n",
"car communism 0.058 0.942\n"
]
}
],
"source": [
"pairs = [\n",
" ('car', 'minivan'), \n",
" ('car', 'bicycle'), \n",
" ('car', 'airplane'), \n",
" ('car', 'cereal'), \n",
" ('car', 'communism'),\n",
"]\n",
"print(\"w1 w2 cos_sim cos_dist\")\n",
"for w1, w2 in pairs:\n",
" print(f\"{w1} {w2} {word_emb.similarity(w1, w2):.3f} {word_emb.distance(w1, w2):.3f}\")\n",
" "
]
},
{
"cell_type": "markdown",
"id": "4968ae71-8c50-4f9e-9c09-24147eba5120",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### Nearest Neighbour (NN) Retrieval // Similarity Search"
]
},
{
"cell_type": "markdown",
"id": "7252e54f-35d1-424e-8bc2-a1e5d1b1452d",
"metadata": {},
"source": [
"- gensim has a ```most_similar``` function:\n",
" - however, it does not perform exhaustive nearest-neighbour search\n",
" - given a query word $w_q$ we want to find a ranked list $L_q$ of words in vocabulary $V$\n",
" in decreasing order of cosine similarity\n",
" - e.g. $w_q$ = \"joy\" then $L_q$ = [\"joy\", \"happiness\", \"smile\", ... ]\n",
" - Sometimes we don't want to look for nearest neighbours in the entire vocabulary $V$:\n",
" - we can restrict it to the top-$r$ frequent words to have a much faster search\n",
"- We can write our own function!"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "44ef69c2-0055-486b-8137-7a9cdfc1b38d",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [],
"source": [
"def retrieve_most_similar(query_words, all_word_emb, restrict_vocab=10000):\n",
" \n",
" # Step 1: Get full or restricted vocabulary embeddings\n",
" # If restrict_vocab=None then we have exhaustive search, otherwise we restrict the vocab to the most frequent words\n",
" vocab_emb = all_word_emb.vectors[:restrict_vocab+1,:] if restrict_vocab is not None else all_word_emb.vectors # shape: |V_r| x word_emb_size\n",
" \n",
" # Step 2: get the word embeddings for the query words\n",
" query_emb = all_word_emb[query_words] # shape: |Q| x word_emb_size\n",
" \n",
" # Step 3: get cosine similarity between queries and embeddings\n",
" cos_sim = cosine_similarity(query_emb, vocab_emb) # shape: |Q| x |V_r|\n",
" \n",
" # Step 4: sort similarities in desceding orders and get indices of nearest neighbours\n",
" nn = np.argsort(-cos_sim) # shape: |Q| x |V_r|\n",
" \n",
" # Step 5: delete self-similarity, i.e. cos_sim(w,w)=1.0 \n",
" nn_filtered = nn[:, 1:] # remove self_similarity\n",
" \n",
" # Step 6: use the indices to get the words\n",
" nn_words = np.array(word_emb.index_to_key)[nn_filtered]\n",
" \n",
" return nn_words"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "85f691a6-1c91-4546-84be-4ff2439e9c77",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['kings' 'queen' 'monarch' 'crown_prince' 'prince' 'sultan' 'ruler'\n",
" 'princes' 'throne' 'royal']\n",
" ['queens' 'princess' 'king' 'monarch' 'Queen' 'princesses' 'royal'\n",
" 'prince' 'duchess' 'Queen_Elizabeth_II']\n",
" ['french' 'Italy' 'i' 'haha' 'Cagliari' 'india' 'dont' 'thats' 'mr'\n",
" 'lol']\n",
" ['Italian' 'Sicily' 'Italians' 'ITALY' 'Spain' 'Bologna' 'Italia'\n",
" 'France' 'Milan' 'Romania']\n",
" ['registered_nurse' 'nurses' 'nurse_practitioner' 'midwife' 'Nurse'\n",
" 'nursing' 'doctor' 'medic' 'pharmacist' 'paramedic']]\n"
]
}
],
"source": [
"queries = [\"king\", \"queen\", \"italy\", \"Italy\", \"nurse\"]\n",
"res = retrieve_most_similar(queries, word_emb, restrict_vocab=100000)\n",
"top_k = 10\n",
"res_k = res[:, :top_k]\n",
"del res\n",
"print(res_k)"
]
},
{
"cell_type": "markdown",
"id": "bf58b008-445d-4950-a78d-4ab87fdba266",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### Dimensionality Reduction and Plotting\n",
"\n",
"- We want to plot our word embeddings\n",
"- But they ''live'' in $\\mathbb{R}^{300}$\n",
"- Let's use dimensionality reduction techniques, like PCA"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "24554147-0a3c-4bb5-b2d4-ab6cd75f0d07",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(|Q| x k) x word_emb_size\n",
"(50, 300)\n"
]
}
],
"source": [
"all_res_words = res_k.flatten()\n",
"res_word_emb = word_emb[all_res_words]\n",
"print(\"(|Q| x k) x word_emb_size\")\n",
"print(res_word_emb.shape)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "0add5d0d-ed1f-419c-a66d-b906ac96c744",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [],
"source": [
"pca = PCA(n_components=3) #Perform 3d-PCA\n",
"word_emb_pca = pca.fit_transform(res_word_emb)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "806c5683-92a4-413f-b616-cbd85826852c",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" pca_x pca_y pca_z word query\n",
"0 -0.951780 -0.588461 0.546893 kings king\n",
"1 -1.366599 -0.059902 0.103550 queen king\n",
"2 -2.038808 -0.398816 -0.404128 monarch king\n",
"3 -1.730922 -0.289503 -0.157777 crown_prince king\n",
"4 -1.596841 -0.419770 -0.252166 prince king\n"
]
}
],
"source": [
"pca_df = pd.DataFrame(word_emb_pca, columns=[\"pca_x\", \"pca_y\", \"pca_z\"])\n",
"\n",
"pca_df[\"word\"] = res_k.flatten()\n",
"\n",
"labels = np.array([queries]).repeat(top_k)\n",
"pca_df[\"query\"] = labels\n",
"\n",
"print(pca_df.head())"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "d05b1a05-8dea-4e25-a469-96499d4ffeda",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"hovertemplate": "query=king
pca_x=%{x}
pca_y=%{y}
pca_z=%{z}
word=%{text}
pca_x=%{x}
pca_y=%{y}
pca_z=%{z}
word=%{text}
pca_x=%{x}
pca_y=%{y}
pca_z=%{z}
word=%{text}
pca_x=%{x}
pca_y=%{y}
pca_z=%{z}
word=%{text}
pca_x=%{x}
pca_y=%{y}
pca_z=%{z}
word=%{text}
\n", " | Word 1 | \n", "Word 2 | \n", "Human (mean) | \n", "
---|---|---|---|
111 | \n", "tiger | \n", "animal | \n", "7.00 | \n", "
317 | \n", "possibility | \n", "girl | \n", "1.94 | \n", "
101 | \n", "money | \n", "possession | \n", "7.29 | \n", "
72 | \n", "magician | \n", "wizard | \n", "9.02 | \n", "
54 | \n", "physics | \n", "proton | \n", "8.12 | \n", "
\n", " | label | \n", "text | \n", "
---|---|---|
0 | \n", "ham | \n", "Go until jurong point, crazy.. Available only ... | \n", "
1 | \n", "ham | \n", "Ok lar... Joking wif u oni... | \n", "
2 | \n", "spam | \n", "Free entry in 2 a wkly comp to win FA Cup fina... | \n", "
3 | \n", "ham | \n", "U dun say so early hor... U c already then say... | \n", "
4 | \n", "ham | \n", "Nah I don't think he goes to usf, he lives aro... | \n", "
... | \n", "... | \n", "... | \n", "
5567 | \n", "spam | \n", "This is the 2nd time we have tried 2 contact u... | \n", "
5568 | \n", "ham | \n", "Will ü b going to esplanade fr home? | \n", "
5569 | \n", "ham | \n", "Pity, * was in mood for that. So...any other s... | \n", "
5570 | \n", "ham | \n", "The guy did some bitching but I acted like i'd... | \n", "
5571 | \n", "ham | \n", "Rofl. Its true to its name | \n", "
5572 rows × 2 columns
\n", "\n", " | label | \n", "text | \n", "
---|---|---|
0 | \n", "0 | \n", "Go until jurong point, crazy.. Available only ... | \n", "
1 | \n", "0 | \n", "Ok lar... Joking wif u oni... | \n", "
2 | \n", "1 | \n", "Free entry in 2 a wkly comp to win FA Cup fina... | \n", "
3 | \n", "0 | \n", "U dun say so early hor... U c already then say... | \n", "
4 | \n", "0 | \n", "Nah I don't think he goes to usf, he lives aro... | \n", "
... | \n", "... | \n", "... | \n", "
5567 | \n", "1 | \n", "This is the 2nd time we have tried 2 contact u... | \n", "
5568 | \n", "0 | \n", "Will ü b going to esplanade fr home? | \n", "
5569 | \n", "0 | \n", "Pity, * was in mood for that. So...any other s... | \n", "
5570 | \n", "0 | \n", "The guy did some bitching but I acted like i'd... | \n", "
5571 | \n", "0 | \n", "Rofl. Its true to its name | \n", "
5572 rows × 2 columns
\n", "\n", " | label | \n", "text | \n", "preprocessed_text | \n", "
---|---|---|---|
0 | \n", "0 | \n", "Go until jurong point, crazy.. Available only ... | \n", "[go, jurong, point, crazy, available, bugis, g... | \n", "
1 | \n", "0 | \n", "Ok lar... Joking wif u oni... | \n", "[ok, lar, joking, wif, oni] | \n", "
2 | \n", "1 | \n", "Free entry in 2 a wkly comp to win FA Cup fina... | \n", "[free, entry, wkly, comp, win, fa, cup, final,... | \n", "
3 | \n", "0 | \n", "U dun say so early hor... U c already then say... | \n", "[dun, say, early, hor, already, say] | \n", "
4 | \n", "0 | \n", "Nah I don't think he goes to usf, he lives aro... | \n", "[nah, think, goes, usf, lives, around, though] | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
5567 | \n", "1 | \n", "This is the 2nd time we have tried 2 contact u... | \n", "[nd, time, tried, contact, pound, prize, claim... | \n", "
5568 | \n", "0 | \n", "Will ü b going to esplanade fr home? | \n", "[going, esplanade, fr, home] | \n", "
5569 | \n", "0 | \n", "Pity, * was in mood for that. So...any other s... | \n", "[pity, mood, suggestions] | \n", "
5570 | \n", "0 | \n", "The guy did some bitching but I acted like i'd... | \n", "[guy, bitching, acted, like, interested, buyin... | \n", "
5571 | \n", "0 | \n", "Rofl. Its true to its name | \n", "[rofl, true, name] | \n", "
5572 rows × 3 columns
\n", "LogisticRegressionCV(cv=5, max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegressionCV(cv=5, max_iter=1000)