{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "eb9ee8d4-afdd-400a-bae0-03ca972a2559",
      "metadata": {
        "tags": [],
        "id": "eb9ee8d4-afdd-400a-bae0-03ca972a2559"
      },
      "source": [
        "#Static Word Embeddings"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "71725826-1cd3-4a89-9471-509ccb926af9",
      "metadata": {
        "tags": [],
        "id": "71725826-1cd3-4a89-9471-509ccb926af9"
      },
      "source": [
        "## Static word Embeddings\n",
        "In class you discussed how there are models like **GloVe**, **Word2Vec** and **fastText** that, when trained on a corpus, produce a vectorial representation of words based on the context. These models produce what are called static embeddings, meaning that there is a singular vector for each word that is a weighted average of all the meanings that the word can have.\n",
        "\n",
        "For example, the word **right** can have different meanings:\n",
        "\n",
        "1.   It looks **right** -> right as in correct\n",
        "2.   The phone is on your **right** -> right as in position\n",
        "\n",
        "But we will get a single vector for the word **right**, which includes both meanings.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Useful NLP libraries**\n",
        "\n",
        "There are many useful python NLP libraries:\n",
        "\n",
        "\n",
        "* [NLTK](https://www.nltk.org), Natural Language Toolkit is a platform that includes many resources for building Python programs and models to work with text.  Here can be found datasets and libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.\n",
        "* [SpaCy](https://spacy.io), which is a library useful for creating models, chatbots, and programs for document and text analysis (part-of-speech tagging, dependency parsing, text categorization, named entity recognition).\n",
        "* [HuggingFace 🤗](https://huggingface.co), that is a machine learning and data science platform useful for building, deployment, and training of machine learning models.\n",
        "* [Gensim](https://radimrehurek.com/gensim/auto_examples/index.html#other-resources), which is library for topic modeling, information retrieval and other natural language processing resources, models and functionalities. This library also offers models to train word vectors, like Word2Vec and FastText.\n",
        "\n",
        "In this lab we will work with **Gensim**, **NLTK** and **SpaCy**.\n"
      ],
      "metadata": {
        "id": "VvdXju9Iw_K9"
      },
      "id": "VvdXju9Iw_K9"
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Preparing the environment"
      ],
      "metadata": {
        "id": "ql91he8j53OK"
      },
      "id": "ql91he8j53OK"
    },
    {
      "cell_type": "code",
      "source": [
        "# to begin with we have to first install the library\n",
        "!pip install gensim\n",
        "!pip install networkx[default]"
      ],
      "metadata": {
        "id": "JP-wTMwYTIbY"
      },
      "id": "JP-wTMwYTIbY",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Since some packages were re-installed, we have to restart the session, or we are going to get some errors.\n",
        "This can be done either by going on **Runtime** in the menu and selecting **Restart session**, or by running the piece of code in the next cell.\n"
      ],
      "metadata": {
        "id": "hn4A65U3kdaP"
      },
      "id": "hn4A65U3kdaP"
    },
    {
      "cell_type": "code",
      "source": [
        "# restart session\n",
        "import os\n",
        "#os.kill(os.getpid(), 9)"
      ],
      "metadata": {
        "id": "gWPKipYLWT3f"
      },
      "id": "gWPKipYLWT3f",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We can now proceed with importing the libraries"
      ],
      "metadata": {
        "id": "PKCK3wKg6gsD"
      },
      "id": "PKCK3wKg6gsD"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1ebdec4e",
      "metadata": {
        "tags": [],
        "id": "1ebdec4e"
      },
      "outputs": [],
      "source": [
        "import gensim\n",
        "# gensim.downloader is an API for downloading, getting information and loading datasets and models\n",
        "import gensim.downloader as api"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "And download the additional datasets that we're going to use"
      ],
      "metadata": {
        "id": "fJXGzKigpsnW"
      },
      "id": "fJXGzKigpsnW"
    },
    {
      "cell_type": "code",
      "source": [
        "!git clone https://github.com/elsartori/Lab_2_datasets.git"
      ],
      "metadata": {
        "id": "XWaYR-1Up2CG"
      },
      "id": "XWaYR-1Up2CG",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "id": "ee260e2b-1403-4be2-af68-86f63869a5b1",
      "metadata": {
        "tags": [],
        "id": "ee260e2b-1403-4be2-af68-86f63869a5b1"
      },
      "source": [
        "## Pre-trained embeddings\n",
        "It is not always necessary to create word embeddings from scratch, but we can use pre-trained embeddings, already created embeddings that we just need to load.\n",
        "Using pre-trained embeddings allows us to work with embeddings without the need to 1) find a corpora and 2) pre-process it and train word vector models.\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# let's explore what corpora and pre-trained Gensim offers\n",
        "api.info()"
      ],
      "metadata": {
        "id": "pnwRvAoCrmNZ"
      },
      "id": "pnwRvAoCrmNZ",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The information is stored within a long dictionary of dictionaries, and is quite difficult to read. We can select to visualize only a part of the information."
      ],
      "metadata": {
        "id": "JqOVYEpoCa2-"
      },
      "id": "JqOVYEpoCa2-"
    },
    {
      "cell_type": "code",
      "source": [
        "# check primary keys of dictionary\n",
        "api.info().keys()"
      ],
      "metadata": {
        "id": "fNwxb9EOY25I"
      },
      "id": "fNwxb9EOY25I",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# let's see which corpora are present\n",
        "api.info()['corpora'].keys()"
      ],
      "metadata": {
        "id": "KKZQWJEq7Nf3"
      },
      "id": "KKZQWJEq7Nf3",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# let's visualize more information about corpora \"text8\"\n",
        "api.info('text8')"
      ],
      "metadata": {
        "id": "DEPBtPHEZWAq"
      },
      "id": "DEPBtPHEZWAq",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Pre-trained embeddings are under the key **model**."
      ],
      "metadata": {
        "id": "Fca5PLJEQpHB"
      },
      "id": "Fca5PLJEQpHB"
    },
    {
      "cell_type": "code",
      "source": [
        "# let's see which pre-trained embeddings are present\n",
        "api.info()['models'].keys()"
      ],
      "metadata": {
        "id": "cOjGQ5SRZZUq"
      },
      "id": "cOjGQ5SRZZUq",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# let's visualize more information about word vectors 'glove-wiki-gigaword-300'\n",
        "api.info('glove-wiki-gigaword-300')"
      ],
      "metadata": {
        "id": "2MXy8HNJpDPR"
      },
      "id": "2MXy8HNJpDPR",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a2ce6ee4-c4c5-4782-b404-c15f3ba0b1ad",
      "metadata": {
        "tags": [],
        "id": "a2ce6ee4-c4c5-4782-b404-c15f3ba0b1ad"
      },
      "outputs": [],
      "source": [
        "# we load the embeddings\n",
        "embeddings = api.load('glove-wiki-gigaword-300')"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings"
      ],
      "metadata": {
        "id": "JXInUNh1a6q4"
      },
      "id": "JXInUNh1a6q4",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "id": "941bbbf7-0af0-43b9-994f-43186b2c0bf5",
      "metadata": {
        "id": "941bbbf7-0af0-43b9-994f-43186b2c0bf5"
      },
      "source": [
        "We are loading a structure called [KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html) and not a model. KeyedVectors do not support additional training but require less RAM than a full model. It is essentially a mapping between keys, in this case **words**, and their **vectors**.\n",
        "\n",
        "There are two properties to access those mappings:\n",
        "* ```key_to_index```, that  returns a dictionary which maps a word present in the vocabulary to its index.\n",
        "* ```index_to_key```, that  returns a dictionary which maps a vocabulary index to a word present in the vocabulary.\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# the dictionary is usually sorted by index. The lower the index, the more common the word was in the corpora.\n",
        "embeddings.key_to_index"
      ],
      "metadata": {
        "id": "eXRsGLPbqOhD"
      },
      "id": "eXRsGLPbqOhD",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# we can check if a word is present in the vocabulary\n",
        "print(\"people\" in embeddings)\n",
        "# we can also print its shape. The lenght of our vector is 300\n",
        "print(embeddings['people'].shape)\n",
        "# we can check the index of a single word\n",
        "print(embeddings.key_to_index['people'])\n",
        "# we can check the word associated to a specific index\n",
        "print(embeddings.index_to_key[69])"
      ],
      "metadata": {
        "id": "NPH3CfXzzVKz"
      },
      "id": "NPH3CfXzzVKz",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We can access vectors both by using a key or an index."
      ],
      "metadata": {
        "id": "I04V7R9tLcIv"
      },
      "id": "I04V7R9tLcIv"
    },
    {
      "cell_type": "code",
      "source": [
        "# let's check the first 30 positions of the vectors.\n",
        "embeddings['people'][0:30]"
      ],
      "metadata": {
        "id": "W6NVYdubtx24"
      },
      "id": "W6NVYdubtx24",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings[69][0:30]"
      ],
      "metadata": {
        "id": "BmPY5btyeDNv"
      },
      "id": "BmPY5btyeDNv",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "As you can see the vector is stored as a NumPy array. [NumPy](https://numpy.org/doc/stable/) is a library used for working with arrays. We can use it to check if the vectors are equal."
      ],
      "metadata": {
        "id": "Mw1NX5srPlku"
      },
      "id": "Mw1NX5srPlku"
    },
    {
      "cell_type": "code",
      "source": [
        "# import library\n",
        "import numpy as np\n",
        "# we check if two arrays have the same shape and elements\n",
        "np.array_equal(embeddings['people'],embeddings[69])"
      ],
      "metadata": {
        "id": "NroIskxELjEo"
      },
      "id": "NroIskxELjEo",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The values are the same!\n",
        "We can also extract multiple embeddings, we are not limited to one."
      ],
      "metadata": {
        "id": "rgkgkdZ4LVh_"
      },
      "id": "rgkgkdZ4LVh_"
    },
    {
      "cell_type": "code",
      "source": [
        "multiple_embeddings = embeddings['people','men', 'women']\n",
        "#check array shape, it has 2 dimensions (n_words * vectors_lenght)\n",
        "print(multiple_embeddings.shape)\n",
        "#the order of extraction is the same as the order of the words\n",
        "print(np.array_equal(embeddings['people'], multiple_embeddings[0]))\n",
        "print(np.array_equal(embeddings['men'], multiple_embeddings[1]))\n",
        "print(np.array_equal(embeddings['women'], multiple_embeddings[2]))"
      ],
      "metadata": {
        "id": "sucdKEQoVpYK"
      },
      "id": "sucdKEQoVpYK",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We can check how similar two vectors are using the cosine similarity.\n",
        "Gensim offers functions to do just that:\n",
        "\n",
        "\n",
        "*   [```similarity```](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.similarity) computes the cosine similarities between two words in the vocabulary.\n",
        "*   [```cosine_similarities```](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.cosine_similarities) works on vectors, and computes the cosine similarities between one vector and an array of other vectors.\n",
        "\n",
        "Otherwise it is possible to use sklearn function [`cosine_similarity`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)."
      ],
      "metadata": {
        "id": "OTtu3E5vkmpo"
      },
      "id": "OTtu3E5vkmpo"
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.metrics.pairwise import cosine_similarity\n",
        "#single words\n",
        "print(embeddings.similarity(\"women\", \"men\"))\n",
        "print(embeddings.similarity(\"people\", \"men\"))\n",
        "print(embeddings.similarity(\"people\", \"women\"))\n",
        "print(embeddings.similarity(\"people\", \"children\"))\n",
        "print(cosine_similarity([embeddings[\"people\"]], embeddings[[\"children\"]]))\n",
        "#multiple words\n",
        "print(embeddings.cosine_similarities(embeddings[\"people\"], embeddings[\"men\", \"women\", \"children\"]))\n",
        "cosine_similarity(embeddings[\"people\",\"men\"], embeddings[\"men\", \"women\", \"children\"])"
      ],
      "metadata": {
        "id": "-8XtXSQPvLe0"
      },
      "id": "-8XtXSQPvLe0",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#the output is an array, but we can convert it into a list\n",
        "print(cosine_similarity(embeddings[\"people\", \"men\"], embeddings[\"men\", \"women\", \"children\"]).tolist())"
      ],
      "metadata": {
        "id": "jQJHQYQhxX2a"
      },
      "id": "jQJHQYQhxX2a",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "It can also be useful to check for possible biases in the dataset"
      ],
      "metadata": {
        "id": "9IJI03Y5lIy2"
      },
      "id": "9IJI03Y5lIy2"
    },
    {
      "cell_type": "code",
      "source": [
        "print(embeddings.cosine_similarities(embeddings[\"soldier\"], embeddings[\"men\",\"women\"]))"
      ],
      "metadata": {
        "id": "bQc2rh1PiTq7"
      },
      "id": "bQc2rh1PiTq7",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We can also check how dissimilar two vectors are using the cosine distance. This is equal to ```1 - cosine_similarity```.\n",
        "\n",
        "There are specific Gensim functions to achieve that:\n",
        "\n",
        "\n",
        "*   [```distance```](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.distance) computes the cosine distances between two words in the vocabulary\n",
        "*   [```distances```](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.distances) computes the cosine distances between one word and a list of other words\n",
        "\n",
        "Otherwise it is possible to use sklearn function [`cosine_distances`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html)."
      ],
      "metadata": {
        "id": "jLs1IPsAlRdE"
      },
      "id": "jLs1IPsAlRdE"
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.metrics.pairwise import cosine_distances\n",
        "#single words\n",
        "print(embeddings.distance(\"people\", \"men\"))\n",
        "print(embeddings.distance(\"people\", \"women\"))\n",
        "print(embeddings.distance(\"people\", \"children\"))\n",
        "print(cosine_distances([embeddings[\"people\"]], [embeddings[\"children\"]]))\n",
        "#multiple words\n",
        "print(embeddings.distances(\"people\", [\"men\",\"women\",\"children\"]))\n",
        "print(cosine_distances([embeddings[\"people\"]], embeddings[\"men\", \"women\", \"children\"]))"
      ],
      "metadata": {
        "id": "bzBcssQRvdqx"
      },
      "id": "bzBcssQRvdqx",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "id": "fad14c49-be5f-43cb-bf7e-6b369e84158e",
      "metadata": {
        "tags": [],
        "id": "fad14c49-be5f-43cb-bf7e-6b369e84158e"
      },
      "source": [
        "Another useful Gensim function is [`most_similar`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar), that finds the top-n most similar keys.\n",
        "The most important arguments are:\n",
        "* ```positive [str or list] ``` : required, word or list of words that contribute positively (sum).\n",
        "* ```negative [str or list] ``` : optional, list of words that contribute negatively (difference).\n",
        "* ```topn [int] ```: optional shows only the n most similar words, the default is 10\n",
        "* ```restrict_vocab [int] ```: optional checks only the first vectors in the vocabulary order."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "words_of_interest = [\"people\", \"men\", \"women\", \"children\"]\n",
        "for word in words_of_interest:\n",
        "  print(\"Most similar words to {}:\".format(word), embeddings.most_similar(positive=word))\n",
        "  #it is not necessary to explict 'positive', the following command works too: embeddings.most_similar(word))"
      ],
      "metadata": {
        "id": "jdOvP6LzvpLu"
      },
      "id": "jdOvP6LzvpLu",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "##Embedding visualization\n",
        "In some cases we would like to visualize the embeddings we are working on. For example we may want to verify if each word in **words_of_interest** is close to their most similar embeddings, but the word vectors we chose have 300 dimensions!\n",
        "\n",
        "In these cases, we can use dimensionality reduction techniques, such as PCA, to map the data into a lower-dimensional space."
      ],
      "metadata": {
        "id": "FlVGJlGZ899E"
      },
      "id": "FlVGJlGZ899E"
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.decomposition import PCA\n",
        "#pandas is a library for data analysis and manipulation, use particularly for tabular data (like in an excel)\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "import plotly.express as px\n",
        "\n",
        "words_of_interest = [\"people\", \"men\", \"women\", \"children\"]\n",
        "#we make a single array with the top10 most similar words to those in words_of_interest\n",
        "clusters=[]\n",
        "for word in words_of_interest:\n",
        "  clusters.extend([k for k,v in embeddings.most_similar(word)])\n",
        "\n",
        "#we perform PCA on the list of top10 words\n",
        "pca = PCA(n_components=3) #Perform 3d-PCA\n",
        "embeddings_pca = pca.fit_transform(embeddings[clusters])\n",
        "\n",
        "#we save the results of PCA in a dataframe (tabular format)\n",
        "pca_df = pd.DataFrame(embeddings_pca, columns=[\"pca_x\", \"pca_y\", \"pca_z\"])\n",
        "#we add a column 'word' with the corresponding top10 words for each coordinates\n",
        "pca_df[\"word\"] = clusters\n",
        "#we add a column 'word_of_interest' with the most similar word in words_of_interest\n",
        "labels = np.array(words_of_interest).repeat(10)\n",
        "pca_df[\"word_of_interest\"] = labels\n",
        "\n",
        "#we visualize the first 10 rows of the dataframe\n",
        "print(pca_df.head(10))\n",
        "#we create a plot\n",
        "px.scatter_3d(pca_df, x='pca_x', y='pca_y', z='pca_z', color=\"word_of_interest\", text=\"word\", opacity=0.7, title=\"3d-PCA representation of word embeddings\")"
      ],
      "metadata": {
        "id": "-P0eLVkg5A9o"
      },
      "id": "-P0eLVkg5A9o",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "id": "7c76ab53-70dc-4324-97ca-358b4d962f5e",
      "metadata": {
        "tags": [],
        "id": "7c76ab53-70dc-4324-97ca-358b4d962f5e"
      },
      "source": [
        "## Word embedding evaluation\n",
        "\n",
        "There are two main types of evaluation for word embeddings:\n",
        "   \n",
        "   * **intrisic evaluation**: where we evaluate word embeddings on the embedding space\n",
        "        * word similarity benchmarks\n",
        "        * word analogy benchmarks\n",
        "   * **extrinsic evaluation**: where we evaluate word embeddings on a downstream task, like text classification.\n",
        "\n",
        "\n",
        "We will see how intrisic evaluation works."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bc5e5ef0-9a9b-4952-9c74-26130634ae7b",
      "metadata": {
        "tags": [],
        "id": "bc5e5ef0-9a9b-4952-9c74-26130634ae7b"
      },
      "source": [
        "###Word Similarity benchmarks\n",
        "\n",
        "Word similarity benchmarks, are datasets that contain pairs of words and a similarity score between each given by human heuristic judgments. The higher the correlation between the assigned human similarity score and the actual similarity between the obtained embeddings, the better the embeddings are.\n",
        "\n",
        "\n",
        "One of these datasets is [MEN](https://staff.fnwi.uva.nl/e.bruni/MEN) that contains a total of 3000 scored pairs.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a006fbad-3740-40ea-b2e9-9a85c92ef32e",
      "metadata": {
        "tags": [],
        "id": "a006fbad-3740-40ea-b2e9-9a85c92ef32e"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "# we read MEN as a dataframe\n",
        "ws_df = pd.read_csv('/content/Lab_2_datasets/men.csv', sep=\",\")\n",
        "# let's see a portion of the dataset\n",
        "ws_df.sample(10)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8a2d8328-745e-4afe-b192-3b71846416ca",
      "metadata": {
        "tags": [],
        "id": "8a2d8328-745e-4afe-b192-3b71846416ca"
      },
      "source": [
        "To evaluate word embeddings on a word similarity benchmark, we need to do these steps:\n",
        "\n",
        "1. For each pair of words $(w_{i_{1}}, w_{i_{2}})$ in the benchmark, we compute the cosine similarity between its word embeddings $\\cos(\\mathbf{e}_{w_{i_{1}}}, \\mathbf{e}_{w_{i_{2}}})$\n",
        "2. We compute a correlation score (like [Pearson's $r$](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) or [Spearman's $\\rho$](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)) between the human given scores and the cosine similarities between the embeddings\n",
        "    - the higher the score, the better!\n",
        "\n",
        "Gensim has a function that does all of that: [```evaluate_word_pairs```](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.evaluate_word_pairs), it computes the correlation between the model with human similarity judgments in a dataset."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c827290d-3880-4919-b3f7-a87734155f92",
      "metadata": {
        "tags": [],
        "id": "c827290d-3880-4919-b3f7-a87734155f92"
      },
      "outputs": [],
      "source": [
        "embeddings.evaluate_word_pairs('/content/Lab_2_datasets/men.csv', delimiter=',')"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0cdce175-12e1-4eaf-acac-1dfadc83ac1c",
      "metadata": {
        "tags": [],
        "id": "0cdce175-12e1-4eaf-acac-1dfadc83ac1c"
      },
      "source": [
        "### Word Analogy benchmarks\n",
        "\n",
        "Word analogy benchmarks are datasets that contain groups of four words that are related in some way.  The evaluation of the embeddings is based on the idea that this relations can be predicted by arithmetic operations in the vectorial space.\n",
        "\n",
        "Given $\\mathbf{e}_x$ to be the embedding of word $x$, examples of the relations that can be found in these datasets and how they are verified are:\n",
        "\n",
        "*    man : king = woman : queen -> $\\mathbf{e}_{king} - \\mathbf{e}_{man} + \\mathbf{e}_{woman} = \\mathbf{e}_x$ where $x$ is checked to be $queen$.\n",
        "*    miss : woman = mr : man -> $\\mathbf{e}_{woman} - \\mathbf{e}_{miss} + \\mathbf{e}_{mr} = \\mathbf{e}_x$ where $x=man$\n",
        "*    athens : greece = madrid : spain -> $\\mathbf{e}_{greece} - \\mathbf{e}_{athens} + \\mathbf{e}_{madrid} = \\mathbf{e}_x$ where $x=spain$\n",
        "\n",
        "\n",
        "To verify these relations we can use Gensim `most_similar` function, that we already saw. Word embedding that we would sum together are those that **contribute positively**, the one that we subtract have a **negative contribution**."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "18d28a9c-7e6d-460d-82ea-0e3c9d3df32b",
      "metadata": {
        "tags": [],
        "id": "18d28a9c-7e6d-460d-82ea-0e3c9d3df32b"
      },
      "outputs": [],
      "source": [
        "print(embeddings.most_similar(positive=[\"king\", \"woman\"], negative=[\"man\"]))\n",
        "print(embeddings.most_similar(positive=[\"woman\", \"mr\"], negative=[\"miss\"]))\n",
        "print(embeddings.most_similar(positive=[\"greece\", \"madrid\"], negative=[\"athens\"]))"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "We will use a subset of the [SemEval 2017](https://alt.qcri.org/semeval2017/task2/) benchmark that contains 4998 relations. The original dataset contains 10014 entries for the English language."
      ],
      "metadata": {
        "id": "keJLlv91D6uQ"
      },
      "id": "keJLlv91D6uQ"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "27d1d8b9-e943-4528-bf3d-ccf58da4867c",
      "metadata": {
        "tags": [],
        "id": "27d1d8b9-e943-4528-bf3d-ccf58da4867c"
      },
      "outputs": [],
      "source": [
        "#we open and visualize a portion of the dataset\n",
        "f = open('/content/Lab_2_datasets/semeval.txt')\n",
        "print(\"\".join(f.readlines()[:10]))\n",
        "f.close()"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Instead of running `most_similar` on each entry, we can use the built-in function [`evaluate_word_analogies`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.evaluate_word_analogies) which computes the performance of the model over an analogy test set."
      ],
      "metadata": {
        "id": "qjgBlhADFas6"
      },
      "id": "qjgBlhADFas6"
    },
    {
      "cell_type": "code",
      "source": [
        "accuracy, results = embeddings.evaluate_word_analogies('/content/Lab_2_datasets/semeval.txt')\n",
        "print(\"Accuracy \", accuracy)\n",
        "print(results[0].keys())\n",
        "print(\"Correct \", len(results[0]['correct']) ,results[0]['correct'][:5])\n",
        "print(\"Incorrect\" , len(results[0]['incorrect']) , results[0]['incorrect'][:5])"
      ],
      "metadata": {
        "id": "C59ELRZDFVfZ"
      },
      "id": "C59ELRZDFVfZ",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Semantic Lexicons\n",
        "\n",
        "A semantic lexicon is a database mapping words to semantic classes or relationships, explicitly represent *word senses* (the different meanings a word can have).\n",
        "\n",
        "\n",
        "\n",
        "The most widely used collection is [**WordNet**](https://wordnet.princeton.edu/), which organizes word senses into\n",
        "*synsets* (synonym sets), each representing a distinct concept. For example, the word *bank* has at least two distinct senses: a financial institution and the edge of a river. Each sense is treated as a separate entity, with its own definition, examples and relations (synonymy, hypernymy, hyponymy, antonymy, etc).\n",
        "\n",
        " Other useful resources are:\n",
        "- **PPDB** (Paraphrase Database), containing\n",
        "  millions of paraphrase pairs extracted from bilingual parallel corpora. The intuition is that if two words in one language translate to the same word in another language, they are likely paraphrases.\n",
        "\n",
        "- [**FrameNet**](https://framenet.icsi.berkeley.edu/), a linguistic database that groups words by the conceptual *frames*\n",
        "  they evoke. A frame is an abstract situation or event, and all words that can\n",
        "  describe it are connected. For example, the frame *motion* groups *direction*, *distance*, *goal*, *path*."
      ],
      "metadata": {
        "id": "7M4OaqEfmWHo"
      },
      "id": "7M4OaqEfmWHo"
    },
    {
      "cell_type": "code",
      "source": [
        "import nltk\n",
        "from nltk.corpus import wordnet as wn\n",
        "nltk.download('wordnet')\n",
        "\n",
        "synsets = wn.synsets(\"plant\") # returns a list of all synsets\n",
        "print(\"Number of synsets:\", len(synsets))\n",
        "print(synsets)\n",
        "\n",
        "# 'plant': lemma, 'n': noun (part of speech),\n",
        "# '01': sense number (disambiguates between multiple synsets)\n",
        "\n",
        "# e.g. synset('plant.n.01'): buildings for carrying on industrial labor\n",
        "# synset('plant.n.09'): (botany) a living organism lacking the power of locomotion"
      ],
      "metadata": {
        "id": "oB-m7xs6mV8O"
      },
      "id": "oB-m7xs6mV8O",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We can see some relations relative to the specific synsets of interest, such as *\"plant.n.01\"* and *\"plant.n.02\"*."
      ],
      "metadata": {
        "id": "qoSBCBlLb9JU"
      },
      "id": "qoSBCBlLb9JU"
    },
    {
      "cell_type": "code",
      "source": [
        "# two different senses of the word \"plant\"\n",
        "print(\"Definition 1:\", wn.synset(\"plant.n.01\").definition())\n",
        "print(\"Definition 2:\", wn.synset(\"plant.n.02\").definition())\n",
        "\n",
        "# synonyms for each sense\n",
        "print(\"\\nSynonyms 1:\", wn.synset(\"plant.n.01\").lemma_names())\n",
        "print(\"Synonyms 2:\", wn.synset(\"plant.n.02\").lemma_names())\n",
        "\n",
        "# hypernyms for each sense (it returns synsets, not strings!!)\n",
        "print(\"\\nHypernyms 1:\", wn.synset(\"plant.n.01\").hypernyms())\n",
        "print(\"Hypernyms 2:\", wn.synset(\"plant.n.02\").hypernyms())\n",
        "\n",
        "# hyponyms for each sense (it returns synsets, not strings!!)\n",
        "print(\"\\nHyponyms 1:\", wn.synset(\"plant.n.01\").hyponyms())\n",
        "print(\"Hyponyms 2:\", wn.synset(\"plant.n.02\").hyponyms())\n",
        "\n",
        "# we can compute the semantic similarity with 'path_similarity' (1.0 / (shortest path distance + 1))\n",
        "print(wn.synset(\"plant.n.01\").path_similarity(wn.synset(\"plant.n.02\")))"
      ],
      "metadata": {
        "id": "yCWp4HIanNFG"
      },
      "id": "yCWp4HIanNFG",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "If we want to access lexical information (hyponyms in this case) for all meanings of a word,\n",
        "we must iterate over all its synsets!"
      ],
      "metadata": {
        "id": "YgFt7u3Dk7zn"
      },
      "id": "YgFt7u3Dk7zn"
    },
    {
      "cell_type": "code",
      "source": [
        "for hyper in wn.synset(\"plant.n.01\").hyponyms():\n",
        "  print(\"-\", hyper.name(), \":\", hyper.definition())"
      ],
      "metadata": {
        "id": "d6P08mwjk6aS"
      },
      "id": "d6P08mwjk6aS",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "In WordNet, a lemma represents a specific word form within a synset.\n",
        "Unlike `.lemma_names()`, which returns a list of strings, `.lemmas()` returns Lemma objects that contain additional linguistic information, such as antonyms or frequency.\n",
        "\n",
        "Since antonyms are defined at the lemma level (not the synset level), they can only be accessed through lemma objects. To access lemma-level attributes, you must first retrieve the lemma objects via `.lemmas()`.\n",
        "\n",
        "Let's check the antonyms of the synset \"*good.a.01*\":"
      ],
      "metadata": {
        "id": "AkBW3umBciG3"
      },
      "id": "AkBW3umBciG3"
    },
    {
      "cell_type": "code",
      "source": [
        "# let's check the definition first\n",
        "print(\"Definition 1:\", wn.synset(\"good.a.01\").definition())\n",
        "\n",
        "# antonyms are defined at the lemma level, so we must iterate over lemma objects\n",
        "# of a specific synset (in this case, a specific meaning of \"good\")\n",
        "for lemma in wn.synset(\"good.a.01\").lemmas():\n",
        "\n",
        "    # lemma.antonyms() returns a list of opposite Lemma objects (if any exist)\n",
        "    if lemma.antonyms():\n",
        "\n",
        "        # we take the first antonym and print its name (string form)\n",
        "        print(\"Antonyms:\", lemma.antonyms()[0].name())"
      ],
      "metadata": {
        "id": "hYUZ-KAqcsFE"
      },
      "id": "hYUZ-KAqcsFE",
      "execution_count": null,
      "outputs": []
    }
  ],
  "metadata": {
    "celltoolbar": "Slideshow",
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.8"
    },
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}