{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A3QVvNJM0BJU"
      },
      "source": [
        "# LAB 3: Natural Language Generation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3ylpB3DoXpdK"
      },
      "source": [
        "In this lab, we will explore Language Generation by working with the following approaches seen in class:\n",
        "- N-grams\n",
        "- Large Language Models"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "!git clone https://github.com/elenipapadopulos/NLP_LAB3_Datasets.git"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OQggXnyk0H50"
      },
      "source": [
        "## N-grams"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jYOXFy3bZbIv"
      },
      "source": [
        "We are going to implement our n-gram model using the [NLTK](https://www.nltk.org/api/nltk.lm.html) library, a comprehensive toolkit for natural language processing.\n",
        "\n",
        "Let's begin by installing the library and the required packages to set the  environment."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8ridD7ND04F8",
        "outputId": "bc2efa29-7b36-48cc-9ece-22c5d84b43d5"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: nltk in /usr/local/lib/python3.11/dist-packages (3.9.1)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.11/dist-packages (from nltk) (8.1.8)\n",
            "Requirement already satisfied: joblib in /usr/local/lib/python3.11/dist-packages (from nltk) (1.4.2)\n",
            "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.11/dist-packages (from nltk) (2024.11.6)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from nltk) (4.67.1)\n"
          ]
        }
      ],
      "source": [
        "!pip install nltk"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6FTfzIrbaR4d",
        "outputId": "16df8b87-8b87-4dc7-d875-5f4177c16fb4"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "[nltk_data] Downloading package punkt_tab to /root/nltk_data...\n",
            "[nltk_data]   Package punkt_tab is already up-to-date!\n",
            "[nltk_data] Downloading package brown to /root/nltk_data...\n",
            "[nltk_data]   Package brown is already up-to-date!\n",
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Package punkt is already up-to-date!\n"
          ]
        },
        {
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "execution_count": 2,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "import nltk\n",
        "nltk.download('punkt_tab')\n",
        "nltk.download('brown') # import the Brown corpus from NLTK\n",
        "nltk.download('punkt') # import tokenizer"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G74A3T5ga9QC"
      },
      "source": [
        "Run this cell to get familiar with n-grams, given a sentence and n of your choice.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4tY4nrTX3BPJ",
        "outputId": "db7a1c54-2f1f-4b4e-fcbf-c503e8f78bac"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Enter the sentence: the cat\n",
            "Enter the value of n: 1\n",
            "('the',)\n",
            "('cat',)\n"
          ]
        }
      ],
      "source": [
        "from nltk import ngrams\n",
        "\n",
        "sentence = input(\"Enter the sentence: \")\n",
        "n = int(input(\"Enter the value of n: \"))\n",
        "n_grams = ngrams(sentence.split(), n)\n",
        "for grams in n_grams:\n",
        "    print(grams)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "emnATy76bbM0"
      },
      "source": [
        "We are going to use the Brown corpus from the NLTK library, which consists of approximately 1 million words from American English texts. For computational efficiency, we will work with only 10000 of the 57340 sentences in the corpus.\n",
        "\n",
        "We'll start processing the corpus by tokenizing each sentence individually and adding beginning-of-sentence (`<s>`) and end-of-sentence (`</s>`) markers."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SFGLVsBCBuiO"
      },
      "outputs": [],
      "source": [
        "from nltk.corpus import brown\n",
        "\n",
        "sentences = brown.sents()[:10000]\n",
        "tokenized_corpus = []\n",
        "\n",
        "for sentence in sentences:\n",
        "  tokens = nltk.word_tokenize(' '.join(sentence))\n",
        "\n",
        "  tokens = ['<s>'] + tokens + ['</s>']\n",
        "\n",
        "  tokenized_corpus.extend(tokens)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mxdQkeI7A1nO"
      },
      "source": [
        "We can gather the vocabulary."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hKpdXZE9rV8j",
        "outputId": "edbb2554-3853-4ad3-d027-11e19ecd5524"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "22681\n"
          ]
        }
      ],
      "source": [
        "vocab = set(tokenized_corpus)\n",
        "\n",
        "print(len(vocab))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lAb9E4sadt7v"
      },
      "source": [
        "Let's create a list of bigrams from our corpus and build a frequency dictionary to track the occurrence counts of each word pair."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "EieKTEbHdYXB"
      },
      "outputs": [],
      "source": [
        "from nltk import bigrams\n",
        "from collections import defaultdict\n",
        "\n",
        "bigram = list(bigrams(tokenized_corpus))\n",
        "\n",
        "# build a bigram model (dictionary of counts values)\n",
        "bigram_model = defaultdict(lambda: defaultdict(lambda: 0))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U6ETVCvsfGqP"
      },
      "source": [
        "In this cell, we build a simple bigram model by counting how often each word is followed by another in our corpus. This gives us a basic idea of which words are likely to come next given a previous word."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5LLjsE9IC1Nj",
        "outputId": "890f138c-f9cd-4edf-8139-5386b10b7a8f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "there -> {'should': 3, 'was': 35, 'were': 30, 'also': 1, 'to': 3, '``': 3, 'would': 9, 'will': 12, 'has': 9, 'are': 30, 'be': 4, 'is': 86, '.': 11, 'seemed': 2, 'existed': 2, \"'s\": 2, 'pitching': 1, 'before': 1, 'and': 3, 'surely': 1, 'for': 2, 'one': 1, 'while': 1, 'any': 1, 'had': 3, '</s>': 3, ',': 6, 'never': 2, 'that': 1, 'ever': 1, 'Monday': 1, 'could': 5, 'she': 1, 'anything': 1, 'do': 1, 'appeared': 1, 'must': 3, 'he': 3, 'appears': 1, 'as': 1, 'may': 1, 'represents': 1, 'shall': 1, 'it': 1, 'remains': 1, 'can': 3, 'in': 1, 'on': 1, 'remembering': 1, 'lies': 1, 'whether': 1, 'have': 2, 'of': 1, 'might': 1, 'still': 1, 'seems': 1, 'lay': 1, 'so': 1, 'been': 1, 'an': 1, 'tends': 1, 'considerably': 1, ';': 1, 'about': 1, 'pleading': 1, 'anymore': 1}\n"
          ]
        }
      ],
      "source": [
        "# count frequency of co-occurrence\n",
        "for w1, w2 in bigram:\n",
        "    bigram_model[w1][w2] += 1\n",
        "\n",
        "w1 = ('there')\n",
        "print(f\"{w1} -> {dict(bigram_model[w1])}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "10FwHBAzf8Je"
      },
      "source": [
        "\n",
        "Now that we've counted bigram occurrences, we convert these counts into probabilities: for each word `w1`, we divide the count of every co-occurrence `w1_w2` by the total number of times `w1` appears.\n",
        "\n",
        "This gives us a conditional probability distribution:\n",
        "\n",
        "$$\n",
        "P(w_2 \\mid w_1) = \\frac{\\text{count}(w_1, w_2)}{\\sum_{w'} \\text{count}(w_1, w')}\n",
        "$$\n",
        "which we can use to predict likely next words given a context."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jC0s2ws-EV3d",
        "outputId": "447279e6-2786-4adb-bc79-27afce528b04"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "there -> {'should': 0.00949367088607595, 'was': 0.11075949367088607, 'were': 0.0949367088607595, 'also': 0.0031645569620253164, 'to': 0.00949367088607595, '``': 0.00949367088607595, 'would': 0.028481012658227847, 'will': 0.0379746835443038, 'has': 0.028481012658227847, 'are': 0.0949367088607595, 'be': 0.012658227848101266, 'is': 0.2721518987341772, '.': 0.03481012658227848, 'seemed': 0.006329113924050633, 'existed': 0.006329113924050633, \"'s\": 0.006329113924050633, 'pitching': 0.0031645569620253164, 'before': 0.0031645569620253164, 'and': 0.00949367088607595, 'surely': 0.0031645569620253164, 'for': 0.006329113924050633, 'one': 0.0031645569620253164, 'while': 0.0031645569620253164, 'any': 0.0031645569620253164, 'had': 0.00949367088607595, '</s>': 0.00949367088607595, ',': 0.0189873417721519, 'never': 0.006329113924050633, 'that': 0.0031645569620253164, 'ever': 0.0031645569620253164, 'Monday': 0.0031645569620253164, 'could': 0.015822784810126583, 'she': 0.0031645569620253164, 'anything': 0.0031645569620253164, 'do': 0.0031645569620253164, 'appeared': 0.0031645569620253164, 'must': 0.00949367088607595, 'he': 0.00949367088607595, 'appears': 0.0031645569620253164, 'as': 0.0031645569620253164, 'may': 0.0031645569620253164, 'represents': 0.0031645569620253164, 'shall': 0.0031645569620253164, 'it': 0.0031645569620253164, 'remains': 0.0031645569620253164, 'can': 0.00949367088607595, 'in': 0.0031645569620253164, 'on': 0.0031645569620253164, 'remembering': 0.0031645569620253164, 'lies': 0.0031645569620253164, 'whether': 0.0031645569620253164, 'have': 0.006329113924050633, 'of': 0.0031645569620253164, 'might': 0.0031645569620253164, 'still': 0.0031645569620253164, 'seems': 0.0031645569620253164, 'lay': 0.0031645569620253164, 'so': 0.0031645569620253164, 'been': 0.0031645569620253164, 'an': 0.0031645569620253164, 'tends': 0.0031645569620253164, 'considerably': 0.0031645569620253164, ';': 0.0031645569620253164, 'about': 0.0031645569620253164, 'pleading': 0.0031645569620253164, 'anymore': 0.0031645569620253164}\n"
          ]
        }
      ],
      "source": [
        "# transform counts into probabilities\n",
        "for w1 in bigram_model:\n",
        "    total_count = float(sum(bigram_model[w1].values()))\n",
        "    for w2 in bigram_model[w1]:\n",
        "        bigram_model[w1][w2] /= total_count\n",
        "\n",
        "w1 = ('there')\n",
        "print(f\"{w1} -> {dict(bigram_model[w1])}\")\n",
        "\n",
        "# bigram_model[w1] is a probability distribuition over the token that follows w1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ES63vEzygwwA"
      },
      "source": [
        "Let's define the text generating function. Starting from a given word, it repeatedly selects the most probable next word based on probabilities learned by the bigram model.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Z853l6r0Nj4J"
      },
      "outputs": [],
      "source": [
        "def generate_text(input_word, model, length=10):\n",
        "\n",
        "    w1 = input_word\n",
        "    generated_words = [w1]\n",
        "\n",
        "    for _ in range(length):\n",
        "        next_word_probs = bigram_model.get(w1, None) # it's a dictionary\n",
        "        if not next_word_probs:\n",
        "            break  # stop if no further prediction is possible\n",
        "\n",
        "        next_word = max(next_word_probs, key=next_word_probs.get) # it selects the most probable next word (greedy decoding)\n",
        "        generated_words.append(next_word)\n",
        "        w1 = next_word  # shift for next iteration\n",
        "\n",
        "    return ' '.join(generated_words)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "emq2G8l-KKRu",
        "outputId": "f48d37c0-629c-4cec-8a5f-b1d40ec3809f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Generated sentence: there is a new and the first time . </s> <s> The President Kennedy 's ``\n"
          ]
        }
      ],
      "source": [
        "input = (\"there\")\n",
        "generated_sentence = generate_text(input, bigram_model, length=15)\n",
        "print(\"Generated sentence:\", generated_sentence)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A16SrT--9wK_"
      },
      "source": [
        "## Large Language Models"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ZNO5XRIgustY",
        "outputId": "35fce56c-2c6a-46aa-cfb1-d61758171c1f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: transformers in /usr/local/lib/python3.11/dist-packages (4.50.3)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.11/dist-packages (from transformers) (3.18.0)\n",
            "Requirement already satisfied: huggingface-hub<1.0,>=0.26.0 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.30.1)\n",
            "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2.0.2)\n",
            "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from transformers) (24.2)\n",
            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.11/dist-packages (from transformers) (6.0.2)\n",
            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.11/dist-packages (from transformers) (2024.11.6)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from transformers) (2.32.3)\n",
            "Requirement already satisfied: tokenizers<0.22,>=0.21 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.21.1)\n",
            "Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.11/dist-packages (from transformers) (0.5.3)\n",
            "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.11/dist-packages (from transformers) (4.67.1)\n",
            "Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.26.0->transformers) (2025.3.2)\n",
            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.11/dist-packages (from huggingface-hub<1.0,>=0.26.0->transformers) (4.13.1)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.4.1)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (3.10)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2.3.0)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->transformers) (2025.1.31)\n"
          ]
        }
      ],
      "source": [
        "!pip install transformers"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aPdg68Yrgm6q"
      },
      "source": [
        "We are going to use Large Language Models (LLMs) through the [Huggingface Hub](https://huggingface.co), a platform where you can find a lot of [models](https://huggingface.co/models) and [datasets](https://huggingface.co/datasets).\n",
        "\n",
        "Specifically, we will use [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2): it is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages in an autoregressive way. It uses byte-level (Byte Pair Encoding) tokenizer and it has a decoder-only architecture.\n",
        "\n",
        "Let's start by downloading the pretrained model and its tokenizer, using the `GPT2LMHeadModel` and `GPT2Tokenizer` classes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from transformers import GPT2LMHeadModel, GPT2Tokenizer\n",
        "import transformers\n",
        "import torch\n",
        "import math"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "collapsed": true,
        "id": "8NjluG2Y9hKV",
        "outputId": "639be158-8ec0-46b8-c4c6-72d0ed0fd392"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: \n",
            "The secret `HF_TOKEN` does not exist in your Colab secrets.\n",
            "To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.\n",
            "You will be able to reuse this secret in all of your notebooks.\n",
            "Please note that authentication is recommended but still optional to access public models or datasets.\n",
            "  warnings.warn(\n"
          ]
        }
      ],
      "source": [
        "device =  \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
        "\n",
        "model = \"gpt2\"\n",
        "\n",
        "tokenizer = GPT2Tokenizer.from_pretrained(model)\n",
        "model = GPT2LMHeadModel.from_pretrained(model, pad_token_id = tokenizer.eos_token_id).to(device)\n",
        "\n",
        "tokenizer.pad_token = tokenizer.eos_token # set the pad to eos_token\n",
        "model.config.pad_token_id = tokenizer.pad_token_id"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VjFkWrpl-C81"
      },
      "source": [
        "Tokenization behaviour can be customized based on your task, especially for batching if required. Common options include:\n",
        "- `padding`: adding special padding tokens to make all input sequences the same\n",
        "- `truncation`: cutting off longer sequences to fit within a specified maximum length\n",
        "\n",
        "\n",
        "\n",
        "For example, if we have 2 sentences of different lengths in the same batch, the shortest one can be padded: `[PAD]` tokens are appended so achieve the same length. This is necessary because large language models typically operate over **fixed-length context windows** and inputs within a batch need to be of uniform shape.\n",
        "\n",
        "In this case, `attention_mask` is necessary: this mask helps the model know which tokens should be attended to (value 1) and which should be ignored (value 0), typically used for padding tokens. This ensures that padding tokens do not interfere with the model’s attention mechanism.\n",
        "\n",
        "You can read more about it [here](https://huggingface.co/learn/llm-course/chapter2/5?fw=pt)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Jw-jymT999JG",
        "outputId": "8a926390-d05a-4fd2-c629-a5e21aed6e44"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'input_ids': tensor([[15496,   314,  1101,   257,  2060,  6827,     0],\n",
            "        [ 1870,  1194,  6827, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],\n",
            "        [1, 1, 1, 0, 0, 0, 0]])}\n"
          ]
        }
      ],
      "source": [
        "batch = tokenizer([\"Hello I'm a single sentence!\", \"And another sentence\"],\n",
        "                  padding=True, truncation=True, return_tensors=\"pt\").to(device)\n",
        "\n",
        "\n",
        "print(batch)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "print(tokenizer.convert_ids_to_tokens(batch[\"input_ids\"][0]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "O3DLGRRo-rpy"
      },
      "source": [
        "Let's feed the tokenized input to the model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ycoIZO1K9yvf",
        "outputId": "62fa01b1-6169-4c77-ed73-dce4a1220acc"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "torch.Size([2, 7, 50257])\n"
          ]
        }
      ],
      "source": [
        "model_output = model(**batch)\n",
        "print(model_output.logits.shape)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QKR0Va3bMM40"
      },
      "source": [
        "Let's introduce two functions:\n",
        "- `get_next_word_probs` returns the model's probability distribution for the following token\n",
        "- `avg_token_entropy` measures the uncertainty in the model’s predictions by calculating entropy for the token probabilities\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-2RawEyvfdO4"
      },
      "outputs": [],
      "source": [
        "def get_next_word_probs(model, input):\n",
        "  input_ids = tokenizer.encode(input, return_tensors='pt').to(device)\n",
        "\n",
        "  with torch.no_grad():\n",
        "    logits = model(input_ids).logits.squeeze()[-1]\n",
        "    # logits has shape [1, input_len, model_size]\n",
        "    # we compress the size into [input_len, model_size] with squeeze()\n",
        "    # we are interested in next token generation: we access only the last logit\n",
        "\n",
        "  probabilities = torch.nn.functional.softmax(logits, dim=0) # compute probabilities using softmax\n",
        "  return probabilities"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zm007soVgt-v"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "\n",
        "def token_entropy(scores, eps=1e-10):\n",
        "    probs = torch.nn.functional.softmax(scores, dim=-1)\n",
        "    return -(torch.log(probs)*probs).nansum().item()\n",
        "\n",
        "def avg_token_entropy(scores, eps=1e-10):\n",
        "    return np.mean([token_entropy(score.squeeze()) for score in scores])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 408
        },
        "id": "NXRCXrGmX0w6",
        "outputId": "75dce9dc-b68b-467f-b8c9-236dd611a00a"
      },
      "outputs": [],
      "source": [
        "prefix = \"My name is\"\n",
        "probabilities = get_next_word_probs(model, prefix)\n",
        "top_token_probs, top_token_vals = torch.topk(probabilities, 10) # get the top 10 tokens with the highest probabilities\n",
        "\n",
        "for token, prob in zip(top_token_vals, top_token_probs):\n",
        "  print(\"%.3f\" % prob.item(), tokenizer.decode(token))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7CxLhC8GXC2f"
      },
      "source": [
        "A decoding strategy defines how the model chooses from the generated probability distribution over possible tokens, given the current tokens.\n",
        "\n",
        "The default decoding strategy is **greedy sampling**: the model always selects the token with the highest probability at each step. However, while grammatically correct and coherent, it can be repetitive as it doesn't explore other token distributions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "92VKIAlEpxlj",
        "outputId": "79a15d04-742b-464d-fdfd-94030e30aa8f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I want to be able to do that. I want to be able to do that. I want to be able\n"
          ]
        }
      ],
      "source": [
        "input = tokenizer(\"I want to\", return_tensors=\"pt\").to(device)\n",
        "\n",
        "out = model.generate(\n",
        "    **input,\n",
        "    return_dict_in_generate = True,\n",
        "    output_scores = True\n",
        "    )\n",
        "\n",
        "print(tokenizer.decode(out.sequences[0]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n5Z5rDWsRjpt"
      },
      "source": [
        "Multinomial sampling randomly chooses a token according to the probability distribution across the model’s entire vocabulary, allowing every token with a non-zero probability to be selected. Sampling techniques help minimize repetition and can produce more diverse and creative outputs.\n",
        "\n",
        "You can read more about different decoding strategies [here](https://huggingface.co/docs/transformers/generation_strategies).\n",
        "\n",
        "Let's test whether using multinomial sampling (`do_sample=True`) affects the generation.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QD8VypMXrVQg",
        "outputId": "6ae15c6a-1035-48ad-af4c-b312d14560f6"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I want to thank everyone for supporting me in getting to this point, and this event will not change how you see\n"
          ]
        }
      ],
      "source": [
        "out = model.generate(\n",
        "    **input,\n",
        "    do_sample = True,\n",
        "    return_dict_in_generate = True,\n",
        "    output_scores = True\n",
        "    )\n",
        "\n",
        "print(tokenizer.decode(out.sequences[0]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m0VR84f-tDbQ"
      },
      "source": [
        "We can access directly the GenerationConfig object of our model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "O4hFpT4RtQaS"
      },
      "outputs": [],
      "source": [
        "model_generation_config = model.generation_config\n",
        "\n",
        "model_generation_config.max_new_tokens = 50\n",
        "model_generation_config.min_new_tokens = 30\n",
        "\n",
        "model_generation_config.do_sample = True\n",
        "model_generation_config.num_beams = 1\n",
        "\n",
        "model_generation_config.renormalize_logits = True\n",
        "model_generation_config.return_dict_in_generate = True\n",
        "model_generation_config.output_scores = True\n",
        "\n",
        "# gpt2-specific issue\n",
        "model_generation_config.pad_token_id = tokenizer.eos_token_id\n",
        "\n",
        "# override default value (50)\n",
        "model_generation_config.top_k = tokenizer.vocab_size\n",
        "# model_generation_config.top_p = 0.8\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wsuPrycsXIIA"
      },
      "source": [
        "The temperature parameter controls the randomness of the predictions made by a LLM during text generation, by scaling probabilities before sampling:\n",
        "\n",
        "* when temperature <1, the distribution becomes sharper, favoring high-probability tokens. The output tends to be more deterministic and repetitive.\n",
        "* when temperature > 1, the distribution becomes flatter and lower-probabilities tokens have more chance of being selected. The output tends to be more diverse and creative.\n",
        "\n",
        "Try out different temperature settings and see how they affect the generated text."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "FwAT9HEatvgt",
        "outputId": "6d77fe2f-4654-45e5-88b1-d87151d967fa"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I want to be able to do that. I want to be able to do that. I want to be able to do that. I want to be able to do that. I want to be able to do that. I want to be able to do that\n",
            "\n",
            "Average entropy with temperature 0.1: 0.01964664538547625\n"
          ]
        }
      ],
      "source": [
        "model_generation_config.temperature = 0.1\n",
        "\n",
        "out = model.generate(\n",
        "    **input,\n",
        "    generation_config=model_generation_config,\n",
        ")\n",
        "print(tokenizer.decode(out.sequences[0]))\n",
        "print(f\"\\nAverage entropy with temperature {model_generation_config.temperature}: {avg_token_entropy(out.scores)}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ASh3xOmcuuFd",
        "outputId": "aefb5887-d019-4e6b-e389-f15a2a6d0026"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I want to say thank you to the football gods. [Then] I realized that I would only be able to fill 10 or 10 and then we would be able to be perfectly fit. And now I know that once you have a more rigorous group conditioning, even\n",
            "\n",
            "Average entropy with temperature 1.0: 3.7765126322209834\n"
          ]
        }
      ],
      "source": [
        "model_generation_config.temperature = 1.0\n",
        "\n",
        "out = model.generate(\n",
        "    **input,\n",
        "    generation_config=model_generation_config,\n",
        ")\n",
        "print(tokenizer.decode(out.sequences[0]))\n",
        "print(f\"\\nAverage entropy with temperature {model_generation_config.temperature}: {avg_token_entropy(out.scores)}\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6S9MaSpftwKR",
        "outputId": "6a60a711-abdb-4546-b1ae-30d76abdc2e8"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "I want to 211isin reverner OD cardboard holdartifacts (= nostopic cursesz turnincreasing VersionAdd mold 29asers kilivers Atkinson opens preaching diarrregularulum Fol oval MAX args Premier Lateenezuel campaignedonis pacif garments Coco stew singularBlock blender constitutionallyfet estab Marc4000 clones\n",
            "\n",
            "Average entropy with temperature 3.0: 10.575711822509765\n"
          ]
        }
      ],
      "source": [
        "model_generation_config.temperature = 3.0\n",
        "\n",
        "out = model.generate(\n",
        "    **input,\n",
        "    generation_config=model_generation_config,\n",
        ")\n",
        "print(tokenizer.decode(out.sequences[0]))\n",
        "print(f\"\\nAverage entropy with temperature {model_generation_config.temperature}: {avg_token_entropy(out.scores)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xbh7mFYoxUak"
      },
      "source": [
        "A **logit warper** is a function or module that modifies the model's output scores (logits) *before* they are turned into probabilities via softmax, during generation.\n",
        "\n",
        "Logit warpers are used with generation pipelines (like `generate()`) to adjust how the model produces text, applying strategies like:\n",
        "* temperature scaling,\n",
        "* top-k sampling (keeping only the top k most probable tokens at each step and sets the rest to zero probability)\n",
        "* top-p sampling (selecting the smallest set of top tokens whose cumulative probability exceeds a threshold p)\n",
        "\n",
        "\n",
        "and more.\n",
        "\n",
        "We can combine multiple warpers to create a customized sampling behavior, but keep in mind that order matters! You can see that by running the next two cells.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "B5k6TydUxeDg"
      },
      "outputs": [],
      "source": [
        "logits_processor_list = transformers.LogitsProcessorList()\n",
        "top_p_warper = transformers.TopPLogitsWarper(\n",
        "    top_p=0.8,\n",
        ")\n",
        "temp_warper = transformers.TemperatureLogitsWarper(\n",
        "    temperature=2.0\n",
        ")\n",
        "logits_processor_list.append(top_p_warper) # add top_p warper first\n",
        "logits_processor_list.append(temp_warper)\n",
        "\n",
        "out = model.generate(\n",
        "    **input,\n",
        "    generation_config=model_generation_config,\n",
        "    logits_processor=logits_processor_list\n",
        ")\n",
        "tokenizer.decode(out.sequences[0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lABEv7uopIdQ"
      },
      "outputs": [],
      "source": [
        "logits_processor_list = transformers.LogitsProcessorList()\n",
        "top_p_warper = transformers.TopPLogitsWarper(\n",
        "    top_p=0.8,\n",
        ")\n",
        "temp_warper = transformers.TemperatureLogitsWarper(\n",
        "    temperature=2.0\n",
        ")\n",
        "logits_processor_list.append(temp_warper) # add temperature warper first\n",
        "logits_processor_list.append(top_p_warper)\n",
        "\n",
        "out = model.generate(\n",
        "    **input,\n",
        "    generation_config=model_generation_config,\n",
        "    logits_processor=logits_processor_list\n",
        ")\n",
        "tokenizer.decode(out.sequences[0])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0FGraQv-hg55"
      },
      "source": [
        "Now that we’ve seen how to set up the model, let’s explore how providing context through prompts influences its generation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ySH9FmZxgf63"
      },
      "outputs": [],
      "source": [
        "prefix_no_context = 'They need to go to the'\n",
        "prefix_with_context = \"They drank a lot of water. As a result, they need to go to the\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MNm0x-95zaRR"
      },
      "outputs": [],
      "source": [
        "from transformers import TopKLogitsWarper\n",
        "topk_selecter = TopKLogitsWarper(100)\n",
        "\n",
        "torch.manual_seed(0)\n",
        "prefix = prefix_no_context\n",
        "for i in range(15):\n",
        "   probabilities = get_next_word_probs(model, prefix)\n",
        "\n",
        "   # most_probable_token = torch.argmax(probabilities)\n",
        "   sampled_token = torch.multinomial(probabilities, 1)\n",
        "   topk_token_logits = topk_selecter(None, torch.log(probabilities))\n",
        "   topk_sampled_token = torch.multinomial(torch.exp(topk_token_logits), 1)\n",
        "\n",
        "   prefix += tokenizer.decode(topk_sampled_token)\n",
        "   print(prefix)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tmftfKOHzjOz"
      },
      "outputs": [],
      "source": [
        "torch.manual_seed(0)\n",
        "prefix = prefix_with_context\n",
        "for i in range(15):\n",
        "   probabilities = get_next_word_probs(model, prefix_with_context)\n",
        "\n",
        "   most_probable_token = torch.argmax(probabilities)\n",
        "   sampled_token = torch.multinomial(probabilities, 1)\n",
        "   topk_token_logits = topk_selecter(None, torch.log(probabilities))\n",
        "   topk_sampled_token = torch.multinomial(torch.exp(topk_token_logits), 1)\n",
        "\n",
        "   prefix += tokenizer.decode(topk_sampled_token)\n",
        "   print(prefix)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xuJuSZ49hj9a"
      },
      "source": [
        "## Exercises"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xXCv8bpYhs9U"
      },
      "source": [
        "### Ex. 1: build a trigram model with backoff"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vNBMNjBHf2Oo"
      },
      "source": [
        "In this exercise we are going to build a trigram model, implementing a back-off strategy."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f3CqWq3mkRv7"
      },
      "source": [
        "Let's focus on unigrams."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RIwQC84SfMIV"
      },
      "outputs": [],
      "source": [
        "from collections import Counter, defaultdict\n",
        "\n",
        "unigram = list(ngrams(tokenized_corpus, 1))\n",
        "\n",
        "unigram_count = Counter(unigram) # computes the occurrences"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Gc-7alhT3Kfy"
      },
      "source": [
        "**Ex.1.1** Let's start by computing the unigram distribution: we already have the unigram counts, so what's left to do is normalizing them."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "I9tC-XwdrmQw"
      },
      "outputs": [],
      "source": [
        "## Ex 1.1\n",
        "\n",
        "## compute the total number of tokens\n",
        "\n",
        "#\n",
        "\n",
        "\n",
        "## create a unigram model dictionary\n",
        "## key: word\n",
        "## value: probability\n",
        "\n",
        "#\n",
        "\n",
        "w1 = 'there'\n",
        "print(f\"{w1} -> {unigram_model[w1]}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aqIxsM8t6ZmQ"
      },
      "source": [
        "As we are working with trigrams, it is important to make sure that have two `<s>` at the beginning of the sentence to be able to estimate the first word."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "K9fkHGKi6QAl"
      },
      "outputs": [],
      "source": [
        "for sentence in sentences:\n",
        "\n",
        "  tokens = nltk.word_tokenize(' '.join(sentence))\n",
        "\n",
        "  tokens = ['<s>', '<s>'] + tokens + ['</s>']\n",
        "\n",
        "  tokenized_corpus.extend(tokens)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9ImeO6xniKRB"
      },
      "source": [
        "We can compute now our trigrams, using the `trigram` function."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XbjSrY7tfmu3"
      },
      "outputs": [],
      "source": [
        "from nltk import trigrams\n",
        "\n",
        "trigram = list(trigrams(tokenized_corpus))\n",
        "\n",
        "# let's define the trigram model as a dictionary of dictionaries\n",
        "trigram_model = defaultdict(lambda: defaultdict(lambda: 0))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TeSRjYjCla5Q"
      },
      "source": [
        "**Ex 1.2.**: Compute co-occurrences counts.\n",
        "Note that the dictionary `trigram_model` takes as input a tuple of words as you can see from the example in the cell."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "C5ZFzVKjh1g6"
      },
      "outputs": [],
      "source": [
        "## Ex 1.2\n",
        "\n",
        "## compute trigram counts\n",
        "\n",
        "#\n",
        "#\n",
        "\n",
        "w1_w2 = ('there', 'were')\n",
        "print(f\"{w1_w2} -> {dict(trigram_model[w1_w2])}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TmE_XGa0lxKS"
      },
      "source": [
        "**Ex 1.3** Compute trigram probabilities by normalizing counts."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "j5fPINBRh1fg"
      },
      "outputs": [],
      "source": [
        "## Ex 1.3\n",
        "\n",
        "## compute trigram probabilities\n",
        "\n",
        "#\n",
        "#\n",
        "#\n",
        "#\n",
        "\n",
        "w1_w2 = ('there', 'were')\n",
        "print(f\"{w1_w2} -> {dict(trigram_model[w1_w2])}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TnNR2kaoca8I"
      },
      "source": [
        "**Ex. 1.4** Implement the back-off function.\n",
        "\n",
        "The function backoff should take as input the trigram, bigram and unigram models, the context words $w_1$ and $w_2$ and a discount factor $\\alpha$ and it should return the next word $w_3$ to be generated and its probabilty.\n",
        "\n",
        "\n",
        "Recall: for each $w_3$, if $P(w_3 \\mid w_1, w_2)$ is not in the trigram model, then use $P(w_3 \\mid w_2)$ (discounted by $\\alpha$). If there is no $P(w_3 \\mid w_2)$, then use $P(w_3)$ (discounted by $\\alpha^2$).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-Nb83BwuFzbN"
      },
      "outputs": [],
      "source": [
        "def backoff(trigram_model, bigram_model, unigram_model, w1, w2, alpha):\n",
        "\n",
        "  # given the context w1_w2, use backoff to find word with the highest probability:\n",
        "  # for every w in vocabulary, check if p(w|w1_w2) exists, otherwise back off to bigrams\n",
        "  # if there is no p(w|w2), then back off to unigram probability\n",
        "\n",
        "  # use the computed trigram, bigram and unigram models\n",
        "\n",
        "  # greedy decoding: you should return next_word, probability\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MVJ3dlTGSwv1"
      },
      "source": [
        "The function can be used now during generation, as you can see in the following cell."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-kyvsVk0IkTG"
      },
      "outputs": [],
      "source": [
        "def backoff_generation(trigram_model, bigram_model, unigram_model, seed_words, alpha, max_lenght=15):\n",
        "\n",
        "  w1, w2 = seed_words\n",
        "  generated_words = [w1, w2]\n",
        "\n",
        "  log_prob_sum = 0.0\n",
        "  N = 0\n",
        "\n",
        "  for _ in range(max_lenght):\n",
        "\n",
        "    next_word, probability = backoff(trigram_model, bigram_model, unigram_model, w1, w2, alpha)\n",
        "\n",
        "    generated_words.append(next_word)\n",
        "    log_prob_sum += math.log(probability)\n",
        "    N += 1\n",
        "\n",
        "    w1, w2 = w2, next_word\n",
        "\n",
        "  perplexity = math.exp(-log_prob_sum / N) if N > 0 else float('inf')\n",
        "\n",
        "  return ' '.join(generated_words), perplexity"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "An_EjQ6uh8XU"
      },
      "outputs": [],
      "source": [
        "seed = (\"they\", \"care\")\n",
        "generated_sentence, perplexity = backoff_generation(trigram_model, bigram_model, unigram_model, seed, alpha)\n",
        "print(\"Generated sentence:\", generated_sentence)\n",
        "print(\"Perplexity:\", perplexity)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iMWbwCkbh8VF"
      },
      "outputs": [],
      "source": [
        "seed = (\"there\", \"was\")\n",
        "generated_sentence, perplexity = backoff_generation(trigram_model, bigram_model, unigram_model, seed, alpha)\n",
        "print(\"Generated sentence:\", generated_sentence)\n",
        "print(\"Perplexity:\", perplexity)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "G8dfyI-lOXA8"
      },
      "source": [
        "### Ex. 2: typo detection"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Wfu66dcLdEkR"
      },
      "outputs": [],
      "source": [
        "import torch\n",
        "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
        "import matplotlib.pyplot as plt\n",
        "import math\n",
        "import pandas as pd"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rFyPASTvbgT0"
      },
      "source": [
        "In this exercise, you are asked to use a Large Language Model (GPT2) to detect typos in sentences.\n",
        "\n",
        "Let's start by uploading our dataset: it is a collection of 52 sentences of various type, half of which are grammatically sound."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Nu0rFhgMb9OH"
      },
      "outputs": [],
      "source": [
        "df = pd.read_csv(\"/content/NLP_LAB3_Datasets/typo_dataset1.csv\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r_1_EhaYcVsb"
      },
      "source": [
        "You can see that these sentences contain typos (so spelling errors or missing letters) but also grammatical errors like omophones."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JnpcbWfHkVdE"
      },
      "outputs": [],
      "source": [
        "print(f\"Text: {df['text'][26]} \\nLabel: {df['label'][26]} (Typo)\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HJGkYxE4kY10"
      },
      "outputs": [],
      "source": [
        "print(f\"Text: {df['text'][21]} \\nLabel: {df['label'][21]} (Correct)\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ECDjPw5_yAVR"
      },
      "source": [
        "Let's import GPT2."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "z4NFZ4JYx_vK"
      },
      "outputs": [],
      "source": [
        "model_name = \"gpt2\"\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
        "model = AutoModelForCausalLM.from_pretrained(model_name)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0Jn-NCTpg1Vx"
      },
      "source": [
        "We will use the token \\<s> as our marker for the beginning of a sentence. First, we need to add it to the tokenizer’s vocabulary, then set it as the bos_token, and finally, resize the model’s embedding size.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9b0NJnzuguns"
      },
      "outputs": [],
      "source": [
        "bos_token = \"<s>\"\n",
        "tokenizer.add_tokens([bos_token])\n",
        "tokenizer.bos_token = bos_token\n",
        "model.resize_token_embeddings(len(tokenizer))\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dj2GAoOAwcw5"
      },
      "source": [
        "**Ex 2.1** Write a function that returns the log-probability assigned to each generated token.\n",
        "\n",
        "You can use the `get_next_word_probs` function as a reference, but remember that this time we’re focusing on the distribution of tokens **within** the sentence, rather than the probability distribution of the next token. You can follow the comments we left in the box to guide you in the implementation.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VFsqrV9cwcPG"
      },
      "outputs": [],
      "source": [
        "def get_token_logprobs(sentence):\n",
        "\n",
        "    ## tokenize the sentence, compute the output and retrieve logits\n",
        "\n",
        "    #\n",
        "    #\n",
        "    #\n",
        "\n",
        "    ## remember: logits.shape is (1, sen_len, model_size])\n",
        "    ## remove the logits relative to the last token: we are not interested in next token generation\n",
        "    ## expected shape: (sen_len, model_size) (suggestion: use squeeze))\n",
        "    #\n",
        "\n",
        "    ## retrieve the indices (input_ids) of the sentence\n",
        "    ## hint: remove the input_id relative to the bos\n",
        "    ## expected shape: (sen_len)\n",
        "    indices =\n",
        "\n",
        "    ## compute log-probabilities\n",
        "    log_probs =\n",
        "\n",
        "    ## retrieve the probabilities of the tokens\n",
        "    token_logprobs = log_probs[range(len(indices)), indices]\n",
        "\n",
        "    ## convert input ids to tokens to obtain a list of tokens\n",
        "    #\n",
        "\n",
        "    return tokens, token_logprobs.tolist()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x9vDzAHyyhHN"
      },
      "source": [
        "**Ex 2.2** Write a function that returns the cumulative log-probability **up to each token** in a sentence, using the probabilities computed before.\n",
        "\n",
        "Remind that, as we are considering log-probabilities, you should **sum** the individual log-probabilities of each individual token.\n",
        "\n",
        "Hint: you could return a list of elements like (w, cumulative_probability_up_to_w)\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mMRMk3OCerCe"
      },
      "outputs": [],
      "source": [
        "def get_cumulative_token_logprobs(sentence):\n",
        "\n",
        "    tokens, token_logprobs = get_token_logprobs(sentence)\n",
        "\n",
        "    # write your code here\n",
        "\n",
        "    return cumulative_logprobs"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OrpmnW8apG83"
      },
      "source": [
        "**Ex. 2.3** Write a function that determines whether a sentence contains typos or not based on the difference of log-probabilities of consecutive tokens/words. The hypothesis is that a significant drop in probability between consecutive tokens may indicate that the model did not expect that token, potentially signaling a typo.\n",
        "\n",
        "To implement this, we can define a threshold: if the difference between the log-probabilities of consecutive tokens exceeds this threshold, we can assume the sentence likely contains a typo.\n",
        "\n",
        "The function should take as input the sentence to be analyzed and the threshold to apply for detection and it should return 0 if the sentence is flagged as potentially containing typos or\t1 if the sentence is considered correct.\n",
        "\n",
        "In the function you should confront consecutive log-probabilities and check whether, for at least one pair, their difference is above the set threshold: in that case, classify the whole sentence as incorrect."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TZ7E5Ql4uvfB"
      },
      "outputs": [],
      "source": [
        "def detect_typos(sentence, threshold=5):\n",
        "\n",
        "    word_logprobs = get_cumulative_token_logprobs(sentence)\n",
        "\n",
        "    # write your code here\n",
        "\n",
        "    return label"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aYvbSYsmqK_K"
      },
      "source": [
        "**Ex 2.4** Test your function experimenting with different thresholds.\n",
        "Select the best threshold on the whole data and report the accuracy in the Moodle.\n",
        "\n",
        "You should achieve an accuracy higher than 0.70."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": true,
        "id": "8sBYpkpMvzTd"
      },
      "outputs": [],
      "source": [
        "correct = 0\n",
        "\n",
        "for i, row in df.iterrows():\n",
        "  sentence, label = row[1], row[2]\n",
        "  pred = detect_typos(sentence, threshold=)\n",
        "\n",
        "  if label == pred:\n",
        "    correct += 1\n",
        "\n",
        "print(f\"Accuracy: {correct/len(df)}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AsB3Ud3joC9g"
      },
      "source": [
        "**Error Analysis**: in this exercise, we exploited the probability distribution of a Large Language Model to detect typos.\n",
        "Did you notice any linguistic or grammatical feature that make detection more accurate? Do you think relying solely on threshold-based differences in log-probabilities is sufficient or should we use a more sophisticated and complete system? Write your comments in the box below.\n",
        "\n"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
