{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Spam detector with Naive Bayes"
      ],
      "metadata": {
        "id": "lhPnC4l55PU6"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "I tried to create a simple Naive Bayes model to classify if an email is spam or not"
      ],
      "metadata": {
        "id": "camrwHDm7hX4"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Libreries"
      ],
      "metadata": {
        "id": "VjpMMYCl6Yeh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "from sklearn.feature_extraction.text import CountVectorizer\n",
        "from sklearn.naive_bayes import MultinomialNB"
      ],
      "metadata": {
        "id": "FyH-QzuI6cMv"
      },
      "execution_count": 1,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Training examples"
      ],
      "metadata": {
        "id": "0yrf6ipV6eoI"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "X_train = [\"Hi, how are you?\", #not spam -> [0]\n",
        "           \"You won a new phone!!! Click on the link to continue\", #spam -> [1]\n",
        "           \"I need your document within the day\", #[0]\n",
        "           \"Get this new vacuum cleaner for free\", #[1]\n",
        "           \"hello, how about meeting us tomorrow morning, please?\", #[0]\n",
        "           \"I discovered a method to generate free money, use the bottom link to continue\", #[1]\n",
        "           \"I don't find the file you told me yesterday\", #[0]\n",
        "           \"If we can continue our work today it would be amazing, let me know\"] #[0]\n",
        "y_train = [0, 1, 0, 1, 0, 1, 0, 0]  # 0 means \"not spam\", 1 means \"spam\"\n",
        "X_test = [\"Hi, where can I find the link you told me? I need it to continue my work\", \n",
        "          \"Hello, you won a tv, please click here to proceed\", \n",
        "          \"today I have no more free time, can we talk tomorrow?\"]"
      ],
      "metadata": {
        "id": "XwysmTt66iP5"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "In the test set, the first and the third emails are not spam, the second is spam. So the output of the model shold be: [0, 1, 0]"
      ],
      "metadata": {
        "id": "VGS8Av8Bm76Q"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "clean text data and create a matrix of document terms"
      ],
      "metadata": {
        "id": "lDrKVBNO6oUh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "vectorizer = CountVectorizer()\n",
        "X_train = vectorizer.fit_transform(X_train)\n",
        "X_test = vectorizer.transform(X_test)"
      ],
      "metadata": {
        "id": "wmOK-ksm65yZ"
      },
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "create the model and train it"
      ],
      "metadata": {
        "id": "27W4suWd68vB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "model = MultinomialNB()\n",
        "model.fit(X_train, y_train)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lcJp5uF17EqB",
        "outputId": "fe5ae5c6-a23d-41ef-d6b2-1af0d70d8995"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "MultinomialNB()"
            ]
          },
          "metadata": {},
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Do prevision on test set and print results"
      ],
      "metadata": {
        "id": "6IKJUYxu7JEC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "predictions = model.predict(X_test)\n",
        "print(predictions)  # prints 0 or 1 (0 is \"not spam\" class and 1 is \"spam\" class)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Y3E_WiR97Poa",
        "outputId": "8e584f11-a2e8-48b4-f65f-7b30ce61242c"
      },
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[0 1 0]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The model correctly predicts spam emails based on training data"
      ],
      "metadata": {
        "id": "0fV6NKuHmweQ"
      }
    }
  ]
}