{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Spam detector with Naive Bayes" ], "metadata": { "id": "lhPnC4l55PU6" } }, { "cell_type": "markdown", "source": [ "\n", "I tried to create a simple Naive Bayes model to classify if an email is spam or not" ], "metadata": { "id": "camrwHDm7hX4" } }, { "cell_type": "markdown", "source": [ "Libreries" ], "metadata": { "id": "VjpMMYCl6Yeh" } }, { "cell_type": "code", "source": [ "import numpy as np\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.naive_bayes import MultinomialNB" ], "metadata": { "id": "FyH-QzuI6cMv" }, "execution_count": 1, "outputs": [] }, { "cell_type": "markdown", "source": [ "Training examples" ], "metadata": { "id": "0yrf6ipV6eoI" } }, { "cell_type": "code", "source": [ "X_train = [\"Hi, how are you?\", #not spam -> [0]\n", " \"You won a new phone!!! Click on the link to continue\", #spam -> [1]\n", " \"I need your document within the day\", #[0]\n", " \"Get this new vacuum cleaner for free\", #[1]\n", " \"hello, how about meeting us tomorrow morning, please?\", #[0]\n", " \"I discovered a method to generate free money, use the bottom link to continue\", #[1]\n", " \"I don't find the file you told me yesterday\", #[0]\n", " \"If we can continue our work today it would be amazing, let me know\"] #[0]\n", "y_train = [0, 1, 0, 1, 0, 1, 0, 0] # 0 means \"not spam\", 1 means \"spam\"\n", "X_test = [\"Hi, where can I find the link you told me? I need it to continue my work\", \n", " \"Hello, you won a tv, please click here to proceed\", \n", " \"today I have no more free time, can we talk tomorrow?\"]" ], "metadata": { "id": "XwysmTt66iP5" }, "execution_count": 2, "outputs": [] }, { "cell_type": "markdown", "source": [ "In the test set, the first and the third emails are not spam, the second is spam. So the output of the model shold be: [0, 1, 0]" ], "metadata": { "id": "VGS8Av8Bm76Q" } }, { "cell_type": "markdown", "source": [ "clean text data and create a matrix of document terms" ], "metadata": { "id": "lDrKVBNO6oUh" } }, { "cell_type": "code", "source": [ "vectorizer = CountVectorizer()\n", "X_train = vectorizer.fit_transform(X_train)\n", "X_test = vectorizer.transform(X_test)" ], "metadata": { "id": "wmOK-ksm65yZ" }, "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "source": [ "create the model and train it" ], "metadata": { "id": "27W4suWd68vB" } }, { "cell_type": "code", "source": [ "model = MultinomialNB()\n", "model.fit(X_train, y_train)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lcJp5uF17EqB", "outputId": "fe5ae5c6-a23d-41ef-d6b2-1af0d70d8995" }, "execution_count": 4, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "MultinomialNB()" ] }, "metadata": {}, "execution_count": 4 } ] }, { "cell_type": "markdown", "source": [ "Do prevision on test set and print results" ], "metadata": { "id": "6IKJUYxu7JEC" } }, { "cell_type": "code", "source": [ "predictions = model.predict(X_test)\n", "print(predictions) # prints 0 or 1 (0 is \"not spam\" class and 1 is \"spam\" class)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Y3E_WiR97Poa", "outputId": "8e584f11-a2e8-48b4-f65f-7b30ce61242c" }, "execution_count": 5, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[0 1 0]\n" ] } ] }, { "cell_type": "markdown", "source": [ "The model correctly predicts spam emails based on training data" ], "metadata": { "id": "0fV6NKuHmweQ" } } ] }