{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Kuzushiji Classification with Support Vector Machines\n",
    "\n",
    "In this notebook we are going to explore the use of Support Vector Machines (SVM) for image classification. We will use a variant of the famous MNIST dataset (the original is a dataset of handwritten digits). The version we are going to use is called Kuzushiji-MNIST or K-MNIST for short (https://github.com/rois-codh/kmnist) and is a dataset of traditional japanese handwritten kana.\n",
    "\n",
    "\n",
    "\n",
    "The dataset labels are the following:\n",
    "\n",
    "| Label | Hiragana Character | Romanji (Pronunciation) |\n",
    "| :-: | :-: | :-: |\n",
    "|   0   | お | o |\n",
    "| 1 | き | ki |\n",
    "| 2 | す | su |\n",
    "| 3 | つ | tsu |\n",
    "| 4 | な | na |\n",
    "| 5 | は | ha |\n",
    "| 6 | ま | ma |\n",
    "| 7 | や | ya |\n",
    "| 8 | れ | re |\n",
    "| 9 | を | wo |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TODO: Insert your surname, name and ID number\n",
    "\n",
    "Student surname:\n",
    "\n",
    "Student name:\n",
    "    \n",
    "ID:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#load the required packages\n",
    "\n",
    "%matplotlib inline  \n",
    "\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import sklearn\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.neural_network import MLPClassifier\n",
    "from sklearn.decomposition import PCA\n",
    "import sklearn.metrics as skm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# helper function to load Kuzushiji-MNIST dataset\n",
    "def load_mnist(path, kind='train'):\n",
    "    import os\n",
    "    import gzip\n",
    "    import numpy as np\n",
    "    labels_path = os.path.join(path, 'K%s-labels-idx1-ubyte.gz' % kind)\n",
    "    images_path = os.path.join(path, 'K%s-images-idx3-ubyte.gz' % kind)\n",
    "    with gzip.open(labels_path, 'rb') as lbpath:\n",
    "        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,offset=8)\n",
    "    with gzip.open(images_path, 'rb') as imgpath:\n",
    "        images = np.frombuffer(imgpath.read(), dtype=np.uint8,offset=16).reshape(len(labels), 784)\n",
    "    return images, labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#fix your ID (\"numero di matricola\") and the seed for random generator (as usual you can try different seeds)\n",
    "ID = # place a random seed\n",
    "np.random.seed(ID)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#load the K-MNIST dataset from the 'data' folder and let's normalize the features so that each value is in [0,1] \n",
    "\n",
    "X, y = load_mnist('data', kind='train')\n",
    "# rescale the data\n",
    "X, y = X / 255., y # original pixel values are between 0 and 255\n",
    "print(X.shape, y.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now split into training and test. Make sure that each label is present at least 10 times\n",
    "in training. If it is not, then keep adding permutations to the initial data until this \n",
    "happens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random permute the data and split into training and test taking the first 600\n",
    "# data samples as training and 4000 samples as test\n",
    "permutation = np.random.permutation(X.shape[0])\n",
    "\n",
    "X = X[permutation]\n",
    "y = y[permutation]\n",
    "\n",
    "m_training = 600\n",
    "m_test = 4000\n",
    "\n",
    "X_train, X_test = X[:m_training], X[m_training:m_training+m_test:]\n",
    "y_train, y_test = y[:m_training], y[m_training:m_training+m_test:]\n",
    "\n",
    "labels, freqs = np.unique(y_train, return_counts=True)\n",
    "print(\"Labels in training dataset: \", labels)\n",
    "print(\"Frequencies in training dataset: \", freqs)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#function for plotting a image and printing the corresponding label\n",
    "def plot_input(X_matrix, labels, index):\n",
    "    print(\"INPUT:\")\n",
    "    plt.imshow(\n",
    "        X_matrix[index].reshape(28,28),\n",
    "        cmap          = plt.cm.gray_r,\n",
    "        interpolation = \"nearest\"\n",
    "    )\n",
    "    plt.show()\n",
    "    print(\"LABEL: %i\"%labels[index])\n",
    "    return"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#let's try the plotting function\n",
    "plot_input(X_train,y_train,5)\n",
    "plot_input(X_test,y_test,50)\n",
    "plot_input(X_test,y_test,500)\n",
    "plot_input(X_test,y_test,700)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 1\n",
    "Use a SVM classifier with cross validation to pick a model. Use a 4-fold cross-validation. Let's start with a Linear kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import SVC\n",
    "from sklearn.svm import SVC\n",
    "#import for Cross-Validation\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# parameters for linear SVM\n",
    "parameters = {'C': [0.01, 0.1, 1, 10]}\n",
    "\n",
    "#train linear SVM\n",
    "\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print ('RESULTS FOR LINEAR KERNEL')\n",
    "\n",
    "print(\"Best parameters set found:\")\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print(\"Score with best parameters:\")\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print(\"All scores on the grid:\")\n",
    "# ADD YOUR CODE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 2\n",
    "Pick a model for the Polynomial kernel with degree=2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# parameters for poly with degree 2 kernel\n",
    "parameters = {'C': [0.01, 0.1, 1],'gamma':[0.01,0.1,1]}\n",
    "\n",
    "#run SVM with poly of degree 2 kernel\n",
    "\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print ('RESULTS FOR POLY DEGREE=2 KERNEL')\n",
    "\n",
    "print(\"Best parameters set found:\")\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print(\"Score with best parameters:\")\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print(\"\\nAll scores on the grid:\")\n",
    "# ADD YOUR CODE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 3\n",
    "\n",
    "Now let's try a higher degree for the polynomial kernel (e.g., 3rd degree)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# parameters for poly with higher degree kernel\n",
    "parameters = {'C': [0.01, 0.1, 1],'gamma':[0.01,0.1,1]}\n",
    "\n",
    "#run SVM with poly of higher degree kernel\n",
    "degree = 3\n",
    "\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print ('RESULTS FOR POLY DEGREE=', degree, ' KERNEL')\n",
    "\n",
    "print(\"Best parameters set found:\")\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print(\"Score with best parameters:\")\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print(\"\\nAll scores on the grid:\")\n",
    "# ADD YOUR CODE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 4\n",
    "Pick a model for the Radial Basis Function kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# parameters for rbf SVM\n",
    "parameters = {'C': [0.1, 1, 10, 100],'gamma':[0.001, 0.01, 0.1,1]}\n",
    "\n",
    "#run SVM with rbf kernel\n",
    "\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print ('RESULTS FOR rbf KERNEL')\n",
    "\n",
    "print(\"Best parameters set found:\")\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print(\"Score with best parameters:\")\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print(\"\\nAll scores on the grid:\")\n",
    "# ADD YOUR CODE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## QUESTION 1\n",
    "What do you observe when using linear, polynomial and RBF kernels on this dataset ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 5\n",
    "Report here the best SVM kernel and parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#get training and test error for the best SVM model from CV\n",
    "best_SVM = # USE YOUR OPTIMAL PARAMETERS\n",
    "\n",
    "# ADD YOUR CODE\n",
    "\n",
    "# (error is 1 - svm.score)\n",
    "\n",
    "print (\"Best SVM training error: %f\" % training_error)\n",
    "print (\"Best SVM test error: %f\" % test_error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 6\n",
    "\n",
    "Analyze how the gamma parameter (inversely proportional to standard deviation of Gaussian Kernel) impact the performances of the classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Test with different values of gamma\n",
    "\n",
    "# Set gamma values\n",
    "gamma_values = np.logspace(-5,2,8)\n",
    "print(gamma_values)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Try the SVM with the previously set values of gamma\n",
    "# use rbf kernel and C=1\n",
    "\n",
    "train_acc_list, test_acc_list = [], []\n",
    "\n",
    "    \n",
    "# ADD YOUR CODE TO TRAIN THE SVM MULTIPLE TIMES WITH THE DIFFERENT VALUES OF GAMMA\n",
    "# PLACE THE TRAIN AND TEST ACCURACY FOR EACH TEST IN THE TRAIN AND TEST ACCURACY LISTS\n",
    "\n",
    "# Plot\n",
    "fig, ax = plt.subplots(1,2, figsize=(15,5))\n",
    "\n",
    "ax[0].plot(gamma_values, train_acc_list)\n",
    "ax[0].set_xscale('log')\n",
    "ax[0].set_xlabel('gamma')\n",
    "ax[0].set_ylabel('Train accuracy')\n",
    "ax[0].grid(True)\n",
    "\n",
    "ax[1].plot(gamma_values, test_acc_list)\n",
    "ax[1].set_xscale('log')\n",
    "ax[1].set_xlabel('gamma')\n",
    "ax[1].set_ylabel('Test accuracy')\n",
    "ax[1].grid(True)\n",
    "\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## QUESTION 2\n",
    "How do the train and test error change when changing gamma ? Which is the best value of gamma ? \n",
    "Connect your answers to the discussion about the overfitting issue."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More data\n",
    "Now let's do the same but using more data points for training.\n",
    "\n",
    "\n",
    "Choose a new number of data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = X[permutation]\n",
    "y = y[permutation]\n",
    "\n",
    "m_training = 2000 # TODO number of data points, adjust depending on the capabilities of your PC\n",
    "\n",
    "X_train, X_test = X[:m_training], X[m_training:]\n",
    "y_train, y_test = y[:m_training], y[m_training:]\n",
    "\n",
    "labels, freqs = np.unique(y_train, return_counts=True)\n",
    "print(\"Labels in training dataset: \", labels)\n",
    "print(\"Frequencies in training dataset: \", freqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 7\n",
    "\n",
    "Let's try to use SVM with parameters obtained from the best model for $m_{training} =  2000$. Since it may take a long time to run, you can decide to just let it run for some time and stop it if it does not complete. If you decide to do this, report it in the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#get training and test error for the best SVM model from CV\n",
    "\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print (\"Best SVM training error: %f\" % training_error)\n",
    "print (\"Best SVM test error: %f\" % test_error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just for comparison, let's also use logistic regression \n",
    "\n",
    "## TO DO 8 Try first without regularization (use a very large large C)¶"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import linear_model\n",
    "\n",
    "# ADD YOUR CODE\n",
    "\n",
    "print (\"Best logistic regression training error: %f\" % training_error)\n",
    "print (\"Best logistic regression test error: %f\" % test_error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 9 Try  with regularization (use C=1)¶"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ADD YOUR CODE\n",
    "\n",
    "print (\"Best regularized logistic regression training error: %f\" % training_error)\n",
    "print (\"Best regularized logistic regression test error: %f\" % test_error)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## QUESTION 3\n",
    "Compare and discuss:\n",
    "- the results from SVM with m=600 and with m=2000 training data points. If you stopped the SVM, include such aspect in your comparison.\n",
    "- the results of SVM and of Logistic Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 10\n",
    "Plot an item of clothing that is missclassified by logistic regression and correctly classified by SVM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "LR_prediction = # ADD CODE\n",
    "SVM_prediction = # ADD CODE\n",
    "\n",
    "# ADD CODE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TO DO 11\n",
    "Plot the confusion matrix for the SVM classifier and for logistic regression.\n",
    "The confusion matrix has one column for each predicted label and one row for each true label. \n",
    "It shows for each class in the corresponding row how many samples belonging to that class gets each possible output label.\n",
    "Notice that the diagonal contains the correctly classified samples, while the other cells correspond to errors.\n",
    "You can obtain it with the sklearn.metrics.confusion_matrix function (see the documentation).\n",
    "Try also to normalize the confusion matrix by the number of samples in each class in order to measure the accuracy on each single class.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.set_printoptions(precision=2, suppress=True) # for better aligned printing of confusion matrix use floatmode='fixed'\n",
    "\n",
    "u, counts = np.unique(y_test, return_counts=True)\n",
    "print(\"Labels and frequencies in test set: \", counts)\n",
    "\n",
    "confusion_SVM = # ADD CODE\n",
    "print(\"\\n Confusion matrix SVM  \\n \\n\", confusion_SVM)\n",
    "print(\"\\n Confusion matrix SVM (normalized)   \\n \\n\", confusion_SVM /counts[:,None] )\n",
    "\n",
    "confusion_LR =  # ADD CODE\n",
    "print(\"\\n Confusion matrix LR  \\n \\n\", confusion_LR)\n",
    "print(\"\\n Confusion matrix LR (normalized)   \\n \\n\", confusion_LR /counts[:,None] )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ADD CODE TO NORMALIZE CONFUSION MATRIX AND PRINT THE NORMALIZED MATRIX\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## QUESTION 4\n",
    "Have a look at the confusion matrices and comment on the obtained accuracies. Why some classes have lower accuracies and others an higher one ? Make some guesses on the possible causes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}