# Kuzushiji Classification with Support Vector Machines

In this notebook we are going to explore the use of Support Vector Machines (SVM) for image classification. We will use a variant of the famous MNIST dataset (the original is a dataset of handwritten digits). The version we are going to use is called Kuzushiji-MNIST or K-MNIST for short (https://github.com/rois-codh/kmnist) and is a dataset of traditional japanese handwritten kana.



The dataset labels are the following:

| Label | Hiragana Character | Romanji (Pronunciation) |
| :-: | :-: | :-: |
| 0 | お | o |
| 1 | き | ki |
| 2 | す | su |
| 3 | つ | tsu |
| 4 | な | na |
| 5 | は | ha |
| 6 | ま | ma |
| 7 | や | ya |
| 8 | れ | re |
| 9 | を | wo |

## TODO: Insert your surname, name and ID number

Student surname:

Student name:
 
ID:

In [None]:
#load the required packages

%matplotlib inline 

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

import sklearn
from sklearn.datasets import fetch_openml
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
import sklearn.metrics as skm

In [None]:
# helper function to load Kuzushiji-MNIST dataset
def load_mnist(path, kind='train'):
 import os
 import gzip
 import numpy as np
 labels_path = os.path.join(path, 'K%s-labels-idx1-ubyte.gz' % kind)
 images_path = os.path.join(path, 'K%s-images-idx3-ubyte.gz' % kind)
 with gzip.open(labels_path, 'rb') as lbpath:
 labels = np.frombuffer(lbpath.read(), dtype=np.uint8,offset=8)
 with gzip.open(images_path, 'rb') as imgpath:
 images = np.frombuffer(imgpath.read(), dtype=np.uint8,offset=16).reshape(len(labels), 784)
 return images, labels

In [None]:
#fix your ID ("numero di matricola") and the seed for random generator (as usual you can try different seeds)
ID = # place a random seed
np.random.seed(ID)

In [None]:
#load the K-MNIST dataset from the 'data' folder and let's normalize the features so that each value is in [0,1] 

X, y = load_mnist('data', kind='train')
# rescale the data
X, y = X / 255., y # original pixel values are between 0 and 255
print(X.shape, y.shape)

Now split into training and test. Make sure that each label is present at least 10 times
in training. If it is not, then keep adding permutations to the initial data until this 
happens.

In [None]:
# Random permute the data and split into training and test taking the first 600
# data samples as training and 4000 samples as test
permutation = np.random.permutation(X.shape[0])

X = X[permutation]
y = y[permutation]

m_training = 600
m_test = 4000

X_train, X_test = X[:m_training], X[m_training:m_training+m_test:]
y_train, y_test = y[:m_training], y[m_training:m_training+m_test:]

labels, freqs = np.unique(y_train, return_counts=True)
print("Labels in training dataset: ", labels)
print("Frequencies in training dataset: ", freqs)


In [None]:
#function for plotting a image and printing the corresponding label
def plot_input(X_matrix, labels, index):
 print("INPUT:")
 plt.imshow(
 X_matrix[index].reshape(28,28),
 cmap = plt.cm.gray_r,
 interpolation = "nearest"
 )
 plt.show()
 print("LABEL: %i"%labels[index])
 return

In [None]:
#let's try the plotting function
plot_input(X_train,y_train,5)
plot_input(X_test,y_test,50)
plot_input(X_test,y_test,500)
plot_input(X_test,y_test,700)

## TO DO 1
Use a SVM classifier with cross validation to pick a model. Use a 4-fold cross-validation. Let's start with a Linear kernel:

In [None]:
#import SVC
from sklearn.svm import SVC
#import for Cross-Validation
from sklearn.model_selection import GridSearchCV

# parameters for linear SVM
parameters = {'C': [0.01, 0.1, 1, 10]}

#train linear SVM

# ADD YOUR CODE

print ('RESULTS FOR LINEAR KERNEL')

print("Best parameters set found:")
# ADD YOUR CODE

print("Score with best parameters:")
# ADD YOUR CODE

print("All scores on the grid:")
# ADD YOUR CODE

## TO DO 2
Pick a model for the Polynomial kernel with degree=2:

In [None]:
# parameters for poly with degree 2 kernel
parameters = {'C': [0.01, 0.1, 1],'gamma':[0.01,0.1,1]}

#run SVM with poly of degree 2 kernel

# ADD YOUR CODE

print ('RESULTS FOR POLY DEGREE=2 KERNEL')

print("Best parameters set found:")
# ADD YOUR CODE

print("Score with best parameters:")
# ADD YOUR CODE

print("\nAll scores on the grid:")
# ADD YOUR CODE

## TO DO 3

Now let's try a higher degree for the polynomial kernel (e.g., 3rd degree).

In [None]:
# parameters for poly with higher degree kernel
parameters = {'C': [0.01, 0.1, 1],'gamma':[0.01,0.1,1]}

#run SVM with poly of higher degree kernel
degree = 3

# ADD YOUR CODE

print ('RESULTS FOR POLY DEGREE=', degree, ' KERNEL')

print("Best parameters set found:")
# ADD YOUR CODE

print("Score with best parameters:")
# ADD YOUR CODE

print("\nAll scores on the grid:")
# ADD YOUR CODE

## TO DO 4
Pick a model for the Radial Basis Function kernel:

In [None]:
# parameters for rbf SVM
parameters = {'C': [0.1, 1, 10, 100],'gamma':[0.001, 0.01, 0.1,1]}

#run SVM with rbf kernel

# ADD YOUR CODE

print ('RESULTS FOR rbf KERNEL')

print("Best parameters set found:")
# ADD YOUR CODE

print("Score with best parameters:")
# ADD YOUR CODE

print("\nAll scores on the grid:")
# ADD YOUR CODE

## QUESTION 1
What do you observe when using linear, polynomial and RBF kernels on this dataset ?

## TO DO 5
Report here the best SVM kernel and parameters

In [None]:
#get training and test error for the best SVM model from CV
best_SVM = # USE YOUR OPTIMAL PARAMETERS

# ADD YOUR CODE

# (error is 1 - svm.score)

print ("Best SVM training error: %f" % training_error)
print ("Best SVM test error: %f" % test_error)

## TO DO 6

Analyze how the gamma parameter (inversely proportional to standard deviation of Gaussian Kernel) impact the performances of the classifier

In [None]:
#Test with different values of gamma

# Set gamma values
gamma_values = np.logspace(-5,2,8)
print(gamma_values)


In [None]:
# Try the SVM with the previously set values of gamma
# use rbf kernel and C=1

train_acc_list, test_acc_list = [], []

 
# ADD YOUR CODE TO TRAIN THE SVM MULTIPLE TIMES WITH THE DIFFERENT VALUES OF GAMMA
# PLACE THE TRAIN AND TEST ACCURACY FOR EACH TEST IN THE TRAIN AND TEST ACCURACY LISTS

# Plot
fig, ax = plt.subplots(1,2, figsize=(15,5))

ax[0].plot(gamma_values, train_acc_list)
ax[0].set_xscale('log')
ax[0].set_xlabel('gamma')
ax[0].set_ylabel('Train accuracy')
ax[0].grid(True)

ax[1].plot(gamma_values, test_acc_list)
ax[1].set_xscale('log')
ax[1].set_xlabel('gamma')
ax[1].set_ylabel('Test accuracy')
ax[1].grid(True)


plt.show()

## QUESTION 2
How do the train and test error change when changing gamma ? Which is the best value of gamma ? 
Connect your answers to the discussion about the overfitting issue.

## More data
Now let's do the same but using more data points for training.


Choose a new number of data points.

In [None]:
X = X[permutation]
y = y[permutation]

m_training = 2000 # TODO number of data points, adjust depending on the capabilities of your PC

X_train, X_test = X[:m_training], X[m_training:]
y_train, y_test = y[:m_training], y[m_training:]

labels, freqs = np.unique(y_train, return_counts=True)
print("Labels in training dataset: ", labels)
print("Frequencies in training dataset: ", freqs)

## TO DO 7

Let's try to use SVM with parameters obtained from the best model for $m_{training} = 2000$. Since it may take a long time to run, you can decide to just let it run for some time and stop it if it does not complete. If you decide to do this, report it in the cell below.

In [None]:
#get training and test error for the best SVM model from CV

# ADD YOUR CODE

print ("Best SVM training error: %f" % training_error)
print ("Best SVM test error: %f" % test_error)

Just for comparison, let's also use logistic regression 

## TO DO 8 Try first without regularization (use a very large large C)¶

In [None]:
from sklearn import linear_model

# ADD YOUR CODE

print ("Best logistic regression training error: %f" % training_error)
print ("Best logistic regression test error: %f" % test_error)

## TO DO 9 Try with regularization (use C=1)¶

In [None]:
# ADD YOUR CODE

print ("Best regularized logistic regression training error: %f" % training_error)
print ("Best regularized logistic regression test error: %f" % test_error)

## QUESTION 3
Compare and discuss:
- the results from SVM with m=600 and with m=2000 training data points. If you stopped the SVM, include such aspect in your comparison.
- the results of SVM and of Logistic Regression

## TO DO 10
Plot an item of clothing that is missclassified by logistic regression and correctly classified by SVM.

In [None]:
LR_prediction = # ADD CODE
SVM_prediction = # ADD CODE

# ADD CODE

## TO DO 11
Plot the confusion matrix for the SVM classifier and for logistic regression.
The confusion matrix has one column for each predicted label and one row for each true label. 
It shows for each class in the corresponding row how many samples belonging to that class gets each possible output label.
Notice that the diagonal contains the correctly classified samples, while the other cells correspond to errors.
You can obtain it with the sklearn.metrics.confusion_matrix function (see the documentation).
Try also to normalize the confusion matrix by the number of samples in each class in order to measure the accuracy on each single class.


In [None]:
np.set_printoptions(precision=2, suppress=True) # for better aligned printing of confusion matrix use floatmode='fixed'

u, counts = np.unique(y_test, return_counts=True)
print("Labels and frequencies in test set: ", counts)

confusion_SVM = # ADD CODE
print("\n Confusion matrix SVM \n \n", confusion_SVM)
print("\n Confusion matrix SVM (normalized) \n \n", confusion_SVM /counts[:,None] )

confusion_LR = # ADD CODE
print("\n Confusion matrix LR \n \n", confusion_LR)
print("\n Confusion matrix LR (normalized) \n \n", confusion_LR /counts[:,None] )

In [None]:
# ADD CODE TO NORMALIZE CONFUSION MATRIX AND PRINT THE NORMALIZED MATRIX


## QUESTION 4
Have a look at the confusion matrices and comment on the obtained accuracies. Why some classes have lower accuracies and others an higher one ? Make some guesses on the possible causes.
