{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " # Machine Learning LAB 1 (course 2022/23, F. Chiariotti, A.A. Deshpande) \n",
    "\n",
    "The notebook contains some simple tasks to be performed about classification and regression. Complete all the required code sections and answer to all the questions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1) Classification of NBA players role\n",
    "\n",
    "## IMPORTANT: make sure to rerun all the code from the beginning to obtain the results for the final version of your notebook, since this is the way we will do it before evaluating your notebook!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Place your name and ID number. Also recall to save the file as Surname_Name_LAB1.ipynb\n",
    "\n",
    "Student name: mario rossi<br>\n",
    "ID Number: 34335"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dataset description\n",
    "\n",
    "We will be working with a dataset of NBA basketball players data (you can get from https://www.kaggle.com/jacobbaruch/nba-player-of-the-week the full dataset).\n",
    "\n",
    "The provided data is a subset of the Kaggle dataset containing the players that have the role of Center and of Point Guard. For each player the dataset contains 3 features, the height, the weight and the age.\n",
    "\n",
    "From Wikipedia (if you are not a basketball fan!!):\n",
    "\n",
    "The Center (C), also known as the five, or the big man, is one of the five positions in a regular basketball game. The center is normally the tallest player on the team, and often has a great deal of strength and body mass as well. In the NBA, the center is usually 6' 10\" (2.08 m) or taller and usually weighs 240 lbs (109 kg) or more. \n",
    "\n",
    "Point Guards (PG, a.k.a. as \"play maker\") are expected to run the team's offense by controlling the ball and making sure that it gets to the right players at the right time. In the NBA, point guards are usually about 6' 3\" (1.93 m) or shorter, and average about 6' 2\" (1.88 m). Having above-average size (height, muscle) is considered advantageous, although size is secondary to situational awareness, speed, quickness, and ball handling skills. Shorter players tend to be better dribblers since they are closer to the floor, and thus have better control of the ball while dribbling. \n",
    "\n",
    "\n",
    "As it is clear from the description, the height and weight of the player are good hints to predict their role and in this lab we will exploit this features to estimate the role.\n",
    "\n",
    "\n",
    "### Three features are considered for this dataset:\n",
    "\n",
    "\n",
    "1) Height in cm\n",
    "\n",
    "2) Weight in kg\n",
    "\n",
    "3) Age in years"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first import all the packages that are needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import csv\n",
    "\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "import sklearn as sl\n",
    "from scipy import stats\n",
    "from sklearn import datasets\n",
    "from sklearn import linear_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Perceptron\n",
    "Firstly we will implement the perceptron algorithm and use it to learn a halfspace."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** Set the random seed (you can use your ID (matricola) or any other number!)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "IDnumber = 3553534 #YOUR_ID , try also to change the seed to see the impact of random initialization on the results\n",
    "np.random.seed(IDnumber)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the dataset and then split in training set and test set (the training set is typically larger, you can use a 70% tranining 30% test split) after applying a random permutation to the datset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A) Load dataset and perform permutation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#load the dataset\n",
    "filename = 'data/NBA.csv'\n",
    "NBA = csv.reader(open(filename, newline=''), delimiter=',')\n",
    "\n",
    "header = next(NBA) #skip first line\n",
    "print(header)\n",
    "\n",
    "dataset = list(NBA)\n",
    "for i in range(len(dataset)):\n",
    "    dataset[i] = [int(x) for x in dataset[i]]\n",
    "    \n",
    "dataset = np.asarray(dataset)\n",
    "\n",
    "X = dataset[:,1:4] #columns 1,2,3 contain the features\n",
    "Y = dataset[:,0]  # column 0: labels\n",
    "\n",
    "Y = Y*2-1  # set labels to -1, 1 as required by perceptron implementation\n",
    "\n",
    "m = dataset.shape[0]\n",
    "print(m)\n",
    "permutation = np.random.permutation(m) # random permurtation\n",
    "\n",
    "X = X[permutation]\n",
    "Y = Y[permutation]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to classify class \"1\" (Center) vs class \"-1\" (Point Guard)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "B) **TO DO** Divide the data into training set and test set (70% of the data in the first set, 30% in the second one)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Divide in training and test: make sure that your training set\n",
    "#contains at least 10 elements from class 1 and at least 10 elements\n",
    "#from class -1! If it does not, modify the code so to apply more random\n",
    "#permutations (or the same permutation multiple times) until this happens.\n",
    "#IMPORTANT: do not change the random seed.\n",
    "\n",
    "#m_training needs to be the number of samples in the training set\n",
    "m_training = # PLACE YOUR CODE\n",
    "\n",
    "#m_test needs to be the number of samples in the test set\n",
    "m_test = # PLACE YOUR CODE\n",
    "\n",
    "#X_training = instances for training set\n",
    "X_training = # PLACE YOUR CODE\n",
    "#Y_training = labels for the training set\n",
    "Y_training = # PLACE YOUR CODE\n",
    "\n",
    "#X_test = instances for test set\n",
    "X_test = # PLACE YOUR CODE\n",
    "#Y_test = labels for the test set\n",
    "Y_test = # PLACE YOUR CODE\n",
    "\n",
    "print(Y_training) #to make sure that Y_training contains both 1 and -1\n",
    "print(m_test)\n",
    "\n",
    "print(\"Shape of training set: \" + str(X_training.shape))\n",
    "print(\"Shape of test set: \" + str(X_test.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We add a 1 in front of each sample so that we can use a vector in homogeneous coordinates to describe all the coefficients of the model. This can be done with the function $hstack$ in $numpy$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#add a 1 to each sample (homogeneous coordinates)\n",
    "X_training = np.hstack((np.ones((m_training,1)),X_training))\n",
    "X_test = np.hstack((np.ones((m_test,1)),X_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** Now complete the function *perceptron*. Since the perceptron does not terminate if the data is not linearly separable, your implementation should return the desired output (see below) if it reached the termination condition seen in class or if a maximum number of iterations have already been run, where one iteration corresponds to one update of the perceptron weights. In case the termination is reached because the maximum number of iterations have been completed, the implementation should return **the best model** seen up to now.\n",
    "\n",
    "The input parameters to pass are:\n",
    "- $X$: the matrix of input features, one row for each sample\n",
    "- $Y$: the vector of labels for the input features matrix X\n",
    "- $max\\_num\\_iterations$: the maximum number of iterations for running the perceptron\n",
    "\n",
    "The output values are:\n",
    "- $best\\_w$: the vector with the coefficients of the best model\n",
    "- $best\\_error$: the *fraction* of misclassified samples for the best model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A template is provided, but feel free to build a different implementation\n",
    "\n",
    "def perceptron_update(current_w, x, y):\n",
    "    # Place in this function the update rule of the perceptron algorithm\n",
    "    \n",
    "    # PLACE YOUR CODE\n",
    "\n",
    "    return new_w\n",
    "\n",
    "def perceptron(X, Y, max_num_iterations):\n",
    "    # Place in this function the main section of the perceptron algorithm\n",
    "    \n",
    "    #init the algorith with w=0, use a best_w variable to keep track of the best solution\n",
    "    curr_w = # PLACE YOUR CODE\n",
    "    best_w = # PLACE YOUR CODE\n",
    "    num_samples = # PLACE YOUR CODE\n",
    "    best_error = # PLACE YOUR CODE\n",
    "    \n",
    "    index_misclassified = #will be ovewritten\n",
    "    num_misclassified = #will be ovewritten\n",
    "    \n",
    "    \n",
    "    #main loop continue until all samples correctly classified or max # iterations reached\n",
    "    num_iter = 1\n",
    "    \n",
    "    while ((index_misclassified != -1) and (num_iter < max_num_iterations)):\n",
    "        \n",
    "        index_misclassified = -1\n",
    "        num_misclassified = 0\n",
    "        \n",
    "        # avoid working always on the same sample, you can use a random permutation or randomize the choice of misclassified\n",
    "        \n",
    "        # PLACE YOUR CODE TO RANDOMIZE\n",
    "        \n",
    "        for i in range(num_samples):\n",
    "            \n",
    "                      \n",
    "            # PLACE YOUR CODE\n",
    "\n",
    "            #check if the i-th randomly selected sample is misclassified\n",
    "            # store the number of randomly classified samples and the index of at least one of them\n",
    "            \n",
    "            \n",
    "            \n",
    "\n",
    "        #update  error count, keep track of best solution\n",
    "        \n",
    "        # PLACE YOUR CODE\n",
    "    \n",
    "        num_iter += 1\n",
    "        \n",
    "        #call update function using a misclassifed sample\n",
    "    \n",
    "    best_error = # PLACE YOUR CODE\n",
    "    \n",
    "    return best_w, best_error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we use the implementation above of the perceptron to learn a model from the training data using 100 iterations and print the error of the best model we have found."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#now run the perceptron for 100 iterations\n",
    "w_found, error = perceptron(X_training,Y_training, 100)\n",
    "print(\"Training Error of perpceptron (100 iterations): \" + str(error))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** use the best model $w\\_found$ to predict the labels for the test dataset and print the fraction of misclassified samples in the test set (the test error that is an estimate of the true loss)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#now use the w_found to make predictions on test dataset\n",
    "\n",
    "num_errors = 0\n",
    "\n",
    "# PLACE YOUR CODE to compute the number of errors\n",
    "\n",
    "true_loss_estimate = num_errors/m_test  # error rate on the test set\n",
    "#NOTE: you can avoid using num_errors if you prefer, as long as true_loss_estimate is correct\n",
    "print(\"Test Error of perpceptron (100 iterations): \" + str(true_loss_estimate))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** **[Answer the following]** What about the difference betweeen the training error and the test error  in terms of fraction of misclassified samples)? Explain what you observe. [Write the answer in this cell]\n",
    "\n",
    "**ANSWER QUESTION 1**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** Copy the code from the last 2 cells above in the cell below and repeat the training with 3000 iterations. Then print the error in the training set and the estimate of the true loss obtained from the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#now run the perceptron for 3000 iterations here!\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "print(\"Training Error of perpceptron (3000 iterations): \" + str(error))\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "print(\"Test Error of perpceptron (3000 iterations): \" + str(true_loss_estimate))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** [Answer the following] What about the difference betweeen the training error and the test error  in terms of fraction of misclassified samples) when running for a larger number of iterations ? Explain what you observe and compare with the previous case. [Write the answer in this cell]\n",
    "\n",
    "**ANSWER QUESTION 2**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Logistic Regression\n",
    "Now we use logistic regression, exploiting the implementation in Scikit-learn, to predict labels. We will also plot the decision region of logistic regression.\n",
    "\n",
    "We first load the dataset again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = 'data/NBA.csv'\n",
    "NBA = csv.reader(open(filename, newline=''), delimiter=',')\n",
    "\n",
    "header = next(NBA)\n",
    "print(header)\n",
    "\n",
    "dataset = list(NBA)\n",
    "for i in range(len(dataset)):\n",
    "    dataset[i] = [int(x) for x in dataset[i]]\n",
    "    \n",
    "dataset = np.asarray(dataset)\n",
    "\n",
    "X = dataset[:,1:]\n",
    "Y = dataset[:,0]\n",
    "\n",
    "Y = Y*2-1  # set labels to -1, 1 as required by perceptron implementation\n",
    "\n",
    "m = dataset.shape[0]\n",
    "permutation = np.random.permutation(m)\n",
    "\n",
    "X = X[permutation]\n",
    "Y = Y[permutation]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** As for the previous part, divide the data into training and test (70%-30%) and add a 1 as first component to each sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Divide in training and test: make sure that your training set\n",
    "#contains at least 10 elements from class 1 and at least 10 elements\n",
    "#from class -1! If it does not, modify the code so to apply more random\n",
    "#permutations (or the same permutation multiple times) until this happens.\n",
    "#IMPORTANT: do not change the random seed.\n",
    "\n",
    "m_training = # PLACE YOUR CODE\n",
    "m_test = # PLACE YOUR CODE\n",
    "\n",
    "X_training = # PLACE YOUR CODE\n",
    "Y_training = # PLACE YOUR CODE\n",
    "\n",
    "X_test = # PLACE YOUR CODE\n",
    "Y_test = # PLACE YOUR CODE\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To define a logistic regression model in Scikit-learn use the instruction\n",
    "\n",
    "$linear\\_model.LogisticRegression(C=1e5)$\n",
    "\n",
    "($C$ is a parameter related to *regularization*, a technique that\n",
    "we will see later in the course. Setting it to a high value is almost\n",
    "as ignoring regularization, so the instruction above corresponds to the\n",
    "logistic regression you have seen in class.)\n",
    "\n",
    "To learn the model you need to use the $fit(...)$ instruction and to predict you need to use the $predict(...)$ function. See the Scikit-learn documentation for how to use it.\n",
    "\n",
    "**TO DO** Define the logistic regression model, then learn the model using the training set and predict on the test set. Then print the fraction of samples misclassified in the training set and in the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#part on logistic regression for 2 classes\n",
    "logreg = linear_model.LogisticRegression(C=1e5) #a large C disables regularization\n",
    "\n",
    "#learn from training set\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "#predict on training set\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "#print the error rate = fraction of misclassified samples\n",
    "error_rate_training = 0\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "print(\"Error rate on training set: \"+str(error_rate_training))\n",
    "\n",
    "#predict on test set\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "#print the error rate = fraction of misclassified samples\n",
    "error_rate_test = 0\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "print(\"Error rate on test set: \" + str(error_rate_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** Now pick two features and restrict the dataset to include only two features, whose indices are specified in the $feature$ vector below. Then split into training and test. Which features are you going to select ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#to make the plot we need to reduce the data to 2D, so we choose two features\n",
    "features_list = ['height', 'weight', 'age']\n",
    "labels_list = ['Center', 'Point guard']\n",
    "\n",
    "# select a pair of features\n",
    "index_feature1 = 0 # we choose height\n",
    "index_feature2 = 1 # and weight (age of course is not very meaningful)\n",
    "features = [index_feature1, index_feature2]\n",
    "\n",
    "feature_name0 = features_list[features[0]]\n",
    "feature_name1 = features_list[features[1]]\n",
    "\n",
    "X_reduced = X[:,features]\n",
    "\n",
    "X_training = # PLACE YOUR CODE\n",
    "Y_training = # PLACE YOUR CODE\n",
    "\n",
    "X_test = # PLACE YOUR CODE\n",
    "Y_test = # PLACE YOUR CODE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now learn a model using the training data and measure the performances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# learning from training data\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "#print the error rate = fraction of misclassified samples\n",
    "error_rate_test = 0\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "print(\"Error rate on test set: \" + str(error_rate_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**TO DO** [Answer the following] Which features did you select and why ? Compare the perfromances with the ones of the case with all the 3 features and comment about the results. [Write the answer in this cell]\n",
    "\n",
    "**ANSWER QUESTION 3**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If everything is ok, the code below uses the model in $logreg$ to plot the decision region for the two features chosen above, with colors denoting the predicted value. It also plots the points (with correct labels) in the training set. It makes a similar plot for the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Plot the decision boundary. For that, we will assign a color to each\n",
    "# point in the mesh [x_min, x_max]x[y_min, y_max].\n",
    "\n",
    "# NOTICE: This visualization code has been developed for a \"standard\" solution of the notebook, \n",
    "# it could be necessary to make some fixes to adapt to your implementation\n",
    "\n",
    "h = .02  # step size in the mesh\n",
    "x_min, x_max = X_reduced[:, 0].min() - .5, X_reduced[:, 0].max() + .5\n",
    "y_min, y_max = X_reduced[:, 1].min() - .5, X_reduced[:, 1].max() + .5\n",
    "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n",
    "\n",
    "Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])\n",
    "\n",
    "# Put the result into a color plot\n",
    "Z = Z.reshape(xx.shape)\n",
    "\n",
    "plt.figure(1, figsize=(4, 3))\n",
    "plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)\n",
    "\n",
    "# Plot also the training points\n",
    "plt.scatter(X_training[:, 0], X_training[:, 1], c=Y_training, edgecolors='k', cmap=plt.cm.Paired)\n",
    "plt.xlabel(feature_name0)\n",
    "plt.ylabel(feature_name1)\n",
    "\n",
    "plt.xlim(xx.min(), xx.max())\n",
    "plt.ylim(yy.min(), yy.max())\n",
    "plt.xticks(())\n",
    "plt.yticks(())\n",
    "plt.title('Training set')\n",
    "\n",
    "plt.show()\n",
    "\n",
    "# Put the result into a color plot\n",
    "Z = Z.reshape(xx.shape)\n",
    "plt.figure(1, figsize=(4, 3))\n",
    "plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)\n",
    "\n",
    "# Plot also the test points \n",
    "plt.scatter(X_test[:, 0], X_test[:, 1], c=Y_test, edgecolors='k', cmap=plt.cm.Paired, marker='s')\n",
    "plt.xlabel(feature_name0)\n",
    "plt.ylabel(feature_name1)\n",
    "\n",
    "plt.xlim(xx.min(), xx.max())\n",
    "plt.ylim(yy.min(), yy.max())\n",
    "plt.xticks(())\n",
    "plt.yticks(())\n",
    "plt.title('Test set')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2) Linear Regression on the Boston House Price dataset\n",
    "\n",
    "Dataset description: <br>\n",
    "The Boston House Price Dataset involves the prediction of a house price in thousands of dollars given details about the house and its neighborhood.\n",
    "\n",
    "The dataset contains a total of 500 observations, which relate 13 input features to an output variable (house price).\n",
    "\n",
    "The variable names are as follows:\n",
    "\n",
    "CRIM: per capita crime rate by town.\n",
    "\n",
    "ZN: proportion of residential land zoned for lots over 25,000 sq.ft.\n",
    "\n",
    "INDUS: proportion of nonretail business acres per town.\n",
    "\n",
    "CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n",
    "\n",
    "NOX: nitric oxides concentration (parts per 10 million).\n",
    "\n",
    "RM: average number of rooms per dwelling.\n",
    "\n",
    "AGE: proportion of owner-occupied units built prior to 1940.\n",
    "\n",
    "DIS: weighted distances to five Boston employment centers.\n",
    "\n",
    "RAD: index of accessibility to radial highways.\n",
    "\n",
    "TAX: full-value property-tax rate per $10,000.\n",
    "\n",
    "PTRATIO: pupil-teacher ratio by town.\n",
    "\n",
    "B: 1000*(Bk – 0.63)2 where Bk is the proportion of blacks by town.\n",
    "\n",
    "LSTAT: % lower status of the population.\n",
    "\n",
    "MEDV: Median value of owner-occupied homes in $1000s.\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#needed if you get the IPython/javascript error on the in-line plots\n",
    "%matplotlib nbagg  \n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Import Data: Load the data from a .csv file\n",
    "\n",
    "filename = \"data/house.csv\"\n",
    "Data = np.genfromtxt(filename, delimiter=';',skip_header=1)\n",
    "\n",
    "#A quick overview of data, to inspect the data you can use the method describe()\n",
    "\n",
    "dataDescription = stats.describe(Data)\n",
    "print(dataDescription)\n",
    "print (\"Shape of data array: \" + str(Data.shape))\n",
    "\n",
    "\n",
    "#for more interesting visualization: use Panda!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Split data in training, validation and test sets\n",
    "\n",
    "\n",
    "\n",
    "Given $m$ total data, denote with $m_{tv}$ the part used for training and validation. Keep $m_t$ data as training data, $m_{val}:=m_{tv}-m_t$ as validation data and $m_{test}:=m - m_{val} - m_t = m-m_{tv}$. For instance one can take $m_t=0.6m$ of the data as training, $m_{val}=0.2m$  validation and $m_{test}=0.2m$ as testing. Let us define as define\n",
    "\n",
    "$\\bullet$ $S_{t}$ the training data set\n",
    "\n",
    "$\\bullet$ $S_{val}$ the validation data set\n",
    "\n",
    "$\\bullet$ $S_{test}$ the testing data set\n",
    "\n",
    "\n",
    "The reason for this splitting is as follows:\n",
    "\n",
    "TRAINING DATA: The training data are used to compute the empirical loss\n",
    "$$\n",
    "L_S(h) = \\frac{1}{m_t} \\sum_{z_i \\in S_{t}} \\ell(h,z_i)\n",
    "$$\n",
    "which is used to estimate $h$ in a given model class ${\\cal H}$.\n",
    "i.e. \n",
    "$$\n",
    "\\hat{h} = {\\rm arg\\; min}_{h \\in {\\cal H}} \\, L_S(h)\n",
    "$$\n",
    "\n",
    "VALIDATION DATA: When different model classes are present (e.g. of different complexity such as linear regression which uses a different number $d_j$ of regressors $x_1$,...$x_{d_j}$), one has to choose which one is the \"best\" complexity. In this simple example the validation set is not needed, but it is better to get used with it.\n",
    "Let ${\\cal H}_{d_j}$ be the space of models as a function of the complexity $d_j$ and let \n",
    "$$\n",
    "\\hat h_{d_j} = {\\rm arg\\; min}_{h \\in {\\cal H}_{d_j}} \\, L_S(h)\n",
    "$$\n",
    "\n",
    "One can estimate the generalization error for model $\\hat h_{d_j}$ as follows:\n",
    "$$\n",
    "L_{{\\cal D}}(\\hat h_{d_j}) \\simeq \\frac{1}{m_{val}} \\sum_{ z_i \\in S_{val}} \\ell(\\hat h_{d_j},z_i)\n",
    "$$\n",
    "and then choose the complexity which achieves the best estimate of the generalization error\n",
    "$$\n",
    "\\hat d_j: = {\\rm arg\\; min}_{d_j} \\,\\frac{1}{m_{val}} \\sum_{ z_i \\in S_{val}} \\ell(\\hat h_{d_j},z_i)\n",
    "$$\n",
    "\n",
    "TESTING DATA: Last, the test data set can be used to estimate the performance of the final estimated model\n",
    "$\\hat h_{\\hat d_j}$ using:\n",
    "$$\n",
    "L_{{\\cal D}}(\\hat h_{\\hat d_j}) \\simeq \\frac{1}{m_{test}} \\sum_{ z_i \\in S_{test}} \\ell(\\hat h_{\\hat d_j},z_i)\n",
    "$$\n",
    "\n",
    "\n",
    "**TO DO**: split the data in training, validation and test sets (60%-20%-20%)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#get number of total samples\n",
    "num_total_samples = Data.shape[0]\n",
    "\n",
    "print (\"Total number of samples: \", num_total_samples)\n",
    "\n",
    "#size of each chunk of data (1/4 each): 2 of them for training, 1 for validation, 1 for testing\n",
    "size_chunk = # PLACE YOUR CODE\n",
    "\n",
    "print (\"Size of each chunk of data: \", size_chunk)\n",
    "\n",
    "#shuffle the data\n",
    "np.random.shuffle(Data)\n",
    "\n",
    "#training data \n",
    "\n",
    "X_training = # PLACE YOUR CODE\n",
    "Y_training = # PLACE YOUR CODE\n",
    "print (\"Training input data size: \", X_training.shape)\n",
    "print (\"Training output data size: \", Y_training.shape)\n",
    "\n",
    "#validation data, to be used to choose among different models\n",
    "X_validation = # PLACE YOUR CODE\n",
    "Y_validation = # PLACE YOUR CODE\n",
    "print (\"Validation input data size: \", X_validation.shape)\n",
    "print (\"Validation output data size: \", Y_validation.shape)\n",
    "\n",
    "#test data, to be used to estimate the true loss of the final model(s)\n",
    "X_test = # PLACE YOUR CODE\n",
    "Y_test = # PLACE YOUR CODE\n",
    "print (\"Test input data size: \", X_test.shape)\n",
    "print (\"Test output data size: \", Y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Normalization\n",
    "It is common practice in Statistics and Machine Learning to scale the data (= each variable) so that it is centered (zero mean) and has standard deviation equal to 1. This helps in terms of numerical conditioning of the (inverse) problems of estimating the model (the coefficients of the linear regression in this case), as well as to give the same scale to all the coefficients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#scale the data\n",
    "\n",
    "# standardize the input matrix\n",
    "from sklearn import preprocessing\n",
    "# the transformation is computed on training data and then used on all the 3 sets\n",
    "scaler = preprocessing.StandardScaler().fit(X_training) \n",
    "\n",
    "X_training = scaler.transform(X_training)\n",
    "print (\"Mean of the training input data:\", X_training.mean(axis=0))\n",
    "print (\"Std of the training input data:\",X_training.std(axis=0))\n",
    "\n",
    "X_validation = scaler.transform(X_validation) # use the same transformation on validation data\n",
    "print (\"Mean of the validation input data:\", X_validation.mean(axis=0))\n",
    "print (\"Std of the validation input data:\", X_validation.std(axis=0))\n",
    "\n",
    "X_test = scaler.transform(X_test) # use the same transformation on test data\n",
    "print (\"Mean of the test input data:\", X_test.mean(axis=0))\n",
    "print (\"Std of the test input data:\", X_test.std(axis=0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Training \n",
    "\n",
    "The model is trained (= estimated) minimizing the empirical error\n",
    "$$\n",
    "L_S(h) := \\frac{1}{m_t} \\sum_{z_i \\in S_{t}} \\ell(h,z_i)\n",
    "$$\n",
    "When the loss function is the quadratic loss\n",
    "$$\n",
    "\\ell(h,z) := (y - h(x))^2\n",
    "$$\n",
    "we define  the Residual Sum of Squares (RSS) as\n",
    "$$\n",
    "RSS(h):= \\sum_{z_i \\in S_{t}} \\ell(h,z_i) = \\sum_{z_i \\in S_{t}} (y_i - h(x_i))^2\n",
    "$$ so that the training error becomes\n",
    "$$\n",
    "L_S(h) = \\frac{RSS(h)}{m_t}\n",
    "$$\n",
    "\n",
    "We recal that, for linear models we have $h(x) = <w,x>$ and the Empirical error $L_S(h)$ can be written\n",
    "in terms of the vector of parameters $w$ in the form\n",
    "$$\n",
    "L_S(w) = \\frac{1}{m_t} \\|Y - X w\\|^2\n",
    "$$\n",
    "where $Y$ and $X$ are the matrices whose $i-$th row are, respectively, the output data $y_i$ and the input vectors $x_i^\\top$.\n",
    "\n",
    "\n",
    " **TO DO:** compute the linear regression coefficients using np.linalg.lstsq from scikitlear \n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#compute linear regression coefficients for training data\n",
    "\n",
    "\n",
    "#add a 1 at the beginning of each sample for training, validation, and testing (use homogeneous coordinates)\n",
    "m_training = X_training.shape[0]\n",
    "X_trainingH = np.hstack((np.ones((m_training,1)),X_training)) # H: in homogeneous coordinates\n",
    "#print X_training[0,:]\n",
    "\n",
    "m_validation = X_validation.shape[0]\n",
    "X_validationH = np.hstack((np.ones((m_validation,1)),X_validation))  # H: in homogeneous coordinates\n",
    "#print X_validation[0,:]\n",
    "\n",
    "m_test = X_test.shape[0]\n",
    "X_testH = np.hstack((np.ones((m_test,1)),X_test))  # H: in homogeneous coordinates\n",
    "#print X_test[0,:]   \n",
    "\n",
    "# Compute the least-squares coefficients using linalg.lstsq\n",
    "w_np, RSStr_np, rank_Xtr, sv_Xtr =  # PLACE YOUR CODE\n",
    "print(\"LS coefficients with numpy lstsq:\", w_np)\n",
    "\n",
    "# compute Residual sums of squares \n",
    "\n",
    "RSStr_hand = # PLACE YOUR CODE\n",
    "\n",
    "print(\"RSS with numpy lstsq: \", RSStr_np)\n",
    "print(\"Empirical risk with numpy lstsq:\", RSStr_np/m_training)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data prediction \n",
    "\n",
    "Compute the output predictions on both training and validation set and compute the Residual Sum of Sqaures (RSS). \n",
    "\n",
    "**TO DO**: Compute these quantities on  training, validation and test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#compute predictions on training and validation\n",
    "\n",
    "#prediction_training \n",
    "prediction_training = # PLACE YOUR CODE\n",
    "prediction_validation = # PLACE YOUR CODE\n",
    "prediction_test = # PLACE YOUR CODE\n",
    "\n",
    "#what about the loss for points in the validation data?\n",
    "RSS_validation = # PLACE YOUR CODE\n",
    "RSS_test = # PLACE YOUR CODE\n",
    "\n",
    "print(\"RSS on validation data:\",  RSS_validation)\n",
    "print(\"Loss estimated from validation data:\", RSS_validation/m_validation)\n",
    "\n",
    "print(\"RSS on test data:\",  RSS_test)\n",
    "print(\"Loss estimated from test data:\", RSS_test/m_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### QUESTION 4: Comment on the results you get and on the difference between the train, validation and test errors.\n",
    "\n",
    "Insert your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ordinary Least-Squares using scikit-learn\n",
    "Another fast way to compute the LS estimate is through sklearn.linear_model (for this function homogeneous coordinates are not needed)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import linear_model\n",
    "\n",
    "# build the LinearRegression() model and train it\n",
    "LinReg = # PLACE YOUR CODE\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "print(\"Intercept:\", LinReg.intercept_)\n",
    "print(\"Least-Squares Coefficients:\", LinReg.coef_)\n",
    "\n",
    "# predict output values on training and test sets\n",
    "\n",
    "# PLACE YOUR CODE\n",
    "\n",
    "# return a prediction score based on the coefficient of determination\n",
    "print(\"Measure on training data:\", 1-LinReg.score(X_training, Y_training))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}