{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Application of the Principal Component Analysis on the Iris dataset" ], "metadata": { "id": "b94slaKNpac5" } }, { "cell_type": "markdown", "source": [ "I apply the PCA of the Iris dataset with different amounts of components noting the loss of accuracy" ], "metadata": { "id": "MW-CAN33pxyR" } }, { "cell_type": "markdown", "source": [ "First of all, I import the necessary libraries" ], "metadata": { "id": "e4y6CSXhqCtg" } }, { "cell_type": "code", "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import StandardScaler\n", "import pandas as pd\n", "from sklearn.datasets import load_iris\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.model_selection import cross_val_score" ], "metadata": { "id": "wGyrwmhbq5v5" }, "execution_count": 13, "outputs": [] }, { "cell_type": "markdown", "source": [ "I load Iris dataset showing the first rows" ], "metadata": { "id": "Wl4yTdyOqQEn" } }, { "cell_type": "code", "source": [ "iris = load_iris()\n", "X = iris['data']\n", "y = iris['target']\n", "\n", "df = pd.DataFrame(iris['data'], columns=iris['feature_names'])\n", "df.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "_GpYFu3Rq57_", "outputId": "c74fb4a2-53c2-4879-c964-daea9d2a9007" }, "execution_count": 14, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
34.63.11.50.2
45.03.61.40.2
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "markdown", "source": [ "I aplly PCA and I calculate the accuracy per each number of components (from 5 to one) in this way:\n", "\n", "\n", "* create a pipeline to perform data standardization and PCA\n", "* performs cross-validation with a logistic regression\n", "* Shows mean accuracy\n", "\n", "Repeat this procedure 5 times (one for each number of components)" ], "metadata": { "id": "Fw-x0gA-rB7w" } }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "t98efWm5lg1h", "outputId": "e3de2437-0664-4e86-b847-47330cffa564" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Number of components: 1. mean accuracy: 0.92000\n", "Number of components: 2. mean accuracy: 0.91333\n", "Number of components: 3. mean accuracy: 0.96000\n", "Number of components: 4. mean accuracy: 0.96000\n" ] } ], "source": [ "for number_components in range(1, 5):\n", " model = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('pca', PCA(n_components=number_components)),\n", " ])\n", "\n", " scores = cross_val_score(LogisticRegression(solver='lbfgs'), model.fit_transform(X), y, cv=5)\n", "\n", " print(f\"Number of components: {number_components}. mean accuracy: {scores.mean():.5f}\")\n" ] } ] }