{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluate, optimize, and fit a classifier \n", "\n", "* **[Sign up to the DEA Sandbox](https://app.sandbox.dea.ga.gov.au/)** to run this notebook interactively from a browser\n", "* **Compatibility:** Notebook currently compatible with the `DEA Sandbox` environment\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background\n", "\n", "Now that we've extracted training data from the ODC, and inspected it to ensure the features we selected are appropriate and useful, we can train a machine learning model. The first step is to decide which machine learning model to use. Deciding which one to pick depends on the classification task at-hand. The table below provides a useful summary of the pros and cons of different models (all of which are available through [scikit-Learn](https://scikit-learn.org/stable/)). This sckit-learn [cheat sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) may also help.\n", "\n", "_Table 1: Some of the pros and cons of different classifiers available through scikit-learn_\n", "\n", "\n", "\n", "The approach to evaluating, optimizing, and training the supervised machine learning model demonstrated in this notebook has been developed to suit the default training dataset provided. The training dataset is small, contains geospatial data, and contains only two classes (crop and non-crop). \n", "\n", "* Because the dataset is relatively small (`n=430`) as shown in the [Extract_training_data](1_Extract_training_data.ipynb) notebook), splitting the data into a training and testing set, and only training the model on the smaller training set would likely substantially degrade the quality of the model. Thus we will fit the final model on _all_ the training data.\n", "* Because we are fitting the model on all the data, we won't have a testing set left over to estimate the model's prediction accuracy. We therefore rely on a method called **nested k-fold cross-validation** to estimate the prediction ability of our model. This method is described further in the markdown before the code.\n", "* And because we are generating a binary prediction (crop/non-crop), the metrics used to evaluate the classifier are those which are well suited to binary classifications.\n", "\n", "While the approach described above works well for the default training data provided, it may not suit your own classification problem. It is advisable to research the different methods for evaluating and training a model to determine which approach is best for you.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Description\n", "\n", "This notebook runs through evaluating, optimizing, and fitting a machine learning classifier (in the default example, a Random Forest model is used). Under each of the sub-headings you will find more information on how and why each method is used. The steps are as follows:\n", "\n", "1. Demonstrate how to group the training data into spatial clusters to assist with Spatial K-fold Cross Validation (SKCV)\n", "2. Calculate an unbiased performance estimate via nested cross-validation\n", "3. Optimize the hyperparameters of the model\n", "4. Fit a model to all the training data using the best hyperparameters identified in the previous step\n", "5. Save the model to disk for use in the subsequent notebook, `4_Classify_satellite_data.ipynb`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "## Getting started\n", "\n", "To run this analysis, run all the cells in the notebook, starting with the \"Load packages\" cell. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# -- scikit-learn classifiers, uncomment the one of interest----\n", "\n", "# from sklearn.svm import SVC\n", "# from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "# from sklearn.naive_bayes import GaussianNB\n", "# from sklearn.linear_model import LogisticRegression\n", "# from sklearn.neighbors import KNeighborsClassifier\n", "\n", "import os\n", "import sys\n", "import joblib\n", "import numpy as np\n", "import pandas as pd\n", "from joblib import dump\n", "import subprocess as sp\n", "import dask.array as da\n", "from pprint import pprint\n", "import matplotlib.pyplot as plt\n", "from odc.io.cgroups import get_cpu_quota\n", "from sklearn.metrics import roc_curve, auc\n", "from sklearn.model_selection import GridSearchCV, ShuffleSplit, KFold\n", "from sklearn.metrics import roc_curve, auc, balanced_accuracy_score, f1_score\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis Parameters\n", "\n", "* `training_data`: Name and location of the training data `.txt` file output from runnning `1_Extract_training_data.ipynb`\n", "* `Classifier`: This parameter refers to the scikit-learn classification model to use, first uncomment the classifier of interest in the `Load Packages` section and then enter the function name into this parameter `e.g. Classifier = SVC` \n", "* `metric` : A single string that denotes the scorer used to find the best parameters for refitting the estimator to evaluate the predictions on the test set. See the scoring parameter page [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) for a pre-defined list of options. e.g. `metric='balanced_accuracy'`" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "training_data = \"results/test_training_data.txt\"\n", "\n", "Classifier = RandomForestClassifier\n", "\n", "metric = 'balanced_accuracy'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### K-Fold Cross Validation Analysis Parameters\n", "\n", "\n", "* `outer_cv_splits` : The number of cross validation splits to use for the outer loop of the nested CV. These splits are used to estimate the accuracy of the classifier. A good default number is 5-10\n", "* `inner_cv_splits` : The number of cross validation splits to use for the inner loop of the nested CV - the inner loop splits are used for optimizing the hyperparameters. A good default number is 5.\n", "* `test_size` : This will determine what fraction of the dataset will be set aside as the testing dataset. There is a trade-off here between having a larger test set that will help us better determine the quality of our classifier, and leaving enough data to train the classifier. A good deafult is to set 10-20 % of your dataset aside for testing purposes.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "inner_cv_splits = 5\n", "\n", "outer_cv_splits = 5\n", "\n", "test_size = 0.20" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find the number of cpus" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ncpus = 2\n" ] } ], "source": [ "ncpus = round(get_cpu_quota())\n", "print('ncpus = ' + str(ncpus))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import training data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# load the data\n", "model_input = np.loadtxt(training_data)\n", "\n", "# load the column_names\n", "with open(training_data, 'r') as file:\n", " header = file.readline()\n", " \n", "column_names = header.split()[1:]\n", "\n", "# Extract relevant indices from training data\n", "model_col_indices = [column_names.index(var_name) for var_name in column_names[1:]]\n", "\n", "# Convert variable names into sci-kit learn nomenclature\n", "X = model_input[:, model_col_indices]\n", "y = model_input[:, 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate an unbiased performance estimate via nested cross-validation\n", "\n", "K-fold [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) is a statistical method used to estimate the performance of machine learning models when making predictions on data not used during training. It is a popular method because it is conceptually straightforward and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.\n", "\n", "This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. However, when the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.\n", "\n", "One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called **nested cross-validation**. The paper [here](https://jmlr.csail.mit.edu/papers/v11/cawley10a.html) provides more context to this issue. The image below depicts how the nested cross-validation works.\n", "\n", "\n", "\n", "The result of our nested cross-validation will be a set of accuracy scores that show how well our classifier is doing at recognising unseen data points. The default example is set up to show the `balanced_accuracy` score, the `f1` score, and the `Receiver-Operating Curve, Area Under the Curve (ROC-AUC)`. This latter metric is a robust measure of a classifier's prediction ability. [This article](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5) has a good explanation on ROC-AUC, which is a common machine learning metric.\n", "\n", "All measures return a value between 0 and 1, with a value of 1 indicating a perfect score.\n", "\n", "To conduct the nested cross-validation, we first need to define a grid of parameters to be used in the optimization:\n", "* `param_grid`: a dictionary of model specific parameters to search through during hyperparameter optimization.\n", "\n", "> **Note**: the parameters in the `param_grid` object depend on the classifier being used. The default example is set up for a Random Forest classifier, to adjust the parameters to suit a different classifier, look up the important parameters under the relevant [sklearn documentation](https://scikit-learn.org/stable/supervised_learning.html). " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Create the parameter grid based on the results of random search\n", "param_grid = {\n", " 'class_weight': ['balanced', None],\n", " 'max_features': ['auto', 'log2', None],\n", " 'n_estimators': [100, 200, 300],\n", " 'criterion': ['gini', 'entropy']\n", "}" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Working on 5/5 outer CV split\r" ] } ], "source": [ "outer_cv = KFold(n_splits=outer_cv_splits, shuffle=True,\n", " random_state=0)\n", "\n", "# lists to store results of CV testing\n", "acc = []\n", "f1 = []\n", "roc_auc = []\n", "i = 1\n", "\n", "for train_index, test_index in outer_cv.split(X, y):\n", " print(f\"Working on {i}/5 outer CV split\", end='\\r')\n", " model = Classifier(random_state=1)\n", "\n", " # Index training, testing, and coordinate data\n", " X_tr, X_tt = X[train_index, :], X[test_index, :]\n", " y_tr, y_tt = y[train_index], y[test_index]\n", " \n", " # Inner split on data within outer split\n", " inner_cv = KFold(n_splits=inner_cv_splits,\n", " shuffle=True,\n", " random_state=0)\n", " \n", " clf = GridSearchCV(\n", " estimator=model,\n", " param_grid=param_grid,\n", " scoring=metric,\n", " n_jobs=ncpus,\n", " refit=True,\n", " cv=inner_cv.split(X_tr, y_tr),\n", " )\n", "\n", " clf.fit(X_tr, y_tr)\n", " \n", " # Predict using the best model\n", " best_model = clf.best_estimator_\n", " pred = best_model.predict(X_tt)\n", "\n", " # Evaluate model w/ multiple metrics\n", " # ROC AUC\n", " probs = best_model.predict_proba(X_tt)\n", " probs = probs[:, 1]\n", " fpr, tpr, thresholds = roc_curve(y_tt, probs)\n", " auc_ = auc(fpr, tpr)\n", " roc_auc.append(auc_)\n", " \n", " # Overall accuracy\n", " ac = balanced_accuracy_score(y_tt, pred)\n", " acc.append(ac)\n", " \n", " # F1 scores\n", " f1_ = f1_score(y_tt, pred)\n", " f1.append(f1_)\n", " i += 1\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Print the results of our model evaluation" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== Nested K-Fold Cross-Validation Scores ===\n", "Mean balanced accuracy: 0.97\n", "Std balanced accuracy: 0.02\n", "\n", "\n", "Mean F1: 0.98\n", "Std F1: 0.01\n", "\n", "\n", "Mean roc_auc: 0.992\n", "Std roc_auc: 0.01\n", "=============================================\n" ] } ], "source": [ "print(\"=== Nested K-Fold Cross-Validation Scores ===\")\n", "print(\"Mean balanced accuracy: \"+ str(round(np.mean(acc), 2)))\n", "print(\"Std balanced accuracy: \"+ str(round(np.std(acc), 2)))\n", "print('\\n')\n", "print(\"Mean F1: \"+ str(round(np.mean(f1), 2)))\n", "print(\"Std F1: \"+ str(round(np.std(f1), 2)))\n", "print('\\n')\n", "print(\"Mean roc_auc: \"+ str(round(np.mean(roc_auc), 3)))\n", "print(\"Std roc_auc: \"+ str(round(np.std(roc_auc), 2)))\n", "print('=============================================')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These scores represent a robust estimate of the accuracy of our classifier. However, because we are using only a subset of data to fit and optimize the models, and the total amount of training data we have is small, it is reasonable to expect these scores are a modest under-estimate of the final model's accuracy. \n", "\n", "Also, its possible the _map_ accuracy will differ from the accuracies reported here since training data is not always a perfect representation of the data in the real world. For example, we may have purposively over-sampled from hard-to-classify regions, or the proportions of classes in our dataset may not match the proportions in the real world. This point underscores the importance of conducting a rigorous and independent map validation, rather than relying on cross-validation scores. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimize hyperparameters\n", "\n", "Machine learning models require certain 'hyperparameters': model parameters that can be tuned to increase the prediction ability of a model. Finding the best values for these parameters is a 'hyperparameter search' or an 'hyperparameter optimization'.\n", "\n", "To optimize the parameters in our model, we use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to exhaustively search through a set of parameters and determine the combination that will result in the highest accuracy based upon the accuracy metric defined.\n", "\n", "We'll search the same set of parameters that we definied earlier, `param_grid`.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Generate n_splits of train-test_split\n", "rs = ShuffleSplit(n_splits=outer_cv_splits, test_size=test_size, random_state=0)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 5 folds for each of 36 candidates, totalling 180 fits\n", "\n", "\n", "The most accurate combination of tested parameters is: \n", "{'class_weight': 'balanced',\n", " 'criterion': 'gini',\n", " 'max_features': 'auto',\n", " 'n_estimators': 100}\n", "\n", "\n", "The balanced_accuracy score using these parameters is: \n", "0.98\n" ] } ], "source": [ "# Instatiate a gridsearchCV\n", "clf = GridSearchCV(Classifier(),\n", " param_grid,\n", " scoring=metric,\n", " verbose=1,\n", " cv=rs.split(X, y),\n", " n_jobs=ncpus)\n", "\n", "clf.fit(X, y)\n", "\n", "print('\\n')\n", "print(\"The most accurate combination of tested parameters is: \")\n", "pprint(clf.best_params_)\n", "print('\\n')\n", "print(f\"The {metric} score using these parameters is: \")\n", "print(round(clf.best_score_, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fit a model\n", "\n", "Using the best parameters from our hyperparmeter optimization search, we now fit our model on all the data." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(class_weight='balanced', n_jobs=2, random_state=1)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a new model\n", "new_model = Classifier(**clf.best_params_, random_state=1, n_jobs=ncpus)\n", "new_model.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Save the model\n", "\n", "Running this cell will export the classifier as a binary`.joblib` file. This will allow for importing the model in the subsequent script, `4_Classify_satellite_data.ipynb` \n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['results/ml_model.joblib']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dump(new_model, 'results/ml_model.joblib')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recommended next steps\n", "\n", "To continue working through the notebooks in this `Scalable Machine Learning on the ODC` workflow, go to the next notebook `4_Classify_satellite_data.ipynb.ipynb`.\n", "\n", "1. [Extracting training data from the ODC](1_Extract_training_data.ipynb) \n", "2. [Inspecting training data](2_Inspect_training_data.ipynb)\n", "3. **Evaluate, optimize, and fit a classifier (this notebook)**\n", "4. [Classifying satellite data](4_Classify_satellite_data.ipynb)\n", "5. [Object-based filtering of pixel classifications](5_Object-based_filtering.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "## Additional information\n", "\n", "**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). \n", "Digital Earth Australia data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.\n", "\n", "**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).\n", "If you would like to report an issue with this notebook, you can file one on [GitHub](https://github.com/GeoscienceAustralia/dea-notebooks).\n", "\n", "**Last modified:** December 2023\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tags\n", "" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "**Tags**: :index:`machine learning`, :index:`SKCV`, :index:`clustering`, :index:`hyperparameters`, :index:`Random Forest`" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }