Evaluate, optimize, and fit a classifier 1c8913bdd2974da29b5fe6f8ca0f5e66

  • Sign up to the DEA Sandbox to run this notebook interactively from a browser

  • Compatibility: Notebook currently compatible with the DEA Sandbox environment

Background

Now that we’ve extracted training data from the ODC, and inspected it to ensure the features we selected are appropriate and useful, we can train a machine learning model. The first step is to decide which machine learning model to use. Deciding which one to pick depends on the classification task at-hand. The table below provides a useful summary of the pros and cons of different models (all of which are available through scikit-Learn). This sckit-learn cheat sheet may also help.

Table 1: Some of the pros and cons of different classifiers available through scikit-learn

3114f8e7bd814dd793d4fc383697f25e

The approach to evaluating, optimizing, and training the supervised machine learning model demonstrated in this notebook has been developed to suit the default training dataset provided. The training dataset is small, contains geospatial data, and contains only two classes (crop and non-crop).

  • Because the dataset is relatively small (n=430) as shown in the Extract_training_data notebook), splitting the data into a training and testing set, and only training the model on the smaller training set would likely substantially degrade the quality of the model. Thus we will fit the final model on all the training data.

  • Because we are fitting the model on all the data, we won’t have a testing set left over to estimate the model’s prediction accuracy. We therefore rely on a method called nested k-fold cross-validation to estimate the prediction ability of our model. This method is described further in the markdown before the code.

  • Because we are dealing with geospatial datasets, we conduct our nested k-fold cross validation using a method called spatial k-fold cross validation (SKCV), where the data is first grouped into spatial clusters before the training and testing splits are conducted. This ensures that the testing samples do not come from the same spatial group as the training samples, and therefore controls for spatial auto-correlation (the principle that things closer together are more correlated than things further apart). Again, more information on this is provided below.

  • And because we are generating a binary prediction (crop/non-crop), the metrics used to evaluate the classifier are those which are well suited to binary classifications.

While the approach described above works well for the default training data provided, it may not suit your own classification problem. It is advisable to research the different methods for evaluating and training a model to determine which approach is best for you.

Description

This notebook runs through evaluating, optimizing, and fitting a machine learning classifier (in the default example, a Random Forest model is used). Under each of the sub-headings you will find more information on how and why each method is used. The steps are as follows:

  1. Demonstrate how to group the training data into spatial clusters to assist with Spatial K-fold Cross Validation (SKCV)

  2. Calculate an unbiased performance estimate via nested cross-validation

  3. Optimize the hyperparameters of the model

  4. Fit a model to all the training data using the best hyperparameters identified in the previous step

  5. Save the model to disk for use in the subsequent notebook, 4_Classify_satellite_data.ipynb


Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

[1]:
# -- scikit-learn classifiers, uncomment the one of interest----

# from sklearn.svm import SVC
# from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# from sklearn.naive_bayes import GaussianNB
# from sklearn.linear_model import LogisticRegression
# from sklearn.neighbors import KNeighborsClassifier

import os
import sys
import joblib
import numpy as np
import pandas as pd
from joblib import dump
import subprocess as sp
import dask.array as da
from pprint import pprint
import matplotlib.pyplot as plt
from odc.io.cgroups import get_cpu_quota
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc, balanced_accuracy_score

sys.path.append('../../Scripts')
from dea_plotting import map_shapefile
from dea_bandindices import calculate_indices
from dea_classificationtools import spatial_clusters, SKCV, HiddenPrints
/env/lib/python3.6/site-packages/geopandas/_compat.py:88: UserWarning: The Shapely GEOS version (3.7.2-CAPI-1.11.0 ) is incompatible with the GEOS version PyGEOS was compiled with (3.9.0-CAPI-1.16.2). Conversions between both will be slow.
  shapely_geos_version, geos_capi_version_string
/env/lib/python3.6/site-packages/datacube/storage/masking.py:8: DeprecationWarning: datacube.storage.masking has moved to datacube.utils.masking
  category=DeprecationWarning)

Analysis Parameters

  • training_data: Name and location of the training data .txt file output from runnning 1_Extract_training_data.ipynb

  • coordinate_data: Name and location of the coordinates data .txt file output from runnning 1_Extract_training_data.ipynb

  • Classifier: This parameter refers to the scikit-learn classification model to use, first uncomment the classifier of interest in the Load Packages section and then enter the function name into this parameter e.g. Classifier = SVC

  • metric : A single string that denotes the scorer used to find the best parameters for refitting the estimator to evaluate the predictions on the test set. See the scoring parameter page here for a pre-defined list of options. e.g. metric='balanced_accuracy'

  • outer_cv_splits : The number of cross validation splits to use for the outer loop of the nested SKCV. These splits are used to estimate the accuracy of the classifier. A good default number is 5-10

  • inner_cv_splits : The number of cross validation splits to use for the inner loop of the nested SKCV - the inner loop splits are used for optimizing the hyperparameters. A good default number is 5.

[2]:
training_data = "results/test_training_data.txt"

coordinate_data = "results/training_data_coordinates.txt"

Classifier = RandomForestClassifier

metric = 'balanced_accuracy'

inner_cv_splits = 5

outer_cv_splits = 5

Spatial K-Fold Cross Validation Analysis Parameters

The DEA function SKCV (spatial k-fold cross validation) is a custom function similar to sklearn.model_selection.KFold but instead works on spatial coordinate data. Grouping the training data by spatial clusters is preferred over plain random splits for spatial data to avoid overestimating validation scores due to the inherent autocorrelation of spatial data. To learn more about why we need to account for spatial autocorrelation, check out this paper.

Coordinate data is grouped according to either a Gaussian Mixture, KMeans, or Hierarchical clustering algorithm. To use SKCV we need to set more parameters:

  • test_size : This will determine what fraction of the dataset will be set aside as the testing dataset. There is a trade-off here between having a larger test set that will help us better determine the quality of our classifier, and leaving enough data to train the classifier. A good deafult is to set 10-20 % of your dataset aside for testing purposes.

  • cluster_method : Which algorithm to use to create spatial groups, either 'Hierarchical', 'GMM' or 'KMeans'. Key word arguments for these algorithms can also be passed to the function if non-default behaviour is required. See the docs for GMM, Kmeans and Hierarchical for options.

  • max_distance: This parameter is used when using the ‘Hierarchical’ clustering method. The maximum distance describes the maximum Euclidean distances between all observations in a cluster. The units of distance depend on the map projection of the coordinate data. e.g if using a UTM projection, then the max_distance is in metres, so to set the maximum distance of each cluster to 200 km, max_distance=200000.

  • n_clusters : Number of spatial clusters to create using the coordinate values of the training data samples. This option only applies if using the ‘GMM’ or ‘Kmeans’ clustering methods, if using the ‘Hierarchical’ method then the number of clusters is determined automatically using the maximum distance threshold.

  • kfold_method : Which strategy to use to split to the data. One of either 'SpatialShuffleSplit' or 'SpatialKFold'.

  • balance : if setting kfold_method to 'SpatialShuffleSplit' this should be an integer (10 is a good default) that represents the number of splits generated per iteration to try to balance the amount of data in each set so that test_size and train_size are respected. If setting kfold_method to 'SpatialKFold' then this value should be either True or False. If False, each fold will have the same number of clusters (which can have different number of data points in them).

[3]:
test_size = 0.15

cluster_method = 'Hierarchical'

max_distance = 750000  # = 750 km

n_clusters = None

kfold_method = 'SpatialShuffleSplit'

balance = 10

Find the number of cpus

[4]:
ncpus = round(get_cpu_quota())
print('ncpus = ' + str(ncpus))
ncpus = 7

Import training and coordinate data

[5]:
# load the data
model_input = np.loadtxt(training_data)
coordinates = np.loadtxt(coordinate_data)

# load the column_names
with open(training_data, 'r') as file:
    header = file.readline()

column_names = header.split()[1:]

# Extract relevant indices from training data
model_col_indices = [
    column_names.index(var_name) for var_name in column_names[1:]
]

#convert variable names into sci-kit learn nomenclature
X = model_input[:, model_col_indices]
y = model_input[:, 0]

Group training data into spatial clusters

After setting the parameters, let’s first generate spatial clusters to visualize how our data will be grouped for the SKCV. You may want to refine the parameters to achieve a grouping that works for your dataset by resetting the parameters above.

[6]:
# Create clusters
spatial_groups = spatial_clusters(coordinates=coordinates,
                                  method=cluster_method,
                                  max_distance=max_distance,
                                  n_groups=n_clusters,
                                  verbose=True)
n clusters = 9

In the plot below, points that are coloured the same are assigned to the same group; training and testing points will not fall within the same group during the k-fold cross validation.

[7]:
# Plot
plt.figure(figsize=(4, 5))
plt.scatter(coordinates[:, 0],
            coordinates[:, 1],
            c=spatial_groups,
            s=50,
            cmap='viridis')
plt.title('Spatial clusters of training data')
plt.ylabel('y')
plt.xlabel('x');
../../../_images/notebooks_Real_world_examples_Scalable_machine_learning_3_Evaluate_optimize_fit_classifier_18_0.png

Calculate an unbiased performance estimate via nested cross-validation

K-fold cross-validation is a statistical method used to estimate the performance of machine learning models when making predictions on data not used during training. It is a popular method because it is conceptually straightforward and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. However, when the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.

One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called nested cross-validation. The paper here provides more context to this issue. The image below depicts how the nested cross-validation works.

d4b3cdfeab7e4a1893a7321deddefcca

The result of our nested cross-validation will be a set of accuracy scores that show how well our classifier is doing at recognising unseen data points. The default example is set up to show the balanced_accuracy score, and the Receiver-Operating Curve, Area Under the Curve (ROC-AUC). This latter metric is a robust measure of a classifier’s prediction ability. This article has a good explanation on ROC-AUC, which is a common machine learning metric.

Both measures return a value between 0 and 1, with a value of 1 indicating a perfect score.

To conduct the nested cross-validation, we first need to define a grid of parameters to be used in the optimization: * param_grid: a dictionary of model specific parameters to search through during hyperparameter optimization.

Note: the parameters in the param_grid object depend on the classifier being used. The default example is set up for a Random Forest classifier, to adjust the parameters to suit a different classifier, look up the important parameters under the relevant sklearn documentation.

[8]:
# Create the parameter grid based on the results of random search
param_grid = {
    'class_weight': ['balanced', None],
    'max_features': ['auto', 'log2', None],
    'n_estimators': [200, 300, 400],
    'criterion': ['gini', 'entropy']
}
[9]:
# Create outer k-fold splits
outer_cv = SKCV(
    coordinates=coordinates,
    max_distance=max_distance,
    n_splits=outer_cv_splits,
    cluster_method=cluster_method,
    kfold_method=kfold_method,
    test_size=test_size,
    balance=balance,
)

# Lists to store results of CV testing
acc = []
roc_auc = []

# Loop through outer splits and test predictions
i = 1
for train_index, test_index in outer_cv.split(coordinates):
    print('working on ' + str(i) + '/' + str(outer_cv_splits) +
          ' outer cv split',
          end='\r')
    model = Classifier(random_state=1)

    # Index training, testing, and coordinate data
    X_tr, X_tt = X[train_index, :], X[test_index, :]
    y_tr, y_tt = y[train_index], y[test_index]
    coords = coordinates[train_index]

    # Inner split on data within outer split
    inner_cv = SKCV(
        coordinates=coords,
        max_distance=max_distance,
        n_splits=inner_cv_splits,
        cluster_method=cluster_method,
        kfold_method=kfold_method,
        test_size=test_size,
        balance=balance,
    )

    # Optimize hyperparameters using innner splits
    clf = GridSearchCV(
        estimator=model,
        param_grid=param_grid,
        scoring=metric,
        n_jobs=ncpus,
        cv=inner_cv.split(coords),
        refit=True,
    )

    clf.fit(X_tr, y_tr)

    # Predict using the best model
    best_model = clf.best_estimator_
    pred = best_model.predict(X_tt)

    # Evaluate model w/ multiple metrics
    # ROC AUC
    probs = best_model.predict_proba(X_tt)
    probs = probs[:, 1]
    fpr, tpr, thresholds = roc_curve(y_tt, probs)
    auc_ = auc(fpr, tpr)
    roc_auc.append(auc_)

    # Overall accuracy
    ac = balanced_accuracy_score(y_tt, pred)
    acc.append(ac)

    # Loop counter
    i += 1
working on 5/5 outer cv split
[10]:
print("=== Nested Spatial K-Fold Cross-Validation Scores ===")
print("Mean balanced accuracy: " + str(round(np.mean(acc), 2)))
print("Std balanced accuracy: " + str(round(np.std(acc), 2)))
print('\n')
print("Mean roc_auc: " + str(round(np.mean(roc_auc), 3)))
print("Std roc_auc: " + str(round(np.std(roc_auc), 2)))
print('=======')
=== Nested Spatial K-Fold Cross-Validation Scores ===
Mean balanced accuracy: 0.93
Std balanced accuracy: 0.02


Mean roc_auc: 0.982
Std roc_auc: 0.01
=======

These scores represent a robust estimate of the accuracy of our classifier. However, because we are using only a subset of data to fit and optimize the models, and the total amount of training data we have is small, it is reasonable to expect these scores are a modest under-estimate of the final model’s accuracy.

Also, its possible the map accuracy will differ from the accuracies reported here since training data is not always a perfect representation of the data in the real world. For example, we may have purposively over-sampled from hard-to-classify regions, or the proportions of classes in our dataset may not match the proportions in the real world. This point underscores the importance of conducting a rigorous and independent map validation, rather than relying on cross-validation scores.

Optimize hyperparameters

Machine learning models require certain ‘hyperparameters’: model parameters that can be tuned to increase the prediction ability of a model. Finding the best values for these parameters is a ‘hyperparameter search’ or an ‘hyperparameter optimization’.

To optimize the parameters in our model, we use GridSearchCV to exhaustively search through a set of parameters and determine the combination that will result in the highest accuracy based upon the accuracy metric defined.

We’ll search the same set of parameters that we definied earlier, param_grid.

[11]:
# Generate n_splits of train-test_split
ss = SKCV(coordinates=coordinates,
          max_distance=max_distance,
          n_groups=n_clusters,
          n_splits=outer_cv_splits,
          cluster_method=cluster_method,
          kfold_method=kfold_method,
          test_size=test_size,
          balance=balance)
[12]:
# Instatiate a gridsearchCV
clf = GridSearchCV(Classifier(),
                   param_grid,
                   scoring=metric,
                   verbose=1,
                   cv=ss.split(coordinates),
                   n_jobs=ncpus)

clf.fit(X, y)

print('\n')
print("The most accurate combination of tested parameters is: ")
pprint(clf.best_params_)
print('\n')
print("The " + metric + " score using these parameters is: ")
print(round(clf.best_score_, 2))
Fitting 5 folds for each of 36 candidates, totalling 180 fits


The most accurate combination of tested parameters is:
{'class_weight': None,
 'criterion': 'gini',
 'max_features': None,
 'n_estimators': 400}


The balanced_accuracy score using these parameters is:
0.94

Fit a model

Using the best parameters from our hyperparmeter optimization search, we now fit our model on all the data.

[13]:
# Create a new model
new_model = Classifier(**clf.best_params_, random_state=1, n_jobs=ncpus)
new_model.fit(X, y)
[13]:
RandomForestClassifier(max_features=None, n_estimators=400, n_jobs=7,
                       random_state=1)

Save the model

Running this cell will export the classifier as a binary.joblib file. This will allow for importing the model in the subsequent script, 4_Classify_satellite_data.ipynb

[14]:
dump(new_model, 'results/ml_model.joblib')
[14]:
['results/ml_model.joblib']

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Last modified: March 2021

Tags

Browse all available tags on the DEA User Guide’s Tags Index

Tags: machine learning, SKCV, clustering, hyperparameters, Random Forest