Random forest classification

What does this notebook do?

This notebook classifies remote sensing data using a random forest classifier model. Key features include being able to efficiently import training data from a set of point, line or polygon shapefiles (i.e. data is extracted from each shapefile separately to avoid slow dc.load on large areas), and allow flexible and consistent selection of training and analysis data using import functions (i.e. ensuring training data is consistent with analysis data). The notebook exports geotiff classification outputs and a series of evaluation figures to help fine-tune the classifier.

Requirements:

You need to run the following commands from the command line prior to launching jupyter notebooks from the same terminal so that the required libraries and paths are set:

module use /g/data/v10/public/modules/modulefiles

module load dea

This notebook also uses four external functions called randomforest_train, randomforest_classify, randomforest_eval and load_nbarx. These functions are available in the 10_Scripts folder of the dea-notebooks Github repository. To ensure these functions and all other file paths load correctly, it is recommended that you download or clone the entire dea-notebooks repository onto your computer and then launch this notebook from inside the cloned directory; please see the readme document here for instructions.

Note that these functions have been developed by DEA users, not the DEA development team, and so are provided without warranty. If you find an error or bug in the functions or notebook, please either create an ‘Issue’ in the Github repository, or fix it yourself and create a ‘Pull’ request to contribute the updated notebook back into the repository (See the repository README for instructions on creating a Pull request).

Date: August 2018

Author: Robbi Bishop-Taylor

Tags: image_classification, machine_learning, random_forest, Landsat8, mangroves, tasseled_cap, Scripts, water_classification

Import modules and define functions

[1]:
# Load modules
import datacube
import os
import sys
import warnings
import pandas as pd
import xarray as xr
import numpy as np
import matplotlib.image as mpimg
from datacube.utils import geometry
from datacube.utils.geometry import CRS
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from datacube_stats.statistics import GeoMedian

# Import external functions from dea-notebooks using relative link to 10_Scripts
sys.path.append('../10_Scripts')
from ClassificationTools import randomforest_train
from ClassificationTools import randomforest_classify
from ClassificationTools import randomforest_eval
from DEADataHandling import load_nbarx

# Set up datacube instance
dc = datacube.Datacube(app = 'Random forest classification')

Function used to import training and analysis data

This function imports datacube data using a query, and return an xarray dataset with one single time-step, multiple bands/variables and crs and affine attributes. This format is required as an input to both randomforest_train and randomforest_classify to ensure that both training and analysis data are consistent. This function can be replaced with a custom function of your creation, as long as it returns data in the same format (one time-step, multiple bands, crs and affine attributes).

[3]:

def nbart_import(query):

    '''
    Takes median of a set of nbart Landsat data; could be replaced with geomedian or any other
    temporal aggregation method

    :attr query: query for datacube call; for training, supply only
    non-spatial queries as spatial are generated from training data
    :returns: xarray dataset with 'crs' and 'affine' attributes
    '''

    # Import cleaned Landsat bands data
    nbart_data, _, _ = load_nbarx(dc, sensor='ls8', query=query,
                                  bands_of_interest=['red', 'green', 'blue', 'nir', 'swir1', 'swir2'])

    # Combine into one temporally aggregated layer
    aggregated_data = nbart_data.median(dim = 'time', keep_attrs = True)

    return aggregated_data

Set up analysis

Defines parameters used for analysis. This notebook demonstrates the classification of water vs non-water using median Landsat nbart data.

[4]:
# List of training files to import. Shapefiles can be either points, lines or polygons,
# but must be in the same projection system as the remote sensing dataset being analysed.
# Each file should cover a small enough spatial area so as to not slow dc.load function
# excessively (e.g. 30 x 30km max). The class of each feature should be given in a shapefile
# column (e.g. `class`), and specified in `randomforest_train` (i.e. `train_field = 'class'`).
train_shps = ['../Supplementary_data/Files/training_data_waternonwater.shp',
              '../Supplementary_data/Files/training_data_waternonwater2.shp']

# Output path for classified geotiff
classification_output = 'classification_dc_waternonwater.tif'

# Optional dict to re-map training shapefile classes; useful for combining classes
# (e.g. `train_reclass = {3:2}` re-maps class 3 to class 2)
train_reclass = None

# Names of output classification classes
classification_names = ['water', 'nonwater']

# Set data function used to import data and optional parameters (e.g. time for temporal data).
# For example, 'tc_import' needs an additional 'time' query as it draws on Landsat time-series
# data, while 'hltc_import' uses high-low tide composites that have no temporal dimension
data_func = nbart_import
data_func_params = {'time': ('2017-03-01', '2017-06-28')}

Import training data and fit model

Uses randomforest_train to extract training data from potentially multiple training shapefiles, and returns a trained classifier (and optionally, training label and training sample arrays)

[5]:
# Dict of classifier parameters
classifier_params = {'n_jobs': -1,
                     'n_estimators': 100,
                     'max_features': 'auto',
                     'min_samples_leaf': 1,
                     'oob_score': True }

# Extract training data for each training shapefile and train classifier
classifier, train_lab, train_samp = randomforest_train(train_shps = train_shps,
                                                       train_field = 'class',
                                                       data_func = data_func,
                                                       data_func_params = data_func_params,
                                                       classifier_params = classifier_params,
                                                       train_reclass = train_reclass)

Importing training data from ../Supplementary_data/Files/training_data_waternonwater.shp:
{'x': (130.80269773885783, 131.04984673894958), 'y': (-12.550884427041208, -12.351964574328044), 'crs': 'epsg:4326', 'time': ('2017-03-01', '2017-06-28')}
Loading ls8_nbart_albers
Loaded ls8_nbart_albers
Generating mask ls8_pq_albers

Training random forest classifier...
Model trained on 6 bands and 88469 training samples

Import analysis data and classify

Classifies and exports an analysis dataset using a previously trained random forest classifier, provided this dataset has the same number of bands/variables as the data used to train the classifier. Using the same data function used to train the classifier (e.g. data_func previously defined as nbart_import) will ensure this is the case. By setting class_prob = True, you can optionally export a geotiff of predicted class probabilities in addition to the classification output.

[6]:
# Set up analysis data query
lat_point, lon_point, buffer = -12.5798399926, 130.782907919, 30000
x, y = geometry.point(lon_point, lat_point, CRS('WGS84')).to_crs(CRS('EPSG:3577')).points[0]
query = {'x': (x - buffer, x + buffer),
         'y': (y - buffer, y + buffer),
         'crs': 'EPSG:3577',
          **data_func_params}

# Load data from datacube
analysis_xarray = data_func(query)

# Run classification and export to file
class_array, prob_array = randomforest_classify(classifier = classifier,
                                                analysis_data = analysis_xarray,
                                                classification_output = classification_output,
                                                class_prob = True)

# Plot output classification
class_xarray = xr.DataArray(class_array,
                   coords = [analysis_xarray.y, analysis_xarray.x],
                   dims = ['y', 'x'],
                   name = 'Classification output')
class_xarray.plot(levels = list(np.unique(class_array)) + [len(np.unique(class_array)) + 1],
                  figsize = (8, 8))

# Plot predicted class probability (proportion of trees agreeing with classification)
plt.plot()
prob_xarray = xr.DataArray(prob_array,
                           coords = [analysis_xarray.y, analysis_xarray.x],
                           dims = ['y', 'x'],
                           name = 'Predicted class probability')
prob_xarray.plot(cmap = 'plasma_r',
                 vmin = np.percentile(prob_array, 3),
                 vmax = np.percentile(prob_array, 97),
                 figsize = (8, 8))

Loading ls8_nbart_albers
Loaded ls8_nbart_albers
Generating mask ls8_pq_albers
Data to classify:
  Rows: 2401
  Columns: 2401
  Bands: 6

Classification processing...
  9467 nodata cells removed
  Classification exported

Class probability processing...
  Class probability exported
[6]:
<matplotlib.collections.QuadMesh at 0x7fcf399be780>
../../_images/notebooks_07_Image_classification_RandomForestClassification_11_2.png
../../_images/notebooks_07_Image_classification_RandomForestClassification_11_3.png

Feature/band/variable importance

Extract classifier estimates of the relative importance of each band/variable for training the classifier. Useful for potentially selecting a subset of input bands/variables for model training/classification (i.e. optimising feature space)

[7]:
#  Extract feature importances from trained classifier
importance = classifier.feature_importances_
importance_df = pd.DataFrame({'variable': analysis_xarray.data_vars,
                              'importance': importance})
importance_df.set_index("variable", inplace = True)
importance_df.plot.bar(title = "Variable importance (global)")
display(importance_df)

importance
variable
red 0.007803
green 0.024393
blue 0.019284
nir 0.420742
swir1 0.345617
swir2 0.182160
../../_images/notebooks_07_Image_classification_RandomForestClassification_13_1.png

Plot performance of model by parameter values

Random forest classifiers contain many modifiable parameters that can strongly affect the performance of the model. This section evaluates the effect of these parameters by plotting out-of-bag (OOB) error for a set of classifier parameter scenarios, and exports the resulting plots to file.

A max_estimators = 300 is usually a good starting point, but this may be slow for large training sets.

[ ]:
# Test effect of max features
classifier_scenario1 = [("max_features = 'sqrt'",
                        RandomForestClassifier(warm_start = True,
                                               oob_score = True,
                                               max_features = "sqrt")),
                       ("max_features = 'log2'",
                        RandomForestClassifier(warm_start = True,
                                               oob_score = True,
                                               max_features = "log2")),

                       ("max_features = '0.1'",
                        RandomForestClassifier(warm_start = True,
                                               max_features = 0.1,
                                               oob_score = True)),

                       ("max_features = '0.5'",
                        RandomForestClassifier(warm_start = True,
                                               max_features = 0.5,
                                               oob_score = True)),
                       ("max_features = None",
                        RandomForestClassifier(warm_start = True,
                                               max_features = None,
                                               oob_score = True))]

# Test effect of minimum samples per leaf
classifier_scenario2 = [("Leaf = 1",
                         RandomForestClassifier(warm_start = True,
                                                min_samples_leaf = 1,
                                                oob_score = True)),
                       ("Leaf = 10",
                        RandomForestClassifier(warm_start = True,
                                               min_samples_leaf = 10,
                                               oob_score = True)),
                       ("Leaf = 20",
                        RandomForestClassifier(warm_start = True,
                                               min_samples_leaf = 20,
                                               oob_score = True)),
                       ("Leaf = 40",
                        RandomForestClassifier(warm_start = True,
                                               min_samples_leaf = 40,
                                               oob_score = True))]

# Test effect of max depth
classifier_scenario3 = [("Max depth = 1",
                         RandomForestClassifier(warm_start = True,
                                                max_depth = 1,
                                                oob_score = True)),
                       ("Max depth = 2",
                        RandomForestClassifier(warm_start = True,
                                               max_depth = 2,
                                               oob_score = True)),
                       ("Max depth = 5",
                        RandomForestClassifier(warm_start = True,
                                               max_depth = 5,
                                               oob_score = True)),
                       ("Max depth = 10",
                        RandomForestClassifier(warm_start = True,
                                               max_depth = 10,
                                               oob_score = True))]

# Test effect of max leaf node
classifier_scenario4 = [("Max leaf node = 5",
                         RandomForestClassifier(warm_start = True,
                                                max_leaf_nodes = 5,
                                                oob_score = True)),
                       ("Max leaf node = 10",
                        RandomForestClassifier(warm_start = True,
                                               max_leaf_nodes = 10,
                                               oob_score = True)),
                       ("Max leaf node = 20",
                        RandomForestClassifier(warm_start = True,
                                               max_leaf_nodes = 20,
                                               oob_score = True)),
                       ("Max leaf node = 40",
                        RandomForestClassifier(warm_start = True,
                                               max_leaf_nodes = 40,
                                               oob_score = True))]

# Test effect of max leaf node
classifier_scenario5 = [("Min samples split = 5",
                         RandomForestClassifier(warm_start = True,
                                                min_samples_split = 5,
                                                oob_score = True)),
                       ("Min samples split = 10",
                        RandomForestClassifier(warm_start = True,
                                               min_samples_split = 10,
                                               oob_score = True)),
                       ("Min samples split = 20",
                        RandomForestClassifier(warm_start = True,
                                               min_samples_split = 20,
                                               oob_score = True)),
                       ("Min samples split = 40",
                        RandomForestClassifier(warm_start = True,
                                               min_samples_split = 40,
                                               oob_score = True))]

# Produce figures and export plots for each set of classification scenarios
for i, classifier_scenario in enumerate([classifier_scenario1,
                                         classifier_scenario2,
                                         classifier_scenario3,
                                         classifier_scenario4,
                                         classifier_scenario5]):

    # Plot OOB error by classifier scenario
    print('Processing: ' + ', '.join([item[0] for item in classifier_scenario]))
    randomforest_eval(training_labels = train_lab,
                      training_samples = train_samp,
                      classifier_scenario = classifier_scenario,
                      output_path = "figures/random_forest_params_{}.png".format(i + 1),
                      max_estimators = 50)

Export tree diagrams

Export .png plots of each decision tree in the random forest ensemble. Useful for inspecting the splits used by the classifier to classify the data.

[ ]:
# Plot output random forest trees to file
for n, tree_in_forest in enumerate(classifier.estimators_):

    # Create graph and save to dot file
    export_graphviz(tree_in_forest,
                    out_file = "figures/tree_graphs/tree.dot",
                    feature_names = list(analysis_xarray.data_vars),
                    class_names = classification_names,
                    filled = True,
                    rounded = True)

    # Plot as figure
    os.system('dot -Tpng figures/tree_graphs/tree.dot -o ' + \
              'figures/tree_graphs/tree' + str(n + 1) + '.png')

# Plot first resulting tree
img = mpimg.imread('figures/tree_graphs/tree1.png')
plt.figure(figsize = (15, 12))
plt.imshow(img, interpolation = "bilinear")
plt.show()

[ ]: