Machine learning with the Open Data Cube

  • Compatability: Notebook currently compatible with both the NCI and DEA Sandbox environments

  • Products used: ls8_nbart_geomedian_annual

  • Special requirements: A shapefile of labelled data in shapefile format is required to use this notebook. An example dataset is provided.

  • Prerequisites: A basic understanding of supervised learning techniques is required. Introduction to statistical learning is a useful resource to begin with - it can be downloaded for free here. The Scikit-learn documentation provides information on the available models and their parameters.

Description

This notebook demonstrates a potential workflow using functions from the dea_classificationtools script to implement a supervised learning landcover classifier within the ODC (Open Data Cube) framework.

For larger model training and prediction implementations this notebook can be adapted into a Python file and run in a distributed fashion.

This example predicts a single class of cultivated / agricultural areas. The notebook demonstrates how to:

  1. Extract the desired ODC data for each labelled area (this becomes our training dataset).

  2. Train a simple decision tree model and adjust parameters.

  3. Predict landcover using trained model on new data.

  4. Evaluate the output of the classification using quantitative metrics and qualitative tools.


Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

[1]:
%matplotlib inline

import sys
import shapely
import rasterio
import datacube
import matplotlib
import pydotplus
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
from io import StringIO
from sklearn import tree
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from IPython.display import Image
from datacube.utils import geometry
from datacube.helpers import write_geotiff

sys.path.append('../Scripts')
from dea_plotting import map_shapefile
from dea_classificationtools import predict_xr
from dea_classificationtools import get_training_data_for_shp

Connect to the datacube

Connect to the datacube so we can access DEA data.

[2]:
dc = datacube.Datacube(app='Machine_learning_with_ODC')

Analysis parameters

  • path: The path to the input shapefile

  • field: This is the name of column in your shapefile attribute table that contains the class labels

  • product: The name of the product to extract. This method works on DEA Landsat Collection 3 ARD Landsat data and annual geomedian composites (ls8_nbart_geomedian_annual is a good start)

  • year: The year you wish to extract data for, typically the same year the labels were created

  • feature_stats: This is an option to calculate the mean of the values within the feature. It is useful for reducing noise and simplifying the training data

[3]:
path = '../Supplementary_data/Machine_learning_with_ODC/example_training_data.shp'
field = 'classnum'
product = 'ls8_nbart_geomedian_annual'
year = 2015
feature_stats = 'mean'

Preview input data and study area

We can load and preview our input data shapefile using geopandas. The shapefile should contain a column with class labels (e.g. classnum below). These labels will be used to train our model.

[4]:
# Load input data shapefile
input_data = gpd.read_file(path)

# Plot first five rows
input_data.head()
[4]:
classnum geometry
0 112 POLYGON ((-1521875.000 -3801925.000, -1521900....
1 111 POLYGON ((-1557925.000 -3801125.000, -1557950....
2 111 POLYGON ((-1555325.000 -3800000.000, -1555200....
3 111 POLYGON ((-1552925.000 -3800950.000, -1552925....
4 111 POLYGON ((-1545475.000 -3800000.000, -1544325....

The data can also be explored using the interactive map below. Hover over each individual feature to see a print-out of its unique class label number above the map.

[5]:
# Plot training data in an interactive map
map_shapefile(input_data, attribute=field)
/usr/local/lib/python3.6/dist-packages/pyproj/crs.py:77: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.
  return _prepare_from_string(" ".join(pjargs))

Extract training data using a shapefile

To train our model, we need to obtain satellite data that corresponds with the labelled input data locations above. The function below takes our shapefile containing class labels and extracts the specified product within these areas into a single array.

The following cell can take several minutes to run. The class labels will be contained in the first column of the output array.

[6]:
# Empty list which we will populate with data
out = []

# Call the extraction function
column_names = get_training_data_for_shp(path=path,
                                         out=out,
                                         product=product,
                                         time=(f'{year}-01-01',
                                               f'{year}-12-31'),
                                         crs='EPSG:3577',
                                         field=field,
                                         calc_indices=None,
                                         feature_stats=feature_stats)

# Stack the extracted training data for each feature into a single array
model_input = np.vstack(out)
print(f'\nOutput training data has shape {model_input.shape}')
Loading data...
Rasterizing features and extracting data...
 Feature 0217/0217
Output training data has shape (217, 7)

Preprocessing

scitkit-learn models cannot accept training data with NaNs (“not a number”). This preprocessing step removes any potential rows in the training array with NaNs.

[7]:
# Remove any potential nans
model_input = model_input[~np.isnan(model_input).any(axis=1)]
print("Cleaned input shape:", model_input.shape)
Cleaned input shape: (217, 7)

Our training data has multiple classes in it. However, we are only trying to predict one class (i.e. class label 111, Cultivated Terrestrial Vegetated) with this model. We therefore remove other classes from our training data by setting the label value for all other classes to 0.

These entries provide counter-examples to help the model distinguish the landcover classes from each other.

[8]:
# Modify the input training data for single class labels
model_input[:,0] = np.where(model_input[:,0] == 111, 1, 0)

So that we can access the accuracy of our classification, we split our data into training and testing data. 80% is used for training with 20% held back for testing. When splitting our data, we stratify the training data by the distributions of class membership. This sampling method leads to a similar distribution of class membership in the training data.

[9]:
# Split into training and testing data
model_train, model_test = model_selection.train_test_split(model_input,
                                                           stratify=model_input[:, 0],
                                                           train_size=0.8,
                                                           random_state=0)
print("Train shape:", model_train.shape)
print("Test shape:", model_test.shape)
Train shape: (173, 7)
Test shape: (44, 7)

Model preparation

This section automatically creates a list of varaible names and their respective indices for each of the training data variables.

Note: To use a custom subset of the satellite bands loaded above to train our data, you can replace column_names[1:] with a list of selected band names (e.g. ['red', 'green', 'blue'])

[10]:
# Select the variables we want to use to train our model
model_variables = column_names[1:]

# Extract relevant indices from the processed shapefile
model_col_indices = [column_names.index(var_name) for var_name in model_variables]

A decision tree model is chosen as it is one of the simplest supervised machine learning models we can implement.

Its strengths are its explainability and cheap computational cost.

Parameter tuning can be conducted in the model initialisation below - details on how the different parameters will affect the model are here.

[11]:
# Initialise model
model = tree.DecisionTreeClassifier()

Train model

The model is fitted / trained using the prepared training data. The fitting process uses the decision tree approach to create a generalised representation of reality based on the training data. This fitted / trained model can then be used to predict which class new data belongs to.

[12]:
# Train model
model.fit(model_train[:, model_col_indices], model_train[:, 0])
[12]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Analyse results

Feature importance

The decision tree classifier allows us to inspect the feature importance of each input variable. Feature importance represents the relative contribution of each variable in predicting the desired landcover class. When summed, the importance of all variables should add up to 1.0.

[13]:
# This shows the feature importance of the input features for predicting the class labels provided
plt.bar(x=model_variables, height=model.feature_importances_)
plt.gca().set_ylabel('Importance', labelpad=10)
plt.gca().set_xlabel('Variable', labelpad=10);
../../_images/notebooks_Frequently_used_code_Machine_learning_with_ODC_33_0.png

This decision tree representation visualises the trained model. Here we can see that the model decides which landcover class to assign based on the value of the important variables in the plot above.

The gini value shown in the tree represents the decrease in node impurity. This can also be understood as how heterogeneous the labels are (small values indicating better results). This metric is used by the decision tree to determine how to split the data into smaller groups.

[14]:
# Prepare a dictionary of class names
class_names = {1: 'Cultivated Terrestrial Vegetated',
               0: 'Not Cultivated Terrestrial Vegetated'}

# Get list of unique classes in model
class_codes = np.unique(model_train[:, 0])
class_names_in_model = [class_names[k] for k in class_codes]

# Plot decision tree
dot_data = StringIO()
tree.export_graphviz(model,
                     out_file=dot_data,
                     feature_names=model_variables,
                     class_names=class_names_in_model,
                     filled=True,
                     rounded=True,
                     special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
[14]:
../../_images/notebooks_Frequently_used_code_Machine_learning_with_ODC_35_0.png

Accuracy

We can use the 20% sample of test data we partitioned earlier to test the accuracy of the trained model on this new, “unseen” data.

An accuracy value of 1.0 indicates that the model was able to correctly predict 100% of the classes in the test data.

[15]:
predictions = model.predict(model_test[:, model_col_indices])
accuracy_score(predictions, model_test[:, 0])
[15]:
0.9545454545454546

Prediction

Now that we have a trained model, we can load new data and use the predict_xr function to predict landcover classes.

The trained model can technically be used to classify any dataset or product with the same bands as the data originally used to train the data. However, it is typically highly advisable to classify data from the same product that the data was originally trained on (e.g. 'ls8_nbart_geomedian_annual' below).

[16]:
# Get extent from input shapefile
xmin, ymin, xmax, ymax = input_data.unary_union.bounds

# Set up the query parameters
query = {'time': (f'{year}-01-01', f'{year}-12-31'),
         'x': (xmin, xmax),
         'y': (ymin, ymax),
         'crs': 'EPSG:3577',
         'resolution': (-25, 25)}

# Load new input data
geomedian_data = dc.load(product='ls8_nbart_geomedian_annual',
                         group_by='solar_day',
                         **query)

Once the data has been loaded, we can classify it using predict_xr:

[17]:
# Predict landcover using the trained model
predicted = predict_xr(model, geomedian_data, progress=True)

Plotting

To qualitatively evaluate how well the classification performed, we can plot the classifed/predicted data next to our input satellite imagery.

Note: The output below is unlikely to be optimal the first time the classification is run. The model training process is one of experimentation and assumption checking that occurs in an iterative cycle - you may need to revisit the steps above and make changes to model parameters or input training data until you achieve a satisfactory result.

[18]:
# Set up plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot classified image
predicted.plot(ax=axes[0],
               cmap='Greens',
               add_labels=False,
               add_colorbar=False)

# Plot true colour image
(geomedian_data[['red', 'green', 'blue']]
 .squeeze('time')
 .to_array()
 .plot.imshow(ax=axes[1], robust=True, add_labels=False))

# Remove axis on right plot
axes[1].get_yaxis().set_visible(False)

# Add plot titles
axes[0].set_title('Classified data')
axes[1].set_title('True colour image');
../../_images/notebooks_Frequently_used_code_Machine_learning_with_ODC_46_0.png

Exporting classification

We can now export the predicted landcover out to a GeoTIFF .tif file. This file can be loaded into GIS software (e.g. QGIS, ArcMap) to be inspected more closely.

[19]:
# Write the predicted data out to a GeoTIFF
predicted = predicted.to_dataset(name="predicted")
predicted = predicted.isel(time=0)
predicted.attrs['crs'] = geometry.CRS("EPSG:3577")
write_geotiff('predicted.tif', predicted)

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Last modified: February 2020

Compatible datacube version:

[20]:
print(datacube.__version__)
1.7+253.ga031f3f4.dirty

Tags

Browse all available tags on the DEA User Guide’s Tags Index

Tags: NCI compatible, sandbox compatible, landsat 8, annual geomedian, dea_plotting, map_shapefile, dea_classificationtools, predict_xr, get_training_data_for_shp, machine learning, decision tree