Machine learning with the Open Data Cube dce7eb92ebb74dc0a8be1ff18601d58f

  • Sign up to the DEA Sandbox to run this notebook interactively from a browser

  • Compatibility: Notebook currently compatible with both the NCI and DEA Sandbox environments

  • Products used: ls8_nbart_geomedian_annual

  • Special requirements: A shapefile of labelled data in shapefile format is required to use this notebook. An example dataset is provided.

  • Prerequisites: A basic understanding of supervised learning techniques is required. Introduction to statistical learning is a useful resource to begin with - it can be downloaded for free here. The Scikit-learn documentation provides information on the available models and their parameters.


This notebook demonstrates a potential workflow using functions from the dea_classificationtools script to implement a supervised learning landcover classifier within the ODC (Open Data Cube) framework.

For larger model training and prediction implementations this notebook can be adapted into a Python file and run in a distributed fashion.

This example predicts a single class of cultivated / agricultural areas. The notebook demonstrates how to:

  1. Extract the desired ODC data for each labelled area (this becomes our training dataset).

  2. Train a simple decision tree model and adjust parameters.

  3. Predict landcover using trained model on new data.

  4. Evaluate the output of the classification using quantitative metrics and qualitative tools.

Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

%matplotlib inline
import subprocess as sp
import sys
import shapely

import rasterio
import datacube
import matplotlib
import pydotplus
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
from io import StringIO
from sklearn import tree
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from IPython.display import Image
from datacube.utils import geometry
from datacube.utils.cog import write_cog

from dea_classificationtools import collect_training_data, predict_xr
from dea_plotting import map_shapefile

import warnings
/env/lib/python3.6/site-packages/datacube/storage/ DeprecationWarning: has moved to datacube.utils.masking
/env/lib/python3.6/site-packages/cligj/ FutureWarning: cligj 1.0.0 will require Python >= 3.7
  warn("cligj 1.0.0 will require Python >= 3.7", FutureWarning)

Connect to the datacube

Connect to the datacube so we can access DEA data.

dc = datacube.Datacube(app='Machine_learning_with_ODC')

Analysis parameters

  • path: The path to the input shapefile. A default shapefile is provided.

  • field: This is the name of column in your shapefile attribute table that contains the class labels

  • products: The name of the product to extract. In this example we use a geomedian composite from 2018, 'ls8_nbart_geomedian_annual'

  • time: The time range you wish to extract data for, typically the same date the labels were created.

  • reduce_func: This will determine the statistic to reduce the time dimension of the loaded datasets (if loading multiple time steps). Options include 'mean', 'median', 'std' or 'geomedian'.

  • band_indices: a list of band indices to calculate for the training data see here for possible options.

  • drop: If calculating band indices, we can optionally choose to drop the spectral bands loaded with the product, if 'drop=True', then only the band indices will be used for the training data.

  • zonal_stats: This is an option to calculate the 'mean', 'median', or 'std' of the pixel values within each polygon feature, setting it to None will result in all pixels being extracted.

  • resolution: The spatial resolution, in metres, to resample the satellite data too e.g. if working with Landsat data, then this should be (-30,30)

  • ncpus: Set this value to > 1 to parallize the collection of training data. eg. npus=8

If running the notebook for the first time, keep the default settings below. This will demonstrate how the analysis works and provide meaningful results.

path = '../Supplementary_data/Machine_learning_with_ODC/example_training_data.shp'
field = 'classnum'
products = ['ls8_nbart_geomedian_annual']
time = ('2015')
reduce_func = None #'median'
custom_func = None
band_indices = None #['NDVI']
drop = False
zonal_stats = 'median'
resolution =  (-25, 25)

Automatically detect the number of cpus

    ncpus = int(float(sp.getoutput('env | grep CPU')[-4:]))
    ncpus = int(round(float(sp.getoutput('env | grep CPU')[-3:])))

print('ncpus = '+str(ncpus))
ncpus = 2

Preview input data and study area

We can load and preview our input data shapefile using geopandas. The shapefile should contain a column with class labels (e.g. classnum below). These labels will be used to train our model.

# Load input data shapefile
input_data = gpd.read_file(path)

# Plot first five rows
classnum geometry
0 112 POLYGON ((-1521875.000 -3801925.000, -1521900....
1 111 POLYGON ((-1557925.000 -3801125.000, -1557950....
2 111 POLYGON ((-1555325.000 -3800000.000, -1555200....
3 111 POLYGON ((-1552925.000 -3800950.000, -1552925....
4 111 POLYGON ((-1545475.000 -3800000.000, -1544325....

The data can also be explored using the interactive map below. Hover over each individual feature to see a print-out of its unique class label number above the map.

# Plot training data in an interactive map
map_shapefile(input_data, attribute=field)

Extract training data using a shapefile

To train our model, we need to obtain satellite data that corresponds with the labelled input data locations above. The function below takes our shapefile containing class labels and extracts the specified product within these areas into a single array.

The following cell can take several minutes to run. The class labels will be contained in the first column of the output array.

#generate a datacube query object
query = {
    'time': time,
    'measurements': ['blue',
    'resolution': resolution,
    'group_by' :'solar_day',

column_names, model_input = collect_training_data(
Taking zonal statistic: median
Collecting training data in parallel mode
100%|██████████| 217/217 [00:54<00:00,  3.99it/s]
Percentage of possible fails after run 1 = 0.0 %
Removed 0 rows wth NaNs &/or Infs
Output shape:  (217, 7)


Our training data has multiple classes in it. However, we are only trying to predict one class (i.e. class label 111, Cultivated Terrestrial Vegetated) with this model. We therefore remove other classes from our training data by setting the label value for all other classes to 0.

These entries provide counter-examples to help the model distinguish the landcover classes from each other.

# Modify the input training data for single class labels
model_input[:,0] = np.where(model_input[:,0] == 111, 1, 0)

So that we can access the accuracy of our classification, we split our data into training and testing data. 80% is used for training with 20% held back for testing. When splitting our data, we stratify the training data by the distributions of class membership. This sampling method leads to a similar distribution of class membership in the training data.

# Split into training and testing data
model_train, model_test = model_selection.train_test_split(model_input,
                                                           stratify=model_input[:, 0],
print("Train shape:", model_train.shape)
print("Test shape:", model_test.shape)
Train shape: (173, 7)
Test shape: (44, 7)

Model preparation

This section automatically creates a list of varaible names and their respective indices for each of the training data variables.

Note: To use a custom subset of the satellite bands loaded above to train our data, you can replace column_names[1:] with a list of selected band names (e.g. ['red', 'green', 'blue'])

# Select the variables we want to use to train our model
model_variables = column_names[1:]

# Extract relevant indices from the processed shapefile
model_col_indices = [column_names.index(var_name) for var_name in model_variables]

A decision tree model is chosen as it is one of the simplest supervised machine learning models we can implement.

Its strengths are its explainability and cheap computational cost.

Parameter tuning can be conducted in the model initialisation below - details on how the different parameters will affect the model are here.

# Initialise model
model = tree.DecisionTreeClassifier(max_depth=10)

Train model

The model is fitted / trained using the prepared training data. The fitting process uses the decision tree approach to create a generalised representation of reality based on the training data. This fitted / trained model can then be used to predict which class new data belongs to.

# Train model[:, model_col_indices], model_train[:, 0])

Analyse results

Feature importance

The decision tree classifier allows us to inspect the feature importance of each input variable. Feature importance represents the relative contribution of each variable in predicting the desired landcover class. When summed, the importance of all variables should add up to 1.0.

# This shows the feature importance of the input features for predicting the class labels provided, height=model.feature_importances_)
plt.gca().set_ylabel('Importance', labelpad=10)
plt.gca().set_xlabel('Variable', labelpad=10);

This decision tree representation visualises the trained model. Here we can see that the model decides which landcover class to assign based on the value of the important variables in the plot above.

The gini value shown in the tree represents the decrease in node impurity. This can also be understood as how heterogeneous the labels are (small values indicating better results). This metric is used by the decision tree to determine how to split the data into smaller groups.

# Prepare a dictionary of class names
class_names = {1: 'Cultivated Terrestrial Vegetated',
               0: 'Not Cultivated Terrestrial Vegetated'}

# Get list of unique classes in model
class_codes = np.unique(model_train[:, 0])
class_names_in_model = [class_names[k] for k in class_codes]

# Plot decision tree
dot_data = StringIO()

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())


We can use the 20% sample of test data we partitioned earlier to test the accuracy of the trained model on this new, “unseen” data.

An accuracy value of 1.0 indicates that the model was able to correctly predict 100% of the classes in the test data.

predictions = model.predict(model_test[:, model_col_indices])
accuracy_score(predictions, model_test[:, 0])


Now that we have a trained model, we can load new data and use the predict_xr function to predict landcover classes.

The trained model can technically be used to classify any dataset or product with the same bands as the data originally used to train the data. However, it is typically highly advisable to classify data from the same product that the data was originally trained on (e.g. 'ls8_nbart_geomedian_annual' below).

# Get extent from input shapefile
xmin, ymin, xmax, ymax = input_data.unary_union.bounds

# Set up the query parameters
query = {'time': time,
         'x': (xmin, xmax),
         'y': (ymin, ymax),
         'crs': 'EPSG:3577',
         'resolution': (-25, 25)}

# Load new input data
geomedian_data = dc.load(product='ls8_nbart_geomedian_annual',

Once the data has been loaded, we can classify it using DEA predict_xr function:

# Predict landcover using the trained model
predicted = predict_xr(model,


To qualitatively evaluate how well the classification performed, we can plot the classifed/predicted data next to our input satellite imagery.

Note: The output below is unlikely to be optimal the first time the classification is run. The model training process is one of experimentation and assumption checking that occurs in an iterative cycle - you may need to revisit the steps above and make changes to model parameters or input training data until you achieve a satisfactory result.

# Set up plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot classified image

# Plot true colour image
(geomedian_data[['red', 'green', 'blue']]
 .plot.imshow(ax=axes[1], robust=True, add_labels=False))

# Remove axis on right plot

# Add plot titles
axes[0].set_title('Classified data')
axes[1].set_title('True colour image');

Exporting classification

We can now export the predicted landcover out to a GeoTIFF .tif file. This file can be loaded into GIS software (e.g. QGIS, ArcMap) to be inspected more closely.

# Write the predicted data out to a GeoTIFF


Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Last modified: July 2020

Compatible datacube version:



Browse all available tags on the DEA User Guide’s Tags Index

Tags: NCI compatible, sandbox compatible, landsat 8, annual geomedian, dea_plotting, map_shapefile, dea_classificationtools, predict_xr, get_training_data_for_shp, machine learning, decision tree