Extracting training data from the ODC 9ae98354f0af441581e374c858424062


Training data is the most important part of any supervised machine learning workflow. The quality of the training data has a greater impact on the classification than the algorithm used. Large and accurate training data sets are preferable: increasing the training sample size results in increased classification accuracy (Maxell et al 2018). A review of training data methods in the context of Earth Observation is available here

When creating training labels, be sure to capture the spectral variability of the class, and to use imagery from the time period you want to classify (rather than relying on basemap composites). Another common problem with training data is class imbalance. This can occur when one of your classes is relatively rare and therefore the rare class will comprise a smaller proportion of the training set. When imbalanced data is used, it is common that the final classification will under-predict less abundant classes relative to their true proportion.

There are many platforms to use for gathering training labels, the best one to use depends on your application. GIS platforms are great for collection training data as they are highly flexible and mature platforms; Geo-Wiki and Collect Earth Online are two open-source websites that may also be useful depending on the reference data strategy employed. Alternatively, there are many pre-existing training datasets on the web that may be useful, e.g. Radiant Earth manages a growing number of reference datasets for use by anyone.


This notebook will extract training data (feature layers, in machine learning parlance) from the open-data-cube using labelled geometries within a geojson. The default example will use the crop/non-crop labels within the 'data/crop_training_WA.geojson' file. This reference data was acquired and pre-processed from the USGS’s Global Food Security Analysis Data portal here and here.

To do this, we rely on a custom dea-notebooks function called collect_training_data, contained within the dea_tools.classification script. The principal goal of this notebook is to familarise users with this function so they can extract the appropriate data for their use-case. The default example also highlights extracting a set of useful feature layers for generating a cropland mask forWA.

  1. Preview the polygons in our training data by plotting them on a basemap

  2. Extract training data from the datacube using collect_training_data’s inbuilt feature layer parameters

  3. Extract training data from the datacube using a custom defined feature layer function that we can pass to collect_training_data

  4. Export the training data to disk for use in subsequent scripts

Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

%matplotlib inline

import os
import sys
import datacube
import numpy as np
import xarray as xr
import subprocess as sp
import geopandas as gpd
from odc.io.cgroups import get_cpu_quota
from datacube.utils.geometry import assign_crs

from dea_plotting import map_shapefile
from dea_bandindices import calculate_indices
from dea_classificationtools import collect_training_data

import warnings
/env/lib/python3.6/site-packages/geopandas/_compat.py:88: UserWarning: The Shapely GEOS version (3.7.2-CAPI-1.11.0 ) is incompatible with the GEOS version PyGEOS was compiled with (3.9.0-CAPI-1.16.2). Conversions between both will be slow.
  shapely_geos_version, geos_capi_version_string
/env/lib/python3.6/site-packages/datacube/storage/masking.py:8: DeprecationWarning: datacube.storage.masking has moved to datacube.utils.masking

Analysis parameters

  • path: The path to the input vector file from which we will extract training data. A default geojson is provided.

  • field: This is the name of column in your shapefile attribute table that contains the class labels. The class labels must be integers

path = 'data/crop_training_WA.geojson'
field = 'class'

Find the number of CPUs

ncpus = round(get_cpu_quota())
print('ncpus = ' + str(ncpus))
ncpus = 7

Preview input data

We can load and preview our input data shapefile using geopandas. The shapefile should contain a column with class labels (e.g. ‘class’). These labels will be used to train our model.

Remember, the class labels must be represented by integers.

# Load input data shapefile
input_data = gpd.read_file(path)

# Plot first five rows
class geometry
0 1 POINT (116.60407 -31.46883)
1 1 POINT (117.03464 -32.40830)
2 1 POINT (117.30838 -32.33747)
3 1 POINT (116.74607 -31.63750)
4 1 POINT (116.85817 -33.00131)
# Plot training data in an interactive map
map_shapefile(input_data, attribute=field)

Extracting training data

The function collect_training_data takes our geojson containing class labels and extracts training data (features) from the datacube over the locations specified by the input geometries. The function will also pre-process our training data by stacking the arrays into a useful format and removing any NaN or inf values.

Collect_training_data has the ability to generate many different types of feature layers. Relatively simple layers can be calculated using pre-defined parameters within the function, while more complex layers can be computed by passing in a custom_func. To begin with, let’s try generating feature layers using the pre-defined methods.

The in-built feature layer parameters are described below:

  • product: The name of the product to extract from the datacube. In this example we use a Landsat 8 geomedian composite from 2019, 'ls8_nbart_geomedian_annual'

  • time: The time range from which to extract data

  • calc_indices: This parameter provides a method for calculating a number of remote sensing indices (e.g. ['NDWI', 'NDVI']). Any of the indices found in the dea_tools.bandindices script can be used here

  • drop: If this variable is set to True, and ‘calc_indices’ are supplied, the spectral bands will be dropped from the dataset leaving only the band indices as data variables in the dataset.

  • reduce_func: The classification models we’re applying here require our training data to be in two dimensions (ie. x & y). If our data has a time-dimension (e.g. if we load in an annual time-series of satellite images) then we need to collapse the time dimension. reduce_func is simply the summary statistic used to collapse the temporal dimension. Options are ‘mean’, ‘median’, ‘std’, ‘max’, ‘min’, and ‘geomedian’. In the default example we are loading a geomedian composite, so there is no time dimension to reduce.

  • zonal_stats: An optional string giving the names of zonal statistics to calculate across each polygon. Default is None (all pixel values are returned). Supported values are ‘mean’, ‘median’, ‘max’, and ‘min’.

  • return_coords : If True, then the training data will contain two extra columns ‘x_coord’ and ‘y_coord’ corresponding to the x,y coordinate of each sample. This variable can be useful for handling spatial autocorrelation between samples later on in the ML workflow when we conduct k-fold cross validation.

Note: collect_training_data also has a number of additional parameters for handling ODC I/O read failures, where polygons that return an excessive number of null values can be resubmitted to the multiprocessing queue. Check out the docs to learn more.

In addition to the parameters required for collect_training_data, we also need to set up a few parameters for the Open Data Cube query, such as measurements (the bands to load from the satellite), the resolution (the cell size), and the output_crs (the output projection).

# Set up our inputs to collect_training_data
products = ['ls8_nbart_geomedian_annual']
time = ('2014')
reduce_func = None
calc_indices = ['NDVI', 'MNDWI']
drop = False
zonal_stats = 'median'
return_coords = True

# Set up the inputs for the ODC query
measurements = ['blue', 'green', 'red', 'nir', 'swir1', 'swir2']
resolution = (-30, 30)
output_crs = 'epsg:3577'

Generate a datacube query object from the parameters above:

query = {
    'time': time,
    'measurements': measurements,
    'resolution': resolution,
    'output_crs': output_crs,
    'group_by': 'solar_day',

Now let’s run the collect_training_data function. We will limit this run to only a subset of all samples (first 100) as here we are only demonstrating the use of the function. Futher on in the notebook we will rerun this function but with all the polygons in the training data.

Note: With supervised classification, its common to have many, many labelled geometries in the training data. collect_training_data can parallelize across the geometries in order to speed up the extracting of training data. Setting ncpus>1 will automatically trigger the parallelization. However, its best to set ncpus=1 to begin with to assist with debugging before triggering the parallelization. You can also limit the number of polygons to run when checking code. For example, passing in gdf=input_data[0:5] will only run the code over the first 5 polygons.

column_names, model_input = collect_training_data(gdf=input_data[0:100],
Calculating indices: ['NDVI', 'MNDWI']
Taking zonal statistic: median
Collecting training data in parallel mode
100%|██████████| 100/100 [00:10<00:00,  9.36it/s]
Percentage of possible fails after run 1 = 0.0 %
Removed 0 rows wth NaNs &/or Infs
Output shape:  (100, 11)

The function returns two numpy arrays, the first (column_names) contains a list of the names of the feature layers we’ve computed:

['class', 'blue', 'green', 'red', 'nir', 'swir1', 'swir2', 'NDVI', 'MNDWI', 'x_coord', 'y_coord']

The second array (model_input) contains the data from our labelled geometries. The first item in the array is the class integer (e.g. in the default example 1. ‘crop’, or 0. ‘noncrop’), the second set of items are the values for each feature layer we computed:

print(np.array_str(model_input, precision=2, suppress_small=True))
[[       1.        809.       1249.   ...       -0.45 -1447515.
  -3510225.  ]
 [       1.        950.       1506.   ...       -0.4  -1430025.
  -3532245.  ]
 [       1.       1089.       1526.   ...       -0.45 -1368555.
  -3603855.  ]
 [       1.        843.       1171.   ...       -0.47 -1300185.
  -3646395.  ]
 [       1.        827.       1120.   ...       -0.52 -1544385.
  -3460635.  ]
 [       1.        816.       1087.   ...       -0.46 -1305465.
  -3660765.  ]]

Custom feature layers

The feature layers that are most relevant for discriminating the classes of your classification problem may be more complicated than those provided in the collect_training_data function. In this case, we can pass a custom feature layer function through the custom_func parameter. Below, we will use a custom function to recollect training data (overwriting the previous example above).

  • custom_func: A custom function for generating feature layers. If this parameter is set, all other options (excluding ‘zonal_stats’), will be ignored. The result of the ‘custom_func’ must be a single xarray dataset containing 2D coordinates (i.e x and y with no time dimension). The custom function has access to the datacube dataset extracted using the dc_query params. To load other datasets, you can use the like=ds.geobox parameter in dc.load

First, lets define a custom feature layer function. This function is fairly basic and replicates some of what the collect_training_data function can do, but you can build these custom functions as complex as you like. We will calculate some band indices on the Landsat 8 geomedian, append the ternary median aboslute deviation dataset from the same year: ls8_nbart_tmad_annual, and append fractional cover percentiles for the photosynthetic vegetation band, also from the same year: fc_percentile_albers_annual.

def custom_reduce_function(ds):

    # Calculate some band indices
    da = calculate_indices(ds,
                           index=['NDVI', 'LAI', 'MNDWI'],

    # Connect to datacube to add TMADs product
    dc = datacube.Datacube(app='custom_feature_layers')

    # Add TMADs dataset
    tmad = dc.load(product='ls8_nbart_tmad_annual',
                   like=ds.geobox, #will match geomedian extent
                   time='2014' #same as geomedian

    # Add Fractional cover percentiles
    fc = dc.load(product='fc_percentile_albers_annual',
                   measurements=['PV_PC_10','PV_PC_50','PV_PC_90'], #only the PV band
                   like=ds.geobox, #will match geomedian extent
                   time='2014' #same as geomedian

    # Merge results into single dataset
    result = xr.merge([da, tmad, fc],compat='override')

    return result.squeeze()

Now, we can pass this function to collect_training_data. We will redefine our intial parameters to align with the new custom function. Remember, passing in a custom_func to collect_training_data means many of the other feature layer parameters are ignored.

# Set up our inputs to collect_training_data
products = ['ls8_nbart_geomedian_annual']
time = ('2014')
zonal_stats = 'median'
return_coords = True

# Set up the inputs for the ODC query
measurements = ['blue', 'green', 'red', 'nir', 'swir1', 'swir2']

resolution = (-30, 30)
output_crs = 'epsg:3577'
# Generate a new datacube query object
query = {
    'time': time,
    'measurements': measurements,
    'resolution': resolution,
    'output_crs': output_crs,
    'group_by': 'solar_day',

Below we collect training data from the datacube using the custom function. This will take around 5-6 minutes to run all 430 samples on the default sandbox as it only has two cpus.

column_names, model_input = collect_training_data(
Reducing data using user supplied custom function
Taking zonal statistic: median
Collecting training data in parallel mode
100%|██████████| 430/430 [01:50<00:00,  3.91it/s]
Percentage of possible fails after run 1 = 0.0 %
Removed 0 rows wth NaNs &/or Infs
Output shape:  (430, 18)
CPU times: user 1.04 s, sys: 195 ms, total: 1.23 s
Wall time: 1min 50s

print(np.array_str(model_input, precision=2, suppress_small=True))
['class', 'blue', 'green', 'red', 'nir', 'swir1', 'swir2', 'NDVI', 'LAI', 'MNDWI', 'sdev', 'edev', 'bcdev', 'PV_PC_10', 'PV_PC_50', 'PV_PC_90', 'x_coord', 'y_coord']

[[       1.     1005.     1464. ...       68. -1393035. -3614685.]
 [       1.      952.     1407. ...       71. -1319175. -3597135.]
 [       1.     1089.     1526. ...       69. -1368555. -3603855.]
 [       0.      579.     1070. ...       11. -1690725. -2826555.]
 [       0.      667.     1049. ...       22.  -516825. -3463935.]
 [       0.      519.      744. ...       48.  -698085. -1657005.]]

Separate coordinate data

By setting return_coords=True in the collect_training_data function, our training data now has two extra columns called x_coord and y_coord. We need to separate these from our training dataset as they will not be used to train the machine learning model. Instead, these variables will be used to help conduct Spatial K-fold Cross validation (SKVC) in the notebook 3_Evaluate_optimize_fit_classifier. For more information on why this is important, see this article.

# Select the variables we want to use to train our model
coord_variables = ['x_coord', 'y_coord']

# Extract relevant indices from the processed shapefile
model_col_indices = [column_names.index(var_name) for var_name in coord_variables]

# Export to coordinates to file
np.savetxt("results/training_data_coordinates.txt", model_input[:, model_col_indices])

Export training data

Once we’ve collected all the training data we require, we can write the data to disk. This will allow us to import the data in the next step(s) of the workflow.

# Set the name and location of the output file
output_file = "results/test_training_data.txt"
# Grab all columns except the x-y coords
model_col_indices = [column_names.index(var_name) for var_name in column_names[0:-2]]

# Export files to disk
np.savetxt(output_file, model_input[:, model_col_indices], header=" ".join(column_names[0:-2]), fmt="%4f")

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Last modified: March 2021

Compatible datacube version:



Browse all available tags on the DEA User Guide’s Tags Index

Tags Landsat 8 geomedian, Landsat 8 TMAD, machine learning, collect_training_data, Fractional Cover