Machine learning with the Open Data Cube a78ec9a7bd434b92b8565f76bc9da95b

  • Sign up to the DEA Sandbox to run this notebook interactively from a browser

  • Compatibility: Notebook currently compatible with both the NCI and DEA Sandbox environments

  • Products used: ls8_nbart_geomedian_annual and ls8_nbart_tmad_annual

  • Special requirements: A shapefile of labelled data in shapefile format is required to use this notebook. An example dataset is provided.

  • Prerequisites: A basic understanding of supervised learning techniques is required. Introduction to statistical learning is a useful resource to begin with - it can be downloaded for free here. The Scikit-learn documentation provides information on the available models and their parameters.

Description

This notebook demonstrates a potential workflow using functions from the dea_tools.classification script to implement a supervised learning landcover classifier within the ODC (Open Data Cube) framework.

For larger model training and prediction implementations this notebook can be adapted into a Python file and run in a distributed fashion.

This example predicts a single class of cultivated / agricultural areas. The notebook demonstrates how to:

  1. Extract the desired ODC data for each labelled area (this becomes our training dataset).

  2. Train a simple decision tree model and adjust parameters.

  3. Predict landcover using trained model on new data.

  4. Evaluate the output of the classification using quantitative metrics and qualitative tools.

This is a quck reference for machine learning on the ODC captured in a single notebook. For a more indepth exploration please use the Scalable Machine Learning series of notebooks. ***

Getting started

To run this analysis, run all the cells in the notebook, starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

[1]:
%matplotlib inline

import subprocess as sp
import shapely
import xarray as xr
import rasterio
import datacube
import matplotlib
import pydotplus
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
from io import StringIO
from odc.io.cgroups import get_cpu_quota
from sklearn import tree
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from IPython.display import Image
from datacube.utils import geometry
from datacube.utils.cog import write_cog

import sys
sys.path.insert(1, '../Tools/')
from dea_tools.classification import collect_training_data, predict_xr

import warnings
warnings.filterwarnings("ignore")

Connect to the datacube

Connect to the datacube so we can access DEA data.

[2]:
dc = datacube.Datacube(app='Machine_learning_with_ODC')

Analysis parameters

  • path: The path to the input shapefile. A default shapefile is provided.

  • field: This is the name of column in your shapefile attribute table that contains the class labels

  • time: The time range you wish to extract data for, typically the same date the labels were created.

  • zonal_stats: This is an option to calculate the 'mean', 'median', or 'std' of the pixel values within each polygon feature, setting it to None will result in all pixels being extracted.

  • resolution: The spatial resolution, in metres, to resample the satellite data too e.g. if working with Landsat data, then this should be (-30,30)

  • output_crs: The coordinate reference system for the data you are querying.

  • ncpus: Set this value to > 1 to parallize the collection of training data. eg. npus=8

If running the notebook for the first time, keep the default settings below. This will demonstrate how the analysis works and provide meaningful results.

[3]:
path = '../Supplementary_data/Machine_learning_with_ODC/example_training_data.shp'
field = 'classnum'
time = ('2015')
zonal_stats = 'median'
resolution = (-25, 25)
output_crs = 'epsg:3577'

Automatically detect the number of cpus

[4]:
ncpus=round(get_cpu_quota())
print(ncpus)
2

Preview input data and study area

We can load and preview our input data shapefile using geopandas. The shapefile should contain a column with class labels (e.g. classnum below). These labels will be used to train our model.

[5]:
# Load input data shapefile
input_data = gpd.read_file(path)

# Plot first five rows
input_data.head()
[5]:
classnum geometry
0 112 POLYGON ((-1521875.000 -3801925.000, -1521900....
1 111 POLYGON ((-1557925.000 -3801125.000, -1557950....
2 111 POLYGON ((-1555325.000 -3800000.000, -1555200....
3 111 POLYGON ((-1552925.000 -3800950.000, -1552925....
4 111 POLYGON ((-1545475.000 -3800000.000, -1544325....

The data can also be explored using the interactive map below. Hover over each individual feature to see a print-out of its unique class label number above the map.

[6]:
# Plot training data in an interactive map
input_data.explore(column=field, legend=False)
[6]:
Make this Notebook Trusted to load map: File -> Trust Notebook