{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extracting training data from the ODC \n", "\n", "* **[Sign up to the DEA Sandbox](https://app.sandbox.dea.ga.gov.au/)** to run this notebook interactively from a browser\n", "* **Compatibility:** Notebook currently compatible with the `DEA Sandbox` environment\n", "* **Products used:** \n", "[ga_ls8c_nbart_gm_cyear_3](https://explorer.dea.ga.gov.au/products/ga_ls8c_nbart_gm_cyear_3),\n", "[ga_ls_fc_pc_cyear_3](https://explorer.dea.ga.gov.au/products/ga_ls_fc_pc_cyear_3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background\n", "\n", "**Training data** is the most important part of any supervised machine learning workflow. The quality of the training data has a greater impact on the classification than the algorithm used. Large and accurate training data sets are preferable: increasing the training sample size results in increased classification accuracy ([Maxell et al 2018](https://www.tandfonline.com/doi/full/10.1080/01431161.2018.1433343)). A review of training data methods in the context of Earth Observation is available [here](https://www.mdpi.com/2072-4292/12/6/1034) \n", "\n", "When creating training labels, be sure to capture the **spectral variability** of the class, and to use imagery from the time period you want to classify (rather than relying on basemap composites). Another common problem with training data is **class imbalance**. This can occur when one of your classes is relatively rare and therefore the rare class will comprise a smaller proportion of the training set. When imbalanced data is used, it is common that the final classification will under-predict less abundant classes relative to their true proportion.\n", "\n", "There are many platforms to use for gathering training labels, the best one to use depends on your application. GIS platforms are great for collection training data as they are highly flexible and mature platforms; [Geo-Wiki](https://www.geo-wiki.org/) and [Collect Earth Online](https://collect.earth) are two open-source websites that may also be useful depending on the reference data strategy employed. Alternatively, there are many pre-existing training datasets on the web that may be useful, e.g. [Radiant Earth](https://www.radiant.earth/) manages a growing number of reference datasets for use by anyone.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Description\n", "This notebook will extract training data (feature layers, in machine learning parlance) from the `open-data-cube` using labelled geometries within a geojson. The default example will use the crop/non-crop labels within the `'data/crop_training_WA.geojson'` file. This reference data was acquired and pre-processed from the USGS's Global Food Security Analysis Data portal [here](https://croplands.org/app/data/search?page=1&page_size=200) and [here](https://e4ftl01.cr.usgs.gov/MEASURES/GFSAD30VAL.001/2008.01.01/).\n", "\n", "To do this, we rely on a custom `dea-notebooks` function called `collect_training_data`, contained within the [dea_tools.classification](../Tools/dea_tools/classification.py) script. The principal goal of this notebook is to familiarise users with this function so they can extract the appropriate data for their use-case. The default example also highlights extracting a set of useful feature layers for generating a cropland mask forWA.\n", "\n", "1. Preview the polygons in our training data by plotting them on a basemap\n", "2. Define a feature layer function to pass to `collect_training_data`\n", "3. Extract training data from the datacube using `collect_training_data`\n", "4. Export the training data to disk for use in subsequent scripts\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting started\n", "\n", "To run this analysis, run all the cells in the notebook, starting with the \"Load packages\" cell. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load packages\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import os\n", "import datacube\n", "import numpy as np\n", "import xarray as xr\n", "import subprocess as sp\n", "import geopandas as gpd\n", "from odc.io.cgroups import get_cpu_quota\n", "from datacube.utils.geometry import assign_crs\n", "\n", "import sys\n", "sys.path.insert(1, '../../Tools/')\n", "from dea_tools.bandindices import calculate_indices\n", "from dea_tools.classification import collect_training_data\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis parameters\n", "\n", "* `path`: The path to the input vector file from which we will extract training data. A default geojson is provided.\n", "* `field`: This is the name of column in your shapefile attribute table that contains the class labels. **The class labels must be integers**\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "path = 'data/crop_training_WA.geojson' \n", "field = 'class'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find the number of CPUs" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ncpus = 2\n" ] } ], "source": [ "ncpus = round(get_cpu_quota())\n", "print('ncpus = ' + str(ncpus))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preview input data\n", "\n", "We can load and preview our input data shapefile using `geopandas`. The shapefile should contain a column with class labels (e.g. 'class'). These labels will be used to train our model. \n", "\n", "> Remember, the class labels **must** be represented by `integers`.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | class | \n", "geometry | \n", "
---|---|---|
0 | \n", "1 | \n", "POINT (116.60407 -31.46883) | \n", "
1 | \n", "1 | \n", "POINT (117.03464 -32.40830) | \n", "
2 | \n", "1 | \n", "POINT (117.30838 -32.33747) | \n", "
3 | \n", "1 | \n", "POINT (116.74607 -31.63750) | \n", "
4 | \n", "1 | \n", "POINT (116.85817 -33.00131) | \n", "