dea_tools.datahandling

Loading and manipulating Digital Earth Australia products and data using the Open Data Cube and xarray.

License: The code in this notebook is licensed under the Apache License, Version 2.0 (https://www.apache.org/licenses/LICENSE-2.0). Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license (https://creativecommons.org/licenses/by/4.0/).

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel (http://slack.opendatacube.org/) or on the GIS Stack Exchange (https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the open-data-cube tag (you can view previously asked questions here: https://gis.stackexchange.com/questions/tagged/open-data-cube).

If you would like to report an issue with this script, you can file one on GitHub (GeoscienceAustralia/dea-notebooks#new).

Last modified: Feb 2024

Functions

dilate(array[, dilation, invert])

Dilate a binary array by a specified nummber of pixels using a disk-like radial dilation.

download_unzip(url[, output_dir, remove_zip])

Downloads and unzips a .zip file from an external URL to a local directory.

first(array, dim[, index_name])

Finds the first occuring non-null value along the given dimension.

last(array, dim[, index_name])

Finds the last occuring non-null value along the given dimension.

load_ard(dc[, products, cloud_mask, ...])

Load multiple Geoscience Australia Landsat or Sentinel 2 Collection 3 products (e.g. Landsat 5, 7, 8, 9; Sentinel 2A and 2B), optionally apply pixel quality/cloud masking and contiguity masks, and drop time steps that contain greater than a minimum proportion of good quality (e.g. non-cloudy or shadowed) pixels.

load_reproject(path, how[, resolution, ...])

Load and reproject part of a raster dataset into a given GeoBox or custom CRS/resolution.

mostcommon_crs(dc, product, query)

Takes a given query and returns the most common CRS for observations returned for that spatial extent.

nearest(array, dim, target[, index_name])

Finds the nearest values to a target label along the given dimension, for all other dimensions.

parallel_apply(ds, dim, func[, use_threads])

Applies a custom function in parallel along the dimension of an xarray.Dataset or xarray.DataArray.

paths_to_datetimeindex(paths[, string_slice])

Helper function to generate a Pandas datetimeindex object from dates contained in a file path string.

wofs_fuser(dest, src)

Fuse two WOfS water measurements represented as ``ndarray``s.

xr_pansharpen(ds, transform[, pan_band, ...])

Apply pan-sharpening to multispectral satellite data with one or more timesteps.

dea_tools.datahandling.dilate(array, dilation=10, invert=True)[source]

Dilate a binary array by a specified nummber of pixels using a disk-like radial dilation.

By default, invalid (e.g. False or 0) values are dilated. This is suitable for applications such as cloud masking (e.g. creating a buffer around cloudy or shadowed pixels). This functionality can be reversed by specifying invert=False.

Parameters:
  • array (array) – The binary array to dilate.

  • dilation (int, optional) – An optional integer specifying the number of pixels to dilate by. Defaults to 10, which will dilate array by 10 pixels.

  • invert (bool, optional) – An optional boolean specifying whether to invert the binary array prior to dilation. The default is True, which dilates the invalid values in the array (e.g. False or 0 values).

Returns:

  • An array of the same shape as array, with valid data pixels

  • dilated by the number of pixels specified by dilation.

dea_tools.datahandling.download_unzip(url, output_dir=None, remove_zip=True)[source]

Downloads and unzips a .zip file from an external URL to a local directory.

Parameters:
  • url (str) – A string giving a URL path to the zip file you wish to download and unzip

  • output_dir (str, optional) – An optional string giving the directory to unzip files into. Defaults to None, which will unzip files in the current working directory

  • remove_zip (bool, optional) – An optional boolean indicating whether to remove the downloaded .zip file after files are unzipped. Defaults to True, which will delete the .zip file.

dea_tools.datahandling.first(array: xarray.DataArray, dim: str, index_name: str = None) xarray.DataArray[source]

Finds the first occuring non-null value along the given dimension.

Parameters:
  • array (xr.DataArray) – The array to search.

  • dim (str) – The name of the dimension to reduce by finding the first non-null value.

Returns:

reduced – An array of the first non-null values. The dim dimension will be removed, and replaced with a coord of the same name, containing the value of that dimension where the last value was found.

Return type:

xr.DataArray

dea_tools.datahandling.last(array: xarray.DataArray, dim: str, index_name: str = None) xarray.DataArray[source]

Finds the last occuring non-null value along the given dimension.

Parameters:
  • array (xr.DataArray) – The array to search.

  • dim (str) – The name of the dimension to reduce by finding the last non-null value.

  • index_name (str, optional) – If given, the name of a coordinate to be added containing the index of where on the dimension the nearest value was found.

Returns:

reduced – An array of the last non-null values. The dim dimension will be removed, and replaced with a coord of the same name, containing the value of that dimension where the last value was found.

Return type:

xr.DataArray

dea_tools.datahandling.load_ard(dc, products=None, cloud_mask='fmask', min_gooddata=0.0, mask_pixel_quality=True, mask_filters=None, mask_contiguity=False, fmask_categories=['valid', 'snow', 'water'], s2cloudless_categories=['valid'], ls7_slc_off=True, dtype='auto', predicate=None, **kwargs)[source]

Load multiple Geoscience Australia Landsat or Sentinel 2 Collection 3 products (e.g. Landsat 5, 7, 8, 9; Sentinel 2A and 2B), optionally apply pixel quality/cloud masking and contiguity masks, and drop time steps that contain greater than a minimum proportion of good quality (e.g. non-cloudy or shadowed) pixels.

The function supports loading the following Landsat products:
  • ga_ls5t_ard_3

  • ga_ls7e_ard_3

  • ga_ls8c_ard_3

  • ga_ls9c_ard_3

And Sentinel-2 products:
  • ga_s2am_ard_3

  • ga_s2bm_ard_3

Cloud masking can be performed using the Fmask (Function of Mask) cloud mask for Landsat and Sentinel-2, and the s2cloudless (Sentinel Hub cloud detector for Sentinel-2 imagery) cloud mask for Sentinel-2.

Last modified: June 2023

Parameters:
  • dc (datacube Datacube object) – The Datacube to connect to, i.e. dc = datacube.Datacube(). This allows you to also use development datacubes if required.

  • products (list) – A list of product names to load. Valid options are [‘ga_ls5t_ard_3’, ‘ga_ls7e_ard_3’, ‘ga_ls8c_ard_3’, ‘ga_ls9c_ard_3’] for Landsat, [‘ga_s2am_ard_3’, ‘ga_s2bm_ard_3’] for Sentinel 2.

  • cloud_mask (string, optional) –

    The cloud mask used by the function. This is used for both masking out poor quality pixels (e.g. clouds) if mask_pixel_quality=True, and for calculating the min_gooddata percentage when dropping cloudy or low quality satellite observations. Two cloud masks are supported:

    • ’fmask’: (default; available for Landsat, Sentinel-2)

    • ’s2cloudless’ (Sentinel-2 only)

  • min_gooddata (float, optional) – The minimum percentage of good quality pixels required for a satellite observation to be loaded. Defaults to 0.00 which will return all observations regardless of pixel quality (set to e.g. 0.99 to return only observations with more than 99% good quality pixels).

  • mask_pixel_quality (str or bool, optional) – Whether to mask out poor quality (e.g. cloudy) pixels by setting them as nodata. Depending on the choice of cloud mask, the function will identify good quality pixels using the categories passed to the fmask_categories or s2cloudless_categories params. Set to False to turn off pixel quality masking completely. Poor quality pixels will be set to NaN (and convert all data to float32) if dtype='auto', or be set to the data’s native nodata value (usually -999) if dtype='native' (see ‘dtype’ below for more details).

  • mask_filters (iterable of tuples, optional) –

    Iterable tuples of morphological operations - (“<operation>”, <radius>) to apply to the inverted pixel quality mask, where: operation: string; one of these morphological operations:

    • 'dilation' = Expands poor quality pixels/clouds outwards

    • 'erosion' = Shrinks poor quality pixels/clouds inwards

    • 'closing' = Remove small holes in clouds by expanding

      then shrinking poor quality pixels

    • 'opening' = Remove small or narrow clouds by shrinking

      then expanding poor quality pixels

    radius: int e.g. mask_filters=[('erosion', 5), ("opening", 2), ("dilation", 2)]

  • mask_contiguity (str or bool, optional) –

    Whether to mask out pixels that are missing data in any band (i.e. “non-contiguous” pixels). This can be important for generating clean composite datasets. The default of False will not apply any contiguity mask. If loading NBART data, set:

    • mask_contiguity='nbart' (or mask_contiguity=True)

    If loading NBAR data, specify:
    • mask_contiguity='nbar'

    Non-contiguous pixels will be set to NaN if dtype=’auto’, or set to the data’s native nodata value if dtype=’native’ (see ‘dtype’ below).

  • fmask_categories (list, optional) – A list of Fmask cloud mask categories to consider as good quality pixels when calculating min_gooddata and when masking data by pixel quality if mask_pixel_quality=True. The default is ['valid', 'snow', 'water']; all other Fmask categories (‘cloud’, ‘shadow’, ‘nodata’) will be treated as low quality pixels. Choose from: ‘nodata’, ‘valid’, ‘cloud’, ‘shadow’, ‘snow’, and ‘water’.

  • s2cloudless_categories (list, optional) – A list of s2cloudless cloud mask categories to consider as good quality pixels when calculating min_gooddata and when masking data by pixel quality if mask_pixel_quality=True. The default is [‘valid’]; all other s2cloudless categories (‘cloud’, ‘nodata’) will be treated as low quality pixels. Choose from: ‘nodata’, ‘valid’, or ‘cloud’.

  • ls7_slc_off (bool, optional) – An optional boolean indicating whether to include data from after the Landsat 7 SLC failure (i.e. SLC-off). Defaults to True, which keeps all Landsat 7 observations > May 31 2003.

  • dtype (string, optional) – Controls the data type/dtype that layers are coerced to after loading. Valid values: ‘native’, ‘auto’, and ‘float{16|32|64}’. When ‘auto’ is used, the data will be converted to float32 if masking is used, otherwise data will be returned in the native data type of the data. Be aware that if data is loaded in its native dtype, nodata and masked pixels will be returned with the data’s native nodata value (typically -999), not NaN.

  • predicate (function, optional) – DEPRECATED: Please use dataset_predicate instead. An optional function that can be passed in to restrict the datasets that are loaded. A predicate function should take a datacube.model.Dataset object as an input (i.e. as returned from dc.find_datasets), and return a boolean. For example, a predicate function could be used to return True for only datasets acquired in January: dataset.time.begin.month == 1

  • **kwargs – A set of keyword arguments to dc.load that define the spatiotemporal query and load parameters used to extract data. Keyword arguments can either be listed directly in the load_ard call like any other parameter (e.g. measurements=[‘nbart_red’]), or by passing in a query kwarg dictionary (e.g. **query). Keywords can include measurements, x, y, time, resolution, resampling, group_by, crs; see the dc.load documentation for all possible options: https://datacube-core.readthedocs.io/en/latest/api/indexed-data/generate/datacube.Datacube.load.html

Returns:

combined_ds – An xarray.Dataset containing only satellite observations with a proportion of good quality pixels greater than min_gooddata.

Return type:

xarray.Dataset

Notes

The load_ard function builds on the Open Data Cube’s native dc.load function by adding the ability to load multiple satellite data products at once, and automatically apply cloud masking and filtering. For loading non-satellite data products (e.g. DEA Water Observations), use dc.load instead.

dea_tools.datahandling.load_reproject(path, how, resolution='auto', tight=False, resampling='nearest', chunks={'x': 2048, 'y': 2048}, bands=None, masked=True, reproject_kwds=None, **kwargs)[source]

Load and reproject part of a raster dataset into a given GeoBox or custom CRS/resolution.

Parameters:
  • path (str) – Path to the raster dataset to be loaded and reprojected.

  • how (GeoBox, str or int) – How to reproject the raster. Can be a GeoBox or a CRS (e.g. “ESPG:XXXX” string or integer).

  • resolution (str or int, optional) –

    The resolution to reproject the raster dataset into if how is a CRS, by default “auto”. Supports:

    • ”same” use exactly the same resolution as the input raster

    • ”fit” use center pixel to determine required scale change

    • ”auto” uses the same resolution on the output if CRS units

    are the same between source and destination; otherwise “fit”

    • Else, a specific resolution in the units of the output crs

  • tight (bool, optional) – By default output pixel grid is adjusted to align pixel edges to X/Y axis, suppling tight=True produces an unaligned geobox.

  • resampling (str, optional) – Resampling method to use when reprojecting data, by default “nearest”, supports all standard GDAL options (“average”, “bilinear”, “min”, “max”, “cubic” etc).

  • chunks (dict, optional) – The size of the Dask chunks to load the data with, by default {“x”: 2048, “y”: 2048}.

  • bands (str or list, optional) – Bands to optionally filter to when loading data.

  • masked (bool, optional) – Whether to mask the data by its nodata value, by default True.

  • reproject_kwds (dict, optional) – Additional keyword arguments to pass to the .odc.reproject() method, by default None.

  • **kwargs (dict) – Additional keyword arguments to be passed to the rioxarray.open_rasterio function.

Returns:

The reprojected raster dataset.

Return type:

xarray.Dataset

dea_tools.datahandling.mostcommon_crs(dc, product, query)[source]

Takes a given query and returns the most common CRS for observations returned for that spatial extent. This can be useful when your study area lies on the boundary of two UTM zones, forcing you to decide which CRS to use for your output_crs in dc.load.

Parameters:
  • dc (datacube Datacube object) – The Datacube to connect to, i.e. dc = datacube.Datacube(). This allows you to also use development datacubes if required.

  • product (str) – A product name (or list of product names) to load CRSs from.

  • query (dict) – A datacube query including x, y and time range to assess for the most common CRS

Returns:

epsg_string – An EPSG string giving the most common CRS from all datasets returned by the query above

Return type:

str

dea_tools.datahandling.nearest(array: xarray.DataArray, dim: str, target, index_name: str = None) xarray.DataArray[source]

Finds the nearest values to a target label along the given dimension, for all other dimensions.

E.g. For a DataArray with dimensions (‘time’, ‘x’, ‘y’)

nearest_array = nearest(array, ‘time’, ‘2017-03-12’)

will return an array with the dimensions (‘x’, ‘y’), with non-null values found closest for each (x, y) pixel to that location along the time dimension.

The returned array will include the ‘time’ coordinate for each x,y pixel that the nearest value was found.

Parameters:
  • array (xr.DataArray) – The array to search.

  • dim (str) – The name of the dimension to look for the target label.

  • target (same type as array[dim]) – The value to look up along the given dimension.

  • index_name (str, optional) – If given, the name of a coordinate to be added containing the index of where on the dimension the nearest value was found.

Returns:

nearest_array – An array of the nearest non-null values to the target label. The dim dimension will be removed, and replaced with a coord of the same name, containing the value of that dimension closest to the given target label.

Return type:

xr.DataArray

dea_tools.datahandling.parallel_apply(ds, dim, func, use_threads=False, *args, **kwargs)[source]

Applies a custom function in parallel along the dimension of an xarray.Dataset or xarray.DataArray.

The function can be any function that can be applied to an individual xarray.Dataset or xarray.DataArray (e.g. data for a single timestep). The function should also return data in xarray.Dataset or xarray.DataArray format.

This function is useful as a simple method for parallising code that cannot easily be parallised using Dask.

Parameters:
  • ds (xarray.Dataset or xarray.DataArray) – xarray data with a dimension dim to apply the custom function along.

  • dim (string) – The dimension along which the custom function will be applied.

  • func (function) – The function that will be applied in parallel to each array along dimension dim. The first argument passed to this function should be the array along dim.

  • use_threads (bool, optional) – Whether to use threads instead of processes for parallelisation. Defaults to False, which means it’ll use multi-processing. In brief, the difference between threads and processes is that threads share memory, while processes have separate memory.

  • *args – Any number of arguments that will be passed to func.

  • **kwargs – Any number of keyword arguments that will be passed to func.

Returns:

A concatenated dataset containing an output for each array along the input dim dimension.

Return type:

xarray.Dataset

dea_tools.datahandling.paths_to_datetimeindex(paths, string_slice=(0, 10))[source]

Helper function to generate a Pandas datetimeindex object from dates contained in a file path string.

Parameters:
  • paths (list of strings) – A list of file path strings that will be used to extract times

  • string_slice (tuple) – An optional tuple giving the start and stop position that contains the time information in the provided paths. These are applied to the basename (i.e. file name) in each path, not the path itself. Defaults to (0, 10).

Returns:

datetime – A pandas.DatetimeIndex object containing a ‘datetime64[ns]’ derived from the file paths provided by paths.

Return type:

pandas.DatetimeIndex

dea_tools.datahandling.wofs_fuser(dest, src)[source]

Fuse two WOfS water measurements represented as ``ndarray``s.

Note: this is a copy of the function located here: GeoscienceAustralia/digitalearthau

dea_tools.datahandling.xr_pansharpen(ds, transform, pan_band='nbart_panchromatic', return_pan=False, output_dtype=None, parallelise=False, band_weights={'nbart_blue': 0.2, 'nbart_green': 0.4, 'nbart_red': 0.4}, pca_rescaling='histogram')[source]

Apply pan-sharpening to multispectral satellite data with one or more timesteps. The following pansharpening transforms are currently supported:

  • Brovey (“brovey”), with optional band weighting

  • ESRI (“esri”), with optional band weighting

  • Simple mean (“simple mean”)

  • PCA (“pca”)

  • HSV (“hsv”), similar to IHS

Note: Pan-sharpening transforms do not necessarily maintain the spectral integrity of the input satellite data, and may be more suitable for visualisation than quantitative work.

Parameters:
  • ds (xarray.Dataset) – An xarrray dataset containing the three input multispectral bands, and a panchromatic band. This dataset should have already been resampled to the spatial resolution of the panchromatic band (15 m for Landsat). Due to differences in the electromagnetic spectrum covered by the panchromatic band, Landsat 8 and 9 data should be supplied with ‘blue’, ‘green’, and ‘red’ multispectral bands, while Landsat 7 should be supplied with ‘green’, ‘red’ and ‘NIR’.

  • transform (string) – The pansharpening transform to apply to the data. Valid options include “brovey”, “esri”, “simple mean”, “pca”, “hsv”.

  • pan_band (string, optional) – The name of the panchromatic band that will be used to pansharpen the multispectral data.

  • return_pan (bool, optional) – Whether to return the panchromatic band in the output dataset. Defaults to False.

  • output_dtype (string or numpy.dtype, optional) – The dtype used for the output values. Defaults to the input dtype of the multispectral bands in ds.

  • parallelise (bool, optional) – Whether to parallelise transformations across multiple cores. Used for PCA and HSV transforms that are applied to each timestep in ds individually; defaults to False.

  • band_weights (dict, optional) – Used for the Brovey and ESRI transforms. Mapping of band names to weights to be applied to each band when calculating the sum (Brovey) or mean (ESRI) of all multispectral bands. The keys of the dictionary should be the names of the bands, and the values should be the weights to apply to each band, e.g.: {"nbart_red": 0.4, "nbart_green": 0.4, "nbart_blue": 0.2}. The default accounts for Landsat 8 and 9’s pan band only partially overlapping with the blue band; this may not be suitable for all applications. Setting band_weights=None will use a simple unweighted sum (for the Brovey transform) or unweighted mean (for the ESRI transform).

  • pca_rescaling (str, optional) – Used for the PCA transform. The method to use for rescaling pan band to more closely match the distribution of values in the first PCA component. “simple” scales the pan band values to more closely match the first PCA component by subtracting the mean of the pan band values from each value, scaling the resulting values by the ratio of the standard deviations of the first PCA component and the pan band, and adding back the mean of the first PCA component. “histogram” uses a histogram matching technique to adjust the pan band values so that the resulting histogram more closely matches the histogram of the first PCA component.

Returns:

ds_pansharpened – An xarrray dataset containing the three pansharpened input multispectral bands and optionally the panchromatic band (if return_pan=True).

Return type:

xarray.Dataset