Exporting compliant NetCDF datasets for publication

What does this notebook do?

This notebook demonstrates how to create a NetCDF output file that is compliant with the requirements of the NCI and international metadata standards. This is specifically intended to assist with products that were produced outside of standard datacube workflows: for example, where data exists in array or raster form rather than having been automatically generated using a datacube-stats workflow. The notebook demonstrates how to use a selection of existing datacube functions to create the NetCDF, as these automatically create much of the required metadata.

(This is an in-progress document which is likely to be updated regularly as this procedure is streamlined. Pelase raise an issue on the dea-notebooks repository or contact the author if you have any suggestions on improving the workflow)

Requirements:

You need to run the following commands from the command line prior to launching jupyter notebooks from the same terminal so that the required libraries and paths are set:

module use /g/data/v10/public/modules/modulefiles

module load dea

This notebook also uses sample data from the National Intertidal Digital Elevation Model (NIDEM) located at /g/data/r78/rt1527/nidem/output_data/.

Date: August 2018

Author: Robbi Bishop-Taylor

Tags: NetCDF, Compliant files, CF, ACDD, Publication, NCI, Outputting data

[4]:
# Import modules
import rasterio
import numpy as np
from datacube.drivers.netcdf import Variable, create_netcdf_storage_unit, writer as netcdf_writer
from datacube.utils.geometry import Coordinate
from datacube.utils.geometry import CRS

Import sample data

For this example, we will create a compliant NetCDF file that contains three seperate input raster datasets. First, import the sample datasets:

[6]:
# Use rasterio to load the input rasters
nidem_path = '/g/data/r78/rt1527/nidem/output_data/geotiff/'
nidem_dem = rasterio.open(nidem_path + 'nidem/NIDEM_33_130.91_-12.26.tif')
nidem_unfiltered = rasterio.open(nidem_path + 'nidem_unfiltered/NIDEM_unfiltered_33_130.91_-12.26.tif')
nidem_mask = rasterio.open(nidem_path + 'nidem_mask/NIDEM_mask_33_130.91_-12.26.tif')

# Read data as numpy arrays
nidem_dem_array = nidem_dem.read(1)
nidem_unfiltered_array = nidem_unfiltered.read(1)
nidem_mask_array = nidem_mask.read(1)

Create new NetCDF

  • Here we create a new NetCDF storage unit using the create_netcdf_storage_unit function from datacube.drivers.netcdf

  • This lets us set up the dataset’s coordinate reference system, dataset dimensions (e.g. x and y coordinates) and create empty variables tied to the above dimensions with manually defined nodata values, units and dtypes.

  • Using the CRS, Coordinate and Variable functions automatically generates much of the required metadata, including geospatial_bounds, geospatial_bounds_crs, geospatial_lat_* and geospatial_lon_*.

[8]:
# Calculate coordinates for each pixel in the x and y directions based on the input dataset
x_coords = [nidem_dem.xy(row=0, col=x_ind)[0] for x_ind in range(0, nidem_mask.width)]
y_coords = [nidem_dem.xy(row=y_ind, col=0)[1] for y_ind in range(0, nidem_mask.height)]

# Use netcdfy_coord to produce compliant coordinates
x_coords = netcdf_writer.netcdfy_coord(np.array(x_coords))
y_coords = netcdf_writer.netcdfy_coord(np.array(y_coords))

# Create new dataset
output_netcdf = create_netcdf_storage_unit(filename='sample.nc',

               # Set coordinate reference system
               crs=CRS('EPSG:3577'),

               # Set up dimensions using x and y coordinates computed above
               coordinates={'x': Coordinate(x_coords, 'metres'),
                            'y': Coordinate(y_coords, 'metres')},

               # Set up empty variables for each layer, specifying the units and nodata
               variables={'dem': Variable(dtype=np.dtype('float32'),
                                          nodata=-9999,
                                          dims=('y', 'x'),
                                          units='metres'),
                          'dem_unfiltered': Variable(dtype=np.dtype('float32'),
                                                     nodata=-9999,
                                                     dims=('y', 'x'),
                                                     units='metres'),
                          'mask': Variable(dtype=np.dtype('int16'),
                                           nodata=-9999,
                                           dims=('y', 'x'),
                                           units='metres')},

               # This can be used to specify optional NetCDF creation options like compression
               variable_params={'dem': {}})

print(output_netcdf)
<class 'datacube.drivers.netcdf._safestrings.SafeStringsDataset'>
root group (NETCDF4 data model, file format HDF5):
    date_created: 2019-05-22T17:27:44.267255
    Conventions: CF-1.6, ACDD-1.3
    history: NetCDF-CF file created by datacube version '1.6.2+398.g0e94625d' at 20190522.
    geospatial_bounds: POLYGON ((130.601383569373 -11.9746230018132,130.595069713574 -12.5618627239134,130.588698595575 -13.1477581546625,131.319765093712 -13.1538034581567,131.322836107244 -12.5678942916324,131.325879516635 -11.9806414811064,130.601383569373 -11.9746230018132))
    geospatial_bounds_crs: EPSG:4326
    geospatial_lat_min: -13.153803458156746
    geospatial_lat_max: -11.974623001813235
    geospatial_lat_units: degrees_north
    geospatial_lon_min: 130.58869859557532
    geospatial_lon_max: 131.32587951663467
    geospatial_lon_units: degrees_east
    dimensions(sizes): x(3219), y(5102)
    variables(dimensions): float64 x(x), float64 y(y), int32 crs(), float32 dem(y,x), float32 dem_unfiltered(y,x), int16 mask(y,x)
    groups:

Assign data and attributes to each variable

So far, the created NetCDF contains no data. The next step is to assign each of our data arrays into the previously created variables, and set metadata for each variable. * Using the netcdfy_data function will ensure the data types are compliant. * Include a descriptive summary of the variable for long_name. * If applicable, use valid_range to set the valid numeric range for the data. * For most datasets, coverage_content_type can be set to ‘modelResult’ (valid options are ‘image’, ‘thematicClassification’, ‘physicalMeasurement’, ‘auxiliaryInformation’, ‘qualityInformation’, ‘referenceInformation’, ‘modelResult’, ‘coordinate’) * The standard_name attribute links the data to a specific pre-defined type of data defined by the Climate and Forecast (CF) Metadata Conventions. If applicable, select an option from this website (http://cfconventions.org/Data/cf-standard-names/55/build/cf-standard-name-table.html); e.g. ‘height_above_mean_sea_level’ below. If none of the standard_name options fit your data, it is best to leave it out, even though this will cause the NetCDF files to fail one of the ACDD tests.

[9]:
# dem: assign data and set variable attributes
output_netcdf['dem'][:] = netcdf_writer.netcdfy_data(nidem_dem_array)
output_netcdf['dem'].valid_range = [-25.0, 25.0]
output_netcdf['dem'].standard_name = 'height_above_mean_sea_level'
output_netcdf['dem'].coverage_content_type = 'modelResult'
output_netcdf['dem'].long_name = 'NIDEM filtered by ITEM confidence (< 0.25 NDWI SD), ' \
                                 'bathymetry (> -25 m) and elevation (< 25 m)'

# dem_unfiltered: assign data and set variable attributes
output_netcdf['dem_unfiltered'][:] = netcdf_writer.netcdfy_data(nidem_unfiltered_array)
output_netcdf['dem_unfiltered'].standard_name = 'height_above_mean_sea_level'
output_netcdf['dem_unfiltered'].coverage_content_type = 'modelResult'
output_netcdf['dem_unfiltered'].long_name = 'NIDEM unfiltered data'

# mask: assign data and set variable attributes
output_netcdf['mask'][:] = netcdf_writer.netcdfy_data(nidem_mask_array)
output_netcdf['mask'].valid_range = [1, 3]
output_netcdf['mask'].coverage_content_type = 'qualityInformation'
output_netcdf['mask'].long_name = 'NIDEM mask flagging cells with elevations greater ' \
                                  'than 25 m (value = 1), less than -25 m (value = 2), ' \
                                  'and ITEM confidence NDWI standard deviation greater ' \
                                  'than 0.25 (value = 3)'

# The `variables` method lets you verify the data has been included in the dataset:
print(output_netcdf.variables)

OrderedDict([('x', <class 'netCDF4._netCDF4.Variable'>
float64 x(x)
    units: metres
    standard_name: projection_x_coordinate
    long_name: x coordinate of projection
unlimited dimensions:
current shape = (3219,)
filling on, default _FillValue of 9.969209968386869e+36 used
), ('y', <class 'netCDF4._netCDF4.Variable'>
float64 y(y)
    units: metres
    standard_name: projection_y_coordinate
    long_name: y coordinate of projection
unlimited dimensions:
current shape = (5102,)
filling on, default _FillValue of 9.969209968386869e+36 used
), ('crs', <class 'netCDF4._netCDF4.Variable'>
int32 crs()
    grid_mapping_name: albers_conical_equal_area
    standard_parallel: [-18. -36.]
    longitude_of_central_meridian: 132.0
    latitude_of_projection_origin: 0.0
    false_easting: 0.0
    false_northing: 0.0
    long_name: GDA94 / Australian Albers
    semi_major_axis: 6378137.0
    semi_minor_axis: 6356752.314140356
    inverse_flattening: 298.257222101
    crs_wkt: PROJCS["GDA94 / Australian Albers",GEOGCS["GDA94",DATUM["Geocentric_Datum_of_Australia_1994",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],TOWGS84[0,0,0,0,0,0,0],AUTHORITY["EPSG","6283"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4283"]],PROJECTION["Albers_Conic_Equal_Area"],PARAMETER["standard_parallel_1",-18],PARAMETER["standard_parallel_2",-36],PARAMETER["latitude_of_center",0],PARAMETER["longitude_of_center",132],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","3577"]]
    spatial_ref: PROJCS["GDA94 / Australian Albers",GEOGCS["GDA94",DATUM["Geocentric_Datum_of_Australia_1994",SPHEROID["GRS 1980",6378137,298.257222101,AUTHORITY["EPSG","7019"]],TOWGS84[0,0,0,0,0,0,0],AUTHORITY["EPSG","6283"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4283"]],PROJECTION["Albers_Conic_Equal_Area"],PARAMETER["standard_parallel_1",-18],PARAMETER["standard_parallel_2",-36],PARAMETER["latitude_of_center",0],PARAMETER["longitude_of_center",132],PARAMETER["false_easting",0],PARAMETER["false_northing",0],UNIT["metre",1,AUTHORITY["EPSG","9001"]],AXIS["Easting",EAST],AXIS["Northing",NORTH],AUTHORITY["EPSG","3577"]]
    GeoTransform: [-1.553500e+05  2.500000e+01  0.000000e+00 -1.262375e+06  0.000000e+00
 -2.500000e+01]
unlimited dimensions:
current shape = ()
filling on, default _FillValue of -2147483647 used
), ('dem', <class 'netCDF4._netCDF4.Variable'>
float32 dem(y, x)
    _FillValue: -9999.0
    grid_mapping: crs
    units: metres
    valid_range: [-25.  25.]
    standard_name: height_above_mean_sea_level
    coverage_content_type: modelResult
    long_name: NIDEM filtered by ITEM confidence (< 0.25 NDWI SD), bathymetry (> -25 m) and elevation (< 25 m)
unlimited dimensions:
current shape = (5102, 3219)
filling on), ('dem_unfiltered', <class 'netCDF4._netCDF4.Variable'>
float32 dem_unfiltered(y, x)
    _FillValue: -9999.0
    grid_mapping: crs
    units: metres
    standard_name: height_above_mean_sea_level
    coverage_content_type: modelResult
    long_name: NIDEM unfiltered data
unlimited dimensions:
current shape = (5102, 3219)
filling on), ('mask', <class 'netCDF4._netCDF4.Variable'>
int16 mask(y, x)
    _FillValue: -9999
    grid_mapping: crs
    units: metres
    valid_range: [1 3]
    coverage_content_type: qualityInformation
    long_name: NIDEM mask flagging cells with elevations greater than 25 m (value = 1), less than -25 m (value = 2), and ITEM confidence NDWI standard deviation greater than 0.25 (value = 3)
unlimited dimensions:
current shape = (5102, 3219)
filling on)])

Add global metadata

The next step is to add global attributes which provide metadata for the entire NetCDF dataset including all variables. * This link (https://geo-ide.noaa.gov/wiki/index.php?title=NetCDF_Attribute_Convention_for_Dataset_Discovery) provides guidance on the kinds of information that can be included. Aim to include as many of the ‘Highly Recommended’ and ‘Recommended’ attributes as possible. * Many of the ‘Recommended’ attributes will be automatically generated by datacube functions

[10]:
# Add global attributes
output_netcdf.title = 'National Intertidal Digital Elevation Model (NIDEM) 25m v 0.1.0'
output_netcdf.institution = 'Commonwealth of Australia (Geoscience Australia)'
output_netcdf.product_version = '0.1.0'
output_netcdf.license = 'CC BY Attribution 4.0 International License'
output_netcdf.time_coverage_start = '1986-01-01'
output_netcdf.time_coverage_end = '2016-10-31'
output_netcdf.cdm_data_type = 'Grid'
output_netcdf.contact = 'clientservices@ga.gov.au'
output_netcdf.publisher_email = 'earth.observation@ga.gov.au'
output_netcdf.source = 'OTPS TPX08 Atlas'
output_netcdf.keywords = 'Tidal, Topography, Landsat, Elevation, Intertidal, MSL, ITEM, NIDEM, DEM'
output_netcdf.summary = "Insert a detailed multiparagraph description of the dataset here"

# When you print the dataset now, the new global attributes should appear:
print(output_netcdf)

<class 'datacube.drivers.netcdf._safestrings.SafeStringsDataset'>
root group (NETCDF4 data model, file format HDF5):
    date_created: 2019-05-22T17:27:44.267255
    Conventions: CF-1.6, ACDD-1.3
    history: NetCDF-CF file created by datacube version '1.6.2+398.g0e94625d' at 20190522.
    geospatial_bounds: POLYGON ((130.601383569373 -11.9746230018132,130.595069713574 -12.5618627239134,130.588698595575 -13.1477581546625,131.319765093712 -13.1538034581567,131.322836107244 -12.5678942916324,131.325879516635 -11.9806414811064,130.601383569373 -11.9746230018132))
    geospatial_bounds_crs: EPSG:4326
    geospatial_lat_min: -13.153803458156746
    geospatial_lat_max: -11.974623001813235
    geospatial_lat_units: degrees_north
    geospatial_lon_min: 130.58869859557532
    geospatial_lon_max: 131.32587951663467
    geospatial_lon_units: degrees_east
    title: National Intertidal Digital Elevation Model (NIDEM) 25m v 0.1.0
    institution: Commonwealth of Australia (Geoscience Australia)
    product_version: 0.1.0
    license: CC BY Attribution 4.0 International License
    time_coverage_start: 1986-01-01
    time_coverage_end: 2016-10-31
    cdm_data_type: Grid
    contact: clientservices@ga.gov.au
    publisher_email: earth.observation@ga.gov.au
    source: OTPS TPX08 Atlas
    keywords: Tidal, Topography, Landsat, Elevation, Intertidal, MSL, ITEM, NIDEM, DEM
    summary: Insert a detailed multiparagraph description of the dataset here
    dimensions(sizes): x(3219), y(5102)
    variables(dimensions): float64 x(x), float64 y(y), int32 crs(), float32 dem(y,x), float32 dem_unfiltered(y,x), int16 mask(y,x)
    groups:

Close the dataset

When you have finished adding data and metadata, close the file to write it to file

[11]:
# Close dataset
output_netcdf.close()

Testing compliance

At minimum, the NCI requires compliance with the Climate and Forecast Metadata Standard (CF) and the Attribute Convention for Data Discovery (ACDD). To test whether the dataset is compliant, we can run several checks:

Climate and Forecast Metadata Standard (CF) compliance

Attribute Convention for Data Discovery (ACDD) compliance

  • Run the compliance-checker tool loaded by the dea module on the NCI: compliance-checker test_file.nc

  • This is a much more detailed check, and will give a total score for the dataset and an explanation for each failed test.

  • Every effort should be made to make all “’High Priority’” tests pass, but there is some flexibility (e.g. varattr may give less than a 100% score if some attributes that are not applicable for a given variable like standard_name are not set).

  • You do not need to pass all ‘Medium priority’ tests, but try to complete as many as practically possible.

[12]:
!compliance-checker sample.nc
Running Compliance Checker on the datasets from: ['sample.nc']


--------------------------------------------------------------------------------
                         IOOS Compliance Checker Report
                                    acdd:1.3
http://wiki.esipfed.org/index.php?title=Category:Attribute_Conventions_Dataset_Discovery
--------------------------------------------------------------------------------
                               Corrective Actions
sample.nc has 5 potential issues


                               Highly Recommended
--------------------------------------------------------------------------------
variable "mask" missing the following attributes:
* standard_name


                                  Recommended
--------------------------------------------------------------------------------
Global Attributes
* acknowledgment/acknowledgement not present
* comment not present
* creator_name not present
* creator_url not present
* creator_email not present
* geospatial_vertical_min not present
* geospatial_vertical_max not present
* geospatial_vertical_positive not present
* geospatial_bounds_vertical_crs not present
* id not present
* naming_authority not present
* project not present
* processing_level not present
* publisher_name not present
* publisher_url not present
* standard_name_vocabulary not present
* time_coverage_duration not present
* time_coverage_resolution not present

geospatial_lat_extents_match
* Could not find lat variable to test extent of geospatial_lat_min/max, see CF-1.6 spec chapter 4.1

geospatial_lon_extents_match
* Could not find lon variable to test extent of geospatial_lon_min/max, see CF-1.6 spec chapter 4.2

time_coverage_extents_match
* Could not find time variable to test extent of time_coverage_start/time_coverage_end, see CF-1.6 spec chapter 4.4

Check file opens correctly in QGIS

If your file contains spatial data, open the NetCDF in QGIS to verify it plots in the correct location, and that nodata values are correctly accounted for.