DEA Code Functionality Examples

Summary

A great deal of work has been done by AGDC partners to outline the key requirements and technical approach for the next version of the AGDC (Version 2). For example, the previous Technical Working Group held a workshop that resulted in a draft Preliminary Design Report that has subsequently been translated to the AGDC wiki ( http://datacube.org.au ). Similarly, prototype functionality has been built to test and verify key concepts for a range of components including the netCDF-4 based Storage Units that are managed under NCIfs NERDIP data management, publishing and general access to data protocols; a Generalised Data Framework (GDF) to access the multidimensional Storage Units; and the AGDC Analytic Engine adding support for interactive Exploratory Data Analysis (EDA).

The following requirements for Version 2 have been mainly based off this thinking and documentation. It is important to note that these requirements are for the current development effort running through till June 30, 2016. They are designed to guide and define the next step in the development of the AGDC, not as the final destination.

Approved Requirements for AGDC Version 2

These requirements were formally approved by the AGDC Programme Board at its meeting on 10 December, 2015.

Patterns of Use:

AGDC Version 2 will support the following patterns of use.

[ ]:
%matplotlib inline
from matplotlib import pyplot as plt
import datacube
from datacube.model import Range
from datetime import datetime
from datacube.utils import geometry
from datacube.storage import masking
from datacube.storage.masking import mask_invalid_data
import pandas
import xarray
import numpy
[ ]:
import folium
from IPython.display import display
import geopandas
from shapely.geometry import mapping
from shapely.geometry import MultiPolygon
import rasterio
import shapely.geometry
import shapely.ops
from functools import partial
import pyproj
#from datacube.model import datacube.utils.geometry.CRS
import time
from dateutil import tz
[ ]:
notebook_start_dt = datetime.now()
[ ]:
dc = datacube.Datacube(app='requirements-met')
[ ]:
datacube.__version__
[ ]:
datacube.__path__
[ ]:
!module list
[ ]:
!datacube system check

Remote Datacube

[ ]:
from datacube import Datacube
remotedc = Datacube(config='/home/547/jps547/.aws_datacube.conf')
[ ]:
from geopy.geocoders import Nominatim
geolocator = Nominatim()
location = geolocator.geocode("Launceston, Tasmania")
[ ]:
location
[ ]:
query = {
    'lat': (-41.82, -41.81),
    'lon': (146.71, 146.72),
}

wofs = remotedc.load(product='wofs_modified_albers', group_by='solar_day', **query)

wofs.water.flags_definition
pandas.DataFrame.from_dict(masking.get_flags_def(wofs), orient='index')
[ ]:
# Throughout this doc we use print() manually rather than returning the object as an output.
# This is to minimise memory usage: avoid storing a reference to the data array itself in Jupyter's output history.
[ ]:
print(wofs)
[ ]:
wofs.water.plot(col='time', col_wrap=5, robust=True)
[ ]:
remotedc.list_products()
[ ]:
water = masking.make_mask(wofs, result='clear_wet')

water.water.plot(col='time', col_wrap=5, robust=True, cmap='gray')
[ ]:
water.water.sum('time').plot()
[ ]:
results = remotedc.index.datasets.search_eager(product='wofs_modified_albers')
[ ]:
len(results)
[ ]:
results[0].metadata_doc
[ ]:
results = None
water = None
wofs = None

Test for native projection query on non standard rainfall product

[ ]:
query = {
    'product': 'bom_rainfall_grids',
    'crs': 'EPSG:3577',
    'time': ('1987-10-01', '1990-10-01'),
    'x': (349388.9787330463, 358497.9246628304),
    'y': (-2379960.5883129314, -2375926.544118764),
    'output_crs': 'EPSG:3577',
    'resolution': (-25, 25),
    'resampling': 'cubic',
}

rainfall_projected = dc.load(**query)
[ ]:
print(rainfall_projected)
[ ]:
rainfall_projected = None

Search for dataset by location example

[ ]:
# Was replaced by stacked file
# ds_uri = 'file:///g/data/rs0/datacube/002/LS5_TM_NBART/-7_-30/LS5_TM_NBART_3577_-7_-30_20030711011529000000_v1488078639.nc'

ds_uri = 'file:///g/data/rs0/datacube/002/LS5_TM_NBAR/13_-35/LS5_TM_NBAR_3577_13_-35_2009_v1496827052.nc'
datasets_in_stack = list(dc.index.datasets.get_datasets_for_location(uri=ds_uri))

# If restacked, there will be a newer file (with higher vXXX version suffix)
# So this one shouldn't change:
assert len(datasets_in_stack) == 74
assert set(d.type.name for d in datasets_in_stack) == {'ls5_nbar_albers'}
print(repr(datasets_in_stack[0]))

datasets_in_stack = None
[ ]:
def chop_to_globe(geom):
    to_crs = geometry.CRS("EPSG:4326")
    left_of_dt = geometry.line(
        [
            (180 - 1.0e-8, -90),
            (180 - 1.0e-8, 90)
        ],
        crs=to_crs
    ).to_crs(geom.crs)

    right_of_dt = geometry.line(
        [
            (-180 + 1.0e-8, 90),
            (-180 + 1.0e-8, -90)
        ],
        crs=to_crs
    ).to_crs(geom.crs)

    chopper = geometry.polygon(
        left_of_dt.points + right_of_dt.points + [left_of_dt.points[0]],
        crs=geom.crs
    )
    return geom.intersection(chopper)

def datasets_union_clip180(dss):
    thing = geometry.unary_union(ds.extent for ds in dss)
    return chop_to_globe(thing).to_crs(geometry.CRS('EPSG:4326'))
[ ]:
def datasets_union(dss):
    thing = geometry.unary_union(ds.extent for ds in dss)
    print(thing.crs)
    return thing.to_crs(geometry.CRS('EPSG:4326'))
[ ]:
import random
def plot_folium(shapes):

    mapa = folium.Map(location=[-30,150], zoom_start=4)
    colors=['#00ff00', '#ff0000', '#00ffff', '#ffffff', '#000000', '#ff00ff']
    for shape in shapes:
        style_function = lambda x: {'fillColor': '#000000' if x['type'] == 'Polygon' else '#00ff00',
                                    'color' : random.choice(colors)}
        poly = folium.features.GeoJson(mapping(shape), style_function=style_function)
        mapa.add_child(poly)
    display(mapa)
[ ]:
def plot_rgb(image, fake_saturation):
    image = mask_invalid_data(image)
    rgb = image.to_array(dim='color')
    rgb = rgb.transpose(*(rgb.dims[1:] + rgb.dims[:1]))  # make 'color' the last dimension
    rgb = rgb.where((rgb <= fake_saturation).all(dim='color'))  # mask out pixels where any band is 'saturated'
    rgb /= fake_saturation  # scale to [0, 1] range for imshow

    rgb.plot.imshow(
        x=image.crs.dimensions[1],
        y=image.crs.dimensions[0],
        col='time',
        col_wrap=5,
        add_colorbar=False
    )

Plot WOfS

[ ]:
plot_folium([
    datasets_union(
        remotedc.index.datasets.search_eager(product='ls7_wofs_pq_scene', )
    )
])

Confirm Dashboard Works

[ ]:
# Dashboard view of current production status per product with Spatial, Tabular and Reference views
#!/g/data/v10/public/run-dash.sh
#!firefox http://127.0.0.1:8080/

Search with sources

[ ]:
(
     dc.index.datasets.get('11228944-42f6-4b9f-a434-73e0c7f9bde2', include_sources=True)
).metadata_doc['lineage']['source_datasets']['0']['lineage']['source_datasets']['level1']['gqa']

Refine search using sources

[ ]:
from datacube.model import Range
results=[]
for product in ['ls5_level1_scene']:
    results.append(dc.index.datasets.search_eager(
        product=product,
        sat_path = Range(91,92),
        sat_row = Range(88,90),
    ))
print(len(results))
print(str(results[0][0].local_path))
[ ]:
from datacube.model import Range
time_range = Range(datetime(2013, 10, 1), datetime(2014, 1, 1))
results = dc.index.datasets.search_eager(
    product='ls8_nbar_albers',
    time=time_range,
    source_filter=dict(
        product='ls8_level1_scene',
        sat_path = Range(87,116),
        sat_row = Range(67,91),
    )
)
print(len(results))
print(results[0])
[ ]:
results[0].metadata_doc
[ ]:
results = None

Grid Cell count of timeslices

[ ]:
product = 'ls8_nbar_albers'

gw = datacube.api.GridWorkflow(dc.index, product=product)
cells = gw.list_cells(product=product, time=('2016-01-01', '2017-01-01'))

for idx, data in cells.items():
    print(idx, data.shape[0])

cells = None

Query products by time range - return count

Count for single query

[ ]:
# Count of single query
product = 'ls7_nbar_scene'
start_time = time.time()
count = dc.index.datasets.count(
    product=product,
    time=Range(datetime(2000, 1, 1), datetime(2001, 1, 1))
)
print("Single product {} count of {} in {} seconds ---".format(product, count, time.time() - start_time))

Count products through time

[ ]:
start_time = time.time()

results = dc.index.datasets.count_by_product_through_time(
    '1 year',
    platform='LANDSAT_7',
    time=Range(
        datetime(2000, 1, 1),
        datetime(2003, 1, 1), #, tzinfo=tz.tzutc()),
    )
)

start_product_time = time.time()
for product, series in results:
    print('{}: {} seconds'.format(product.name, time.time() - start_product_time))
    for timerange, count in series:
        print('\t{}: {}'.format(timerange[0], count))
    start_product_time = time.time()

print("--- %s seconds ---" % (time.time() - start_time))

Data load examples

[ ]:
dc.list_measurements()

Load SRTM

[ ]:
query = {
    'lat': (-20.61, -20.66),
    'lon': (147.0, 147.05)
}

srtm_dem1sv1_0 = dc.load(product='srtm_dem1sv1_0', **query)
[ ]:
srtm_dem1sv1_0.dem.plot()
[ ]:
srtm_dem1sv1_0 = None

Load Radiometrics

[ ]:
query = {
    'lat': (-20.61, -20.66),
    'lon': (147.0, 147.05),
    'measurements':['rad_k_equiv_conc_filtered','rad_th_equiv_conc_filtered','rad_u_equiv_conc_filtered']
}

# Gamma ray no longer available at indexed location?
try:
    radiometrics = dc.load(product='gamma_ray', **query)

    # Previously: array(3.2456936836242676)
    print(radiometrics.rad_u_equiv_conc_filtered.max())


    print(radiometrics)
    plot_rgb(radiometrics, 20)

except Exception as e:
    print(repr(e))

[ ]:
radiometrics = None

Load NBAR

[ ]:
nbar = dc.load(
    product='ls5_nbart_albers',
    y=(-18.7896944682,-18.7896944684),
    x=(146.073501475,146.073501476)
)
[ ]:
print(nbar)
[ ]:
nbar = None

MODIS Landsat time series

[ ]:
modis = dc.load(
    product='modis_mcd43a4_tile',
    longitude=132.1,
    latitude=-27.5,
    time=('2000-1-1', '2000-5-1'),
    resolution=(-500,500),
    # measurements=('Nadir_Reflectance_Band1','Nadir_Reflectance_Band4','Nadir_Reflectance_Band3'),
)
[ ]:
# Josh archived all of this product.
#modis.Nadir_Reflectance_Band1.min()
assert len(modis) == 0

print(modis)
[ ]:
modis = None

Load rainfall

[ ]:
rain = dc.load(product='bom_rainfall_grids', longitude=132.1, latitude=-27.5, time=('2000-1-1', '2001-1-1'))
[ ]:
rain.rainfall.sel(longitude=132.1, latitude=-27.500001, method='nearest').plot()
[ ]:
rain = None

1. Routine national scale product generation. Specifically, Version 2 will include national collections of:

Water Observations from Space.

[ ]:
dc.index.datasets.search_eager(product='srtm_dem1sv1_0')
[ ]:
rain_ds = dc.index.datasets.search_eager(product='bom_rainfall_grids', time=Range(datetime(2001, 1, 1), datetime(2002, 1, 1)))

datasets_union_clip180(rain_ds).to_crs(geometry.CRS('EPSG:3577'))
[ ]:
rain_ds = None
[ ]:
plot_folium([
    datasets_union_clip180(
        dc.index.datasets.search_eager(
            product='bom_rainfall_grids',
            time=Range(datetime(2001, 1, 1), datetime(2002, 1, 1))
        )
    )
])
[ ]:
plot_folium([
    datasets_union(
        dc.index.datasets.search_eager(
            product='srtm_dem1sv1_0',
            time=Range(datetime(2001, 1, 1), datetime(2020, 1, 1))
        )
    )
])
[ ]:
# Josh archived all of this
modis_mcd43a4 = dc.load(
    product='modis_mcd43a4_tile',
    x=(145,145.1),
    y=(-38.0,-38.1),
    resolution=(-500,500),
    time=('2000-01-01', '2001-01-01'),
    measurements=(
        'Nadir_Reflectance_Band1',
        'Nadir_Reflectance_Band4',
        'Nadir_Reflectance_Band3'
    )
)
[ ]:
print(modis_mcd43a4)
[ ]:
modis_mcd43a4 = None

Multiband file or subdataset support example

[ ]:
%%script false
# Modis disabled. Josh archived all of this

modis_mcd43a1 = dc.load(
    product='modis_mcd43a1_tile',
    x=(145,145.1),
    y=(-38.0,-38.1),
    resolution=(-500,500),
    time=('2001-01-01', '2001-01-01'),
    measurements=(
        'BRDF_Albedo_Parameters_Band1',
        'BRDF_Albedo_Parameters_Band2',
        'BRDF_Albedo_Parameters_Band3'
    )
)
[ ]:

# modis_mcd43a1.BRDF_Albedo_Parameters_Band1[0][1]
[ ]:
# plot_rgb(modis_mcd43a1, 1000)

Plot MODIS timeseries

[ ]:
# plot_rgb(modis_mcd43a4, 3000)
[ ]:
# plot_folium([datasets_union(dc.index.datasets.search_eager(product='ls5_nbart_albers', time=Range(datetime(1986, 1, 1), datetime(1988, 1, 1))))])

Query, load and plot NBART example

[ ]:
query = {
    'time': ('1987-01-01', '1988-01-01'),
    'lat': (-35.2, -35.4),
    'lon': (149.0, 149.2),
}

nbart = dc.load(
    product='ls5_nbart_albers',
    measurements=['swir2', 'nir', 'red'],
    group_by='solar_day' ,
    **query
)

plot_rgb(nbart, 3500)
[ ]:
nbart = None

Query, load and plot WoFS example

Construction of the Burdekin Dam completed in 1987

[ ]:
%%script false
# Disabled: WOfS has been removed.

query = {
    'time': ('1986-01-01', '1988-01-01'),
    'lat': (-20.61, -20.67),
    'lon': (147.0, 147.15),
}

wofs = dc.load(product='wofs_albers', group_by='solar_day', **query)
wofs
[ ]:
%%script false
wofs.water.flags_definition
pandas.DataFrame.from_dict(masking.get_flags_def(wofs), orient='index')
[ ]:
%%script false

water = masking.make_mask(wofs, result='clear_wet')
water.water.plot(col='time', col_wrap=5, robust=True, cmap='gray')
[ ]:
%%script false
# Number of clear water pixels - no group by solar day to exclude scene overlap

water_sum = water.water.sum('time').where(water.water.sum('time')!= 0)
water_sum.plot(cmap='rainbow', robust=True)
[ ]:
wofs = None
water = None
water_sum = None

Intertidal Characterisation.

[ ]:
# incomplete for v2 - water classification only - confidence metrics continue to be a work in progress

Landsat Fractional Cover.

[ ]:
plot_folium([
    datasets_union(
        dc.index.datasets.search_eager(
            product='ls5_fc_albers',
            time=Range(datetime(1987, 1, 1), datetime(1988, 1, 1))
        )
    )
])
[ ]:
# Query, load and plot Fractional Cover example

query = {
    'time': ('1987-01-01', '1988-01-01'),
    'lat': (-35.2, -35.4),
    'lon': (149.0, 149.2),
}

fractional_cover = dc.load(
    product='ls5_fc_albers',
    measurements=['BS', 'PV', 'NPV'],
    group_by='solar_day' ,
    **query
)

plot_rgb(fractional_cover, 100)
[ ]:
fractional_cover = None

Pixel Quality

[ ]:
plot_folium([
    datasets_union(
        dc.index.datasets.search_eager(
            product='ls5_pq_albers',
            time=Range(datetime(1987, 1, 1), datetime(1988, 1, 1))
        )
    )
])
[ ]:
plot_folium([
    datasets_union(
        dc.index.datasets.search_eager(
            product='ls7_nbar_albers',
           time=Range(datetime(2004,2, 1), datetime(2004, 2, 16))
        )
    )
])
[ ]:
pq = dc.load(product='ls5_pq_albers', group_by='solar_day', **query)
pq.pixelquality.plot(col='time', col_wrap=5, robust=True)
[ ]:
# Number of clear pixels based on ga_good_pixels default - no group by solar day to exclude scene overlap
clear_pixels = masking.make_mask(pq, ga_good_pixel=True)
clear_pixels.pixelquality.sum('time').plot(cmap='rainbow')
[ ]:
clear_pixels = None

NDVI

[ ]:
plot_folium([
    datasets_union(
        dc.index.datasets.search_eager(
            product='ls5_ndvi_albers',
            time=Range(datetime(1987,1, 1), datetime(1988, 1, 1))
        )
    )
])
[ ]:
#ndvi = dc.load(product='ls5_ndvi_albers', group_by='solar_day', **query )
#ndvi = mask_invalid_data(ndvi)
good_pixels = masking.make_mask(pq, ga_good_pixel=True)
#ndvi = ndvi.where(good_pixels.pixelquality)
#ndvi.ndvi.plot(col='time', col_wrap=5, robust=True, cmap='Greens')
print(good_pixels)
[ ]:

Landsat Surface reflectance statistical summaries:

Seasonal medians; and
[ ]:
#ndvi.ndvi.median(dim='time').plot(robust=True, cmap='YlGn')

Most-up-to-date observation.

[ ]:
# Most recent pixel mosaic function from cloud screen NBAR
#%run /g/data/v10/public/agdcv2_requirements/latest_pixel/agdc_pixel/pixel_test.py '60/(15,-40)'
[ ]:
#my_data
[ ]:
#my_data[15, -40].swir1.plot(cmap='rainbow', robust=True)

In some cases, these collections may be virtual, i.e. they are not pre-computed but rather computed as they are needed.

[ ]:
nbar = dc.load(product='ls5_nbar_albers', group_by='solar_day', **query )
nbar = nbar.where(good_pixels.pixelquality)
ndvi_on_the_fly = (nbar.nir-nbar.red)/(nbar.nir+nbar.red)*10000
ndvi_on_the_fly.median(dim='time').plot(robust=True, cmap='YlGn')
[ ]:
# diff = (ndvi_on_the_fly.median(dim='time') - ndvi.ndvi.median(dim='time'))
[ ]:
# diff.plot(robust=True) #error at level of noise due to *10000 scaling of floats

2. A user should be able to interact with these collections through a web browser including:

œ Clicking on a pixel and displaying a time series (i.e., pixel drill).œ

[ ]:
# See Steven Ackerley / Andrew Hicks
# NCML through THREDDS
#!/g/data/v10/public/agdcv2-pixeldrill/pixeldrill -p ls5_nbart_albers

#Webpage to define it
!firefox http://dapds00.nci.org.au/thredds/ncss/grid/uc0/rs0_dev/all_the_ncmls/LS5_TM_FC/ncml/LS5_TM_FC_13_-43.ncml/pointDataset.html
[ ]:
#CSV URL for a drill
!firefox http://dapds00.nci.org.au/thredds/ncss/uc0/rs0_dev/all_the_ncmls/LS5_TM_FC/ncml/LS5_TM_FC_13_-43.ncml?var=BS&var=NPV&var=PV&latitude=-38.13&longitude=147.17&time_start=1986-08-21T23%3A18%3A40Z&time_end=2011-11-14T23%3A45%3A07Z&accept=csv

Spatio-temporal (statistical) summaries that would allow users to easily answer questions such as:

How frequently was water observed over catchment y during time period x; and
[ ]:
# not this time - you mean wofs right? - see example below for when the data is available
What was the surface reflectance for area x at time y?

Link to stats within a polygon

3. Earth Observation (EO) scientists and allied domain specialists will be able to undertake exploratory data analysis. In general this would mean a user will be able to easily retrieve, investigate, visualise, develop algorithms, test, iterate, visualise results and interpret them in the context of other spatio-temporal datasets.

A key demonstration of this capability will be the availability of functions and data structures to enable Landsat/MODIS blending.

Input data

The AGDC will use the following data collections: œ Landsat: TM, ETM+ and OLI/TIRS. œ MODIS: Collection 6 MOD09 (granule) and MOD43 (sinusoidal tiles) that will provide variables necessary for the Landsat-MODIS blending algorithm.

[ ]:
dc.list_products()

It will also include the SRTM 3 second DSM and 1 and 3 second DEMs.

[ ]:
dc.list_measurements()

At a minimum, the Australian implementation of the AGDC will cover all of continental Australia plus a one tile buffer and the Great Barrier Reef. However, the Boards preference would be for the Version 2 to also cover all Commonwealth Marine Reserves.

[ ]:
plot_folium([
    datasets_union(
        dc.index.datasets.search_eager(
            product='ls8_nbar_albers',
            time=Range(datetime(2014,1, 1), datetime(2015, 1, 1))
        )
    )
])

All data collections that are included in the Australian implementation of the AGDC will: œ Have a CC BY Attribution 3.0 or CC BY Attribution 4.0 license.

[ ]:
!ncdump -h '/g/data/rs0/datacube/002/LS5_TM_NBAR/14_-40/LS5_TM_NBAR_3577_14_-40_19860821231816000000.nc' | grep license

The use of a collection with a different licence will require approval of the AGDC Programme Board; and œ Be in netCDF 4 format and will comply with relevant CF conventions;.

[ ]:
import compliance_checker
[ ]:
compliance_checker.__version__
[ ]:
from __future__ import print_function
import argparse
import sys

import cf_units
from compliance_checker.runner import ComplianceChecker, CheckSuite
from compliance_checker import __version__

ds_loc = '/g/data/rs0/datacube/002/LS5_TM_NBAR/14_-40/LS5_TM_NBAR_3577_14_-40_19860821231816000000.nc'

cs = CheckSuite()
cs.load_all_available_checkers()

was_success, errors = ComplianceChecker.run_checker(
    ds_loc=ds_loc,
    checker_names=['cf'], # 'acdd'?
    verbose=0,
    criteria='normal',
    output_filename='-',
    # Skipping:
    # - check_dimension_order:
    #      Our files don't contain all the lats/lons as an auxiliary
    #      cordinate var as it's unnecessary for any software we've
    #      tried.
    #      It may be added at some point in the future, and this check
    #      should be re-enabled.
    skip_checks=['check_dimension_order']
)
# missing source attribute: skippable according to Damien
# assert was_success, "Compliance failed."
[ ]:
help(ComplianceChecker.run_checker)

Output Products

By June 30, 2016 the Australian implementation of the AGDC will ensure that the products being produced are supported by and hosted on the RDS.

[ ]:
!ls /g/data/rs0/datacube/002
[ ]:
!ls /g/data/fk4/datacube/002

Technical Requirements

In line with the use-case patterns outlined above, Version 2 will support: œ Data-fusion and analysis across heterogeneous gridded data collections from different domains.

[ ]:
albers_grid = dc.load(
    product='ls5_nbart_albers',
    group_by='solar_day',
    measurements=['swir2', 'nir', 'blue'],
    **query
)
[ ]:
sinusoidal_grid = dc.load(
    product='ls5_nbar_albers',
    group_by='solar_day',
    measurements=['swir2', 'nir', 'blue'],
    output_crs='PROJCS["unnamed",GEOGCS["Unknown datum based upon the custom spheroid",\
                DATUM["Not specified (based on custom spheroid)",\
                SPHEROID["Custom spheroid",6371007.181,0]],\
                PRIMEM["Greenwich",0],\
                UNIT["degree",0.0174532925199433]],\
                PROJECTION["Sinusoidal"],\
                PARAMETER["longitude_of_center",0],\
                PARAMETER["false_easting",0],\
                PARAMETER["false_northing",0],\
                UNIT["Meter",1]]',
    resolution=(-250,250),
    **query
)
[ ]:
plot_rgb(albers_grid, 3500)
albers_grid = None
[ ]:
plot_rgb(sinusoidal_grid, 3500)
sinusoidal_grid = None

Tuneable configuration, at ingest, of multidimensional files (eg. chunking, compression type, dimension depth).

[ ]:
!datacube product add --help

A data retrieval mechanism that provides the ability to:

Obtain seamless subsets of data across storage unit boundaries;

[ ]:
seamless_query = {
    'time': ('2011-6-1', '2011-6-30'),
    'lat': (-20, -30),
    'lon': (132.0, 132.01),
}
[ ]:
seamless_subset = dc.load(
    product='ls5_nbar_albers',
    measurements=['swir2', 'nir', 'blue'],
    group_by='solar_day',
    **seamless_query
)
[ ]:
seamless_subset.y.max() - seamless_subset.y.min() #distance in metres
[ ]:
seamless_subset.x.max() - seamless_subset.x.min() #distance in metres
[ ]:
seamless_subset.extent.points
[ ]:
plot_rgb(seamless_subset, 3000)
[ ]:
# With the Grid Workflow class demonstrate seamless spatial query across tile boundaries
from datacube.api import GridWorkflow
[ ]:
gw = GridWorkflow(dc.index, product='ls5_nbar_albers')
[ ]:
# The query straddles multiple tiles - indexes shown below
gw.list_cells(
    product = 'ls5_nbar_albers',
    **seamless_query
).keys()

Filter data based on observation attribute (for example, pixel quality);

[ ]:

# Load some NDVI
nbar = dc.load(product='ls5_nbar_albers', group_by='solar_day', **query)
ndvi = (nbar.nir - nbar.red) / (nbar.nir + nbar.red) * 10000
nbar = None

# Mask the valid data
ndvi = mask_invalid_data(ndvi)
# Create a mask for the data with artefacts
good_pixels = masking.make_mask(pq, ga_good_pixel=True)
# Apply artefact mask
ndvi = ndvi.where(good_pixels.pixelquality)

# Plot the result
ndvi.plot(col='time', col_wrap=5, robust=True, cmap='YlGn')
ndvi = None

pandas.DataFrame.from_dict(
    masking.get_flags_def(pq),
    orient='index'
)

Define the spatio-temporal range of interest independent of data storage unit; and define the specific sensor or combination of sensor data to be analysed.

Define the specific sensor or combination of sensor data to be analysed.

[ ]:
multi_sensor_query = {
    'time': ('2013-06-01', '2014-01-01'),
    'lat': (-35.2, -35.4),
    'lon': (149.0, 149.2),
}

products = ['ls7_nbar_albers', 'ls8_nbar_albers']

# Find similarly named measurements
measurements = set(dc.index.products.get_by_name(products[0]).measurements.keys())
for prod in products[1:]:
    measurements.intersection(dc.index.products.get_by_name(products[0]).measurements.keys())

datasets = []
for prod in products:
    ds = dc.load(product=prod, measurements=measurements, **multi_sensor_query)
    ds['product'] = ('time', numpy.repeat(prod, ds.time.size))
    datasets.append(ds)

combined = xarray.concat(datasets, dim='time')
combined = combined.isel(time=combined.time.argsort())  # sort along time dim
[ ]:
combined.swir1.median(dim='time').plot(col_wrap=5, robust=True, cmap='rainbow')

The API will provide a simplified conceptual model for data query and analysis based on an n-dimensional array abstraction;

[ ]:
combined.data_vars
[ ]:
combined = None

During EDA, lazy evaluation of calculations so only those results that are in use are computed; and

[ ]:
# CSIRO Peter Wang - link to notebook

Support for calculations on arrays that are larger than core memory.

[ ]:
# Dask Example
lazydata = dc.load(
    product='ls7_nbart_albers',
    longitude=(132.0, 137.0),
    latitude=(-20, -25),
    time=('2011-6-1', '2011-6-30'),
    dask_chunks={'x': 200, 'y': 200, 'time': 5}
)
[ ]:
from dask.dot import dot_graph

dot_graph(lazydata)
[ ]:
print(lazydata)
[ ]:
# If larger than memory will be chunked and run
print(lazydata.mean(dim='time'))

[ ]:
# Visualise doesn't seem to exist anymore?
# lazydata.blue.visualise()

Continental scale product generation will be based off the continental workflows from the current ADGC v1 API, however it will be modified to use the version 2 data retrieval mechanism.

[ ]:
# See stats, wofs and ingestion

The ability to manage results of calculations as a temporary/private data cube for further analysis.

[ ]:
# See remote cube example above

Web based delivery of products through WMS, WCS, CS/W, OpenDAP services.

[ ]:
!firefox http://dapds00.nci.org.au/thredds/catalogs/rs0/catalog.html

Basic provenance that records information about an analysis result/product such as what datasets, software version, ancillary data and algorithm was used to produce the product. Wherever possible, Version 2 will adopt and adapt existing software, services and standards.

[ ]:
#combined.variables
[ ]:
# See sources example above

Other Requirements

The Project Plan will be supported by:

œ A transition plan and timetable for moving AGDC production from Version 1 to Version 2 of the AGDC (including controlled updates to data collections that are already accessible via RDS);

[ ]:
!firefox http://dapds00.nci.org.au/thredds/catalogs/rs0/catalog.html

A software release and management plan;

Approved by Robert Woodcock, Matt Paget and Simon Oliver 19/01/2016

  1. Develop branch - aim is to have this not broken - integration tests should enforce expected functionality on this branch

  1. integration tests are trying to enforce this

  2. Develop is the main collaboration branch

  3. To discourage keeping feature branches separate for too long - small incremental improvements are better

  1. Master to be replaced with git tagged releases (these should not deviate greatly from develop)

  1. The process to get to release could be :

  2. Decide when develop is feature complete

  1. Branch it to a release candidate

  2. Test the release candidate (manual - apply patches as needed to branch)
    
  3. Tag release

  1. Continuous integration tests are run on every checkin to check core functionality- currently includes all of the ingestion code - in the process of extending to include storage access.

  1. To merge in a feature branch it should include any relevant tests.

  1. Persistent demo environment to be set up on Raijin with major changes going in - accessible to NCI/CSIRO and GA - only updated when we are going to do a release i.e. against the release candidate (as a final check)

  1. Aim to have pushbutton releases to raijin (nightly build)

  1. We should be doing user acceptance tests also and need to agree on what this looks like prior to approving a release.

A plan for improving the data management of the collections in the RDS Landsat (rs0) and WOfs (fk4) Projects, such as establishing data layout, appropriate access controls (including read/write permissions) and alignment to existing organisational data libraries); œ Reformatting of all data in projects rs0 (Landsat) and fk4 (WOfS) that are currently stored in GeoTIFF into netCDF4-CF;

[ ]:

An AGDC database API which can provide a base for further AGDC developments and that can accommodate updates in the internal structure of netCDF4-CF;

[ ]:
# Looks like a good base? Not sure what update to internal structure in NetCDF actually means...

Upgrading all documentation (including Data Management Plans, Product Specifications, etc.) and ensuring that any Metadata is compliant with the requirements of the Australian Government Spatial Data Policies and Directives, the Australian National Data Service and data.gov.au; and

[ ]:
#

œ œ

Benchmarks and Quality Assurance tests to validate the quality of the access that is required by the agreed use-cases.

[ ]:
notebook_end_dt = datetime.now()
'Finished in {}'.format(notebook_end_dt - notebook_start_dt)
[ ]: