Surface area duration plots from DEA Waterbodies data 49bdac276568469093e89ba01d490481

  • **Sign up to the DEA Sandbox** to run this notebook interactively from a browser

  • Compatibility: Notebook currently compatible with both the NCI and DEA Sandbox environments

  • Products used: DEA Waterbodies time series data (available online)

Background

A “surface area duration” (SAD) curve depicts the cumulative percentage of time that a waterbody had a larger extent than a given percentage of its maximum extent—in other words, the total amount of time that a waterbody was at least a certain amount full. These are similar to flow duration curves, and describe the long-term behaviour of a waterbody. SAD curves may be useful for identifying different types of waterbodies, which helps us to understand how water is used throughout Australia.

Digital Earth Australia use case

The DEA Waterbodies product uses Geoscience Australia’s archive of over 30 years of Landsat satellite imagery to identify where almost 300,000 waterbodies are in the Australian landscape and tells us how the wet surface area within those waterbodies changes over time. These data can be analysed to obtain insights into the duration and temporal dynamics of inundation for any mapped waterbody in Australia.

Description

This notebook plots “surface area duration” (SAD) curves, which show for how long a waterbody is filled at a given surface area. Such a curve can be generated for any DEA Waterbody. We also generate a time-dependent version of this plot called a “short-time surface area duration” (STSAD) curve, which shows how the surface area duration curve varies over time.

This analysis should work for any time series stored in a similar format.


Getting started

Run the first cell, which loads all modules needed for this notebook. Then edit the configuration to match what you want the notebook to output.

Load modules

[1]:
%matplotlib inline

import numpy as np
import pandas as pd
import scipy.signal
from pathlib import Path
from matplotlib import pyplot as plt

# Import DEA modules.
import sys
sys.path.insert(1, '../Scripts')
from dea_temporaltools import calculate_vector_stat

Configuration

To generate a plot for a waterbody with a given geohash, specify the geohash here:

[2]:
geohash = "r3f225n9h"  # Weereewa/Lake George

Also specify the window for the STSAD plot. Use ‘hann’ if you have no opinions on this, or ‘boxcar’ if you want a window that is easy to think about. You can see a list of possible windows here. Hann is fairly standard for short-time transforms like this and will give a smooth STSAD:

Hann window

Boxcar is a sliding window:

Boxcar window

[3]:
window = "hann"

Finally, specify the path to the waterbodies CSVs:

[4]:
waterbody_csv_path = "https://data.dea.ga.gov.au/projects/WaterBodies/timeseries"

Load DEA Waterbodies data

The DEA Waterbodies time series are stored as CSV files. Each waterbody is labelled by a geohash, e.g. Weereewa is r3f225n9h. They are stored online (on Amazon S3) in a folder named after the first four characters of the geohash, and the filename is the geohash, e.g. Weereewa is at https://data.dea.ga.gov.au/projects/WaterBodies/timeseries/r3f2/r3f225n9h.csv. Each CSV has three columns: Observation Date, Wet pixel percentage, Wet pixel count (n = ?) where ? is the total number of observations. An example is:

Observation Date,Wet pixel percentage,Wet pixel count (n = 230894)
1987-05-29T23:14:29Z,,
1987-07-16T23:15:29Z,,
1987-09-02T23:16:50Z,,
1987-09-18T23:17:13Z,19.9,45926

First we will read the CSV containing the surface area vs time observations data directly from the URL path using pandas. We will rename the Observation Date, Wet pixel percentage, Wet pixel count (n = ?) columns to more consistent and easier to access names: date, pc_wet, px_wet

We also ensure that the ‘date’ column is parsed as a datetime, and convert the data percentages to decimals:

[5]:
# Set path to the CSV file
csv_path = f"{waterbody_csv_path}/{geohash[:4]}/{geohash}.csv"

# Load the data using `pandas`:
time_series = pd.read_csv(csv_path,
                          header=0,
                          names=["date", "pc_wet", "px_wet"],
                          parse_dates=["date"],
                          index_col="date",
                         )

# Convert percentages into a float between 0 and 1.
time_series.pc_wet /= 100

We can now inspect the first few rows of the data:

[6]:
time_series.head()
[6]:
pc_wet px_wet
date
1987-05-29 23:14:29+00:00 NaN NaN
1987-07-16 23:15:29+00:00 NaN NaN
1987-09-02 23:16:50+00:00 NaN NaN
1987-09-18 23:17:13+00:00 0.199 45926.0
1987-10-04 23:17:34+00:00 NaN NaN

Interpolate data to daily values

DEA Waterbodies data is stored with one row per satellite observation. To make our data easier to analyse by time, we can use interpolation the data to estimate the percentage coverage of water for every individual day in our time series.

[7]:
# Round dates in the time series dataset  to the nearest whole day
time_series.index = time_series.index.round(freq="d")

# Create a new `datetime` index with one value for every individual day
# between the first and last DEA Waterbodies observation
dt_index = pd.date_range(start=time_series.index[0],
                         end=time_series.index[-1], freq="d")

# Use this new index to modify the original time series so that it
# has a row for every individual day
time_series = time_series.reindex(dt_index)
assert len(time_series) == len(dt_index)

Then interpolate (linearly, since it’s the least-information thing other than constant).

[8]:
# Replace NaNs with a linear interpolation of the time series.
time_series.pc_wet.interpolate(inplace=True, limit_direction="both")

If we inspect the first rows in the dataset again, we can see that we now have one row for every day (e.g. 1987-05-30, 1987-05-31) instead of a row for each satellite observation:

[9]:
time_series.head()
[9]:
pc_wet px_wet
1987-05-30 00:00:00+00:00 0.199 NaN
1987-05-31 00:00:00+00:00 0.199 NaN
1987-06-01 00:00:00+00:00 0.199 NaN
1987-06-02 00:00:00+00:00 0.199 NaN
1987-06-03 00:00:00+00:00 0.199 NaN

Calculate SADs

For each rolling time step, calculate the SADs.

[10]:
def calculate_sad(vec):
    """Calculates the surface area duration curve for a given vector of heights.

    Parameters
    ----------
    vec : d-dimensional np.ndarray
        Vector of heights over time.

    Returns
    -------
    d-dimensional np.ndarray
        Surface area duration curve vector over the same time scale.
    """
    return np.sort(vec)[::-1]


def calculate_stsad(vec, window_size=365, step=10, progress=None, window="hann"):
    """Calculates the short-time surface area duration curve for a given vector of heights.

    Parameters
    ----------
    vec : d-dimensional np.ndarray
        Vector of heights over time.
    window_size : int
        Sliding window size (default 365).
    step : int
        Step size (default 10).
    progress : iterator -> iterator
        Optional progress decorator, e.g. tqdm.notebook.tqdm. Default None.
    window : str
        What kind of window function to use. Default 'hann', but you might
        also want to use 'boxcar'. Any scipy window
        function is allowed (see documentation for scipy.signal.get_window
        for more information).

    Returns
    -------
    (d / step)-dimensional np.ndarray
        y values (the time axis)
    t-dimensional np.ndarray
        x values (the statistic axis)
    (d / step) x t-dimensional np.ndarray
        The short-time surface area duration curve array.
    """
    return calculate_vector_stat(
        vec,
        calculate_sad,
        window_size=window_size,
        step=step,
        target_dim=window_size,
        progress=progress,
        window=window,
    )

Calculate the STSAD.

[11]:
sad_y, sad_x, sads = calculate_stsad(
    time_series.pc_wet, window_size=365 * 2, window=window, step=10
)

Convert the time axis into integers for calculating tick locations.

[12]:
year_starts = (time_series.index.month == 1) & (time_series.index.day == 1)
year_start_days = (time_series.index - time_series.index[0])[year_starts].days
year_start_years = time_series.index.year[year_starts]

Calculate the overall SAD.

[13]:
sad = calculate_sad(time_series.pc_wet)

Calculate the SAD for the Millenium drought.

[14]:
# Select observations between the start and end of the Millenium drought.
millenium_drought_start = pd.to_datetime('2001-01-01', utc=True)
millenium_drought_end = pd.to_datetime('2009-12-12', utc=True)
millenium_drought = (time_series.index > millenium_drought_start) & (time_series.index < millenium_drought_end)

millenium_sad = calculate_sad(time_series.pc_wet[millenium_drought])

Plot SADs

We can then plot the SAD.

[15]:
plt.figure(figsize=(8, 6))
xs = np.linspace(0, 100, len(sad))
plt.plot(xs, sad, label='All-time')
xs = np.linspace(0, 100, len(millenium_sad))
plt.plot(xs, millenium_sad, label='Millenium drought')
plt.xlabel("Percentage of time")
plt.ylabel("Surface area (% maximum extent)")
plt.legend();
../../_images/notebooks_Real_world_examples_Surface_area_duration_31_0.png

From this plot of the SAD for Weereewa/Lake George, we can tell that the lake was at least partly-inundated for 25 of the last 33 years. The lake was rarely full. The slope of the curve indicates how the lake varied over time, but is hard to interpret without comparison to another SAD curve.

Comparing the all-time curve to the SAD during the Millenium drought shows that the lake was significantly drier during the drought, never exceeding 50% capacity. The very sharp drop at the beginning of the SAD curve indicates rapid drying when the lake was near its peak, suggesting either that the lake bed is not steep at the point where it is 50% full, or that the only significant influx of water into the lake was during a period of rapid drying.

[16]:
plt.figure(figsize=(10, 10))
plt.pcolormesh(sad_x, sad_y, sads)
plt.xlabel("Days in two-year period")
plt.yticks(year_start_days[::2], year_start_years[::2])
plt.ylabel("Date")
plt.colorbar(label="Surface area (% maximum extent)")
plt.title("Short-time surface area duration");
../../_images/notebooks_Real_world_examples_Surface_area_duration_33_0.png

The STSAD plot helps visualise the change in Weereewa/Lake George over time. Here, the Millenium drought is clearly visible between 2000 and 2009 as a period of very little wet extent. The lake was mostly full during the 1990s, and it filled and dried fairly consistently over time, indicating that it is unregulated. It was stably partly-filled for significant parts of 1988, during the 2010–12 La Niña, and 2016–17.


Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Last modified: September 2020

Tags

Browse all available tags on the DEA User Guide’s Tags Index.

Tags: NCI compatible, DEA Waterbodies, data visualisation, time series, water