WOfS Imputation 45c5bb1aa09c425caabf69e0facaf3c5

  • Compatibility: Notebook currently compatible with both the NCI and DEA Sandbox environments.

  • Products used: ga_ls7e_ard_3

Description

There are plenty of missing data in our time series of Australian waterbodies, not least due to Landsat 7’s scan line corrector failing in May 2003 (see below for an image of missing data in WOfS). In this notebook we impute missing data on a scene-by-scene basis in a few different ways.

image.png

We will run WOfS ourself since we do not want the terrain mask — we know where the waterbody is and we can ignore the terrain masking for this analysis.

Getting started

To run this analysis, work through this notebook starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

Analysis parameters

This section lets you set parameters for the analysis.

Choose a waterbody

Specify the geohash of a waterbody here:

[3]:
geohash = "r282x882w"  # Lagoon of Islands

Estimate WOfS for the full time stack

Load all data for this waterbody and estimate WOfS.

Load Landsat

Fetch the waterbody polygon extents:

[4]:
wb = get_waterbody(geohash)
[5]:
xlim = wb.total_bounds[::2]
ylim = wb.total_bounds[1::2]

Set up the datacube:

[6]:
# Set up the datacube to get DEA data.
dc = datacube.Datacube(app="WOfSImputation")

# Some query parameters.
dask_chunks = {"x": 3000, "y": 3000, "time": 1}

Then load Landsat 7:

[7]:
# To load in native resolution and projection we need to find the native CRS.
output_crs = mostcommon_crs(
    dc, product="ga_ls7e_ard_3", query=dict(x=xlim, y=ylim, crs="EPSG:3577")
)
[8]:
query = dict(
    x=xlim,
    y=ylim,
    crs="EPSG:3577",
    output_crs=output_crs,  # Native CRS
    align=(
        15,
        15,
    ),  # Required for loading native resolution and CRS Landsat Collection 3
    resolution=(-30, 30),  # Native resolution
    dask_chunks=dask_chunks,
    group_by="solar_day",
    measurements=[
        "nbart_blue",
        "nbart_green",
        "nbart_red",
        "nbart_nir",
        "nbart_swir_1",
        "nbart_swir_2",
        "oa_fmask",
        "oa_nbart_contiguity",
    ],
)
[9]:
ls7 = dc.load("ga_ls7e_ard_3", **query)

Check that everything will fit into memory before we load it:

[10]:
n_px = ls7.sizes["time"] * ls7.sizes["x"] * ls7.sizes["y"]
est = (6 * n_px * 16 + 2 * n_px * 8) // 8 / 1e9
print("Estimated:", est, "GB")

assert est < 40
Estimated: 0.187979792 GB

Then load it explicitly with dask:

[11]:
ls7.load()
[11]:
<xarray.Dataset>
Dimensions:              (time: 776, x: 143, y: 121)
Coordinates:
  * time                 (time) datetime64[ns] 1999-08-10T23:45:11.304012 ......
  * y                    (y) float64 -4.659e+06 -4.659e+06 ... -4.662e+06
  * x                    (x) float64 4.921e+05 4.921e+05 ... 4.963e+05 4.963e+05
    spatial_ref          int32 32655
Data variables:
    nbart_blue           (time, y, x) int16 53 53 52 1 101 ... 147 192 194 173
    nbart_green          (time, y, x) int16 170 358 288 218 ... 311 312 289 318
    nbart_red            (time, y, x) int16 401 347 338 374 ... 385 407 345 394
    nbart_nir            (time, y, x) int16 1727 1796 1679 ... 1496 1412 1475
    nbart_swir_1         (time, y, x) int16 1114 1114 1019 ... 1286 1325 1227
    nbart_swir_2         (time, y, x) int16 579 578 492 470 ... 672 761 709 778
    oa_fmask             (time, y, x) uint8 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
    oa_nbart_contiguity  (time, y, x) uint8 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
Attributes:
    crs:           epsg:32655
    grid_mapping:  spatial_ref

We can then combine some masks. We will mask with a combination of fmask, contiguity, valid data, and within the waterbody maximum extent:

[12]:
wb_raster = xr_rasterize(wb, ls7)
[13]:
mask = (ls7.oa_fmask == 1) | (ls7.oa_fmask == 4) | (ls7.oa_fmask == 5)
mask &= ls7.oa_nbart_contiguity == 1
mask &= (ls7 > -999).to_array().all(axis=0)
mask &= wb_raster == 1

Convert the cube into a format that WOfS will accept.

[14]:
ls7_cube = ls7[
    [
        "nbart_blue",
        "nbart_green",
        "nbart_red",
        "nbart_nir",
        "nbart_swir_1",
        "nbart_swir_2",
    ]
].to_array(dim="band")

Finally we can go ahead and estimate WOfS.

[15]:
wofs = fuzzy_wofs.wofs.predict(ls7_cube)

We can reproduce the WOfS all-time summary by taking the mean, which will let us confirm that everything worked OK:

[16]:
plt.imshow(np.nanmean(np.where(mask, wofs, np.nan), axis=0), cmap="jet_r")
/env/lib/python3.6/site-packages/ipykernel_launcher.py:1: RuntimeWarning: Mean of empty slice
  """Entry point for launching an IPython kernel.
[16]:
<matplotlib.image.AxesImage at 0x7ff672120c88>
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_30_2.png

That looks like it matches the all-time summary (see here on DEA Maps), and the islands in the middle of the lagoon are clearly visible.

Baseline comparison: Time series without imputing WOfS

First we will try matching DEA Waterbodies. Throw out all time steps with <90% pixels:

[17]:
wb_area_px2 = wb_raster.sum()
[18]:
ok = mask.sum(axis=(1, 2)) >= wb_area_px2 * 0.9

Then just sum up the pixel counts.

[19]:
ts_baseline = pd.Series(
    np.nansum(np.where(mask, wofs, np.nan), axis=(1, 2)), index=ls7.time.values
)
ts_baseline = ts_baseline.mask(~ok)
[20]:
n_nodata = pd.Series((~mask & (wb_raster == 1)).sum(axis=(1, 2)), index=ls7.time.values)
pc_nodata = n_nodata / int(wb_area_px2)

We can plot this with linear interpolation to join the water surface area time series:

[21]:
fig, ax = plt.subplots()
ts_baseline.plot(ax=ax, linewidth=0, marker="x", c="C0")
ts_baseline.interpolate(method="time").plot(ax=ax, linewidth=1, c="C0")
ax.set_xlabel("Time")
ax.set_ylabel("Area (px$^2$)");
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_39_0.png

Different thresholds for “good data” can change this substantially. Let’s compare the 90% threshold above to a 75% threshold:

[22]:
ok_low = mask.sum(axis=(1, 2)) >= wb_area_px2 * 0.75
[23]:
ts_baseline_low = pd.Series(
    np.nansum(np.where(mask, wofs, np.nan), axis=(1, 2)), index=ls7.time.values
)
ts_baseline_low = ts_baseline_low.mask(~ok_low)
[24]:
fig, ax = plt.subplots()
ts_baseline.plot(ax=ax, linewidth=0, marker="x", c="C0", label="_nolegend_")
h1 = ts_baseline.interpolate(method="time").plot(
    ax=ax, linewidth=1, c="C0", label="90% OK"
)
ts_baseline_low.plot(ax=ax, linewidth=0, marker="x", c="C1", label="_nolegend_")
h2 = ts_baseline_low.interpolate(method="time").plot(
    ax=ax, linewidth=1, c="C1", label="75% OK"
)
plt.legend()
ax.set_xlabel("Time")
ax.set_ylabel("Area (px$^2$)");
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_43_0.png

It’s useful here to estimate how far out we might be, because this is really a lower bound on how many wet pixels there are — we’re essentially assuming all missing pixels are dry, so why not assume all missing pixels are wet? That would instead give us an upper bound.

[25]:
ts_baseline_upper = pd.Series(
    np.where(ok, wofs.sum(axis=(1, 2)) + n_nodata, np.nan), index=ls7.time.values
)
ts_baseline_low_upper = pd.Series(
    np.where(ok_low, wofs.sum(axis=(1, 2)) + n_nodata, np.nan), index=ls7.time.values
)

We can then plot the lower and upper bounds:

[26]:
fig, ax = plt.subplots()

as_bars = False

if as_bars:
    ax.errorbar(
        ts_baseline_low.index,
        ts_baseline_low,
        yerr=np.stack(
            [np.zeros(len(ts_baseline_low)), ts_baseline_low_upper - ts_baseline_low]
        ),
        label="75% OK data",
        c="C1",
    )
    ax.errorbar(
        ts_baseline.index,
        ts_baseline,
        yerr=np.stack([np.zeros(len(ts_baseline)), ts_baseline_upper - ts_baseline]),
        label="90% OK data",
        c="C0",
    )
else:
    is_nan = pd.isnull(ts_baseline_low)
    ax.fill_between(
        ts_baseline_low.index[~is_nan],
        ts_baseline_low[~is_nan],
        ts_baseline_low_upper[~is_nan],
        label="75% OK data",
        color="C1",
        zorder=1,
        alpha=0.5,
    )
    is_nan = pd.isnull(ts_baseline)
    ax.fill_between(
        ts_baseline.index[~is_nan],
        ts_baseline[~is_nan],
        ts_baseline_upper[~is_nan],
        label="90% OK data",
        color="C0",
        alpha=0.5,
        zorder=2,
    )
plt.legend()
ax.set_xlabel("Time")
ax.set_ylabel("Area (px$^2$)");
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_47_0.png

Nearest observation imputation

We can assume that if we have no observed WOfS data for a pixel on some date, then we can fill that pixel using the last available WOfS value for that pixel. We can then compute the time series as before, but no pixels will be missing as they will already have been imputed.

[27]:
# Convert into a dataframe so we can use pandas.
df = pd.DataFrame(wofs.reshape(wofs.shape[0], -1), index=ls7.time.values)
[28]:
# Mask the masked values using pandas.
df_masked = df.mask(~mask.values.reshape(mask.shape[0], -1))
[29]:
# Forward-fill masked values.
df_ffilled = df_masked.ffill(axis=0)
[30]:
# Calculate the surface area time series.
ts_ffill = df_ffilled.sum(axis=1)
[31]:
fig, ax = plt.subplots()
ax.errorbar(
    ts_baseline_low.index,
    ts_baseline_low,
    yerr=np.stack(
        [np.zeros(len(ts_baseline_low)), ts_baseline_low_upper - ts_baseline_low]
    ),
    label="75% OK data",
    c="C1",
)
ax.errorbar(
    ts_baseline.index,
    ts_baseline,
    yerr=np.stack([np.zeros(len(ts_baseline)), ts_baseline_upper - ts_baseline]),
    label="90% OK data",
    c="C0",
)
ts_ffill.plot(ax=ax, c="C2", label="ffill")
plt.legend()
ax.set_xlabel("Time")
ax.set_ylabel("Area (px$^2$)");
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_53_0.png

An equally meaningful approach would be to take the next observation instead.

[32]:
df_bfilled = df_masked.bfill(axis=0)
[33]:
ts_bfill = df_bfilled.sum(axis=1)
[34]:
fig, ax = plt.subplots()
ts_ffill.plot(ax=ax, c="C2", label="ffill")
ts_bfill.plot(ax=ax, c="purple", label="bfill", linestyle="--")
plt.legend()
ax.set_xlabel("Time")
ax.set_ylabel("Area (px$^2$)");
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_57_0.png

We can also average these (on a per-pixel level!), which means we’re taking the average of the previous and next WOfS values:

[35]:
ts_avgfill = (df_ffilled + df_bfilled).sum(axis=1) / 2
[36]:
fig, ax = plt.subplots()
ts_baseline.interpolate(method="time").plot(ax=ax, c="C0", label="90% OK")
ts_baseline_low.interpolate(method="time").plot(ax=ax, c="C1", label="75% OK")
ts_avgfill.plot(ax=ax, c="C3", label="avgfill")
plt.legend()
ax.set_xlabel("Time")
ax.set_ylabel("Area (px$^2$)");
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_60_0.png

Finally we could also linearly interpolate each pixel. This will largely be very similar to the average fill, except if there is a long series of unobserved times.

[37]:
df_linfilled = df_masked.interpolate(method="time", axis=0)
[38]:
ts_linfilled = df_linfilled.sum(axis=1)
[39]:
fig, ax = plt.subplots()
ts_baseline.interpolate(method="time").plot(ax=ax, c="C0", label="90% OK")
ts_baseline_low.interpolate(method="time").plot(ax=ax, c="C1", label="75% OK")
ts_avgfill.plot(ax=ax, c="C3", label="avgfill")
ts_linfilled.plot(ax=ax, c="C4", label="linfill")
plt.legend()
[39]:
<matplotlib.legend.Legend at 0x7ff6bc1c8208>
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_64_1.png

Mean observation imputation

Take the mean over all-time as the value for each missing pixel. This is a fairly standard method for imputing values in prediction tasks, but it’s very rough.

[40]:
mean_values = df_masked.mean(axis=0)
[41]:
df_meanfill = df_masked.fillna(value=mean_values)
[42]:
ts_meanfill = df_meanfill.sum(axis=1)
[43]:
fig, ax = plt.subplots()
ts_baseline.interpolate(method="time").plot(ax=ax, c="C0", label="90% OK")
ts_baseline_low.interpolate(method="time").plot(ax=ax, c="C1", label="75% OK")
ts_avgfill.plot(ax=ax, c="C3", label="avgfill")
ts_meanfill.plot(ax=ax, c="magenta", label="meanfill")
plt.legend()
[43]:
<matplotlib.legend.Legend at 0x7ff671ca3be0>
../../../../_images/notebooks_Scientific_workflows_DEAWaterbodies_DEAWaterbodiesToolkit_WOfSImputation_69_1.png

A very clear pattern can be seen where the mean water level is preferentially chosen, which will correspond to days with very few observations. Where there are not many observations, filling with the mean water level will cause the waterbody to appear as if it spends most of the time at the mean water level.

Conclusion

Imputing WOfS by setting missing values to the average of their previous and next valid observations, on a per-pixel basis, provides a simple and efficient way to impute WOfS across missing data, while seeming to add less bias toward dry values. Forward- or backward-filling would also be OK, but we might as well be time-symmetric. In the imputed time series seasonal patterns of filling and emptying are seen, whereas these patterns are less clear or missing in the non-imputed series.


Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Last modified: November 2020

Compatible datacube version:

[61]:
datacube.__version__
[61]:
'1.8.3'

Tags

Tags: DEA Waterbodies, WOfS, water