# Extracting waterbodies from Sentinel-2 ¶

• **Sign up to the DEA Sandbox** to run this notebook interactively from a browser

• Compatibility: Notebook currently compatible with both the NCI and DEA Sandbox environments

## Background¶

DEA Waterbodies is a time-series data product that summarises the surface area of open waterbodies in Australia using Water Observations from Space (WOfS). WOfS classifies Landsat pixels into wet and dry. Landsat data have a resolution of 25m$$^2$$, but it would be really nice if we could instead use Sentinel 2 data with its 10m$$^2$$ resolution. This would help distinguish neighbouring waterbodies that are blurred together in Landsat (and hence in DEA Waterbodies). Sentinel 2 does not yet have WOfS.

One alternative to WOfS that we could evaluate for Sentinel 2 is the Modified Normalised Difference Water Idex (MNDWI), which can be calculated directly from surface reflectance. MNDWI greater than zero is indicative of water. It is not as accurate as WOfS, but easier to obtain.

## Description¶

This notebook compares DEA Waterbodies derived from WOfS to an analogous product derived from MNDWI. There are two main factors we can vary: how to derive our polygons, and how to derive our time series. Both polygons and time series can be generated from Landsat WOfS, from Landsat MNDWI, or from Sentinel 2 MNDWI. We can choose to combine any polygon method and any time series method. This gives us a grid of possible evaluations:

Landsat WOfS polygons

Landsat MNDWI polygons

Sentinel 2 MNDWI polygons

Landsat WOfS time series

DEA Waterbodies

Polygon proxy quality

Good differentation for merged waterbodies

Landsat MNDWI time series

Time series proxy quality

Total proxy quality

Sentinel 2 MNDWI time series

Sentinel 2 proxy quality

Highest resolution

The upper left corner is the existing DEA Waterbodies. Polygons derived from Sentinel 2 would be great for differentiating neighbouring waterbodies, even with lower resolution surface area data. Landsat WOfS polygons with MNDWI time series informs us how well MNDWI approximates the surface area time series. Similarly, Landsat MNDWI polygons with WOfS time series informs us how well the MNDWI polygons approximate those we obtain from WOfS. The highest attainable resolution is deriving both parts of the product from Sentinel 2.

## Getting started¶

Edit the analysis parameters and run all the cells in this notebook to analyse a region of Australia.

Import Python packages that are used for the analysis.

[1]:

import sys
import datacube
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import datacube.virtual as virtual
import geopandas as gpd

sys.path.append("../Scripts")
from dea_plotting import display_map
from dea_spatialtools import xr_rasterize, xr_vectorize
from dea_datahandling import wofs_fuser

/env/lib/python3.6/site-packages/datacube/storage/masking.py:4: DeprecationWarning: datacube.storage.masking has moved to datacube.utils.masking
category=DeprecationWarning)


### Connect to the datacube¶

Connect to the datacube so we can access DEA data. The app parameter is a unique name for the analysis which is based on the notebook file name.

[2]:

dc = datacube.Datacube(app="Sentinel_2_waterbodies")


[3]:

create_local_dask_cluster()


### Cluster

• Workers: 1
• Cores: 2
• Memory: 14.18 GB

### Analysis parameters¶

Choose which area of Australia to analyse:

[4]:

# Lake Boort and Lake Lyndger
xlim = (143.70141, 143.76180)
ylim = (-36.14517, -36.09688)


### Check the analysis area¶

Display a map of the area we plan to analyse.

[5]:

display_map(x=xlim, y=ylim)

[5]:


## Build a virtual product for the MNDWI calculation¶

The MNDWI is given by

$\frac{\mathrm{green} - \mathrm{SWIR}}{\mathrm{green} + \mathrm{SWIR}}.$

We’ll build a virtual product that calculates this without having to load all of the channels ourselves. A virtual product is just like a regular DEA product, except it has some transformations applied before we see it. We can define one with YAML that applies cloud masking and calculates the MNDWI.

[6]:

cat_yaml = """
products:
ls8_MNDWI:
recipe:
transform: expressions
output:
MNDWI:
formula: (green - swir1) / (green + swir1)
dtype: float32
input:
input:
transform: expressions
output:
nodata: False
green: green
swir1: swir1
input:
product: ga_ls8c_ard_3
output_crs: EPSG:3577
resolution: [-30, 30]
resampling:
'*': average
s2a_MNDWI:
recipe:
transform: expressions
output:
MNDWI:
formula: (nbart_green - nbart_swir_2) / (nbart_green + nbart_swir_2)
dtype: float32
input:
input:
transform: expressions
output:
nodata: False
nbart_green: nbart_green
nbart_swir_2: nbart_swir_2
input:
collate:
- product: s2b_ard_granule
output_crs: EPSG:3577
resolution: [-10, 10]
resampling:
'*': average
- product: s2a_ard_granule
output_crs: EPSG:3577
resolution: [-10, 10]
resampling:
'*': average
recipe:
transform: expressions
output:
water:
formula: water
nodata: -1
input:
input:
transform: expressions
output:
formula: (water == 0) | (water == 128)
nodata: False
water:
formula: water
nodata: -1
input:
product: wofs_albers
measurements: [water]
output_crs: EPSG:3577
resolution: [-25, 25]
"""


Then we can convert the YAML into a catalogue containing our custom virtual products, which we can use to load data just like a datacube:

[7]:

cat = virtual.catalog_from_yaml(cat_yaml)


Now we can use this catalogue to load our MNDWI virtual products for Sentinel 2 and Landsat 8. Use dask to avoid loading everything into memory at once if the data are too big, or otherwise use .load here to load the data immediately.

Note: This step can take several minutes to load.

[8]:

# Set up spatio-temporal query used to load data for both products
query = {
"x": xlim,
"y": ylim,
"time": ("2016-01", "2018-12"),
"group_by": "solar_day",
"dask_chunks": {"x": 3000, "y": 3000, "time": 1},
}

# Load Landsat 8 MNDWI data


Finally, grab WOfS for the same time period as our Landsat observations.

[9]:

# Load WOfS data for the same spatio-temporal query

# Set nodata to NaN, and wet pixels (value 128) to 1
wofs = xr.where(wofs == 128, 1, wofs)


## Calculate an all-time average¶

In analogy to WOfS all-time summary, calculate an all-time average for water derived from MNDWI. First, let’s threshold each MNDWI frame so we can find the water. What would be a good threshold? This may vary depending on the satellite.

[10]:

# Find all the dates common to both datasets so we can visualise them
# with a fair comparison.
common_dates = sorted(
set(list(s2a_mndwi.time.astype("datetime64[D]").values))
& set(list(ls8_mndwi.time.astype("datetime64[D]").values))
)

[11]:

# Define MNDWI water index thresholds
s2_threshold = 0
ls8_threshold = 0

# Plot a single date for both Sentinel 2 and Landsat 8 data
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
(s2a_mndwi.MNDWI.sel(time=common_dates[3], method="nearest") >
s2_threshold).plot.imshow(ax=axs[0])
(ls8_mndwi.MNDWI.sel(time=common_dates[3], method="nearest") >
ls8_threshold).plot.imshow(ax=axs[1])
axs[0].set_title("Sentinel 2 > threshold")
axs[1].set_title("LS8 > threshold")

[11]:

Text(0.5, 1.0, 'LS8 > threshold')


A variable threshold like local Otsu thresholding might be useful in future.

Next, summarise these thresholded scenes and also generate a summary for WOfS by calculating how often each pixel is wet in the data.

Here we make the summaries and plot them. We have chosen some thresholds for Sentinel 2 and Landsat 8 that seem to get the best results and plotted these as contours - in reality we’d have no manual oversight on this, so this is something of a best-case attempt. To match the methodology of DEA Waterbodies, we need two thresholds: one for identifying waterbodies and one for finding their maximum extent. In WOfS (DEA Waterbodies) these are 5% and 10% respectively.

[12]:

# Threshold MNDWI by the MNDWI threshold defined above, then compute
# a summary frequency layer by taking the mean along the time dimension
s2a_mndwi_summary = (s2a_mndwi >= s2_threshold).MNDWI.mean(dim='time')
ls8_mndwi_summary = (ls8_mndwi >= ls8_threshold).MNDWI.mean(dim='time')
wofs_summary = wofs.water.mean(axis=0)

[13]:

# Summary layer thresholds
s2a_thresholds = [0.03, 0.08]
ls8_thresholds = [0.05, 0.10]
wofs_thresholds = [0.05, 0.10]

[14]:

fig, axs = plt.subplots(1, 3, figsize=(15, 5))
colours = ["black", "white"]
s2a_mndwi_summary.plot.imshow(ax=axs[0])
ls8_mndwi_summary.plot.imshow(ax=axs[1])
s2a_mndwi_summary.plot.contour(levels=s2a_thresholds, ax=axs[0], colors=colours)
ls8_mndwi_summary.plot.contour(levels=ls8_thresholds, ax=axs[1], colors=colours)
wofs_summary.plot.contour(levels=wofs_thresholds, ax=axs[2], colors=colours)
wofs_summary.plot.imshow(ax=axs[2])
axs[0].set_title("Sentinel 2 MNDWI average")
axs[1].set_title("Landsat 8 MNDWI average")
axs[2].set_title("WOfS water frequency")

[14]:

Text(0.5, 1.0, 'WOfS water frequency')


## Generating polygons¶

We can generate polygons by judicious choice of threshold for each of these.

The existing method for DEA Waterbodies is to generate polygons at the 10% inundation frequency level as well as the 5% level. Then, use the 10% polygons to detect waterbodies and the 5% waterbodies to get the maximum extent. We will use the thresholds from above, which are different for Sentinel 2 and Landsat.

[15]:

def mask_to_polygons(mask):
# Vectorised polygons will have their attribute column
# set based on the pixel values of the mask.
# This is a boolean mask, so we can just check
# for attribute == 1.
return gdf[gdf.attribute == 1]

[16]:

# Mask Sentinel 2 summary by both thresholds and convert to polygons

/env/lib/python3.6/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
return _prepare_from_string(" ".join(pjargs))

[17]:

# Mask Landsat 8 summary by both thresholds and convert to polygons

[18]:

# Mask WOfS summary by both thresholds and convert to polygons


Use the lower boundaries, but only if they contain an upper polygon.

[19]:

s2a_wbs = s2a_lower_wbs.loc[
gpd.sjoin(s2a_lower_wbs, s2a_upper_wbs, how="right").index_left
]
ls8_wbs = ls8_lower_wbs.loc[
gpd.sjoin(ls8_lower_wbs, ls8_upper_wbs, how="right").index_left
]
wofs_wbs = wofs_lower_wbs.loc[
gpd.sjoin(wofs_lower_wbs, wofs_upper_wbs, how="right").index_left
]


Let’s discard everything less than 5 Landsat pixels in area: 4500 m$$^2$$ for Landsat Collection 3. This matches the methodology of DEA Waterbodies.

[20]:

# Set an area threshold in square metres
area_threshold = 4500

# Apply this threshold to remove smaller waterbodies
s2a_wbs = s2a_wbs[s2a_wbs.area >= area_threshold]
ls8_wbs = ls8_wbs[ls8_wbs.area >= area_threshold]
wofs_wbs = wofs_wbs[wofs_wbs.area >= area_threshold]

[21]:

# Plot the resulting waterbodies for Sentinel 2, Landsat 8 and WOfS
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
for wbs, ax, t in zip(
[s2a_wbs, ls8_wbs, wofs_wbs], axs, ["Sentinel 2 MNDWI", "Landsat 8 MNDWI", "WOfS"]
):
wbs.plot(ax=ax)
ax.set_title(t)


We end up with a few polygons that are duplicated or overlapping. Combine these by doing a unary union and then split back into polygons again.

[22]:

s2a_wbs = gpd.GeoDataFrame(
geometry=[poly for poly in s2a_wbs.unary_union], crs="EPSG:3577"
)
ls8_wbs = gpd.GeoDataFrame(
geometry=[poly for poly in ls8_wbs.unary_union], crs="EPSG:3577"
)
wofs_wbs = gpd.GeoDataFrame(
geometry=[poly for poly in wofs_wbs.unary_union], crs="EPSG:3577"
)


Finally give everything a new ID. We can use this later to generate our masks.

[23]:

s2a_wbs["ID"] = np.arange(1, len(s2a_wbs) + 1).astype(int)
ls8_wbs["ID"] = np.arange(1, len(ls8_wbs) + 1).astype(int)
wofs_wbs["ID"] = np.arange(1, len(wofs_wbs) + 1).astype(int)


Set the ID to be the index so we can query by it. Don’t drop the ID so we can use it as a column for later rasterising.

[24]:

s2a_wbs.set_index("ID", drop=False, inplace=True)
ls8_wbs.set_index("ID", drop=False, inplace=True)
wofs_wbs.set_index("ID", drop=False, inplace=True)


## Surface area time series comparison¶

Let’s choose a waterbody and compare time series derived from different methods.

Select a polygon in the Sentinel 2 waterbodies:

[25]:

s2a_wb = s2a_wbs.iloc[0]

[26]:

s2a_wb.geometry

[26]:


Then find its equivalent in Landsat MNDWI and WOfS polygons by finding the polygon with the largest intersection.

[27]:

ls8_wb = ls8_wbs.iloc[ls8_wbs.intersection(s2a_wb.geometry).area.argmax()]
wofs_wb = wofs_wbs.iloc[wofs_wbs.intersection(s2a_wb.geometry).area.argmax()]

[28]:

ls8_wb.geometry

[28]:

[29]:

wofs_wb.geometry

[29]:


Calculate a mask that matches the waterbodies. There are nine masks, which corresponds to three different polygon sets and three different time series.

[30]:

polygon_sets = {
"wofs": wofs_wbs,
"ls8": ls8_wbs,
"s2a": s2a_wbs,
}

ts_dataarrays = {
"wofs": wofs.water,
"ls8": ls8_mndwi.MNDWI >= ls8_threshold,
"s2a": s2a_mndwi.MNDWI >= s2_threshold,
}

ids = {
"wofs": wofs_wb.ID,
"ls8": ls8_wb.ID,
"s2a": s2a_wb.ID,
}

area_per_px = {
"wofs": 25 ** 2,
"ls8": 30 ** 2,
"s2a": 10 ** 2,
}

[31]:

masks = {}  # (polygons, time series)
for poly_name, poly_set in polygon_sets.items():
for ts_name, ts_dataarray in ts_dataarrays.items():

Rasterizing to match xarray.DataArray dimensions (234, 237)
Rasterizing to match xarray.DataArray dimensions (196, 198)
Rasterizing to match xarray.DataArray dimensions (584, 592)
Rasterizing to match xarray.DataArray dimensions (234, 237)
Rasterizing to match xarray.DataArray dimensions (196, 198)
Rasterizing to match xarray.DataArray dimensions (584, 592)
Rasterizing to match xarray.DataArray dimensions (234, 237)
Rasterizing to match xarray.DataArray dimensions (196, 198)
Rasterizing to match xarray.DataArray dimensions (584, 592)


Each mask has pixels set to the ID value of the polygon that contains those pixels.

[32]:

masks["ls8", "ls8"].plot(cmap="rainbow", add_colorbar=False)

[32]:

<matplotlib.collections.QuadMesh at 0x7fb2e1a3a4e0>


Use the masks to extract pixel values for each time. We also want to extract how many pixels are invalid, so we can figure out which days have good observations.

[33]:

ts_wet = {}
ts_invalid = {}
for ts_name, ts_dataarray in ts_dataarrays.items():
for poly_name, poly_set in polygon_sets.items():
ts_wet[poly_name, ts_name] = (
ts_invalid[poly_name, ts_name] = (


DEA Waterbodies considers an observation valid if there is at least 90% valid pixels.

[34]:

ts_ok = {}
for (poly_name, ts_name), invalid_px in ts_invalid.items():
max_extent = polygon_sets[poly_name].area.loc[ids[poly_name]]
ts_ok[poly_name, ts_name] = ts_invalid[poly_name, ts_name] < 0.1 * max_extent


We can plot all of these time series now. This step may take a while if you didn’t preload the data, as it’s where Dask has to finally load it all.

[35]:

fig, axs = plt.subplots(3, 3, figsize=(15, 15))

# Our datasets have different date ranges, so set these manually for display.
wofs_xlimits = (wofs.time.values.min(), wofs.time.values.max())
xticks = np.arange(np.datetime64("2016", "Y"), np.datetime64("2020", "Y"))

for x, poly_name in enumerate(polygon_sets):
for y, ts_name in enumerate(ts_dataarrays):
ax = axs[y, x]

if x == 0:
# Label the vertical axis of subplots.
ax.annotate(
s=ts_name + " time series",
xy=(-0.1, 0.5),
xycoords="axes fraction",
textcoords="offset points",
size="large",
ha="center",
va="center",
rotation="vertical",
)

if y == 0:
# Label the horizontal axis of subplots.
ax.annotate(
s=poly_name + " polygons",
xy=(0.5, 1),
xycoords="axes fraction",
textcoords="offset points",
size="large",
ha="center",
va="center",
rotation="horizontal",
)

# Plot valid dates of the time series.
ok = ts_ok[poly_name, ts_name]
ts = ts_wet[poly_name, ts_name]
ax.plot(ts.time[ok], ts[ok], marker="x", linestyle="None")

ax.set_xlim(wofs_xlimits)
ax.set_xticks(xticks)
ax.set_xticklabels(xticks)


The MNDWI series are noticably noisier, with lots of no-water or low-water observations that are wet in WOfS. However, the general pattern matches WOfS, and in particular seems to bound it below. Maybe we could consider MNDWI a lower bound for wetness?

We can also estimate a mean-squared difference between each time series and the WOfS time series with WOfS polygons (= DEA Waterbodies). This assumes Gaussian error, but for sufficiently large numbers of pixels the binomial distribution we actually have should be well-approximated by a Gaussian. By calculating such a difference, we can summarise the discrepancies between the above plots as a single number per plot.

[36]:

# Make a pandas series of valid data for comparison.
# We resample to daily to avoid time rounding issues.
ok = ts_ok["wofs", "wofs"]
ts = ts_wet["wofs", "wofs"]
comparison_series = (
pd.Series(ts[ok], index=ts.time[ok].values, name="comparison").resample("D").mean()
)

differences = {}
for poly_name in polygon_sets:
for ts_name in ts_dataarrays:

# Make a pandas series of valid data.
ok = ts_ok[poly_name, ts_name]
ts = ts_wet[poly_name, ts_name]
series = (
pd.Series(ts[ok], index=ts.time[ok].values, name="test")
.resample("D")
.mean()
)

# Join this with the comparison series to have a unified time index.
joint_series = pd.merge(
series, comparison_series, left_index=True, right_index=True, how="outer"
)

# Interpolate and compute the difference.
joint_series.interpolate(inplace=True)
difference = joint_series.test - joint_series.comparison

# The mean-squared error is the difference we seek.
differences[poly_name, ts_name] = (difference ** 2).mean()


Now we can look at these differences as a table.

[37]:

fig = plt.figure(figsize=(8, 8))
grid = np.zeros((3, 3))
for x, poly_name in enumerate(sorted(polygon_sets)):
for y, ts_name in enumerate(sorted(ts_dataarrays)):
grid[x, y] = differences[poly_name, ts_name]
plt.pcolor(grid, cmap="Reds")
plt.colorbar(label="Difference")
plt.xticks(np.arange(3) + 0.5, [i + " polygons" for i in sorted(polygon_sets)])
plt.yticks(np.arange(3) + 0.5, [i + " time series" for i in sorted(ts_dataarrays)]);


The polygons dominate the differences between the time series. This shows that the polygons are the most important change here between the datasets, dwarfing the impact of the low-water observations. To see these more clearly, we can normalise the difference by the kind of polygons before we plot:

[38]:

fig = plt.figure(figsize=(8, 8))
plt.pcolor(grid / grid.max(axis=0), cmap="Reds")
plt.colorbar(label="Normalised difference")
plt.xticks(np.arange(3) + 0.5, [i + " polygons" for i in sorted(polygon_sets)])
plt.yticks(np.arange(3) + 0.5, [i + " time series" for i in sorted(ts_dataarrays)]);


S2 and LS8 MNDWI are equally good when applied to non-WOfS polygons. S2 time series deviate appreciably when applied to WOfS polygons, though. This makes sense because S2 will include lots of dry areas considered part of the waterbody by the much larger WOfS pixels.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Compatible datacube version:

[39]:

print(datacube.__version__)

1.8.3


## Tags¶

Browse all available tags on the DEA User Guide’s Tags Index

Tags: NCI compatible, sandbox compatible, sentinel 2, landsat 8, water, waterbodies, MNDWI, time series, dea_plotting, dea_spatialtools, dea_dask, dea_datahandling, display_map, xr_rasterize, xr_vectorize, create_local_dask_cluster, wofs_fuser, virtual products