Turn water observations into waterbody polygons

  • Compatability: Notebook currently compatible with both the NCI and DEA Sandbox environments if you set your filepaths to the required datasets.

  • Products used: wofs_summary

  • Special requirements: This notebook requires the python_geohash library. If you are using the default dea environment, this package may not be available. You can install it locally by using pip install --user python-geohash.

  • Prerequisites:

    • NetCDF files with WOfS outputs that will be used to define the persistent water body polygons

      • Variable name: TileFolder

      • This folder can be either a custom extraction of datacube-stats, or you can choose to use the WOfS summary tiles for all of Australia (see here for further information).

    • A coastline polygon to filter out polygons generated from ocean pixels.

  • Optional prerequisites:

    • River line dataset for filtering out polygons comprised of river segments.

      • Variable name: MajorRiversDataset

      • The option to filter out major rivers is provided, and so this dataset is optional if FilterOutRivers = False.

      • Here we use the Bureau of Meteorology’s Geofabric v 3.0.5 Beta (Suface Hydrology Network), filtered to only keep features tagged as major rivers.

      • There are some identified issues with this data layer that make the filtering using this data inconsistent (see the discussion here)

      • We therefore turn this off during the production of the water bodies shapefile.

    • Urban high rise polygon dataset

      • Variable name: UrbanMaskFile, but this is optional and can be skipped by setting UrbanMask = False.

      • WOfS has a known limitation, where deep shadows thrown by tall CBD buildings are misclassified as water. This results in ‘waterbodies’ around these misclassified shadows in capital cities. If you are not using WOfS for your analysis, you may choose to set UrbanMask = False.

Background

On average, the Australian Government invests around half a billion dollars a year in monitoring, protecting and enhancing Australia’s land, coasts and oceans. DEA provides near real-time satellite information which can be used by government to better target these investments.

Water is among one the most precious natural resources and is essential for the survival of life on Earth. Within Australia, the scarcity of water is both an economic and social issue. Water is required not only for consumption but for industries and environmental ecosystems to function and flourish.

With the demand for water increasing, there is a need to better understand our water availability to ensure we are managing our water resources effectively and efficiently.

Digital Earth Australia (DEA)’s Water Observations from Space (WOfS) dataset, provides a water classified image of Australia approximately every 16 days. These individual water observations have been combined into a WOfS summary product, which calculates the frequency of wet observations (compared against all clear observations of that pixel), over the full 30 year satellite archive.

The WOfS summary product provides valuable insights into the persistence of water across the Australian landscape on a pixel by pixel basis. While knowing the wet history of a single pixel within a waterbody is useful, it is more useful to be able to map the whole waterbody as a single object.

This notebook demonstrates a workflow for mapping waterbodies across Australia as polygon objects. This workflow has been used to produce DEA Waterbodies.

Description

This code follows the following workflow:

  • Load the required python packages

  • Load the required functions

  • Set your chosen analysis parameters

    • minimum number of valid observations

    • wetness threshold/s

    • min/max waterbody size

    • optional flag to filter out waterbodies that intersect with major rivers

      • if you set this flag you will need to provide a dataset to do the filtering

    • set the analysis region

    • set the input files for the analysis

    • read in a land/sea mask

    • read in an urban mask

  • Generate a list of netCDF files within a specified folder location

  • Opens each netCDF file and:

    • Keep only pixels observed at least x times

    • Keep only pixels identified as wet at least x% of the time

      • Here the code can take in two wetness thresholds, to produce two initial temporary polygon files.

    • Convert the raster data into polygons

  • Append the polygon set to a temporary shapefile

  • Remove artificial polygon borders created at tile boundaries by merging polygons that intersect across Albers Tile boundaries

  • Filter the combined polygon dataset (note that this step happens after the merging of Albers tile boundary polygons to ensure that artifacts are not created by part of a polygon being filtered out, while the remainder of the polygon that sits on a separate tile is treated differently).

    • Filter the polygons based on size

    • Remove polygons that intersect with Australia’s coastline

    • Remove erroneous ‘water’ polygons within high-rise CBD areas

    • Combine the two generated wetness thresholds (optional)

    • Optional filtering for proximity to major rivers (as identified by the Geofabric dataset)

  • Save out the final polygon set to a shapefile

Getting started

To run this analysis, work through this notebook starting with the “Load packages” cell.

Load packages

Import Python packages that are used for the analysis.

Set up the functions for this script

Analysis parameters

The following section walks you through the analysis parameters you will need to set for this workflow. Each section describes the parameter, how it is used, and what value was used for the DEA Waterbodies product.

### How frequently wet does a pixel need to be to be included? The value/s set here will be the minimum frequency (as a decimal between 0 and 1) that you want water to be detected across all analysis years before it is included.

E.g. If this was set to 0.10, any pixels that are wet at least 10% of the time across all valid observations will be included. If you don’t want to use this filter, set this value to 0.

Following the exploration of an appropriate wetness threshold for DEA Waterbodies (see here), we choose to set two thresholds here. The code is set up to loop through both wetness thresholds, and to write out two temporary shapefiles. These two shapefiles with two separate thresholds are then used together to combine polygons from both thresholds later on in the workflow.

Polygons identified by the secondary threshold that intersect with the polygons generated by the primary threshold will be extracted, and included in the final polygon dataset. This means that the location of polygons is set by the primary threshold, but the shape of these polygons is set by the secondary threshold.

Threshold values need to be provided as a list of either one or two floating point numbers. If one number is provided, then this will be used to generate the initial polygon dataset. If two thresholds are entered, the first number becomes the secondary threshold, and the second number becomes the primary threshold. If more than two numbers are entered, the code will generate an error below.

MinSize

E.g. A minimum size of 6250 means that polygons need to be at least 10 pixels to be included. If you don’t want to use this filter, set this value to 0.

MaxSize

E.g. A maximum size of 1 000 000 means that you only want to consider polygons less than 1 km2. If you don’t want to use this filter, set this number to math.inf.

NOTE: if you are doing this analysis for all of Australia, very large polygons will be generated offshore, in the steps prior to filtering by the Australian coastline. For this reason, we have used a ``MaxSize`` = Area of Kati Thanda-Lake Eyre. This will remove the huge ocean polygons, but keep large inland waterbodies that we want to map.

The total number of valid WOfS observations for each pixel varies depending on the frequency of clouds and cloud shadow, the proximity to high slope and terrain shadow, and the seasonal change in solar angle.

The count_clear parameter within the `wofs_summary <https://explorer.sandbox.dea.ga.gov.au/wofs_summary>`__ data provides a count of the number of valid observations each pixel recorded over the analysis period. We can use this parameter to mask out pixels that were infrequently observed. If this mask is not applied, pixels that were observed only once could be included if that observation was wet (i.e. a single wet observation means the calculation of the frequency statistic would be (1 wet observation) / (1 total observation) = 100% frequency of wet observations).

Here we set the minimum number of observations to be 128 (roughly 4 per year over our 32 year analysis). Note that this parameter does not specify the timing of these observations, but rather just the total number of valid observations (observed at any time of the year, in any year).

The Bureau of Meteorology’s Geofabric v 3.0.5 Beta (Suface Hydrology Network) can be used to filter out polygons that intersect with major rivers. This is done to remove river segments from the polygon dataset. The SH_Network AHGFNetworkStream any layer within the SH_Network_GDB_V2_1_1.zip geodatabase can be used. You may chose to filter the rivers dataset to only keep rivers tagged as major, as the full rivers dataset contains a lot of higher order streams and can remove smaller waterbodies situated on upland streams.

If you have an alternative dataset you wish to use inplace of the Bureau of Meteorology Geofabric, you can set the filepath to this file in the MajorRiversDataset variable. Any alternative dataset needs to be a vector dataset, and able to be read in by the fiona python library.

Note that we reproject this dataset to epsg 3577, Australian Albers coordinate reference system to match the coordinate reference system of the WOfS data we use. If this is not correct for your analysis, you can change this in the cell below. A list of epsg code can be found here.

If you don’t want to filter out polygons that intersect with rivers, set this parameter to False.

Note that for the Water Body Polygon dataset, we set this filter to False (``FilterOutRivers = False``)

The option to filter out rivers was switched off for the production of DEA Waterbodies.

Note that the Geofabric continues the streamline through on-river dams, which means these polygons are filtered out. This may not be the desired result.

Stream and Dam intersection

[6]:
FilterOutRivers = False

This section of code allows you to choose whether to use the WOfS summary data, or your own custom analysis. There are a number of options available to you here: * All of Australia WOfS analysis * Set AllofAustraliaAllTime = True * Set CustomData = False * Set AutoGenerateTileList = False * Some of Australia WOfS analysis * Set AllofAustraliaAllTime = False * Set CustomData = False * Set AutoGenerateTileList = False * You will then need to input a list of Albers Equal Area tiles over which you would like to perform your analysis in the ListofAlbersTiles variable. * Custom analysis for any spatial extent * Set AllofAustraliaAllTime = False * Set CustomData = True * Provide a path to where the files are sitting. Note that this code assumes the files are netCDF. * Set AutoGenerateTileList = True/False * If you want to analyse all of the tiles in the custom folder, set this to True. * If you want to analyse a subset of the tiles in the custom folder, set this to False, and provide a list of tiles to the ListofAlbersTiles variable.

All of Australia analysis

If you would like to perform the analysis for all of Australia, using the published WOfS all time summaries, set AllofAustraliaAllTime = True.

The WOfS all time summaries NetCDF files used are located in /g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/. These files contain the following three variables: count_wet, count_clear and frequency.

Custom Data option

This code is set up to allow you to input your own set of custom data statistics. You can generate your own custom statistics using the datacube-stats code repository. For example, you may wish to calculate WOfS summary statistics over a specified period, rather than over the full 1987 to 2018 period provided in the WOfS summary product. Datacube-stats allows you to specify the parameters for generating statistical summaries. You will need to ensure the output format is set to netCDF to make it compatible with the code here.

If CustomData = True, you will need to specify the location of the data you would like to use for this analysis, setting TileFolder below, under the if CustomData code section below.

If CustomData = False, the code will automatically look at the published WOfS all time summaries.

Autogeneration of tile list

AutoGenerateTileList will only be used if AllOfAustraliaAllTime = False. We only want to generate a list of tiles to iterate through if it will be a subset of all of the available data.

If you would like to automatically generate a list of tiles using the outputs of an analysis (e.g. if you have run a custom datacube-stats analysis over just NSW), set AutoGenerateTileList = True and update the location of the output file directory. This will generate a tile list consisting of every available tile within that folder. Note that this option currently assumes a filename format. If you experience an error when running this step, you may need to modify the Generate_list_of_albers_tiles function loaded above.

If you would like to manually feed in a list of albers tiles (i.e. run a subset of the tiles available within a chosen folder), set AutoGenerateTileList = False, and feed in a list of tiles in the format:

ListofAlbersTiles = ['7_-34', '10_-40', '16_-34']

For testing and debugging, set CustomData = True and AutoGenerateTileList = False, then enter a list of tiles to run using the ListofAlbersTiles described above.

[8]:
AllOfAustraliaAllTime = False

CustomData = False
AutoGenerateTileList = False
[9]:
if CustomData:
    # Path to the files you would like to use for the analysis
    TileFolder = '/g/data/r78/cek156/datacube_stats/WOFSDamsAllTimeNSWMDB/'
else:
    # Default path to the WOfS summary product
    TileFolder = '/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/'

A high tide coastline for Australia is used to mask out ocean polygons. You can choose which land/sea mask you would like to use, depending on how much coastal water you would like in the final product.

We use a coastline generated using the Intertidal Extents Model (ITEM) v2 dataset. This particular coastline creates a mask by identifying any water pixels that are continuously connected to the ocean, or an estuary. Any polygons that intersect with this mask are filtered out. I.e. if a polygon identified within our workflow overlaps with this coastal mask by even a single pixel, it will be discarded. We chose this very severe ocean mask as the aim of DEA Waterbodies is not to map coastal waterbodies, but just inland ones. For a detailed description of how this coastline was created, see this notebook.

Note that the mask we use here sets ocean = 1, land = NaN. If your mask has land = 1 instead, you can either invert it, or change the code in the `Filter merged polygons by: <#Filtering>`__ code section below.

[11]:
# Path to the coastal mask you would like to use.
LandSeaMaskFile = '/g/data/r78/cek156/ShapeFiles/ITEMv2Coastline/ITEM_Ocean_Polygon.shp'

Coastline = gp.read_file(LandSeaMaskFile)
Coastline = Coastline.to_crs({'init': 'epsg:3577'})

Read in a mask for high-rise CBDs

WOfS has a known limitation, where deep shadows thrown by tall CBD buildings are misclassified as water. This results in ‘waterbodies’ around these misclassified shadows in capital cities.

To address this problem, we use the Australian Bureau of Statistics Statistical Area 3 shapefile (2016) to define a spatial footprint for Australia’s CBD areas.

We use the following polygons as our CBD filter:

SA3_CODE1

SA3_NAME16

11703

Sydney Inner City

20604

Melbourne City

30501

Brisbane Inner

30901

Broadbeach - Burleigh

30910

Surfers Paradise

40101

Adelaide City

50302

Perth City

If you are not using WOfS for your analysis, you may choose to set UrbanMask = False.

[ ]:
UrbanMask = True
[12]:
if UrbanMask:
    UrbanMaskFile = '/g/data/r78/cek156/ShapeFiles/ABS_1270055001_sa3_2016_aust_shape/HighRiseCBD_ABS_sa3.shp'

    CBDs = gp.read_file(UrbanMaskFile)
    CBDs = CBDs.to_crs({'init': 'epsg:3577'})

Generate the first temporary polygon dataset

This code section:

  1. Checks that the AtLeastThisWet threshold has been correctly entered above

  2. Sets up a for loop that allows the user to input multiple temporal datasets (see below)

  3. Generates a list of netCDF files to loop through

  4. Sets up a for loop for that list of files. Here we have separate data for each Landsat tile, so this loop loops through the list of tile files

  5. Opens the netCDF frequency data and removes the time dimension (which in this case is only of size 1)

  6. Opens the netCDF count_clear data and removes the time dimension (which in this case is only of size 1)

  7. Removes any pixels not observed at least `MinimumValidObs times <#valid>`__

  8. Sets up a for loop for the entered `AtLeastThisWet thresholds <#wetnessThreshold>`__

  9. Masks out any data that does not meet the wetness threshold

  10. Converts the data to a Boolean array, with included pixels == 1

  11. Converts the raster array to a polygon dataset

  12. Cleans up the polygon dataset

  13. Resets the geometry to a shapely geometry

  14. Merges any overlapping polygons

  15. Convert the output of the merging back into a geopandas dataframe

  16. Calculates the area of each polygon

  17. Saves the results to a shapefile

Within this section you need to set up: - WaterBodiesShp: The name and filepath of the intermediate output polygon set - WOFSshpMerged: The filepath for the location of temp files during the code run - WOFSshpFiltered: The name and filepath of the outputs following the filtering steps - FinalName: The name and file path of the final, completed waterbodies shapefile - years to analyse: for year in range(x,y) - note that the last year is NOT included in the analysis. This for loop is set up to allow you to loop through multiple datasets to create multiple polygon outputs. If you only have one input dataset, set this to range(<year of the analysis>, <year of the analysis + 1>)

[13]:
## Set up some file names for the inputs and outputs
# The name and filepath of the intermediate output polygon set
WaterBodiesShp = f'/g/data/r78/cek156/dea-notebooks/DEAWaterbodies/AusAllTime01-005HybridWaterbodies/Temp'

# The name and filepath of the temp, filtered output polygon set
WOFSshpMerged = f'/g/data/r78/cek156/dea-notebooks/DEAWaterbodies/AusAllTime01-005HybridWaterbodies/'
WOFSshpFiltered = '/g/data/r78/cek156/dea-notebooks/DEAWaterbodies/AusAllTime01-005HybridWaterbodies/AusWaterBodiesFiltered.shp'

# Final shapefile name
FinalName = '/g/data/r78/cek156/dea-notebooks/DEAWaterbodies/AusAllTime01-005HybridWaterbodies/AusWaterBodies.shp'
[ ]:
# First, test whether the wetness threshold has been correctly set
if len(AtLeastThisWet) == 2:
    print(
        f'We will be running a hybrid wetness threshold. Please ensure that the major threshold is \n'
        f'listed second, with the supplementary threshold entered first.'
        f'**You have set {AtLeastThisWet[-1]} as the primary threshold,** \n'
        f'**with {AtLeastThisWet[0]} set as the supplementary threshold.**')
elif len(AtLeastThisWet) == 1:
    print(
        f'You have not set up the hybrid threshold option. If you meant to use this option, please \n'
        f'set this option by including two wetness thresholds in the `AtLeastThisWet` variable above. \n'
        f'The wetness threshold we will use is {AtLeastThisWet}.')
else:
    raise ValueError(
        f'There is something wrong with your entered wetness threshold. Please enter a list \n'
        f'of either one or two numbers. You have entered {AtLeastThisWet}. \n'
        f'See above for more information')

# Now perform the analysis to generate the first iteration of polygons
for year in range(1980, 1981):

    ### Get the list of netcdf file names to loop through
    if AllOfAustraliaAllTime:
        # Grab everything from the published WOfS all time summaries
        Alltiles = glob.glob(f'{TileFolder}*.nc')
    else:
        Alltiles = Generate_list_of_tile_datasets(ListofAlbersTiles, year,
                                                  TileFolder, CustomData)

    for WOFSfile in Alltiles:
        try:
            # Read in the data
            # Note that the netCDF files we are using here contain a variable called 'frequency',
            # which is what we are using to define our water polygons.
            # If you use a different netCDF input source, you may need to change this variable name here
            WOFSnetCDFData = xr.open_rasterio(f'NETCDF:{WOFSfile}:frequency')
            # Remove the superfluous time dimension
            WOFSnetCDFData = WOFSnetCDFData.squeeze()

            # Open the clear count variable to generate the minimum observation mask
            # If you use a different netCDF input source, you may need to change this variable name here
            WOFSvalidcount = xr.open_rasterio(f'NETCDF:{WOFSfile}:count_clear')
            WOFSvalidcount = WOFSvalidcount.squeeze()

            # Filter our WOfS classified data layer to remove noise
            # Remove any pixels not abserved at least MinimumValidObs times
            WOFSValidFiltered = WOFSvalidcount >= MinimumValidObs

            for Thresholds in AtLeastThisWet:
                # Remove any pixels that are wet < AtLeastThisWet% of the time
                WOFSfiltered = WOFSnetCDFData > Thresholds

                # Now find pixels that meet both the MinimumValidObs and AtLeastThisWet criteria
                # Change all zeros to NaN to create a nan/1 mask layer
                # Pixels == 1 now represent our water bodies
                WOFSfiltered = WOFSfiltered.where((WOFSfiltered != 0) &
                                                  (WOFSValidFiltered != 0))

                # Convert the raster to polygons
                # We use a mask of '1' to only generate polygons around values of '1' (not NaNs)
                WOFSpolygons = rasterio.features.shapes(
                    WOFSfiltered.data.astype('float32'),
                    mask=WOFSfiltered.data.astype('float32') == 1,
                    transform=WOFSnetCDFData.transform)
                # The rasterio.features.shapes returns a tuple. We only want to keep the geometry portion,
                # not the value of each polygon (which here is just 1 for everything)
                WOFSbreaktuple = (a for a, b in WOFSpolygons)

                # Put our polygons into a geopandas geodataframe
                PolygonGP = gp.GeoDataFrame(list(WOFSbreaktuple))

                # Grab the geometries and convert into a shapely geometry
                # so we can quickly calcuate the area of each polygon
                PolygonGP['geometry'] = None
                for ix, poly in PolygonGP.iterrows():
                    poly['geometry'] = shape(poly)

                # Set the geometry of the dataframe to be the shapely geometry we just created
                PolygonGP = PolygonGP.set_geometry('geometry')
                # We need to add the crs back onto the dataframe
                PolygonGP.crs = {'init': 'epsg:3577'}

                # Combine any overlapping polygons
                MergedPolygonsGeoms = unary_union(PolygonGP['geometry'])

                # Turn the combined multipolygon back into a geodataframe
                MergedPolygonsGPD = gp.GeoDataFrame(
                    [poly for poly in MergedPolygonsGeoms])
                # Rename the geometry column
                MergedPolygonsGPD.columns = ['geometry']
                # We need to add the crs back onto the dataframe
                MergedPolygonsGPD.crs = {'init': 'epsg:3577'}

                # Calculate the area of each polygon again now that overlapping polygons
                # have been merged
                MergedPolygonsGPD['area'] = MergedPolygonsGPD['geometry'].area

                # Save the polygons to a shapefile
                schema = {
                    'geometry': 'Polygon',
                    'properties': {
                        'area': 'float'
                    }
                }

                # Generate our dynamic filename
                FileName = f'{WaterBodiesShp}_{Thresholds}.shp'
                # Append the file name to the list so we can call it later on

                if os.path.isfile(FileName):
                    with fiona.open(FileName,
                                    "a",
                                    crs=from_epsg(3577),
                                    driver='ESRI Shapefile',
                                    schema=schema) as output:
                        for ix, poly in MergedPolygonsGPD.iterrows():
                            output.write(({
                                'properties': {
                                    'area': poly['area']
                                },
                                'geometry': mapping(shape(poly['geometry']))
                            }))
                else:
                    with fiona.open(FileName,
                                    "w",
                                    crs=from_epsg(3577),
                                    driver='ESRI Shapefile',
                                    schema=schema) as output:
                        for ix, poly in MergedPolygonsGPD.iterrows():
                            output.write(({
                                'properties': {
                                    'area': poly['area']
                                },
                                'geometry': mapping(shape(poly['geometry']))
                            }))

        except:
            print(
                f'{WOFSfile} did not run. \n'
                f'This is probably because there are no waterbodies present in this tile.'
            )
You have not set up the hybrid threshold option. If you meant to use this option, please
set this option by including two wetness thresholds in the `AtLeastThisWet` variable above.
The wetness threshold we will use is [0.95].
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_8_-32_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_19_-40_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_14_-31_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_10_-37_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_9_-34_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_20_-39_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_6_-35_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_9_-32_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_7_-36_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_18_-30_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_8_-41_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_14_-28_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_8_-34_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_6_-37_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_14_-33_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.
/g/data/fk4/datacube/002/WOfS/WOfS_Stats_25_2_1/netcdf/WOFS_3577_6_-36_summary.nc did not run.
This is probably because there are no waterbodies present in this tile.

Merge polygons that have an edge at a tile boundary

Now that we have all of the polygons across our whole region of interest, we need to check for artifacts in the data caused by tile boundaries.

We have created a shapefile that consists of the albers tile boundaries, plus a 1 pixel (25 m) buffer. This shapefile will help us to find any polygons that have a boundary at the edge of an albers tile. We can then find where polygons touch across this boundary, and join them up.

Within this section you need to set up: - AlbersBuffer: The file location of a shapefile that is a 1 pixel buffer around the Albers tile boundaries

NOTE: for the Australia-wide analysis, the number and size of polygons means that this cell cannot be run in this notebook. Instead, we ran this cell on raijin

#!/bin/bash
#PBS -P r78
#PBS -q hugemem
#PBS -l walltime=96:00:00
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l ncpus=7
#PBS -l wd
#PBS -lother=gdata1a

module use /g/data/v10/public/modules/modulefiles/
module load dea

PYTHONPATH=$PYTHONPATH:/g/data/r78/cek156/dea-notebooks
[ ]:
AlbersBuffer = gp.read_file('/g/data/r78/cek156/ShapeFiles/AlbersBuffer25m.shp')

for Threshold in AtLeastThisWet:
    print(f'Working on {Threshold} shapefile')
    # We are using the more severe wetness threshold as the main polygon dataset.
    # Note that this assumes that the thresholds have been correctly entered into the 'AtLeastThisWet'
    # variable, with the higher threshold listed second.
    WaterPolygons = gp.read_file(f'{WaterBodiesShp}_{Threshold}.shp')

    # Find where the albers polygon overlaps with our dam polygons
    BoundaryMergedDams, IntersectIndexes, NotBoundaryDams = Filter_shapefile_by_intersection(
        WaterPolygons, AlbersBuffer, invertMask=False, returnInverse=True)

    # Now combine overlapping polygons in `BoundaryDams`
    UnionBoundaryDams = BoundaryMergedDams.unary_union

    # `Explode` the multipolygon back out into individual polygons
    UnionGDF = gp.GeoDataFrame(crs=WaterPolygons.crs,
                               geometry=[UnionBoundaryDams])
    MergedDams = UnionGDF.explode()

    # Then combine our new merged polygons with the `NotBoundaryDams`
    # Combine New merged polygons with the remaining polygons that are not near the tile boundary
    AllTogether = gp.GeoDataFrame(
        pd.concat([NotBoundaryDams, MergedDams], ignore_index=True,
                  sort=True)).set_geometry('geometry')

    # Calculate the area of each polygon
    AllTogether['area'] = AllTogether.area

    # Check for nans
    AllTogether.dropna(inplace=True)

    schema = {'geometry': 'Polygon', 'properties': {'area': 'float'}}

    print(f'Writing out {Threshold} shapefile')

    with fiona.open(f'{WOFSshpMerged}Union_{Threshold}.shp',
                    "w",
                    crs=from_epsg(3577),
                    driver='ESRI Shapefile',
                    schema=schema) as output:
        for ix, poly in AllTogether.iterrows():
            output.write(({
                'properties': {
                    'area': poly['area']
                },
                'geometry': mapping(shape(poly['geometry']))
            }))

Filter the merged polygons by:

  • Area: Based on the MinSize and MaxSize parameters set here.

  • Coastline: Using the Coastline dataset loaded here.

  • CBD location (optional): Using the CBDs dataset loaded here.

  • Wetness thresholds: Here we apply the hybrid threshold described here

  • Intersection with rivers (optional): Using the MajorRivers dataset loaded here

NOTE: for the Australia-wide analysis, the number and size of polygons means that this cell cannot be run in this notebook. Instead, we ran this cell on raijin

#!/bin/bash
#PBS -P r78
#PBS -q hugemem
#PBS -l walltime=96:00:00
#PBS -l mem=500GB
#PBS -l jobfs=200GB
#PBS -l ncpus=7
#PBS -l wd
#PBS -lother=gdata1a

module use /g/data/v10/public/modules/modulefiles/
module load dea

PYTHONPATH=$PYTHONPATH:/g/data/r78/cek156/dea-notebooks
[19]:
try:
    AllTogether = gp.read_file(f'{WOFSshpMerged}Temp_{AtLeastThisWet[1]}.shp')
except IndexError:
    AllTogether = gp.read_file(f'{WOFSshpMerged}Temp_{AtLeastThisWet[0]}.shp')
AllTogether['area'] = pd.to_numeric(AllTogether.area)

# Filter out any polygons smaller than MinSize, and greater than MaxSize
WaterBodiesBig = AllTogether.loc[((AllTogether['area'] > MinSize) &
                                  (AllTogether['area'] <= MaxSize))]

# Filter out any ocean in the pixel
WaterBodiesLand, IntersectIndexes = Filter_shapefile_by_intersection(
    WaterBodiesBig, Coastline, invertMask=True)

# WOfS has a known bug where deep shadows from high-rise CBD buildings are misclassified
# as water. We will use the ABS sa3 dataset to filter out Brisbane, Gold Coast, Sydney,
# Melbourne, Adelaide and Perth CBDs.
# If you have chosen to set UrbanMask = False, this step will be skipped.
if UrbanMask:
    NotCities, IntersectIndexes = Filter_shapefile_by_intersection(
        WaterBodiesLand, CBDs)
else:
    print(
        'You have chosen not to filter out waterbodies within CBDs. If you meant to use this option, please \n'
        'set `UrbanMask = True` variable above, and set the path to your urban filter shapefile'
    )
    NotCities = WaterBodiesLand

# Check for hybrid wetness thresholds
if len(AtLeastThisWet) == 2:
    # Note that this assumes that the thresholds have been correctly entered into the 'AtLeastThisWet'
    # variable, with the supplementary threshold listed first.
    LowerThreshold = gp.read_file(
        f'{WOFSshpMerged}Union_{AtLeastThisWet[0]}.shp')
    LowerThreshold['area'] = pd.to_numeric(LowerThreshold.area)
    # Filter out those pesky huge polygons
    LowerThreshold = LowerThreshold.loc[(LowerThreshold['area'] <= MaxSize)]
    # Find where the albers polygon overlaps with our dam polygons
    BoundaryMergedDams, IntersectIndexes = Filter_shapefile_by_intersection(
        LowerThreshold, NotCities)
    # Pull out the polygons from the supplementary shapefile that intersect with the primary shapefile
    LowerThresholdToUse = LowerThreshold.loc[LowerThreshold.index.isin(
        IntersectIndexes)]
    # Concat the two polygon sets together
    CombinedPolygons = gp.GeoDataFrame(
        pd.concat([LowerThresholdToUse, NotCities], ignore_index=True))
    # Merge overlapping polygons
    CombinedPolygonsUnion = CombinedPolygons.unary_union
    # `Explode` the multipolygon back out into individual polygons
    UnionGDF = gp.GeoDataFrame(crs=LowerThreshold.crs,
                               geometry=[CombinedPolygonsUnion])
    HybridDams = UnionGDF.explode()
else:
    print(
        'You have not set up the hybrid threshold option. If you meant to use this option, please \n'
        'set this option by including two wetness thresholds in the `AtLeastThisWet` variable above'
    )
    HybridDams = NotCities

# Here is where we do the river filtering (if FilterOutRivers == True)
if FilterOutRivers:
    WaterBodiesBigRiverFiltered, IntersectIndexes = Filter_shapefile_by_intersection(
        HybridDams, MajorRivers)
else:
    # If river filtering is turned off, then we just keep all the same polygons
    WaterBodiesBigRiverFiltered = HybridDams

# We need to add the crs back onto the dataframe
WaterBodiesBigRiverFiltered.crs = {'init': 'epsg:3577'}

# Calculate the area and perimeter of each polygon again now that overlapping polygons
# have been merged
WaterBodiesBigRiverFiltered['area'] = WaterBodiesBigRiverFiltered[
    'geometry'].area
WaterBodiesBigRiverFiltered['perimeter'] = WaterBodiesBigRiverFiltered[
    'geometry'].length

# Calculate the Polsby-Popper value (see below), and write out too
WaterBodiesBigRiverFiltered['PPtest'] = (
    (WaterBodiesBigRiverFiltered['area'] * 4 * math.pi) /
    (WaterBodiesBigRiverFiltered['perimeter']**2))

# Save the polygons to a shapefile
schema = {
    'geometry': 'Polygon',
    'properties': {
        'area': 'float',
        'perimeter': 'float',
        'PPtest': 'float'
    }
}

with fiona.open(WOFSshpFiltered,
                "w",
                crs=from_epsg(3577),
                driver='ESRI Shapefile',
                schema=schema) as output:
    for ix, poly in WaterBodiesBigRiverFiltered.iterrows():
        output.write(({
            'properties': {
                'area': poly['area'],
                'perimeter': poly['perimeter'],
                'PPtest': poly['PPtest']
            },
            'geometry': mapping(shape(poly['geometry']))
        }))
You have not set up the hybrid threshold option. If you meant to use this option, please
set this option by including two wetness thresholds in the `AtLeastThisWet` variable above

Dividing up very large polygons

The size of polygons is determined by the contiguity of waterbody pixels through the landscape. This can result in very large polygons, e.g. where rivers are wide and unobscured by trees, or where waterbodies are connected to rivers or neighbouring waterbodies. The image below shows this for the Menindee Lakes, NSW. The relatively flat terrain in this part of Australia means that the 0.05 wetness threshold results in the connection of a large stretch of river and the individual lakes into a single large polygon that spans 154 km. This polygon is too large to provide useful insights into the changing water surface area of the Menindee Lakes, and needs to be broken into smaller, more useful polygons.

Menindee Lakes original polygon

We do this by applying the Polsby-Popper test (1991). The Polsby-Popper test is an assessment of the ‘compactness’ of a polygon. This method was originally developed to test the shape of congressional and state legislative districts, to prevent gerrymandering.

The Polsby-Popper test examines the ratio between the area of a polygon, and the area of a circle equal to the perimeter of that polygon. The result falls between 0 and 1, with values closer to 1 being assessed as more compact.

\begin{align*} PPtest = \frac{polygon\ area * 4\pi}{polygon\ perimeter^2} \end{align*}

The Menindee Lakes polygon above has a PPtest value \(\approx\) 0.00.

We selected all polygons with a PPtest value <=0.005. This resulted in a subset of 186 polygons.

Polygons with a Polsby-Popper test score of less than 0.005

The 186 polygons were buffered with a -50 meter (2 pixel) buffer to separate the polygons where they are connected bu two pixels or less. This allows us to split up these very large polygons by using natural thinning points. The resulting negatively buffered polygons was run through the multipart to singlepart tool in QGIS, to give the now separated polygons unique IDs.

These polygons were then buffered with a +50 meter buffer to return the polygons to approximately their original size. These final polygons were used to separate the 186 original polygons identified above.

The process for dividing up the identified very large polygons varied depending on the polygon in question. Where large waterbodies (like the Menindee Lakes) were connected, the buffered polygons were used to determine the cut points in the original polygons. Where additional breaks were required, the Bureau of Meteorology’s Geofabric v 3.0.5 Beta (Suface Hydrology Network) waterbodies dataset was used as an additional source of information for breaking up connected segments.

The buffering method didn’t work on large segments of river, which became a series of disconnected pieces when negatively and positively buffered. Instead, we used a combination of tributaries and man-made features such as bridges and weirs to segment these river sections.

Final checks and recalculation of attributes

[11]:
WaterBodiesBigRiverFiltered = gp.read_file(WOFSshpFiltered)
[12]:
# Recalculate the area and perimeter of each polygon again following the manual checking
# step performed above
WaterBodiesBigRiverFiltered['area'] = WaterBodiesBigRiverFiltered[
    'geometry'].area
WaterBodiesBigRiverFiltered['perimeter'] = WaterBodiesBigRiverFiltered[
    'geometry'].length
[13]:
# Remove the PPtest column, since we don't really want this as an attribute of the final shapefile
WaterBodiesBigRiverFiltered.drop(labels='PPtest', axis=1, inplace=True)
[14]:
# Reapply the size filtering, just to check that all of the split and filtered waterbodies are
# still in the size range we want
DoubleCheckArea = WaterBodiesBigRiverFiltered.loc[(
    (WaterBodiesBigRiverFiltered['area'] > MinSize) &
    (WaterBodiesBigRiverFiltered['area'] <= MaxSize))]

Generate a unique ID for each polygon

A unique identifier is required for every polygon to allow it to be referenced. The naming convention for generating unique IDs here is the geohash.

A Geohash is a geocoding system used to generate short unique identifiers based on latitude/longitude coordinates. It is a short combination of letters and numbers, with the length of the string a function of the precision of the location. The methods for generating a geohash are outlined here - yes, the official documentation is a wikipedia article.

Here we use the python package python-geohash to generate a geohash unique identifier for each polygon. We use precision = 9 geohash characters, which represents an on the ground accuracy of <20 metres. This ensures that the precision is high enough to differentiate between waterbodies located next to each other.

[15]:
# We need to convert from Albers coordinates to lat/lon, in order to generate the geohash
GetUniqueID = DoubleCheckArea.to_crs(epsg=4326)

# Generate a geohash for the centroid of each polygon
GetUniqueID['UID'] = GetUniqueID.apply(lambda x: gh.encode(
    x.geometry.centroid.y, x.geometry.centroid.x, precision=9),
                                       axis=1)

# Check that our unique ID is in fact unique
assert GetUniqueID['UID'].is_unique

# Make an arbitrary numerical ID for each polygon. We will first sort the dataframe by geohash
# so that polygons close to each other are numbered similarly
SortedData = GetUniqueID.sort_values(by=['UID']).reset_index()
SortedData['WB_ID'] = SortedData.index
[30]:
# The step above creates an 'index' column, which we don't actually want, so drop it.
SortedData.drop(labels='index', axis=1, inplace=True)

Write out the final results to a shapefile

[32]:
BackToAlbers = SortedData.to_crs(epsg=3577)
BackToAlbers.to_file(FinalName, driver='ESRI Shapefile')

Some extra curation

Following the development of timeseries for each individual polygon, it was determined that a number of polygons do not produce complete timeseries.

Splitting polygons that cross swath boundaries

Three large polygons were identified that straddle Landsat swath boundaries. This is problematic, as the whole polygon will never be observed on a single day, which trips the requirement for at least 90% of a polygon to be observed in order for an observation to be valid.

There are two options for dealing with this issue: - Splitting the polygons using the swath boundaries, so that each half of the polygon will be observed in a single day. This will retain information as to the exact timing of observations. - Creating time averaged timeseries, which would group observations into monthly blocks and provide a value for each month. This would provide information for the whole polygon, but would lose the specific timing information.

We chose to go with the first option to keep the high fidelity timing information for each polygon. Three polygons were split using the swath boundaries as a guide. The split polygons were given a new WB_ID, and a new geohash was calculated for each new polygon.

[ ]:
WaterBodiesSplit = gp.read_file(
    '/g/data/r78/cek156/dea-notebooks/DEAWaterbodies/AusAllTime01-005HybridWaterbodies/AusWaterBodiesSplitEliminate.shp'
)

# We need to convert from Albers coordinates to lat/lon, in order to generate the geohash
GetUniqueID = WaterBodiesSplit.to_crs(epsg=4326)

# Only recalculate the geohash for the polygons that have changed:
ChangedWB_ID = [145126, 66034, 146567, 295902, 295903, 295904, 295905]

for ix, rowz in GetUniqueID.iterrows():
    if rowz['WB_ID'] in ChangedWB_ID:
        # Generate a geohash for the centroid of each polygon
        GetUniqueID.loc[ix, 'WB_ID'] = gh.encode(
            GetUniqueID.iloc[ix].geometry.centroid.y,
            GetUniqueID.iloc[ix].geometry.centroid.x,
            precision=9)
        print('Changing geohash')

# Check that our unique ID is in fact unique
assert GetUniqueID['UID'].is_unique

Save the final version of the polygons!

[ ]:
BackToAlbers = GetUniqueID.to_crs(epsg=3577)
BackToAlbers.to_file(
    '/g/data/r78/cek156/dea-notebooks/DEAWaterbodies/AusAllTime01-005HybridWaterbodies/AusWaterBodiesFINAL.shp',
    driver='ESRI Shapefile')

Additional information

License: The code in this notebook is licensed under the Apache License, Version 2.0. Digital Earth Australia data is licensed under the Creative Commons by Attribution 4.0 license.

Contact: If you need assistance, please post a question on the Open Data Cube Slack channel or on the GIS Stack Exchange using the open-data-cube tag (you can view previously asked questions here). If you would like to report an issue with this notebook, you can file one on Github.

Last modified: December 2019. Peer Code Quality Check Performed, March 2019

Compatible datacube version: A full list of python packages used to produce DEA Waterbodies is available here.

Tags

Tags: fiona, geopandas, Geotiff, masking, rasterio, shapefile, WOfS, WOFL, shapely, raster to polygons, polygons, vectorise, no_testing