Collection Management

The DEA repository contains a set of tools for managing the collections on disk at NCI.

See their help commands for specific information:

# Update the index, find problems with datasets.
dea-sync --help
dea-coherence --help
dea-duplicates --help

# Move datasets between disks safely and incrementally (updating the index).
dea-move --help

# Trash archived datasets
dea-clean --help

# Submit a sync job to PBS
dea-submit-sync --help

Note that many of these operate on collections, not products. To perform a move or sync on a new product you may first need add it as a collection.

Defining Collections

The Open Data Cube core keeps track of where individual datasets are in a product, but not where datasets as a whole should be (such as which filesystems).

Knowing “where they should be” is currently handled in this DEA repository as the list of collections.

A collection defines:

  • datacube query arguments and folder patterns that should contain the same set of datasets. The sync tool, for example, can then iterate the two to find mismatches in both directions.

  • how datasets in the collection should be treated: is an unindexed file found on disk corrupt, or newly arrived?

The set of NCI DEA collections is currently in


        query={'product': ['ls8_level1_scene', 'ls8_level1_oli_scene']},

# Telemetry collection
        query={'metadata_type': 'telemetry'},
        # If something is archived, how many days before we can delete it? None means never
        # Who do we trust in a sync if there's a mismatch?