Extending Datacube

Beyond the configuration available in ODC, there are three extension points provided for implementing different types of data storage and indexing.

  • Drivers for Reading Data
  • Drivers for Writing Data
  • Alternative types of Index

Support for Plug-in drivers

A light weight implementation of a driver loading system has been implemented in datacube/drivers/driver_cache.py which uses setuptools dynamic service and plugin discovery mechanism to name and define available drivers. This code caches the available drivers in the current environment, and allows them to be loaded on demand, as well as handling any failures due to missing dependencies or other environment issues.

Data Read Plug-ins

Entry point group:
 datacube.plugins.io.read.

Read plug-ins are specified as supporting particular uri protocols and formats, both of which are fields available on existing Datasets

A ReadDriver returns a DataSource implementation, which is chosen based on:

  • Dataset URI protocol, eg. s3://
  • Dataset format. As stored in the Data Cube Dataset.
  • Current system settings
  • Available IO plugins

If no specific DataSource can be found, a default datacube.storage.storage.RasterDatasetDataSource is returned, which uses rasterio to read from the local file system or a network resource.

The DataSource maintains the same interface as before, which works at the individual dataset+time+band level for loading data. This is something to be addressed in the future.

Example code to implement a reader driver

def init_reader_driver():
    return AbstractReaderDriver()

class AbstractReaderDriver(object):
    def supports(self, protocol: str, fmt: str) -> bool:
        pass
    def new_datasource(self, band: BandInfo) -> DataSource:
        return AbstractDataSource(band)

class AbstractDataSource(object):  # Same interface as before
    ...

S3 Driver

URI Protocol:s3://
Dataset Format:aio
Implementation location:
 datacube/drivers/s3/driver.py

Example Pickle Based Driver

Available in /examples/io_plugin. Includes an example setup.py as well as an example Read and Write Drivers.

Data Write Plug-ins

Entry point group:
 datacube.plugins.io.write

Are selected based on their name. The storage.driver field has been added to the ingestion configuration file which specifies the name of the write driver to use. Drivers can specify a list of names that they can be known by, as well as publicly defining their output format, however this information isn’t used by the ingester to decide which driver to use. Not specifying a driver counts as an error, there is no default.

At this stage there is no decision on what sort of a public API to expose, but the write_dataset_to_storage() method implemented in each driver is the closest we’ve got. The ingester is using it to write data.

Example code to implement a writer driver

def init_writer_driver():
    return AbstractWriterDriver()

class AbstractWriterDriver(object):
    @property
    def aliases(self):
        return []  # List of names this writer answers to

    @property
    def format(self):
        return ''  # Format that this writer supports

    def write_dataset_to_storage(self, dataset, filename,
                                 global_attributes=None,
                                 variable_params=None,
                                 storage_config=None,
                                 **kwargs):
        ...
        return {}  # Can return extra metadata to be saved in the index with the dataset

NetCDF Writer Driver

Name:netcdf, NetCDF CF
Format:NetCDF
Implementation:datacube.drivers.netcdf.driver.NetcdfWriterDriver

S3 Writer Driver

Name:s3aio
Protocol:s3
Format:aio
Implementation:datacube.drivers.s3.driver.S3WriterDriver

Index Plug-ins

Entry point group:
 datacube.plugins.index

A connection to an Index is required to find data in the Data Cube. Already implemented in the develop branch was the concept of environments which are a named set of configuration parameters used to connect to an Index. This PR extends this with an index_driver parameter, which specifies the name of the Index Driver to use. If this parameter is missing, it falls back to using the default PostgreSQL Index.

Example code to implement an index driver

def index_driver_init():
    return AbstractIndexDriver()

class AbstractIndexDriver(object):
    @staticmethod
    def connect_to_index(config, application_name=None, validate_connection=True):
        return Index.from_config(config, application_name, validate_connection)

Default Implementation

The default Index uses a PostgreSQL database for all storage and retrieval.

S3 Extensions

The datacube.drivers.s3aio_index.S3AIOIndex driver subclasses the default PostgreSQL Index with support for saving additional data about the size and shape of chunks stored in S3 objects. As such, it implements an identical interface, while overriding the dataset.add() method to save the additional data.

Drivers Plugin Management Module

Drivers are defined in setup.py -> entry_points:

entry_points={
    'datacube.plugins.io.read': [
        's3aio = datacube.drivers.s3.driver:reader_driver_init'
    ],
    'datacube.plugins.io.write': [
        'netcdf = datacube.drivers.netcdf.driver:writer_driver_init',
        's3aio = datacube.drivers.s3.driver:writer_driver_init',
        's3aio_test = datacube.drivers.s3.driver:writer_test_driver_init',
    ]
}

Data Cube Drivers API

This module implements a simple plugin manager for storage and index drivers.

datacube.drivers.new_datasource(band: datacube.storage._base.BandInfo) → Optional[datacube.drivers.datasource.DataSource][source]

Returns a newly constructed data source to read dataset band data.

An appropriate DataSource implementation is chosen based on:

  • Dataset URI (protocol part)
  • Dataset format
  • Current system settings
  • Available IO plugins

This function will return the default RasterDatasetDataSource if no more specific DataSource can be found.

Parameters:
  • dataset – The dataset to read.
  • band_name (str) – the name of the band to read.
datacube.drivers.storage_writer_by_name(name)[source]

Lookup writer driver by name

Returns:Initialised writer driver instance
Returns:None if driver with this name doesn’t exist
datacube.drivers.index_driver_by_name(name)[source]

Lookup writer driver by name

Returns:Initialised writer driver instance
Returns:None if driver with this name doesn’t exist
datacube.drivers.index_drivers()[source]

Returns list driver names

datacube.drivers.reader_drivers() → List[str][source]

Returns list driver names

datacube.drivers.writer_drivers() → List[str][source]

Returns list driver names

References and History