Dataset¶

class earth_data_kit.stitching.Dataset(name, source, engine, format, clean=True)

The Dataset class is the main class implemented by the stitching module. It acts as a dataset wrapper and maps to a single remote dataset. A remote dataset can contain multiple files.

Initialize a new dataset instance.

Parameters:

name (str) – Unique identifier for the dataset
source (str) – Source identifier (S3 URI or Earth Engine collection ID)
engine (str) – Data source engine - s3 or earth_engine
clean (bool, optional) – Whether to clean temporary files before processing. Defaults to True

Raises:

Exception – If the provided engine is not supported

Example

>>> from earth_data_kit.stitching.classes.dataset import Dataset
>>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine")
>>> # Or with S3
>>> ds = Dataset("example_dataset", "s3://your-bucket/path", "s3")

set_timebounds(start, end, resolution=None)

Set time bounds for data download and optional temporal resolution for combining images.

Parameters:

start (datetime) – Start date
end (datetime) – End date (inclusive)
resolution (str, optional) – Temporal resolution (e.g., ‘D’ for daily, ‘W’ for weekly, ‘M’ for monthly) See pandas offset aliases for full list: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases

Example

>>> import datetime
>>> from earth_data_kit.stitching import Dataset
>>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine", clean=True)
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31))
>>> # Set daily resolution
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31), resolution='D')
>>> # Set monthly resolution
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31), resolution='M')

set_spacebounds(bbox, grid_dataframe=None)

Configure spatial constraints for the dataset using a bounding box and, optionally, a grid dataframe.

This method sets up the spatial filtering parameters by specifying a bounding box defined by four coordinates in EPSG:4326. Additionally, if a grid dataframe is provided, the method will utilize it to accurately pinpoint the scene files to download based on the spatial variables in the source path.

Parameters:

bbox (tuple[float, float, float, float]) – A tuple of four coordinates in the order (min_longitude, min_latitude, max_longitude, max_latitude)/(xmin, ymin, xmax, ymax) defining the spatial extent.
grid_dataframe (geopandas.GeoDataFrame, optional) – A GeoDataFrame containing grid cells with columns that match the spatial variables in the source path (e.g., ‘h’, ‘v’ for MODIS grid). Each row should have a geometry column defining the spatial extent of the grid cell.

Example

>>> import earth_data_kit as edk
>>> import geopandas as gpd
>>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path/{h}/{v}/B01.TIF", "s3")
>>>
>>> # Setting spatial bounds using a bounding box:
>>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822))
>>>
>>> # Setting spatial bounds with a grid dataframe:
>>> gdf = gpd.GeoDataFrame()
>>> # Assume gdf has columns 'h', 'v' that match the spatial variables in the source path
>>> # and a 'geometry' column with the spatial extent of each grid cell
>>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822), grid_dataframe=gdf)

discover(band_locator='description')

Scans the dataset source to identify, catalog, and save the intersecting tiles based on provided time and spatial constraints.

This method follows a multi-step workflow:

Invokes the engine’s scan method to retrieve a dataframe of available tile metadata that match the time and spatial options.
Handles any subdatasets found in the scan results.
Concurrently retrieves detailed metadata for each tile by constructing Tile objects using a ThreadPoolExecutor.
Converts the user-specified bounding box into a Shapely polygon (in EPSG:4326) and filters the tiles by comparing each tile’s extent (also converted to EPSG:4326) to the bounding box using an intersection test.
Saves the catalog of the intersecting tiles as a CSV file at the location specified by self.catalog_path.

Parameters:: band_locator (str, optional) – Specifies how to locate bands in the dataset. Defaults to “description”. Valid options are “description”, “color_interp”, “filename”.
Returns:: None
Raises:: Exception – Propagates any exceptions encountered during scanning, metadata retrieval, spatial filtering, or catalog saving.

Example

>>> import datetime
>>> import earth_data_kit as edk
>>> import geopandas as gpd
>>> ds = edk.stitching.Dataset(
...     "modis-pds",
...     "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF",
...     "s3",
...     True
... )
>>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2))
>>> # Load grid dataframe
>>> gdf = gpd.read_file("tests/fixtures/modis.kml")
>>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2)
>>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2)
>>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf)
>>> ds.discover() # This will scan the dataset and save the catalog of intersecting tiles

get_bands()

Retrieve unique band configurations from tile metadata.

Aggregates metadata from each tile by extracting attributes such as resolution (x_res, y_res) and coordinate reference system (crs). The data is then grouped by columns: band index inside tile (source_idx), band description, data type (dtype), x_res, y_res, and crs.

Returns:

A DataFrame with unique band configurations, where each row represents: a unique band configuration with the following columns: - source_idx: Band index within the source files - description: Band description - dtype: Data type of the band - x_res: X resolution - y_res: Y resolution - crs: Coordinate reference system - tiles: List of Tile objects that contain this band configuration

Return type:

pd.DataFrame

Example

>>> import datetime
>>> import earth_data_kit as edk
>>> import geopandas as gpd
>>> # Initialize the dataset
>>> ds = edk.stitching.Dataset("modis-pds", "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF", "s3", True)
>>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2))
>>> # Load grid dataframe
>>> gdf = gpd.read_file("tests/fixtures/modis.kml")
>>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2)
>>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2)
>>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf)
>>> ds.discover()
>>> bands_df = ds.get_bands()
>>> print(bands_df.head())
   source_idx                description    dtype  x_res  y_res         crs                                              tiles
0           1  Nadir_Reflectance_Band1  uint16   30.0   30.0   EPSG:4326  [<earth_data_kit.stitching.classes.tile.Tile object...
1           1  Nadir_Reflectance_Band2  uint16   30.0   30.0   EPSG:4326  [<earth_data_kit.stitching.classes.tile.Tile object...
2           1  Nadir_Reflectance_Band3  uint16   30.0   30.0   EPSG:4326  [<earth_data_kit.stitching.classes.tile.Tile object...

Notes

The ‘source_idx’ column typically represents the band index within the source files. In some cases, this value will be 1 for all bands, especially when each band is stored in a separate file.

mosaic(bands, sync=False, overwrite=False, resolution=None, dtype=None, crs=None)

Identifies and extracts the required bands from the tile metadata for each unique date. For each band, it creates a single-band VRT that is then mosaiced together. These individual band mosaics are finally stacked into a multi-band VRT according to the ordered band arrangement provided.

Parameters:: bands (list[string]) – Ordered list of band descriptions to output as VRTs.

Example

>>> import datetime
>>> import earth_data_kit as edk
>>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket-name/path/to/data", "s3")
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31))
>>> ds.discover()  # Discover available scene files before stitching
>>> bands = ["red", "green", "blue"]
>>> ds.mosaic(bands)  # Use mosaic instead of to_vrts
>>> ds.save()  # Save the output VRTs to a JSON file

save()

Saves the mosaiced VRTs into a combined JSON file.

This method should be called after the mosaic() method to save the generated VRTs. The resulting JSON path is stored in the json_path attribute.

Returns:: None

to_dataarray()

Converts the dataset to an xarray DataArray.

This method opens the JSON file created by save() using xarray with the ‘edk_dataset’ engine and returns the DataArray corresponding to this dataset.

Returns:: A DataArray containing the dataset’s data with dimensions for time, bands, and spatial coordinates.
Return type:: xarray.DataArray

Example

>>> import earth_data_kit as edk
>>> import datetime
>>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path", "s3")
>>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31))
>>> ds.discover()
>>> ds.mosaic(bands=["red", "green", "blue"])
>>> ds.save()
>>> data_array = ds.to_dataarray()

Note

This method requires that mosaic() and save() have been called first to generate the JSON file.

static Dataset.dataarray_from_file(json_path)

Creates an xarray DataArray from a JSON file created by the save() method.

Automatically determines optimal chunking based on the underlying raster block size.

Parameters:: json_path (str) – Path to the JSON file containing dataset information.
Returns:: DataArray with dimensions for time, bands, and spatial coordinates.
Return type:: xarray.DataArray

Example

>>> import earth_data_kit as edk
>>> data_array = edk.stitching.Dataset.dataarray_from_file("path/to/dataset.json")

Note

Loads a previously saved dataset without needing to recreate the Dataset object.

static Dataset.combine(ref_da, das, method=None)

Combine a list of DataArrays by interpolating each to the grid of the reference DataArray, using the specified interpolation methods for each DataArray.

The reference DataArray (ref_da) and the DataArrays in das are typically returned by the .to_dataarray() function, and are expected to have dimensions: “time”, “band”, “x”, and “y”.

Parameters:

ref_da (xarray.DataArray) – The reference DataArray whose grid will be used for interpolation.
das (list of xarray.DataArray) – List of DataArrays to combine (excluding the reference DataArray).
method (str or list of str, optional) – Interpolation method(s) to use for each DataArray in das. If a single string is provided, it is used for all DataArrays. If a list is provided, it must be the same length as das. Default is “linear” for all.

Returns:

Concatenated DataArray with a new ‘band’ dimension, with the reference DataArray as the first band.

Return type:

xarray.DataArray