Dataset¶
- class earth_data_kit.stitching.Dataset(name, source, engine, clean=True)
The Dataset class is the main class implemented by the stitching module. It acts as a dataset wrapper and maps to a single remote dataset. A remote dataset can contain multiple files.
Initializes a new dataset instance.
- Parameters:
name (str) – Unique identifier for the dataset.
source (str) – Source identifier (S3 URI or Earth Engine collection ID).
engine (str) – Data source engine -
s3
orearth_engine
.clean (bool, optional) – Whether to clean temporary files before processing. Defaults to True.
- Raises:
Exception – If the provided engine is not supported.
Example
>>> from earth_data_kit.stitching.classes.dataset import Dataset >>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine") >>> # Or with S3 >>> ds = Dataset("example_dataset", "s3://your-bucket/path", "s3")
- set_timebounds(start, end, resolution=None)
Sets time bounds for which we want to download the data.
- Parameters:
start (datetime) – Start date.
end (datetime) – End date, inclusive.
resolution (str, optional) – Temporal resolution for combining images. Options include ‘daily’.
Example
>>> import datetime >>> from earth_data_kit.stitching import Dataset >>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine", clean=True) >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31))
- set_src_options(options)
Sets options for the source dataset.
This method allows setting various options for the source dataset, including source nodata values that can be used during processing.
- Parameters:
options (dict) – A dictionary containing source options. Can include ‘-srcnodata’ which may be either: - A single value to be applied to all bands - An array of values, one for each band in the dataset
Example
>>> ds = Dataset("example", "path/to/data", "file_system") >>> # Set a single nodata value for all bands >>> ds.set_src_options({"-srcnodata": "-9999"})
- set_target_options(options)
Sets options for the target dataset.
This method allows setting various GDAL options for the output dataset, which influence the gdalwarp process during VRT creation.
- Parameters:
options (dict) – A dictionary containing GDAL options. Common options include: - ‘-t_srs’: Target spatial reference system (projection) - ‘-tr’: Target resolution (x y, in target projection units) - ‘-r’: Resampling method (nearest, bilinear, cubic, etc.)
Example
>>> ds = Dataset("example", "path/to/data", "file_system") >>> # Set target projection, resolution and resampling method >>> ds.set_target_options({"-t_srs": "EPSG:3857", "-tr": "30 30", "-r": "bilinear"})
- set_spacebounds(bbox, grid_dataframe=None)
Configure spatial constraints for the dataset using a bounding box and, optionally, a grid dataframe.
This method sets up the spatial filtering parameters by specifying a bounding box defined by four coordinates in EPSG:4326. Additionally, if a grid dataframe is provided, the method will utilize it to accurately pinpoint the scene files to download based on the spatial variables in the source path.
- Parameters:
bbox (tuple[float, float, float, float]) – A tuple of four coordinates in the order (min_longitude, min_latitude, max_longitude, max_latitude)/(xmin, ymin, xmax, ymax) defining the spatial extent.
grid_dataframe (geopandas.GeoDataFrame, optional) – A GeoDataFrame containing grid cells with columns that match the spatial variables in the source path (e.g., ‘h’, ‘v’ for MODIS grid). Each row should have a geometry column defining the spatial extent of the grid cell.
Example
>>> import earth_data_kit as edk >>> import geopandas as gpd >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path/{h}/{v}/B01.TIF", "s3") >>> >>> # Setting spatial bounds using a bounding box: >>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822)) >>> >>> # Setting spatial bounds with a grid dataframe: >>> gdf = gpd.GeoDataFrame() >>> # Assume gdf has columns 'h', 'v' that match the spatial variables in the source path >>> # and a 'geometry' column with the spatial extent of each grid cell >>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822), grid_dataframe=gdf)
- discover()
Scans the dataset source to identify, catalog, and save the intersecting tiles based on provided time and spatial constraints.
- This method follows a multi-step workflow:
Invokes the engine’s scan method to retrieve a dataframe of available tile metadata that match the time and spatial options.
Concurrently retrieves detailed metadata for each tile by constructing Tile objects using a ThreadPoolExecutor.
Converts the user-specified bounding box into a Shapely polygon (in EPSG:4326) and filters the tiles by comparing each tile’s extent (also converted to EPSG:4326) to the bounding box using an intersection test.
Saves the catalog of the intersecting tiles as a CSV file at the location specified by self.catalog_path.
Discovers the overview information for the dataset.
- Returns:
None
- Raises:
Exception – Propagates any exceptions encountered during scanning, metadata retrieval, spatial filtering, or catalog saving.
Example
>>> import datetime >>> import earth_data_kit as edk >>> import geopandas as gpd >>> ds = edk.stitching.Dataset( ... "modis-pds", ... "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF", ... "s3", ... True ... ) >>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2)) >>> # Load grid dataframe >>> gdf = gpd.read_file("tests/fixtures/modis.kml") >>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf) >>> ds.discover() # This will scan the dataset and save the catalog of intersecting tiles at the location specified by self.catalog_path
- get_bands()
Retrieve unique band configurations from tile metadata.
Aggregates metadata from each tile by extracting attributes such as resolution (x_res, y_res) and coordinate reference system (crs). The data is then grouped by columns: band index inside tile (source_idx), band description, data type (dtype), x_res, y_res, and crs.
- Returns:
- A DataFrame with unique band configurations, where each row represents
a unique band configuration with the following columns: - source_idx: Band index within the source files - description: Band description - dtype: Data type of the band - x_res: X resolution - y_res: Y resolution - crs: Coordinate reference system - tiles: List of Tile objects that contain this band configuration
- Return type:
pd.DataFrame
Example
>>> import datetime >>> import earth_data_kit as edk >>> import geopandas as gpd >>> # Initialize the dataset >>> ds = edk.stitching.Dataset("modis-pds", "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF", "s3", True) >>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2)) >>> # Load grid dataframe >>> gdf = gpd.read_file("tests/fixtures/modis.kml") >>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf) >>> ds.discover() >>> bands_df = ds.get_bands() >>> print(bands_df.head()) source_idx description dtype x_res y_res crs tiles 0 1 Nadir_Reflectance_Band1 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object... 1 1 Nadir_Reflectance_Band2 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object... 2 1 Nadir_Reflectance_Band3 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object...
Notes
The ‘source_idx’ column typically represents the band index within the source files. In some cases, this value will be 1 for all bands, especially when each band is stored in a separate file.
- mosaic(bands)
Stitches the scene files together into VRTs based on the ordered band arrangement provided. For each unique date, this function extracts the required bands from the tile metadata and creates individual single-band VRTs that are reprojected to EPSG:3857 by default (if no target spatial reference is specified). These single-band VRTs are then mosaiced per band and finally stacked into a multi-band VRT.
GDAL options influencing the gdalwarp process can be configured using the
edk.stitching.Dataset.set_target_options()
function. Also look atedk.stitching.Dataset.set_src_options()
for source dataset related options influencing the gdalwarp process, eg: srcnodataval.- Parameters:
bands (list[string]) – Ordered list of band descriptions to output as VRTs.
Example
>>> import datetime >>> import earth_data_kit as edk >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket-name/path/to/data", "s3") >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31)) >>> ds.discover() # Discover available scene files before stitching >>> bands = ["red", "green", "blue"] >>> ds.mosaic(bands) # Use mosaic instead of to_vrts >>> ds.save() # Save the output VRTs to a JSON file
- save()
Saves the mosaiced VRTs into a combined JSON file.
This method should be called after the mosaic() method to save the generated VRTs. The resulting JSON path is stored in the json_path attribute.
- Returns:
None
- to_dataarray()
Converts the dataset to an xarray DataArray.
This method opens the JSON file created by save() using xarray with the ‘edk_dataset’ engine and returns the DataArray corresponding to this dataset.
- Returns:
A DataArray containing the dataset’s data with dimensions for time, bands, and spatial coordinates.
- Return type:
xarray.DataArray
Example
>>> import earth_data_kit as edk >>> import datetime >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path", "s3") >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31)) >>> ds.discover() >>> ds.mosaic(bands=["red", "green", "blue"]) >>> ds.save() >>> data_array = ds.to_dataarray()
Note
This method requires that mosaic() and save() have been called first to generate the JSON file.
- static Dataset.dataarray_from_file(json_path)
Creates an xarray DataArray from a JSON file created by the save() method.
Automatically determines optimal chunking based on the underlying raster block size.
- Parameters:
json_path (str) – Path to the JSON file containing dataset information.
- Returns:
DataArray with dimensions for time, bands, and spatial coordinates.
- Return type:
xarray.DataArray
Example
>>> import earth_data_kit as edk >>> data_array = edk.stitching.Dataset.dataarray_from_file("path/to/dataset.json")
Note
Loads a previously saved dataset without needing to recreate the Dataset object.