Dataset¶
- class earth_data_kit.stitching.Dataset(name, source, engine, format, clean=True)
- The Dataset class is the main class implemented by the stitching module. It acts as a dataset wrapper and maps to a single remote dataset. A remote dataset can contain multiple files. - Initialize a new dataset instance. - Parameters:
- name (str) – Unique identifier for the dataset 
- source (str) – Source identifier (S3 URI or Earth Engine collection ID) 
- engine (str) – Data source engine - - s3or- earth_engine
- clean (bool, optional) – Whether to clean temporary files before processing. Defaults to True 
 
- Raises:
- Exception – If the provided engine is not supported 
 - Example - >>> from earth_data_kit.stitching.classes.dataset import Dataset >>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine") >>> # Or with S3 >>> ds = Dataset("example_dataset", "s3://your-bucket/path", "s3") - set_timebounds(start, end, resolution=None)
- Set time bounds for data download and optional temporal resolution for combining images. - Parameters:
- start (datetime) – Start date 
- end (datetime) – End date (inclusive) 
- resolution (str, optional) – Temporal resolution (e.g., ‘D’ for daily, ‘W’ for weekly, ‘M’ for monthly) See pandas offset aliases for full list: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases 
 
 - Example - >>> import datetime >>> from earth_data_kit.stitching import Dataset >>> ds = Dataset("example_dataset", "LANDSAT/LC08/C01/T1_SR", "earth_engine", clean=True) >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31)) >>> # Set daily resolution >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31), resolution='D') >>> # Set monthly resolution >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 12, 31), resolution='M') 
 - set_spacebounds(bbox, grid_dataframe=None)
- Configure spatial constraints for the dataset using a bounding box and, optionally, a grid dataframe. - This method sets up the spatial filtering parameters by specifying a bounding box defined by four coordinates in EPSG:4326. Additionally, if a grid dataframe is provided, the method will utilize it to accurately pinpoint the scene files to download based on the spatial variables in the source path. - Parameters:
- bbox (tuple[float, float, float, float]) – A tuple of four coordinates in the order (min_longitude, min_latitude, max_longitude, max_latitude)/(xmin, ymin, xmax, ymax) defining the spatial extent. 
- grid_dataframe (geopandas.GeoDataFrame, optional) – A GeoDataFrame containing grid cells with columns that match the spatial variables in the source path (e.g., ‘h’, ‘v’ for MODIS grid). Each row should have a geometry column defining the spatial extent of the grid cell. 
 
 - Example - >>> import earth_data_kit as edk >>> import geopandas as gpd >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path/{h}/{v}/B01.TIF", "s3") >>> >>> # Setting spatial bounds using a bounding box: >>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822)) >>> >>> # Setting spatial bounds with a grid dataframe: >>> gdf = gpd.GeoDataFrame() >>> # Assume gdf has columns 'h', 'v' that match the spatial variables in the source path >>> # and a 'geometry' column with the spatial extent of each grid cell >>> ds.set_spacebounds((19.3044861183, 39.624997667, 21.0200403175, 42.6882473822), grid_dataframe=gdf) 
 - discover(band_locator='description')
- Scans the dataset source to identify, catalog, and save the intersecting tiles based on provided time and spatial constraints. - This method follows a multi-step workflow:
- Invokes the engine’s scan method to retrieve a dataframe of available tile metadata that match the time and spatial options. 
- Handles any subdatasets found in the scan results. 
- Concurrently retrieves detailed metadata for each tile by constructing Tile objects using a ThreadPoolExecutor. 
- Converts the user-specified bounding box into a Shapely polygon (in EPSG:4326) and filters the tiles by comparing each tile’s extent (also converted to EPSG:4326) to the bounding box using an intersection test. 
- Saves the catalog of the intersecting tiles as a CSV file at the location specified by self.catalog_path. 
 
 - Parameters:
- band_locator (str, optional) – Specifies how to locate bands in the dataset. Defaults to “description”. Valid options are “description”, “color_interp”, “filename”. 
- Returns:
- None 
- Raises:
- Exception – Propagates any exceptions encountered during scanning, metadata retrieval, spatial filtering, or catalog saving. 
 - Example - >>> import datetime >>> import earth_data_kit as edk >>> import geopandas as gpd >>> ds = edk.stitching.Dataset( ... "modis-pds", ... "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF", ... "s3", ... True ... ) >>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2)) >>> # Load grid dataframe >>> gdf = gpd.read_file("tests/fixtures/modis.kml") >>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf) >>> ds.discover() # This will scan the dataset and save the catalog of intersecting tiles 
 - get_bands()
- Retrieve unique band configurations from tile metadata. - Aggregates metadata from each tile by extracting attributes such as resolution (x_res, y_res) and coordinate reference system (crs). The data is then grouped by columns: band index inside tile (source_idx), band description, data type (dtype), x_res, y_res, and crs. - Returns:
- A DataFrame with unique band configurations, where each row represents
- a unique band configuration with the following columns: - source_idx: Band index within the source files - description: Band description - dtype: Data type of the band - x_res: X resolution - y_res: Y resolution - crs: Coordinate reference system - tiles: List of Tile objects that contain this band configuration 
 
- Return type:
- pd.DataFrame 
 - Example - >>> import datetime >>> import earth_data_kit as edk >>> import geopandas as gpd >>> # Initialize the dataset >>> ds = edk.stitching.Dataset("modis-pds", "s3://modis-pds/MCD43A4.006/{h}/{v}/%Y%j/*_B0?.TIF", "s3", True) >>> ds.set_timebounds(datetime.datetime(2017, 1, 1), datetime.datetime(2017, 1, 2)) >>> # Load grid dataframe >>> gdf = gpd.read_file("tests/fixtures/modis.kml") >>> gdf['h'] = gdf['Name'].str.split(' ').str[0].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> gdf['v'] = gdf['Name'].str.split(' ').str[1].str.split(':').str[1].astype(int).astype(str).str.zfill(2) >>> ds.set_spacebounds((19.30, 39.62, 21.02, 42.69), grid_dataframe=gdf) >>> ds.discover() >>> bands_df = ds.get_bands() >>> print(bands_df.head()) source_idx description dtype x_res y_res crs tiles 0 1 Nadir_Reflectance_Band1 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object... 1 1 Nadir_Reflectance_Band2 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object... 2 1 Nadir_Reflectance_Band3 uint16 30.0 30.0 EPSG:4326 [<earth_data_kit.stitching.classes.tile.Tile object... - Notes - The ‘source_idx’ column typically represents the band index within the source files. In some cases, this value will be 1 for all bands, especially when each band is stored in a separate file. 
 - mosaic(bands, sync=False, overwrite=False, resolution=None, dtype=None, crs=None)
- Identifies and extracts the required bands from the tile metadata for each unique date. For each band, it creates a single-band VRT that is then mosaiced together. These individual band mosaics are finally stacked into a multi-band VRT according to the ordered band arrangement provided. - Parameters:
- bands (list[string]) – Ordered list of band descriptions to output as VRTs. 
 - Example - >>> import datetime >>> import earth_data_kit as edk >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket-name/path/to/data", "s3") >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31)) >>> ds.discover() # Discover available scene files before stitching >>> bands = ["red", "green", "blue"] >>> ds.mosaic(bands) # Use mosaic instead of to_vrts >>> ds.save() # Save the output VRTs to a JSON file 
 - save()
- Saves the mosaiced VRTs into a combined JSON file. - This method should be called after the mosaic() method to save the generated VRTs. The resulting JSON path is stored in the json_path attribute. - Returns:
- None 
 
 - to_dataarray()
- Converts the dataset to an xarray DataArray. - This method opens the JSON file created by save() using xarray with the ‘edk_dataset’ engine and returns the DataArray corresponding to this dataset. - Returns:
- A DataArray containing the dataset’s data with dimensions for time, bands, and spatial coordinates. 
- Return type:
- xarray.DataArray 
 - Example - >>> import earth_data_kit as edk >>> import datetime >>> ds = edk.stitching.Dataset("example_dataset", "s3://your-bucket/path", "s3") >>> ds.set_timebounds(datetime.datetime(2020, 1, 1), datetime.datetime(2020, 1, 31)) >>> ds.discover() >>> ds.mosaic(bands=["red", "green", "blue"]) >>> ds.save() >>> data_array = ds.to_dataarray() - Note - This method requires that mosaic() and save() have been called first to generate the JSON file. 
 
- static Dataset.dataarray_from_file(json_path)
- Creates an xarray DataArray from a JSON file created by the save() method. - Automatically determines optimal chunking based on the underlying raster block size. - Parameters:
- json_path (str) – Path to the JSON file containing dataset information. 
- Returns:
- DataArray with dimensions for time, bands, and spatial coordinates. 
- Return type:
- xarray.DataArray 
 - Example - >>> import earth_data_kit as edk >>> data_array = edk.stitching.Dataset.dataarray_from_file("path/to/dataset.json") - Note - Loads a previously saved dataset without needing to recreate the Dataset object. 
- static Dataset.combine(ref_da, das, method=None)
- Combine a list of DataArrays by interpolating each to the grid of the reference DataArray, using the specified interpolation methods for each DataArray. - The reference DataArray (ref_da) and the DataArrays in das are typically returned by the .to_dataarray() function, and are expected to have dimensions: “time”, “band”, “x”, and “y”. - Parameters:
- ref_da (xarray.DataArray) – The reference DataArray whose grid will be used for interpolation. 
- das (list of xarray.DataArray) – List of DataArrays to combine (excluding the reference DataArray). 
- method (str or list of str, optional) – Interpolation method(s) to use for each DataArray in das. If a single string is provided, it is used for all DataArrays. If a list is provided, it must be the same length as das. Default is “linear” for all. 
 
- Returns:
- Concatenated DataArray with a new ‘band’ dimension, with the reference DataArray as the first band. 
- Return type:
- xarray.DataArray