- Efficiently reads remote gridded data for an Area of Interest (AOI) into Xarray.Dataset objects using dask.distributed for parallelization.
- Transform data for your needs: resample the grid, resample along a time dimension, convert timezone, etc.
- Extract time series data at coordinates and save to a tabular file (i.e. .xlsx, .csv, or .parquet) for use in physical or machine learning models.
- Extendable/modular package architecture supporting open-source contributions, and connections to more datasets/sources.
- Start by cloning this repository locally.
- Next, within an conda terminal navigate to the local repository location and clone and activate our conda virtual environment using
environment_demo.yml
.environment_dev.yml
currently does not recognize the 'jupyter lab' command
# mock conda terminal
(base) C://User: cd Path/To/Xarray-DataAccessor
(base) C://User/Path/To/Xarray-DataAccessor conda env create -f environment_demo.yml
...
(base) C://User/Path/To/Xarray-DataAccessor conda activate data_accessor_full
(data_accessor_full) C://User/Path/To/Xarray-DataAccessor
- (optional) if you plan to use the
CDSDataAccessor
, follow the instructions here to allow your computer to interact with the CDS API. Basically you must manually make a.cdsapirc
text file (no extension!) wherecdsapi.Client()
expects it to be. - Use the conda-develop
develop
command pointed to the/src/
directory to make the repo importable.
# mock conda terminal with the env activated
(data_accessor_full) C://User/Path/To/Xarray-DataAccessor conda develop src
# a this point you are ready to open an IDE/Notebook of your choice to run your code!
# For example:
(data_accessor_full) C://User/Path/To/Xarray-DataAccessor jupyter lab
- Finally, import the library into your workflow:
import xarray_data_accessor
All data one can retrieve with this library is organized in a three tier hierarchy:
- A "data accessor" is a python class that interacts with a given data source.
- Each data accessor can retrieve data from any number of specific datasets.
- For example:
CDSDataAccessor
accesses the CDS API and can currently be used to access a few ERA5 datasets.
- A specific dataset may be something like "reanalysis-era5-single-levels". Note that the same dataset may be able to be accessed by different data accessors.
- Each dataset will contain one or more variables.
To allow this library to be extendable, the "data accessors", the datasets they can access, and the variables that exist in each dataset are not hardcoded anywhere in the repo.
Therefore to explore what is available, one can use the following xarray_data_accessor.DataAccessorFactory
class functions:
from xarray_data_accessor import DataAccessorFactory
# to return a list of all data accessor names
DataAccessorFactory.data_accessor_names()
# to return a dictionary with data accessor names as keys and their respective objects and values
DataAccessorFactory.data_accessor_objects()
# to return a dictionary with data accessor names as keys, and their supported dataset names as values
DataAccessorFactory.supported_datasets()
# to return a list of variable names for a specific data accessor - dataset combination
DataAccessorFactory.supported_variables(
data_accessor_name: str,
dataset_name: str,
)
We also intend to keep documentation about data accessors and their respective datasets updated here.
To get data one can use the get_xarray_dataset()
function after specifying time and space AOI.
The spatial AOI can be specified with a shapefile, raster, a list of lat/long coordinate tuples, or a csv with lat/lon as columns.
The temporal AOI can be specified as a string or a datetime object. Additionally, one can specify a timezone using param:timezone
.
In the example below we fetch ERA5 data from AWS for a shapefile defined extent.
import xarray_data_accessor
dataset = xarray_data_accessor.get_xarray_dataset(
data_accessor_name='AWSDataAccessor',
dataset_name='reanalysis-era5-single-levels',
variables=[
'air_temperature_at_2_metres',
'eastward_wind_at_100_metres',
],
start_time='2019-01-30',
end_time='2019-02-02',
shapefile='path/to/shapefile.shp',
)
Functionality has not been thoroughly tested...documentation pending.
- Build out base architecture and library design.
- Build
CDSDataAccessor
to retrieve ERA5 hourly data from the CDS API. - Build
AWSDataAccessor
to retrieve ERA5 hourly data from the Planet OS S3 bucket. - Build a function to spatially resample data.
- Build a function to convert data timezones.
- Build a function to sample data across the time dimension and export to a table file.
- Build a
pytest
test suite for the two ERA5 data accessors as well as theDataAccessorFactory
class functions. - Set up documentation structure.
- Build a
DataAccessorBase
implementation to NASA LP-DAAC data (elevation and land cover). - Build a function to temporally resample data.
- Build a
pytest
test suite for all the data transformation functions. - Build a
DataAccessorBase
implementation to fetch soils data (type and moisture). - Make it so the library writes a cdsapirc file, then set repo secrets, and enable automated testing.
- Make the package pip installable.