Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparation of v0.2 #20

Merged
merged 356 commits into from
Jan 21, 2021
Merged

Preparation of v0.2 #20

merged 356 commits into from
Jan 21, 2021

Conversation

coroa
Copy link
Member

@coroa coroa commented Jun 12, 2019

The first official version of atlite is getting closer and there is a bunch of good stuff coming.

Changes

The main change is that a cutout corresponds to a single netcdf file for the whole cutout period, which is fully accessible as a xarray at cutout.data.

  1. This makes it possible to iterate over the data in customizable slices: cutout.wind(shapes=countries, turbine="Vestas_V90_3MW") uses months as in the previous version, while f.ex. cutout.wind(..., windows='Y') uses years. windows can be anything that pd.Grouper understands, ie. 'D', 'M', 'Y' or even '2D'. windows = False makes it possible to apply the conversion function to the data as a whole. It should also be possible to choose windows compatible with a particular time zone to avoid the re-averaging that was necessary for heat_demand.

  2. The data for cutouts is now grouped into different features (from the ERA-5 dataset):

    features = {
        'height': ['height'],
        'wind': ['wnd100m', 'roughness'],
        'influx': ['influx_toa', 'influx_direct', 'influx_diffuse', 'influx', 'albedo'],
        'temperature': ['temperature', 'soil_temperature'],
        'runoff': ['runoff']
    }

    It's possible to prepare a cutout only for a subset of the available features: cutout.prepare(['runoff', 'wind']). One can always extend the cutout by running prepare again.

  3. One can load a cutout fully into memory using cutout.data.load() (or cutout.data.wnd100m.load(), cutout.data.roughness.load() for wind only), which should fully supersede @euronion 's caching from Introduce dataset caching and outsource wind speed extrapolation #9 .

  4. It's easy to get a subset of a cutout using .sel directly induced by atlite's sel function: cutout.sel(time="2012-01") or cutout.sel(time="2012-07", bounds=german_shape.bounds)

Open questions

  • config.py has been completely removed instead one has to provide the necessary paths explicitly when creating new cutouts. In addition we could allow reading in a config file like ~/.atlite.config or some such?
  • Should data cleaning methods be moved into datasets (ie surface roughness <= 0. ->0.002)? I think that would be a good idea! Are there objections?
  • When data is read in as dask arrays, it is not mutable in the conversion functions leading to Exceptions. We can either change everything to copy-on-change (ie use clip) or catch the error and throw a more helpful error message to have to prepare the dataset? Related to Parallelised calculations using dask. #30. To be conservative we load dask arrays before they are passed to the conversion functions.

Remaining TODOS

  • Add a .sel method to produce a view on a subset of the data
  • Mailing list for announcements
  • Release notes (done in documentation branch)
  • Migration instructions
  • Examples should set up for showing warnings: [v0.2] Deprecation warnings are ignored by default #27 . (done in documentation branch)
  • Other datasets:
    • sarah
    • cordex
    • ncep
    • efas
  • Merge documentation branch

I'm happy about everyone, who wants to test the new version, provide feedback or help with documentation or the remaining todos! @leonsn @nworbmot @FabianHofmann @schlott @fneum

@coroa
Copy link
Member Author

coroa commented Jun 18, 2019

@euronion : Do you have time to test whether this branch works for you?

The easiest invocation to get a cutout for wind generation covering germany now is something along the lines of:

germanshapefile = ...
cutout = atlite.Cutout("<cutoutname>", bounds=germanshapefile.buffer(0.2).bounds, time="2012", module="era5")
cutout.prepare(["wind"])

This generates a <cutoutname>.nc file in the current directory containing wind speed and surface roughness in hourly resolution. This file is opened and fully accessible as cutout.data. You can load everything into memory using cutout.data.load().

The previously used wind generation function cutout.wind (like the other conversion function) additionally now understands a windows argument. Using windows=False will apply the conversion to the whole time-series at a time, while the default windows = 'M' goes monthly, as `windows = 'Y' works on yearly slices.

@euronion
Copy link
Collaborator

euronion commented Jun 18, 2019

I'd love to, but I can't spare sufficient ressource on it at the moment.
I estimate I can look at it in ~2 weeks earliest and then do some testing work:

I would also

And do I would at the same time work on an

  • update of the documentation with the official upcoming version.

including

  • some examples to follow along.

I really hope you can wait that long :)

@coroa
Copy link
Member Author

coroa commented Jun 18, 2019

@euronion We're not under any particular time pressure right now. I'd like to merge this around mid-July. Would be great if you can integrate your changes. I think it would be ideal to work on pull requests against this branch!

@euronion
Copy link
Collaborator

@coroa Yes o/c only PR against this branch. Mid of July sounds good.

@coroa
Copy link
Member Author

coroa commented Jun 18, 2019

Atlite in the current version 0.0.2 is now available from PyPI as well as conda-forge. The version prepared in this branch will be tagged as 0.1 (sic!) as soon as I merge this branch (the branch name will stay v0.2, for the time being).

@coroa coroa changed the title Preparation of v0.2 Preparation of v0.1 Jun 18, 2019
@euronion
Copy link
Collaborator

Hi @coroa ,
Other things on my side took longer than expected, I started working on the changes this week,
expect some PRs from me at the end of this week or start of next week :)

@euronion
Copy link
Collaborator

Cross-checking:
Dropping support for Python 2.X is understandable, but was it intended to also drop support for <3.6?
I came across the use of f-strings which are not supported on lower Python versions.

In any case think best practice is to include the Python versino into the setup.py? E.g. with

python_requires='>=3.6'

@coroa
Copy link
Member Author

coroa commented Jul 23, 2019

Semi-intended. I don't want to miss out on f-strings anymore, and even Debian stable these days already ships with python 3.7 so a requirement for 3.6 seems fair to me. Is there any reason to have a python 3.5 or lower?

Making the requirement explicit in setup.py and the conda-forge recipe should be done, true.

@euronion
Copy link
Collaborator

euronion commented Jul 23, 2019

Mainly just asking.
The only ad-hoc reason would be that you might want to apply the same restrictions on the Python version for atlite as for e.g. pypsa.

  • Add python version requirement >= 3.6 to setup.py.

atlite/data.py Outdated Show resolved Hide resolved
@euronion
Copy link
Collaborator

euronion commented Jul 29, 2019

The windows=... option does not return an error after changing base_string to str and seems to work as intended (good).
But I do not understand its purpose, I would need another explanation / example / documentation for it.

@coroa
Copy link
Member Author

coroa commented Jul 30, 2019

windows is an argument available to any function wrapped by a @requires_windowed decorator. It allows you to choose how to traverse the cutout.data dataset.

Omitting it, is equivalent to the argument windows = "M". This will break cutout.data into monthly chunks, so that the for-loop in cutout.convert_and_aggregate iterates over slices of a month. Internally, the windows argument is converted to an iterator by code similar to the following:

windows = xr.core.groupby.DatasetGroupBy(self.data, self.data['time'], grouper=pd.Grouper(freq="M"))._iter_grouped()

over which the convert_and_aggregate function then iterates and calls the conversion functions.

Other allowed strings are "2M" for slices of two months, and "D" for days, "Y" for years. If you supply an integer it is used to feed the bins argument of xr.core.groupby.DatasetGroupBy. For instance windows=2 splits the time axis in two. We will have to experiment with what good choices are.

This grouping mechanism can be switched off using windows=False. For a regular dataset as loaded from a cutout NetCDF file, xarray will then try to convert the whole data in one go. If your memory is big enough and you want to do a lot of repetitions it's probably best to preload the wind data

for feature in cutout.dataset_module.features["wind"]:
    cutout.data[feature].load()

and convert it in one go

cutout.wind(turbine=..., windows=False)

What needs to be investigated a bit further is the possibility to use dask automatically for the heavy lifting:
cutout.data.chunk(time=744) will split the dataset into approximately monthly chunks, and then cutout.wind(turbine=..., windows=False) will use dask to do the wind conversion for these monthly chunks. With the right configuration of dask this should even happen in parallel on multiple processors.

@euronion
Copy link
Collaborator

What the windows argument is becoming somehow clear.
Still I do not see any benefits and reason for it.
Except for cosmetic changes (the resolution of the progressbar) and (maybe a slight) performance difference, it does not do much?

Preloading features has become really easy and convinient with the change.
xarray and dask also do a nice job and keeping parts of the cutout.data loaded automatically (not all of it though, so preloading still gives a few seconds of extra performance for repeated calculations).

As a universal and really simple solution for the chunk sizes, using the new auto keyword from dask could work.
I.e. what I (successfully) tried was

cutout.data.chunk({dim:'auto' for dim in cutout.data.dims})

I doubt we will find a better universal solution for the chunk sizes. This has been a long standing issue for xarray and dask and depends on the cutout size (spatial and temporal) as well as confiugration, hardware setup and use case.
But that should not come as a suprise, that is always an issue of optimisation of parallelisation.

E.g.:
I was playing around a bit with it today and the best I could do was 8s / iteration (without dask) and ~18s with dask and a chunk={'time':2500} and the auto-chunk above with ~11s.

@coroa
Copy link
Member Author

coroa commented Jul 30, 2019

The windows machinery enabled switching out the data backend completely, while keeping compatibility with non-dask-ready conversion functions.

@coroa
Copy link
Member Author

coroa commented Jul 30, 2019

I'd be open to throwing it out in a separate PR in which we transition everything to dask and don't incur huge performance losses in the process.

The bottleneck will be the conversion of the pv module. When I originally tried to implement it using dask only, a couple of years ago, I regularly broke dask in the process, which is why the original atlite version finally iterated through separate files. It's possible that dask has improved enough by now, but we will have to clock and measure it; until then we'll need windows.

@euronion
Copy link
Collaborator

No, let's have the windows feature included in the upcoming version.
The idea is powerful and should be the way to go in the future.

Not knowing what went wrong back then with dask, I'd say we could try again. For this use-case dask does not always seem to be the most performant choice. As you say: if we try it, we should also clock and measure it.

@coroa coroa changed the title Preparation of v0.1 Preparation of v0.2 Aug 8, 2019
FabianHofmann and others added 27 commits June 16, 2020 15:37
* ci: use mamba instead of conda

* follow up, add comment [skip travis]

* follow up

* follow up, fix conda activate

* ci: playaround, remove conda specifications
* irradiation.py: replace .clip by .where due to new numpy/dask incompatibility

* follow up, only apply .where where necessary
* enable ci on windows

* data.py: use TemporaryDirectory instead of mkdtemp

* data.py revert last commit, try now with wrapper

* fix travis env for windows machines

* follow up: write pip and pytest dependencies in env file

* env: add libspatialindex to requirements

* travis: reintroduce strict channel order due to installation problems on windows
* introduce Cutout.grid
make Cutout.grid_cells and Cutout.grid_coordinates deprecated

* follow up

* adjust plotting example

* update release notes

* test_creation.py adjust test

* test: tiny fix up

* add crs to Cutout.grid

* follow up: add comment [skip travis]

* release notes: fix typo [skip travis]
* Rename projection to crs

Follows pyproj in nomenclature. See https://pyproj4.github.io/pyproj/stable/gotchas.html#upgrading-to-pyproj-2-from-pyproj-1 .

* environment: Remove channel pinning

Channel pinning has been superseed by strict channel_priority as
proposed at https://conda-forge.org/docs/user/tipsandtricks.html.

* gis: Add grid_cell_areas function to compute areas of grid cells

* cutout: Fix forgotten conversion

* gis: Improve grid_cell_areas

* remove area calculation due to geopandas implementation

* update release notes

* gis.py: revise imports

Co-authored-by: Fabian <[email protected]>
* gebco: Extract and resample data from GEBCO using rasterio

* tiny fixup of inversed y-axis and data array accessing

* fix numeric tags

Co-authored-by: Fabian <[email protected]>
* * add warning for ignoring cutoutparams if cutout already exists
* reintroduce Cutout.prepared

* follow up
cutout.py make prepared features more secure
* cutout.py add merge function
pytest add merge test

* cutout.py: when data is passed and path is non-existent, write out file
path in cutout.merge and cutout.sel has to be non-existent

* adjust docstrings

* revert second last commit, add cutou.to_file function

* revert unneeded assert

* follow up: update docstrings [skip travis]
solar_position.py: saver/cleaner approach for chunking
* convert.py catch case of no layout given

* convert.py: restructure convert_and_aggregate for correctly handling all input combinations

* test: pv add rounding to assert
@FabianHofmann FabianHofmann merged commit 12846ce into master Jan 21, 2021
@FabianHofmann
Copy link
Contributor

finally :)

@FabianHofmann FabianHofmann deleted the v0.2 branch March 3, 2021 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants