-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up get_formatted_array #170
Comments
Here's one measurement: for my 502 locs, 12 techs, 2208 timesteps model, This is crashing runs on the cluster as I'm hitting memory limits. It's especially unfortunate as the model is already solved at that point of time. Furthermore, postprocessing now takes 1/3 or 1/4 of the entire computation time for my models (2-4h), a good part of which may be due to I will try and see whether I can improve the code. Using The cleanest solution would probably be to avoid |
Memo to myself: The problematic line creates a updated_data_var = xr.DataArray(
data_var_df.values,
[("loctechscarriers", data_var_df.index), ("timesteps", data_var_df.columns)]
)
updated_data_var.sel(locs="my-loc") # example use It takes milliseconds to execute and hardly any RAM. EDIT: That is, it's not the string formatting. No wonder, there are only about 1e3 strings to format in my case. Still, should all of this work, this could be a great general solution for avoiding |
@timtroendle, I actually just switched off postprocessing on the cluster in my runs, due to the same issue... Anyway, I had some stuff waiting to go on this, see PR #231 for a working branch that you could test with. It may still blow up on unstacking the MultiIndex (but my memory profiling suggests a much lower memory use than the previous incarnation of I'm a but confused by your solution, how does it go from |
@brynpickering I do not quite understand your solution and why it is so great, but it is so great! Memory and runtime is acceptable for my results: I think I saw 7 GB spikes and had a few seconds runtime. Now it sits there with about 4 GB RAM. My solution was using a MultiIndex without unstacking it which leads to much much smaller arrays: considering they are as sparse as mine are, around three orders of magnitude. You can select locations and technologies as I show above, but you cannot do things like I'll need to explore your solution a bit more, but for now it looks good for my issues at least. Mabye I'll use my own routine with MultiIndex as well, if I can work around the issues it has. And if that appears useful, we can think of adding it to the core, so that one can do:
And should there be support for sparse matrices in xarray at one point, we can have the benefits of both approaches combined. BTW, I didn't know you could switch off postprocessing -- was that a hack, or is there an option? |
Great! Previously, it was turning the DataArray into a pandas dataframe, splitting the loc_tech index string, then turning it back into a DataArray. Now it is just turning the loc_tech index into a pandas index, doing a string operation to create a MultiIndex of locs and techs, replacing the loc_tech dimension with the new MultiIndex, and unstacking that MultiIndex. So, if you stopped this script just before that last step, you'd get the loc_tech (or loc_tech_carrier) index as a MultiIndex in the returned DataArray. I'd be happy enough to have the option of returning just the MultiIndexed DataArray, given that the unstacking causes a small memory increase. Perhaps
Very much a hack... |
Good idea. |
Some ideas in here for further improvements of how we deal with arrays.. |
Problem description
Using
get_formatted_array
splitsloc::techs
andloc::tech::carriers
string sets and interacts between xarray and pandas to produce a sparse matrix for easier indexing (e.g. summing over a singletech
).This can take a very long time for large DataArrays, and has been recorded as hitting memory limits for some devices.
So, it should be made more efficient. This could be a matter of defning
loc::techs
etc. as tuples instead of::
concatenated strings. Then they automagically are parsed as a MultiIndex, instead of needing to apply string operations.Calliope version
0.6.3
The text was updated successfully, but these errors were encountered: