Better Garbage Collection when Loading hdf5 Data #439

zeyueN · 2022-07-18T22:52:21Z

Describe the bug
Currently, running SimulationData.from_file to read a hdf5 file consumes 3 times the RAM it should. This causes OOM when loading large files, especially on our cloud instances where RAM is limited (10GB max) when loading ~4GB data

To Reproduce
Load a simulation data hdf5 and monitor the RAM usage with a profiler/debugger/htop.

Expected behavior
RAM usage being around the data size.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Linux
Browser [e.g. chrome, safari]
Version v1.4.1

Potential fixes
There seems to be 2 places that are causing this RAM over usage:

monitor_data_dict hanging around after being used and not garbage collected, here.
.normalize() makes a copy of itself, here.

1 is fairly easy to fix, either explicitly garbage collect the monitor_data_dict variable, or not assigning it to a variable and directly pass it to the cls() call below.

2 perhaps involves more changes design-wise, but a short term fix would be great, something like having the option for not doing the copy.

The text was updated successfully, but these errors were encountered:

momchil-flex · 2022-08-01T21:46:00Z

Thanks. We're just wrapping up a big data structure reorganization, we'll check if this is still an issue and fix it in the next release.

tylerflex · 2022-08-02T08:03:53Z

Hi @zeyueN, we just merged our very comprehensive refactor of the tidy3d data structures #425 into the develop branch. I believe it should mitigate some of the memory issue here as we no longer normalize the SimulationData when loading from file. Instead a normalized copy of the individual monitor data are returned when accessed by sim_data[monitor_name]. We will do a bit of internal testing and let you know how the memory handling is for working with data of around 4GB in size.

momchil-flex · 2022-08-02T17:45:29Z

@tylerflex note however that if a user has one big monitor data, and does something like

sim_data = td.SimulationData.from_file("sim_data.hdf5")
mon_data = sim_data["monitor_name"]

They will still have two copies of the data in memory. If we want to fix that, I think we should just normalize the data when loading from file, rather than returning a normalized copy. Similarly we should make sure that apply_symmetry only returns a copy if symmetry was actually applied, otherwise just return a view into the data array.

momchil-flex assigned tylerflex Aug 1, 2022

tylerflex linked a pull request Aug 4, 2022 that will close this issue

performance improvements for saving / loading .hdf5 #451

Merged

momchil-flex closed this as completed in #451 Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Garbage Collection when Loading hdf5 Data #439

Better Garbage Collection when Loading hdf5 Data #439

zeyueN commented Jul 18, 2022

momchil-flex commented Aug 1, 2022

tylerflex commented Aug 2, 2022

momchil-flex commented Aug 2, 2022

Better Garbage Collection when Loading hdf5 Data #439

Better Garbage Collection when Loading hdf5 Data #439

Comments

zeyueN commented Jul 18, 2022

momchil-flex commented Aug 1, 2022

tylerflex commented Aug 2, 2022

momchil-flex commented Aug 2, 2022