Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Garbage Collection when Loading hdf5 Data #439

Closed
zeyueN opened this issue Jul 18, 2022 · 3 comments · Fixed by #451
Closed

Better Garbage Collection when Loading hdf5 Data #439

zeyueN opened this issue Jul 18, 2022 · 3 comments · Fixed by #451
Assignees

Comments

@zeyueN
Copy link

zeyueN commented Jul 18, 2022

Describe the bug
Currently, running SimulationData.from_file to read a hdf5 file consumes 3 times the RAM it should. This causes OOM when loading large files, especially on our cloud instances where RAM is limited (10GB max) when loading ~4GB data

To Reproduce
Load a simulation data hdf5 and monitor the RAM usage with a profiler/debugger/htop.

Expected behavior
RAM usage being around the data size.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Linux
  • Browser [e.g. chrome, safari]
  • Version v1.4.1

Potential fixes
There seems to be 2 places that are causing this RAM over usage:

  1. monitor_data_dict hanging around after being used and not garbage collected, here.
  2. .normalize() makes a copy of itself, here.

1 is fairly easy to fix, either explicitly garbage collect the monitor_data_dict variable, or not assigning it to a variable and directly pass it to the cls() call below.

2 perhaps involves more changes design-wise, but a short term fix would be great, something like having the option for not doing the copy.

@momchil-flex
Copy link
Collaborator

Thanks. We're just wrapping up a big data structure reorganization, we'll check if this is still an issue and fix it in the next release.

@tylerflex
Copy link
Collaborator

Hi @zeyueN, we just merged our very comprehensive refactor of the tidy3d data structures #425 into the develop branch. I believe it should mitigate some of the memory issue here as we no longer normalize the SimulationData when loading from file. Instead a normalized copy of the individual monitor data are returned when accessed by sim_data[monitor_name]. We will do a bit of internal testing and let you know how the memory handling is for working with data of around 4GB in size.

@momchil-flex
Copy link
Collaborator

@tylerflex note however that if a user has one big monitor data, and does something like

sim_data = td.SimulationData.from_file("sim_data.hdf5")
mon_data = sim_data["monitor_name"]

They will still have two copies of the data in memory. If we want to fix that, I think we should just normalize the data when loading from file, rather than returning a normalized copy. Similarly we should make sure that apply_symmetry only returns a copy if symmetry was actually applied, otherwise just return a view into the data array.

@tylerflex tylerflex linked a pull request Aug 4, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants