Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance improvements for saving / loading .hdf5 #451

Merged
merged 2 commits into from
Aug 5, 2022

Conversation

tylerflex
Copy link
Collaborator

@tylerflex tylerflex commented Aug 3, 2022

For a SimulationData.hdf5 file of size 4GB on disk:

  • Requires 4.21 GB of memory to read and write(using memory_profiler python package).
  • 11 sec of load time on my machine, few seconds to write.

@tylerflex tylerflex changed the title added memory test for saving / loading .hdf5 performance improvements for saving / loading .hdf5 Aug 3, 2022
@tylerflex tylerflex force-pushed the tyler/hdf5_memory branch 3 times, most recently from ea968e7 to 6c33f02 Compare August 3, 2022 12:22
@@ -307,6 +307,9 @@ def unpack_dataset(dataset: h5py.Dataset) -> Any: # pylint:disable=too-many-ret
return [val.decode("utf-8") for val in value]
if value.dtype == bool:
return value.astype(bool)
# handle xarray datasets implicitly (retain np.ndarray type)
if len(value.shape) >= 4:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this about and why 4?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, admittedly this is a bit hacky, but here's the explanation: This function loads the out of an hdf5 dataset to be placed in a dictionary that is eventually fed to cls.parse_raw(). Often times, the data is of type np.ndarray, including things like size, center, vertices, and DataArray data (or values). The .from_file() was slow before because it converted all np.ndarray values tolist() as Tidy3d doesn't know how to handle np.ndarray. However, for large DataArray objects, this was slow and unnecessary because xarray needs to convert that list back to np.ndarray. So this condition checks if the numpy array has shape > 4 (a scalar field data array) and then just keeps the type as an np.ndarray. Maybe we could make this more explicit and set a flag in the dataset indicating whether to load as array or list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I don't understand is, why is this OK? Why is the .tolist() conversion needed in the first place and why doesn't something break by the fact that we don't convert certain arrays to list? The function docstring says "Gets the value contained in a dataset in a form ready to insert into final dict." - is an ndarray OK or not? It just seems like either we should be converting all numpy arrays to list or none of them. Or an intermediate situation may be if value.shape > 1, if we are iteratively going through arrays and only want to convert the innermost one to list?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.ndarray is only acceptable if it is a DataArray. For everything else, it's not ok. For example, make a box with

td.Box(size=np.array([1,2,3]))

and it will fail because Tidy3dBaseModel doesn't recognize np.ndarray.

So the idea is that we just want to return whatever type we can feed to initialize the object. does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see. But for example all the coords of the xarray DataArray will still be converted to list? Generally it seems quite sketchy (what if we introduce a 3D scalar data in the future?) but I'm fine with this if it noticably improves things and there's no easy but better fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the way I have it right now, coords and nd scalar data (where n<4) will be converted to lists and back. Maybe the easiest thing to do is just save things as np.ndarray only in the xarray bits, and everything else could be saved as list. I'll give it a try.

@tylerflex tylerflex linked an issue Aug 4, 2022 that may be closed by this pull request
@momchil-flex
Copy link
Collaborator

Note also my last comments in the linked issue. However, with respect to avoiding a MonitorData copy during getitem, I can't really come up with a good way to do that while keeping the option for the user to change the normalize_index (the only way the user can do that is something like SimulationData.copy(update=dict(normalize_index=1)) right?)

Well, one way I could come up with, is if the self.monitor_data is exactly what is written in the hdf5 file, and we bring back a normalize(new_normalize_index) method that returns a copy of the SimulationData with all of the contained data first renormalized by 1 / spectrum(old_normalize_index) if old_normalize_index is not None, and then normalized by spectrum(new_normalize_index). Note that the 1 / should not pose any problems because the data normalize function already does something like 1 / spectrum.

This may sound a bit ugly, but it does solve some problems.

  • Avoids the confusion users may have of why the data in the file is different from the data in python
  • We still need a way to write the normalized data to file for the post-run viz - currently there is no easy way I think
  • Avoids making a copy when you do sim_data["monitor_name"]

Basically, we will normalize by source index 0 on the backend already, to get the same current default behavior, but with the .hdf5 file matching what's returned by sim_data["monitor_name"]. So on the backend, to write the final file, we do sim_data.normalize(normalize_index=0).to_file("simulation_data.hdf5"). This is what users download. Then they can modify as they wish through sim_data.normalize. I do think the renormalization seems a bit odd but I do like how it solves a number of issues we have right now.

@tylerflex
Copy link
Collaborator Author

Well, one way I could come up with, is if the self.monitor_data is exactly what is written in the hdf5 file, and we bring back a normalize(new_normalize_index) method that returns a copy of the SimulationData with all of the contained data first renormalized by 1 / spectrum(old_normalize_index) if old_normalize_index is not None, and then normalized by spectrum(new_normalize_index).

I'm confused though, wouldn't this run into the same issue because you'd then have two copies of the sim_data in memory? For example, would this store twice the memory are required for sim_data?

sim_data = sim_data.normalize(1)

@tylerflex
Copy link
Collaborator Author

tylerflex commented Aug 5, 2022

I added a commit that speeds things up from 14s to 4 seconds for loading a 4GB hdf5. A bit of background explanation:

If I have a value (1,2,3) that gets saved to hdf5 file, it gets loaded as np.ndarray and is indistinguishable (as far as I can tell) from a np.array([1,2,3]) that was saved to the file. So there is no way to tell whether the data "should" be loaded as np.ndarray or tuple. The ideal solution would be to load everything as np.ndarray and then be able to feed these arrays to Tidy3d, which would call .tolist() on them if np.ndarray is not an acceptable type (for example, for center, vertices).

For now, (until we implement something like this, which is a bit tricky). I pushed a commit that does this explicitly and also improves the performance further.

The two main changes:

  1. Added a keep_numpy : bool = False flag that gets passed around (and stored in the file) and explicitly tells whether to keep this object as numpy or not.
  2. Removed unnecessary np.array() conversions in the DataArray validators, which were slowing things down if I looked at the profiler.

@momchil-flex
Copy link
Collaborator

Ah, this looks much better now, thanks!

@momchil-flex momchil-flex merged commit 2ae68d4 into develop Aug 5, 2022
@momchil-flex momchil-flex deleted the tyler/hdf5_memory branch August 5, 2022 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better Garbage Collection when Loading hdf5 Data
2 participants