performance improvements for saving / loading .hdf5 #451

tylerflex · 2022-08-03T11:21:43Z

For a SimulationData.hdf5 file of size 4GB on disk:

Requires 4.21 GB of memory to read and write(using memory_profiler python package).
11 sec of load time on my machine, few seconds to write.

momchil-flex · 2022-08-03T22:17:15Z

tidy3d/components/base.py

@@ -307,6 +307,9 @@ def unpack_dataset(dataset: h5py.Dataset) -> Any:  # pylint:disable=too-many-ret
                return [val.decode("utf-8") for val in value]
            if value.dtype == bool:
                return value.astype(bool)
+            # handle xarray datasets implicitly (retain np.ndarray type)
+            if len(value.shape) >= 4:


What is this about and why 4?

So, admittedly this is a bit hacky, but here's the explanation: This function loads the out of an hdf5 dataset to be placed in a dictionary that is eventually fed to cls.parse_raw(). Often times, the data is of type np.ndarray, including things like size, center, vertices, and DataArray data (or values). The .from_file() was slow before because it converted all np.ndarray values tolist() as Tidy3d doesn't know how to handle np.ndarray. However, for large DataArray objects, this was slow and unnecessary because xarray needs to convert that list back to np.ndarray. So this condition checks if the numpy array has shape > 4 (a scalar field data array) and then just keeps the type as an np.ndarray. Maybe we could make this more explicit and set a flag in the dataset indicating whether to load as array or list?

What I don't understand is, why is this OK? Why is the .tolist() conversion needed in the first place and why doesn't something break by the fact that we don't convert certain arrays to list? The function docstring says "Gets the value contained in a dataset in a form ready to insert into final dict." - is an ndarray OK or not? It just seems like either we should be converting all numpy arrays to list or none of them. Or an intermediate situation may be if value.shape > 1, if we are iteratively going through arrays and only want to convert the innermost one to list?

np.ndarray is only acceptable if it is a DataArray. For everything else, it's not ok. For example, make a box with

td.Box(size=np.array([1,2,3]))

and it will fail because Tidy3dBaseModel doesn't recognize np.ndarray.

So the idea is that we just want to return whatever type we can feed to initialize the object. does that make sense?

I think I see. But for example all the coords of the xarray DataArray will still be converted to list? Generally it seems quite sketchy (what if we introduce a 3D scalar data in the future?) but I'm fine with this if it noticably improves things and there's no easy but better fix.

Yea, the way I have it right now, coords and nd scalar data (where n<4) will be converted to lists and back. Maybe the easiest thing to do is just save things as np.ndarray only in the xarray bits, and everything else could be saved as list. I'll give it a try.

momchil-flex · 2022-08-04T20:03:05Z

Note also my last comments in the linked issue. However, with respect to avoiding a MonitorData copy during getitem, I can't really come up with a good way to do that while keeping the option for the user to change the normalize_index (the only way the user can do that is something like SimulationData.copy(update=dict(normalize_index=1)) right?)

Well, one way I could come up with, is if the self.monitor_data is exactly what is written in the hdf5 file, and we bring back a normalize(new_normalize_index) method that returns a copy of the SimulationData with all of the contained data first renormalized by 1 / spectrum(old_normalize_index) if old_normalize_index is not None, and then normalized by spectrum(new_normalize_index). Note that the 1 / should not pose any problems because the data normalize function already does something like 1 / spectrum.

This may sound a bit ugly, but it does solve some problems.

Avoids the confusion users may have of why the data in the file is different from the data in python
We still need a way to write the normalized data to file for the post-run viz - currently there is no easy way I think
Avoids making a copy when you do sim_data["monitor_name"]

Basically, we will normalize by source index 0 on the backend already, to get the same current default behavior, but with the .hdf5 file matching what's returned by sim_data["monitor_name"]. So on the backend, to write the final file, we do sim_data.normalize(normalize_index=0).to_file("simulation_data.hdf5"). This is what users download. Then they can modify as they wish through sim_data.normalize. I do think the renormalization seems a bit odd but I do like how it solves a number of issues we have right now.

tylerflex · 2022-08-05T08:42:46Z

Well, one way I could come up with, is if the self.monitor_data is exactly what is written in the hdf5 file, and we bring back a normalize(new_normalize_index) method that returns a copy of the SimulationData with all of the contained data first renormalized by 1 / spectrum(old_normalize_index) if old_normalize_index is not None, and then normalized by spectrum(new_normalize_index).

I'm confused though, wouldn't this run into the same issue because you'd then have two copies of the sim_data in memory? For example, would this store twice the memory are required for sim_data?

sim_data = sim_data.normalize(1)

tylerflex · 2022-08-05T09:41:22Z

I added a commit that speeds things up from 14s to 4 seconds for loading a 4GB hdf5. A bit of background explanation:

If I have a value (1,2,3) that gets saved to hdf5 file, it gets loaded as np.ndarray and is indistinguishable (as far as I can tell) from a np.array([1,2,3]) that was saved to the file. So there is no way to tell whether the data "should" be loaded as np.ndarray or tuple. The ideal solution would be to load everything as np.ndarray and then be able to feed these arrays to Tidy3d, which would call .tolist() on them if np.ndarray is not an acceptable type (for example, for center, vertices).

For now, (until we implement something like this, which is a bit tricky). I pushed a commit that does this explicitly and also improves the performance further.

The two main changes:

Added a keep_numpy : bool = False flag that gets passed around (and stored in the file) and explicitly tells whether to keep this object as numpy or not.
Removed unnecessary np.array() conversions in the DataArray validators, which were slowing things down if I looked at the profiler.

momchil-flex · 2022-08-05T22:03:04Z

Ah, this looks much better now, thanks!

tylerflex changed the title ~~added memory test for saving / loading .hdf5~~ performance improvements for saving / loading .hdf5 Aug 3, 2022

tylerflex force-pushed the tyler/hdf5_memory branch 3 times, most recently from ea968e7 to 6c33f02 Compare August 3, 2022 12:22

tylerflex requested a review from momchil-flex August 3, 2022 12:24

momchil-flex reviewed Aug 3, 2022

View reviewed changes

tylerflex linked an issue Aug 4, 2022 that may be closed by this pull request

Better Garbage Collection when Loading hdf5 Data #439

Closed

added memory test for saving / loading .hdf5

f6a4e36

tylerflex force-pushed the tyler/hdf5_memory branch from 67e60f1 to dbe288b Compare August 5, 2022 09:37

explicit flag to keep data as numpy to speed up data IO

77ee8de

tylerflex force-pushed the tyler/hdf5_memory branch from dbe288b to 77ee8de Compare August 5, 2022 14:02

momchil-flex merged commit 2ae68d4 into develop Aug 5, 2022

momchil-flex deleted the tyler/hdf5_memory branch August 5, 2022 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance improvements for saving / loading .hdf5 #451

performance improvements for saving / loading .hdf5 #451

tylerflex commented Aug 3, 2022 •

edited

Loading

momchil-flex Aug 3, 2022

tylerflex Aug 4, 2022

momchil-flex Aug 4, 2022

tylerflex Aug 4, 2022

momchil-flex Aug 4, 2022

tylerflex Aug 5, 2022

momchil-flex commented Aug 4, 2022

tylerflex commented Aug 5, 2022

tylerflex commented Aug 5, 2022 •

edited

Loading

momchil-flex commented Aug 5, 2022

performance improvements for saving / loading .hdf5 #451

performance improvements for saving / loading .hdf5 #451

Conversation

tylerflex commented Aug 3, 2022 • edited Loading

momchil-flex Aug 3, 2022

Choose a reason for hiding this comment

tylerflex Aug 4, 2022

Choose a reason for hiding this comment

momchil-flex Aug 4, 2022

Choose a reason for hiding this comment

tylerflex Aug 4, 2022

Choose a reason for hiding this comment

momchil-flex Aug 4, 2022

Choose a reason for hiding this comment

tylerflex Aug 5, 2022

Choose a reason for hiding this comment

momchil-flex commented Aug 4, 2022

tylerflex commented Aug 5, 2022

tylerflex commented Aug 5, 2022 • edited Loading

momchil-flex commented Aug 5, 2022

tylerflex commented Aug 3, 2022 •

edited

Loading

tylerflex commented Aug 5, 2022 •

edited

Loading