-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance improvements for saving / loading .hdf5 #451
Conversation
ea968e7
to
6c33f02
Compare
tidy3d/components/base.py
Outdated
@@ -307,6 +307,9 @@ def unpack_dataset(dataset: h5py.Dataset) -> Any: # pylint:disable=too-many-ret | |||
return [val.decode("utf-8") for val in value] | |||
if value.dtype == bool: | |||
return value.astype(bool) | |||
# handle xarray datasets implicitly (retain np.ndarray type) | |||
if len(value.shape) >= 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this about and why 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, admittedly this is a bit hacky, but here's the explanation: This function loads the out of an hdf5 dataset to be placed in a dictionary that is eventually fed to cls.parse_raw()
. Often times, the data is of type np.ndarray
, including things like size
, center
, vertices
, and DataArray
data (or values). The .from_file()
was slow before because it converted all np.ndarray
values tolist()
as Tidy3d doesn't know how to handle np.ndarray
. However, for large DataArray
objects, this was slow and unnecessary because xarray
needs to convert that list back to np.ndarray
. So this condition checks if the numpy array has shape > 4 (a scalar field data array) and then just keeps the type as an np.ndarray
. Maybe we could make this more explicit and set a flag in the dataset indicating whether to load as array or list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't understand is, why is this OK? Why is the .tolist() conversion needed in the first place and why doesn't something break by the fact that we don't convert certain arrays to list? The function docstring says "Gets the value contained in a dataset in a form ready to insert into final dict." - is an ndarray OK or not? It just seems like either we should be converting all numpy arrays to list or none of them. Or an intermediate situation may be if value.shape > 1
, if we are iteratively going through arrays and only want to convert the innermost one to list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.ndarray
is only acceptable if it is a DataArray
. For everything else, it's not ok. For example, make a box with
td.Box(size=np.array([1,2,3]))
and it will fail because Tidy3dBaseModel doesn't recognize np.ndarray
.
So the idea is that we just want to return whatever type we can feed to initialize the object. does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see. But for example all the coords of the xarray DataArray will still be converted to list? Generally it seems quite sketchy (what if we introduce a 3D scalar data in the future?) but I'm fine with this if it noticably improves things and there's no easy but better fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, the way I have it right now, coords and nd scalar data (where n<4) will be converted to lists and back. Maybe the easiest thing to do is just save things as np.ndarray
only in the xarray bits, and everything else could be saved as list. I'll give it a try.
Note also my last comments in the linked issue. However, with respect to avoiding a MonitorData copy during getitem, I can't really come up with a good way to do that while keeping the option for the user to change the Well, one way I could come up with, is if the This may sound a bit ugly, but it does solve some problems.
Basically, we will normalize by source index 0 on the backend already, to get the same current default behavior, but with the .hdf5 file matching what's returned by |
I'm confused though, wouldn't this run into the same issue because you'd then have two copies of the sim_data in memory? For example, would this store twice the memory are required for sim_data = sim_data.normalize(1) |
67e60f1
to
dbe288b
Compare
I added a commit that speeds things up from 14s to 4 seconds for loading a 4GB hdf5. A bit of background explanation: If I have a value For now, (until we implement something like this, which is a bit tricky). I pushed a commit that does this explicitly and also improves the performance further. The two main changes:
|
dbe288b
to
77ee8de
Compare
Ah, this looks much better now, thanks! |
For a
SimulationData.hdf5
file of size 4GB on disk:memory_profiler
python package).