-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoded waveforms load slower than compressed waveforms #77
Comments
Forgot to mention that these files are the same size so there is no benefit to storage for using encoding over gzipping.
|
@gipert suggested trying this on a hard drive, so I tested it on NERSC for a @gipert also suggested checking other compression filters. I tested the
|
Well, this cannot be...
Yes, I have also already noticed that ZZ is significantly slower than David's code. I would have originally liked to only use Sigcompress, but then it was decided not to rescale the pre-summed waveform (i.e. samples are of uint32 type). Sigcompress only works with 16-bit integers.
In principle, Another thing: do you make sure you run the test scripts multiple times? After the first time the read speed will be much higher, since the file is cached by the filesystem. If you have admin rights, you can manually clean the filesystem cache to ensure reproducibility (I forgot the command). Also make sure that Numba cache is enabled, so it does not need to recompile after the first time! |
If we confirm that LZF performance is so good, it could be a safe choice for the next re-processing of the raw tier. It's a famous algorithm so I'm sure it's well supported from the Julia side too. |
I retried this and it looks like I screwed up the test I made of Here is my code: waveform_tests.txt The default level of compression for
I'm not sure that there is actually any caching occurring or something else is going on with it that I don't understand. Maybe the cache is getting cleared faster than I can read the files? This is the output from repeating the test three times in a simple loop. The read speeds of each loop are all nearly identical. It does look like the test I made of I checked this by introducing a I can perform some more rigorous tests if there are suggestions.
Is there something special I need to do to enable this? I am not very familiar with Numba. I thought this would be handled by whatever functions we have that are using Numba. Results from
Results from
|
The caching is always occurring, and the data is kept there until the cache is full (or I guess a significant amount of time has passed). Since the cache is shared among all processes and users, the time it takes for the file to be removed from the cache can vary a lot, but it's not particularly short, I'd say.
Caching is true by default in legend-pydataobj, unless overridden by the user. To conclude, let's re-discuss all this when we'll be again close in time to re-generating the raw tier. We had to cut-off the discussion last time because of time constraints and we clearly did not study this well enough. |
See https://legend-exp.atlassian.net/wiki/spaces/LEGEND/pages/991953524/Investigation+of+LH5+File+Structure+and+I+O.
|
After some profiling, I think the culprit here is awkward array. Here's a test of reading in a raw file:
It looks like this line in
This line is responsible for padding each vector to have the same length and copying it into a numpy array. The problem is that this involves iterating throw the arrays in a slow python-y way it seems. I think it could be made a lot quicker by constructing the numpy array first and then copying into it. Doing this with numba would be very fast. Awkward array may also have a much more optimized way of doing this. |
Encoded waveforms load 2-3x slower than gzipped waveforms. If I convert waveforms to being gzipped instead of encoded (see #76) , they load 2-3x faster.
gives
The text was updated successfully, but these errors were encountered: