-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maximum Record Length #13
Comments
A while back @krischer wrote a nice paragraph describing the downsides of huge records. I can't find it at the moment, perhaps he can dig it up, it was a good summary of reasons why not to do huge records. Dave Ketchum mentioned the problem of sample time drift over long arrays, but Lion had other points. I do not think we are trying to compete with SAC, the goals are very different. My 2 cents. Also to consider: if we blocked/chunked the data payload such that the header does not need to contain the length then, heck, there are no limits, you can make them as big as you want. |
I feel the opposite, it is painful to have to deal with multiple file formats for saving seismic data. Everybody has to read miniseed, but then once they have done processing, save in a different format? All I am asking for here is 2 extra bytes in the record size header. I agree a datalogger or a datacenter should not do really long records for raw data, but it is very useful for the end user to save one big float or int array instead of being forced to break it up after you have already made the decision that the timing is good enough. The SAC file format is painful for lots of other reasons, and with these 2 extra bytes miniseed could quickly become the only file format processing systems need to support. That would be a real benefit to seismology, for the cost of only 2 measly bytes. |
In draft 20170708 there is no hard limit to a record length. There remains a limit for a data block of 65k, but there is no limit on the number that can be included in a record. |
I can see that argument but I don't see why it is a big problem. Some library will perform the record split so it is invisible to users. Another things to keep in mind is that very large records will make it a lot harder to split up MiniSEED files and it will also make the checksum at the end more expensive to compute. Additionally the checksum will be less meaningful as a single check is now performed for a potentially very large record and a single bit flip will invalidate the whole record without any chance of figuring out where it goes wrong. Even more it will become technically more challenging as the checksum calculation requires access to the whole data after it has been encoded. As libraries cannot just require a potential additional 4 GB of memory to write to it would require some awkward flip-flop of writing everything to disc -> read again and calculate checksum -> write checksum. This is IMHO a fairly realistic concern.
I also cannot find it right now but a 32 bit sampling rate is not accurate enough to correctly determine the times of later samples - but this indeed only becomes important for fairly large sample counts so I'm not sure it represents a problem in practice. I found this script on my machine which demonstrates the problem - I guess I initially wrote this for some related discussion. While it is indeed a bit contrived it is the equivalent of a 124 day recording at 200 Hz. The data-format should IMHO not allow something that is wrong and if MiniSEED allows something people will definitely do it. Three possible ways around this:
So while I can understand @crotwell's arguments there are a lot of downsides to large records and I feel like they are not worth the trade-off. Also MiniSEED 2 is already used as a processing format so I don't see why it should no longer be the case with MiniSEED 3. I'm reopening this for further discussion. from decimal import Decimal as D
import numpy as np
starttime_in_ns = 124734934578
# Awkward sampling rate to force floating point errors.
sampling_rate_in_sec = 201.12345678
# Max number of samples for 4 bytes record length field. Assumes 2 bytes per
# sample which is very achievable with compression.
samples = 4294967295 // 2
endtime_in_ns_d = \
D(starttime_in_ns) + D("1000000000") / D(sampling_rate_in_sec) * D(samples)
# Use a single precision sampling rate.
endtime_in_ns = \
starttime_in_ns + 1000000000 / np.float32(sampling_rate_in_sec) * samples
print("Endtime in ns - accurate: ", int(endtime_in_ns_d))
print("Endtime in ns - floating point:", int(endtime_in_ns))
diff = abs(int(endtime_in_ns_d) - int(endtime_in_ns))
print("Difference in ns: ", diff)
print("Difference as a factor of dt: ",
(diff / 1E9) / (1.0 / sampling_rate_in_sec)) output:
|
If we are going to limit data size in the record to 16 bits (~ 64K) then there is no need to have Unit32 number of samples. Basically, the arguments above about timing and large records are really about large number of samples, even if they compress really well. I still like the idea of large single records, but if we are going to disallow them via limiting the record size to Uint16, then we should also limit the number of samples to Uint16. Or flipping it around, if we allow Uint32 samples, we should allow Uint32 bytes to put them in. |
Maximum number of samples in a 64-byte Steim2 frame is 105. There is a bit of overhead taking up a few more bytes depending on first frame or not but it's still more than 64 samples in a 64-byte frame. Since we can have more than 1 sample/byte we need a sample count larger than byte count. |
I still don't get it. You argued that records should not be too large due to sample drift, but now you are ok with a single record with huge number of samples in it as long as it compress really well? In other words, is allowing 2^17 samples to be packed into a 2^16 byte record really that much of a benefit over forcing them to be split into 2 records? Is it worth the extra 2 bytes that will be zeros >99.99% of the time? I feel this edge case of packing a maximally sized record to capacity with highly compressible data is not worth it. I feel records should either be limited to be "small" in both senses, or should be allowed to be large in both senses. |
My thinking was to provide a single limiter (length), instead of two (length or sample count) which is just a bit more complex and a (minor) wrinkle for record creators who try to create maximum size records. The original Strawman had the 2-limiter issue and it was commented on. I do not feel strongly about this and would be fine with a UINT16, maximum 2^16 samples. Does anyone else have strong feelings? |
Discussion branched off #2. Concerns DRAFT20170622.
@crotwell
The text was updated successfully, but these errors were encountered: