-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
total size of record should be easy to calculate based on fixed offsets #25
Comments
I totally agree in terms of simplicity but I feel like this goes against the current draft to some extent. This would make the termination block for example a bit pointless (except maybe the CRC) - if you already know all that information at the time the fixed header is written it could just be written there. |
Yep, and that is I think my argument for separating the archival record format from the on the wire low latency protocol. The receiver can almost as easily update values in the fixed header as it can append data and termination blocks, so why are we doing this? The complexity over the wire makes sense to me, but not once it is dumped to disk. |
While I agree that a simple, up-front record length is nice. I also think skipping through blocks to determine the total size is not so bad (you would have to do that to read the record anyway) if that is one of a few concessions that allows us to have a format that does both archiving/streaming. As a reader you can and should have a maximum record length you're willing to read, fail gracefully when it goes beyond that. Then again, I haven't yet written code to do that, so maybe I'll feel differently after that... I wouldn't be totally opposed to an up-front record length, it does effect flexibility a bit. I can imagine we want to use a future encoding to send 1-second data blocks, this encoding (compressed) may not be very predictable in terms of size. Maybe we need another issue for discussion of "should we have separate archive format and a streaming formats"? |
Easy solution: fixed record size. |
@andres-h That implies padding and potentially splitting records when adding a extra header or chunk if there is not enough space. I feel the splitting is particularly a problem when the added item strongly relates to the data in the record but doesn't fit. A particularly bad instance of this was old mseed2 data that lacked a blockette 1000 and did not have room to insert it. Hopefully mseed3 never suffers from such a glaring problem, but that past experience makes me very wary of fixed record sizes. |
On 07/10/2017 09:23 PM, Philip Crotwell wrote:
@andres-h <https://github.com/andres-h> That implies padding and
potentially splitting records when adding a extra header or chunk if
there is not enough space. I feel the splitting is particularly a
problem when the added item strongly relates to the data in the record
but doesn't fit.
A particularly bad instance of this was old mseed2 data that lacked a
blockette 1000 and did not have room to insert it. Hopefully mseed3
never suffers from such a glaring problem, but that past experience
makes me very wary of fixed record sizes.
In this case, the record size can be enlarged. I don't mean that
virtually *all* records have to be the same size. Just records in a
single file should be preferably same size. So when you add another
chunk, you just make all records larger by that size (if there is not
enough free space already).
One thing to possibly consider is skipping the padding during transfer
if a protocol such as SeedLink is used. It wouldn't affect CRC in my
version, because CRC is calculated for data up to and *not* including
the CRC chunk, which is normally the last chunk (unless SHA or something
is added after it).
|
So if I have a 10Mb mseed file with a 20,000 mseed records in it, and I need to add a small thing to one record that doesn't have enough free space, I should expand all 20,000 records to be the same size? That seems very wasteful. |
On 07/11/2017 06:12 PM, Philip Crotwell wrote:
So if I have a 10Mb mseed file with a 20,000 mseed records in it, and I
need to add a small thing to one record that doesn't have enough free
space, I should expand all 20,000 records to be the same size? That
seems very wasteful.
Usually you want to add the thing to all records anyway. You wouldn't
add blockette 1000 to one record only...
|
Except when you want to add event detection records or other things that occur sparsely in time. I agree with @crotwell, this is a good example of a problem with fixed record lengths that we have suffered from for a long time. Chances are high that this problem will be exacerbated when we allow more header additions. |
On 07/11/2017 06:48 PM, Chad Trabant wrote:
Usually you want to add the thing to all records anyway. You wouldn't
add blockette 1000 to one record only...
Except when you want to add event detection records or other things that
occur sparsely in time.
Event detections go to separate records usually.
Anyway, having fixed length records is just a recommendation. I'm too
lazy to implement index for random access :)
|
I've seen otherwise and there is no restriction in the format, i.e. any combination is allowed.
You're welcome to use this :) I still have to update the code from the currently documented schema and will do so soon, so it's a bit out of sync. |
A common need is to be able to easily read an entire record into memory or skip to the start of the next record. For example, looking for the first record that contains a time.
The 0708 spec makes this hard because the overall length is
<header size> + sum( <data block size> ) + <termination block size>
, meaning that to find the start of the next record or to fully read into memory a single record, you must do multiple reads, with the offset of each read being dependent on the value of the previous. I do not think this very efficient. And I think argues that the fixed header should contain a single field that gives the total size of the record, instead of having the total size be the sum of sizes of various sub-record blocks.My feeling is the best values to store would be the total size and the data size, in the fixed header, with the extra header size either also being stored, or perhaps being documented as total size minus data size. And identifier size is also needs to be in there I suppose.
Regardless of how it is done, I feel a fundamental design goal should be that you can read a small fixed number of bytes, ie size of fixed header, and know that you can find the total size of the record without additional reads. We do not want to recreate the painful situation of having to search for a blockette 1000 to find the overall record size.
The text was updated successfully, but these errors were encountered: