total size of record should be easy to calculate based on fixed offsets #25

crotwell · 2017-07-10T14:31:38Z

A common need is to be able to easily read an entire record into memory or skip to the start of the next record. For example, looking for the first record that contains a time.

The 0708 spec makes this hard because the overall length is
<header size> + sum( <data block size> ) + <termination block size>, meaning that to find the start of the next record or to fully read into memory a single record, you must do multiple reads, with the offset of each read being dependent on the value of the previous. I do not think this very efficient. And I think argues that the fixed header should contain a single field that gives the total size of the record, instead of having the total size be the sum of sizes of various sub-record blocks.

My feeling is the best values to store would be the total size and the data size, in the fixed header, with the extra header size either also being stored, or perhaps being documented as total size minus data size. And identifier size is also needs to be in there I suppose.

Regardless of how it is done, I feel a fundamental design goal should be that you can read a small fixed number of bytes, ie size of fixed header, and know that you can find the total size of the record without additional reads. We do not want to recreate the painful situation of having to search for a blockette 1000 to find the overall record size.

The text was updated successfully, but these errors were encountered:

krischer · 2017-07-10T15:15:03Z

I totally agree in terms of simplicity but I feel like this goes against the current draft to some extent. This would make the termination block for example a bit pointless (except maybe the CRC) - if you already know all that information at the time the fixed header is written it could just be written there.

crotwell · 2017-07-10T17:11:08Z

Yep, and that is I think my argument for separating the archival record format from the on the wire low latency protocol. The receiver can almost as easily update values in the fixed header as it can append data and termination blocks, so why are we doing this? The complexity over the wire makes sense to me, but not once it is dumped to disk.

chad-earthscope · 2017-07-10T17:34:38Z

While I agree that a simple, up-front record length is nice. I also think skipping through blocks to determine the total size is not so bad (you would have to do that to read the record anyway) if that is one of a few concessions that allows us to have a format that does both archiving/streaming. As a reader you can and should have a maximum record length you're willing to read, fail gracefully when it goes beyond that. Then again, I haven't yet written code to do that, so maybe I'll feel differently after that...

I wouldn't be totally opposed to an up-front record length, it does effect flexibility a bit. I can imagine we want to use a future encoding to send 1-second data blocks, this encoding (compressed) may not be very predictable in terms of size.

Maybe we need another issue for discussion of "should we have separate archive format and a streaming formats"?

andres-h · 2017-07-10T17:54:15Z

Easy solution: fixed record size.

crotwell · 2017-07-10T19:23:40Z

@andres-h That implies padding and potentially splitting records when adding a extra header or chunk if there is not enough space. I feel the splitting is particularly a problem when the added item strongly relates to the data in the record but doesn't fit.

A particularly bad instance of this was old mseed2 data that lacked a blockette 1000 and did not have room to insert it. Hopefully mseed3 never suffers from such a glaring problem, but that past experience makes me very wary of fixed record sizes.

andres-h · 2017-07-10T19:36:35Z

On 07/10/2017 09:23 PM, Philip Crotwell wrote: @andres-h <https://github.com/andres-h> That implies padding and potentially splitting records when adding a extra header or chunk if there is not enough space. I feel the splitting is particularly a problem when the added item strongly relates to the data in the record but doesn't fit. A particularly bad instance of this was old mseed2 data that lacked a blockette 1000 and did not have room to insert it. Hopefully mseed3 never suffers from such a glaring problem, but that past experience makes me very wary of fixed record sizes.

In this case, the record size can be enlarged. I don't mean that virtually *all* records have to be the same size. Just records in a single file should be preferably same size. So when you add another chunk, you just make all records larger by that size (if there is not enough free space already). One thing to possibly consider is skipping the padding during transfer if a protocol such as SeedLink is used. It wouldn't affect CRC in my version, because CRC is calculated for data up to and *not* including the CRC chunk, which is normally the last chunk (unless SHA or something is added after it).

chad-earthscope · 2017-07-10T19:38:22Z

@crotwell 👍

crotwell · 2017-07-11T16:12:29Z

@andres-h

In this case, the record size can be enlarged. I don't mean that virtually all records have to be the same size. Just records in a single file should be preferably same size. So when you add another chunk, you just make all records larger by that size (if there is not enough free space already).

So if I have a 10Mb mseed file with a 20,000 mseed records in it, and I need to add a small thing to one record that doesn't have enough free space, I should expand all 20,000 records to be the same size? That seems very wasteful.

andres-h · 2017-07-11T16:36:57Z

On 07/11/2017 06:12 PM, Philip Crotwell wrote: So if I have a 10Mb mseed file with a 20,000 mseed records in it, and I need to add a small thing to one record that doesn't have enough free space, I should expand all 20,000 records to be the same size? That seems very wasteful.

Usually you want to add the thing to all records anyway. You wouldn't add blockette 1000 to one record only...

chad-earthscope · 2017-07-11T16:48:14Z

Usually you want to add the thing to all records anyway. You wouldn't
add blockette 1000 to one record only...

Except when you want to add event detection records or other things that occur sparsely in time.

I agree with @crotwell, this is a good example of a problem with fixed record lengths that we have suffered from for a long time. Chances are high that this problem will be exacerbated when we allow more header additions.

andres-h · 2017-07-11T17:07:15Z

On 07/11/2017 06:48 PM, Chad Trabant wrote: Usually you want to add the thing to all records anyway. You wouldn't add blockette 1000 to one record only... Except when you want to add event detection records or other things that occur sparsely in time.

Event detections go to separate records usually. Anyway, having fixed length records is just a recommendation. I'm too lazy to implement index for random access :)

chad-earthscope · 2017-07-11T17:14:25Z

Event detections go to separate records usually.

I've seen otherwise and there is no restriction in the format, i.e. any combination is allowed.

Anyway, having fixed length records is just a recommendation. I'm too lazy to implement index for random access :)

You're welcome to use this :)
https://github.com/iris-edu/mseedindex

I still have to update the code from the currently documented schema and will do so soon, so it's a bit out of sync.

crotwell mentioned this issue Jul 10, 2017

Chunks format #14

Open

krischer mentioned this issue Jul 10, 2017

Should we have separate archiving and streaming formats? #26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

total size of record should be easy to calculate based on fixed offsets #25

total size of record should be easy to calculate based on fixed offsets #25

crotwell commented Jul 10, 2017

krischer commented Jul 10, 2017

crotwell commented Jul 10, 2017

chad-earthscope commented Jul 10, 2017

andres-h commented Jul 10, 2017

crotwell commented Jul 10, 2017

andres-h commented Jul 10, 2017 via email

chad-earthscope commented Jul 10, 2017

crotwell commented Jul 11, 2017

andres-h commented Jul 11, 2017 via email

chad-earthscope commented Jul 11, 2017

andres-h commented Jul 11, 2017 via email

chad-earthscope commented Jul 11, 2017

total size of record should be easy to calculate based on fixed offsets #25

total size of record should be easy to calculate based on fixed offsets #25

Comments

crotwell commented Jul 10, 2017

krischer commented Jul 10, 2017

crotwell commented Jul 10, 2017

chad-earthscope commented Jul 10, 2017

andres-h commented Jul 10, 2017

crotwell commented Jul 10, 2017

andres-h commented Jul 10, 2017 via email

chad-earthscope commented Jul 10, 2017

crotwell commented Jul 11, 2017

andres-h commented Jul 11, 2017 via email

chad-earthscope commented Jul 11, 2017

andres-h commented Jul 11, 2017 via email

chad-earthscope commented Jul 11, 2017