Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

total size of record should be easy to calculate based on fixed offsets #25

Open
crotwell opened this issue Jul 10, 2017 · 12 comments
Open

Comments

@crotwell
Copy link
Collaborator

A common need is to be able to easily read an entire record into memory or skip to the start of the next record. For example, looking for the first record that contains a time.

The 0708 spec makes this hard because the overall length is
<header size> + sum( <data block size> ) + <termination block size>, meaning that to find the start of the next record or to fully read into memory a single record, you must do multiple reads, with the offset of each read being dependent on the value of the previous. I do not think this very efficient. And I think argues that the fixed header should contain a single field that gives the total size of the record, instead of having the total size be the sum of sizes of various sub-record blocks.

My feeling is the best values to store would be the total size and the data size, in the fixed header, with the extra header size either also being stored, or perhaps being documented as total size minus data size. And identifier size is also needs to be in there I suppose.

Regardless of how it is done, I feel a fundamental design goal should be that you can read a small fixed number of bytes, ie size of fixed header, and know that you can find the total size of the record without additional reads. We do not want to recreate the painful situation of having to search for a blockette 1000 to find the overall record size.

@krischer
Copy link
Collaborator

I totally agree in terms of simplicity but I feel like this goes against the current draft to some extent. This would make the termination block for example a bit pointless (except maybe the CRC) - if you already know all that information at the time the fixed header is written it could just be written there.

@crotwell
Copy link
Collaborator Author

Yep, and that is I think my argument for separating the archival record format from the on the wire low latency protocol. The receiver can almost as easily update values in the fixed header as it can append data and termination blocks, so why are we doing this? The complexity over the wire makes sense to me, but not once it is dumped to disk.

@chad-earthscope
Copy link

While I agree that a simple, up-front record length is nice. I also think skipping through blocks to determine the total size is not so bad (you would have to do that to read the record anyway) if that is one of a few concessions that allows us to have a format that does both archiving/streaming. As a reader you can and should have a maximum record length you're willing to read, fail gracefully when it goes beyond that. Then again, I haven't yet written code to do that, so maybe I'll feel differently after that...

I wouldn't be totally opposed to an up-front record length, it does effect flexibility a bit. I can imagine we want to use a future encoding to send 1-second data blocks, this encoding (compressed) may not be very predictable in terms of size.

Maybe we need another issue for discussion of "should we have separate archive format and a streaming formats"?

@andres-h
Copy link
Collaborator

Easy solution: fixed record size.

@crotwell
Copy link
Collaborator Author

@andres-h That implies padding and potentially splitting records when adding a extra header or chunk if there is not enough space. I feel the splitting is particularly a problem when the added item strongly relates to the data in the record but doesn't fit.

A particularly bad instance of this was old mseed2 data that lacked a blockette 1000 and did not have room to insert it. Hopefully mseed3 never suffers from such a glaring problem, but that past experience makes me very wary of fixed record sizes.

@andres-h
Copy link
Collaborator

andres-h commented Jul 10, 2017 via email

@chad-earthscope
Copy link

@crotwell 👍

@crotwell
Copy link
Collaborator Author

@andres-h

In this case, the record size can be enlarged. I don't mean that virtually all records have to be the same size. Just records in a single file should be preferably same size. So when you add another chunk, you just make all records larger by that size (if there is not enough free space already).

So if I have a 10Mb mseed file with a 20,000 mseed records in it, and I need to add a small thing to one record that doesn't have enough free space, I should expand all 20,000 records to be the same size? That seems very wasteful.

@andres-h
Copy link
Collaborator

andres-h commented Jul 11, 2017 via email

@chad-earthscope
Copy link

Usually you want to add the thing to all records anyway. You wouldn't
add blockette 1000 to one record only...

Except when you want to add event detection records or other things that occur sparsely in time.

I agree with @crotwell, this is a good example of a problem with fixed record lengths that we have suffered from for a long time. Chances are high that this problem will be exacerbated when we allow more header additions.

@andres-h
Copy link
Collaborator

andres-h commented Jul 11, 2017 via email

@chad-earthscope
Copy link

Event detections go to separate records usually.

I've seen otherwise and there is no restriction in the format, i.e. any combination is allowed.

Anyway, having fixed length records is just a recommendation. I'm too lazy to implement index for random access :)

You're welcome to use this :)
https://github.com/iris-edu/mseedindex

I still have to update the code from the currently documented schema and will do so soon, so it's a bit out of sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants