-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we have separate archiving and streaming formats? #26
Comments
I'd like MiniSEED to be able to do reasonably low latency streaming, such as sending data in 64-byte-payload data blocks. An advantage (which I already mentioned) is checksumming the archived data right at the digitizer. |
I agree that the main purpose of miniSEED is archiving and exchange and, I would add, easy to read (i.e. simple) as a goal. Variable record lengths and small records get miniSEED pretty far into the capability of reasonably low latency. So is this good enough? It may be. Earlier I was hopeful that there was a solution that would satisfy both archive/exchange and streaming with very little impact on the more important archive/exchange and simple reading of the data. As the details have been worked through I'm less and less optimistic.
This can be done regardless, even the very first Strawman had a CRC that could be set by a data generator. Maybe explain how you see the advantage with regards to streaming versus not? |
On 07/10/2017 09:34 PM, Chad Trabant wrote:
I agree that the main purpose of miniSEED is archiving and exchange and,
I would add, easy to read (i.e. simple) as a goal. Variable record
lengths and small records get miniSEED pretty far into the capability of
reasonably low latency. So is this good enough? It may be.
Earlier I was hopeful that there was a solution that would satisfy both
archive/exchange and streaming with very little impact on the more
important archive/exchange and simple reading of the data. As the
details have been worked through I'm less and less optimistic.
You know that I have a solution.
@andres-h <https://github.com/andres-h>
An advantage (which I already mentioned) is checksumming the
archived data right at the digitizer.
This can be done regardless, even the very first Strawman had a CRC that
could be set by a data generator. Maybe explain how you see the
advantage with regards to streaming versus not?
If even the "DA-blocks" are dropped, you have 2 options:
1. Use extremely short records, for example records with 64-byte data
payload. In this case most of the content is headers. Yes, you can add
checksum, but garbage like that is not suitable for archiving, so the
records have to be reformatted, which makes the checksum invalid.
2. Use different (proprietary) real-time transfer protocol, such as the
Q330 protocol. Again, data has to be converted to mseed, which makes
original Q330 checksums invalid.
Yes, you can add another checksum in the data center, but I cannot be
sure that the data is authentic anymore and that there haven't been any
conversion errors.
|
@andres-h Can you say bit more about your 64 byte needs? Do you mean you want to send a record with only 64 bytes of data? Or that the header plus data < 64 bytes? The existing fixed stuff comes in at a bit over 50 bytes depending on the length of the channel identifier, and with the variable length record, you could send a 64 byte record that only had a few data points in it. Using Steim compression wouldn't work, but it wouldn't save much space for small number of samples and likely makes it larger anyway. So sending the data encoded as just 16 bit if they fit or 32 bit ints if not would work? I can envision a single record with say 10 samples? Or what the heck, a record with just one sample? Lots of overhead, but maybe it doesn't really matter??? I presume the receiver might wish to combine and recompress records after receipt and before archiving in order to reduce the space for the repeated headers, but at least for over the wire sending, maybe full variable length records is sufficient? |
On 07/10/2017 09:56 PM, Philip Crotwell wrote:
@andres-h <https://github.com/andres-h> Can you say bit more about your
64 byte needs? Do you mean you want to send a record with only 64 bytes
of data? Or that the header plus data < 64 bytes?
I think a reasonable payload of a data block would be 64 bytes, which is
1 Steim frame. Of course, shorter data blocks, down to 1 sample would be
possible with integer encoding, but too inefficient in practice.
So yes, I meant a record with 64 bytes of data.
I presume the receiver might wish to combine and recompress records
after receipt and before archiving in order to reduce the space for the
repeated headers
That's one of the main problems (in addition to network bandwidth).
|
@andres-h Do you see a problem with sending a single record with a single 64 byte steim frame and just keeping the whole record as is? Ends up being bit less than 128 bytes I suspect. And if you are paranoid enough that you want the original CRC, and I mean that in a good way :), then wouldn't you also want the original start time and original sample rate of the segment? Those are more likely to be corrupted in some sense by the repackaging process than the data bits since you have to do math on them. If you are saving all of that, it is really not that much extra bytes to just save the entire original record. Am I missing something? |
I humbly submit that you do not know what I know ;) |
On 07/10/2017 10:27 PM, Philip Crotwell wrote:
@andres-h <https://github.com/andres-h> Do you see a problem with
sending a single record with a single 64 byte steim frame and just
keeping the whole record as is? Ends up being bit less than 128 bytes I
suspect.
OK, we can assume that storage and network bandwidth are becoming cheap,
but still having at least 2 times more data (not counting extra headers)
without any added value is not what I want.
And we are talking only about the standard use case.
And if you are paranoid enough that you want the original CRC,
and I mean that in a good way :), then wouldn't you also want the
original start time and original sample rate of the segment?
Sure, I don't want any modifications.
Those are
more likely to be corrupted in some sense by the repackaging process
than the data bits since you have to do math on them. If you are saving
all of that, it is really not that much extra bytes to just save the
entire original record. Am I missing something?
I don't want to waste network bandwidth and storage space.
If there aren't any improvements with mseed3 (you can "solve" the
latency problem with 128-byte records in mseed2), maybe FDSN should go
for the option #2 (blockette 1002) instead?
|
Do not mix record/chunk size with latency. We are talking about a data format, not a transfer protocol. If data is uncompressed, some (at least one ;-) proposals allow incremental transfer as well as tentative reading and usage of the data before the entire record is transferred. (you rely interpreting entire records only if you insist in having a checksum. However, i am not exactly sure what a checksum can tell you on a single sample).
|
@kaestli I think I agree with you, but not sure if I am missing what you are saying. Can you elaborate on what you mean by
I think using variable length records makes
|
Thinking more deeply about low latency and data streaming and I am becoming more convinced that a "partial record" like in 0708 is a bad idea. If you only send whole records, maybe small, then you have a more or less atomic operation. Either the record makes it to the other end or it doesn't. Any problems are dealt with by the sender and/or receiver deciding to do a retransmission. Simple and works over both a tcp and udp type link as long as the individual records fit into a single MTU. But a partial record, where the station sends a header, then repeats sending little segments of data and finally sends a terminating footer, adds a lot of potential problems. The probability of a link dropping in the middle of the record goes way up as it is open for a much longer period. And it is less clear how the receiver should handle a partial record. If it does not receive a termination footer within a timeout interval, does it toss the record or append a footer that it makes up based on the data so far? And if after some time the link comes back up and the sender finished the record, the receiver has to decide how to deal with the new data given that it maybe has already closed and archived or tossed the record. Does a receiver have to store all open records on restart because more data may arrive when it comes back online? Does the logger (usually a cpu and memory limited system) have to keep track of both what records have been sent as well as where parts of the record have been sent? And out of order data segments are an even bigger problem, certainly would not work over udp. Maybe there are ways to deal with all this, but I fear that the simple partial records we have been discussing just opens up a an ugly can of worms that we really don't want to deal with, and don't have a mandate to deal with. If the idea is that the sender sends only complete records that may be small, and the receiver is allowed to either keep them as is, or combine them into fewer larger records, then there is less opportunity for problems I feel. Maybe it wastes some bytes over the wire, but at least it is simple and robust. And this does not preclude a protocol that does something more advanced, but I don't think such a protocol belongs in the mseed spec. So I propose modifying 0708 to return to a single data block, and moving the num samples and CRC back into the fixed header. |
@crotwell "Lossless changes of record alignment and length": |
@crotwell The answers to the questions you ask (handling of partially transferred records) is application specific. If somebody is interested in low latency, he may accept some of the potential troubles of not waiting for the footer. If he is interested in simple and correct transfer, he will only act after having received the footer, and discart all non-closed records. |
@kaestli While I agree knowing the actual time a clock locked and unlocked is more useful than a generic "clock is locked" bit, the fact is we need to convert old mseed2 records that will have a "clock locked" bit set and so we need to be able to convert those ambiguous values. I am willing to accept the ambiguity of that bit applying somehow to the whole record and splitting being more difficult, as the alternative is 1 bit gets mapped to 12 or maybe 24 bytes. Storage is getting cheaper all the time, but it is still not free. Moreover, any attempt to map the "clock is locked" bit in mseed 2 to a mseed3 value with its own timestamp is just guessing as all you have is the bit. The idea of "forward readability of records:" is fine and I think satisfied by the IRIS proposals. Do you see any places where it is not? Andreas's chunks can have that problem as there is not an order to the chunks, ie data could come before the sample rate and encoding, but I will let him address that. I still feel encouraging or designing for partial records adds complexity for archiving while not actually being a good way to solve the low latency problem. |
Follow up to #16, #22, and #25.
My 2 cents (but I have no real experience with this so I might be wrong): MiniSEED is already kind of workable for reasonable delay streaming. For very low latency applications (we should consider per-sample streaming here) some other mechanism is likely needed in any case and the goal of this to some extent goes against MiniSEED whose main purpose IMHO is standardized long-term data archival and exchange.
The text was updated successfully, but these errors were encountered: