Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirement: Alongside with the new format, a protocol for (near) real time data exchange needs to be specified #25

Open
krischer opened this issue Jan 3, 2018 · 40 comments

Comments

@krischer
Copy link
Contributor

krischer commented Jan 3, 2018

Alongside with the new format, a protocol for (near) real time data exchange needs to be specified.

@andres-h
Copy link

andres-h commented Jan 6, 2018

How is this an NGF requirement? Are we talking about a protocol for (near) real time data exchange using NGF for data payload?

The requirement should be more clear -- use NGF for (near) real time data exchange or not.

@chad-earthscope
Copy link
Member

The requirement should be more clear -- use NGF for (near) real time data exchange or not.

Agreed. I think the requirement for NGF should be more along the lines of usable for (near) real time data, which is a really easy requirement to meet to be honest. More important is the implied guideline that format designs should consider real-time streaming as an important use case.

For what it's worth, this is a bullet point in the requirements document the working group reviewed since Nov 13, and was also in the much earlier in the white paper. I believe it is the point covering the general "it should work for real time streaming uses" that was voiced very early in this process.

@andres-h
Copy link

andres-h commented Jan 8, 2018

I'm in favor of designing NGF with (near) real time data exchange in mind as well, which of course leads to latency/overhead requirements.

@crotwell
Copy link

crotwell commented Jan 8, 2018

I am not convinced that a really good near real time protocol would want to use a single channel per record format, the latency is artificially high due to waiting to get enough samples to justify sending a record and the things that are good for storage like every record knowing its channel, sample rate, etc adds a lot of overhead.

A system that batched one time with samples from many channels into a single NRT-record, could be much more efficient. This might be converted to NGF on the receiving end, but bears little relationship over the wire.

I am not opposed to limited consideration of latency in the design of NGF, but would prefer it to be very secondary to more traditional use cases as I suspect that eventually a completely separate protocol would be needed, having little relation to NGF. I feel the two problems are both important but orthogonal in their needs.

@andres-h
Copy link

andres-h commented Jan 8, 2018

I am not convinced that a really good near real time protocol would want to use a single channel per record format, the latency is artificially high due to waiting to get enough samples to justify sending a record

That's why I suggested the so-called "sub-record streaming", which means you don't have to wait until the record is complete before starting to send data. (The NRT protocol would provide a mechanism for multiplexing the fragments.)

the things that are good for storage like every record knowing its channel, sample rate, etc adds a lot of overhead.

Indeed, every miniSEED record is self-contained. I suppose this is one of the requirements? If yes, small records are inefficient, because of all the metadata you have to repeat in each and every record.

I am not opposed to limited consideration of latency in the design of NGF, but would prefer it to be very secondary to more traditional use cases as I suspect that eventually a completely separate protocol would be needed, having little relation to NGF.

OK, suppose another format would be designed for (near) real-time transfer. Next generation SeedLink would then use this other format. What would be the distinctive properties of NGF that make me want to use NGF instead of that other format?

@chad-earthscope
Copy link
Member

@andres-h:
I'm in favor of designing NGF with (near) real time data exchange in mind as well, which of course leads to latency/overhead requirements.

Regarding the latency/overhead requirements, I think the NGF should be required to "support streaming latency as good as supported by miniSEED 2, with similar overhead." Ideally, NGF would make some improvements in this regard, but that should not be a requirement.

I am not opposed to limited consideration of latency in the design of NGF, but would prefer it to be very secondary to more traditional use cases as I suspect that eventually a completely separate protocol would be needed, having little relation to NGF. I feel the two problems are both important but orthogonal in their needs.

I agree with what @crotwell wrote above. The primary use case is for miniSEED/NGF is for permanent archival, exchange and subsetting/selection of data. We should absolutely make NGF as amenable as possible for real-time streaming (with considerations for low latency), but not to the degree that features made for real-time streaming make the primary use case more complex. The data will be read (for exchange, conversion, processing, etc.) much more often after real time.

As @crotwell writes, if we really wanted to design something for low latency we can do a lot better than single-channel miniSEED records. Any such new wire protocol should be convertible to NGF.

For the record, here is the discussion we had in July 2017 about this topic.

@andres-h
Copy link

andres-h commented Jan 8, 2018

I think the NGF should be required to "support streaming latency as good as supported by miniSEED 2, with similar overhead." Ideally, NGF would make some improvements in this regard, but that should not be a requirement.

If we do the voting, why not add it as a requirement? Everyone can just vote NO...

If we really wanted to design something for low latency we can do a lot better than single-channel miniSEED records. Any such new wire protocol should be convertible to NGF.

OK, I'm a seismologist that needs low-latency data. How to get it from IRIS, GEOFON, etc.?

@crotwell
Copy link

crotwell commented Jan 8, 2018

Maybe we are talking about different things, I thought this meant low latency data from an instrument into the network operator. Low latency from a data center out to a seismologist is a different issue and probably more than we want to tackle, at least beyond the idea of NGF being as usable as miniseed in something like seedlink.

@andres-h
Copy link

andres-h commented Jan 8, 2018

Hmm, what's the point of transferring low latency from instrument to network operator if nobody can use the data anyway?

@chad-earthscope
Copy link
Member

OK, I'm a seismologist that needs low-latency data. How to get it from IRIS, GEOFON, etc.?

Similar/same as now. Next generation SeedLink or other protocol that streams (near) real-time NGF. The latency is relatively low with the already existing services and data format, which, for a lot of research/monitoring usage, is good enough. So it depends on what you mean by "low-latency"; taken to the extreme that could mean transmitting each sample as it is recorded, which is a non-trivial task that could not be justified in a lot of operational data centers.

So the question is what is "good enough". My answer for NGF is "same general capability as miniSEED 2, and try to improve it as much as possible without detrimentally effecting the primary usage".

@jmsaurel
Copy link

I tend to agree with the idea of having the NGF supporting similar low-latencies as miniSEED2.4 and as @crotwell says, low-latency seems to me rather involving communication between digitizer and data acquisition/processing center.

While you can have some control on that latency, because, among other things, you can control the path of the data between the digitizer and your center (by using leased lines, dedicated Wifi links ...), there is practically no chances that a data center can guarantee or have any control on the latency introduced by the internet between him and the user.

So, yes,

same general capability as miniSEED 2, and try to improve it as much as possible without detrimentally effecting the primary usage

seems enough to me.

And I don't think the NGF should be designed with low-latency applications (such as EEW) in mind. This would probably be better covered with a speciality format and protocol.

@andres-h
Copy link

While you can have some control on that latency, because, among other things, you can control the path of the data between the digitizer and your center (by using leased lines, dedicated Wifi links ...), there is practically no chances that a data center can guarantee or have any control on the latency introduced by the internet between him and the user.

True, but many digitizers support lower latencies than is possible with SeedLink. You can use dedicated lines, but as soon as there is SeedLink anywhere in the chain, you must count with much larger latencies.

I remember Chad told me about a network that modified SeedLink to use record size smaller than 512 bytes to cope with the problem. Is using ultra small record sizes and huge header overhead really the proper solution?

If NGF has the same capabilities than miniSEED 2, it does not sound like "next generation" to me. With the current IRIS proposal I don't see any features that make me say "wow, NGF is much better than miniSEED 2, I want to use it". Probably we will not adopt NGF and continue to use miniSEED 2.

@jmsaurel
Copy link

Probably we will not adopt NGF and continue to use miniSEED 2

For the same usage as miniSEED2 (permanent classic networks, sparse station density, few number of channels on each station), yes I agree.
And this is why miniSEED2 should continue to be supported.

If I remember correctly, NGF is primarily needed for high density deployments, high number of temporary deployments.
There are not so many things to change to miniSEED2 for current permanent, classic, networks.

many digitizers support lower latencies than is possible with SeedLink

The drawback being that they use proprietary protocols or format to achieve that, which leads people to prefer using Seedlink and tweak it.

It goes outside of this issue and discussion, but I think ideally, it would be good to have a set of solutions from FDSN.

  • miniSEED2 and seedlink for classic legacy permanent networks
  • NGF and seedlink for high density deployments and temporary experiments
  • format x and protocol x for ultra low latency applications (which could be a subversion of NGF and seedlink)

@jsaul
Copy link

jsaul commented Jan 11, 2018

IMHO ultra-low latency requires (and deserves!) a dedicated format/protocol that goes far beyond the scope of both miniSEED and NGF. It is also beyond the scope of the FDSN and its data centers.

The latency is relatively low with the already existing services and data format, which, for a lot of research/monitoring usage, is good enough

+1

For practically any use except EEW the latencies are already low enough.

@andres-h
Copy link

IMHO ultra-low latency requires (and deserves!) a dedicated format/protocol that goes far beyond the scope of both miniSEED and NGF. It is also beyond the scope of the FDSN and its data centers.

I'm not speaking about ultra-low latencies, just making it possible to transfer each 64-byte frame individually, which gives 7 times lower latency than 512-byte records (or 63 times lower latency than 4096 byte records) without extra overhead and is easy to implement.

For practically any use except EEW the latencies are already low enough.

So is miniSEED 2.

@jsaul
Copy link

jsaul commented Jan 11, 2018

For practically any use except EEW the latencies are already low enough.

So is miniSEED 2.

Indeed!

@crotwell
Copy link

I'm not speaking about ultra-low latencies, just making it possible to transfer each 64-byte frame individually, which gives 7 times lower latency than 512-byte records (or 63 times lower latency than 4096 byte records) without extra overhead and is easy to implement.

@andres-h Doesn't variable length records (and not power of 2) as in #15 do exactly this?

@andres-h
Copy link

Doesn't variable length records (and not power of 2) as in #15 do exactly this?

Not exactly. Using #15 you'd have to attach full NGF header to each 64-byte data block, which adds another 64 bytes at least (50% overhead). The overhead can be much higher with "extra headers".

@krischer
Copy link
Contributor Author

Moving the CRC and the number of samples to a footer as has been proposed a number of times would make this possible, right? The only downside I see is slightly more complicated and slightly more expensive parsing.

@andres-h
Copy link

Moving the CRC and the number of samples to a footer as has been proposed a number of times would make this possible, right?

Shortly: yes, but you have to know the length of record beforehand.

I've summarized the options in the document that I linked above (https://github.com/iris-edu/mseed3-evaluation/wiki/Chunks#sub-record-streaming).

@krischer
Copy link
Contributor Author

Shortly: yes, but you have to know the length of record beforehand.
I've summarized the options in the document that I linked above (https://github.com/iris-edu/mseed3-evaluation/wiki/Chunks#sub-record-streaming).

Thanks for this comprehensive summary! So adding a footer + record length in the header would be (at least from a high-level) similar to a sub-record with an always present archive record header? In my opinion always adding the header would be an acceptable compromise and it also would make the work of loggers/archivers somewhat similar. Super low latency systems will likely always need some custom solution.

@chad-earthscope
Copy link
Member

chad-earthscope commented Jan 17, 2018

The concept of header + data blocks + footer was discussed in a specification last July here:
iris-edu/mseed3-evaluation#21

This was the merger of the IRIS proposal (at the time), the concept by @kaestli (of ETHZ) presented at the early 2017 meeting and refinements from July 2017 discussion.

Ultimately I came to regard this solution as non-ideal as it traded off a relatively large complexity for the single goal of being able to read data blocks as they were produced. The question is how much complexity is worth adding for low-latency usage. In my opinion, very little. If we want a solution for low latency data flow, beyond allowing variable length and small records, we should do it in a different format.

Edit: Was referring to a meeting in 2017 and not 2016, my bad. This conversation has now spanned 3 years!

@crotwell
Copy link

I agree with @chad-iris. The footer idea adds significant complexity for the datacenter and end user use cases without really fully addressing the EEW low latency realtime needs. I feel that variable length records go far enough towards medium latency realtime data but without compromising the other uses.

@kaestli
Copy link

kaestli commented Jan 26, 2018

This requirement, as well as (partly) the discussion mixes data format and transfer protocol. Transfer protocols of different complexity are in place - from serial to tcp/ip, including (or not including) solutions for packeting, routing, order, completeness and corruption control etc. No need to re-invent those.

The requirement to the data format is not to add to the drawbacks. If real time transmission is the requirement, and samples become available from the measuring device in time-sequential manner, the strongest possible requirement for real time applications is the one that you can transfer, at least tentatively, a single sample (reducing time delays to those implied by the transfer protocol)

For the format, this means that

  1. it should allow samples to be written one by one, and read one by one
  2. all meta-information allowing to interpret samples should be at the beginning of a record, and all information meta-information known only with completion of the measurement of the last sample should be positioned (transferred) after the samples.

1, as far as i know about compression techniques, implies uncompressed data (at least in a flavour of the format). This is not a strong requirement: if you are interested in RT transfers, you would dimension your bandwidth for uncompressed data anyway, as the worst case of compressed data is uncompressable data.

2 implies that a record header defines the (intended) length (in bytes, not samples) of the data section (to allow distinguishing of the last data byte from the first byte of the footer while forward reading), while the number of valid samples and the CRC should go to the footer. (Note that like that, in the worst case, a streaming client would need to a-posteriori discard padding bytes from interpretation at the end of the record in those rare cases when a record was not completed as intended when writing the header), while data transfer and interpretation could always be started just at that point in time when writing the final number of samples into the header of a record.
Note that the CRC of the data format has no relevance for streaming applications (as integrity can be controlled by the transfer protocol). It is used to detect data corruption due to upstream processing steps or long-term storage.

@krischer
Copy link
Contributor Author

krischer commented Jan 29, 2018

Summary

(Please let me know if I missed a point or misunderstood something)

Some clarifications to the questions are available here: #25 (comment)

This is a long discussion with no clear solution. Thus I'll simplify it a bit for now. Please vote on:

  1. Should suitability for real-time transfer be a major design goal of the new format? Otherwise it would be designed to be as suitable for real-time transfer as reasonable without compromising other use cases. (Yes/No)
  2. Assuming a no on (1): Should some fixed header fields be moved to a footer? (Yes/No)
  3. Should a different approach like the mentioned "sub record streaming" be investigated in more detail? (Yes/No)

@chad-earthscope
Copy link
Member

  1. No. Real-time streaming with features such as sub-record write and readability prior to record completion is not a primary goal.
  2. No. A footer adds complexity to the vastly dominant use case with NGF, which is reading and writing in non real-time.
  3. Probably, in a different format though.

@ketchum-usgs
Copy link

  1. No, real time is a very different problem
  2. No, everything in the header
  3. No

@crotwell
Copy link

1 no
2 no
3 not within this NGF discussion/specification

@kaestli
Copy link

kaestli commented Jan 30, 2018

  1. YES - it can easily be covered by records which are forward writable and forward readable
  2. YES - a well defined footer position implies no overhead compared to a well defined header position.
  3. YES

(note that the questions do actually not refer to the title of the issue [ protocol ], rather to the adequateness of the format for such a protocol. Which i think is completely fine)

@ozym
Copy link

ozym commented Jan 30, 2018

  1. No
  2. No
  3. No

@claudiodsf
Copy link

claudiodsf commented Jan 31, 2018

  1. Should suitability for real-time transfer be a major design
    goal of the new format? Otherwise it would be designed to be as
    suitable for real-time transfer as reasonable without compromising
    other use cases. (Yes/No)

No

  1. Assuming a no on (1): Should some fixed header fields be moved
    to a footer? (Yes/No)

No

  1. Should a different approach like the mentioned "sub record
    streaming" be investigated in more detail? (Yes/No)

Seems irrelevant for the discussion here: this question concerns the streaming protocol, not the
NGF format

@ihenson-bsl
Copy link

  1. Yes
  2. No
  3. Yes

@jmsaurel
Copy link

jmsaurel commented Feb 1, 2018

@krischer, I'm afraid it's maybe too late given several people already voted, but reviewing your summary yesterday with the RESIF colleagues, we found the first question a little bit confusing.

If I remember the discussion, there was some kind of consensus that the NGF would provide the same real-time capabilities (in terms of latency) as the actual miniseed together with Seedlink.

  1. Should suitability for real-time transfer be a major design goal of the new format? Otherwise it would be designed to be as suitable for real-time transfer as reasonable without compromising other use cases. (Yes/No)

I understand your first question as whether or not some efforts needs to be put toward a very low latency protocol, where my colleagues (and I think it's the sense of @claudiodsf answer) understood whether or not the NGF should provide basic real-time capabilities to the same level as current miniseed.

  1. Assuming a no on (1): Should some fixed header fields be moved to a footer? (Yes/No)

I'm more confused with this one as I though the footer was a possible solution to achieve lower latency and thus, as I understand the first question, should apply only if Yes to 1.

Could you please clarify a little bit the question so that we (RESIF and our representative, @claudiodsf) can be sure our answer correctly reflect our understanding of the problem ?

@krischer
Copy link
Contributor Author

krischer commented Feb 1, 2018

If I remember the discussion, there was some kind of consensus that the NGF would provide the same real-time capabilities (in terms of latency) as the actual miniseed together with Seedlink.

    1. Should suitability for real-time transfer be a major design goal of the new format? Otherwise it would be designed to be as suitable for real-time transfer as reasonable without compromising other use cases. (Yes/No)

I understand your first question as whether or not some efforts needs to be put toward a very low latency protocol, where my colleagues (and I think it's the sense of @claudiodsf answer) understood whether or not the NGF should provide basic real-time capabilities to the same level as current miniseed.

I intended it the way you understand it. All the discussions we had during the last weeks lead towards the conclusion that NGF will be at least as suitable for near real-time applications as the current miniSEED. This question thus asks if NGF should be designed with even lower latency applications in mind, even if it leads to compromises for some other use cases.

  1. Assuming a no on (1): Should some fixed header fields be moved to a footer? (Yes/No)

I'm more confused with this one as I though the footer was a possible solution to achieve lower latency and thus, as I understand the first question, should apply only if Yes to 1.

A yes to (1) would directly imply something like (2) as very low latency is otherwise not achievable. We directly discussed the potential of a footer in this thread and so I thought it might be worthwhile to ask if this would be a compromise the community wants. But you are right - it is slightly confusing.

I'll add a link to this clarification to the above summary.

@chad-iris @ketchum-usgs @crotwell @kaestli @ozym @claudiodsf @ihenson-bsl Can you please review your answers in case our understanding of my questions differed?

@chad-earthscope
Copy link
Member

@krischer, @jmsaurel Thanks for clarifying. My votes are not changed.

By voting "No" for 1 I mean exactly "that NGF will be at least as suitable for near real-time applications as the current miniSEED" and NGF should not be designed for lower latency applications that require features that compromise the primary use cases.

@ozym
Copy link

ozym commented Feb 2, 2018

No change here.

@crotwell
Copy link

crotwell commented Feb 2, 2018

What @chad-iris said.
No vote change for me either.

@claudiodsf
Copy link

I therefore changed my vote.

@ValleeMartin
Copy link

  1. No
  2. No
  3. Not within this NGF discussion/specification

@JoseAntonioJara
Copy link

1 No
2 No
3 Why not, but not within this NGF discussion/specification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests