-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
miniSEED3 draft 20170622 specification target #2
Comments
Some thoughts/comments on DRAFT20170622. Notion of “backwards-compatibility” in data formats: This is really tricky as semantic versioning as applied to software cannot be applied to data formats. The only thing that could safely be considered backwards compatible in the sense that old software can read new versions of the format are completely optional additions that do not change the semantics of the other data, i.e. it most be completely safe to ignore. I wonder what that would be in a minimal-by-design data format like the new miniSEED. The conclusion to that would be to get rid of a major/minor version number but just have a monotonically increasing integer version number. Or do I miss something here? Fixed Header
Section 4: FDSN Identifiers
Section 5: Definition of transient data payload sub-headerIs this really necessary? This should IMHO be delegated to some lower level protocol. TCP can do multiplexing, as can HTTP/2, and who knows what will be out in 10 years. Section 6: Definition of channel codesI would like to define the
Section 7: Data encoding codesWhy keep the legacy codes? Section 8: Definition of reserved, extra header fields
|
On Thursday 2017-06-29 23:10, Lion Krischer wrote:
Notion of “backwards-compatibility” in data formats: This is really tricky as semantic versioning as
applied to software cannot be applied to data formats. The only thing that could safely be considered
backwards compatible in the sense that old software can read new versions of the format are completely
optional additions that do not change the semantics of the other data, i.e. it most be completely safe to
ignore. I wonder what that would be in a minimal-by-design data format like the new miniSEED. The
conclusion to that would be to get rid of a major/minor version number but just have a monotonically
increasing integer version number. Or do I miss something here?
It is hard to predict what will be needed in the forthcoming decades.
The new SEED format should be used even for the archival of
non-seismologic data. Who knows what other communities might need.
Some things are specific to manufacturers. For example, it has been
complained that percentual timing quality is useless. Manufacturers
could add their own specific timing quality info.
Besides, blockettes are IMO what make SEED SEED. If blockettes are
replaced by another extension mechanism, which is even inferior
(keys-values), the format should not be called SEED anymore.
If blockettes or similar are used, then a version number would basically
not be needed, because new revisions of the standard just add new
blockettes. This also follows the principles of object-oriented design
(eg., the open/closed principle).
Why keep the legacy codes?
It would be nice to have a possibility to include MS2 data without
modification. We should IMO get rid of the byteorder bit, though, which
is ambiguous and has caused so much pain in MS2. Use a fixed byteorder
in the header and different encoding types for big-endian and
little-endian variant of data encodings.
|
Some things are specific to manufacturers. For example, it has been complained
that percentual timing quality is useless. Manufacturers could add their own
specific timing quality info.
One more thing that would be nice to have is an optional "instrument ID"
that unambiguously identifies the response. Could have separate IDs for
sensor and datalogger that are manufacturer-specific or refer to NRL.
|
Your logic is what I tried to capture in the field description, and that means the minor version would only be updated when new reserved extra headers are added or maybe additions of data payload encodings or maybe new namespaces for identifiers. I changed what was a monotonically increasing integer to a major.minor to allow for those cases and what @andres-h said:
A future major version may have more reasons for minor versioning. Of course, this decision also has effects for the software ecosystem supporting the format. Updating major versions will very likely break any software downstream of a producer, which would be a big ripple and probably mean we do not do a major version update often (a good thing from a format perspective). A minor version allows, for example, adding a general compressor encoding in 3.1 and while allowing 3.0 readers to continue to read what they are able to read and provide time for updating downstream software that would not immediately see the new additions anyway. It is a concession though, in that it reduces the major versioning from ~253 to 23 versions. I think there are legitimate arguments either way and, while I lean toward the minor version addition at the moment, would go with this group if any consensus emerges.
The extra header "TimeLeapSecond" is exactly what you describe. If you mean a bit flag in the fixed header, then I am strongly against it. That's what we had before and it is a total and utter mess because when it's in the fixed header it's a required bit and in the vast majority of cases it's set incorrectly. I never believe this bit. The solution is to make is something that is pro-actively added by a data generator.
Unless there is a reason for not using 0, I think we should stick with it for two reasons a) it's what we use now and b) it makes usage just a tad bit harder as programmers now need to specify NaN in whatever language, testing for zero is dead simple in every language. An aside on where I'm coming from: over the last couple of years spent thinking about the next generation of miniSEED I have come around to a philosophy that it should be as complex as needs to be an not any more. Put another way, keep every aspect as absolutely simple as possible to achieve the real needs. The motivating factors are usability as broadly across computing environments as possible and ensuring future use. When you see someone implementing a miniSEED parser in JavaScript (haha, that was a funny notion 10 years ago) you see all the barriers, small but present, that get in the way. Probably most would agree with the general statement, I write it so you know where I'm coming from when I say things like stick with zeros instead of NaNs.
Hhmm, interesting. It could make for cluttered looking IDs, but I see the use. Curious what others think, I'll ask around the DMC.
Practical reasons. These identifiers are in each record, lots of redundancy. In our domain, folks are used to "reading" or "saying" the identifiers, where concise is valuable. The current lengths are designed to fit the future needs we can see, 8 character networks can be much more meaningful (can include 4 character years for temporary networks), station code hasn't been a problem, location holds what is needed to identify nodes in arrays of 100,000's of sensors and channel now has room for a lot more instrument identifiers. In a way, it's now double the address space. Also, in the future we can decide on a new namespace for identifiers, e.g. "FDSN2:", and come up with expanded or completely new schemes. This variable length identifier concept is one of the best changes we've discussed in next general miniSEED in my opinion. Provides a lot of future proof-ness. This question needs to be turned around, why should we give the identifiers more space?
Yes, and I think we should define at least some convention but maybe even a required use semantics. The specification is probably the right place for this. Suggestions welcome.
Perhaps there is some confusion? I did not mean the "FDSN:" namespace identifier, I meant the literal "urn:" that is part of a formal URN. Similar to how you would need to add a "doi:" prefix to a value in a field that is already identified as a DOI.
Yeah, I agree with you and @andres-h, this was a misfire. I was attempting to providing a way to uniformly multiplex record fragments over any general communications link, but I'm happy to relegate it to the transmission protocol.
Sounds like a decent proposal to me. Using an X band eliminates the ability to denote the band, but it's a course definition anyway. Since that is a completely new channel definition I suggest this goes to FDSN WG II as a proposal and not something we conflate with the format specification. There is already a lot of new format layout conflated with new format semantics, I suggest separating what we can.
What @andres-h said, for forward compatibility without re-encoding. At the DMC we have converted almost all of the data in those legacy encodings to Steim# as a step towards getting rid of them, but providing the path forward is still needed. Perhaps the wording can be stronger, instead of "not recommended", it could be "deprecated, do not use for new data".
Agreed. I'll remove the bit about treating them case insensitively, such that they need to match exactly.
Adds a bit of bloat to them. Perhaps we should require that any non-reserved headers include a namespace? And let the default be FDSN reserved. It would be a step towards keeping potential conflicts at bay. To avoid conflicts completely we'd probably need a registry of namespaces managed by the FDSN. Thoughts? |
Good idea @andres-h. Some people may be concerned that the fixed order is not ideal given that architectures (embedded, etc.) vary, but the clarity is probably worth it. Got a preferred byte order for the values in a fixed order? |
Good idea @andres-h. Some people may be concerned that the fixed order is not ideal given that
architectures (embedded, etc.) vary, but the clarity is probably worth it.
Got a preferred byte order for the values in a fixed order?
Normally I would prefer big-endian, because that is the canonical
"network byte order", however, standard varint is AFAIK little-endian,
so if we use varints, we should consider using little-endian everywhere.
|
Hi all Header I am a little confused about your byte-order discussion? Do you mean that we pick a single byte order for the header and eliminate bit 0 from field 3 (flags)? Or are you talking about separating header byte order from data byte order? I do like the idea of the encoding including the byte order where if it might not be the same as the header, so 3 is big endian 32 bit integer and 43 or something else is little endian 32 bit integer. Dealing with the bit flags separately is a pain in the rumpus. And byte order might not make sense for new compression types that might be added later, for example ascii or an encoding that itself includes byte order information. If we pick one order, then I tend to like big endian. Field 7 should probably also say set to 0 if no data payload like Field 6 and 9. Field 8 Maybe reserve values < 10 for raw data and qc types of things and values >= 10 for user modified data. The dividing line is whether the metadata still applies, so below 10, the response is still the response. But once the version is above 10, be careful as the response may have already been applied or the data modified to the extent that it no longer can be. In other words, below 10 users can proceed normally, above 10 "here be dragons" and you better know the history. Is 10 large enough? Field 9, consider UINT32. It is really nice for processing data to be able to store a long continuous time series as a single record like SAC and 65K is kind of small for that. I have no problem with a recommendation that data loggers only generate small (~512 or 4096) or data centers choose a maximum for acceptance or internal storage. The header allows UINT32 samples but not enough bytes to put them in. Section 6: Definition of channel codes In the Water Current section, add a sentence to say that water current channels must NOT use SOH or LOG to avoid conflicting with existing soh and log channel names. For synthetic data, we now have the option of longer codes, so "real" data channel codes should be limited to 3-4 characters, but synthetic or other can be longer, prefixed with X, so XBHZ or XLSN. Then even a new instrument code that was "JK" could be synthetic with BJKN mapping to XBJKN? This kind of matches Lion's idea, except make is explicit that band of X means that it is synthetic amd that the rest of the channel code can be interpreted along standard channel naming conventions, or is undefined by the spec? The restriction of short codes maybe makes sense for "real" data, but maybe should be relaxed for synthetic or highly processed data, thinking of miniseed of stacked data for example. The example on page 15 for 2 char instrument codes sets a bad precedent as it makes it seem as if the 2 char WU in LWUS is a subtype of instruments of type W. But I think a least part of the reason for expanding the code is to allow for completely new types, so WU might be a "foobar meter" and have nothing to do with a "wind speed" instrument. In any case, instrument codes should be "as specified by fdsn" and not user definable? I worry that someone will look at the example and decide to create BH1Z and BH2Z channels because they want to have 2 BHZ channels. Along those lines, could we just have a "location code" and call all 4 code things "codes" instead of network, station and channel "codes" and location "identifier"? Section 8 Extra Header Make any identifier that starts with an upper case ASCII letter be reserved to be defined by FDSN. Anything that starts with a lower case or other UTF character is user-defined? All of your existing words already meet that requirement, and it provides an easy way to separate fdsn from other without prefix-bloat. +1 on a standard key-value to identify logger and sensor type and serial numbers. Although getting the sensor correct is unlikely to be automatic as the logger that builds the records may only know that it has voltages on pins and have no idea of the sensor connected to it. You probably want a "model" and a "serial number" as model is often sufficient to get nominal response while serial helps with calibrated response and with inventory control. Leap Seconds, may not want to limit it to "during" this record. If we accept that leap seconds can only happen at month ends and have always been at the end of June or December, then it may be beneficial to add the "leap second happened" header to records that are near but not overlapping the actual leap second. For example a record starting 20150701T00:00:01 could have the leap second extra header to let the system know that this time included the leap second applied in the previous seconds even though it doesn't actually overlap. Putting the value into records preceeding the leap second would also allow systems to warn "hey, leap second coming up" before the record with the leap in it actually happens. I would say the flag should be recommended to be set in records that overlap or are within a small time interval (minutes?) of the actual leap second? The meaning is then "leap second occurs near" instead of "leap second occurs inside" this record? I have more thoughts on structure in extra headers, but will post to that issue. |
On 07/03/2017 06:06 PM, Philip Crotwell wrote:
I am a little confused about your byte-order discussion? Do you mean
that we pick a single byte order for the header and eliminate bit 0 from
field 3 (flags)?
Yes.
I do like the idea of the encoding including the
byte order where if it might not be the same as the header, so 3 is big
endian 32 bit integer and 43 or something else is little endian 32 bit
integer.
Exactly.
+1 on a standard key-value to identify logger and sensor type and serial
numbers. Although getting the sensor correct is unlikely to be automatic
as the logger that builds the records may only know that it has voltages
on pins and have no idea of the sensor connected to it.
There can be a plug-and-play protocol, like VGA monitors were recognized
by the graphics card using I2C.
You probably
want a "model" and a "serial number" as model is often sufficient to get
nominal response while serial helps with calibrated response and with
inventory control.
Indeed, I forgot the serial number.
|
I opened a bunch of new issues with some of the discussions which makes it easier (at least for me) to follow them all. We can also close them once we've reached consensus. |
Fair enough and I think I agree.
👍 |
Draft 20170708 supersedes this draft. |
Attached is a draft of a miniSEED 3 specification that combines and hopefully addresses all of the feedback received from the straw man and the discussions at the "future of miniSEED" meeting in the Netherlands in February 2017.
What is drafted is a complete standard where references to the previous standard are informational only, i.e it is stand-alone. This encourages consideration of all/most aspects, it should not be treated like a proposal, at least not until we have some agreement. Some of the included documentation, such as SEED identifiers, would necessarily be shared with an FDSN StationXML revision where these changes are incorporated.
The largest changes from the original straw man discussed last year are from the February "future of miniSEED" meeting and include:
a) Replacement of traditional identifiers (network, station, location, channel) with a URN with flexibility in length. The specification contains a definition of how the traditional identifiers could be mapped to/from a URN.
b) Re-arrangement to allow streaming of a record during generation with as much flexibility as possible. In this construction, a record is composed of a header, data payload and footer.
The changes to allow stream-abilty only require definition of the length of the data payload prior to it being sent. The entire record length does not need to be known, which allows shipment of the data payload and addition of "extra" headers such as event flags in the footer.
I also added the definition of a transient sub-header for transmitting the data payload in chunks. Allowing multiple channels to be transmitted the same stream-based communication link (TCP-IP). It would be a non-trivial thing to decode on the receiving end as the payload would have to be searched for the signature of of the sub-header. But without something like this I don't know how multiple channels could be transmitted in a multiplexed fashion on the same communication link. The complexity added to the record structure to facilitate streaming only a single channel does not seem worth it as that would be a very limited scenario. Other ideas, of course, are welcome.
Of course, details are all up for discussion. This is an attempt to provide a target. We should try to agree on a target ASAP in order to have some time for implementation and evaluation.
miniSEED3-DRAFT20170622.pdf
The text was updated successfully, but these errors were encountered: