-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
miniSEED3 DRAFT 20170708 specification #21
Comments
First read I think this is very good. I feel like we are converging on something very workable and I am happy with the structure. @chad-iris In your copious free time (ha ha), can you close the issues that you think are "resolved" by this new version? If any of the rest of us objects to the resolution, then we should open a new issue w.r.t. this new document and have it link back to the old issue. Otherwise our issue list just keep growing without any closure. Seem reasonable? |
I don't care, but there are colleagues who are religious about this...
👍
OK if that is the FSDN naming schema. Multiple identifiers should be allowed, though. Currently the FDSN identifier must be dropped when a different URN is used, which is maybe not in the interest of FDSN.
OK
👍
OK. BTW, is Steim3 used anywhere!?
👍
This is a restricted special case of my chunks concept. The record now consists of the following "chunks": HEADER(MS) Instead of numeric chunk IDs, two-letter codes and implicit length are used. The blocks again suffer from "over-lumping", eg., sample rate/period and encoding format are only applicable if data blocks are present, and sample rate/period is not applicable to all encodings. What to do if microsecond resolution turns out to be insufficient in the future? (Maybe in other domains than seismology?) At least there is some extensibility -- new block types can be added. Maybe "SE" for sensor ID (VID, PID, serial, preset) and "DL" for datalogger ID (VID, PID, serial, preset)... I would also add an optional gain or "GA" block, in case there is gain reduction between the sensor and datalogger, such as used with some of our EarthData units... PS. I haven't been able to read the comments on the white paper yet, because I don't have a Google account. I hope Angelo will send me a current copy of the white paper on Monday. |
Not that I have ever seen. The Q330s apparently use some variation of it in their transfer protocol, maybe internally too.
True. But, the vast majority of records will use those fields for known and anticipated uses. This is about finding right balance for the number one goal of time series data for FDSN members while allowing as much flexibility for other uses as possible without making it worse for the number one goal. In this case, is having a different block with small but non-zero overhead worth it compared to the few cases where those fields are not empty.
Change the format and increment the version. Alternatively, add more resolution as an extra header (no my favorite).
There is a huge amount of extensibility in the extra headers, and I think those kinds of things belong in the extra headers by default. We'd need a very good reason to create another block type in my opinion. |
On Sunday 2017-07-09 20:22, Chad Trabant wrote:
The blocks again suffer from "over-lumping", eg., sample rate/period and encoding format are
only applicable if data blocks are present, and sample rate/period is not applicable to all
encodings.
True. But, the vast majority of records will use those fields for known and anticipated uses. This is
about finding right balance for the number one goal of time series data for FDSN members while allowing
as much flexibility for other uses as possible without making it worse for the number one goal.
I think chunks add a *huge* amount of flexibility *without* making it
any worse for the number one goal. I'm even more confident after
implementing the format in both Python and Javascript.
What to do if microsecond resolution turns out to be insufficient in the future? (Maybe in
other domains than seismology?)
Change the format and increment the version.
Fantastic. Then there will be at least another chance to get things
right.
At least there is some extensibility -- new block types can be added. Maybe "SE" for sensor
ID (VID, PID, serial, preset) and "DL" for datalogger ID (VID, PID, serial, preset)...
I would also add an optional gain or "GA" block, in case there is gain reduction between the
sensor and datalogger, such as used with some of our EarthData units...
There is a huge amount of extensibility in the extra headers, and I think those kinds of things belong in
the extra headers by default. We'd need a very good reason to create another block type in my opinion.
I don't see any reason for wasting space with extra headers. I'm sure
many users agree and you can expect lots of different blocks to be used
soon. There will be just no controlled way for allocating the IDs.
|
The draft seems like quite some progress! What exactly is the thinking behind the opaque data encoding? It this needed for some actual use case? Also should the text encoding for data encoding |
Hi, Record header block indicator and version - collapse to one field. Typically different formats have different structure, resulting of the version to be stored somewhere else anyway... FLAGS: drop flags. they have unclear time reference: refer either to a point in time, or to a time interval, but neither is explicitlitly expressed. recoding /realigning information in different records changes the interpretation / leads to information loss. this is a very fundamental flaw in a data format, should not happen in the 20th century... The information contained in the flag needs to be stored in waveform quality metadata we have better concepts for things like that in WFparam or Mustang RECORD START TIME: precision is not sufficient. We sample with 50 MHz even in seismology (analysis of rock samples), and even today. Not future proof for a generic time series format Sampling rate: precision not sufficient. Data version: different versions of data should have different stream IDs, resolving to different metadata (which tells how the data was treated to make it a different version). Data version as a separate data field without link to metadata is pointless in data exchange. Furthermore if not being part of a globally unique IDs, the same version tags may be used in different places for different stuff in different places. Data blocks: i do not see the point of having multiple blocks. With no crc per block, you have to wait anyway for the termination block to see whether you got everything right. Tentative forward readibility is given also without sub-blocks if the compression format allows (you know about potential inherent block structures from the encoding field. With one data block present per header block, number of samples goes to the header, length of data payload is derived from overall length minus header and footer, and indication is not required (is implicit by position after the header). What remains, is pure data ... btw, multiple data records make reading/searching quite ineffective: you have to jump to each record in order to figure out the position of the footer, and to go to the next header. Footer: On the new FDSN identifiers: |
Agree on multiple data blocks. Disagree on extra headers. I feel a core requirement is that we be able to migrate mseed2 in a way that is lossless. Dropping extra headers makes this impossible. I am NOT in favor of opaque URI identifiers. Network, station and channel are too fundamental. |
To take the role of mseed2 blockette 2000. I do not know of a specific use case currently, but a general way to allow packaging and transportation of (presumably time series) data in a payload that is not expected to be generally known. Of course the risk is accumulating unusable data records. Perhaps there could be a requirement to set an OpaquePayload extra header describing what it is for any records with this encoding. I believe mseed2 blockette 2000 was originally created for packing GPS BINEX data blocks into miniSEED so that it could be transported in a system designed for miniSEED. As far as I know there are none of those around as the miniSEED packaging was subsequently stripped off at the data center. In this scenario, going through an approval process to define a new encoding is not worth it. |
Thanks for the comments.
Makes it simpler 👍
We need a way to map mseed2 data to mseed3. Losing information is a non-starter for many. If you have an alternative way to incorporate mseed2 information that would not immediately be legacy cruft please describe it. This whole process would be a lot easier if we could dream up the best new format for current and future needs; but that is not the case, we must provide a transition path for a lot of old data. Extra fields also provide a mechanism for data generators (operators, equipment manufacturers, etc.) to put their own values into the header. This has been requested many times over the years. What you see in the reserved extra headers now is just the mseed2 flags/blockettes, but the real value is a flexible extra header structure for things to come, future protection.
The version in the URN is an interesting idea. It would need to be optional so that data could be referenced, for example in a request, without a version because the default for nearly every data center is "the latest".
If defined as relative to a data center, it has value to know if an extracted copy is the current version later in time. It is not nearly as valuable as a linked to metadata, but that requires much more change to be realizable I would think. Could you expand on specifics of what you think would be required to actually have versioned identifiers and metadata?
Agreed, that's a general problem with an arbitrary blocking style structure, you have to walk the blocks to find anything not in the first block.
We have a large legacy of identifiers that cannot be ignored, moving to an opaque system would be a big mistake as many aspects of it, e.g. easy network identification, are extremely useful.
The size of each code can be discussed and I'm sure will be. Justification will be needed in any case as there are impacts in the real world for "slightly larger". The namespace of "FDSN" makes it future proof in the most important way, in the future the FDSN could create a new namespace for a new identifier. If someone wants to use the format for time series not defined by the FDSN they can use another identifier. Can you explain how this is not future proof in a way that matters? |
In the attached drafts I have tried to incorporate all the items that we have discussed and according to what appeared to be consensus.
The specification of FDSN identifiers from miniSEED 3 have been split off so these can be treated separately and, ultimately, so the documentation of identifiers may be shared with StationXML.
Here are the other changes:
My interpretation: I do not see why we need both underscores and dashes are needed, and underscores are ugly (my bias). Also, they cannot be used in channels as each letter is proscribed. I'm also guessing there would be pushback from adding dash/underscore/etc to network codes and that didn't seem to be a use case. If you want changes included speak up and justify your case.
Note: I did not add encodings for big endian fundamental types (int, float), I'd rather convert those types when converting from mseed2.
I tried to evenly apply the changes we all agreed, or at least there were two positive votes and no push back. Please speak up if you think I got something wrong. Perhaps I'm in the fog of many hours of refactoring/combining everything, but I'm optimistic that this demonstrates a good amount of convergence.
Things that still need treatment:
miniSEED3-DRAFT20170708.pdf
FDSNIdentifiers-DRAFT20170708.pdf
The text was updated successfully, but these errors were encountered: