-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunks format #14
Comments
The schema language is of course independent of the data format. See a simplistic schema in Python here. The final standard should have language-independent schema language, maybe based on JSON, but it can be very simple. Specification of the schema language could be included in the appendix. |
It doesn't seem like repeating
The old blockettes all suffer from one problem or another, wasted "reserved" bytes, over-lumping or over-requirements. If this is just for a demo fine. But we will miss a huge opportunity to create a better extra headers in miniSEED if we just embed them, they should be mapped to something better when converting. |
On 07/06/2017 02:37 AM, Chad Trabant wrote:
WFDATA = ChunkType("WFDATA", 20, "<fBH", 7,
"sample_rate",
"encoding",
"number_of_samples",
"data")
It doesn't seem like repeating |sample_rate| and |encoding| in each
waveform chunk is needed. I wouldn't think those should change within a
record.
Agreed. I did not want to put |sample_rate| and |encoding| into the
fixed header, because these are not applicable to all kinds of data, but
there could be a separate block for those.
MS2BLK = {
100: MS2BLK100,
200: MS2BLK200,
201: MS2BLK201,
300: MS2BLK300,
310: MS2BLK310,
320: MS2BLK320,
390: MS2BLK390,
395: MS2BLK395,
400: MS2BLK400,
405: MS2BLK405,
500: MS2BLK500
}
The old blockettes all suffer from one problem or another, wasted
"reserved" bytes, over-lumping or over-requirements.
If this is just for a demo fine. But we will miss a huge opportunity to
create a better extra headers in miniSEED if we just embed them, they
should be mapped to something better when converting.
True, but if you add *all* this info to MS3, you create lots of extra
headers or blocks that will be rarely if ever used.
If you drop something, some guy who does use obscure MS2 blockettes will
complain.
Same with MS2 flags. There should be a way to include all of those flags
in MS3, even if they will be never used. I don't like discarding any
information, especially if you physically convert the archive.
|
I think everything is now clear in my head, so I'm ready to write a full draft. If @chad-iris wants to do another draft first, it would be OK with me, though. |
Turns out it's simply not that much, look at any of the posted drafts. So even if they are rarely used who cares, they are there for conversion of data from mseed2.
Nothing was dropped in the 20170622 draft so I'm not sure what you mean here. There were some structural problems, as you pointed out multiple event detection "occurrences" could not be grouped very nicely, but there was no loss of information from mseed2. |
It is a little hard to evaluate as code, would be easier to talk about if we had a document, but this looks like there would be a huge bloat factor due to storing the key and size for everything. This is really flexible, but the flexibility comes at a cost of wasted bytes and parsing complexity. For example when parsing, there is no order enforcement that I can see, so I may have to parse most of the record just to find the start time or identifier? It is much easier in a fixed case to know you can get the identifier at offset 25 bytes. I think that all of the fixed header as we have defined it is stuff that really always needs to be there, and this idea of making everything key-length-value is too much flexibility and comes at too great a cost. Everything is about trade-offs. |
Also relevant to many chunks each with its own size, see issue #25. Total record size should be easily calculated without reading/parsing many chunks. Specifically, in this you have to search for the I think it is really important to be able to load an entire record into memory or skip to the next record without having to do many reads to sum up sizes. |
It's option
Much less than with JSON.
Don't agree. Minimal fixed header would be OK, but only for stuff that is fundamental.
It's not possible with Chad's proposal either.
Totally agreed. That's why I prefer fixed length records. See comments here. |
I started to write better documentation here. |
Nobody is suggesting storing the header as json. Most records would have very minimal extra headers. That is kind of the point, fix things that have to be there, use flexible storage for things that are optional or new without a format revision. Have you done a byte size calculation for converting an average mseed2 record to your style?
What exactly do you think is not fundamental in the latest fixed header? Maybe "Flags" or "data version". I do not see the point of a miniseed replacement where the sample rate and encoding format are optional. That would just repeat the blockette1000 problem. |
Yes, as written in the first post of this thread, the size jumped from 512 to 522 bytes, but that was with sensor and datalogger ID included. Otherwise the size would have been shorter than mseed2.
Flags I have a multi-header concept now. It's all documented here. |
Small thing, but a serial number is often not an actual number, very often string mix of letters and numbers. Keeping a registry for vendor/product codes feels like more work than the fdsn is capable of providing on an ongoing basis. Maybe just strings for all of these would be more usable. USB collects fees from big companies, fdsn is effectively all volunteer. I do like the idea of the logger being able to tag data with its serial number and such. Might be good to be as compatible as possible with what is in stationxml, given byte size limitations. |
Thanks for feedback :)
Small thing, but a serial number is often not an actual number, very > often string mix of letters and numbers.
Yes, I was thinking about that, but I'd like to not waste too much
space. It is unlikely that more than 65536 items of a single model are
produced, so the instrument should somehow be able to put its actual
serial number there.
Keeping a registry for vendor/product codes feels like more work than
the fdsn is capable of providing on an ongoing basis. Maybe just strings
for all of these would be more usable.
I think strings are not usable and can be even dangerous. For example,
if you put Guralp something there, there might be multiple versions of
Guralp something with different gains and filters.
USB collects fees from big
companies, fdsn is effectively all volunteer.
Yes, it's possible that the idea will never realize :(
|
The actual serial number is not always digits. |
Kind of the same issue with the chunk id number registry. I just don't see the FDSN being able to support that long term. I guess I just don't understand, why design a format that depends on something external that even you don't think is likely to exist? |
Kind of the same issue with the chunk id number registry. I just don't
see the FDSN being able to support that long term.
I guess I just don't understand, why design a format that depends on
something external that even you don't think is likely to exist?
What's the problem with chunk registry? All FDSN chunks are documented
in the standard. If there are other chunks, the corresponding
organization or manufacturer must make the schema available. Can be a
single URL from where you can download the schema.
I guess this should be clarified in the white paper.
|
There has to be a registry for the "Allocation of blockette types", right? I can't just choose a random number and write chunks, can I? If I find a chunk with ID 123456789, how do I find the URL that goes with it? |
There has to be a registry for the "Allocation of blockette types",
right? I can't just choose a random number and write chunks, can I?
If I find a chunk with ID 123456789, how do I find the URL that goes
with it?
Indeed there has to be a "master table" that links ID ranges to URLs
from where you can download the schema. I don't think implementing this
would be a problem for IRIS. We at GEOFON DC can do this for free :)
The master table would not change often (only when a new organization or
manufacturer is added or a URL is changed), so it would be feasible to
ship it with software. Offline copy of the schemata can be included too.
|
Thanks, this is much better to review. My comments are in regard to this version of your document: Section 2:
Section 3: Blockettes
This is an excellent example of how "all blockettes" makes doing a common operation like reading through a stream of records and skipping some of them a more difficult task compared to a fixed header of core values. So to skip through records based on identifiers and/or times I have to go about searching through the blockette chain to find the right blockettes for each record. Add in the possibility of multiple identifiers and it gets worse. This is arguably, the most common operation for a data center and would certainly get more expensive.
Section 4: Definition of standard blockettes:
In generalThis structuring pushes more complexity onto the readers. This is very important because the records will be read much, much more often than they are written, modified or streamed in real-time. As pointed out above, even simple operations of reading through files/streams of miniSEED to subset them (something done millions of times per day at our data centers) is more complex, i.e. expensive. Other examples include needing to check for dependee/depender blockette ordering, checking for duplicated records that should not be duplicated, needing to know the structure & schema of extra headers to even print anything about them, needing to do varints, two blockette types for time series data, potentially multiple versions of the same blockette types, potentially multiple blockettes with the same information (if we keep re-defining them to try and get it right) and probably more. If size were the main driver then this structure has an advantage over other opions we've discussed. Although even in that case I would look for better waveform compression before making the header even more complex to save bytes. The specification is also missing quite some detail, in particular there are not enough details to losslessly convert mseed2 to this format. Previously it has been suggested that mseed2 blockettes would be inserted verbatim into mseed3 blockettes. I think this would be a mistake for two main reasons: 1) many mseed2 blockettes are terribly constructed and all suffer from some problems and 2) most but not all of them that exist today are big endian, so we'd have little-endian structuring containing mostly big-endian blockettes (but some little-endian) with no byte order flag to know the difference. P.S. Minor, but statements like "a sane value" (General structure) expose bias and judgement and do not belong in format specifications. Other values are insane? In the text that you copied I had written actual reasons for recommending records less than ~4096 bytes. |
I find this troubling. The chunks are opaque unless you have the schema, which might be maintained outside of the FDSN. And so if a company goes out of business and the link dies, the information could easily become unparsable, unreadable and essentially garbage? Do we really want to create a data format that encourages data to be stored in a way that might not even be able to be parsed decades from now? This is just asking for bit rot! |
Many thanks for the exhaustive review. I appreciate this.
Everything becomes a blockette. While this seems simple and would be
quite flexible, it invokes a number of subtle problems. For any
blockettes that include more than one field there is a very real
problem of over-grouping, as demonstrated in mseed2. Such grouping
is totally rigid, making fields required that should not necessarily
be if the blockette is desired for any one of the fields. An
alternative approach is to make each field a blockette, which incurs
more overhead of the structuring (possible minor), and really just
becomes a smaller, binary version of the tilde-separated string
headers with the problems we've identified with that.
I don't really see a problem. Grouping would be used where it is
natural, other blockettes would have single values (only 2 bytes
overhead per blockette).
A possible
solution is to create alternatives of different blockettes that
group fields differently; that could be really messy indeed.
Another alternative would be using an encoding where fields can be unset
(eg., Protobuf).
Binary blockettes for extra, non-FDSN headers provide complete
flexibility, possible too much because the content is absolutely
opaque without a definition of the structure and the meaning. The
other structuring options for extra headers we've discussed at least
give the reader basic data types, this would allow, for example,
printing of the extra headers without knowing anything about the
contents. Any of the approaches would need some sort of schema
definition for full use of any extra headers, but wide-open binary
blockettes force small parsing engines for each different header.
You mean type 127? I wouldn't recommend to use that, but it could be
used for things like SeedLink INFO packets (XML). I guess a future
SeedLink would use 126 and JSON, though.
"This standard documents only archive record header, which is used
with MS3
files. Real-time transfer protocols may use a different header."
*
Can you expand on what advantage there would be to use a different
header? Also, do you mean "real-time" transfer could potentially use
an alternate header? Or a header in addition to the standard header?
If the former, should there be a requirement that a standard header
be added at some later time? If the later then may this doesn't
belong in the spec at all.
I think including the archive header in real-time transfer is pointless
and problematic, because the record length is not known in advance.
Moreover, real-time header needs other fields to correctly assemble the
records on the receiving side. I think a future SeedLink packet would
look like this:
<SeedLink header (incl. stream ID)><blockette><blockette>...
The packet would not contain whole record, just one or more blockettes.
I will clarify this in the next revision.
I do not think padding should be allowed. It's a waste and is only
used as a kludge to address other problems that I can tell. If you
want padding in a file form some kind of storage/access pattern, add
it between records, it should not be allowed within as that forces
everyone handling that record to pay the penalty.
OK, I think I can agree with that.
varints save a few bytes but add a bit of complexity. Their value
increases if there are /lots/ of blockettes or blockette ids are
really huge numbers. I think we want a format that is as easy to use
as possible and this trade off is not worth it.
In the allocation that I suggested, no ID would take more than 3 bytes
and I think IDs that are larger than 1 byte would be rare.
Section 3: Blockettes
In exceptional cases, new revisions of the standard may append
fields to existing blockettes (this was a practice in miniSEED 2.x)...
* Versions of the same blockettes would be terrible, suggest dropping
that entirely. This one of the lessons of mseed2 we do not want to
re-learn.
Like I said, "in exceptional cases". I think there should be a way to
amend blockettes if we forget something. Like "git commit --amend". It
is discouraged, but can be used if needed.
The problem with mseed2 was that blockette lengths were not defined. You
could only guess the length from "next blockette's byte number" or
"beginning of data", both of which can be unset.
*Order* For efficiency reasons, essential blockettes (eg., time
series identifier, record start time) should occur near the
beginning of a record.
* I think you meant usability or something other than "efficiency" or
I don't understand which characteristic is more/less efficient
depending on blockette order. Definitely not size.
In this case, assuming that only one instance of a blockette per
record is allowed, and knowing the record length, it would be
possible to skip to next record as soon as all relevant blockettes
are found.
This is an excellent example of how "all blockettes" makes doing a
common operation like reading through a stream of records and skipping
some of them a more difficult task compared to a fixed header of core
values. So to skip through records based on identifiers and/or times I
have to go about searching through the blockette chain to find the right
blockettes for each record. Add in the possibility of multiple
identifiers and it gets worse. This is arguably, /the/ most common
operation for a data center and would certainly get more expensive.
This is exactly what I mean with efficiency. The core values would be in
the beginning of record and only one instance would be allowed, so you
don't have to search through the blockette chain.
100000..199999 reserved for IRIS extensions
200000..299999 reserved for EIDA extensions
* I do not think that is a good idea. We don't even know if those
organizations will be around for the expected lifetime of the format
and what about other groups?
How many big datacentres/federations are there? I think there are enough
IDs for everyone.
I guess datacentres may want to define their quality control blockettes
or things like that, but the ID ranges could be allocated on demand.
Probably many are not even interested.
Section 4: Definition of standard blockettes:
* Multiple time series identifiers? I see downstream problems
identifying the data, e.g. at data centers are we expected to allow
track unlimited aliases /per-record/ that may vary over time
and allow requests for any of the aliases? Without a
preferred/primary ID are systems that report on data supposed to
provide all IDs? The first in the record could be identified as the
preferred/primary. I understand the desire, but not sure we can
justify putting it in every record. I think aliases for time series
identifiers fit much better in external metadata.
There would be one FDSN identifier allowed and FDSN web services like
dataselect would use only that.
Alternative identifiers could be used by groups like ETH who need opaque
URI identifiers, for example.
Sensor (10) Optional sensor identification.
Datalogger (11) Optional datalogger (digitizer) identification.
* Both of these are over-grouped. Those fields will not all be
appropriate for every case where some of this information is known
and would otherwise be useful.
You know vendor ID, but not product ID? I don't think this would be useful.
Or you know vendor ID, product ID, but have no idea which filter and
gain settings are in effect? IMO not useful either.
The serial number can be set to 0 if not known. I don't think it would
be that bad.
We also need a channel ID, though. See below.
Also does not handle serial numbers
that are not all digits or prefixed with zeros or have dashes, etc. etc.
I think there are two options:
1. Serial number would be a variable length string.
2. (preferred) Manufacturers provide numeric serial number in addition
to the fancy one.
Gain (12)
* I like this in concept, could be used in scenarios where instrument
gain changes dynamically (not sure how common that is) but there are
some problems. It's probably worth clarifying whether this is the
the total sensitivity (that's how SEED refers to total system gain),
and if it is maybe units could be reported also? It may be
challenging to define the standard for each combination. Also, I do
not understand by "The value 1.0 corresponds to standard gain" would
be used, why not just put in the gain so it's directly usable?
It's *not* the total sensitivity.
The problem with total sensitivity is that it is only valid for given
frequency and in case of polynomial response, there is no gain at all
(you need to specify the polynomial).
Eg. (EarthData digitizer): temperature = counts / 10.0 - 50.0
Instead of specifying all that (and units), I'd prefer to refer to the
devices.
In fact, we must add channel ID (Z, N, E, voltage, temperature, etc.).
Waveform metadata (20)
Sample rate/period FLOAT32
* Might need to be FLOAT64 if we are pushing the time resolution up.
Agreed.
The specification is also missing quite some detail, in particular there
are not enough details to losslessly convert mseed2 to this format.
Previously it has been suggested that mseed2 blockettes would be
inserted verbatim into mseed3 blockettes. I think this would be a
mistake for two main reasons: 1) many mseed2 blockettes are terribly
constructed and all suffer from some problems and 2) most but not all of
them that exist today are big endian, so we'd have little-endian
structuring containing mostly big-endian blockettes (but some
little-endian) with no byte order flag to know the difference.
The specification is a work in progress and is missing many details.
Regarding mseed2 blockettes, I agree that creating mseed3 version of
them instead of copying the data verbatim would make sense. At least the
blockettes would have to be converted to little-endian.
P.S. Minor, but statements like "a sane value" (General structure)
expose bias and judgement and do not belong in format specifications.
Other values are insane? In the text that you copied I had written
actual reasons for recommending records less than ~4096 bytes.
Of course. My English is not perfect and the wording can be improved a
lot. Probably I copied the text from 20170622 draft (I had both drafts
open on my screen), which only said that "typical record lengths are
between 256 and 4096 bytes".
|
On 07/13/2017 05:00 PM, Andres Heinloo wrote:
Another alternative would be using an encoding where fields can be unset
(eg., Protobuf).
In fact, I think we should give Protobuf another look, because it is
quite similar to chunks/blockettes, but more flexible.
Each field has an ID (varint) -> similar to chunks/blockettes.
A field can be present or not -> similar to chunks/blockettes.
A bunch of fields can be concatenated -> similar to chunks/blockettes.
A field can be a single value or an embedded message (eg., chunk/blockette).
The embedded message has again fields with ID -> fields of
chunk/blockette can be optional/unset.
Besides, the encoding looks quite simple and there is even RFC.
|
New revision submitted.
Edit: removed record terminator, because it does not make much sense without padding now.
|
I committed an implementation of protobuf encoding in Python, no .proto
files needed.
Might do Javascript as well.
|
I've committed a Javascript implementation, including a .proto file. Result can be verified using msi3-protobuf.py. |
Note that there are two things that should not be confused with each other. It wasn't fully clear even to myself in the beginning. One is the protobuf encoding, which is very simple and fits perfectly with my chunks concept. The specification of protobuf encoding takes just a couple of pages. Two is various protobuf toolsets whose purpose is to provide highly efficient encoders and decoders for various languages. There are two versions of them developed by Google (proto2 and proto3), both of which use the same encoding! Using those toolsets is not a requirement to parse MS3 records, but they can be used to implement highly efficient parsers. My Python implementation (not the most efficient one) uses small bits of Google's code, just because I was too lazy to write all code myself. My Javascript implementation is based on protobufjs. I just love this package! It's so elegant and easy to parse miniSEED now, not to mention that all problems with the miniSEED format are solved that I can think of. PS. If microseconds resolution is needed in most use cases, it might make sense to have required microseconds field and an optional nanoseconds (0..999) (or picoseconds) field. |
@andres-h |
Also, please remove my name from the javascript as well. What you have done is such a huge change from what I did that I don't feel there is any reason to keep my name on it. |
OK, I haven't posted this anywhere else. Will do the changes with the
next update.
Your work served as a very useful base actually. I probably wouldn't
have done Javascript without it.
|
OK, thanks. I am happy you were able to reuse my code. No bad feelings, I just think having my name there was confusing. |
Not a draft, but I wrote a couple of crude Python scripts. ms2to3 converts an MS2 file to MS3 chunks format. msi3 lists all chunks in a file. Sorry that there are not many comments, but the scripts are very short. The scripts are tested with Python3 only.
Few things to note:
Records are converted 1:1, eg., an MS3 record is created for each MS2 record, so MS3 records have non-standard size (typically 522 bytes) and may have variable length.
I wanted to keep the parser simple, so sub-chunk transfer is not possible. For smaller latency, it would be recommended to use multiple shorter WFDATA chunks in a record.
Optimal record size would be much larger and I would prefer fixed size.
The text was updated successfully, but these errors were encountered: