Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunks format #14

Open
andres-h opened this issue Jul 5, 2017 · 31 comments
Open

Chunks format #14

andres-h opened this issue Jul 5, 2017 · 31 comments

Comments

@andres-h
Copy link
Collaborator

andres-h commented Jul 5, 2017

It might help others understand your complete vision. Also, I find that when you document, even in rough draft, something in a near-complete description you are pushed to think through the details and it helps identify problems that are not obvious from a very general concept view.

Not a draft, but I wrote a couple of crude Python scripts. ms2to3 converts an MS2 file to MS3 chunks format. msi3 lists all chunks in a file. Sorry that there are not many comments, but the scripts are very short. The scripts are tested with Python3 only.

Few things to note:

  • Records are converted 1:1, eg., an MS3 record is created for each MS2 record, so MS3 records have non-standard size (typically 522 bytes) and may have variable length.

  • I wanted to keep the parser simple, so sub-chunk transfer is not possible. For smaller latency, it would be recommended to use multiple shorter WFDATA chunks in a record.

  • Optimal record size would be much larger and I would prefer fixed size.

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 5, 2017

How is the schema specified? Is that part of your proposal?

The schema language is of course independent of the data format. See a simplistic schema in Python here. The final standard should have language-independent schema language, maybe based on JSON, but it can be very simple. Specification of the schema language could be included in the appendix.

@chad-earthscope
Copy link

WFDATA = ChunkType("WFDATA", 20, "<fBH", 7,
"sample_rate",
"encoding",
"number_of_samples",
"data")

It doesn't seem like repeating sample_rate and encoding in each waveform chunk is needed. I wouldn't think those should change within a record.

MS2BLK = {
100: MS2BLK100,
200: MS2BLK200,
201: MS2BLK201,
300: MS2BLK300,
310: MS2BLK310,
320: MS2BLK320,
390: MS2BLK390,
395: MS2BLK395,
400: MS2BLK400,
405: MS2BLK405,
500: MS2BLK500
}

The old blockettes all suffer from one problem or another, wasted "reserved" bytes, over-lumping or over-requirements.

If this is just for a demo fine. But we will miss a huge opportunity to create a better extra headers in miniSEED if we just embed them, they should be mapped to something better when converting.

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 6, 2017 via email

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 7, 2017

I think everything is now clear in my head, so I'm ready to write a full draft. If @chad-iris wants to do another draft first, it would be OK with me, though.

@chad-earthscope
Copy link

True, but if you add all this info to MS3, you create lots of extra headers or blocks that will be rarely if ever used.

Turns out it's simply not that much, look at any of the posted drafts. So even if they are rarely used who cares, they are there for conversion of data from mseed2.

If you drop something, some guy who does use obscure MS2 blockettes will complain.

Same with MS2 flags. There should be a way to include all of those flags in MS3, even if they will be never used. I don't like discarding any information, especially if you physically convert the archive.

Nothing was dropped in the 20170622 draft so I'm not sure what you mean here. There were some structural problems, as you pointed out multiple event detection "occurrences" could not be grouped very nicely, but there was no loss of information from mseed2.

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 9, 2017

@crotwell I ported your javascript to chunks format, see this. Looks nice IMO. Saving MS3 data is now possible too (result can be verified with msi3.py).

@crotwell
Copy link
Collaborator

It is a little hard to evaluate as code, would be easier to talk about if we had a document, but this looks like there would be a huge bloat factor due to storing the key and size for everything. This is really flexible, but the flexibility comes at a cost of wasted bytes and parsing complexity. For example when parsing, there is no order enforcement that I can see, so I may have to parse most of the record just to find the start time or identifier? It is much easier in a fixed case to know you can get the identifier at offset 25 bytes.

I think that all of the fixed header as we have defined it is stuff that really always needs to be there, and this idea of making everything key-length-value is too much flexibility and comes at too great a cost. Everything is about trade-offs.

@crotwell
Copy link
Collaborator

Also relevant to many chunks each with its own size, see issue #25. Total record size should be easily calculated without reading/parsing many chunks. Specifically, in this you have to search for the fdsn.NULL_CHUNK to find the end of the record.

I think it is really important to be able to load an entire record into memory or skip to the next record without having to do many reads to sum up sizes.

@andres-h
Copy link
Collaborator Author

It is a little hard to evaluate as code, would be easier to talk about if we had a document

It's option #3 in the white paper...

there would be a huge bloat factor due to storing the key and size for everything.

Much less than with JSON.

I think that all of the fixed header as we have defined it is stuff that really always needs to be there

Don't agree. Minimal fixed header would be OK, but only for stuff that is fundamental.

Also relevant to many chunks each with its own size, see issue #25. Total record size should be easily calculated without reading/parsing many chunks.

It's not possible with Chad's proposal either.

I think it is really important to be able to load an entire record into memory or skip to the next record without having to do many reads to sum up sizes.

Totally agreed. That's why I prefer fixed length records. See comments here.

@andres-h
Copy link
Collaborator Author

It is a little hard to evaluate as code, would be easier to talk about if we had a document

I started to write better documentation here.

@crotwell
Copy link
Collaborator

there would be a huge bloat factor due to storing the key and size for everything.

Much less than with JSON.

Nobody is suggesting storing the header as json. Most records would have very minimal extra headers. That is kind of the point, fix things that have to be there, use flexible storage for things that are optional or new without a format revision.

Have you done a byte size calculation for converting an average mseed2 record to your style?

I think that all of the fixed header as we have defined it is stuff that really always needs to be there

Don't agree. Minimal fixed header would be OK, but only for stuff that is fundamental.

What exactly do you think is not fundamental in the latest fixed header? Maybe "Flags" or "data version". I do not see the point of a miniseed replacement where the sample rate and encoding format are optional. That would just repeat the blockette1000 problem.

@andres-h
Copy link
Collaborator Author

Have you done a byte size calculation for converting an average mseed2 record to your style?

Yes, as written in the first post of this thread, the size jumped from 512 to 522 bytes, but that was with sensor and datalogger ID included. Otherwise the size would have been shorter than mseed2.

What exactly do you think is not fundamental in the latest fixed header?

Flags
Time -> can be absolute or relative (simulations, synthetic data)
Time series identifier -> multiple identifiers (FDSN and non-FDSN) should be allowed
Sample rate/period, encoding -> not applicable to non-waveform data
etc.

I have a multi-header concept now. It's all documented here.

@crotwell
Copy link
Collaborator

Small thing, but a serial number is often not an actual number, very often string mix of letters and numbers.

Keeping a registry for vendor/product codes feels like more work than the fdsn is capable of providing on an ongoing basis. Maybe just strings for all of these would be more usable. USB collects fees from big companies, fdsn is effectively all volunteer.

I do like the idea of the logger being able to tag data with its serial number and such. Might be good to be as compatible as possible with what is in stationxml, given byte size limitations.

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 12, 2017 via email

@chad-earthscope
Copy link

Small thing, but a serial number is often not an actual number, very > often string mix of letters and numbers.

Yes, I was thinking about that, but I'd like to not waste too much space. It is unlikely that more than 65536 items of a single model are produced, so the instrument should somehow be able to put its actual serial number there.

The actual serial number is not always digits.

@crotwell
Copy link
Collaborator

USB collects fees from big
companies, fdsn is effectively all volunteer.
Yes, it's possible that the idea will never realize :(

Kind of the same issue with the chunk id number registry. I just don't see the FDSN being able to support that long term.

I guess I just don't understand, why design a format that depends on something external that even you don't think is likely to exist?

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 12, 2017 via email

@crotwell
Copy link
Collaborator

There has to be a registry for the "Allocation of blockette types", right? I can't just choose a random number and write chunks, can I?

If I find a chunk with ID 123456789, how do I find the URL that goes with it?

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 12, 2017 via email

@chad-earthscope
Copy link

I started to write better documentation here.

Thanks, this is much better to review. My comments are in regard to this version of your document:
https://github.com/iris-edu/mseed3-evaluation/wiki/Chunks/385b84eda92dd48974ee7c49e8b0b4c81ed0bd37

Section 2:

  • Everything becomes a blockette. While this seems simple and would be quite flexible, it invokes a number of subtle problems. For any blockettes that include more than one field there is a very real problem of over-grouping, as demonstrated in mseed2. Such grouping is totally rigid, making fields required that should not necessarily be if the blockette is desired for any one of the fields. An alternative approach is to make each field a blockette, which incurs more overhead of the structuring (possible minor), and really just becomes a smaller, binary version of the tilde-separated string headers with the problems we've identified with that. A possible solution is to create alternatives of different blockettes that group fields differently; that could be really messy indeed.

  • Binary blockettes for extra, non-FDSN headers provide complete flexibility, possible too much because the content is absolutely opaque without a definition of the structure and the meaning. The other structuring options for extra headers we've discussed at least give the reader basic data types, this would allow, for example, printing of the extra headers without knowing anything about the contents. Any of the approaches would need some sort of schema definition for full use of any extra headers, but wide-open binary blockettes force small parsing engines for each different header.

"This standard documents only archive record header, which is used with MS3
files. Real-time transfer protocols may use a different header."

  • Can you expand on what advantage there would be to use a different header? Also, do you mean "real-time" transfer could potentially use an alternate header? Or a header in addition to the standard header? If the former, should there be a requirement that a standard header be added at some later time? If the later then may this doesn't belong in the spec at all.

  • I do not think padding should be allowed. It's a waste and is only used as a kludge to address other problems that I can tell. If you want padding in a file form some kind of storage/access pattern, add it between records, it should not be allowed within as that forces everyone handling that record to pay the penalty.

  • varints save a few bytes but add a bit of complexity. Their value increases if there are lots of blockettes or blockette ids are really huge numbers. I think we want a format that is as easy to use as possible and this trade off is not worth it.

Section 3: Blockettes

In exceptional cases, new revisions of the standard may append fields to existing blockettes (this was a practice in miniSEED 2.x)...

  • Versions of the same blockettes would be terrible, suggest dropping that entirely. This one of the lessons of mseed2 we do not want to re-learn.

Order For efficiency reasons, essential blockettes (eg., time series identifier, record start time) should occur near the beginning of a record.

  • I think you meant usability or something other than "efficiency" or I don't understand which characteristic is more/less efficient depending on blockette order. Definitely not size.

In this case, assuming that only one instance of a blockette per record is allowed, and knowing the record length, it would be possible to skip to next record as soon as all relevant blockettes are found.

This is an excellent example of how "all blockettes" makes doing a common operation like reading through a stream of records and skipping some of them a more difficult task compared to a fixed header of core values. So to skip through records based on identifiers and/or times I have to go about searching through the blockette chain to find the right blockettes for each record. Add in the possibility of multiple identifiers and it gets worse. This is arguably, the most common operation for a data center and would certainly get more expensive.

100000..199999 reserved for IRIS extensions
200000..299999 reserved for EIDA extensions

  • I do not think that is a good idea. We don't even know if those organizations will be around for the expected lifetime of the format and what about other groups?

Section 4: Definition of standard blockettes:

  • Multiple time series identifiers? I see downstream problems identifying the data, e.g. at data centers are we expected to allow track unlimited aliases per-record that may vary over time
    and allow requests for any of the aliases? Without a preferred/primary ID are systems that report on data supposed to provide all IDs? The first in the record could be identified as the preferred/primary. I understand the desire, but not sure we can justify putting it in every record. I think aliases for time series identifiers fit much better in external metadata.

Sensor (10) Optional sensor identification.
Datalogger (11) Optional datalogger (digitizer) identification.

  • Both of these are over-grouped. Those fields will not all be appropriate for every case where some of this information is known and would otherwise be useful. Also does not handle serial numbers that are not all digits or prefixed with zeros or have dashes, etc. etc.

Gain (12)

  • I like this in concept, could be used in scenarios where instrument gain changes dynamically (not sure how common that is) but there are some problems. It's probably worth clarifying whether this is the the total sensitivity (that's how SEED refers to total system gain), and if it is maybe units could be reported also? It may be challenging to define the standard for each combination. Also, I do not understand by "The value 1.0 corresponds to standard gain" would be used, why not just put in the gain so it's directly usable?

Waveform metadata (20)
Sample rate/period FLOAT32

  • Might need to be FLOAT64 if we are pushing the time resolution up.

Waveform data (21)
Large waveform data (22)

  • Two waveform blockettes, presumably to save 3 bytes in each header. This is another complexity cost of trying to make miniSEED tailored for low latency, near real time data transmission.

In general

This structuring pushes more complexity onto the readers. This is very important because the records will be read much, much more often than they are written, modified or streamed in real-time. As pointed out above, even simple operations of reading through files/streams of miniSEED to subset them (something done millions of times per day at our data centers) is more complex, i.e. expensive. Other examples include needing to check for dependee/depender blockette ordering, checking for duplicated records that should not be duplicated, needing to know the structure & schema of extra headers to even print anything about them, needing to do varints, two blockette types for time series data, potentially multiple versions of the same blockette types, potentially multiple blockettes with the same information (if we keep re-defining them to try and get it right) and probably more. If size were the main driver then this structure has an advantage over other opions we've discussed. Although even in that case I would look for better waveform compression before making the header even more complex to save bytes.

The specification is also missing quite some detail, in particular there are not enough details to losslessly convert mseed2 to this format. Previously it has been suggested that mseed2 blockettes would be inserted verbatim into mseed3 blockettes. I think this would be a mistake for two main reasons: 1) many mseed2 blockettes are terribly constructed and all suffer from some problems and 2) most but not all of them that exist today are big endian, so we'd have little-endian structuring containing mostly big-endian blockettes (but some little-endian) with no byte order flag to know the difference.

P.S. Minor, but statements like "a sane value" (General structure) expose bias and judgement and do not belong in format specifications. Other values are insane? In the text that you copied I had written actual reasons for recommending records less than ~4096 bytes.

@crotwell
Copy link
Collaborator

Indeed there has to be a "master table" that links ID ranges to URLs from where you can download the schema.

I find this troubling. The chunks are opaque unless you have the schema, which might be maintained outside of the FDSN. And so if a company goes out of business and the link dies, the information could easily become unparsable, unreadable and essentially garbage? Do we really want to create a data format that encourages data to be stored in a way that might not even be able to be parsed decades from now? This is just asking for bit rot!

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 13, 2017 via email

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 13, 2017 via email

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 13, 2017 via email

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 14, 2017 via email

@andres-h
Copy link
Collaborator Author

I've committed a Javascript implementation, including a .proto file. Result can be verified using msi3-protobuf.py.

@andres-h
Copy link
Collaborator Author

Note that there are two things that should not be confused with each other. It wasn't fully clear even to myself in the beginning.

One is the protobuf encoding, which is very simple and fits perfectly with my chunks concept. The specification of protobuf encoding takes just a couple of pages.

Two is various protobuf toolsets whose purpose is to provide highly efficient encoders and decoders for various languages. There are two versions of them developed by Google (proto2 and proto3), both of which use the same encoding!

Using those toolsets is not a requirement to parse MS3 records, but they can be used to implement highly efficient parsers.

My Python implementation (not the most efficient one) uses small bits of Google's code, just because I was too lazy to write all code myself.

My Javascript implementation is based on protobufjs. I just love this package! It's so elegant and easy to parse miniSEED now, not to mention that all problems with the miniSEED format are solved that I can think of.

PS. If microseconds resolution is needed in most use cases, it might make sense to have required microseconds field and an optional nanoseconds (0..999) (or picoseconds) field.

@crotwell
Copy link
Collaborator

@andres-h
You should change the author in the package.json in both of your javascript examples to be you instead of me.

@crotwell
Copy link
Collaborator

Also, please remove my name from the javascript as well. What you have done is such a huge change from what I did that I don't feel there is any reason to keep my name on it.

@andres-h
Copy link
Collaborator Author

andres-h commented Jul 17, 2017 via email

@crotwell
Copy link
Collaborator

OK, thanks.

I am happy you were able to reuse my code. No bad feelings, I just think having my name there was confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants