Chunks

MiniSEED 3 "chunks" format (white paper option #3)

This document incorporates feedback from the technical evaluation group. For discussion and background information, see https://github.com/iris-edu/mseed3-evaluation/issues/14.

Section 1: Overview

General structure of the miniSEED 3 "chunks" format

The fundamental unit of miniSEED is a data record. Normally, a time series is stored and exchanged as a sequence of these records. Each record is independently usable even when presented in a sequence. An MS3 record is composed of a header, followed by zero or more blockettes (also called chunks to distinguish from miniSEED 2.x blockettes). The proposed format supports streaming while the record is being generated at the blockette granularity. The format does not limit data records in total size. To limit problems with timing system drift & resolution and practical issues of subsetting & resource limitation for readers of the data, typical record lengths for raw data generation and archiving are recommended to be near the range of 256 and 4096 bytes.

What's wrong with miniSEED 2.x?

Similar to miniSEED 3, miniSEED 2.x is record-oriented. A record is composed of a header, followed by zero or more blockettes, followed by waveform samples (if any). Unfortunately the miniSEED 2.x format has a number of problems:

Too much information in the fixed header, which is impossible to change. Some of this information became obsolete over time and wasted space in a record, other information (eg., network code) needed expansion, which was impossible due to fixed length fields.
Number of samples, flags, etc. in the beginning of record prevent sub-record transfer, which is needed to achieve lower latency. As a workaround, very small records (eg., 64 bytes header + 64 bytes data) or partly filled records are used, where overhead can be 50% or more.
Blockette lengths are not defined. It is only possible to guess blockette's length from "next blockette's byte number" or "beginning of data", both of which can be unset.
Blockette encoding is rather inefficient, forcing to group unrelated fields to save space. For example, the blockette 1001 contains both timing quality and microseconds. In order to add microsecond resolution, one must also add timing quality value, which may not be known.
Waveform data is handled as a special case rather than a blockette. This means that fields like "sample rate" and "number of samples" must be provided even when record contains no waveform data.
Power of two record size makes it difficult to add blockettes if a record is already full. For example, to add a blockette to a 512-byte record, the size must be expanded to 1024; to add a blockette to a 4096-byte record, the size must be expanded to 8192 and so on.

All of the above problems are solved with the current proposal.

How does this proposal (#3) compare to miniSEED 2.x and proposals #1, #2 and #4?

Feature	miniSEED 2.x	#1 (IRIS)	#2 (ORFEUS)	#3 (GFZ)	#4 (ETH)
Fixed Header	yes	yes	yes	no	yes
Extensions	blockettes	CBOR	blockettes	chunks	opaque
Time series identifier	SNCL (fixed)	URI	SNCL (variable)	multiple	URI
Sub-record streaming	no	no	no	yes	yes
Waveform data	section	section	section	chunks	section
Non-waveform data	blockettes	CBOR	blockettes	chunks	opaque
Mixed data	yes	yes	yes	yes	no
Checksum	no	CRC-32	CRC-32	multiple	MD5
Record length	POT	variable	POT	variable	variable

Fixed Header

Fixed header is forever, so one must think carefully about what to put there. Many fields are obviously essential until... they aren't. One of the more obvious examples of such fields is "data publication version", but even "record start time" and/or "time series identifier" may require alternative representations in the future.

The current proposal (#3) solves this problem by having no fixed header at all.

Extensions

Extensions are used to add information that is not in the fixed header. The classic extension mechanism of miniSEED 2.x is blockette. The current proposal builds on the blockette concept, solving issues with classic blockettes that are outlined above. The miniSEED 3 blockettes are sometimes called "chunks" to distinguish from miniSEED 2.x blockettes.

Proposal #1 uses a binary JSON-like format. The idea of JSON-like formats is that data is interpretable even without a schema. For example, here is a proper JSON document:

{"TimingQualityInPercent": 100, "MaximumTimingErrorInSeconds": 1e6, "QualityIndicator":"RawData"}

The problem is that miniSEED is a record-oriented format. When adding the above data to every record, a lot of space is wasted. Probably that's why #1 proposes shorter keys:

{"TQ": 100, "MTE": 1e6, "QI":"R"}

More efficient. But a casual reader has no longer idea what the keys mean. Moreover, a different organization may coincidentally use a key like "QI" for something completely different.

Proposal #1 suggests using a prefix (namespace) to avoid collisions. OK, so we use:

{"TQ": 100, "MTE": 1e6, "EIDA:QI": 537}

At this point it is probably clear that a registry is needed to resolve "EIDA:QI" to something more descriptive.

The current proposal (#3) jumps straight into it -- blockette (chunk) types are numeric. Everything is a blockette, so there is no need to use a different language (and different parser) for extensions.

Time series identifier

The expansion of network codes was the main motivation for redesigning miniSEED in the first place.

Proposal #2 added new, variable length SCNL fields that override SCNL fields in the fixed header.

Proposals #1 and #4 added a time series identifier in URI format. Only one URI is allowed, so a time series can be identified in only one ecosystem at the same time.

In the current proposal (#3), time series identifier is stored in a blockette and can have any structure as defined by the blockette (string, number, multiple fields, etc., avoiding the need to parse non-opaque URIs). It is possible to use multiple types of time series identifiers at the same time.

Sub-record streaming

An important property of miniSEED is that every record is self-containing. This means that every header field, for example, "TimingQualityInPercent" must be repeated in every record, even if it has the same value. Because of this, the header can be relatively large (and even larger with JSON), and to avoid huge overhead, one usually tries to put as much data into a record as possible.

MiniSEED 2.x as well as proposals #1 and #2 put the number of samples into the beginning of record, which means that it is not possible to start sending the record before the record generation is complete (the number of samples that fits into a record is not known in advance due to compression). Due to above, one must choose either high overhead or high latency.

Sub-record streaming means that a sender (eg., a digitizer) can start sending the record before the record generation is complete. Proposal #4 implemented sub-record streaming by adding a "footer" section that stores the number of samples and checksum, however, record length is stored at the beginning of record, so if it is necessary to finish record earlier than expected, padding must be used.

Another idea discussed in the technical evaluation group was using multiple data blocks, each with their own size. The drawback of that solution was that jumping directly to next record after reading the header was no longer possible. The latter is a typical pattern when reading multiplexed miniSEED files.

The format proposed here (#3) is efficient for both real-time streaming and data archival, thanks to "archive record header", which is added just before a record is archived. The archive record header contains only the format identifier and record size.

Waveform data

MiniSEED is primarily used for waveform data. MiniSEED 2.x and proposals #1, #2 and #4 treat waveform data specially -- there is a variable length section of record that is dedicated for waveform data. Fixed header (and footer) have dedicated fields to store the sample rate and the number of samples.

The current proposal (#3) takes a different approach -- waveform data is encapsulated in a blockette (chunk) like everything else.

Non-waveform data

Non-waveform data, such as event detections, are stored using the same mechanism that is used for extensions, eg., #1 uses CBOR, #2 uses blockettes, #3 uses chunks and #4 treats non-waveform data as opaque.

Usage of non-waveform data may increase if miniSEED expands to other disciplines besides seismology.

Mixed data

In miniSEED 2.x and proposals #1, #2 and #3, it is possible to store waveform and non-waveform data in the same record. In proposal #4, this is not possible.

Checksum

All proposals add some sort of record checksum. The problem with #1 and #2 is that adding anything to the record (for example a blockette or CBOR object describing the quality control procedure that was made at the data centre), even increasing the data publication version, just to indicate that the data passed quality control procedures, invalidates the checksum. Due to a hardware or software glitch, it can happen that data was corrupted, but the (new) checksum is correct.

The concept of the current proposal (#3) is that data should be modified as little as possible. In particular, adding "quality control passed" should not invalidate the primary checksum if no other changes were made to the record. Multiple checksums and hashes are supported (CRC-32, MD5, SHA, etc.)

Record length

MiniSEED 2.x and proposal #2 have power-of-two record length, leading to problems that were outlined above. #1 and #4 propose variable length records, but the record length must be known in advance. In case of #4, the number of actual samples is not known in advance, so a padding might be needed to finish a record earlier than expected.

In the current proposal (#3), record length is stored in archive record header, which is added just before the record is archived, making the format efficient for both real-time streaming and data archival.

Section 2: Structure of the MS3 record

An MS3 record is composed of a header, followed by zero or more blockettes. This standard documents the archive record header, which is used in MS3 files. Streaming protocols may transfer individual blockettes and use a different header.

Layout

Field	Field name	Type	Length	Offset	Content
[Archive record header]
1	Record indicator	CHAR	3	0	ASCII "MS3"
2	Record length	VARINT	V	3
[Blockettes, zero or more may be present]
3	Blockette type	VARINT	V	V
4	Blockette length	VARINT	V	V
5	Blockette payload	encoded	V	V

All length values are specified in bytes, which are assumed to be 8-bits in length, "V" denotes variable length.

Data types

CHAR: Character data.
INT8: Signed 8-bit integer.
UINT8: Unsigned 8-bit integer.
UINT16: Unsigned 16-bit integer, little-endian.
UINT32: Unsigned 32-bit integer, little-endian.
FLOAT64: IEEE-754 64-bit (double precision) floating point number, little-endian.
VARINT: Base 128 variable length integer (little-endian) as defined in Protobuf. See also example Python implementation.

Description of fields

Record indicator -- ASCII "MS3".
Record length in bytes, excluding the header.
Blockette type (incl. Protobuf wire type), see section 3.
Length of blockette payload in bytes (skipped with certain Protobuf wire types).
Encoded blockette payload, see section 3.

Section 3: Blockettes

Encoding

The following encodings are currently under consideration:

Fixed-length struct (little-endian), optionally followed by opaque variable length data (same as miniSEED 2.x blockettes, except that only little-endian is allowed).
Protobuf (preferred). A blockette would be represented as a single field or an embedded message, where field number would be equal to blockette type. Note that Protobuf supports "repeated" fields, which are useful for waveform data and other blockettes that may appear multiple times in a record.

Order

For efficiency reasons, essential blockettes (eg., time series identifier, record start time) should occur near the beginning of a record. In this case, assuming that only one instance of a blockette per record is allowed, and knowing the record length, it would be possible to skip to next record as soon as all relevant blockettes are found.

If a blockette depends on other blockettes, the dependee must occur before depender. For example, the waveform metadata blockette must occur before waveform blockettes that depend on it.

Waveform blockettes must be sorted by time. Intra-record data gaps are not possible.

Allocation of blockette types

0...999999	reserved for organizations
	0...99999	reserved for the FDSN standard
		0..127	essential blockettes (1-byte ID)
		128..16383	important blockettes (2-byte ID)
1000000+	reserved for manufacturer extensions

Note: in case of Protobuf, IDs 1..15 would take 1 byte, 16-2047 would take 2 bytes, 2048..262143 would take 3 bytes, etc. However, 1 byte would be saved when encoding single-field blockette.

Section 4: Definition of standard blockettes

Below is the [incomplete] list of standard blockettes. Unless noted otherwise, only one instance of a blockette per record is allowed.

Flags are currently represented as a group of 8 bits (UINT8) in a single blockette. An alternative would be using a zero-length blockette (2 bytes) or boolean (3 bytes) for each individual flag.

In case of using Protobuf encoding, blockettes 21 and 22 would be unified, because both would have the same size. Type of the waveform blockette should be in the 1..15 range to save 1 byte per blockette.

Offsets and lengths are not applicable to Protobuf encoding.

Blockette numbers below are arbitrary and only used as an example.

Time series identifier (1)

Time series identifier as defined by the FDSN. Future revisions of the standard may add alternative time series identifiers to be used in other ecosystems.

Field	Field name	Type	Length	Offset
1	Time series identifier	V	V	0

Record start time (2)

Time of the first data sample and related flags. A representation of UTC using individual fields for year, day-of-year, hour, minute, second and nanosecond. A 60 second value is used to represent a time value during a positive leap second.

Future revisions of the standard may add relative time blockette, which could be useful with simulations and synthetic data.

Field	Field name	Type	Length	Offset
1	Year (0-65535)	UINT16	2	0
2	Day-of-year (1-366)	UINT16	2	2
3	Hour (0-23)	UINT8	1	4
4	Minute (0-59)	UINT8	1	5
5	Second (0-60)	UINT8	1	6
6	Nanosecond (0-999999999)	UINT32	4	7
7	Flags	UINT8	1	11

Flags

[Bit 0]: Time tag is questionable.
[Bit 1]: Clock locked.

Leap second (3)

One or more leap seconds occurred during this record. The value specifies the number of leap seconds and direction. For example use “+1” to specify a single positive leap second and “-1” to specify a single negative leap second.

Field	Field name	Type	Length	Offset
1	Leap second	INT8	1	0

Sensor (10)

Optional sensor identification.

Field	Field name	Type	Length	Offset
1	Vendor ID	UINT16	2	0
2	Product ID	UINT16	2	2
3	Serial number	UINT16	2	4
4	Component	UINT8	1	6
5	Preset	UINT8	1	7

Vendor ID: Vendor ID, such as used with USB devices.
Product ID: Product ID, such as used with USB devices.
Serial number: Serial number of the device.
Component: Component, eg.: 0=Z, 1=N, 2=E. Device-specific.
Preset: A code indicating gain and filter settings. Device-specific.

Datalogger (11)

Optional datalogger (digitizer) identification.

Field	Field name	Type	Length	Offset
1	Vendor ID	UINT16	2	0
2	Product ID	UINT16	2	2
3	Serial number	UINT16	2	4
4	Channel	UINT8	1	6
5	Preset	UINT8	1	7

Vendor ID: Vendor ID, such as used with USB devices.
Product ID: Product ID, such as used with USB devices.
Serial number: Serial number of the device.
Channel: Channel, eg.: 0=Z1, 1=N1, 2=E1, 3=Z2, 4=N2, 5=E2, 6=supply voltage, etc. Device-specific.
Preset: A code indicating channel settings (gain, filters, etc.). Device/channel-specific.

Gain (12)

This blockette must be added to (10, 11) when non-standard gain or custom gain reduction is used.

Field	Field name	Type	Length	Offset
1	Gain	FLOAT64	8	0

Gain: The value 1.0 corresponds to standard gain of the respective sensor/datalogger/preset combination.

Waveform metadata (20)

Metadata for all waveform blockettes in a record. This blockette must occur before any waveform data blockettes (21, 22).

Field	Field name	Type	Length	Offset
1	Sample rate/period	FLOAT64	8	0
2	Data encoding format	UINT8	1	8

Sample rate/period

When the value is positive it represents the rate in samples per second, when it is negative it represents the sample period in seconds. Creators should use the negative value sample period notation for rates less than 1 samples per second to retain resolution.

Data encoding format

A code indicating the encoding format. The following codes are defined:

1: 16-bit integers, little-endian
3: 32-bit integers, little-endian
4: IEEE 32-bit floats, little-endian
5: IEEE 64-bit floats, little-endian
10: Steim-1 integer compression (defined only in big-endian)
11: Steim-2 integer compression (defined only in big-endian)
19: Steim-3 integer compression (defined only in big-endian)
53: 32-bit integers, little-endian, general compressor (TBD)
54: 32-bit IEEE floats, little-endian, general compressor (TBD)
55: 64-bit IEEE floats, little-endian, general compressor (TBD)

Waveform data (21)

Waveform data up to 255 samples. It is recommended to use multiple small waveform blockettes per record to achieve better real-time latency.

Field	Field name	Type	Length	Offset
1	Number of samples	UINT8	1	0
2	Data payload	encoded	V	1

Large waveform data (22)

Waveform data up to 2^32 samples. Multiple instances of this blockette per record is allowed.

Field	Field name	Type	Length	Offset
1	Number of samples	UINT32	4	0
2	Data payload	encoded	V	4

Log (23)

Log message. Multiple instances of this blockette per record is allowed.

Field	Field name	Type	Length	Offset
1	UTF-8 text	V	V	0

CRC-32 (30)

CRC-32C (Castagnoli) value, calculated over preceding blockettes, header excluded. Excluding the header (with record length) makes it possible to add blockettes in a data center without invalidating the CRC-32 value calculated in a digitizer. Multiple CRC-32 blockettes per record can be used.

In case of Protobuf encoding, byte position where CRC32 is calculated should be added for convenience, because the position of the CRC32 blockette itself may not be preserved by parsers.

Field	Field name	Type	Length	Offset
1	CRC-32 value	UINT32	4	0

Data version (90)

Recommended values: 1 for raw data, 2 for data following quality control procedures, and the value is incremented for each later revision.

Field	Field name	Type	Length	Offset
1	Data version	UINT8	1	0

Quality indicator (91)

Quality indicator. Primarily for older data, use not recommended for new data.

Field	Field name	Type	Length	Offset
1	Quality indicator	CHAR	1	0

Signal quality flags (92)

Signal quality flags, ported from miniSEED 2.

Field	Field name	Type	Length	Offset
1	Flags	UINT8	1	0

Flags

[Bit 0]: The mass position is off-scale.
[Bit 1]: Amplifier saturation detected.
[Bit 2]: Digitizer clipping detected.
[Bit 3]: Spikes detected.
[Bit 4]: Glitches detected.
[Bit 5]: A digital filter may be charging.

Legacy flags (93)

Deprecated miniSEED 2 flags, do not use.

Field	Field name	Type	Length	Offset
1	Flags	UINT8	1	0

Flags

[Bit 0]: Station volume parity error possibly present.
[Bit 1]: Long record read (possibly no problem).
[Bit 2]: Short record read (record padded).
[Bit 3]: Start of time series.
[Bit 4]: End of time series.
[Bit 5]: Telemetry synchronization error.
[Bit 6]: Missing/padded data present.

Timing quality (100)

A vendor specific timing quality value from 0 to 100% of maximum accuracy.

Field	Field name	Type	Length	Offset
1	Timing quality	UINT8	1	0

Maximum timing error (101)

Estimated maximum timing error in seconds.

Field	Field name	Type	Length	Offset
1	Maximum timing error	FLOAT64	8	0

Time correction (102)

Time correction in seconds applied to record start time.

Field	Field name	Type	Length	Offset
1	Time correction	FLOAT64	8	0

JSON data (126)

User-defined extension (JSON). Multiple instances of this blockette per record is allowed.

Field	Field name	Type	Length	Offset
1	JSON data (UTF-8)	V	V	0

Generic (127)

User-defined extension (binary). Multiple instances of this blockette per record is allowed.

Field	Field name	Type	Length	Offset
1	UUID	CHAR	16	0
2	Data payload	V	V	16

UUID: Data type identification (https://en.wikipedia.org/wiki/Universally_unique_identifier).
Data payload: Data payload, corresponding to the UUID.

MiniSEED 2.x blockettes

Further miniSEED 2.x blockettes (timing, detection, calibration, beam) will be converted to MS3 counterparts. Some MS2 blockettes will be split into multiple MS3 blockettes.

Appendix: Proto2 schema

Below is the schema in proto2 language. Note again that the blockette numbers are arbitrary and should be taken as example only. Blockette 22 (LargeWaveformData) will be removed, because it is identical to blockette 21 (WaveformData).

package mseed3;

// Note: all int types are varints; uint8 and uint16 do not exist in protobuf.

message RecordStartTime {
        required uint32 year = 1;
        required uint32 day_of_year = 2;
        required uint32 hour = 3;
        required uint32 minute = 4;
        required uint32 second = 5;
        required uint32 microsecond = 6;
        required uint32 flags = 7;
}

message Sensor {
        required uint32 vendor_id = 1;
        required uint32 product_id = 2;
        optional uint32 serial_no = 3;
        required uint32 channel = 4;
        required uint32 preset = 5;
}

message Datalogger {
        required uint32 vendor_id = 1;
        required uint32 product_id = 2;
        optional uint32 serial_no = 3;
        required uint32 channel = 4;
        required uint32 preset = 5;
}

message WaveformMetadata {
        required double sample_rate_period = 1;
        required uint32 encoding = 2;
}

message WaveformData {
        required uint32 number_of_samples = 1;
        required bytes data = 2;
}

message LargeWaveformData {
        required uint32 number_of_samples = 1;
        required bytes data = 2;
}

message UserData {
        required bytes uuid = 1;
        required bytes payload = 2;
}

message CRC32 {
        required fixed32 value = 1;
        required uint32 byte_position = 2;
}

message Record {
        // The "required" qualifier has no effect on the encoding;
        // it simply suggests the parser to throw an exception if the
        // field is missing.

        required string time_series_identifier = 1;
        required RecordStartTime record_start_time = 2;

        optional sint32 leap_second = 3;
        optional Sensor sensor = 10;
        optional Datalogger datalogger = 11;
        optional float64 gain = 12;
        optional WaveformMetadata waveform_metadata = 20;

        // repeated field -- can appear zero or more times in a record
        repeated WaveformData waveform_data = 21;

        // repeated field -- can appear zero or more times in a record
        repeated LargeWaveformData large_waveform_data = 22;

        // repeated field -- can appear zero or more times in a record
        repeated string log = 23

        // repeated field -- can appear zero or more times in a record
        repeated CRC32 crc32 = 30;

        optional uint32 data_version = 90;
        optional string quality_indicator = 91;
        optional uint32 signal_quality_flags = 92;
        optional uint32 legacy_flags = 93;
        optional uint32 timing_quality = 100;
        optional float64 maximum_timing_error = 101;
        optional float64 time_correction = 102;
        repeated string json_data = 126;
        repeated UserData user_data = 127;
}

Chunks

MiniSEED 3 "chunks" format (white paper option #3)

Section 1: Overview

General structure of the miniSEED 3 "chunks" format

What's wrong with miniSEED 2.x?

How does this proposal (#3) compare to miniSEED 2.x and proposals #1, #2 and #4?

Fixed Header

Extensions

Time series identifier

Sub-record streaming

Waveform data

Non-waveform data

Mixed data

Checksum

Record length

Section 2: Structure of the MS3 record

Layout

Data types

Description of fields

Section 3: Blockettes

Encoding

Order

Allocation of blockette types

Section 4: Definition of standard blockettes

Time series identifier (1)

Record start time (2)

Leap second (3)

Sensor (10)

Datalogger (11)

Gain (12)

Waveform metadata (20)

Waveform data (21)

Large waveform data (22)

Log (23)

CRC-32 (30)

Data version (90)

Quality indicator (91)

Signal quality flags (92)

Legacy flags (93)

Timing quality (100)

Maximum timing error (101)

Time correction (102)

JSON data (126)

Generic (127)

MiniSEED 2.x blockettes

Appendix: Proto2 schema

Clone this wiki locally