Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirement: Include a CRC (cyclic redundancy check) of the complete record #12

Open
krischer opened this issue Jan 3, 2018 · 15 comments

Comments

@krischer
Copy link
Contributor

krischer commented Jan 3, 2018

Include a CRC (cyclic redundancy check) of the complete record.

@chad-earthscope
Copy link
Member

This requirement was discussed in change proposal 6 to the 2016 strawman. The Quaterra CRC algorithm was offered for use in change proposal 25 to the 2016 strawman.

In the previous format specifications from IRIS I suggested the adoption of the CRC-32C (Castagnoli) algorithm for the following reasons:

  • It is well understood, standardized and documented in multiple RFCs, e.g. RFC 4960 and has been adopted by standardized transmission protocols (iSCSI, SCTP), file systems (Btrfs, ext4) and beyond.
  • Support is ubiquitous across many programming languages. Heck, many modern CPUs have hardware support for it as it is included in SSE 4.2, which is supported by both Intel and AMD.
  • It is really not that complex to code a simple version should that ever be needed.

@andres-h
Copy link

andres-h commented Jan 6, 2018

I think the type of CRC is at this stage an irrelevant implementation detail. The question is rather if there should be a CRC of complete record, partial record or no record-level CRC at all.

I'm in favor of partial record CRC.

@krischer
Copy link
Contributor Author

krischer commented Jan 8, 2018

The type of CRC should probably also be discussed as this stage as stage 3 of the design hopefully only deals with complete implementations and the type of CRC is a detail that could be sorted out by then. Thus we have to things to discuss here: Which type of checksum calculation and what parts should be checksummed.

Excerpt from @andres-h link:

All proposals add some sort of record checksum. The problem with #1 and #2 is that adding anything to the record (for example a blockette or CBOR object describing the quality control procedure that was made at the data centre), even increasing the data publication version, just to indicate that the data passed quality control procedures, invalidates the checksum. Due to a hardware or software glitch, it can happen that data was corrupted, but the (new) checksum is correct.

The concept of the current proposal (#3) is that data should be modified as little as possible. In particular, adding "quality control passed" should not invalidate the primary checksum if no other changes were made to the record. Multiple checksums and hashes are supported (CRC-32, MD5, SHA, etc.)

I personally disagree with this and think that everything should be checksummed - its cheap and also serves as a safety mechanism when touching any part of a record to some extent. It could also be two separate 16bit checksums - one for the data, one for the header. The risk of false positives might still be small enough to have to worry about it.

@andres-h
Copy link

andres-h commented Jan 8, 2018

I agree that everything should be checksummed, but having a single checksum does not match with the requirement (?) of being able to modify the records in the datacentre (adding QC, etc.). As a user, I want to check if the CRC matches the one that was generated by the digitizer or not. Also, I want to know which changes, if any, were done in the datacentre. Assuming that digitizers support NGF directly, similar to mseed2/seedlink (requirement?).

@crotwell
Copy link

crotwell commented Jan 8, 2018

I feel the checksum should be over the whole record and modification of a record should force a checksum recalculation. The purpose of this checksum is simply to detect corruption of data during transmission or storage, not to provide "providence" of the data back to the digitizer. While that is an important concept, it is a much larger problem and to do it correctly it needs to be done separately from the timeseries format.

For this purpose I think simpler is better, and so the CRC-32 makes most sense.

@chad-earthscope
Copy link
Member

chad-earthscope commented Jan 9, 2018

I agree with @crotwell. The most value of adding a CRC is to check whether the record has been corrupted during transmission or storage. Even if multiple CRCs could be used in a provenance scheme, requiring a reader to calculate multiple CRCs to do the "has this record been corrupted" check is not justified.

@krischer
Copy link
Contributor Author

Summary

(Please let me know if I missed a point or misunderstood something)

There is agreement that we want CRC but not what algorithm or what "type" of CRC. Technically it is also clear that the CRC field must be set to some pre-determined value (or ignored) for the actual calculation of the CRC. Thus please vote on:

  1. What should the CRC include? (complete record / partial record / no record-level CRC)
  2. What algorithms should be used? (CRC-32/other suggestion)

@crotwell
Copy link

1 complete record
2 crc-32c

@chad-earthscope
Copy link
Member

  1. Complete record with the CRC field set to 0's.
  2. CRC-32C

@kaestli
Copy link

kaestli commented Jan 30, 2018

  1. complete record
  2. crc-32

@ozym
Copy link

ozym commented Jan 30, 2018

  1. complete record as per @chad-iris
  2. crc-32

@claudiodsf
Copy link

  1. What should the CRC include? (complete record / partial record / no record-level CRC)

Complete record

  1. What algorithms should be used? (CRC-32/other suggestion)

CRC-32 or any other lightweight algorithm

@ihenson-bsl
Copy link

1 complete record
2 crc-32c

@ValleeMartin
Copy link

  1. Complete record
  2. CRC-32 (or similar lightweight algorithm)

@JoseAntonioJara
Copy link

  1. Complete record
  2. CRC-32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants