-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For discussion: full-file hashes in hyperdrive metadata #12
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,178 @@ | ||
|
||
Title: **DEP-0000: Hyperdrive File Hashes** | ||
|
||
Short Name: `0000-hyperdrive-hashes` | ||
|
||
Type: Standard | ||
|
||
Status: Undefined (as of YYYY-MM-DD) | ||
|
||
Github PR: [Discussion](https://github.com/datprotocol/DEPs/pull/12) | ||
|
||
Authors: [Bryan Newbold](https://github.com/bnewbold) | ||
|
||
|
||
# Summary | ||
[summary]: #summary | ||
|
||
Full-file hashes are optionally included in hyperdrive metadata to complement | ||
the existing cryptographic-strength hashing of sub-file chunks. Multiple | ||
popular hash algorithms can be included at the same time. | ||
|
||
|
||
# Motivation | ||
[motivation]: #motivation | ||
|
||
Naming, discovering, and cataloging data "by content" (aka, by a fixed-size | ||
hashes of the data) is a powerful pattern for robust distributed systems. Dat | ||
is one among several such systems. Unfortunately, interoperability between or | ||
layering such systems on top of each other is difficult because each tends to | ||
adopt it's own hashing norms and formats. Design variances can include hash | ||
algorithm selection, hash configuration, salting, data chunking, and | ||
intermediate Merkle tree data formats. | ||
|
||
As a concrete example, the sha1sum command-line tool, the bittorrent P2P | ||
protocol and the git code versioning software both use the SHA-1 algorithm to | ||
hash file contents. However, one can not use the simple `sha1sum` hash of a | ||
given file to check whether that file is the same as referenced in either a | ||
bittorrent `.torrent` file or from git metadata, because each calculate the | ||
hash in different ways. Bittorrent combines all files in the torrent into a | ||
single stream, then splits into a fixed number of chunks and hashes those | ||
separately; the chunk boundaries usually do not correspond to individual files. | ||
git prepends the size of the file (in bytes) as a fixed header before hashing | ||
and storing the file as a "blob". This makes comparison or interoperability | ||
between these systems impossible without having either a universal cross-hash | ||
table (infeasible to build in the general sense) or without having the full | ||
file contents on-hand to compare or re-hash in all three formats. | ||
|
||
The design decisions to adopt hash variants are usually well-founded, motivated | ||
by security concerns (such as pre-image attacks), efficiency, and | ||
implementation concerns. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... but this is note the case in dat. The data is both validated and secured on a chunk level. Isn't it in our There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I understand. The dat implementation does use something other than popular full-file hash algorithms internally, for good reasons. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It uses chunk-based hashes, yes. But all data received through the dat protocol is signed. I am not sure how you would smuggle unsigned content into a dat. |
||
|
||
By adding simple full-file hashes of files as optional complementary metadata | ||
in our distributed data systems, we can make interoperability and powerful | ||
efficiency gains possible. | ||
|
||
For example, a large collection of files could be stored in a simple format on | ||
disk, indexed by a popular hash format. Gateway clients to several P2P networks | ||
could make the same files accessible by storing metadata (relatively small) | ||
separately for each network, but accessing the file contents from the shared | ||
store by a common hash. | ||
|
||
In the case of Dat, a particular efficiency of this use case would be enabling | ||
fast de-duplication of file storage between multiple Dat archives on a | ||
full-file level, instead of at the chunk-level (which would be sensitive to | ||
changes in chunking algorithm). | ||
|
||
|
||
# Usage Documentation | ||
[usage-documentation]: #usage-documentation | ||
|
||
Implementations would include hashes as file-level metadata along with existing | ||
"stat" fields. | ||
|
||
Existing API methods would include options to control generation of hashes (and | ||
which types) when creating a new drive or adding files. | ||
|
||
|
||
# Reference Documentation | ||
[reference-documentation]: #reference-documentation | ||
|
||
Hashes would be stored as additional fields in hyperdrive's existing `Stat` | ||
protobuf message, with the following structure: | ||
|
||
```protobuf | ||
message Stat { | ||
message ExtraHash { | ||
required uint32 type = 1; | ||
required bytes value = 2; | ||
} | ||
required uint32 mode = 1; | ||
optional uint32 uid = 2; | ||
optional uint32 gid = 3; | ||
optional uint64 size = 4; | ||
optional uint64 blocks = 5; | ||
optional uint64 offset = 6; | ||
optional uint64 byteOffset = 7; | ||
optional uint64 mtime = 8; | ||
optional uint64 ctime = 9; | ||
repeated ExtraHash hashes = 10; | ||
} | ||
``` | ||
|
||
`type` is a number representing the hash algorithm, and `value` is the | ||
bytestring of the hash output itself. The length of the hash digest (in bytes) | ||
is available from protobuf metadata for the value. This scheme, and the `type` | ||
value table, is intended to be interoperable with the [multihash][multihash] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should reference the multihash table to specify what the types are?! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The table is linked from the multihash homepage linked... i'm wary of deep linking directly into a github blob (the repo could move to a new platform or file could be renamed), but maybe that's an overblown concern. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having a link seems better than not having it. To remove/reduce that concern you could link to a commit number or make a mirror of it. |
||
scheme from the IPFS community. | ||
|
||
A subset of the multihash hash digest table includes: | ||
|
||
``` | ||
md5 0x00D5 | ||
sha1 0x0011 | ||
sha2-256 0x0012 | ||
sha2-512 0x0013 | ||
blake2b-256 0xB220 | ||
``` | ||
|
||
Multiple hashes would be calculated in parallel with the existing | ||
chunking/hashing process, in a streaming fashion. Final hashes would be | ||
calculated when the chunking is complete, and included in the `Stat` metadata. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But multiple hashes are not required, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct; this is making the point that even with multiple hashes, the file only needs to be scanned (read from disk) only once. |
||
|
||
For 2018, recommended default full-file hash functions to include are `SHA1` | ||
(for popularity and interoperability) and `blake2b-256` (already used in other | ||
parts of the Dat protocol stack). | ||
|
||
[multihash]: https://multiformats.io/multihash/ | ||
|
||
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
The metadata storage overhead (on a per-file basis) should be minimal, but the | ||
additional computational resources to hash a large file multiple times are | ||
non-trivial on machines with a single (or few) cores, even when computed in a | ||
parallel/streaming format. | ||
|
||
|
||
# Security and Privacy Concerns | ||
[privacy]: #privacy | ||
|
||
Additional optional fields may leak additional bits of user-specific | ||
configuration metadata, analogous to the "[evercookie][]" and | ||
"[panopticlick][]" browser fingerprinting issues. | ||
|
||
[evercookie]: https://en.wikipedia.org/wiki/Evercookie | ||
[panopticlick]: https://panopticlick.eff.org/ | ||
|
||
|
||
# Rationale and alternatives | ||
[alternatives]: #alternatives | ||
|
||
Users wanting this metadata could instead maintain a manifest file (mapping | ||
paths to hashes) inside the Dat archive itself. The Dat client could support | ||
this with a special mode or flag. One downside of this is that for large | ||
archives, the file would need to be updated and duplicated for every new or | ||
modified file. | ||
|
||
|
||
# Unresolved questions | ||
[unresolved]: #unresolved-questions | ||
|
||
What does the user-facing API look like, specifically? | ||
|
||
Should we allow non-standard hashes, like the git "hash", or higher-level | ||
references like (single-file) bittorrent magnet links or IPFS file references? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We shouldn't allowed arbitrary values for the hash field. There should be one clear specification of what type defines which hashing algorithm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I follow. What I was getting at here was "what if there are additional hash or merkel tree references a user would want to include that are not in the multihash table"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me rephrase: No: we should not allow non-standard hashes. Though "standard" in this context means the standard we set. I am okay with a "githash" being added for example. |
||
|
||
Modifying a small part of a large file would require re-hashing the entire | ||
file, which is slow. Should we skip including the updated hashes in this case? | ||
Currently mitigated by the fact that we duplicate the entire file when recoding | ||
changes or additions. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hashes are optional so it is up to the implementor/user to add a hash or not. Including for updates of a file. |
||
|
||
|
||
# Changelog | ||
[changelog]: #changelog | ||
|
||
- 2018-03-17: First draft for comment. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter if they are popular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! If we are trying to be inter-operable with existing databases and large users, and in particular to bridge to older "legacy" systems which might not support newer ("better") hash functions. To be transparent, I work at the Internet Archive, and we have dozens of petabytes of files hashed and cataloged with the MD5 and SHA1 hashes (because they are popular, not because they are "strong" or the "best" in any sense). We'll probably re-hash with new algorithms some day, but would like to do so only rarely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a little nitpicking around my impression that the sentence would have the same meaning and impact without "popular"; this blew out of proportion.