-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: HyperDB #3
Conversation
proposals/0000-hyperdb.md
Outdated
# Summary | ||
[summary]: #summary | ||
|
||
HyperDB is a new abstraction layer providing a general purpose distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it "new"? Aint this all new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not just nitpicking on prose; I'm concerned "new" will be confusing, both now and later when nothing is comparatively newer.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; i'm still finding the right "voice" to use: tense, "we/I", "should/could/must", "one/you".
I was thinking of adding a CI pass with something like http://proselint.com/; we should probably have at least a pass that validates the markdown, and maybe checks out-links for valid HTTP responses.
proposals/0000-hyperdb.md
Outdated
Hyperdrive (used by the Dat application) is expected to be re-implemented on | ||
top of HyperDB for improved performance with many files (eg, millions). The | ||
hyperdrive API should be largely unchanged, but the `metadata` format will be | ||
backwards-incompatible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I suggest we move this paragraph about hyperdrive into the motivation, where the plans and reasoning at the time of writing can all be contained.
proposals/0000-hyperdb.md
Outdated
A secondary benefit is to refactor the [trie][trie]-structured key/value API | ||
out of hyperdrive, allowing third party code to build applications directly on | ||
this abstraction layer. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Third motivation: add multi-writer support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh scratch that, multi-writer isn't discussed in this doc
proposals/0000-hyperdb.md
Outdated
`db.get(key)`: Reading a non-existant `key` is an error. Read-only. | ||
|
||
`db.delete(key)`: Removes the key from the database. Deleting a non-existant | ||
key is an error. Requires read-write access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does deletion of a path-segment cause the deletion of all keys which the segment is a prefix of to be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I think "No", a separate delete_recursive()
(or a flag) would be needed.
I've tried to minimize the API surface here in the spec; eg, I haven't included the diff or history streaming stuff. I think in this case it's pretty straightforward to combined list(prefix)
and delete(key)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supporting recursive deletes thru a single append is actually not hard and could be added in the future
proposals/0000-hyperdb.md
Outdated
- `path`: a 2-bit hash sequence of `key`'s components. | ||
- `seq`: the zero-indexed sequence number of this Node (hyperdb) entry in | ||
the feed. Note that this may be offset from the hypercore entry index if | ||
there are any leading non-hyperdb entries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that seq
only increments when a hyperdb Node is written to the feed?
proposals/0000-hyperdb.md
Outdated
HyperDB is not backwards compatible with the existing hyperdrive metadata, | ||
meaning dat clients may need to support both versions during a transition | ||
period. This applies both to archives saved to disk (eg, in SLEEP) and to | ||
archives received and published to peers over the network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can a client detect which metadata protocol to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This is related to the issue of hashbase (and other tools) choking when they try to download a non-hyperdrive hypercore.
Two long-term solutions I could imagine are:
- add a Feed-level "type" field to hypercore itself. The obvious place for this would be the Feed message, except this would then leak type information in clear text on the wire, which is bad. A new message type? I could imagine refactoring hypercore messages to have a new "connection init" message for crypto setup, then re-send a Feed encrypted. This would be a good place to toss in a protocol version header as well. hypercore has been pretty stable though, this would be disruptive. Also, (thinking out loud) how would the software figure out the type of feeds on disk (aka, mystery meat SLEEP files)?
- extend the convention of always writing a "special" protobuf message as the first entry in every hypercore feed, and encode type info there. hyperdrive currently does this with a pointer to the data feed, but no other hypercore feed "types" do this (AFAIK).
@mafintosh, thoughts?
Looking really good! |
proposals/0000-hyperdb.md
Outdated
An example pseudo-code session working with a database might be: | ||
|
||
db.put('/life/animal/mammal/kitten', '{"cuteness": 500.3}') | ||
db.put('/life/plant/bush/bananna', '{"delicious": 103.4}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't know if you care but the banana
has an extra n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch!
proposals/0000-hyperdb.md
Outdated
|
||
1. Calculate the `path` array for the key you are looking for. | ||
2. Select the most-recent ("latest") Node for the feed. | ||
3. Compare path arrays. If the paths match exactly, you have found the Node you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SipHash is not collision resistant, so two distinct paths could have the same path array, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in case of a collision the final node bucket will just have more than one node in it. The full final node bucket is read on every read (usually only a single node) to do key full key comparison to check for collisions
proposals/0000-hyperdb.md
Outdated
implementation. The current implementation requires the most recent node in a | ||
directory to point to all other nodes in the directory. Even with pointer | ||
compression, this requires on the order of `O(N^2)` bytes; the HyperDB | ||
implementation scales with `O(N log(N))`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍
this is what prevented us from doing datpedia the straightforward way
proposals/0000-hyperdb.md
Outdated
specified limit of 2 GByte (due to 32-bit signed arthimetic), and most | ||
implementations impose a (configurable) 64 MByte limit. Should this DEP impose | ||
specific limits on key and value sizes? Would be good to decide before Draft | ||
status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since dat does rabin chunking, the 2GB limit seems fine ~ large files will always be divided into chunks smaller than that.
will the chunks always be below 64MB though? @mafintosh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcposch note this actually not for the files but just for the file metadata. in our case that is just a fs.Stat object with an additional numeric pointer to another feed where the actual file content is stored. This gives basically as large files as you'd want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, FWIW, dat (via hyperdrive) does not actually use Rabin chunking at the moment, just fixed size chunks. The protocol is agnostic to chunking, so this can evolve in the future with little disruption if/when content-aware chunking is more performent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remind, dat (via hyperdrive) is only one possible application on top of hyperdrive; while hyperdrive should be pretty comfortable even with a couple KByte metadata size limit, other applications might want to store arbitrary data in the hyperdb values; setting expectations around this could save frustration or growing pains later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnewbold agreed. imo you should always only store "small" values directly in the hyperdb, say < 1mb. anything you should put in a "content feed" and then link it from the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for Draft we can mention the inherent limits and then mention:
"It is STRONGLY recommended to keep both keys and values to less than 2 megabytes (individually and combined). If larger value sizes are desired, a separate content can be pointed to (via contentFeed
), and the value in this feed can be metadata pointing to data (eg, contiguous blocks of entries) in that feed."
proposals/0000-hyperdb.md
Outdated
|
||
Are the "deletion" semantics here correct, in that deletion Nodes persist as | ||
"tombstones", or should deleted keys be omitted from future `trie` fields to | ||
remove their presence? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems important to leave deletion nodes in place--that way, list()
can iterate over all current and deleted nodes under a directory and then omit the deleted ones
otherwise, i think list()
might return partial results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcposch thats how it works atm actually
proposals/0000-hyperdb.md
Outdated
array. How unlikely is this situation? Eg, how many keys in a database before | ||
the birthday paradox results in a realistic chance of a collision? How should | ||
implementations behave if they detect a collision (matching path but not | ||
matching key)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i found a collision so we can test this
after an hour or so with a little Go script on a 122GB r4.4xlarge
instance...
Checked 7059000000 hashes
Checked 7060000000 hashes
Checked 7061000000 hashes
Checked 7062000000 hashes
Checked 7063000000 hashes
Checked 7064000000 hashes
COLLISION: SipHash24(mpomeiehc) = SipHash24(idgcmnmna) = 11615559218517537840
that's SipHash24 with an all-zero key, as described above. so if you make a dat and add the following files:
/mpomeiehc
/idgcmnmna
...you should be able to test a path
collision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mafintosh/hyperdb@fd60344 <-- i've added @dcposch's hash collision to the tests. We support hash collisions, the two nodes just end up in the same bucket similar to have a multiwriter fork would
Thanks @dcposch!
Will need copy editing, and there are a number of new changes that need to be understood and integrated.
proposals/0000-hyperdb.md
Outdated
When converting the 8-bytehash to an array of 2-bit bytes, the ordering is | ||
proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash & | ||
0x3`) as byte index `0`, the next two bits (aka, `hash & 0xC`) as byte index | ||
`1`, etc. When concatanating path hashes into a longer array, the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concatenating
proposals/0000-hyperdb.md
Outdated
For a `trie` with `N` buckets, each may have zero or more pointers. Typically | ||
there are a maximum of 3 pointers per bucket, corresponding to the 4 possible | ||
values minus the current Entry's value, but in the event of hash collisions (in | ||
the path array space) , there may be multiple pointers in the same bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
),
proposals/0000-hyperdb.md
Outdated
a pointer to the Entry with the same hash at the final array index. | ||
5. If the paths don't entirely match, find the first index at which the two | ||
arrays differ. Copy all `trie` elements (empty or not) into the new `trie` | ||
for indicies between the "current index" and the "differing index". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indices
proposals/0000-hyperdb.md
Outdated
field of an Entry protobuf message. | ||
|
||
Consider a trie array with `N` buckets and `M` non-empty buckets (`0 <= M <= | ||
N`). In the encoded field, there will be `M` concatanated bytestrings of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concatenated
proposals/0000-hyperdb.md
Outdated
the second bytestring chunk would be: | ||
|
||
- index varint is `64` (65th element in the trie array; the terminating value) | ||
- btifield is `0b10000` (varint 1); there is a single pointer set... but it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bitfield
proposals/0000-hyperdb.md
Outdated
hash arrays, we now get all the way to index 34 before there is a difference. | ||
We again look in the `trie`, find a pointer to entry index 0, and fetch the | ||
first Entry and recurse. Now the path elements match exactly; we have found the | ||
Entry we are looking for, and it has an existant `value`, so we return the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existent
proposals/0000-hyperdb.md
Outdated
|
||
The basic key/value semantics of hyperdb (as discussed in this DEP, not | ||
considering multi-writer changes) are not known to introduce new privacy issues | ||
when compared with, eg, storing binary values at key-like paths using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
proposals/0000-hyperdb.md
Outdated
|
||
Hyperdb is not backwards compatible with the existing hyperdrive metadata, | ||
meaning dat clients may need to support both versions during a transition | ||
period. This applies both to archives saved to disk (eg, in SLEEP) and to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
proposals/0000-hyperdb.md
Outdated
a feed? See also <https://github.com/datprotocol/DEPs/issues/13> | ||
|
||
Need to think through deletion process with respect to listing a path prefix; | ||
will previously deleted nodes be occulded, or potentially show up in list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
occluded?
proposals/0000-hyperdb.md
Outdated
db.get('/life/animal/mammal/kitten') | ||
=> {"cuteness": 500.3} | ||
db.list('/life/') | ||
=> ['/life/animal/mammal/kitten', '/life/plant/tree/banana'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this example listing correct? It's showing keys multiple prefix-segments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended for it to be recursive; looks like there is a recursive
flag in maf's hyperdb library. I've added a clarification.
proposals/0000-hyperdb.md
Outdated
care to the following scenarios: A malicious writer may be able to produce very | ||
frequent key path hash collisions, which could degrade to linear performance. A | ||
malicious writer could send broken trie structures that contain pointer loops, | ||
duplicate pointers, or other invalid contents. A malicous writer could write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointer loops sound like they could be a pretty serious DoS attack vector. Do we have any way to mitigate that?
There's a calculable upper limit on lookups based on the number of prefixes... Perhaps the recursive algorithm needs to track how many lookups it does and then abort if it hits the limit?
@noffle I've still got you in here as a co-author in attribution of your ARCHITECTURE.md document that was very helpful when starting this DEP. However, while this is derivative, almost none of that document remains in here, and I don't want to "co-author" you without consent. Any thoughts? I'll move you down to an Acknowledgements section if I don't hear back in a day or two. |
@bnewbold whatever you think is appropriate is fine by me! |
Any further comments here? Poke @mafintosh. If we could approve/merge this week that would be great (eg, at the protocol WG meeting next Wednesday at the latest). |
Let me give this a thorough review over the weekend. At last glance this
looked great!
…On Thu, Apr 26, 2018, 21:04 bnewbold ***@***.***> wrote:
Any further comments here? Poke @mafintosh <https://github.com/mafintosh>.
If we could approve/merge this week that would be great (eg, at the
protocol WG meeting next Wednesday at the latest).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAW_VdxWDQxlTew8YRQEA145okU8ogFUks5tsho1gaJpZM4R5LAf>
.
|
proposals/0000-hyperdb.md
Outdated
|
||
For a database with most keys having N path segments, the cost of a `get()` | ||
scales with the number of entries M as `O(log(M))` with best case 1 lookup and | ||
worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worst case will only happen if all the hashes in the db are colliding which is extremely unlikely. The worst case for a "perfect" hash is still log4(n) afik. The best case is O(1) and the normal case is log4(n) / k where k depends on how many shared hash prefixes there are
proposals/0000-hyperdb.md
Outdated
# Unresolved questions | ||
[unresolved]: #unresolved-questions | ||
|
||
Should we declare that `contendFeed` pointers *not* change over the history of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
proposals/0000-hyperdb.md
Outdated
mafintosh mentioned this might be in the works. Does this DEP need to "leave | ||
room" for those changes, or should we call out the potential for future change? | ||
Probably not, should only describe existing solutions. This can be resolved | ||
after Draft. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can and should be improved in the future (backwards compat). step one is shipping what we have
proposals/0000-hyperdb.md
Outdated
after Draft. | ||
|
||
Review (or prove?) the `O(log(M))` intuition for `get()` operations. This could | ||
happen after Draft status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's log(M) yes
To be explicit: left @noffle as a co-author. Thanks for kicking this whole thing off! |
This DEP is submitted for review for Draft status.
Rendered pre-merge
A diagram showing the trie structure might be nice to help with learning, but maybe the specification isn't the right place for that. The examples are verbose, but I think are important to clarify exactly how things work. I think exact binary protobuf output (hexdump style?) could also be included for some example messages, to make encoding and endianness completely unambiguous.