Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: HyperDB #3

Merged
merged 27 commits into from
May 7, 2018
Merged

Conversation

bnewbold
Copy link
Contributor

@bnewbold bnewbold commented Feb 5, 2018

This DEP is submitted for review for Draft status.

Rendered pre-merge

A diagram showing the trie structure might be nice to help with learning, but maybe the specification isn't the right place for that. The examples are verbose, but I think are important to clarify exactly how things work. I think exact binary protobuf output (hexdump style?) could also be included for some example messages, to make encoding and endianness completely unambiguous.

# Summary
[summary]: #summary

HyperDB is a new abstraction layer providing a general purpose distributed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "new"? Aint this all new?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not just nitpicking on prose; I'm concerned "new" will be confusing, both now and later when nothing is comparatively newer.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; i'm still finding the right "voice" to use: tense, "we/I", "should/could/must", "one/you".

I was thinking of adding a CI pass with something like http://proselint.com/; we should probably have at least a pass that validates the markdown, and maybe checks out-links for valid HTTP responses.

Hyperdrive (used by the Dat application) is expected to be re-implemented on
top of HyperDB for improved performance with many files (eg, millions). The
hyperdrive API should be largely unchanged, but the `metadata` format will be
backwards-incompatible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I suggest we move this paragraph about hyperdrive into the motivation, where the plans and reasoning at the time of writing can all be contained.

A secondary benefit is to refactor the [trie][trie]-structured key/value API
out of hyperdrive, allowing third party code to build applications directly on
this abstraction layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third motivation: add multi-writer support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh scratch that, multi-writer isn't discussed in this doc

`db.get(key)`: Reading a non-existant `key` is an error. Read-only.

`db.delete(key)`: Removes the key from the database. Deleting a non-existant
key is an error. Requires read-write access.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does deletion of a path-segment cause the deletion of all keys which the segment is a prefix of to be deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I think "No", a separate delete_recursive() (or a flag) would be needed.
I've tried to minimize the API surface here in the spec; eg, I haven't included the diff or history streaming stuff. I think in this case it's pretty straightforward to combined list(prefix) and delete(key).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting recursive deletes thru a single append is actually not hard and could be added in the future

- `path`: a 2-bit hash sequence of `key`'s components.
- `seq`: the zero-indexed sequence number of this Node (hyperdb) entry in
the feed. Note that this may be offset from the hypercore entry index if
there are any leading non-hyperdb entries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that seq only increments when a hyperdb Node is written to the feed?

HyperDB is not backwards compatible with the existing hyperdrive metadata,
meaning dat clients may need to support both versions during a transition
period. This applies both to archives saved to disk (eg, in SLEEP) and to
archives received and published to peers over the network.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can a client detect which metadata protocol to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This is related to the issue of hashbase (and other tools) choking when they try to download a non-hyperdrive hypercore.
Two long-term solutions I could imagine are:

  • add a Feed-level "type" field to hypercore itself. The obvious place for this would be the Feed message, except this would then leak type information in clear text on the wire, which is bad. A new message type? I could imagine refactoring hypercore messages to have a new "connection init" message for crypto setup, then re-send a Feed encrypted. This would be a good place to toss in a protocol version header as well. hypercore has been pretty stable though, this would be disruptive. Also, (thinking out loud) how would the software figure out the type of feeds on disk (aka, mystery meat SLEEP files)?
  • extend the convention of always writing a "special" protobuf message as the first entry in every hypercore feed, and encode type info there. hyperdrive currently does this with a pointer to the data feed, but no other hypercore feed "types" do this (AFAIK).

@mafintosh, thoughts?

@pfrazee
Copy link
Contributor

pfrazee commented Mar 5, 2018

Looking really good!

An example pseudo-code session working with a database might be:

db.put('/life/animal/mammal/kitten', '{"cuteness": 500.3}')
db.put('/life/plant/bush/bananna', '{"delicious": 103.4}')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know if you care but the banana has an extra n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!


1. Calculate the `path` array for the key you are looking for.
2. Select the most-recent ("latest") Node for the feed.
3. Compare path arrays. If the paths match exactly, you have found the Node you
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SipHash is not collision resistant, so two distinct paths could have the same path array, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in case of a collision the final node bucket will just have more than one node in it. The full final node bucket is read on every read (usually only a single node) to do key full key comparison to check for collisions

implementation. The current implementation requires the most recent node in a
directory to point to all other nodes in the directory. Even with pointer
compression, this requires on the order of `O(N^2)` bytes; the HyperDB
implementation scales with `O(N log(N))`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍

this is what prevented us from doing datpedia the straightforward way

specified limit of 2 GByte (due to 32-bit signed arthimetic), and most
implementations impose a (configurable) 64 MByte limit. Should this DEP impose
specific limits on key and value sizes? Would be good to decide before Draft
status.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since dat does rabin chunking, the 2GB limit seems fine ~ large files will always be divided into chunks smaller than that.

will the chunks always be below 64MB though? @mafintosh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcposch note this actually not for the files but just for the file metadata. in our case that is just a fs.Stat object with an additional numeric pointer to another feed where the actual file content is stored. This gives basically as large files as you'd want

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, FWIW, dat (via hyperdrive) does not actually use Rabin chunking at the moment, just fixed size chunks. The protocol is agnostic to chunking, so this can evolve in the future with little disruption if/when content-aware chunking is more performent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remind, dat (via hyperdrive) is only one possible application on top of hyperdrive; while hyperdrive should be pretty comfortable even with a couple KByte metadata size limit, other applications might want to store arbitrary data in the hyperdb values; setting expectations around this could save frustration or growing pains later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnewbold agreed. imo you should always only store "small" values directly in the hyperdb, say < 1mb. anything you should put in a "content feed" and then link it from the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for Draft we can mention the inherent limits and then mention:

"It is STRONGLY recommended to keep both keys and values to less than 2 megabytes (individually and combined). If larger value sizes are desired, a separate content can be pointed to (via contentFeed), and the value in this feed can be metadata pointing to data (eg, contiguous blocks of entries) in that feed."


Are the "deletion" semantics here correct, in that deletion Nodes persist as
"tombstones", or should deleted keys be omitted from future `trie` fields to
remove their presence?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems important to leave deletion nodes in place--that way, list() can iterate over all current and deleted nodes under a directory and then omit the deleted ones

otherwise, i think list() might return partial results

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcposch thats how it works atm actually

array. How unlikely is this situation? Eg, how many keys in a database before
the birthday paradox results in a realistic chance of a collision? How should
implementations behave if they detect a collision (matching path but not
matching key)?
Copy link

@dcposch dcposch Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i found a collision so we can test this

after an hour or so with a little Go script on a 122GB r4.4xlarge instance...

Checked 7059000000 hashes
Checked 7060000000 hashes
Checked 7061000000 hashes
Checked 7062000000 hashes
Checked 7063000000 hashes
Checked 7064000000 hashes
COLLISION: SipHash24(mpomeiehc) = SipHash24(idgcmnmna) = 11615559218517537840

that's SipHash24 with an all-zero key, as described above. so if you make a dat and add the following files:

/mpomeiehc
/idgcmnmna

...you should be able to test a path collision

@mafintosh @bnewbold

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mafintosh/hyperdb@fd60344 <-- i've added @dcposch's hash collision to the tests. We support hash collisions, the two nodes just end up in the same bucket similar to have a multiwriter fork would

When converting the 8-bytehash to an array of 2-bit bytes, the ordering is
proceed byte-by-byte, and for each take the two lowest-value bits (aka, `hash &
0x3`) as byte index `0`, the next two bits (aka, `hash & 0xC`) as byte index
`1`, etc. When concatanating path hashes into a longer array, the first

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concatenating

For a `trie` with `N` buckets, each may have zero or more pointers. Typically
there are a maximum of 3 pointers per bucket, corresponding to the 4 possible
values minus the current Entry's value, but in the event of hash collisions (in
the path array space) , there may be multiple pointers in the same bucket

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

),

a pointer to the Entry with the same hash at the final array index.
5. If the paths don't entirely match, find the first index at which the two
arrays differ. Copy all `trie` elements (empty or not) into the new `trie`
for indicies between the "current index" and the "differing index".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indices

field of an Entry protobuf message.

Consider a trie array with `N` buckets and `M` non-empty buckets (`0 <= M <=
N`). In the encoded field, there will be `M` concatanated bytestrings of the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concatenated

the second bytestring chunk would be:

- index varint is `64` (65th element in the trie array; the terminating value)
- btifield is `0b10000` (varint 1); there is a single pointer set... but it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bitfield

hash arrays, we now get all the way to index 34 before there is a difference.
We again look in the `trie`, find a pointer to entry index 0, and fetch the
first Entry and recurse. Now the path elements match exactly; we have found the
Entry we are looking for, and it has an existant `value`, so we return the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existent


The basic key/value semantics of hyperdb (as discussed in this DEP, not
considering multi-writer changes) are not known to introduce new privacy issues
when compared with, eg, storing binary values at key-like paths using the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.


Hyperdb is not backwards compatible with the existing hyperdrive metadata,
meaning dat clients may need to support both versions during a transition
period. This applies both to archives saved to disk (eg, in SLEEP) and to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

a feed? See also <https://github.com/datprotocol/DEPs/issues/13>

Need to think through deletion process with respect to listing a path prefix;
will previously deleted nodes be occulded, or potentially show up in list

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

occluded?

db.get('/life/animal/mammal/kitten')
=> {"cuteness": 500.3}
db.list('/life/')
=> ['/life/animal/mammal/kitten', '/life/plant/tree/banana']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this example listing correct? It's showing keys multiple prefix-segments below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended for it to be recursive; looks like there is a recursive flag in maf's hyperdb library. I've added a clarification.

care to the following scenarios: A malicious writer may be able to produce very
frequent key path hash collisions, which could degrade to linear performance. A
malicious writer could send broken trie structures that contain pointer loops,
duplicate pointers, or other invalid contents. A malicous writer could write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointer loops sound like they could be a pretty serious DoS attack vector. Do we have any way to mitigate that?

There's a calculable upper limit on lookups based on the number of prefixes... Perhaps the recursive algorithm needs to track how many lookups it does and then abort if it hits the limit?

@bnewbold
Copy link
Contributor Author

@noffle I've still got you in here as a co-author in attribution of your ARCHITECTURE.md document that was very helpful when starting this DEP. However, while this is derivative, almost none of that document remains in here, and I don't want to "co-author" you without consent. Any thoughts? I'll move you down to an Acknowledgements section if I don't hear back in a day or two.

@hackergrrl
Copy link

@bnewbold whatever you think is appropriate is fine by me!

@bnewbold
Copy link
Contributor Author

Any further comments here? Poke @mafintosh. If we could approve/merge this week that would be great (eg, at the protocol WG meeting next Wednesday at the latest).

@mafintosh
Copy link

mafintosh commented Apr 26, 2018 via email


For a database with most keys having N path segments, the cost of a `get()`
scales with the number of entries M as `O(log(M))` with best case 1 lookup and
worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worst case will only happen if all the hashes in the db are colliding which is extremely unlikely. The worst case for a "perfect" hash is still log4(n) afik. The best case is O(1) and the normal case is log4(n) / k where k depends on how many shared hash prefixes there are

# Unresolved questions
[unresolved]: #unresolved-questions

Should we declare that `contendFeed` pointers *not* change over the history of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

mafintosh mentioned this might be in the works. Does this DEP need to "leave
room" for those changes, or should we call out the potential for future change?
Probably not, should only describe existing solutions. This can be resolved
after Draft.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can and should be improved in the future (backwards compat). step one is shipping what we have

after Draft.

Review (or prove?) the `O(log(M))` intuition for `get()` operations. This could
happen after Draft status.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's log(M) yes

@bnewbold bnewbold merged commit 23fa355 into dat-ecosystem-archive:master May 7, 2018
@bnewbold bnewbold deleted the dep-hyperdb branch May 7, 2018 03:07
@bnewbold
Copy link
Contributor Author

bnewbold commented May 7, 2018

To be explicit: left @noffle as a co-author. Thanks for kicking this whole thing off!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants