Skip to content
This repository has been archived by the owner on Dec 1, 2024. It is now read-only.

Batch get option #491

Closed
NirmalVatsyayan opened this issue Sep 20, 2017 · 23 comments · Fixed by #725
Closed

Batch get option #491

NirmalVatsyayan opened this issue Sep 20, 2017 · 23 comments · Fixed by #725
Labels
enhancement New feature or request

Comments

@NirmalVatsyayan
Copy link

Hi,
Is there any batch get option or trick to get for multiple keys in minimum pass. We have a use case where we need to get values of 15 K keys, firing so many get requests is not a proper way probably.

Kindly assist.

@ralphtheninja
Copy link
Member

Depending on what your keys look like, you could do a db.createReadStream() with a key interval that incorporates all keys and just filter out the values for the keys that you are interested in.

@juliangruber Any other ideas?

@vweevers
Copy link
Member

If the keys are adjacent, then db.createReadStream({ limit: 15000 }) will do the trick.

If there are small gaps (say you have keys a, b, c in the database and you want to get a and c), then @ralphtheninja's suggestion of filtering is fine.

If there are big gaps (you have keys a-z but only want to get a and z), you could consider using seek() on an iterator, but this is a lower-level mechanism and only supported by leveldown and not exposed in levelup. So only go there if you really need it.

@vweevers
Copy link
Member

If you can design your keys in such a way (making use of the lexicographical sort order) that they are indeed adjacent, that's the preferred way. You wouldn't need to skip anything.

@fergiemcdowall
Copy link

Retrieving a "bag of keys" is pretty valid and common scenario which is not totally straightforward due the to asnyc-iness of the API. Just now the API handles batches of noncontiguous writes, so it seems reasonable for it to handle batches of reads.

If you want to fetch noncontiguous values, at the moment you have to carefully read the whole README just to confirm that there isnt a builtin, and then you have to work out some sort of boilerplate incantation to wire up a sequence of gets, and package up the result (slightly tricky owing to the async nature of the API).

Changing the API would be a bit drastic at this stage in the 2.x.x release, but I think the lib would be a lot more accessible if there was at least a quick novice-friendly copy-paste example for "batch get" in the docs.

@vweevers
Copy link
Member

@fergiemcdowall Can you share a use case? I'd try to shift the thinking to storing a bag of keys. Key design is important. If the keys are contiguous on disk, reading is too. In other words, if you need to read a lot of keys that "belong together", shouldn't they be stored together?

Indexes are another option.

A "batch get" is doable to implement BTW, with leveldown at least. Performance might depend on whether the input keys are sorted because forward seeks are faster. In any case it would beat multiple gets, ánd give you snapshot guarantees. But I'm not yet convinced levelup should have this interface.

@fergiemcdowall
Copy link

@fergiemcdowall Can you share a use case? I'd try to shift the thinking to storing a bag of keys. Key design is important. If the keys are contiguous on disk, reading is too. In other words, if you need to read a lot of keys that "belong together", shouldn't they be stored together?

@vweevers I get why you are saying this- one of the great things about leveldb is that it handles ranges really well, and doing clever things with sorted indexes is where leveldb really shines.

That said, there are plenty of situations where you have to grab a set of keys that are spread out all over an index. For example- I use level mostly to build the search-index module. search-index might handle a search for "The quick brown fox jumps over the lazy dog" in which case search-index looks for keys related to 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy' and 'dog'. There is no clever key structure that can make these arbitrary, unpredictable keys exist contiguously in the index.

Although I use the hell out of .createReadStream(), I also have quite a few "bag of keys" scenarios. A "batch get" would be mostly about reducing "code smell" for level in these instances, and if there was actually a performance benefit to boot, as you mention that there might be, then that would be even better.

@vweevers
Copy link
Member

Good point. I've had similar use cases. There's a difference though between needing a few keys (quick brown fox) and 15k. So I'm interested in the specifics of your use case too @NirmalVatsyayan.

I think, rather than a batch get, I prefer to officially expose iterators with a seek() method. This not only enables fetching multiple non-contiguous keys but also getting multiple ranges of keys. You seek, read some, seek again. Also you get backpressure and can do things like skip scans (for compound aka composite indexes).

For the simple use cases, "batch get" can be a userland module on top of iterators.

@vweevers
Copy link
Member

It's also doable in memdown (because there, iterators operate on an immutable tree; seeking is theoretically possible) and level.js (without backpressure though, because of how IndexedDB cursors work). If we choose to forego snapshot guarantees, then it's doable everywhere, but probably not with the same performance.

@NirmalVatsyayan
Copy link
Author

NirmalVatsyayan commented Sep 20, 2017

@vweevers I am using the cache for re-ranking of results from Support vector machine, i get N most similar items for some input vector and need to re rank them based upon some business logic (fuzzy match + noun intersection etc), so the keys could be any thing among the data set. Basically i need to cache 2-3 million objects and want to access any random set of 15 K from them.

@vweevers
Copy link
Member

@dominictarr this sounds like something more up your alley. Any advice for @NirmalVatsyayan?

@peakji
Copy link
Member

peakji commented Sep 21, 2017

@vweevers I prefer to officially expose iterators with a seek() method.

I was thinking about this too. For those *-down without iterator.seek(), we may provide a polyfill in levelup by swapping internal iterators:

Iterator.prototype.seek = function(target) {

	// Native implementation
	if (typeof(this._iterator.seek) == "function") {
		this._iterator.seek(target);
		return this;
	}

	// Polyfill

	// 1. End the internal iterator
	this._iterator.end(noop);

	// 2. New options
	var options = Object.assign({}, this._options);
	options[this._options.reverse === true ? "lte" : "gte"] = target;
	// ... what about 'limit'?

	// 3. Swap!
	this._iterator = this._db.iterator(options);

	return this;
};

@vweevers
Copy link
Member

@peakji what was your use case for seeking again? Just so we have a full picture (and another argument for exposing iterators 😉).

Also do you need snapshot guarantees? Because with this proposed polyfill, seek() operates on a new snapshot. If we can't provide consistent semantics, I suggest to just throw if the *down doesn't support seek().

@peakji
Copy link
Member

peakji commented Sep 21, 2017

@vweevers I'm using seek() to collect the intersection of two sorted lists (contiguous keys).
For example, if we have two sorted lists stored in two namespaces, sharing the same ID scheme:

A = [a:001, a:002, a:008, ... , a:070, a:089, a:091, a:111]
B = [b:072, b:089, b:091, b:200]

If we want to get the intersection (089 and 091), we only need two iterators (a:* and b:*) and several skips:

a:001 -> b:072 -> a:089 -> b:089 -> a:091 -> b:091 - ...

This is much faster than naive scanning on large datasets, because we are able to skip a lot of keys (a:002 to a:070).

Also do you need snapshot guarantees?

Ahh... I forgot the snapshot thing... 👍 for throwing error.

@vweevers
Copy link
Member

vweevers commented Sep 21, 2017

A little history:

  • @dominictarr proposed multiread (aka "batch get", "multiget") 4 years ago (multiread leveldown#13)
  • @juliangruber said "With a multiget the gain would increase with the numbers of keys given. I'm not sure yet about when this would pay off. Someone would have to do an implementation so we can see."
  • @rvagg warned "of potential size problems, perhaps we should put an upper limit on it or at least enforce the use of a limit property so at least they know what they are doing."
  • @dominictarr replied "I'm opposed to adding safety features to leveldown", referencing The Hole Hawg: "It is dangerous because it does exactly what you tell it to. [..] The danger lies not in the machine itself but in the user's failure to envision the full consequences of the instructions he gives to it."

  • @juliangruber proposed exposing snapshots (#138), which would allow multiple operations on a single snapshot. No compelling use cases were found.


@gogoout
Copy link

gogoout commented Sep 11, 2021

Hi, I'm currently try to implement a multi get with iterator and seek. My use case is we have a db with around 40m records in there. And we have to random get up to 15k of them when we do lookup.

So for our use case (a random sparse list of keys), I have an implementation a bit like below:

const iterator=level.iterator();

// wrap next in promise
const next = async()=>{
  return new Promise((resolve, reject)=>{
      iterator.next((err,key,value)=>{
           // resolve, reject
          // ...
      });
  })
}

for (let key of sortedKeys) {
  iterator.seek(key);
  const result =  await next();
  // do something
}

Or should I just use next to iterator every possible value than check it against the target?
My guts feeling is that using seek will be much faster because it's possible to skip a potential big gap.

Also I have a question about seek behaviour, in the document it says:

By calling seek(key), subsequent calls to next(cb) will return key/values larger or smaller than key, based on your reverse setting in the iterator constructor.

Does this mean after calling the seek(key), next() will land on a key that is larger but not equal to the given key? If this is the case, maybe I should make the key offset 1 character code before I call seek?

I guess for my use case, the most ideal method is using the purposed multiGet feature.

@vweevers
Copy link
Member

By calling seek(key), subsequent calls to next(cb) will return key/values larger or smaller than key, based on your reverse setting in the iterator constructor.

Where is it documented like that? We should update that, because in abstract-leveldown we have a more precise explanation that should answer your question:

Seek the iterator to a given key or the closest key. Subsequent calls to iterator.next() will yield entries with keys equal to or larger than target, or equal to or smaller than target if the reverse option passed to db.iterator() was true.

@gogoout
Copy link

gogoout commented Sep 11, 2021

Thanks for the quick response, the doc is in https://github.com/Level/leveldown

vweevers added a commit to Level/leveldown that referenced this issue Sep 11, 2021
@vweevers
Copy link
Member

My guts feeling is that using seek will be much faster because it's possible to skip a potential big gap.

Potentially, yes. With leveldown it's recommended to use db.iterator({ highWaterMark }) (Level/leveldown#468 (comment)).

@gogoout
Copy link

gogoout commented Sep 11, 2021

I'm actually using rocksdb in my project. And was passing {gte,lte} based on the sorted key when I created the iterator. I assume it will be the same as highWaterMark ?

@vweevers
Copy link
Member

rocksdb also has the highWaterMark option. If you're expecting gaps between the keys (within your gte/lte range) then it can significantly boost perf.

@gogoout
Copy link

gogoout commented Sep 11, 2021

Thanks, that'll be really useful. I've investigated a bit around the highWaterMark, and seems there is an internal batchNext function as well. Should I use batchNext together with highWaterMark or the normal seek will benefit from it as well?

@vweevers
Copy link
Member

That doesn't ring a bell, can you link to batchNext in source code?

@gogoout
Copy link

gogoout commented Sep 11, 2021

Nah, sorry to border you. I've find it in a third party libaray and made a bad assumption it's calling the internal method of leveldown. Thanks for the help again, that's really useful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants