Batch get option #491

NirmalVatsyayan · 2017-09-20T04:19:15Z

Hi,
Is there any batch get option or trick to get for multiple keys in minimum pass. We have a use case where we need to get values of 15 K keys, firing so many get requests is not a proper way probably.

Kindly assist.

ralphtheninja · 2017-09-20T14:35:33Z

Depending on what your keys look like, you could do a db.createReadStream() with a key interval that incorporates all keys and just filter out the values for the keys that you are interested in.

@juliangruber Any other ideas?

vweevers · 2017-09-20T15:02:26Z

If the keys are adjacent, then db.createReadStream({ limit: 15000 }) will do the trick.

If there are small gaps (say you have keys a, b, c in the database and you want to get a and c), then @ralphtheninja's suggestion of filtering is fine.

If there are big gaps (you have keys a-z but only want to get a and z), you could consider using seek() on an iterator, but this is a lower-level mechanism and only supported by leveldown and not exposed in levelup. So only go there if you really need it.

vweevers · 2017-09-20T15:14:53Z

If you can design your keys in such a way (making use of the lexicographical sort order) that they are indeed adjacent, that's the preferred way. You wouldn't need to skip anything.

fergiemcdowall · 2017-09-20T15:45:22Z

Retrieving a "bag of keys" is pretty valid and common scenario which is not totally straightforward due the to asnyc-iness of the API. Just now the API handles batches of noncontiguous writes, so it seems reasonable for it to handle batches of reads.

If you want to fetch noncontiguous values, at the moment you have to carefully read the whole README just to confirm that there isnt a builtin, and then you have to work out some sort of boilerplate incantation to wire up a sequence of gets, and package up the result (slightly tricky owing to the async nature of the API).

Changing the API would be a bit drastic at this stage in the 2.x.x release, but I think the lib would be a lot more accessible if there was at least a quick novice-friendly copy-paste example for "batch get" in the docs.

vweevers · 2017-09-20T17:20:28Z

@fergiemcdowall Can you share a use case? I'd try to shift the thinking to storing a bag of keys. Key design is important. If the keys are contiguous on disk, reading is too. In other words, if you need to read a lot of keys that "belong together", shouldn't they be stored together?

Indexes are another option.

A "batch get" is doable to implement BTW, with leveldown at least. Performance might depend on whether the input keys are sorted because forward seeks are faster. In any case it would beat multiple gets, ánd give you snapshot guarantees. But I'm not yet convinced levelup should have this interface.

fergiemcdowall · 2017-09-20T18:41:35Z

@fergiemcdowall Can you share a use case? I'd try to shift the thinking to storing a bag of keys. Key design is important. If the keys are contiguous on disk, reading is too. In other words, if you need to read a lot of keys that "belong together", shouldn't they be stored together?

@vweevers I get why you are saying this- one of the great things about leveldb is that it handles ranges really well, and doing clever things with sorted indexes is where leveldb really shines.

That said, there are plenty of situations where you have to grab a set of keys that are spread out all over an index. For example- I use level mostly to build the search-index module. search-index might handle a search for "The quick brown fox jumps over the lazy dog" in which case search-index looks for keys related to 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy' and 'dog'. There is no clever key structure that can make these arbitrary, unpredictable keys exist contiguously in the index.

Although I use the hell out of .createReadStream(), I also have quite a few "bag of keys" scenarios. A "batch get" would be mostly about reducing "code smell" for level in these instances, and if there was actually a performance benefit to boot, as you mention that there might be, then that would be even better.

vweevers · 2017-09-20T20:20:38Z

Good point. I've had similar use cases. There's a difference though between needing a few keys (quick brown fox) and 15k. So I'm interested in the specifics of your use case too @NirmalVatsyayan.

I think, rather than a batch get, I prefer to officially expose iterators with a seek() method. This not only enables fetching multiple non-contiguous keys but also getting multiple ranges of keys. You seek, read some, seek again. Also you get backpressure and can do things like skip scans (for compound aka composite indexes).

For the simple use cases, "batch get" can be a userland module on top of iterators.

vweevers · 2017-09-20T20:33:43Z

It's also doable in memdown (because there, iterators operate on an immutable tree; seeking is theoretically possible) and level.js (without backpressure though, because of how IndexedDB cursors work). If we choose to forego snapshot guarantees, then it's doable everywhere, but probably not with the same performance.

NirmalVatsyayan · 2017-09-20T21:02:18Z

@vweevers I am using the cache for re-ranking of results from Support vector machine, i get N most similar items for some input vector and need to re rank them based upon some business logic (fuzzy match + noun intersection etc), so the keys could be any thing among the data set. Basically i need to cache 2-3 million objects and want to access any random set of 15 K from them.

vweevers · 2017-09-20T22:18:51Z

@dominictarr this sounds like something more up your alley. Any advice for @NirmalVatsyayan?

peakji · 2017-09-21T07:13:53Z

@vweevers I prefer to officially expose iterators with a seek() method.

I was thinking about this too. For those *-down without iterator.seek(), we may provide a polyfill in levelup by swapping internal iterators:

Iterator.prototype.seek = function(target) {

	// Native implementation
	if (typeof(this._iterator.seek) == "function") {
		this._iterator.seek(target);
		return this;
	}

	// Polyfill

	// 1. End the internal iterator
	this._iterator.end(noop);

	// 2. New options
	var options = Object.assign({}, this._options);
	options[this._options.reverse === true ? "lte" : "gte"] = target;
	// ... what about 'limit'?

	// 3. Swap!
	this._iterator = this._db.iterator(options);

	return this;
};

vweevers · 2017-09-21T08:23:52Z

@peakji what was your use case for seeking again? Just so we have a full picture (and another argument for exposing iterators 😉).

Also do you need snapshot guarantees? Because with this proposed polyfill, seek() operates on a new snapshot. If we can't provide consistent semantics, I suggest to just throw if the *down doesn't support seek().

peakji · 2017-09-21T08:52:00Z

@vweevers I'm using seek() to collect the intersection of two sorted lists (contiguous keys).
For example, if we have two sorted lists stored in two namespaces, sharing the same ID scheme:

A = [a:001, a:002, a:008, ... , a:070, a:089, a:091, a:111]
B = [b:072, b:089, b:091, b:200]

If we want to get the intersection (089 and 091), we only need two iterators (a:* and b:*) and several skips:

a:001 -> b:072 -> a:089 -> b:089 -> a:091 -> b:091 - ...

This is much faster than naive scanning on large datasets, because we are able to skip a lot of keys (a:002 to a:070).

Also do you need snapshot guarantees?

Ahh... I forgot the snapshot thing... 👍 for throwing error.

vweevers · 2017-09-21T08:53:47Z

A little history:

@dominictarr proposed multiread (aka "batch get", "multiget") 4 years ago (multiread leveldown#13)
@juliangruber said "With a multiget the gain would increase with the numbers of keys given. I'm not sure yet about when this would pay off. Someone would have to do an implementation so we can see."
@rvagg warned "of potential size problems, perhaps we should put an upper limit on it or at least enforce the use of a limit property so at least they know what they are doing."
@dominictarr replied "I'm opposed to adding safety features to leveldown", referencing The Hole Hawg: "It is dangerous because it does exactly what you tell it to. [..] The danger lies not in the machine itself but in the user's failure to envision the full consequences of the instructions he gives to it."

@juliangruber proposed exposing snapshots (#138), which would allow multiple operations on a single snapshot. No compelling use cases were found.

@mafintosh added Iterator.prototype.seek() to leveldown (added iterator.seek() leveldown#185) (use case)
@mcollina argued "I'm not for having this in AbstractLevelDown, because on some backends it might be hard to implement."
@juliangruber replied "we can later decide if we need it in other backends"
I wanted to add support for multiple seeks (Seek multiple times and support Buffers leveldown#256) but got stranded (use case)
@peakji had the same needs and did a better job than me (Multiple seeks in ltgt range and support Buffers leveldown#323) (use case and above)

@MeirionHughes proposed implementing Symbol.asyncIterator (Add iterator(options) and return [Symbol.asyncIterator] #477) (userland for now, levelup or abstract-leveldown iterators later).

gogoout · 2021-09-11T06:33:11Z

Hi, I'm currently try to implement a multi get with iterator and seek. My use case is we have a db with around 40m records in there. And we have to random get up to 15k of them when we do lookup.

So for our use case (a random sparse list of keys), I have an implementation a bit like below:

const iterator=level.iterator();

// wrap next in promise
const next = async()=>{
  return new Promise((resolve, reject)=>{
      iterator.next((err,key,value)=>{
           // resolve, reject
          // ...
      });
  })
}

for (let key of sortedKeys) {
  iterator.seek(key);
  const result =  await next();
  // do something
}

Or should I just use next to iterator every possible value than check it against the target?
My guts feeling is that using seek will be much faster because it's possible to skip a potential big gap.

Also I have a question about seek behaviour, in the document it says:

By calling seek(key), subsequent calls to next(cb) will return key/values larger or smaller than key, based on your reverse setting in the iterator constructor.

Does this mean after calling the seek(key), next() will land on a key that is larger but not equal to the given key? If this is the case, maybe I should make the key offset 1 character code before I call seek?

I guess for my use case, the most ideal method is using the purposed multiGet feature.

vweevers · 2021-09-11T07:45:58Z

By calling seek(key), subsequent calls to next(cb) will return key/values larger or smaller than key, based on your reverse setting in the iterator constructor.

Where is it documented like that? We should update that, because in abstract-leveldown we have a more precise explanation that should answer your question:

Seek the iterator to a given key or the closest key. Subsequent calls to iterator.next() will yield entries with keys equal to or larger than target, or equal to or smaller than target if the reverse option passed to db.iterator() was true.

gogoout · 2021-09-11T07:48:51Z

Thanks for the quick response, the doc is in https://github.com/Level/leveldown

Ref Level/levelup#491 (comment)

vweevers · 2021-09-11T08:08:18Z

My guts feeling is that using seek will be much faster because it's possible to skip a potential big gap.

Potentially, yes. With leveldown it's recommended to use db.iterator({ highWaterMark }) (Level/leveldown#468 (comment)).

gogoout · 2021-09-11T08:11:55Z

I'm actually using rocksdb in my project. And was passing {gte,lte} based on the sorted key when I created the iterator. I assume it will be the same as highWaterMark ?

vweevers · 2021-09-11T08:14:38Z

rocksdb also has the highWaterMark option. If you're expecting gaps between the keys (within your gte/lte range) then it can significantly boost perf.

gogoout · 2021-09-11T08:27:22Z

Thanks, that'll be really useful. I've investigated a bit around the highWaterMark, and seems there is an internal batchNext function as well. Should I use batchNext together with highWaterMark or the normal seek will benefit from it as well?

vweevers · 2021-09-11T08:29:30Z

That doesn't ring a bell, can you link to batchNext in source code?

gogoout · 2021-09-11T08:34:32Z

Nah, sorry to border you. I've find it in a third party libaray and made a bad assumption it's calling the internal method of leveldown. Thanks for the help again, that's really useful.

This was referenced Jun 16, 2018

Expose levelup#iterator #586

Closed

Add iterator#seek Level/abstract-leveldown#234

Closed

Add Symbol.asyncIterator to iterator Level/abstract-leveldown#235

Closed

Add iterator(options) and return [Symbol.asyncIterator] #477

Closed

vweevers added the enhancement New feature or request label Jun 16, 2018

vweevers mentioned this issue Jun 19, 2018

Snapshots Level/leveldown#486

Closed

vweevers mentioned this issue Sep 22, 2018

Requirements for manifests Level/community#42

Closed

vweevers mentioned this issue Jan 2, 2019

multiread Level/leveldown#13

Closed

This was referenced Aug 3, 2020

Open to adding an optional "batched get" capability? Level/abstract-leveldown#367

Closed

select multiple keys Level/level#184

Closed

vweevers mentioned this issue Jul 12, 2021

Question: Fetch all values Level/level#201

Closed

vweevers added a commit to Level/leveldown that referenced this issue Sep 11, 2021

Tweak iterator.seek() documentation

ef1c321

Ref Level/levelup#491 (comment)

This was referenced Sep 25, 2021

Add db.getMany(keys) Level/abstract-leveldown#381

Merged

Add db.getMany(keys) across the board Level/community#101

Closed

vweevers mentioned this issue Sep 28, 2021

Add db.getMany(keys) #725

Merged

vweevers closed this as completed in #725 Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch get option #491

Batch get option #491

NirmalVatsyayan commented Sep 20, 2017

ralphtheninja commented Sep 20, 2017

vweevers commented Sep 20, 2017

vweevers commented Sep 20, 2017

fergiemcdowall commented Sep 20, 2017

vweevers commented Sep 20, 2017

fergiemcdowall commented Sep 20, 2017

vweevers commented Sep 20, 2017

vweevers commented Sep 20, 2017

NirmalVatsyayan commented Sep 20, 2017 •

edited

Loading

vweevers commented Sep 20, 2017

peakji commented Sep 21, 2017

vweevers commented Sep 21, 2017

peakji commented Sep 21, 2017

vweevers commented Sep 21, 2017 •

edited

Loading

gogoout commented Sep 11, 2021

vweevers commented Sep 11, 2021

gogoout commented Sep 11, 2021

vweevers commented Sep 11, 2021

gogoout commented Sep 11, 2021

vweevers commented Sep 11, 2021

gogoout commented Sep 11, 2021

vweevers commented Sep 11, 2021

gogoout commented Sep 11, 2021

Batch get option #491

Batch get option #491

Comments

NirmalVatsyayan commented Sep 20, 2017

ralphtheninja commented Sep 20, 2017

vweevers commented Sep 20, 2017

vweevers commented Sep 20, 2017

fergiemcdowall commented Sep 20, 2017

vweevers commented Sep 20, 2017

fergiemcdowall commented Sep 20, 2017

vweevers commented Sep 20, 2017

vweevers commented Sep 20, 2017

NirmalVatsyayan commented Sep 20, 2017 • edited Loading

vweevers commented Sep 20, 2017

peakji commented Sep 21, 2017

vweevers commented Sep 21, 2017

peakji commented Sep 21, 2017

vweevers commented Sep 21, 2017 • edited Loading

gogoout commented Sep 11, 2021

vweevers commented Sep 11, 2021

gogoout commented Sep 11, 2021

vweevers commented Sep 11, 2021

gogoout commented Sep 11, 2021

vweevers commented Sep 11, 2021

gogoout commented Sep 11, 2021

vweevers commented Sep 11, 2021

gogoout commented Sep 11, 2021

NirmalVatsyayan commented Sep 20, 2017 •

edited

Loading

vweevers commented Sep 21, 2017 •

edited

Loading