Does rayon have guarantee that .par_bridge().map().collect() will not store too many "Item"s in mem? #1068

safinaskar · 2023-07-02T16:24:15Z

Hi. I wrote iterator, which reads data from a file, splits the data to chunks and returns them one-by-one. Then I apply .par_bridge().map(...).collect::<Vec<_>>() to that iterator. But my file does not necessary fits into memory. So my question is: does this use case provide guarantee that rayon will never store in memory too many chunks? (I. e. too many items of original sequential iterator.) (Values produced by map are small [in fact they are hashes of chunks in my application], so it okay for my application to store them all in memory.)

If such guarantee exists, then, please, document it

The text was updated successfully, but these errors were encountered:

cuviper · 2023-07-02T18:45:16Z

"Too many" is not specific enough to make guarantees.

par_bridge used to cache a lot of items from the iterator -- I think it was n² items for an n-thread pool -- but even then I don't think you'd have a memory problem unless each individual item was huge. The current implementation only pulls items one at a time as needed, which is as minimal as we can get, up to n total with one for each thread.

I'm not sure that we want to commit guarantees about that implementation though. We've changed it in a major way already, and might want to change it again if we figure out a new heuristic for performance.

safinaskar · 2023-07-02T21:07:47Z

Thanks for answer! It still would be great to get at least some guarantee. n^2 and even n^3 will go. Otherwise the library is unusable for my application. (Well, it is still usable in its current form, but I have to write rayon = "=1.7.0" to stick to particular version, this is ugly.)

cuviper · 2023-07-03T17:53:37Z

We can (and should) at least guarantee that we don't consume the entire iterator at once -- this means that it's safe to use with unbounded iterators like std::ops::RangeFrom, std::iter::Repeat, etc.

The other extreme is to say that we don't buffer items at all, which is actually the current state of things. That's a useful property because it enables patterns like channel send/recv that might otherwise deadlock, if existing items don't get processed before trying to recv more. Once promised though, I hope we would not regret it...

I'm less certain about the value of trying to put any particular numbers in-between.

adamreichold · 2023-07-03T18:02:59Z

Once promised though, I hope we would not regret it...

Thinking about the StreamExt API, this could be approached by multiple methods with different buffering strategies or an explicit parameter controlling the buffering. Possibly only if it ever becomes a problem after committing to no buffering for the default .par_bridge() invocation.

I'm less certain about the value of trying to put any particular numbers in-between.

👍

cuviper · 2023-07-03T18:24:42Z

There are a lot of ways to implement buffering on the pre-rayon iterator side. Buffering on the rayon side would be trying to avoid some of the Mutex<Iter> bottleneck, as we used to do, but you're right that we could re-introduce that with explicit methods, or even a whole new type of bridge.

So, okay, let's commit to not buffering. Anyone want to write that up in a PR?

safinaskar · 2023-07-06T13:15:20Z

I will adopt my pull request #1071 in my code base, so I personally don't need any guarantees from par_bridge anymore. Still the guarantees can be useful for others

safinaskar mentioned this issue Jul 3, 2023

How to process very big file in parallel and write results in correct order? (I want FuturesOrdered!) #1070

Closed

cuviper mentioned this issue Jul 11, 2023

Promise not to buffer par_bridge #1075

Merged

bors bot closed this as completed in 63b959c Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does rayon have guarantee that .par_bridge().map().collect() will not store too many "Item"s in mem? #1068

Does rayon have guarantee that .par_bridge().map().collect() will not store too many "Item"s in mem? #1068

safinaskar commented Jul 2, 2023

cuviper commented Jul 2, 2023

safinaskar commented Jul 2, 2023

cuviper commented Jul 3, 2023 •

edited

Loading

adamreichold commented Jul 3, 2023 •

edited

Loading

cuviper commented Jul 3, 2023

safinaskar commented Jul 6, 2023

Does rayon have guarantee that .par_bridge().map().collect() will not store too many "Item"s in mem? #1068

Does rayon have guarantee that .par_bridge().map().collect() will not store too many "Item"s in mem? #1068

Comments

safinaskar commented Jul 2, 2023

cuviper commented Jul 2, 2023

safinaskar commented Jul 2, 2023

cuviper commented Jul 3, 2023 • edited Loading

adamreichold commented Jul 3, 2023 • edited Loading

cuviper commented Jul 3, 2023

safinaskar commented Jul 6, 2023

cuviper commented Jul 3, 2023 •

edited

Loading

adamreichold commented Jul 3, 2023 •

edited

Loading