Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does rayon have guarantee that .par_bridge().map().collect() will not store too many "Item"s in mem? #1068

Closed
safinaskar opened this issue Jul 2, 2023 · 6 comments · Fixed by #1075

Comments

@safinaskar
Copy link

Hi. I wrote iterator, which reads data from a file, splits the data to chunks and returns them one-by-one. Then I apply .par_bridge().map(...).collect::<Vec<_>>() to that iterator. But my file does not necessary fits into memory. So my question is: does this use case provide guarantee that rayon will never store in memory too many chunks? (I. e. too many items of original sequential iterator.) (Values produced by map are small [in fact they are hashes of chunks in my application], so it okay for my application to store them all in memory.)

If such guarantee exists, then, please, document it

@cuviper
Copy link
Member

cuviper commented Jul 2, 2023

"Too many" is not specific enough to make guarantees.

par_bridge used to cache a lot of items from the iterator -- I think it was n2 items for an n-thread pool -- but even then I don't think you'd have a memory problem unless each individual item was huge. The current implementation only pulls items one at a time as needed, which is as minimal as we can get, up to n total with one for each thread.

I'm not sure that we want to commit guarantees about that implementation though. We've changed it in a major way already, and might want to change it again if we figure out a new heuristic for performance.

@safinaskar
Copy link
Author

Thanks for answer! It still would be great to get at least some guarantee. n^2 and even n^3 will go. Otherwise the library is unusable for my application. (Well, it is still usable in its current form, but I have to write rayon = "=1.7.0" to stick to particular version, this is ugly.)

@cuviper
Copy link
Member

cuviper commented Jul 3, 2023

We can (and should) at least guarantee that we don't consume the entire iterator at once -- this means that it's safe to use with unbounded iterators like std::ops::RangeFrom, std::iter::Repeat, etc.

The other extreme is to say that we don't buffer items at all, which is actually the current state of things. That's a useful property because it enables patterns like channel send/recv that might otherwise deadlock, if existing items don't get processed before trying to recv more. Once promised though, I hope we would not regret it...

I'm less certain about the value of trying to put any particular numbers in-between.

@adamreichold
Copy link
Collaborator

adamreichold commented Jul 3, 2023

Once promised though, I hope we would not regret it...

Thinking about the StreamExt API, this could be approached by multiple methods with different buffering strategies or an explicit parameter controlling the buffering. Possibly only if it ever becomes a problem after committing to no buffering for the default .par_bridge() invocation.

I'm less certain about the value of trying to put any particular numbers in-between.

👍

@cuviper
Copy link
Member

cuviper commented Jul 3, 2023

There are a lot of ways to implement buffering on the pre-rayon iterator side. Buffering on the rayon side would be trying to avoid some of the Mutex<Iter> bottleneck, as we used to do, but you're right that we could re-introduce that with explicit methods, or even a whole new type of bridge.

So, okay, let's commit to not buffering. Anyone want to write that up in a PR?

@safinaskar
Copy link
Author

I will adopt my pull request #1071 in my code base, so I personally don't need any guarantees from par_bridge anymore. Still the guarantees can be useful for others

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants