New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

gh-74028: concurrent.futures.Executor.map: add `buffersize` param for lazy behavior #125663

Open

ebonnal wants to merge 12 commits into python:main from ebonnal:fix-issue-29842

ebonnal commented Oct 17, 2024 •

edited

Loading

Context recap (#74028)

concurrent.futures.Executor.map is not lazy:

it cannot be used to map infinite input iterables
even for finite input iterables the inconvenience is that it is $O(n)$ memory-wise, collecting all the results into a list
also because it launches all the futures at once it becomes impossible for client code to limit the rate of the iteration over input iterables, which is a critical feature when the mapped fn calls an external service that you don't want to overload.

Proposal: add `buffersize` param

Similar to the work of @graingert in #18566 and @Jason-Y-Z in #114975, i.e. use a queue of fixed size to hold the not-yet-yielded future results, and only iterating on input iterables if the queue is not full.

In addition this PR:

uses the intuitive term "buffer"
keeps the exact existing list-based behavior when buffersize=None
integrates concisely into existing logic
forbids the usage of both timeout and buffersize at the same time
tests "If the buffer is full, then the iteration over iterables is paused until a result is yielded from the buffer."
intuitive docstrings

Issue: Make Executor.map work with infinite/large inputs correctly #74028

📚 Documentation preview 📚: https://cpython-previews--125663.org.readthedocs.build/

cpython-cla-bot bot commented Oct 17, 2024 •

edited

Loading

All commit authors signed the Contributor License Agreement.

bedevere-app bot commented Oct 17, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

bedevere-app bot added the awaiting review label

bedevere-app bot mentioned this pull request

Make Executor.map work with infinite/large inputs correctly #74028

Open

This was referenced Oct 17, 2024

bpo-29842: Make Executor.map less eager so it handles large/unbounded… #18566

Open

gh-74028: Introduce a prefetch parameter to Executor.map to handle large iterators #114975

Closed

bedevere-app bot commented Oct 17, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

rruuaanng reviewed

View reviewed changes

Lib/test/test_concurrent_futures/test_pool.py Outdated Show resolved Hide resolved

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

1 similar comment

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

ebonnal force-pushed the fix-issue-29842 branch from 6a58c7d to 21f7b8d Compare

October 18, 2024 10:04

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

ebonnal requested a review from rruuaanng

October 18, 2024 10:06

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Contributor

Zheaoli commented Oct 18, 2024

Thanks for the PR.

First, I think this is a big behavior change for Executor. I think we need to discuss it in the https://discuss.python.org/ first.

In my personal opinion, I think this is not a good choice to add the buffersize argument to the api. For now, the API design is based on the original map API. I think this argument will bring more inconsistent into the codebase. And BTW, even if we need the buffersize argument, I think it's not reasonable to forbids the usage of both timeout and buffersize at the same time

The returned iterator raises a TimeoutError if next() is called and the result isn’t available after timeout seconds from the original call to Executor.map().

ebonnal force-pushed the fix-issue-29842 branch 2 times, most recently from 9eef605 to e5c867a Compare

October 18, 2024 13:49

Author

ebonnal commented Oct 18, 2024 •

edited

Loading

Hi @Zheaoli, thank you for your comment!

First, I think this is a big behavior change for Executor.

You mean big alternative behavior, right? (the default behavior when ommitting buffersize remaining unchanged)

I think we need to discuss it in the https://discuss.python.org/ first.

Fair, I will start a thread there and ping you.

For now, the API design is based on the original map API. I think this argument will bring more inconsistent into the codebase.

I'm not sure to get it, could you detail that point? 🙏🏻

I think it's not reasonable to forbids the usage of both timeout and buffersize at the same time

You are completely right, makes more sense! I have fixed that (commit)

ebonnal force-pushed the fix-issue-29842 branch from 8bf7be7 to 769060e Compare

October 18, 2024 13:59

Contributor

Zheaoli commented Oct 19, 2024

I'm not sure to get it, could you detail that point? 🙏🏻

For me, the basic map API's behavior is when we put an infinite iterator, the result would be infinite and only stop when the iterator has been stoped. I think we need to keep the same behavior between map and executor.map

Author

ebonnal commented Oct 20, 2024 •

edited

Loading

For me, the basic map API's behavior is when we put an infinite iterator, the result would be infinite and only stop when the iterator has been stoped. I think we need to keep the same behavior between map and executor.map

There may be a misunderstanding here, the goal of this PR is precisely to make Executor.map closer to the builtin map behavior, i.e. make it lazier. (map and current executor.map do not have the same behavior)

I will recap the behaviors so that everybody is on the same page:

built-in `map`

infinite_iterator = itertools.count(0)

# a `map` instance is created and the func and iterable are just stored as attributes
mapped_iterator = map(str, infinite_iterator)

# retrieves the first element of its input iterator, applies
# the transformation and returns the result
assert next(mapped_iterator) == "0" 

# the next element in the input iterator is the 2nd
assert next(infinite_iterator) == 1

# one can next infinitely
assert next(mapped_iterator) == "2"
assert next(mapped_iterator) == "3" 
assert next(mapped_iterator) == "4" 
assert next(mapped_iterator) == "5" 
...

`Executor.map` without `buffersize` (= current `Executor.map`)

infinite_iterator = itertools.count(0)

# this line runs FOREVER, trying to iterate over input iterator until exhaustion
mapped_iterator = executor.map(str, infinite_iterator)

⏫ this line will run forever because it collects the entire input iterable eagerly, in order to build the entire future results list fs = [self.submit(fn, *args) for args in zip(*iterables)] which requires infinite time and memory.

`Executor.map` with `buffersize`

infinite_iterator = itertools.count(0)

# retrieves the first 2 elements (=buffersize) and submits 2 tasks for them
mapped_iterator = executor.map(str, infinite_iterator, buffersize=2)

# retrieves the 3rd element of input iterator and submits a task for it,
# then wait for the oldest future in the buffer to complete and returns the result
assert next(mapped_iterator) == "0" 

# the next element of the input iterator is the 4th
assert next(infinite_iterator) == 3

# one can next infinitely while only a buffer of finite not-yet-yielded future results is kept in memory
assert next(mapped_iterator) == "1" 
assert next(mapped_iterator) == "2" 
assert next(mapped_iterator) == "4"
assert next(mapped_iterator) == "5" 
...

note

I used the example of an infinite input iterator because this is an example where current Executor.map is just unusable at all. But even for finite input iterables, if a developer writes mapped_iterator = executor.map(fn, iterable), they often don’t want the iterable to be eagerly exhausted right away, but rather to be iterated at the same rate as mapped_iterator. This PR's proposal is to allow them to do so by setting a buffersize.

ebonnal force-pushed the fix-issue-29842 branch from 769060e to be419ed Compare

October 21, 2024 17:54

Author

ebonnal commented Oct 25, 2024

hey @rruuaanng, fyi I have applied your requested changes regarding the integration of unit tests into existing class 🙏🏻

rruuaanng reviewed

View reviewed changes

Lib/concurrent/futures/_base.py

@@ @@ -594,10 +599,21 @@ def map(self, fn, *iterables, timeout=None, chunksize=1): @@
                               before the given timeout.
                           Exception: If fn(*args) raises for any values.
                       """
+                      if buffersize is not None and buffersize < 1:

Contributor

rruuaanng Oct 25, 2024

Why does it have to be None?

Author

ebonnal Oct 25, 2024

it has to be None OR to be greater than 0, would the addition of either make it clearer ? -> ValueError("buffersize must be either None or >= 1.")

Lib/concurrent/futures/_base.py

+                      args_iter = iter(zip(*iterables))
+                      if buffersize:
+                          fs = collections.deque(
+                              self.submit(fn, *args) for args in islice(args_iter, buffersize)

Contributor

rruuaanng Oct 25, 2024

Isn't buffersize empty? Can you introduce it? (Forgive me for not understanding it).

Author

ebonnal Oct 25, 2024

absolutely np, thank you for taking the time to review my proposal. To be sure to understand the question well, what do you mean by "Isn't buffersize empty?"

ebonnal and others added 6 commits

October 25, 2024 14:47


          bpo-29842: concurrent.futures.Executor.map: add buffersize param for …

4e0bea6

…lazy behavior


          test_map_buffersize: 1s sleep

e6a8396


          mention chunksize in ProcessPoolExecutor's buffersize docstring

6f14f0b


          merge unittest into ExecutorTest

a996080


          fix versionchanged

2cd052d


          📜🤖 Added by blurb_it.

d27be36

ebonnal added 6 commits

October 25, 2024 14:47


          fix tests determinism

d4a8996


          add test_map_with_buffersize_on_empty_iterable

d8c3949


          allow timeout + buffersize

b28a996


          lint import

3cd7f6e


          tests: polish

5f3dd2f


          rephrase docstring

bb0e747

ebonnal force-pushed the fix-issue-29842 branch from e28a0f0 to bb0e747 Compare

October 25, 2024 13:47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review