Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-74028: concurrent.futures.Executor.map: add buffersize param for lazy behavior #125663

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

ebonnal
Copy link

@ebonnal ebonnal commented Oct 17, 2024

Context recap (#74028)

concurrent.futures.Executor.map is not lazy:

  • it cannot be used to map infinite input iterables
  • even for finite input iterables the inconvenience is that it is $O(n)$ memory-wise, collecting all the results into a list
  • also because it launches all the futures at once it becomes impossible for client code to limit the rate of the iteration over input iterables, which is a critical feature when the mapped fn calls an external service that you don't want to overload.

Proposal: add buffersize param

Similar to the work of @graingert in #18566 and @Jason-Y-Z in #114975, i.e. use a queue of fixed size to hold the not-yet-yielded future results, and only iterating on input iterables if the queue is not full.

In addition this PR:

  • uses the intuitive term "buffer"
  • keeps the exact existing list-based behavior when buffersize=None
  • integrates concisely into existing logic
  • forbids the usage of both timeout and buffersize at the same time
  • tests "If the buffer is full, then the iteration over iterables is paused until a result is yielded from the buffer."
  • intuitive docstrings

📚 Documentation preview 📚: https://cpython-previews--125663.org.readthedocs.build/

Copy link

cpython-cla-bot bot commented Oct 17, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Oct 17, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Oct 17, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

1 similar comment
@bedevere-app
Copy link

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@Zheaoli
Copy link
Contributor

Zheaoli commented Oct 18, 2024

Thanks for the PR.

First, I think this is a big behavior change for Executor. I think we need to discuss it in the https://discuss.python.org/ first.

In my personal opinion, I think this is not a good choice to add the buffersize argument to the api. For now, the API design is based on the original map API. I think this argument will bring more inconsistent into the codebase. And BTW, even if we need the buffersize argument, I think it's not reasonable to forbids the usage of both timeout and buffersize at the same time

The returned iterator raises a TimeoutError if next() is called and the result isn’t available after timeout seconds from the original call to Executor.map().

@ebonnal ebonnal force-pushed the fix-issue-29842 branch 2 times, most recently from 9eef605 to e5c867a Compare October 18, 2024 13:49
@ebonnal
Copy link
Author

ebonnal commented Oct 18, 2024

Hi @Zheaoli, thank you for your comment!

First, I think this is a big behavior change for Executor.

You mean big alternative behavior, right? (the default behavior when ommitting buffersize remaining unchanged)

I think we need to discuss it in the https://discuss.python.org/ first.

Fair, I will start a thread there and ping you.

For now, the API design is based on the original map API. I think this argument will bring more inconsistent into the codebase.

I'm not sure to get it, could you detail that point? 🙏🏻

I think it's not reasonable to forbids the usage of both timeout and buffersize at the same time

You are completely right, makes more sense! I have fixed that (commit)

@Zheaoli
Copy link
Contributor

Zheaoli commented Oct 19, 2024

I'm not sure to get it, could you detail that point? 🙏🏻

For me, the basic map API's behavior is when we put an infinite iterator, the result would be infinite and only stop when the iterator has been stoped. I think we need to keep the same behavior between map and executor.map

@ebonnal
Copy link
Author

ebonnal commented Oct 20, 2024

Hi @Zheaoli

For me, the basic map API's behavior is when we put an infinite iterator, the result would be infinite and only stop when the iterator has been stoped. I think we need to keep the same behavior between map and executor.map

There may be a misunderstanding here, the goal of this PR is precisely to make Executor.map closer to the builtin map behavior, i.e. make it lazier. (map and current executor.map do not have the same behavior)

I will recap the behaviors so that everybody is on the same page:

built-in map

infinite_iterator = itertools.count(0)

# a `map` instance is created and the func and iterable are just stored as attributes
mapped_iterator = map(str, infinite_iterator)

# retrieves the first element of its input iterator, applies
# the transformation and returns the result
assert next(mapped_iterator) == "0" 

# the next element in the input iterator is the 2nd
assert next(infinite_iterator) == 1

# one can next infinitely
assert next(mapped_iterator) == "2"
assert next(mapped_iterator) == "3" 
assert next(mapped_iterator) == "4" 
assert next(mapped_iterator) == "5" 
...

Executor.map without buffersize (= current Executor.map)

infinite_iterator = itertools.count(0)

# this line runs FOREVER, trying to iterate over input iterator until exhaustion
mapped_iterator = executor.map(str, infinite_iterator)

⏫ this line will run forever because it collects the entire input iterable eagerly, in order to build the entire future results list fs = [self.submit(fn, *args) for args in zip(*iterables)] which requires infinite time and memory.

Executor.map with buffersize

infinite_iterator = itertools.count(0)

# retrieves the first 2 elements (=buffersize) and submits 2 tasks for them
mapped_iterator = executor.map(str, infinite_iterator, buffersize=2)

# retrieves the 3rd element of input iterator and submits a task for it,
# then wait for the oldest future in the buffer to complete and returns the result
assert next(mapped_iterator) == "0" 

# the next element of the input iterator is the 4th
assert next(infinite_iterator) == 3

# one can next infinitely while only a buffer of finite not-yet-yielded future results is kept in memory
assert next(mapped_iterator) == "1" 
assert next(mapped_iterator) == "2" 
assert next(mapped_iterator) == "4"
assert next(mapped_iterator) == "5" 
...

note

I used the example of an infinite input iterator because this is an example where current Executor.map is just unusable at all. But even for finite input iterables, if a developer writes mapped_iterator = executor.map(fn, iterable), they often don’t want the iterable to be eagerly exhausted right away, but rather to be iterated at the same rate as mapped_iterator. This PR's proposal is to allow them to do so by setting a buffersize.

@ebonnal
Copy link
Author

ebonnal commented Oct 25, 2024

hey @rruuaanng, fyi I have applied your requested changes regarding the integration of unit tests into existing class 🙏🏻

@@ -594,10 +599,21 @@ def map(self, fn, *iterables, timeout=None, chunksize=1):
before the given timeout.
Exception: If fn(*args) raises for any values.
"""
if buffersize is not None and buffersize < 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to be None?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to be None OR to be greater than 0, would the addition of either make it clearer ? -> ValueError("buffersize must be either None or >= 1.")

args_iter = iter(zip(*iterables))
if buffersize:
fs = collections.deque(
self.submit(fn, *args) for args in islice(args_iter, buffersize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't buffersize empty? Can you introduce it? (Forgive me for not understanding it).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely np, thank you for taking the time to review my proposal. To be sure to understand the question well, what do you mean by "Isn't buffersize empty?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants