Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multithreading in groupreduce #2491

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft

Conversation

nalimilan
Copy link
Member

Keep the default to a single thread until we find a reliable way of predicting a reasonably optimal number of threads.

@@ -453,18 +453,22 @@ julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false)
```
"""
function combine(f::Base.Callable, gd::GroupedDataFrame;
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true,
nthreads::Int=NTHREADS)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should use a keyword argument name that is standard across packages, but for now I haven't found examples of similar cases elsewhere.

We should also anticipate what it will look like once we switch the default to the automatic optimal number of threads. Would it be nthreads=nothing, nthreads=:auto, nthreads=0? We could also use another name: threads=true looks good at it also works with e.g. threads=5. Though using a different type forces recompilation, so maybe 0 is better as a default, even though it's not super logical.

@bkamins
Copy link
Member

bkamins commented Oct 17, 2020

Note that after #2481 method signatures will change so this PR will have merge conflicts.

@bkamins bkamins added this to the 1.x milestone Oct 17, 2020
@nalimilan nalimilan force-pushed the nl/threadedgrouping branch from b676e63 to bc4dc71 Compare October 17, 2020 22:25
@nalimilan nalimilan force-pushed the nl/threadedgrouping branch 2 times, most recently from 1a977d3 to 1933482 Compare November 24, 2020 22:48
Keep the default to a single thread until we find a reliable way of
predicting a reasonably optimal number of threads.
@nalimilan
Copy link
Member Author

nalimilan commented Nov 25, 2020

I've rebased against master. Should be good for a review now. I just hope CI really used 2 threads but it's not visible in the logs AFAICT. EDIT: seeing Codecov, probably not.

@nalimilan nalimilan marked this pull request as ready for review November 25, 2020 10:03
@bkamins
Copy link
Member

bkamins commented Nov 25, 2020

EDIT: seeing Codecov, probably not.

@quinnj - could you please have a quick look at the CI set-up? Thank you!

@nalimilan
Copy link
Member Author

Wait, I just did something completely stupid when setting the environment variable. I've committed a probable fix.

@nalimilan
Copy link
Member Author

OK that worked!

@nalimilan
Copy link
Member Author

I've added DataFrames.nthreads (called to choose the default nthreads) and DataFrames.nthreads! to set it, to make it easier to choose the number of threads once for a session, as requested by Jan.

@tk3369
Copy link
Contributor

tk3369 commented Nov 25, 2020

I was just thinking about nthreads! sounding a little weird because it's not a verb. Given that it's a global setting, would it make sense to create a struct that holds on to all global settings? Then, the user can configure it via setproperty!. Just a thought...

@nalimilan
Copy link
Member Author

I was just thinking about nthreads! sounding a little weird because it's not a verb. Given that it's a global setting, would it make sense to create a struct that holds on to all global settings? Then, the user can configure it via setproperty!. Just a thought...

Yes, I also thought about setnthreads!. That's an eternal question about naming conventions for setters in Julia. But note that we have Random.seed! already in the stdlib, which is quite similar.

Adding a struct for global settings sounds overkill at this point since it's not clear we're going to add more settings in the future. I'd rather keep this minimal, so that it can be extended if needed in the future, e.g. using Preferences.jl for storage across sessions.

Use [`DataFrames.nthreads`](@ref) to access the value.
"""
function nthreads!(n::Int)
n > 0 || throw(ArgumentError("n must be equal to or greater than 1 (got $n)"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might also cap this at Threads.nthreads() here so that later DataFrames.nthreads does report the number of threads that is going to be actually used

@bkamins
Copy link
Member

bkamins commented Nov 26, 2020

I'd rather keep this minimal

I am OK with the current design as it is an implementation detail. Though probably having a mutable struct instead of Ref is more natural (but we can always change it in the future).

Regarding the naming - I am OK with nthreads!, but in general I do not have a super strong opinion here.

@nalimilan
Copy link
Member Author

The Ref is an implementation detail, but we have to decide whether DataFrames.nthreads and DataFrames.nthreads! are part of the public API. If we're hesitant we could avoid mentioning these in the docs and only use them for benchmarking for now (like DataFrames.precompile).

@bkamins
Copy link
Member

bkamins commented Nov 26, 2020

I am OK to have them as a part of the public API.

@StefanKarpinski - if Julia Base were to add an option to change number of available threads dynamically during the Julia session, would Threads.nthreads! would be a name for this operation?

@bkamins
Copy link
Member

bkamins commented Jan 24, 2021

This can be closed given the other approach we took? Or we keep it open for the future?

@nalimilan
Copy link
Member Author

This PR is complementary with the other threaded grouping PRs, since without it optimized reductions are not multithreaded. We should check whether we can identify conditions under which multithreading is faster than single threading.

@bkamins
Copy link
Member

bkamins commented Mar 6, 2021

I have totally forgotten about this PR. I thought you have merged it :). So what is the status here - are you working on the condition or I should review it as is?

Thank you!

@nalimilan
Copy link
Member Author

Given that we now automatically spawn as many tasks as there are threads in other places, requiring users to set NTHREADS for this function may be a bit weird. I guess we should try to automatically compute a reasonable number of threads -- but that turned out to be quite tricky.

@bkamins
Copy link
Member

bkamins commented Mar 6, 2021

equiring users to set NTHREADS for this function may be a bit weird

I agree

but that turned out to be quite tricky.

Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion)

@nalimilan
Copy link
Member Author

Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion)

Probably. I need to look at the data again. What's difficult is that there are several dimensions to cross: number of groups, number of rows, number of threads. And IIRC the overhead due to threading is larger when Julia was started with many threads.

Finally, there's always the possibility that all CPUs are already busy with another threaded operation, so that spawning multiple tasks is counter-productive. Maybe that's not a serious issue.

@bkamins
Copy link
Member

bkamins commented Mar 8, 2021

Should I mark it as 1.0, or keep it 1.x and we will merge it when we are comfortable?

@nalimilan
Copy link
Member Author

I think this doesn't have to happen before 1.0.

Comment on lines +203 to +204
start = 1 + ((tid - 1) * length(groups)) ÷ nt
stop = (tid * length(groups)) ÷ nt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid overflow on 32-bit machines, on 64-bit machine this is a no-op

Suggested change
start = 1 + ((tid - 1) * length(groups)) ÷ nt
stop = (tid * length(groups)) ÷ nt
start = 1 + ((tid - 1) * Int64(length(groups))) ÷ nt
stop = (tid * Int64(length(groups))) ÷ nt

@bkamins
Copy link
Member

bkamins commented Jan 31, 2022

@nalimilan - what do we do with this PR? Close it (as it is likely outdated)?

In general - for the future we should rather add a kwarg allowing to choose if users wants to use multithreading c.f. #2992.

@bkamins
Copy link
Member

bkamins commented Feb 20, 2022

bump

@nalimilan - what do you think we should do with this PR?

@nalimilan
Copy link
Member Author

I'm not sure. There are two aspects to address:

@bkamins
Copy link
Member

bkamins commented Feb 21, 2022

@nalimilan - following our discussion I recommend to do what is written in #2988 (comment).

We would then implement this proposed change. There are the following things to be done to implement this change:

  1. implement the feature of this PR
  2. implement enabling/disabling threading of what you propose in this PR
  3. implement enabling/disabling threading in general in _combine (across transformation operations)
  4. implement ntasks keyword argument in all relevant functions

@nalimilan - can you please comment which parts of this list you would be willing to have a look at (I understand that 1, 2, and 4 must be covered in this PR, but 3 can be done in a separate follow-up PR that I can do if you prefer after this PR is finished but before 1.4 release)

Additional maintenance tasks:

  • add tests for all cases
  • create and run benchmarks
  • update docstrings
  • update documentation
  • update NEWS.md

@nalimilan
Copy link
Member Author

OK I'll rebase this PR, but I'd rather leave point 3 for another one.

@bkamins
Copy link
Member

bkamins commented Feb 23, 2022

Sure. Thank you!

@bkamins
Copy link
Member

bkamins commented Feb 23, 2022

(probably starting a new branch and making proper changes + force push rather than rebase would be easier, as I am not sure by how much we have changed the sources you touched)

@bkamins
Copy link
Member

bkamins commented Jun 12, 2022

@nalimilan - what are your plans with this PR 😄? (I am asking because it is getting so old that there is a risk that it is better to re-implement it from scratch if we want to have it given the recent changes)

@nalimilan
Copy link
Member Author

Now that we have the threads keyword argument, we could allow passing an integer to enable multithreaded groupreduce. But it would be kind of weird that the default threads=true wouldn't enable multithreading here, while e.g. threads=2 would. Also threads=2 wouldn't actually limit the number of tasks to 2 as the existing code doesn't really allow doing that (nor should it). And anyway many people wouldn't benefit from it as they will probably expect threading to be automatic, just like it is in other cases.

So it would probably be worth experimenting more to find conditions under which we can be sure starting multiple tasks is faster.

@bkamins
Copy link
Member

bkamins commented Jun 14, 2022

OK - so let us keep it open for now.

@bkamins bkamins modified the milestones: 1.x, 1.5 Jun 19, 2022
@bkamins bkamins mentioned this pull request Aug 24, 2022
@nalimilan nalimilan changed the base branch from master to main August 25, 2022 19:34
@bkamins bkamins marked this pull request as draft December 2, 2022 10:07
@bkamins
Copy link
Member

bkamins commented Dec 2, 2022

I converted this PR to draft to avoid accidental merging.

@bkamins bkamins modified the milestones: 1.5, 1.x Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants