Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix world age issue with custom streams for Distributed workers #42481

Merged
merged 1 commit into from
Nov 16, 2021

Conversation

c42f
Copy link
Member

@c42f c42f commented Oct 3, 2021

If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.

CC @tanmaykm

Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems odd to be doing this without holding a relevant lock, but lgtm

@kshyatt kshyatt added the parallelism Parallel or distributed computation label Oct 4, 2021
@c42f
Copy link
Member Author

c42f commented Oct 5, 2021

Seems odd to be doing this without holding a relevant lock

A lock around the global worker state in Distributed? Definitely a good idea, though orthogonal to the problem fixed here.

@vchuravy
Copy link
Member

vchuravy commented Oct 5, 2021

Would be good to have a test for custom message streams.

@vtjnash
Copy link
Member

vtjnash commented Oct 5, 2021

yeah, many locks are likely missing, since we next try to iterate the shared (PGRP::ProcessGroup).workers without any synchronization

@c42f
Copy link
Member Author

c42f commented Oct 6, 2021

Would be good to have a test for custom message streams.

Yeah, testing this reliably seemed like a lot more work than just fixing it 😬 In practice we found this problem to be quite intermittent in our use case where we have a custom TLSStream type which was loaded in a library as part of the same using statement as Distributed. I guess we could try to add a bunch of yield() to encourage the task within Distributed.__init__ to be scheduled, after which we can load the code for a custom stream.

@vchuravy
Copy link
Member

vchuravy commented Oct 6, 2021

I guess we could try to add a bunch of yield() to encourage the task within Distributed.init to be scheduled, after which we can load the code for a custom stream.

Not necessaily? After #41449 tasks have a predictable world. So you just need to load and define some new code, and you should hit the worldage issue.

@vchuravy vchuravy added the needs tests Unit tests are required for this change label Oct 6, 2021
@c42f
Copy link
Member Author

c42f commented Oct 6, 2021

Good point! We hit this in 1.6 but I forgot that #41449 is already merged in dev.

@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Nov 16, 2021
@DilumAluthge
Copy link
Member

@c42f Can you rebase on the latest master? This should fix the tester_freebsd64 failures.

If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.
@c42f c42f force-pushed the cjf/distributed-custom-streams-fix branch from 2ff7930 to 2e44818 Compare November 16, 2021 04:47
@c42f
Copy link
Member Author

c42f commented Nov 16, 2021

Rebased.

Truth be told I still don't know how to write a good test for this without rewriting some things to have more control over how the flush_gc_msgs task is launched.

@DilumAluthge
Copy link
Member

CI is all green, but it looks like the PR still needs a test, so I'll remove the merge me label until a test is added.

@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Nov 16, 2021
@KristofferC
Copy link
Member

KristofferC commented Nov 16, 2021

This seems strictly better than the existing code and is likely to fix a real problem. The change is also such that it is extremely unlikely to cause any regressions. So merge and open an issue about adding a test?

@DilumAluthge
Copy link
Member

That seems fine to me!

@c42f c42f merged commit a05bcb2 into master Nov 16, 2021
@c42f c42f deleted the cjf/distributed-custom-streams-fix branch November 16, 2021 23:32
@c42f
Copy link
Member Author

c42f commented Nov 17, 2021

Seems good to me. I've merged this and opened #43109.

@c42f c42f added backport 1.6 Change should be backported to release-1.6 backport 1.7 labels Nov 17, 2021
@KristofferC KristofferC mentioned this pull request Nov 19, 2021
29 tasks
KristofferC pushed a commit that referenced this pull request Nov 19, 2021
If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.

(cherry picked from commit a05bcb2)
KristofferC pushed a commit that referenced this pull request Nov 26, 2021
If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.

(cherry picked from commit a05bcb2)
KristofferC pushed a commit that referenced this pull request Dec 7, 2021
If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.

(cherry picked from commit a05bcb2)
@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Dec 11, 2021
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
…aLang#42481)

If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
…aLang#42481)

If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.
staticfloat pushed a commit that referenced this pull request Dec 23, 2022
If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.

(cherry picked from commit a05bcb2)
vchuravy pushed a commit to JuliaLang/Distributed.jl that referenced this pull request Oct 6, 2023
…aLang/julia#42481)

If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.

(cherry picked from commit 24afe90)
Keno pushed a commit that referenced this pull request Jun 5, 2024
If connect(::CustomClusterManager, ...) returns a custom transport
stream, use of that stream by the task in start_gc_msgs_task() may fail
due to the task executing in an old world age. Add an invokelatest() to
prevent this problem.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs tests Unit tests are required for this change parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants