Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Distributed.Worker threadsafe [Take 2] #38134

Merged
merged 4 commits into from
Oct 22, 2020
Merged

Conversation

vchuravy
Copy link
Member

cc: @jonas-schulze

If my understanding is correct, the issue was that on the slower CI machines the timedwait could feasibly timeout and cause the subsequent test to fail. This replaces the timedwait with actual waits.

Should be squashed on merge.

Fixes JuliaLang/Distributed.jl#73

@vchuravy vchuravy added the parallelism Parallel or distributed computation label Oct 21, 2020
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, the dreaded timedwait function. Yes we should always try to avoid that in preference for more responsible implementations.

@jonas-schulze
Copy link
Contributor

The timedwait was not the issue; I used it on purpose so that the tests definitely do not get stuck (or: fail before CI times out after 1h or so). If it would have been the issue, the tests would just have failed, but they errored.

I'm on vacation this week and don't have a laptop at hand.

@vchuravy
Copy link
Member Author

If it would have been the issue, the tests would just have failed, but they errored.

I was looking at https://build.julialang.org/#/builders/52/builds/4632/steps/5/logs/stdio

      From worker 4:	from thread 1.1 to 2.2: Test Failed at /buildworker/worker/tester_linuxaarch64/build/share/julia/stdlib/v1.6/Distributed/test/threads.jl:48
      From worker 4:	  Expression: isdone(recv)
      From worker 4:	Stacktrace:
      From worker 4:	 [1] top-level scope
      From worker 4:	   @ /buildworker/worker/tester_linuxaarch64/build/share/julia/stdlib/v1.6/Distributed/test/threads.jl:48
      From worker 4:	 [2] top-level scope
      From worker 4:	   @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1219
      From worker 4:	 [3] top-level scope
      From worker 4:	   @ /buildworker/worker/tester_linuxaarch64/build/share/julia/stdlib/v1.6/Distributed/test/threads.jl:32
      From worker 4:	 [4] top-level scope
      From worker 4:	   @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1219
      From worker 4:	 [5] top-level scope
      From worker 4:	   @ /buildworker/worker/tester_linuxaarch64/build/share/julia/stdlib/v1.6/Distributed/test/threads.jl:31
      From worker 4:	 [6] top-level scope
      From worker 4:	   @ /buildworker/worker/package_linuxaarch64/build/usr/share/julia/stdlib/v1.6/Test/src/Test.jl:1144
      From worker 4:	 [7] top-level scope
      From worker 4:	   @ /buildworker/worker/tester_linuxaarch64/build/share/julia/stdlib/v1.6/Distributed/test/threads.jl:29

Which definitely looks like a time out.

@vchuravy vchuravy merged commit 31026e8 into master Oct 22, 2020
@vchuravy vchuravy deleted the vc/threadsafe_distributed branch October 22, 2020 17:31
@vtjnash
Copy link
Member

vtjnash commented Oct 27, 2020

Still might be failing? https://build.julialang.org/#/builders/4/builds/5236

@vchuravy
Copy link
Member Author

  From worker 56:	error in running finalizer: ErrorException("concurrency violation detected")

yeah...., much rarer so this might be a case of the actual code change not being sufficient.

@jonas-schulze
Copy link
Contributor

Maybe we should replace any_gc_flag?

@vtjnash
Copy link
Member

vtjnash commented Oct 28, 2020

I don't know if it is much rarer or a different issue, but almost 25% of builds are crashing or segfaulting or failing in this test (ie. https://build.julialang.org/#/builders/52/builds/4884)

@vchuravy
Copy link
Member Author

vchuravy commented Oct 28, 2020

      From worker 56:	signal (11): Segmentation fault
      From worker 56:	in expression starting at none:0
      From worker 56:	*** Error in `/buildworker/worker/tester_linuxaarch64/build/bin/julia': free(): corrupted unsorted chunks: 0x000000000e0e13f0 ***
      From worker 56:	
      From worker 56:	signal (6): Aborted
      From worker 56:	in expression starting at none:0
      From worker 56:	gsignal at /lib/aarch64-linux-gnu/libc.so.6 (unknown line)
      From worker 56:	abort at /lib/aarch64-linux-gnu/libc.so.6 (unknown line)
      From worker 56:	Allocations: 13059132 (Pool: 13054874; Big: 4258); GC: 15

That looks like a peculiar aarch64 issue.

vtjnash added a commit that referenced this pull request Oct 30, 2020
vchuravy added a commit that referenced this pull request Nov 12, 2020
vchuravy added a commit that referenced this pull request Nov 16, 2020
vchuravy added a commit that referenced this pull request Nov 22, 2020
vchuravy added a commit that referenced this pull request Dec 4, 2020
vchuravy added a commit that referenced this pull request Dec 10, 2020
vchuravy added a commit that referenced this pull request Mar 24, 2021
Sacha0 pushed a commit that referenced this pull request Jun 22, 2021
JeffBezanson pushed a commit that referenced this pull request Jul 15, 2021
Keno pushed a commit that referenced this pull request Jun 5, 2024
Keno pushed a commit that referenced this pull request Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Concurrency violation on interplay between Distributed and Base.Threads
4 participants