Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sealing sched: Fix deadlock between worker.wndLk / workersLk #3489

Merged
merged 2 commits into from
Sep 2, 2020

Conversation

magik6k
Copy link
Contributor

@magik6k magik6k commented Sep 2, 2020

The deadlock could happen between sh.workersLk.RLock() in runWorker (sched.go:L577) (which holds worker.wndLk at the time), and trySched in the main scheduler goroutine, which indirectly tries to acquire worker.wndLk through task selector while holding sh.workersLk.RLock()

Normally this doesn't manifest because how could you possibly deadlock when acquiring an RLock from 2 goroutines

Well

Throw in a 3rd goroutine and write-preferred RWlocks to the mix and you get the deadlock

main trySched goroutine:

goroutine 213 [semacquire, 245 minutes]:
sync.runtime_Semacquire(0xc01533f888)
	/usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc01533f880)
	/usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).trySched(0xc0004ffcc0)
	/home/downloads/lotus/extern/sector-storage/sched.go:413 +0x338
github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).runSched(0xc0004ffcc0)
	/home/downloads/lotus/extern/sector-storage/sched.go:278 +0x48f
created by github.com/filecoin-project/lotus/extern/sector-storage.New
	/home/downloads/lotus/extern/sector-storage/manager.go:119 +0x64c

One of trySched sub-goroutines blocked on acquiring wndLk:

goroutine 4514747 [semacquire, 245 minutes]:
sync.runtime_SemacquireMutex(0xc022aaf254, 0xc01f42a200, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc022aaf250)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/filecoin-project/lotus/extern/sector-storage.(*workerHandle).utilization(0xc022aaf1e0, 0xc0231a9080)
	/home/downloads/lotus/extern/sector-storage/sched_resources.go:117 +0x44d
github.com/filecoin-project/lotus/extern/sector-storage.(*allocSelector).Cmp(0xc0552f37d0, 0x2dfd6a0, 0xc0231a9080, 0x29f6815, 0x13, 0xc022aaf1e0, 0xc021e3dd90, 0x581000, 0x0, 0x0)
	/home/downloads/lotus/extern/sector-storage/selector_alloc.go:62 +0x2b
github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).trySched.func1.3(0x24, 0x23, 0x0)
	/home/downloads/lotus/extern/sector-storage/sched.go:404 +0x212
sort.insertionSort_func(0xc0158aff28, 0xc011262280, 0x14, 0x28)
	/usr/local/go/src/sort/zfuncversion.go:12 +0xb1
sort.stable_func(0xc0158aff28, 0xc011262280, 0x4c)
	/usr/local/go/src/sort/zfuncversion.go:167 +0x51
sort.SliceStable(0x2659460, 0xc011262260, 0xc0158aff28)
	/usr/local/go/src/sort/slice.go:27 +0xcb
github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).trySched.func1(0xc01533f880, 0xc015ef28a0, 0xc0004ffcc0, 0xc02721e000, 0x58, 0x58, 0xc033a80000, 0xc1, 0xc1, 0xbb)
	/home/downloads/lotus/extern/sector-storage/sched.go:389 +0x7f1
created by github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).trySched
	/home/downloads/lotus/extern/sector-storage/sched.go:343 +0x306

runWorker goroutine:

goroutine 2269100 [semacquire, 245 minutes]:
sync.runtime_SemacquireMutex(0xc0004ffcd4, 0xc05fc5d800, 0x0)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
	/usr/local/go/src/sync/rwmutex.go:50
github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).runWorker.func1(0xc0004ffcc0, 0x26, 0xc01ed8fb40)
	/home/downloads/lotus/extern/sector-storage/sched.go:577 +0xdaf
created by github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).runWorker
	/home/downloads/lotus/extern/sector-storage/sched.go:503 +0xa6

Some other random goroutine waiting on a WLock on sh.workersLk

goroutine 1703990 [semacquire, 245 minutes]:
sync.runtime_SemacquireMutex(0xc0004ffccc, 0x2dfd600, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0004ffcc8)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc0004ffcc8)
	/usr/local/go/src/sync/rwmutex.go:98 +0x97
github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).assignWorker.func1(0xc012d9f0e0, 0xc01c3dae70, 0xc0004ffcc0, 0xe00000000, 0x1000000000, 0x1, 0x0, 0xa00000, 0xc01490bd40, 0x18)
	/home/downloads/lotus/extern/sector-storage/sched.go:686 +0x114
created by github.com/filecoin-project/lotus/extern/sector-storage.(*scheduler).assignWorker
	/home/downloads/lotus/extern/sector-storage/sched.go:684 +0x223

Fixes #3460

(I have to give a fair bit of credit to danstark on filecoin slack for helping figure this out)

@magik6k magik6k requested review from Kubuxu and arajasek September 2, 2020 14:52
@magik6k magik6k changed the title sealing sched: Fix deadleck witween worker.wndLk / workersLk sealing sched: Fix deadlock between worker.wndLk / workersLk Sep 2, 2020
@magik6k magik6k force-pushed the fix/sched-deadlocks branch from a05cd2f to 7fe8580 Compare September 2, 2020 15:06
@magik6k magik6k merged commit 5f79ff3 into master Sep 2, 2020
@magik6k magik6k deleted the fix/sched-deadlocks branch September 2, 2020 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduler deadlocking
2 participants