-
Notifications
You must be signed in to change notification settings - Fork 1k
pidfile lock waiting broken in docker containers #1261
Comments
iiiinteresting, that is a particularly nasty interaction of things. (/cc @matjam).
i don't think there's really a way we can do this portably (have it work correctly for both network filesystems AND docker, etc.). in general, pid files that are shared across different kernels, logical or temporal or otherwise, just aren't a good idea. that it's turning out to work that way in containers, too, is less than great. for your purposes here, we've recently added an environment-based way for disabling the lock file in #1206 - set |
I kind of want the opposite, DEPALWAYSLOCK=1. If the pid file exists just wait a bit and see if it goes away. The problem with the current implementation is if the pidfiles pid == my pid it aborts, instead of waiting like it should.
// add a private global, probably needs a mutex too.
var isHoldingLock = false
if isHoldingLock && process.Pid == os.Getpid() {
// error, already held by this process
}
// update isHoldingLock = true while the lock is held. |
// add a private global, probably needs a mutex too.
var isHoldingLock = false we don't have private globals in a place where they could facilitate such a change - we treat gps as a library, and follow the associated good coding practices. if isHoldingLock && process.Pid == os.Getpid() {
// error, already held by this process
}
// update isHoldingLock = true while the lock is held. we could maybe do equivalent workarounds in the main package, but...even if we did, or if we allowed a private global, this implementation would still be wrong on a network filesystem, where having the same pid is not a reliable indication that the old process is gone.
again, this locking system nothing to do with deadlocks, or anything in-process at all. it's a mechanism for preventing multiple processes from accessing the cache dir simultaneously, as that would have undefined results. |
Imagine removing these lines dep/internal/gps/source_manager.go Lines 233 to 238 in 3fd5bb3
All of the locking would still work, and multiple processes would wait for each other correctly. So why are those lines there if not to prevent a deadlock? |
Perhaps it should be more of a lock, and less of a pidfile. pids dont make sense in docker / across network boundaries but the lock does.
the only non portable code in github.com/nightlyone/lockfile is dealing with pids, and pids are the problem. |
the purpose of those lines is to provide an immediate return case, differentiated from the remainder of the logic (where it does a sleeping thing, much like what you've described), such that the caller can choose to take remedial action, rather than blocking. i suppose you could see that particular subsegment as protecting against deadlocks? it's not impossible that you could interleave goroutines in that way. just, given the domain at hand here, just sorta absurd.
like this? dep/internal/gps/source_manager.go Lines 244 to 276 in 3fd5bb3
and yes, that file has always been intended to be a lock - that it's had pids conflated in with that isn't ideal. we're not terribly happy with the current implementation, anyway - see #1117. i understand it may not be your preference, but |
It's good to hear there is some progress. Depnolock might make sense if the cache wasn't shared, but then you wouldn't have a problem anyway. In this case (two containers accessing the same shared cache ) correct locking semantics are important, otherwise you risk corruption... |
i'm sorry, i misunderstood - i shouldn't pick up concurrency issues late at night 😛. OK, so your situation is multiple containers operating simultaneously, sharing a dep cache that's mounted into each container. this is effectively the same as a network filesystem issue - multiple active kernels sharing a single disk. while we work on #1117, which should (i think) resolve your problem, you might put some simple hack in place - e.g., run a random number of processes before invoking dep in order to (very probably) get different pids. |
Running into the same issue, any news for this? |
Pids in docker containers always start at one. So if you run dep inside a docker container as part of a script the pids are very likely to collide between different containers.
The cache lock in source_manager.go uses the pid to detect and avoid a deadlock where the process is waiting for itself.
Unfortunately this all results in a false positive abort when running in docker.
Steps to reproduce
Given directory containing
Dockerfile
and
main.go
run
expected output
One process would get the lock, the other would wait:
actual output
Proposed solution
The deadlock detection is useful, but perhaps the process could track it own state? If dep knows it isnt holding the lock but the lock file exists with our pid wait like normal.
The text was updated successfully, but these errors were encountered: