Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

pidfile lock waiting broken in docker containers #1261

Closed
vektah opened this issue Oct 11, 2017 · 9 comments
Closed

pidfile lock waiting broken in docker containers #1261

vektah opened this issue Oct 11, 2017 · 9 comments

Comments

@vektah
Copy link

vektah commented Oct 11, 2017

Pids in docker containers always start at one. So if you run dep inside a docker container as part of a script the pids are very likely to collide between different containers.

The cache lock in source_manager.go uses the pid to detect and avoid a deadlock where the process is waiting for itself.

Unfortunately this all results in a false positive abort when running in docker.

Steps to reproduce

Given directory containing Dockerfile

FROM golang:1.9.1

RUN go get -u github.com/golang/dep/cmd/dep

RUN mkdir /go/src/test
WORKDIR /go/src/test
ADD main.go .
RUN dep init

and main.go

package main

import _ "github.com/go-chi/chi"

func main() {}

run

docker build . -t test
mkdir cache
(docker run -v $PWD/cache:/go/pkg/dep test dep ensure -v &) && \
    docker run -v $PWD/cache:/go/pkg/dep test dep ensure -v 

expected output

One process would get the lock, the other would wait:

Gopkg.lock was already in sync with imports and Gopkg.toml
(1/1) Wrote github.com/go-chi/[email protected]
Gopkg.lock was already in sync with imports and Gopkg.toml
(1/1) Wrote github.com/go-chi/[email protected]

actual output

Gopkg.lock was already in sync with imports and Gopkg.toml
lockfile /go/pkg/dep/sm.lock already locked by this process
(1/1) Wrote github.com/go-chi/[email protected]

Proposed solution

The deadlock detection is useful, but perhaps the process could track it own state? If dep knows it isnt holding the lock but the lock file exists with our pid wait like normal.

@sdboyer
Copy link
Member

sdboyer commented Oct 12, 2017

iiiinteresting, that is a particularly nasty interaction of things. (/cc @matjam).

perhaps the process could track it own state?

i don't think there's really a way we can do this portably (have it work correctly for both network filesystems AND docker, etc.). in general, pid files that are shared across different kernels, logical or temporal or otherwise, just aren't a good idea. that it's turning out to work that way in containers, too, is less than great.

for your purposes here, we've recently added an environment-based way for disabling the lock file in #1206 - set DEPNOLOCK=1 in the env, and no lock will be taken at all. it's not deadlocks we're protecting against here, but rather simultaneous access to shared disk state - and if there's no chance of simultaneous access, then you're fine disabling the lock.

@vektah
Copy link
Author

vektah commented Oct 12, 2017

I kind of want the opposite, DEPALWAYSLOCK=1. If the pid file exists just wait a bit and see if it goes away.

The problem with the current implementation is if the pidfiles pid == my pid it aborts, instead of waiting like it should.

i don't think there's really a way we can do this portably

// add a private global, probably needs a mutex too.
var isHoldingLock = false

if isHoldingLock && process.Pid == os.Getpid() {
    // error, already held by this process
}

// update isHoldingLock = true while the lock is held.

@sdboyer
Copy link
Member

sdboyer commented Oct 12, 2017

// add a private global, probably needs a mutex too.
var isHoldingLock = false

we don't have private globals in a place where they could facilitate such a change - we treat gps as a library, and follow the associated good coding practices.

if isHoldingLock && process.Pid == os.Getpid() {
    // error, already held by this process
}

// update isHoldingLock = true while the lock is held.

we could maybe do equivalent workarounds in the main package, but...even if we did, or if we allowed a private global, this implementation would still be wrong on a network filesystem, where having the same pid is not a reliable indication that the old process is gone.

I kind of want the opposite, DEPALWAYSLOCK=1. If the pid file exists just wait a bit and see if it goes away.

again, this locking system nothing to do with deadlocks, or anything in-process at all. it's a mechanism for preventing multiple processes from accessing the cache dir simultaneously, as that would have undefined results.

@vektah
Copy link
Author

vektah commented Oct 12, 2017

Imagine removing these lines

if process.Pid == os.Getpid() {
return nil, CouldNotCreateLockError{
Path: glpath,
Err: fmt.Errorf("lockfile %s already locked by this process", glpath),
}
}

All of the locking would still work, and multiple processes would wait for each other correctly. So why are those lines there if not to prevent a deadlock?

@vektah
Copy link
Author

vektah commented Oct 12, 2017

Perhaps it should be more of a lock, and less of a pidfile. pids dont make sense in docker / across network boundaries but the lock does.

while lock file exists {
   sleep a bit
}

create lock file
defer delete lock file
do stuff with lock

the only non portable code in github.com/nightlyone/lockfile is dealing with pids, and pids are the problem.

@sdboyer
Copy link
Member

sdboyer commented Oct 12, 2017

So why are those lines there if not to prevent a deadlock?

the purpose of those lines is to provide an immediate return case, differentiated from the remainder of the logic (where it does a sleeping thing, much like what you've described), such that the caller can choose to take remedial action, rather than blocking.

i suppose you could see that particular subsegment as protecting against deadlocks? it's not impossible that you could interleave goroutines in that way. just, given the domain at hand here, just sorta absurd.

If the pid file exists just wait a bit and see if it goes away.

like this?

// If it's a TemporaryError, we retry every second. Otherwise, we fail
// permanently.
//
// TODO: #534 needs to be implemented to provide a better way to log warnings,
// but until then we will just use stderr.
// Implicit Time of 0.
var lasttime time.Time
err = lockfile.TryLock()
for err != nil {
nowtime := time.Now()
duration := nowtime.Sub(lasttime)
// The first time this is evaluated, duration will be very large as lasttime is 0.
// Unless time travel is invented and someone travels back to the year 1, we should
// be ok.
if duration > 15*time.Second {
fmt.Fprintf(os.Stderr, "waiting for lockfile %s: %s\n", glpath, err.Error())
lasttime = nowtime
}
if t, ok := err.(interface {
Temporary() bool
}); ok && t.Temporary() {
time.Sleep(time.Second * 1)
} else {
return nil, CouldNotCreateLockError{
Path: glpath,
Err: errors.Wrapf(err, "unable to lock %s", glpath),
}
}
err = lockfile.TryLock()
}

and yes, that file has always been intended to be a lock - that it's had pids conflated in with that isn't ideal. we're not terribly happy with the current implementation, anyway - see #1117.

i understand it may not be your preference, but DEPNOLOCK=1 still seems like the best option you have right now.

@vektah
Copy link
Author

vektah commented Oct 12, 2017

It's good to hear there is some progress.

Depnolock might make sense if the cache wasn't shared, but then you wouldn't have a problem anyway.

In this case (two containers accessing the same shared cache ) correct locking semantics are important, otherwise you risk corruption...

@sdboyer
Copy link
Member

sdboyer commented Oct 12, 2017

Depnolock might make sense if the cache wasn't shared, but then you wouldn't have a problem anyway.

In this case (two containers accessing the same shared cache ) correct locking semantics are important, otherwise you risk corruption...

i'm sorry, i misunderstood - i shouldn't pick up concurrency issues late at night 😛. OK, so your situation is multiple containers operating simultaneously, sharing a dep cache that's mounted into each container.

this is effectively the same as a network filesystem issue - multiple active kernels sharing a single disk. while we work on #1117, which should (i think) resolve your problem, you might put some simple hack in place - e.g., run a random number of processes before invoking dep in order to (very probably) get different pids.

@nawaitesidah
Copy link

Running into the same issue, any news for this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants