pidfile lock waiting broken in docker containers #1261

vektah · 2017-10-11T23:25:54Z

Pids in docker containers always start at one. So if you run dep inside a docker container as part of a script the pids are very likely to collide between different containers.

The cache lock in source_manager.go uses the pid to detect and avoid a deadlock where the process is waiting for itself.

Unfortunately this all results in a false positive abort when running in docker.

Steps to reproduce

Given directory containing Dockerfile

FROM golang:1.9.1

RUN go get -u github.com/golang/dep/cmd/dep

RUN mkdir /go/src/test
WORKDIR /go/src/test
ADD main.go .
RUN dep init

and main.go

package main

import _ "github.com/go-chi/chi"

func main() {}

run

docker build . -t test
mkdir cache
(docker run -v $PWD/cache:/go/pkg/dep test dep ensure -v &) && \
    docker run -v $PWD/cache:/go/pkg/dep test dep ensure -v

expected output

One process would get the lock, the other would wait:

Gopkg.lock was already in sync with imports and Gopkg.toml
(1/1) Wrote github.com/go-chi/[email protected]
Gopkg.lock was already in sync with imports and Gopkg.toml
(1/1) Wrote github.com/go-chi/[email protected]

actual output

Gopkg.lock was already in sync with imports and Gopkg.toml
lockfile /go/pkg/dep/sm.lock already locked by this process
(1/1) Wrote github.com/go-chi/[email protected]

Proposed solution

The deadlock detection is useful, but perhaps the process could track it own state? If dep knows it isnt holding the lock but the lock file exists with our pid wait like normal.

The text was updated successfully, but these errors were encountered:

sdboyer · 2017-10-12T00:32:24Z

iiiinteresting, that is a particularly nasty interaction of things. (/cc @matjam).

perhaps the process could track it own state?

i don't think there's really a way we can do this portably (have it work correctly for both network filesystems AND docker, etc.). in general, pid files that are shared across different kernels, logical or temporal or otherwise, just aren't a good idea. that it's turning out to work that way in containers, too, is less than great.

for your purposes here, we've recently added an environment-based way for disabling the lock file in #1206 - set DEPNOLOCK=1 in the env, and no lock will be taken at all. it's not deadlocks we're protecting against here, but rather simultaneous access to shared disk state - and if there's no chance of simultaneous access, then you're fine disabling the lock.

vektah · 2017-10-12T01:32:26Z

I kind of want the opposite, DEPALWAYSLOCK=1. If the pid file exists just wait a bit and see if it goes away.

The problem with the current implementation is if the pidfiles pid == my pid it aborts, instead of waiting like it should.

i don't think there's really a way we can do this portably

// add a private global, probably needs a mutex too.
var isHoldingLock = false

if isHoldingLock && process.Pid == os.Getpid() {
    // error, already held by this process
}

// update isHoldingLock = true while the lock is held.

sdboyer · 2017-10-12T02:18:51Z

// add a private global, probably needs a mutex too.
var isHoldingLock = false

we don't have private globals in a place where they could facilitate such a change - we treat gps as a library, and follow the associated good coding practices.

if isHoldingLock && process.Pid == os.Getpid() {
    // error, already held by this process
}

// update isHoldingLock = true while the lock is held.

we could maybe do equivalent workarounds in the main package, but...even if we did, or if we allowed a private global, this implementation would still be wrong on a network filesystem, where having the same pid is not a reliable indication that the old process is gone.

I kind of want the opposite, DEPALWAYSLOCK=1. If the pid file exists just wait a bit and see if it goes away.

again, this locking system nothing to do with deadlocks, or anything in-process at all. it's a mechanism for preventing multiple processes from accessing the cache dir simultaneously, as that would have undefined results.

vektah · 2017-10-12T02:33:14Z

Imagine removing these lines

dep/internal/gps/source_manager.go

Lines 233 to 238 in 3fd5bb3

    
           if process.Pid == os.Getpid() { 
        
           	return nil, CouldNotCreateLockError{ 
        
           		Path: glpath, 
        
           		Err:  fmt.Errorf("lockfile %s already locked by this process", glpath), 
        
           	} 
        
           }

All of the locking would still work, and multiple processes would wait for each other correctly. So why are those lines there if not to prevent a deadlock?

vektah · 2017-10-12T02:39:01Z

Perhaps it should be more of a lock, and less of a pidfile. pids dont make sense in docker / across network boundaries but the lock does.

while lock file exists {
   sleep a bit
}

create lock file
defer delete lock file
do stuff with lock

the only non portable code in github.com/nightlyone/lockfile is dealing with pids, and pids are the problem.

sdboyer · 2017-10-12T03:01:07Z

So why are those lines there if not to prevent a deadlock?

the purpose of those lines is to provide an immediate return case, differentiated from the remainder of the logic (where it does a sleeping thing, much like what you've described), such that the caller can choose to take remedial action, rather than blocking.

i suppose you could see that particular subsegment as protecting against deadlocks? it's not impossible that you could interleave goroutines in that way. just, given the domain at hand here, just sorta absurd.

If the pid file exists just wait a bit and see if it goes away.

like this?

dep/internal/gps/source_manager.go

Lines 244 to 276 in 3fd5bb3

    
           // If it's a TemporaryError, we retry every second. Otherwise, we fail 
        
           // permanently. 
        
           // 
        
           // TODO: #534 needs to be implemented to provide a better way to log warnings, 
        
           // but until then we will just use stderr. 
        
           // Implicit Time of 0. 
        
           var lasttime time.Time 
        
           err = lockfile.TryLock() 
        
           for err != nil { 
        
           	nowtime := time.Now() 
        
           	duration := nowtime.Sub(lasttime) 
        
           	// The first time this is evaluated, duration will be very large as lasttime is 0. 
        
           	// Unless time travel is invented and someone travels back to the year 1, we should 
        
           	// be ok. 
        
           	if duration > 15*time.Second { 
        
           		fmt.Fprintf(os.Stderr, "waiting for lockfile %s: %s\n", glpath, err.Error()) 
        
           		lasttime = nowtime 
        
           	} 
        
           	if t, ok := err.(interface { 
        
           		Temporary() bool 
        
           	}); ok && t.Temporary() { 
        
           		time.Sleep(time.Second * 1) 
        
           	} else { 
        
           		return nil, CouldNotCreateLockError{ 
        
           			Path: glpath, 
        
           			Err:  errors.Wrapf(err, "unable to lock %s", glpath), 
        
           		} 
        
           	} 
        
           	err = lockfile.TryLock() 
        
           }

and yes, that file has always been intended to be a lock - that it's had pids conflated in with that isn't ideal. we're not terribly happy with the current implementation, anyway - see #1117.

i understand it may not be your preference, but DEPNOLOCK=1 still seems like the best option you have right now.

vektah · 2017-10-12T07:19:37Z

It's good to hear there is some progress.

Depnolock might make sense if the cache wasn't shared, but then you wouldn't have a problem anyway.

In this case (two containers accessing the same shared cache ) correct locking semantics are important, otherwise you risk corruption...

sdboyer · 2017-10-12T16:15:58Z

Depnolock might make sense if the cache wasn't shared, but then you wouldn't have a problem anyway.

In this case (two containers accessing the same shared cache ) correct locking semantics are important, otherwise you risk corruption...

i'm sorry, i misunderstood - i shouldn't pick up concurrency issues late at night 😛. OK, so your situation is multiple containers operating simultaneously, sharing a dep cache that's mounted into each container.

this is effectively the same as a network filesystem issue - multiple active kernels sharing a single disk. while we work on #1117, which should (i think) resolve your problem, you might put some simple hack in place - e.g., run a random number of processes before invoking dep in order to (very probably) get different pids.

nawaitesidah · 2018-07-30T10:58:31Z

Running into the same issue, any news for this?

matjam mentioned this issue Oct 12, 2017

Fix for file locking on filesystems that don't support hard linking #1117

Closed

kevinburke closed this as completed Jun 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pidfile lock waiting broken in docker containers #1261

pidfile lock waiting broken in docker containers #1261

vektah commented Oct 11, 2017

sdboyer commented Oct 12, 2017

vektah commented Oct 12, 2017 •

edited

Loading

sdboyer commented Oct 12, 2017 •

edited

Loading

vektah commented Oct 12, 2017

vektah commented Oct 12, 2017 •

edited

Loading

sdboyer commented Oct 12, 2017

vektah commented Oct 12, 2017

sdboyer commented Oct 12, 2017

nawaitesidah commented Jul 30, 2018

pidfile lock waiting broken in docker containers #1261

pidfile lock waiting broken in docker containers #1261

Comments

vektah commented Oct 11, 2017

Steps to reproduce

expected output

actual output

Proposed solution

sdboyer commented Oct 12, 2017

vektah commented Oct 12, 2017 • edited Loading

sdboyer commented Oct 12, 2017 • edited Loading

vektah commented Oct 12, 2017

vektah commented Oct 12, 2017 • edited Loading

sdboyer commented Oct 12, 2017

vektah commented Oct 12, 2017

sdboyer commented Oct 12, 2017

nawaitesidah commented Jul 30, 2018

vektah commented Oct 12, 2017 •

edited

Loading

sdboyer commented Oct 12, 2017 •

edited

Loading

vektah commented Oct 12, 2017 •

edited

Loading