Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add lock timeout to reduce lock failure cases #2287

Closed

Conversation

allencloud
Copy link
Collaborator

Signed-off-by: Allen Sun [email protected]

Ⅰ. Describe what this PR did

failed to xxx container xxxx: lock failed is a normal error when pouchd meets lots of concurrent request for the same container object, such as container.

In the original implementation, ctrd will try to get the lock for one container to continue the request's operation. While if the container's access(lock) is owned by others, then the request will not be blocked and will return a lock failure instead. This action has benefit and shortcomings as well. The benefit is that it could increase the real time of request even if it fails. The drawback is that it would depends on the client callers to retry the request.

To avoid this, we could take other ways like:

  • make the lock infinite blocked;
  • add a timeout when accessing the lock.

This pull request uses the second way to make it.

Besides the logic update for timeout(default is 5 seconds). I also changed the following things:

  • make the lock failure error to be more explicit, like container xxx is accessed by other request and please try again: lock failed;
  • remove the containerLock in ctrd since it could be replaced by kmutex pkg.

Ⅱ. Does this pull request fix one issue?

fix #2281

Ⅲ. Why don't you add test cases (unit test/integration test)? (你真的觉得不需要加测试吗?)

no need for this, I added more test cases for the common package knutex.

Ⅳ. Describe how to verify it

none

Ⅴ. Special notes for reviews

none

@codecov
Copy link

codecov bot commented Sep 29, 2018

Codecov Report

Merging #2287 into master will decrease coverage by 0.04%.
The diff coverage is 63.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2287      +/-   ##
==========================================
- Coverage   66.97%   66.93%   -0.05%     
==========================================
  Files         211      210       -1     
  Lines       17306    17298       -8     
==========================================
- Hits        11591    11578      -13     
- Misses       4312     4313       +1     
- Partials     1403     1407       +4
Flag Coverage Δ
#criv1alpha1test 32.32% <41.3%> (+0.25%) ⬆️
#criv1alpha2test 35.78% <41.3%> (+0.18%) ⬆️
#integrationtest 40.24% <52.17%> (+0.06%) ⬆️
#nodee2etest 33.11% <52.17%> (-0.01%) ⬇️
#unittest 23.35% <26.08%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pkg/kmutex/kmutex.go 100% <100%> (+7.81%) ⬆️
ctrd/utils.go 65.38% <100%> (+0.67%) ⬆️
ctrd/client.go 57.47% <100%> (+0.4%) ⬆️
ctrd/container.go 59.76% <37.03%> (+1.9%) ⬆️
cri/ocicni/cni_manager.go 65% <0%> (-15%) ⬇️
cri/v1alpha2/cri_wrapper.go 60% <0%> (-2.4%) ⬇️
cri/v1alpha2/cri.go 64.87% <0%> (-1.46%) ⬇️
cri/v1alpha2/cri_utils.go 90.28% <0%> (-0.29%) ⬇️
daemon/logger/jsonfile/utils.go 73.17% <0%> (+1.62%) ⬆️
daemon/mgr/container_stats.go 12.61% <0%> (+1.8%) ⬆️
... and 3 more

@pouchrobot pouchrobot added kind/bug This is bug report for project size/XL labels Sep 29, 2018
@allencloud allencloud force-pushed the fix-2281 branch 2 times, most recently from 1e822b7 to f8d49a0 Compare September 30, 2018 01:47
if ok && len(v.c) == 0 {
// filled up the chan to identify caller has released the lock.
v.c <- struct{}{}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better that free unused lockedKeys in this Unlock method than using a extra go routine?
https://github.com/alibaba/pouch/blob/f56461421b3a2148db94d825a5e715fb156e2972/pkg/kmutex/kmutex.go#L41-L52

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is a legacy function which is designed by @skoo87 .
Could you help explain a little bit more about the goroutine usage here?

@@ -5,7 +5,6 @@ import (
"testing"

"github.com/alibaba/pouch/pkg/errtypes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add blank line here

return false
}
}

// LockWithTimeout trys to lock, if can't acquire the lock, will block util timeout.
// LockWithTimeout tries to lock.
// It can't acquire the lock, will block util timeout.
func (m *KMutex) LockWithTimeout(k string, to time.Duration) bool {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/LockWithTimeout/TrylockWithTimeout

select {
case <-v.c:
// the locker has released the lock.
return true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this line, we think there is a bug there.
If this goroutine gets the lock and returns true, there is a very tiny possibility that the GC function would delete the corresponding locked key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. These lock functions, such as Trylock, LockWithTimeout and Lock all have same problem that they didn't not get the lock after <-v.c.

select {
case <-v.c:
// the locker has released the lock.
return true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this line, we think there is a bug there.
If this goroutine gets the lock and returns true, there is a very tiny possibility that the GC function would delete the corresponding locked key.

@allencloud
Copy link
Collaborator Author

Currently block this pull request. For the issue #2281 , we will try to make the code :

if !c.Trylock(id) {
		return nil, errtypes.ErrLockfailed			
}		

to be:

if !c.Trylock(id) {
      for i:= 0;i<3;i++{
            // try another 3 times.
     }			
}

select {
case <-v.c:
// the locker has released the lock.
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. These lock functions, such as Trylock, LockWithTimeout and Lock all have same problem that they didn't not get the lock after <-v.c.

@pouchrobot
Copy link
Collaborator

ping @allencloud
Conflict happens after merging a previous commit.
Please rebase the branch against master and push it back again. Thanks a lot.

@fuweid
Copy link
Contributor

fuweid commented Oct 18, 2018

this issue has been fixed by #2319

@fuweid fuweid closed this Oct 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conflict/needs-rebase kind/bug This is bug report for project size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed to stop container: lock failed
5 participants