Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/datarace when concurrent save the same file #836

Merged

Conversation

9547
Copy link
Contributor

@9547 9547 commented Oct 10, 2020

What problem does this PR solve?

In some cases, saving the same file may cause an error due to a race condition, eg in the cluster deploy with TLS supported scenario, the ca.crt file was saved multi times in parallel, in my local env, this leads to the ca.crt file to be left empty.

root@control:~/.tiup/storage/cluster/clusters/test_scale_core_30845# ll config-cache/ca*
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:07.798975201Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:08.057441637Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:08.070494012Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:08.448460096Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:08.550741396Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:08.719875556Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:09.14992231Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:09.312662983Z.crt
-rw-r--r-- 1 root root    0 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:09.315387179Z.crt
-rw-r--r-- 1 root root    0 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:09.401021401Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:09.505369444Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:09.614051511Z.crt
-rw-r--r-- 1 root root    0 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:09.671778643Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:10.837351042Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:10.924662339Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:11.240826755Z.crt
-rw-r--r-- 1 root root    0 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:11.414010019Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca-2020-10-10T13:54:11.921334897Z.crt
-rw-r--r-- 1 root root 1123 Oct 10 13:54 config-cache/ca.crt

What is changed and how it works?

introduce a simple lock to ensure the file was written in sequence.

Check List

Tests

  • Unit test
  • Integration test

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release notes:

NONE

@codecov-io
Copy link

codecov-io commented Oct 10, 2020

Codecov Report

Merging #836 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #836      +/-   ##
==========================================
+ Coverage   52.94%   52.98%   +0.03%     
==========================================
  Files         261      261              
  Lines       18999    19005       +6     
==========================================
+ Hits        10059    10069      +10     
+ Misses       7384     7376       -8     
- Partials     1556     1560       +4     
Flag Coverage Δ
#cluster 44.82% <100.00%> (+0.05%) ⬆️
#dm 25.25% <100.00%> (+0.01%) ⬆️
#integrate 47.79% <100.00%> (+0.03%) ⬆️
#playground 22.17% <ø> (ø)
#tiup 10.78% <0.00%> (-0.01%) ⬇️
#unittest 20.79% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/cluster/task/builder.go 97.10% <ø> (ø)
pkg/file/file.go 63.88% <100.00%> (+7.22%) ⬆️
pkg/cluster/api/pdapi.go 52.58% <0.00%> (-1.62%) ⬇️
pkg/cluster/spec/tikv.go 57.14% <0.00%> (+1.53%) ⬆️
pkg/utils/retry.go 47.36% <0.00%> (+15.78%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a4fbe7...6959440. Read the comment docs.

@9547
Copy link
Contributor Author

9547 commented Oct 11, 2020

@lonng PTAL

@lucklove lucklove added status/PTAL type/new-feature Categorizes pr as related to a new feature. labels Oct 11, 2020
@@ -25,6 +28,13 @@ func SaveFileWithBackup(path string, data []byte, backupDir string) error {
return errors.Errorf("%s is directory", path)
}

if _, ok := fileLocks[path]; !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fileLocks itself is not thread safe, so I think the datarace is still exists. eg.

# assume the fileLocks is empty as initial state.
# thread 1 start
if _, ok := fileLocks["/path/a"]; !ok { # the ok is false because fileLocks is empty
# thread 1 hang after check !ok but before fileLocks[path] = &sync.Mutex{}

# thread 2 start
if _, ok := fileLocks["/path/a"]; !ok { # the ok is false because fileLocks is empty and thread 1 doesn't assign it yet
    fileLocks["/path/a"] = &sync.Mutex{} # assign a new mutex and success
}

fileLocks["/path/a"].Lock()
# thread 2 enter critical area

# thread 1 start
fileLocks["/path/a"] = &sync.Mutex{} # assign a new mutex and success
fileLocks["/path/a"].Lock() # this will success because thread 1 reset fileLocks["/path/a"]
# thread 1 enter critical area

In this case, both threads enter into the same critical area.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, sorry to forget that the map needs to be locked in concurrent access.

@@ -10,7 +10,7 @@ GOARCH := $(if $(GOARCH),$(GOARCH),amd64)
GOENV := GO111MODULE=on CGO_ENABLED=0 GOOS=$(GOOS) GOARCH=$(GOARCH)
GO := $(GOENV) go
GOBUILD := $(GO) build $(BUILD_FLAG)
GOTEST := GO111MODULE=on CGO_ENABLED=1 $(GO) test -p 3
GOTEST := GO111MODULE=on CGO_ENABLED=1 go test -p 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to revert this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to run tests with -race, but which need CGO was enabled. Seems there is no effective to set GO111MODULE=on CGO_ENABLED=1 before $(GO), such as

$ make race
TIUP_HOME=/root/.go/src/github.com/pingcap/tiup/tests/tiup GO111MODULE=on CGO_ENABLED=1 GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go test -p 3 -race ./...  || { $(tools/bin/failpoint-ctl disable); exit 1; }
go test: -race requires cgo; enable cgo by setting CGO_ENABLED=1
make: *** [Makefile:97: race] Error 1

Copy link
Contributor

@lonng lonng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@lonng lonng added this to the v1.2.1 milestone Oct 15, 2020
@9547 9547 force-pushed the fix/datarace-when-concurrent-save-the-same-file branch from 4e67051 to 96895d0 Compare October 15, 2020 14:36
Copy link
Contributor

@lonng lonng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 16, 2020
@lonng lonng merged commit b8f201d into pingcap:master Oct 16, 2020
@9547 9547 deleted the fix/datarace-when-concurrent-save-the-same-file branch October 17, 2020 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/LGT1 Indicates that a PR has LGTM 1. type/new-feature Categorizes pr as related to a new feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants