-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
karmada-search: Fix lock race affects watch RestChan not close, causing client watch api to hang. #4212
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #4212 +/- ##
==========================================
- Coverage 52.78% 52.77% -0.01%
==========================================
Files 239 239
Lines 23584 23585 +1
==========================================
Hits 12448 12448
Misses 10460 10460
- Partials 676 677 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
376ff1b
to
40922ea
Compare
Thanks @xigang |
@xigang Do you know how to reproduce it?
Since the result channel is no cached, so I can understand that the goroutine will be blocked when trying send event to it: karmada/pkg/search/proxy/store/util.go Line 259 in d6e2881
Can you remind me why it will cause goroutine leak and lock race? |
Currently, our federated cluster has 1.8W+ nodes (the size of the nodes will increase). The client uses the |
[client side]
reproduce log: [server side] ServeHTTP:
serveWatch:
Finally, through the goroutine profile, it was found that a large number of goroutines were stuck and locked
|
@ikaven1024 In the log, the hang appears not for the fourth watch, but for the Nth time. Because the server will regularly close the watch api connection, the client will renew the watch. This prevents the client from always watching a single |
The leak come here:
This can be tested with: func TestName(t *testing.T) {
// watcher from cacher
cacheW := watch.NewFakeWithChanSize(1, false)
// client start watch, and not receive the event
clientW := newWatchMux()
clientW.AddSource(cacheW, func(event watch.Event) {})
clientW.Start()
// receive an object from cacher.
// while client does not consume it.
cacheW.Add(&corev1.Pod{})
time.Sleep(time.Second)
// client close.
clientW.Stop()
// DEADLOCK!!!
} |
Does below code work,without adding a timeout. @xigang
|
@ikaven1024 Thanks for the analysis. Is there any problem in remove If |
It's safe reading, writing and closing a chan in multiple goroutines.
It will cause client break the watch, and send a new watch request, Increasing the burden on the server. |
40922ea
to
4eadcdf
Compare
4eadcdf
to
76ffd32
Compare
@ikaven1024 I think it's ok, I resubmitted the code. To avoid the exception, I added additional |
76ffd32
to
cb289a2
Compare
pkg/search/proxy/store/util.go
Outdated
select { | ||
case <-w.done: | ||
return | ||
default: | ||
w.result <- copyEvent | ||
case w.result <- copyEvent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify it:
--- a/pkg/search/proxy/store/util.go
+++ b/pkg/search/proxy/store/util.go
@@ -246,19 +246,8 @@ func (w *watchMux) startWatchSource(source watch.Interface, decorator func(watch
select {
case <-w.done:
return
- default:
+ case w.result <- copyEvent:
}
-
- func() {
- w.lock.RLock()
- defer w.lock.RUnlock()
- select {
- case <-w.done:
- return
- default:
- w.result <- copyEvent
- }
- }()
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikaven1024 code updated. Ask for help, the unit test did not pass, and I couldn't find what went wrong.😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which test fails, would you show the error message?
pkg/search/proxy/store/util.go
Outdated
@@ -225,6 +225,7 @@ func (w *watchMux) Stop() { | |||
} | |||
|
|||
func (w *watchMux) startWatchSource(source watch.Interface, decorator func(watch.Event)) { | |||
defer func() { _ = recover() }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add log when panic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, This avoids writing panic after w.result is closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This recover() suppresses panic forever, can you explain why we need to do that?
This avoids writing panic after w.result is closed.
Can we eliminate this risk?
My concern about the recover()
is it just suppresses issues, might bring debug effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, This avoids writing panic after w.result is closed.
I mean we should print log when panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RainbowMango The reason for adding recover()
is that I am worried that multiple gourinue write the w.result
channel at the same time. When one goroutine close w.result
, other goroutines continue to write w.result
, triggering panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write to a closed channel won't panic in select case
, like https://github.com/karmada-io/karmada/pull/4212/files#diff-688048f0ec4cf2d213f3bd0437bde2531b124c71d172319b20f25cd140fc238cR255.
cb289a2
to
249d571
Compare
We usually don't open another PR to replace the current in-review PR. I don't know which PR we should move forward with. @xigang What do you think? |
@RainbowMango Agreed, in order to track this issue, let's continue to fix this issue in this PR. |
Yes, we can focus on fixing this issue and keep the minimum change( so that we can easily backport this to release branches), and let #4221 focus on additional improvement. Does that make sense to you? @xigang @ikaven1024 |
@ikaven1024 Migrate #4221 code to this PR? Or use timeout to temporarily fix this issue? |
You can migrate #4221 here, and i will close it. |
f5fb699
to
640f6cc
Compare
Done. |
pkg/search/proxy/store/util.go
Outdated
for _, source := range w.sources { | ||
go w.startWatchSource(source.watcher, source.decorator) | ||
source := source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reports declarations of variables that overlap with the declarations in the outer scope.
A warning from Goland IDE.
If you want to pin
the variable, we can give it another name, or:
diff --git a/pkg/search/proxy/store/util.go b/pkg/search/proxy/store/util.go
index 00f93c467..dbb765959 100644
--- a/pkg/search/proxy/store/util.go
+++ b/pkg/search/proxy/store/util.go
@@ -196,8 +196,8 @@ func (w *watchMux) AddSource(watcher watch.Interface, decorator func(watch.Event
// Start run the watcher
func (w *watchMux) Start() {
wg := sync.WaitGroup{}
- for _, source := range w.sources {
- source := source
+ for i := range w.sources {
+ source := w.sources[i]
wg.Add(1)
go func() {
defer wg.Done()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RainbowMango Fixed. please review again.
…ng client watch api to hang Signed-off-by: xigang <[email protected]>
640f6cc
to
9c9c52c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @ikaven1024
For the final decision.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ikaven1024 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @xigang do you think we need to cherry-pick this patch to the previous branch? |
@XiShanYongYe-Chang I think we should cherry-pick the previous branch and let me handle it. |
Thanks~ |
cherry pick of #4212: Fix lock race affects watch RestChan not close, causing client watch api to hang.
cherry pick of #4212: Fix lock race affects watch RestChan not close, causing client watch api to hang.
cherry pick of #4212: Fix lock race affects watch RestChan not close, causing client watch api to hang.
What type of PR is this?
/kind bug
What this PR does / why we need it:
When the
result
channel ofwatchMux
is blocked, it will causegoroutine
leak and lock race. this problem affects the client watch api interface to hang when consumingwatch.RestChan
.fetch goroutine stack:
Through the
goroutine
stack below, there are1892
goroutines blocked inruntime_SemacquireMutex
.Refer to cacheWatcher send event to input channel to implement adding a timeout.
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher/cache_watcher.go#L216
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
@ikaven1024 @RainbowMango @XiShanYongYe-Chang
Does this PR introduce a user-facing change?: