Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otelgrpc: fix goroutine leak after context cancellation #840

Merged
merged 5 commits into from
Jul 20, 2021

Conversation

glazychev-art
Copy link
Contributor

This PR is related to this issue: #839

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 30, 2021

CLA Signed

The committers are authorized under a signed CLA.

@glazychev-art
Copy link
Contributor Author

@vadimberezniker @tonistiigi
Hello. As far as I know, you have caught a similar problem. Could you please take a look too?

@glazychev-art glazychev-art marked this pull request as draft June 30, 2021 15:59
return
}

case <-s.Context().Done():
Copy link
Contributor

@vadimberezniker vadimberezniker Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer: I'm not a maintainer on this project but I have been looking at this code recently.

If we trust the ClientStream.Context() docs then it's not safe to use the Context() call here:

	// Context returns the context for this stream.
	//
	// It should not be called until after Header or RecvMsg has returned. Once
	// called, subsequent client-side retries are disabled.
	Context() context.Context

Edit: I just double checked the gRPC code and Context() does have side-effects.

The alternative is to pass the parent context and check that.

Thanks for fixing this, I noticed this while fixing the other issue and this was next on my plate.

@tonistiigi
Copy link
Contributor

Does <-s.Context().Done() return then calling context is canceled or when stream is closed from the server-side and recv return eof? If former then I don't think this is correct and we should wait for server-side to close stream, maybe calling recvmsg is goroutine is an option if there isn't anything better.

@vadimberezniker
Copy link
Contributor

Does <-s.Context().Done() return then calling context is canceled or when stream is closed from the server-side and recv return eof? If former then I don't think this is correct and we should wait for server-side to close stream, maybe calling recvmsg is goroutine is an option if there isn't anything better.

If the client properly handles the stream, the ctx.Done() case should not be reached. That should be handled by the event for loop.
The issue @glazychev-art is trying to address is the case where the parent context is cancelled before the client is done with the stream.

@tonistiigi
Copy link
Contributor

@vadimberezniker I understand that. The canceling flow should be 1) context client passed gets canceled 2) cancellation is passed to the server-side 3) server closes either cleanly or with an error 4) client receives the close and stream is finished. Just cancelling the context is just a signal and doesn't atomically close the stream by itself. I'm not sure about the internals, therefore it may be possible that s.Context() only closes after 4), in that case, the change looks correct to me. But if it closes after 1) then I don't think this approach is correct.

@glazychev-art
Copy link
Contributor Author

@vadimberezniker
Thank you. Updated to parent context

@glazychev-art
Copy link
Contributor Author

@tonistiigi
It is a good idea, but I'm not sure if this possible.. Because, let's take a look at:
https://github.com/grpc/grpc-go/blob/master/stream.go#L358-L372

		go func() {
			select {
			case <-cc.ctx.Done():
				cs.finish(ErrClientConnClosing)
			case <-ctx.Done():
				cs.finish(toRPCErr(ctx.Err()))
			}
		}()

ClientStream is keeping track context closing. And does not notify about it.
So, if we call RecvMsg it may be already finished.

@glazychev-art glazychev-art marked this pull request as ready for review July 1, 2021 14:23
@vadimberezniker
Copy link
Contributor

@vadimberezniker I understand that. The canceling flow should be 1) context client passed gets canceled 2) cancellation is passed to the server-side 3) server closes either cleanly or with an error 4) client receives the close and stream is finished. Just cancelling the context is just a signal and doesn't atomically close the stream by itself. I'm not sure about the internals, therefore it may be possible that s.Context() only closes after 4), in that case, the change looks correct to me. But if it closes after 1) then I don't think this approach is correct.

Based on my understanding when the context is cancelled, gRPC automatically closes the underlying HTTP2 stream so there's no opportunity for the server to send anything after that.

@tonistiigi
Copy link
Contributor

tonistiigi commented Jul 4, 2021

I did some tests and indeed the server-side graceful cancel does not happen as I described before. I had forgotten that in the code where I use it I actually need to do two requests and only cancel one of them (that forces another to close cleanly) to achieve this behavior.

But it is still possible that RecvMsg returns cleanly after the context has been canceled if server-side sent the message before it received the cancel(after that sends will fail). Looking at the interceptor I think it probably just means that the events for the receive get dropped that isn't catastrophic. Although, I think I still kind of consider it a caller side issue if the client just cancels context and doesn't recv until an error after that.

@glazychev-art
Copy link
Contributor Author

@Aneurysm9 Aneurysm9 requested a review from pellared as a code owner July 13, 2021 17:26
@Aneurysm9
Copy link
Member

The approach seems sane. Please fix the linting issue.

Signed-off-by: Artem Glazychev <[email protected]>
@glazychev-art
Copy link
Contributor Author

@Aneurysm9
Thank you. Fixed

@codecov
Copy link

codecov bot commented Jul 14, 2021

Codecov Report

Merging #840 (da5c1aa) into main (e2210a9) will increase coverage by 0.2%.
The diff coverage is 81.2%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #840     +/-   ##
=======================================
+ Coverage   79.2%   79.4%   +0.2%     
=======================================
  Files         62      62             
  Lines       2750    2755      +5     
=======================================
+ Hits        2178    2188     +10     
+ Misses       440     437      -3     
+ Partials     132     130      -2     
Impacted Files Coverage Δ
...ion/google.golang.org/grpc/otelgrpc/interceptor.go 89.2% <81.2%> (+2.2%) ⬆️

Copy link
Member

@XSAM XSAM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

Signed-off-by: Artem Glazychev <[email protected]>
@edwarnicke
Copy link

@pellared Pardon my ignorance of the process... but what else has to be done before this fix can be merged?

@pellared
Copy link
Member

@Aneurysm9 @MrAlias PTAL

@MrAlias MrAlias merged commit e8b0fa6 into open-telemetry:main Jul 20, 2021
@edwarnicke
Copy link

Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants