Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: monitorUpstreamConnectionState CPU consumption #3884

Merged
merged 4 commits into from
Oct 23, 2023

Conversation

elimt
Copy link
Member

@elimt elimt commented Oct 23, 2023

Bug found in #3881

monitorUpstreamConnectionState() is a goroutine that listens for GRPC client connection changes using GRPC's WaitForStateChange method.

monitorUpstreamConnectionState() is currently consuming the a lot of CPU. It seems to be continuously running and not actually waiting for state changes.

Reference for GRPC Connection State Transitions: https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html

state was never getting updated so the default state was never getting updated which was causing the loop not to wait for new state chanegs

Before Fix

image

After Fix

image

`monitorUpstreamConnectionState()` is currently consuming the a lot of CPU.

`monitorUpstreamConnectionState()` is a goroutine that listens for GRPC client connection changes using GRPC's `WaitForStateChange` method. It is continuously running and not actually waiting for state changes .

I suspect there are a lot of GRPC connection state transitions from `CONNECTING` GRPC state to `TRANSIENT_FAILURE` GRPC state and then back to `CONNECTING` GRPC state which is triggering the continuously running loop.

Adding a slight sleep to the for loop stopped the high consumption of CPU.
@@ -442,5 +442,7 @@ func monitorUpstreamConnectionState(ctx context.Context, cc *grpc.ClientConn, co
}

connectionState.Store(newState)

time.Sleep(10 * time.Millisecond)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be a long term solution or just a hotfix that we can get out quickly to users? It feels like we need to spend more time digging into the source of the issue. Why is the state changing so often? It could indicate something about the application that's worrying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed another fix which should address the issue. Took out sleep change

@@ -442,5 +442,7 @@ func monitorUpstreamConnectionState(ctx context.Context, cc *grpc.ClientConn, co
}

connectionState.Store(newState)

time.Sleep(10 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep is a bandaid that does not fix the bug IMO.
WaitForStateChange should properly wait for grpc if it is working properly.
It appears that line 436 should be waiting on the value retrieved on line 437 but state never gets set to that value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the newState to state. That fixed the issue

@elimt elimt force-pushed the elimt-fix-cpu-consumption branch from aba7ade to e51418d Compare October 23, 2023 22:00
Copy link
Collaborator

@johanbrandhorst johanbrandhorst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I expect it will be hard to test this, but if you find some way, please do!

Copy link
Collaborator

@moduli moduli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change makes sense to me. Confirmed that running boundary dev against this branch does not show boundary running at 100% CPU.

Copy link
Collaborator

@ajayreshc ajayreshc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as state never became newState even when it changed. If the zero value of state (Idle I believe) didn't match getState() then the result is a tight loop that never resolves itself, because it would always fall out in WaitForStateChange() without polling the channel for a new value.

EDIT: Ran the updated code, and verified. CPU is < 1.0% on my system.

@elimt elimt merged commit 5741807 into main Oct 23, 2023
54 checks passed
@elimt elimt deleted the elimt-fix-cpu-consumption branch October 23, 2023 23:35
elimt added a commit that referenced this pull request Oct 24, 2023
Update the changelog details for High CPU Utilization bug fix: #3884
elimt added a commit that referenced this pull request Oct 24, 2023
Update the changelog details for High CPU Utilization bug fix: #3884
elimt added a commit that referenced this pull request Oct 24, 2023
Update the changelog details for High CPU Utilization bug fix: #3884
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants