Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving thread safety of K8s informer DB #1118

Merged
merged 4 commits into from
Sep 2, 2024

Conversation

mariomac
Copy link
Contributor

Should minimize the impact of #1117

The kubernetes database was micro-optimized (the "root of all evil") by separating all the data access by different mutexes. That could avoid Go complaining about race conditions while avoiding to block the whole database each time a single element is accessed (e.g. you could access services and pods info simultaneously).

However that lead to effective race conditions when, for example, the Pod information was updated without locking the access to its containers and PID namespaces.

The frequency of writing in the kube database is not really high, so controlling the access through a single mutex should not have a noticeable impact in performance.

@codecov-commenter
Copy link

codecov-commenter commented Aug 29, 2024

Codecov Report

Attention: Patch coverage is 98.07692% with 1 line in your changes missing coverage. Please review.

Project coverage is 81.99%. Comparing base (ae750f7) to head (7f7fca9).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/internal/transform/kube/db.go 98.07% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1118   +/-   ##
=======================================
  Coverage   81.99%   81.99%           
=======================================
  Files         140      140           
  Lines       11574    11570    -4     
=======================================
- Hits         9490     9487    -3     
+ Misses       1560     1558    -2     
- Partials      524      525    +1     
Flag Coverage Δ
integration-test 57.31% <0.00%> (+0.14%) ⬆️
k8s-integration-test 59.17% <86.53%> (-0.03%) ⬇️
oats-test 36.89% <0.00%> (+0.01%) ⬆️
unittests 52.31% <67.30%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice catch! I added few remarks that I think should make things even safer.

}
}
}

func (id *Database) addProcess(ifp *container.Info) {
id.deletePodCache(ifp.PIDNamespace)
id.nsMut.Lock()
id.access.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe safer to do a defer unlock() here?

@@ -128,36 +119,28 @@ func (id *Database) AddProcess(pid uint32) {
}

func (id *Database) CleanProcessCaches(ns uint32) {
id.access.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defer here too?

for _, ip := range pod.IPInfo.IPs {
id.podsByIP[ip] = pod
}
id.access.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can lead to a race too, we check the pod state before we lock. Let's wrap the whole function in a lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it's not needed because the Mutex protects access to the maps, and the pod value is provided from outside the database, from a value that is not concurrently modified.

Anyway let's move the lock outside the block to make it clearer, as anyway the pods should have always IPs.

defer id.podsMut.Unlock()
for _, ip := range pod.IPInfo.IPs {
delete(id.podsByIP, ip)
id.access.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should check the pod state inside the lock.

@@ -203,30 +209,28 @@ func (id *Database) UpdateNewServicesByIPIndex(svc *kube.ServiceInfo) {

func (id *Database) UpdateDeletedServicesByIPIndex(svc *kube.ServiceInfo) {
if len(svc.IPInfo.IPs) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the pod state inside a lock.

return id.podsByIP[ip]
}

func (id *Database) UpdateNewServicesByIPIndex(svc *kube.ServiceInfo) {
if len(svc.IPInfo.IPs) > 0 {
id.svcMut.Lock()
defer id.svcMut.Unlock()
id.access.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check the pod state inside the lock.

@mariomac mariomac merged commit b71dc56 into grafana:main Sep 2, 2024
6 checks passed
@mariomac mariomac deleted the improve-k8s-db branch September 2, 2024 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants