-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moved locking to protect a read of a map in the router #15385
Moved locking to protect a read of a map in the router #15385
Conversation
The locking was not protecting a read, so a simultaneous write would crash the router. I made a bunch of new functions that implemented the functional part of the function without the locking, then made the locking functions acquire the lock and then call the internal part. Then in the rename, I moved the lock acquisition earlier and called the internal functions. In brief: re-jiggered the code so we could lock properly. Fixes bug 1473031 (https://bugzilla.redhat.com/show_bug.cgi?id=1473031)
@openshift/networking PTAL |
[test][testextended][extended: networking] |
Evaluated for origin testextended up to 0b305fb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Do you plan to remove the older functions entirely?
@rajatchopra the older functions can still be called, and thus, need to get the locks. This just makes lockless ones we can call from the update function. Update is handled with a delete and add. But it could not call the external delete and external add since it needs to hold a lock, and they acquire them. Or did I miss something in your queston? |
No. That was the question. Thanks for the answer. |
continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/889/) (Base Commit: d784563) (PR Branch Commit: 0b305fb) (Extended Tests: networking) |
// We have to call the internal form of functions after this | ||
// because we are holding the state lock. | ||
r.lock.Lock() | ||
defer r.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is AddRoute called a lot in parallel? do we know what level of contention locking here is going to cause? is it worth making this a read lock, e.g.:
r.lock.RLock()
existingConfig, exists := r.state[backendKey]
r.lock.RUnlock()
if exists {
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't contend much... it's fed by the stream of events from the event queue and those are popped and handled one-by-one. The other access is when the router state is written out, and that only happens periodically.
BTW the current model is really appalling, we do the following:
- Pop event
- Lock the structure
- Update state
- Unlock the structure
- Call a rate-limited function to write state (default is no more often than 5s)
If it hasn't written state recently, it will:
- Lock the structure
- Write the conf file
- Reload haproxy
- Unlock the structure
On a system with lots of routes it can take 10+ seconds to reload haproxy... so we process one event and then reload... and do the same forever.
We have a card open to fix this. We should only hold the state lock while the state is being written out. But we need to not touch state again until the reload is complete... so we need to add another lock there. And make sure that we don't block the event processing, just the event writing, while the reload happens.
[test] flaked on #14385 (logs https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/3353/) |
Evaluated for origin test up to 0b305fb |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/3386/) (Base Commit: 8833a3a) (PR Branch Commit: 0b305fb) |
LGTM |
[merge] |
Evaluated for origin merge up to 0b305fb |
Is there a 3.6 variant? |
@smarterclayton we had decided not to, but we could be talked into the backport. The code has been this way since at least 3.4, and we've only had the one crash reported, and that was recovered. |
Ok. Agree the current "one thing running" is worse.
|
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1345/) (Base Commit: 2ff8dfd) (PR Branch Commit: 0b305fb) (Image: devenv-rhel7_6477) |
The locking was not protecting a read, so a simultaneous write would
crash the router. I made a bunch of new functions that implemented
the functional part of the function without the locking, then made the
locking functions acquire the lock and then call the internal part.
Then in the rename, I moved the lock acquisition earlier and called
the internal functions.
In brief: re-jiggered the code so we could lock properly.
Fixes bug 1473031 (https://bugzilla.redhat.com/show_bug.cgi?id=1473031)