Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moved locking to protect a read of a map in the router #15385

Conversation

knobunc
Copy link
Contributor

@knobunc knobunc commented Jul 20, 2017

The locking was not protecting a read, so a simultaneous write would
crash the router. I made a bunch of new functions that implemented
the functional part of the function without the locking, then made the
locking functions acquire the lock and then call the internal part.
Then in the rename, I moved the lock acquisition earlier and called
the internal functions.

In brief: re-jiggered the code so we could lock properly.

Fixes bug 1473031 (https://bugzilla.redhat.com/show_bug.cgi?id=1473031)

The locking was not protecting a read, so a simultaneous write would
crash the router.  I made a bunch of new functions that implemented
the functional part of the function without the locking, then made the
locking functions acquire the lock and then call the internal part.
Then in the rename, I moved the lock acquisition earlier and called
the internal functions.

In brief: re-jiggered the code so we could lock properly.

Fixes bug 1473031 (https://bugzilla.redhat.com/show_bug.cgi?id=1473031)
@knobunc knobunc added this to the 3.7.0 milestone Jul 20, 2017
@knobunc knobunc self-assigned this Jul 20, 2017
@knobunc
Copy link
Contributor Author

knobunc commented Jul 20, 2017

@openshift/networking PTAL

@knobunc
Copy link
Contributor Author

knobunc commented Jul 20, 2017

[test][testextended][extended: networking]

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to 0b305fb

Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Do you plan to remove the older functions entirely?

@knobunc
Copy link
Contributor Author

knobunc commented Jul 20, 2017

@rajatchopra the older functions can still be called, and thus, need to get the locks. This just makes lockless ones we can call from the update function. Update is handled with a delete and add. But it could not call the external delete and external add since it needs to hold a lock, and they acquire them.

Or did I miss something in your queston?

@rajatchopra
Copy link
Contributor

No. That was the question. Thanks for the answer.
This looks good to go.

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/889/) (Base Commit: d784563) (PR Branch Commit: 0b305fb) (Extended Tests: networking)

// We have to call the internal form of functions after this
// because we are holding the state lock.
r.lock.Lock()
defer r.lock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is AddRoute called a lot in parallel? do we know what level of contention locking here is going to cause? is it worth making this a read lock, e.g.:

r.lock.RLock()
existingConfig, exists := r.state[backendKey]
r.lock.RUnlock()
if exists {
  ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't contend much... it's fed by the stream of events from the event queue and those are popped and handled one-by-one. The other access is when the router state is written out, and that only happens periodically.

BTW the current model is really appalling, we do the following:

  • Pop event
  • Lock the structure
  • Update state
  • Unlock the structure
  • Call a rate-limited function to write state (default is no more often than 5s)

If it hasn't written state recently, it will:

  • Lock the structure
  • Write the conf file
  • Reload haproxy
  • Unlock the structure

On a system with lots of routes it can take 10+ seconds to reload haproxy... so we process one event and then reload... and do the same forever.

We have a card open to fix this. We should only hold the state lock while the state is being written out. But we need to not touch state again until the reload is complete... so we need to add another lock there. And make sure that we don't block the event processing, just the event writing, while the reload happens.

@knobunc
Copy link
Contributor Author

knobunc commented Jul 21, 2017

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 0b305fb

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/3386/) (Base Commit: 8833a3a) (PR Branch Commit: 0b305fb)

@imcsk8
Copy link
Contributor

imcsk8 commented Jul 21, 2017

LGTM

@knobunc
Copy link
Contributor Author

knobunc commented Jul 21, 2017

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 0b305fb

@smarterclayton
Copy link
Contributor

Is there a 3.6 variant?

@knobunc
Copy link
Contributor Author

knobunc commented Jul 21, 2017

@smarterclayton we had decided not to, but we could be talked into the backport. The code has been this way since at least 3.4, and we've only had the one crash reported, and that was recovered.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 21, 2017 via email

@openshift-bot
Copy link
Contributor

openshift-bot commented Jul 22, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1345/) (Base Commit: 2ff8dfd) (PR Branch Commit: 0b305fb) (Image: devenv-rhel7_6477)

@openshift-bot openshift-bot merged commit 61ca304 into openshift:master Jul 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants