Moved locking to protect a read of a map in the router #15385

knobunc · 2017-07-20T20:28:48Z

The locking was not protecting a read, so a simultaneous write would
crash the router. I made a bunch of new functions that implemented
the functional part of the function without the locking, then made the
locking functions acquire the lock and then call the internal part.
Then in the rename, I moved the lock acquisition earlier and called
the internal functions.

In brief: re-jiggered the code so we could lock properly.

Fixes bug 1473031 (https://bugzilla.redhat.com/show_bug.cgi?id=1473031)

The locking was not protecting a read, so a simultaneous write would crash the router. I made a bunch of new functions that implemented the functional part of the function without the locking, then made the locking functions acquire the lock and then call the internal part. Then in the rename, I moved the lock acquisition earlier and called the internal functions. In brief: re-jiggered the code so we could lock properly. Fixes bug 1473031 (https://bugzilla.redhat.com/show_bug.cgi?id=1473031)

knobunc · 2017-07-20T20:31:11Z

@openshift/networking PTAL

knobunc · 2017-07-20T20:31:27Z

[test][testextended][extended: networking]

openshift-bot · 2017-07-20T20:32:18Z

Evaluated for origin testextended up to 0b305fb

rajatchopra

LGTM
Do you plan to remove the older functions entirely?

knobunc · 2017-07-20T20:34:12Z

@rajatchopra the older functions can still be called, and thus, need to get the locks. This just makes lockless ones we can call from the update function. Update is handled with a delete and add. But it could not call the external delete and external add since it needs to hold a lock, and they acquire them.

Or did I miss something in your queston?

rajatchopra · 2017-07-20T20:46:51Z

No. That was the question. Thanks for the answer.
This looks good to go.

openshift-bot · 2017-07-20T22:08:16Z

continuous-integration/openshift-jenkins/testextended SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_extended/889/) (Base Commit: d784563) (PR Branch Commit: 0b305fb) (Extended Tests: networking)

liggitt · 2017-07-21T06:01:01Z

pkg/router/template/router.go

+	// We have to call the internal form of functions after this
+	// because we are holding the state lock.
+	r.lock.Lock()
+	defer r.lock.Unlock()


is AddRoute called a lot in parallel? do we know what level of contention locking here is going to cause? is it worth making this a read lock, e.g.:

r.lock.RLock() existingConfig, exists := r.state[backendKey] r.lock.RUnlock() if exists { ...

It won't contend much... it's fed by the stream of events from the event queue and those are popped and handled one-by-one. The other access is when the router state is written out, and that only happens periodically.

BTW the current model is really appalling, we do the following:

Pop event

Lock the structure

Update state

Unlock the structure

Call a rate-limited function to write state (default is no more often than 5s)

If it hasn't written state recently, it will:

Lock the structure

Write the conf file

Reload haproxy

Unlock the structure

On a system with lots of routes it can take 10+ seconds to reload haproxy... so we process one event and then reload... and do the same forever.

We have a card open to fix this. We should only hold the state lock while the state is being written out. But we need to not touch state again until the reload is complete... so we need to add another lock there. And make sure that we don't block the event processing, just the event writing, while the reload happens.

knobunc · 2017-07-21T13:41:42Z

[test] flaked on #14385 (logs https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/3353/)

openshift-bot · 2017-07-21T13:47:51Z

Evaluated for origin test up to 0b305fb

openshift-bot · 2017-07-21T15:38:04Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/3386/) (Base Commit: 8833a3a) (PR Branch Commit: 0b305fb)

imcsk8 · 2017-07-21T17:05:13Z

LGTM

knobunc · 2017-07-21T18:30:13Z

[merge]

openshift-bot · 2017-07-21T18:32:27Z

Evaluated for origin merge up to 0b305fb

smarterclayton · 2017-07-21T18:51:21Z

Is there a 3.6 variant?

knobunc · 2017-07-21T19:02:52Z

@smarterclayton we had decided not to, but we could be talked into the backport. The code has been this way since at least 3.4, and we've only had the one crash reported, and that was recovered.

smarterclayton · 2017-07-21T20:15:17Z

Ok. Agree the current "one thing running" is worse.

openshift-bot · 2017-07-22T10:07:47Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1345/) (Base Commit: 2ff8dfd) (PR Branch Commit: 0b305fb) (Image: devenv-rhel7_6477)

knobunc added the component/routing label Jul 20, 2017

knobunc added this to the 3.7.0 milestone Jul 20, 2017

knobunc self-assigned this Jul 20, 2017

knobunc requested review from rajatchopra, JacobTanenbaum and pecameron July 20, 2017 20:28

rajatchopra approved these changes Jul 20, 2017

View reviewed changes

liggitt reviewed Jul 21, 2017

View reviewed changes

openshift-bot merged commit 61ca304 into openshift:master Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moved locking to protect a read of a map in the router #15385

Moved locking to protect a read of a map in the router #15385

knobunc commented Jul 20, 2017

knobunc commented Jul 20, 2017

knobunc commented Jul 20, 2017

openshift-bot commented Jul 20, 2017

rajatchopra left a comment

knobunc commented Jul 20, 2017

rajatchopra commented Jul 20, 2017

openshift-bot commented Jul 20, 2017

liggitt Jul 21, 2017

knobunc Jul 21, 2017

knobunc commented Jul 21, 2017

openshift-bot commented Jul 21, 2017

openshift-bot commented Jul 21, 2017

imcsk8 commented Jul 21, 2017

knobunc commented Jul 21, 2017

openshift-bot commented Jul 21, 2017

smarterclayton commented Jul 21, 2017

knobunc commented Jul 21, 2017

smarterclayton commented Jul 21, 2017 via email

openshift-bot commented Jul 22, 2017 •

edited

Loading

Moved locking to protect a read of a map in the router #15385

Moved locking to protect a read of a map in the router #15385

Conversation

knobunc commented Jul 20, 2017

knobunc commented Jul 20, 2017

knobunc commented Jul 20, 2017

openshift-bot commented Jul 20, 2017

rajatchopra left a comment

Choose a reason for hiding this comment

knobunc commented Jul 20, 2017

rajatchopra commented Jul 20, 2017

openshift-bot commented Jul 20, 2017

liggitt Jul 21, 2017

Choose a reason for hiding this comment

knobunc Jul 21, 2017

Choose a reason for hiding this comment

knobunc commented Jul 21, 2017

openshift-bot commented Jul 21, 2017

openshift-bot commented Jul 21, 2017

imcsk8 commented Jul 21, 2017

knobunc commented Jul 21, 2017

openshift-bot commented Jul 21, 2017

smarterclayton commented Jul 21, 2017

knobunc commented Jul 21, 2017

smarterclayton commented Jul 21, 2017 via email

openshift-bot commented Jul 22, 2017 • edited Loading

openshift-bot commented Jul 22, 2017 •

edited

Loading