-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grab the iptables lock before running iptables-restore. #1491
Conversation
c7990ce
to
5a3e989
Compare
3dd2673
to
48a743d
Compare
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits.
err = p.parseFailed(raw, "invalid float") | ||
return | ||
} | ||
result = time.Duration(millis * float64(time.Millisecond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cast needed? I'm curious to understand, if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. I want to allow the input to be a float, so you could say 1.2
(milliseconds). Go's casting rules are very strict so you can't do float * duration
(which is really an int64 number of nanoseconds) without a cast. Then I need to cast back to time.Duration
at the end.
Casting millis to time.Duration
first then multiplying doesn't work because that would lose precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense - thanks.
intdataplane/int_dataplane.go
Outdated
forceDataplaneRefresh bool | ||
cleanupPending bool | ||
doneFirstApply bool | ||
// dataplaneNeedsSync is set if the dataplane is dirty in some way, I.e. we need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowercase "i.e."?
intdataplane/int_dataplane.go
Outdated
iptablesNATOptions.ExtraCleanupRegexPattern = rules.HistoricInsertedNATRuleRegex | ||
|
||
// Create the shared iptables lock. This allows us to block other processes from | ||
// manipulating iptables but it still allows us to do our work in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the "... in parallel" clause means here.
iptables/lock.go
Outdated
}) | ||
countLockTimeouts = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "felix_iptables_lock_timeouts", | ||
Help: "Number of lock timeouts while waiting for hte iptables lock(s).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hte"
iptables/lock.go
Outdated
return nil, Err14LockTimeout | ||
} | ||
time.Sleep(probeInterval) | ||
countLockRetries.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like it would be useful for the 14 and 16 metrics to be distinct - perhaps with labels. In a failing system, it's probably going to be one of them in particular that is problematic.
err = p.parseFailed(raw, "invalid float") | ||
return | ||
} | ||
result = time.Duration(millis * float64(time.Millisecond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cast needed? I'm curious to understand, if so.
intdataplane/int_dataplane.go
Outdated
iptablesNATOptions.ExtraCleanupRegexPattern = rules.HistoricInsertedNATRuleRegex | ||
|
||
// Create the shared iptables lock. This allows us to block other processes from | ||
// manipulating iptables but it still allows us to do our work in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand the last clause on first reading this. Perhaps "but it still allows our own goroutines to modify iptables in parallel - which is OK because they are operating on different tables"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of queries about the new testing.
Makefile
Outdated
@@ -324,6 +330,16 @@ ut combined.coverprofile: vendor/.up-to-date $(FELIX_GO_FILES) | |||
@echo Running Go UTs. | |||
$(DOCKER_GO_BUILD) ./utils/run-coverage | |||
|
|||
bin/fv.test: vendor/.up-to-date $(FELIX_GO_FILES) | |||
$(DOCKER_GO_BUILD) go test ./fv -c --tags fvtests -o bin/fv.test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I added the k8sfv
directory, I needed to exclude it from some of the infrastructure for running UTs:
./utils/run-coverage:11: grep -vE '/vendor/|\./proto/|.glide|/k8sfv/' | \
./utils/run-coverage:24:ginkgo -cover -covermode=count -coverpkg=${go_dirs} -r -skipPackage k8sfv
./Makefile:352: $(DOCKER_GO_BUILD) ginkgo -r -skipPackage k8sfv $(GINKGO_OPTIONS)
./Makefile:357: $(DOCKER_GO_BUILD) ginkgo watch -r -skipPackage k8sfv $(GINKGO_OPTIONS)
Do you need to do that for fv
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw a neat trick in that Cloudflare article: using a build tag to identify which tests to build/run and I went for that approach. WDYT?
fv/iptables_lock_test.go
Outdated
BeforeEach(func() { | ||
containerName = fmt.Sprintf("felix-fv-%d-%d", os.Getpid(), containerIdx) | ||
containerIdx++ | ||
felixDir, err := os.Getwd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these tests need felixDir to be correct? I would expect Getwd() here to give <reporoot>/fv
, not <reporoot>
- and I'm not sure that is what you intend with the name felixDir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it works from the makefile due to the way we run it. Maybe I should be tolerant to that though, so that ginkgo watch
does the right thing.
- Copy and adapt kube-proxy's iptables lock implementation. - Implement a shared iptables lock that allows for writes to different tables (which are orthogonal). - Add a test executable that grabs the lock for a specified length of time along with FV tests that test correct blocking of iptables -w. These are the Felix changes required for #1470.
@fasaxc I think we've got duplicated release note information coming from this PR and projectcalico/calico#935. Can we remove the label from this one? (Or maybe we just leave it and say that's OK?) |
Sure. |
Grab the iptables lock before running iptables-restore.
Description
To avoid conflicts with kube-proxy (and others), take the iptables lock before running iptables-restore.
See #1470
Todos
Release Note