-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Healthcheck & Dashboard update #522
Conversation
docs/user-guide.md
Outdated
@@ -46,7 +46,9 @@ Usage of kube-router: | |||
-h, --help Print usage information. | |||
--hostname-override string Overrides the NodeName of the node. Set this if kube-router is unable to determine your NodeName automatically. | |||
--iptables-sync-period duration The delay between iptables rule synchronizations (e.g. '5s', '1m'). Must be greater than 0. (default 5m0s) | |||
--iptables-sync-timeout duration The timeout for iptables rule synchronizations (e.g. '5s', '1m'). Must be greater than 0. (default 30s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should specifically mention that it will cause health endpoint to return 500
pkg/cmd/kube-router.go
Outdated
@@ -17,11 +17,12 @@ import ( | |||
"github.com/cloudnativelabs/kube-router/pkg/options" | |||
"github.com/golang/glog" | |||
|
|||
"time" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go to the very top with the rest of stdlib
@@ -19,14 +19,15 @@ import ( | |||
"github.com/golang/glog" | |||
"github.com/prometheus/client_golang/prometheus" | |||
|
|||
"regexp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, group with other stdlib
pkg/healthcheck/health_controller.go
Outdated
@@ -170,7 +153,7 @@ func (hc *HealthController) RunServer(stopCh <-chan struct{}, wg *sync.WaitGroup | |||
|
|||
//RunCheck starts the HealthController's check | |||
func (hc *HealthController) RunCheck(healthChan <-chan *ControllerHeartbeat, stopCh <-chan struct{}, wg *sync.WaitGroup) error { | |||
t := time.NewTicker(5000 * time.Millisecond) | |||
t := time.NewTicker(1000 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new timeout, does it still make sense to lower this value?
} | ||
err := npc.Sync() | ||
if err != nil { | ||
metrics.ControllerErrors.WithLabelValues("NPC").Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make "NPC" a constant?
@@ -928,7 +941,7 @@ func cleanupStaleRules(activePolicyChains, activePodFwChains, activePolicyIPSets | |||
for podFwChain := range activePodFwChains { | |||
podFwChainRules, err := iptablesCmdHandler.List("filter", podFwChain) | |||
if err != nil { | |||
|
|||
glog.Errorf("Failed to list iptable filters") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glog.Errorf("Failed to list iptable filters: %s", err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this have metrics.ControllerErrors.WithLabelValues(networkPolicyControllerLabel).Inc()
?
pkg/metrics/metrics_controller.go
Outdated
Help: "Time it took to sync internal bgp peers", | ||
}, []string{}) | ||
Name: "controller_bgp_advertisements", | ||
Help: "Number of BGP advertisements received and sent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: can we make this received and advertised
to be consistent with code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ofc
@@ -354,7 +377,7 @@ func (nrc *NetworkRoutingController) advertisePodRoute() error { | |||
subnet), false, attrs, time.Now(), false)}); err != nil { | |||
return fmt.Errorf(err.Error()) | |||
} | |||
|
|||
metrics.ControllerBGPadvertisements.WithLabelValues("sent").Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a if nrc.MetricsEnabled
check?
@roffe could you please rebase to latest master? |
We have encountered some cases where our nodes have not properly reloaded their iptable rules and this went silently undetected for some time. We looked through the logs and found errors such as:
I am curious if this PR could help us detect this and self heal as it would cause that effected router pod to fail a health check and get restarted. If it does, I am happy to help get this PR merged in. |
@jrthrawny no this pr would not adress that, it was mostly metrics and preventing a large rate of NSC.SYNC() to occur that was the motivation behind it but to much has changed in the codebase to salvage it now |
also parts of this pr is adressed by #595 the debounce parts and such i'm planing to recreate as soon as other changes i want in has been merged |
controller_errors
with label "controller"controller_bgp_advertisements_received
tocontroller_bgp_advertisements
with label "type"sent
oradvertised