Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck & Dashboard update #522

Closed
wants to merge 14 commits into from
Closed

Healthcheck & Dashboard update #522

wants to merge 14 commits into from

Conversation

roffe
Copy link
Collaborator

@roffe roffe commented Aug 29, 2018

  • added configurable static timeout for the controllers, defaults to 30 sec
  • added metric controller_errors with label "controller"
  • changed metric controller_bgp_advertisements_received to controller_bgp_advertisements with label "type" sent or advertised

@roffe roffe requested review from murali-reddy and andrewsykim and removed request for murali-reddy August 29, 2018 10:56
@@ -46,7 +46,9 @@ Usage of kube-router:
-h, --help Print usage information.
--hostname-override string Overrides the NodeName of the node. Set this if kube-router is unable to determine your NodeName automatically.
--iptables-sync-period duration The delay between iptables rule synchronizations (e.g. '5s', '1m'). Must be greater than 0. (default 5m0s)
--iptables-sync-timeout duration The timeout for iptables rule synchronizations (e.g. '5s', '1m'). Must be greater than 0. (default 30s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should specifically mention that it will cause health endpoint to return 500

@@ -17,11 +17,12 @@ import (
"github.com/cloudnativelabs/kube-router/pkg/options"
"github.com/golang/glog"

"time"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go to the very top with the rest of stdlib

@@ -19,14 +19,15 @@ import (
"github.com/golang/glog"
"github.com/prometheus/client_golang/prometheus"

"regexp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, group with other stdlib

@@ -170,7 +153,7 @@ func (hc *HealthController) RunServer(stopCh <-chan struct{}, wg *sync.WaitGroup

//RunCheck starts the HealthController's check
func (hc *HealthController) RunCheck(healthChan <-chan *ControllerHeartbeat, stopCh <-chan struct{}, wg *sync.WaitGroup) error {
t := time.NewTicker(5000 * time.Millisecond)
t := time.NewTicker(1000 * time.Millisecond)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new timeout, does it still make sense to lower this value?

}
err := npc.Sync()
if err != nil {
metrics.ControllerErrors.WithLabelValues("NPC").Inc()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make "NPC" a constant?

@@ -928,7 +941,7 @@ func cleanupStaleRules(activePolicyChains, activePodFwChains, activePolicyIPSets
for podFwChain := range activePodFwChains {
podFwChainRules, err := iptablesCmdHandler.List("filter", podFwChain)
if err != nil {

glog.Errorf("Failed to list iptable filters")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glog.Errorf("Failed to list iptable filters: %s", err)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have metrics.ControllerErrors.WithLabelValues(networkPolicyControllerLabel).Inc()?

Help: "Time it took to sync internal bgp peers",
}, []string{})
Name: "controller_bgp_advertisements",
Help: "Number of BGP advertisements received and sent",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: can we make this received and advertised to be consistent with code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ofc

@@ -354,7 +377,7 @@ func (nrc *NetworkRoutingController) advertisePodRoute() error {
subnet), false, attrs, time.Now(), false)}); err != nil {
return fmt.Errorf(err.Error())
}

metrics.ControllerBGPadvertisements.WithLabelValues("sent").Inc()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a if nrc.MetricsEnabled check?

@murali-reddy
Copy link
Member

@roffe could you please rebase to latest master?

@jay-rob
Copy link

jay-rob commented Dec 8, 2018

We have encountered some cases where our nodes have not properly reloaded their iptable rules and this went silently undetected for some time. We looked through the logs and found errors such as:

E1206 22:10:16.636454 1 network_routes_controller.go:335] Failed to inject routes due to: no such process

I am curious if this PR could help us detect this and self heal as it would cause that effected router pod to fail a health check and get restarted. If it does, I am happy to help get this PR merged in.

@roffe
Copy link
Collaborator Author

roffe commented Dec 8, 2018

@jrthrawny no this pr would not adress that, it was mostly metrics and preventing a large rate of NSC.SYNC() to occur that was the motivation behind it but to much has changed in the codebase to salvage it now

@roffe roffe closed this Dec 8, 2018
@roffe
Copy link
Collaborator Author

roffe commented Dec 8, 2018

also parts of this pr is adressed by #595 the debounce parts and such i'm planing to recreate as soon as other changes i want in has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants