*: enable debug logging for etcd-operator by default #1425

hasbro17 · 2017-07-17T19:08:26Z

This PR enables the debug logging feature introduced in etcd-operator 0.4.2.
coreos/etcd-operator#1232

The operator will now write logs for certain critical actions(like pod creation/deletion) to a logfile, which will be persisted to an underlying hostPath volume on the node where the operator pod happens to run.

This logfile helps in debugging failed clusters where the logs for Kubernetes pods are no longer available. The logfile can be retrieved if the nodes of the failed cluster are still accessible by ssh.

/cc @xiang90 @hongchaodeng @coresolve

ggreer · 2017-07-17T20:46:00Z

Just curious, but could debug logging cause long-lived clusters to run out of disk space?

xiang90 · 2017-07-17T21:12:23Z

@ggreer

We try to keep the debug logging minimal. If there is no failure, nothing will be logged.

We can rotate the log and clean up logs older than a day. Not sure if it worth the effort for now.

/cc @hasbro17

hasbro17 · 2017-07-17T21:14:36Z

@ggreer Yes it could. However, the debug logs currently are not periodic and are only meant to record the following 3 events so as to keep it to a minimal size:

Create an etcd pod
Delete an etcd pod
An update to the cluster spec

All of the above should not be common actions for a normal cluster.
The cluster would have to be in a state of flux for long enough time to generate enough logs to run out of disk space on the underlying master node, i.e constantly scaling the etcd-cluster up and down, or if etcd-pods are repeatedly becoming unhealthy and getting recreated by the operator.

As to what is "long enough" I don't really have an estimate of the time it would take to run out of disk space since that depends on factors like the available disk size, and the average rate of changes to the cluster state over a period of time.

Admittedly is a bit of a temporary hack to make debugging failed self-hosted clusters easier for the field team. In the future, we probably won't be writing the debug logs to the node's disk, but instead send them to a more appropriate sink.
Plus, since self-hosted is currently experimental I'm not sure how common is it to have them running as long running clusters.

hasbro17 · 2017-07-17T21:15:39Z

Actually do not merge this just yet. Having some issues testing this out the master branch. Worked fine with the 1.6.7 release.

hasbro17 · 2017-07-17T22:53:58Z

Messed up the pod spec while switching from an earlier branch.
I've tested this PR out manually by setting up a cluster and confirming the presence of the debug log file. So this should be good to go once the CI passes.

hasbro17 · 2017-07-18T00:06:50Z

@coreosbot run e2e

s-urbaniak

I have a concern regarding the hostPath mounting.

s-urbaniak · 2017-07-18T08:27:09Z

modules/bootkube/resources/experimental/manifests/etcd-operator.yaml

@@ -31,6 +35,12 @@ spec:
              value: /tmp
          image: ${etcd_operator_image}
          name: etcd-operator
+          command:
+          - /usr/local/bin/etcd-operator
+          - --debug-logfile-path=/var/tmp/etcd-operator/debug/debug.log


question: why is this insisted to go into a log file? Are these debug messages also going to visible in kubectl logs your-etcd-operator?

for self hosted etcd, when it is down, k8s is down. when k8s is down, kubectl is unusable. the whole point of this is to enforce we log down to disk for debugging purpose.

s-urbaniak · 2017-07-18T08:34:08Z

modules/bootkube/resources/experimental/manifests/etcd-operator.yaml

+      volumes:
+      - name: debug-volume
+        hostPath:
+          path: /var/tmp


I am worried that we are hardcoding a host path here. The etcd-operator is a deployment, hence subject to be rescheduled by k8s at any time. This /var/tmp/etcd-operator/debug/debug.log file will eventually be sprinkled across all master nodes. Judging from https://github.com/coreos/etcd-operator/blob/c946e30490947dc8b171fc4439a98356c7a85078/pkg/debug/debug_logger.go#L51 I see that this at least opens the file file using O_APPEND, but those logs would still be pretty inconsistent in the face of rescheduling.

Cannot debug simply output to stdout such that its output is captured by standard k8s logging facilities?

if we can force every tectonic users to use a logging system like splunk, then it is a great help. Most of the users we interact with today have no logging system setup, this brings a huge problem for debugging self hosted etcd. when k8s is down, we have no easy way to get logging.

with this hack way, we at least can get the logging we want by downloading files from a well known path on all master nodes. we do not really worry about logging spreading too much. the operator is leader elected and time skew should not be a really problem.

and something is better than nothing.

s-urbaniak · 2017-07-18T16:29:26Z

@xiang90 thanks a lot for the reasoning! You have a point in the context of self-hosted etcd. LGTM in that case. Also for failed etcd pods, the kubelet would clean up, so even docker logs ... wouldn't work on the host itself. LGTM then.

s-urbaniak · 2017-07-18T16:30:22Z

ok to test

xiang90 · 2017-07-18T16:56:34Z

Also for failed etcd pods, the kubelet would clean up, so even docker logs ... wouldn't work on the host itself

exactly.

s-urbaniak · 2017-07-18T17:03:01Z

@mxinden @cpanato do you mind to look into the "Permission denied" CI failure?

cpanato · 2017-07-19T07:14:35Z

@s-urbaniak will look into

s-urbaniak · 2017-07-19T10:20:41Z

ok to test

hasbro17 changed the title *: enable debug logging for etcd-operator by default [WIP]*: enable debug logging for etcd-operator by default Jul 17, 2017

*: enable debug logging for etcd-operator by default

ffe1b4a

hasbro17 force-pushed the haseeb/enable-default-debug-logging branch from a1dbda0 to ffe1b4a Compare July 17, 2017 22:14

hasbro17 changed the title ~~[WIP]*: enable debug logging for etcd-operator by default~~ *: enable debug logging for etcd-operator by default Jul 17, 2017

s-urbaniak suggested changes Jul 18, 2017

View reviewed changes

s-urbaniak approved these changes Jul 18, 2017

View reviewed changes

s-urbaniak merged commit e98f485 into coreos:master Jul 19, 2017

hasbro17 deleted the haseeb/enable-default-debug-logging branch July 19, 2017 22:50

squat pushed a commit to squat/tectonic-installer that referenced this pull request Sep 25, 2017

*: enable debug logging for etcd-operator by default (coreos#1425)

54e360d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: enable debug logging for etcd-operator by default #1425

*: enable debug logging for etcd-operator by default #1425

hasbro17 commented Jul 17, 2017

ggreer commented Jul 17, 2017

xiang90 commented Jul 17, 2017

hasbro17 commented Jul 17, 2017

hasbro17 commented Jul 17, 2017

hasbro17 commented Jul 17, 2017

hasbro17 commented Jul 18, 2017

s-urbaniak left a comment

s-urbaniak Jul 18, 2017 •

edited

Loading

xiang90 Jul 18, 2017

s-urbaniak Jul 18, 2017 •

edited

Loading

xiang90 Jul 18, 2017

s-urbaniak commented Jul 18, 2017

s-urbaniak commented Jul 18, 2017

xiang90 commented Jul 18, 2017

s-urbaniak commented Jul 18, 2017

cpanato commented Jul 19, 2017

s-urbaniak commented Jul 19, 2017

*: enable debug logging for etcd-operator by default #1425

*: enable debug logging for etcd-operator by default #1425

Conversation

hasbro17 commented Jul 17, 2017

ggreer commented Jul 17, 2017

xiang90 commented Jul 17, 2017

hasbro17 commented Jul 17, 2017

hasbro17 commented Jul 17, 2017

hasbro17 commented Jul 17, 2017

hasbro17 commented Jul 18, 2017

s-urbaniak left a comment

Choose a reason for hiding this comment

s-urbaniak Jul 18, 2017 • edited Loading

Choose a reason for hiding this comment

xiang90 Jul 18, 2017

Choose a reason for hiding this comment

s-urbaniak Jul 18, 2017 • edited Loading

Choose a reason for hiding this comment

xiang90 Jul 18, 2017

Choose a reason for hiding this comment

s-urbaniak commented Jul 18, 2017

s-urbaniak commented Jul 18, 2017

xiang90 commented Jul 18, 2017

s-urbaniak commented Jul 18, 2017

cpanato commented Jul 19, 2017

s-urbaniak commented Jul 19, 2017

s-urbaniak Jul 18, 2017 •

edited

Loading

s-urbaniak Jul 18, 2017 •

edited

Loading