Skip to content
This repository has been archived by the owner on Feb 5, 2020. It is now read-only.

*: enable debug logging for etcd-operator by default #1425

Merged

Conversation

hasbro17
Copy link
Contributor

This PR enables the debug logging feature introduced in etcd-operator 0.4.2.
coreos/etcd-operator#1232

The operator will now write logs for certain critical actions(like pod creation/deletion) to a logfile, which will be persisted to an underlying hostPath volume on the node where the operator pod happens to run.

This logfile helps in debugging failed clusters where the logs for Kubernetes pods are no longer available. The logfile can be retrieved if the nodes of the failed cluster are still accessible by ssh.

/cc @xiang90 @hongchaodeng @coresolve

@ggreer
Copy link
Contributor

ggreer commented Jul 17, 2017

Just curious, but could debug logging cause long-lived clusters to run out of disk space?

@xiang90
Copy link

xiang90 commented Jul 17, 2017

@ggreer

We try to keep the debug logging minimal. If there is no failure, nothing will be logged.

We can rotate the log and clean up logs older than a day. Not sure if it worth the effort for now.

/cc @hasbro17

@hasbro17
Copy link
Contributor Author

@ggreer Yes it could. However, the debug logs currently are not periodic and are only meant to record the following 3 events so as to keep it to a minimal size:

  • Create an etcd pod
  • Delete an etcd pod
  • An update to the cluster spec

All of the above should not be common actions for a normal cluster.
The cluster would have to be in a state of flux for long enough time to generate enough logs to run out of disk space on the underlying master node, i.e constantly scaling the etcd-cluster up and down, or if etcd-pods are repeatedly becoming unhealthy and getting recreated by the operator.

As to what is "long enough" I don't really have an estimate of the time it would take to run out of disk space since that depends on factors like the available disk size, and the average rate of changes to the cluster state over a period of time.

Admittedly is a bit of a temporary hack to make debugging failed self-hosted clusters easier for the field team. In the future, we probably won't be writing the debug logs to the node's disk, but instead send them to a more appropriate sink.
Plus, since self-hosted is currently experimental I'm not sure how common is it to have them running as long running clusters.

@hasbro17 hasbro17 changed the title *: enable debug logging for etcd-operator by default [WIP]*: enable debug logging for etcd-operator by default Jul 17, 2017
@hasbro17
Copy link
Contributor Author

Actually do not merge this just yet. Having some issues testing this out the master branch. Worked fine with the 1.6.7 release.

@hasbro17 hasbro17 force-pushed the haseeb/enable-default-debug-logging branch from a1dbda0 to ffe1b4a Compare July 17, 2017 22:14
@hasbro17 hasbro17 changed the title [WIP]*: enable debug logging for etcd-operator by default *: enable debug logging for etcd-operator by default Jul 17, 2017
@hasbro17
Copy link
Contributor Author

Messed up the pod spec while switching from an earlier branch.
I've tested this PR out manually by setting up a cluster and confirming the presence of the debug log file. So this should be good to go once the CI passes.

@hasbro17
Copy link
Contributor Author

@coreosbot run e2e

Copy link
Contributor

@s-urbaniak s-urbaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a concern regarding the hostPath mounting.

@@ -31,6 +35,12 @@ spec:
value: /tmp
image: ${etcd_operator_image}
name: etcd-operator
command:
- /usr/local/bin/etcd-operator
- --debug-logfile-path=/var/tmp/etcd-operator/debug/debug.log
Copy link
Contributor

@s-urbaniak s-urbaniak Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why is this insisted to go into a log file? Are these debug messages also going to visible in kubectl logs your-etcd-operator?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for self hosted etcd, when it is down, k8s is down. when k8s is down, kubectl is unusable. the whole point of this is to enforce we log down to disk for debugging purpose.

volumes:
- name: debug-volume
hostPath:
path: /var/tmp
Copy link
Contributor

@s-urbaniak s-urbaniak Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried that we are hardcoding a host path here. The etcd-operator is a deployment, hence subject to be rescheduled by k8s at any time. This /var/tmp/etcd-operator/debug/debug.log file will eventually be sprinkled across all master nodes. Judging from https://github.com/coreos/etcd-operator/blob/c946e30490947dc8b171fc4439a98356c7a85078/pkg/debug/debug_logger.go#L51 I see that this at least opens the file file using O_APPEND, but those logs would still be pretty inconsistent in the face of rescheduling.

Cannot debug simply output to stdout such that its output is captured by standard k8s logging facilities?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can force every tectonic users to use a logging system like splunk, then it is a great help. Most of the users we interact with today have no logging system setup, this brings a huge problem for debugging self hosted etcd. when k8s is down, we have no easy way to get logging.

with this hack way, we at least can get the logging we want by downloading files from a well known path on all master nodes. we do not really worry about logging spreading too much. the operator is leader elected and time skew should not be a really problem.

and something is better than nothing.

@s-urbaniak
Copy link
Contributor

@xiang90 thanks a lot for the reasoning! You have a point in the context of self-hosted etcd. LGTM in that case. Also for failed etcd pods, the kubelet would clean up, so even docker logs ... wouldn't work on the host itself. LGTM then.

@s-urbaniak
Copy link
Contributor

ok to test

@xiang90
Copy link

xiang90 commented Jul 18, 2017

Also for failed etcd pods, the kubelet would clean up, so even docker logs ... wouldn't work on the host itself

exactly.

@s-urbaniak
Copy link
Contributor

@mxinden @cpanato do you mind to look into the "Permission denied" CI failure?

@cpanato
Copy link
Contributor

cpanato commented Jul 19, 2017

@s-urbaniak will look into

@s-urbaniak
Copy link
Contributor

ok to test

@s-urbaniak s-urbaniak merged commit e98f485 into coreos:master Jul 19, 2017
@hasbro17 hasbro17 deleted the haseeb/enable-default-debug-logging branch July 19, 2017 22:50
squat pushed a commit to squat/tectonic-installer that referenced this pull request Sep 25, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants