-
Notifications
You must be signed in to change notification settings - Fork 266
*: enable debug logging for etcd-operator by default #1425
*: enable debug logging for etcd-operator by default #1425
Conversation
Just curious, but could debug logging cause long-lived clusters to run out of disk space? |
@ggreer Yes it could. However, the debug logs currently are not periodic and are only meant to record the following 3 events so as to keep it to a minimal size:
All of the above should not be common actions for a normal cluster. As to what is "long enough" I don't really have an estimate of the time it would take to run out of disk space since that depends on factors like the available disk size, and the average rate of changes to the cluster state over a period of time. Admittedly is a bit of a temporary hack to make debugging failed self-hosted clusters easier for the field team. In the future, we probably won't be writing the debug logs to the node's disk, but instead send them to a more appropriate sink. |
Actually do not merge this just yet. Having some issues testing this out the master branch. Worked fine with the 1.6.7 release. |
a1dbda0
to
ffe1b4a
Compare
Messed up the pod spec while switching from an earlier branch. |
@coreosbot run e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a concern regarding the hostPath mounting.
@@ -31,6 +35,12 @@ spec: | |||
value: /tmp | |||
image: ${etcd_operator_image} | |||
name: etcd-operator | |||
command: | |||
- /usr/local/bin/etcd-operator | |||
- --debug-logfile-path=/var/tmp/etcd-operator/debug/debug.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: why is this insisted to go into a log file? Are these debug messages also going to visible in kubectl logs your-etcd-operator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for self hosted etcd, when it is down, k8s is down. when k8s is down, kubectl is unusable. the whole point of this is to enforce we log down to disk for debugging purpose.
volumes: | ||
- name: debug-volume | ||
hostPath: | ||
path: /var/tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried that we are hardcoding a host path here. The etcd-operator is a deployment, hence subject to be rescheduled by k8s at any time. This /var/tmp/etcd-operator/debug/debug.log
file will eventually be sprinkled across all master nodes. Judging from https://github.com/coreos/etcd-operator/blob/c946e30490947dc8b171fc4439a98356c7a85078/pkg/debug/debug_logger.go#L51 I see that this at least opens the file file using O_APPEND
, but those logs would still be pretty inconsistent in the face of rescheduling.
Cannot debug simply output to stdout such that its output is captured by standard k8s logging facilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can force every tectonic users to use a logging system like splunk, then it is a great help. Most of the users we interact with today have no logging system setup, this brings a huge problem for debugging self hosted etcd. when k8s is down, we have no easy way to get logging.
with this hack way, we at least can get the logging we want by downloading files from a well known path on all master nodes. we do not really worry about logging spreading too much. the operator is leader elected and time skew should not be a really problem.
and something is better than nothing.
@xiang90 thanks a lot for the reasoning! You have a point in the context of self-hosted etcd. LGTM in that case. Also for failed etcd pods, the kubelet would clean up, so even |
ok to test |
exactly. |
@s-urbaniak will look into |
ok to test |
This PR enables the debug logging feature introduced in etcd-operator 0.4.2.
coreos/etcd-operator#1232
The operator will now write logs for certain critical actions(like pod creation/deletion) to a logfile, which will be persisted to an underlying hostPath volume on the node where the operator pod happens to run.
This logfile helps in debugging failed clusters where the logs for Kubernetes pods are no longer available. The logfile can be retrieved if the nodes of the failed cluster are still accessible by ssh.
/cc @xiang90 @hongchaodeng @coresolve