-
Notifications
You must be signed in to change notification settings - Fork 109
[5.5.x] Add since flag to gravity report #1719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we could roll a few of those additional system collectors from #1273 in the same PR, that'd be cool (less forward-porting burden later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly cosmetic feedback.
3c6c2cc
to
f57bd02
Compare
lib/defaults/defaults.go
Outdated
@@ -851,6 +854,8 @@ const ( | |||
|
|||
// EtcdLocalAddr is the local etcd address | |||
EtcdLocalAddr = "https://127.0.0.1:2379" | |||
// EtcdLocalMetricsAddr is address where etcd metrics are exposed | |||
EtcdLocalMetricsAddr = "https://127.0.0.1:4001" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the metrics can be access from the :2379 endpoint. It's out intention to consolidate these into a single port, I just don't think anyone has gotten around to it.
lib/report/system.go
Outdated
// etcdMetrics fetches etcd metrics | ||
func etcdMetrics() Collectors { | ||
return Collectors{ | ||
Cmd("etcd-metrics", utils.PlanetCommandArgs("/usr/bin/curl", "-s", "--insecure", "--tlsv1.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe the insecure flag is needed / should be used here. I believe all required certs / configuration is being passed for mTLS.
BTW, we hardcode insecure all over the place so this isn't a new issue, but we should really be moving away from hard coding insecure TLS connections when we have all the info needed to properly do a mTLS connection.
tool/gravity/cli/register.go
Outdated
@@ -534,7 +534,8 @@ func RegisterCommands(app *kingpin.Application) *Application { | |||
|
|||
// get cluster diagnostics report | |||
g.ReportCmd.CmdClause = g.Command("report", "Generate cluster diagnostics report") | |||
g.ReportCmd.FilePath = g.ReportCmd.Flag("file", "target report file name").Default("report.tar.gz").String() | |||
g.ReportCmd.FilePath = g.ReportCmd.Flag("file", "Target report file name").Default("report.tar.gz").String() | |||
g.ReportCmd.Since = g.ReportCmd.Flag("since", "Only return logs newer than a relative duration like 5s, 2m, or 3h").Duration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be setting a default for since
, I'm thinking something sensible is like 7 days or 14 days.
As I see it, to get value out of it, it means all of our customers and customers of customers need to be taught about this flag, to use it, etc. We also have potential for UX friction, where for atleast some time we'll have a number of clusters that do not have this flag, leading to unknown flag errors when trying to use an updating runbook or docs for example. Basically, depending on the version of gravity deployed, debug report needs to be used differently to actually get a smaller report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. We can set it to 14 days. I'll edit the description to include the default and to pass 0s
if they would like the full logs.
case FilterEtcd: | ||
collectors = append(collectors, etcdBackup()...) | ||
collectors = append(collectors, etcdMetrics()...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar enough with the filters, but based on your sample output, this seems to only collect metrics from one node in the cluster. Some of the metrics will be relevant to the particular server though, so this should likely be collected on each master node, so each server can be inspected individually. Be careful not to collect on the worker nodes, as those connections will be randomly load balanced to a master.
Unfortunately what makes this even more complicated, is many of the metrics are not useful on their own with a single snapshot. Many metrics will be counters since startup, which don't say anything on their own. It's only when you delta the current snapshot, with one a minute ago, to you get a picture of what was happening over the last minute. This isn't really just isolated to etcd, I don't think we've really planned on how to pull any metric histories as part of the debug reporting. So we might want to put together a plan on pulling out valuable metrics histories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated so that etcd metrics are collected on all master nodes fab170f.
We should definitely look to do some additional processing of the metrics. For this PR I think the single snapshot can still provide some useful information?
- Reboots history - Kernel information - Serf cluster - Swap status - Etcd metrics
There is already a gravity-system file. Conflict occurs when trying to unzip gravity-system.log.gz.
- Fetch metrics from :2379 instead of :4001 - No need to insecure flag
f57bd02
to
fab170f
Compare
Sorry I sort of missed on this PR - why was the kubernetes logs collector removed? I liked how it was grouping the logs by namespace and it would also be beneficial to use the stock tools which could provide more details - why cannot they co-exist? |
Also, the new flag is not propagated to planet journal log collector. |
Description
This PR adds the
since
flag togravity report
command. This flag is used to filter the log entries by time. The current implementation only affects the log entries collected fromkubectl logs
andjournalctl
.Add a few more system collectors that have come up recently:
last -x
uname -a
serf members
swapon -s (we have vmstat but wouldn't hurt)
curl -s --insecure --tlsv1.2 --cacert /var/state/root.cert --cert /var/state/etcd.cert --key /var/state/etcd.key https://localhost:4001/metrics
Type of change
Linked tickets and other PRs
TODOs
Testing done
Verify additional collectors
Verify kubectl logs are filtered by time
--since 1m
Verify journal logs are filtered by time
--since 1m