[5.5.x] Add since flag to gravity report #1719

bernardjkim · 2020-06-16T21:21:16Z

Description

This PR adds the since flag to gravity report command. This flag is used to filter the log entries by time. The current implementation only affects the log entries collected from kubectl logs and journalctl.

Add a few more system collectors that have come up recently:

Reboots history: last -x
Kernel information: uname -a
Serf cluster: serf members
Swap status: swapon -s (we have vmstat but wouldn't hurt)
Etcd metrics: curl -s --insecure --tlsv1.2 --cacert /var/state/root.cert --cert /var/state/etcd.cert --key /var/state/etcd.key https://localhost:4001/metrics

Type of change

New feature (non-breaking change which adds functionality)
This change has a user-facing impact

Linked tickets and other PRs

Updates Improve Gravity report capabilities #1273

TODOs

Self-review the change
Perform manual testing
Address review feedback

Testing done

Verify additional collectors

[gravity-report] tree report                                                                                                                   bernard/5.5.x/time-filter  ✭ ✱
report
├── cluster.json
├── node-1-debug-logs
│   ├── bridge-fdb
│   ├── cluster-status
│   ├── df
│   ├── df-inodes
│   ├── dmesg
│   ├── dmsetup
│   ├── etcd-status
│   ├── etc-logs.tar.gz
│   ├── fdisk
│   ├── free
│   ├── gravity-install.log
│   ├── gravity-packages.yaml
│   ├── gravity-system.log
│   ├── gravity-system.log.gz
│   ├── host-system-failed
│   ├── host-system-jobs
│   ├── host-system-status
│   ├── ifconfig
│   ├── ip-addr
│   ├── ip-link
│   ├── ip-maddress
│   ├── ip-neighbor
│   ├── ip-netconf
│   ├── ip-ntable
│   ├── ip-route
│   ├── ip-rule
│   ├── iptables
│   ├── ip-tcp_metrics
│   ├── lsblk
│   ├── lscpu
│   ├── lsmod
│   ├── planet-journal-export.log.gz
│   ├── planet-status
│   ├── planet-system-failed
│   ├── planet-system-jobs
│   ├── planet-system-status
│   ├── reboot-history  <---
│   ├── route
│   ├── running-processes
│   ├── serf-members  <---
│   ├── slabinfo
│   ├── slabtop
│   ├── swapon <---
│   ├── uname  <---
│   └── vmstat
├── node-1-debug-logs.tar.gz
├── node-2-etcd
│   ├── etcd-backup.json
│   └── etcd-metrics  <---
├── node-2-etcd.tar.gz
[...]

Verify kubectl logs are filtered by time

--since 1m

[vagrant@node-1 node-2-k8s-logs]$ cat k8s-logs-kube-system-gravity-site-kn59t-gravity-site
2020-06-13T09:36:30Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:36:30Z INFO             2020/06/13 09:36:30 http: TLS handshake error from 127.0.0.1:32834: tls: internal error: failed to clone hash runtime/asm_amd64.s:1373
2020-06-13T09:36:40Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:36:40Z INFO             2020/06/13 09:36:40 http: TLS handshake error from 127.0.0.1:32900: tls: internal error: failed to clone hash runtime/asm_amd64.s:1373
2020-06-13T09:36:40Z WARN [TELEPROXY] "Failed to renew certificate for [email protected]: \nERROR REPORT:\nOriginal Error: *trace.ConnectionProblemError remote error: tls: internal error\nStack Trace:\n\t/home/bernard/go/src/github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/httplib/httplib.go:110 github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/httplib.ConvertResponse\n\t/home/bernard/go/src/github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/auth/clt.go:211 github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/auth.(*Client).PostJSON\n\t/home/bernard/go/src/github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/auth/clt.go:1169 github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/auth.(*Client).GenerateUserCert\n\t/home/bernard/go/src/github.com/gravitational/gravity/lib/process/proxy.go:125 github.com/gravitational/gravity/lib/process.(*teleportProxyService).initAuthMethods.func1\n\t/home/bernard/go/src/github.com/gravitational/gravity/lib/process/proxy.go:154 github.com/gravitational/gravity/lib/process.(*teleportProxyService).initAuthMethods\n\t/usr/lib/go-1.14/src/runtime/asm_amd64.s:1373 runtime.goexit\nUser Message: remote error: tls: internal error." process/proxy.go:155
2020-06-13T09:36:50Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:36:58Z INFO [RBAC]      Access to create user in namespace default denied to roles Node,default-implicit-role: no allow rule matched. services/role.go:1838
2020-06-13T09:37:00Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:37:10Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:37:20Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447

Verify journal logs are filtered by time

--since 1m
Not entirely sure how to read these timestamps, but there are much fewer log entries compared to the report without a time filter provided.

[vagrant@node-1 node-1-debug-logs]$ cat gravity-system.log
__CURSOR=s=05ed0a8050e54d6eb25b26ebd98fd6d3;i=1d5d;b=7a3504e73fcc4cb186958464f737b949;m=1ea97c5500;t=5a7f3eea4dbd4;x=bfdc13f631df90fd
__REALTIME_TIMESTAMP=1592041006226388
__MONOTONIC_TIMESTAMP=131692516608
_BOOT_ID=7a3504e73fcc4cb186958464f737b949
_UID=0
_MACHINE_ID=ee32f6c4a34e4799afc544610c083982
PRIORITY=5
_CAP_EFFECTIVE=1fffffffff
_TRANSPORT=syslog
SYSLOG_FACILITY=10
_AUDIT_LOGINUID=1000
_SYSTEMD_OWNER_UID=1000
_SYSTEMD_SLICE=user-1000.slice
SYSLOG_IDENTIFIER=sudo
_GID=1000
_COMM=sudo
_EXE=/usr/bin/sudo
_SELINUX_CONTEXT=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
_HOSTNAME=node-1
MESSAGE= vagrant : TTY=pts/0 ; PWD=/home/vagrant ; USER=root ; COMMAND=/bin/gravity report --help
_CMDLINE=sudo gravity report --help
_AUDIT_SESSION=105
_SYSTEMD_CGROUP=/user.slice/user-1000.slice/session-105.scope
_SYSTEMD_SESSION=105
_SYSTEMD_UNIT=session-105.scope
_PID=17162
_SOURCE_REALTIME_TIMESTAMP=1592041006223422

__CURSOR=s=05ed0a8050e54d6eb25b26ebd98fd6d3;i=1d5e;b=7a3504e73fcc4cb186958464f737b949;m=1ea97c80c1;t=5a7f3eea50796;x=6c580e6969270a4f
__REALTIME_TIMESTAMP=1592041006237590
__MONOTONIC_TIMESTAMP=131692527809
_BOOT_ID=7a3504e73fcc4cb186958464f737b949
PRIORITY=6
_UID=0
_GID=0
_MACHINE_ID=ee32f6c4a34e4799afc544610c083982
_CAP_EFFECTIVE=1fffffffff
_TRANSPORT=syslog
SYSLOG_FACILITY=10
_AUDIT_LOGINUID=1000
_SYSTEMD_OWNER_UID=1000
_SYSTEMD_SLICE=user-1000.slice
SYSLOG_IDENTIFIER=sudo
_COMM=sudo
_EXE=/usr/bin/sudo
_SELINUX_CONTEXT=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
_HOSTNAME=node-1
MESSAGE=pam_unix(sudo:session): session opened for user root by vagrant(uid=0)
_CMDLINE=sudo gravity report --help
_AUDIT_SESSION=105
_SYSTEMD_CGROUP=/user.slice/user-1000.slice/session-105.scope
_SYSTEMD_SESSION=105
_SYSTEMD_UNIT=session-105.scope
_PID=17162
_SOURCE_REALTIME_TIMESTAMP=1592041006236963

__CURSOR=s=05ed0a8050e54d6eb25b26ebd98fd6d3;i=1d5f;b=7a3504e73fcc4cb186958464f737b949;m=1ea97d9723;t=5a7f3eea61df7;x=db7124e0e9f6124
__REALTIME_TIMESTAMP=1592041006308855
__MONOTONIC_TIMESTAMP=131692599075
_BOOT_ID=7a3504e73fcc4cb186958464f737b949
PRIORITY=6
_UID=0
_GID=0
_MACHINE_ID=ee32f6c4a34e4799afc544610c083982
_CAP_EFFECTIVE=1fffffffff
_TRANSPORT=syslog
SYSLOG_FACILITY=10
_AUDIT_LOGINUID=1000
_SYSTEMD_OWNER_UID=1000
_SYSTEMD_SLICE=user-1000.slice
SYSLOG_IDENTIFIER=sudo
_COMM=sudo
_EXE=/usr/bin/sudo
_SELINUX_CONTEXT=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
_HOSTNAME=node-1
MESSAGE=pam_unix(sudo:session): session closed for user root
_CMDLINE=sudo gravity report --help
_AUDIT_SESSION=105
_SYSTEMD_CGROUP=/user.slice/user-1000.slice/session-105.scope
_SYSTEMD_SESSION=105
_SYSTEMD_UNIT=session-105.scope
_PID=17162
_SOURCE_REALTIME_TIMESTAMP=1592041006308612

__CURSOR=s=05ed0a8050e54d6eb25b26ebd98fd6d3;i=1d60;b=7a3504e73fcc4cb186958464f737b949;m=1eac0643e1;t=5a7f3f12ecab5;x=8661f252df6da835
__REALTIME_TIMESTAMP=1592041048820405
__MONOTONIC_TIMESTAMP=131735110625
_BOOT_ID=7a3504e73fcc4cb186958464f737b949
PRIORITY=6
_UID=0
_GID=0
_SYSTEMD_SLICE=system.slice
_MACHINE_ID=ee32f6c4a34e4799afc544610c083982
SYSLOG_FACILITY=3
_CAP_EFFECTIVE=1fffffffff
_TRANSPORT=stdout
_SELINUX_CONTEXT=system_u:system_r:init_t:s0
_HOSTNAME=node-1
SYSLOG_IDENTIFIER=gravity
_COMM=teleport
_CMDLINE=rootfs/usr/bin/teleport start
_STREAM_ID=a4ff5bb0a2e34ea39c31b31ec84dbc20
_PID=1412
_EXE=/var/lib/gravity/local/packages/unpacked/gravitational.io/teleport/3.0.5/rootfs/usr/bin/teleport
_SYSTEMD_CGROUP=/system.slice/gravity__gravitational.io__teleport__3.0.5.service
_SYSTEMD_UNIT=gravity__gravitational.io__teleport__3.0.5.service
MESSAGE=INFO [NODE]      [LOCAL EXEC] Started command: "gravity --insecure system report --filter system --compressed --since 1m0s" id:9 local:172.28.128.101:3022 login:root remote:127.0.0.1:56576 teleportUser:[email protected] srv/exec.go:171

r0mant

Also, if we could roll a few of those additional system collectors from #1273 in the same PR, that'd be cool (less forward-porting burden later).

lib/ops/operatoracl.go

lib/ops/opsservice/report.go

lib/report/k8s.go

r0mant

Mostly cosmetic feedback.

lib/ops/ops.go

lib/report/k8s.go

lib/ops/opsservice/report.go

lib/report/system.go

tool/gravity/cli/register.go

lib/report/k8s.go

lib/report/report.go

lib/ops/opshandler/opshandler.go

lib/webapi/webapi.go

knisbet · 2020-06-18T05:33:36Z

lib/defaults/defaults.go

@@ -851,6 +854,8 @@ const (

 	// EtcdLocalAddr is the local etcd address
 	EtcdLocalAddr = "https://127.0.0.1:2379"
+	// EtcdLocalMetricsAddr is address where etcd metrics are exposed
+	EtcdLocalMetricsAddr = "https://127.0.0.1:4001"


I believe the metrics can be access from the :2379 endpoint. It's out intention to consolidate these into a single port, I just don't think anyone has gotten around to it.

knisbet · 2020-06-18T05:45:55Z

lib/report/system.go

+// etcdMetrics fetches etcd metrics
+func etcdMetrics() Collectors {
+	return Collectors{
+		Cmd("etcd-metrics", utils.PlanetCommandArgs("/usr/bin/curl", "-s", "--insecure", "--tlsv1.2",


I don't believe the insecure flag is needed / should be used here. I believe all required certs / configuration is being passed for mTLS.

BTW, we hardcode insecure all over the place so this isn't a new issue, but we should really be moving away from hard coding insecure TLS connections when we have all the info needed to properly do a mTLS connection.

knisbet · 2020-06-18T05:55:37Z

tool/gravity/cli/register.go

@@ -534,7 +534,8 @@ func RegisterCommands(app *kingpin.Application) *Application {

 	// get cluster diagnostics report
 	g.ReportCmd.CmdClause = g.Command("report", "Generate cluster diagnostics report")
-	g.ReportCmd.FilePath = g.ReportCmd.Flag("file", "target report file name").Default("report.tar.gz").String()
+	g.ReportCmd.FilePath = g.ReportCmd.Flag("file", "Target report file name").Default("report.tar.gz").String()
+	g.ReportCmd.Since = g.ReportCmd.Flag("since", "Only return logs newer than a relative duration like 5s, 2m, or 3h").Duration()


I think we should be setting a default for since, I'm thinking something sensible is like 7 days or 14 days.

As I see it, to get value out of it, it means all of our customers and customers of customers need to be taught about this flag, to use it, etc. We also have potential for UX friction, where for atleast some time we'll have a number of clusters that do not have this flag, leading to unknown flag errors when trying to use an updating runbook or docs for example. Basically, depending on the version of gravity deployed, debug report needs to be used differently to actually get a smaller report.

Yeah, that's a good point. We can set it to 14 days. I'll edit the description to include the default and to pass 0s if they would like the full logs.

knisbet · 2020-06-18T06:09:03Z

lib/report/report.go

 		case FilterEtcd:
 			collectors = append(collectors, etcdBackup()...)
+			collectors = append(collectors, etcdMetrics()...)


I'm not familiar enough with the filters, but based on your sample output, this seems to only collect metrics from one node in the cluster. Some of the metrics will be relevant to the particular server though, so this should likely be collected on each master node, so each server can be inspected individually. Be careful not to collect on the worker nodes, as those connections will be randomly load balanced to a master.

Unfortunately what makes this even more complicated, is many of the metrics are not useful on their own with a single snapshot. Many metrics will be counters since startup, which don't say anything on their own. It's only when you delta the current snapshot, with one a minute ago, to you get a picture of what was happening over the last minute. This isn't really just isolated to etcd, I don't think we've really planned on how to pull any metric histories as part of the debug reporting. So we might want to put together a plan on pulling out valuable metrics histories.

Updated so that etcd metrics are collected on all master nodes fab170f.

We should definitely look to do some additional processing of the metrics. For this PR I think the single snapshot can still provide some useful information?

- Reboots history - Kernel information - Serf cluster - Swap status - Etcd metrics

There is already a gravity-system file. Conflict occurs when trying to unzip gravity-system.log.gz.

- Fetch metrics from :2379 instead of :4001 - No need to insecure flag

a-palchikov · 2020-06-23T13:03:46Z

Sorry I sort of missed on this PR - why was the kubernetes logs collector removed? I liked how it was grouping the logs by namespace and it would also be beneficial to use the stock tools which could provide more details - why cannot they co-exist?

a-palchikov · 2020-06-23T13:07:45Z

Also, the new flag is not propagated to planet journal log collector.

bernardjkim added the port/5.5 Requires port to version/5.5.x label Jun 16, 2020

bernardjkim requested review from r0mant and a-palchikov June 16, 2020 21:21

r0mant reviewed Jun 16, 2020

View reviewed changes

lib/ops/operatoracl.go Outdated Show resolved Hide resolved

lib/ops/opsservice/report.go Outdated Show resolved Hide resolved

lib/ops/opsservice/report.go Outdated Show resolved Hide resolved

lib/report/k8s.go Outdated Show resolved Hide resolved

r0mant reviewed Jun 17, 2020

View reviewed changes

r0mant approved these changes Jun 17, 2020

View reviewed changes

lib/ops/opshandler/opshandler.go Outdated Show resolved Hide resolved

lib/webapi/webapi.go Outdated Show resolved Hide resolved

bernardjkim marked this pull request as ready for review June 18, 2020 02:46

bernardjkim requested a review from a team June 18, 2020 02:46

bernardjkim force-pushed the bernard/5.5.x/time-filter branch from 3c6c2cc to f57bd02 Compare June 18, 2020 02:47

knisbet reviewed Jun 18, 2020

View reviewed changes

bernardjkim added 15 commits June 22, 2020 11:50

Add since flag to gravity report

c567b83

Simplify --since arg

3a85e1d

Wrap key and since in GetSiteReportRequest

b588236

Collect kubectl logs if since is provided

af7ed0d

GetSiteReportRequest -> GetClusterReportRequest

3f47578

Append since flag if not 0

4224d25

Capitalize descriptions; Remove unused field

b7f2e6c

Add additional collectors

3a6bd45

- Reboots history - Kernel information - Serf cluster - Swap status - Etcd metrics

Use URL.Query().Get()

29b8cc7

Edit etcd naming

f234c48

gravity-system.log.gz -> gravity-journal.log.gz

cc8cd61

There is already a gravity-system file. Conflict occurs when trying to unzip gravity-system.log.gz.

Edit etcd metics collector

a79070a

- Fetch metrics from :2379 instead of :4001 - No need to insecure flag

Specify 14 day default to since flag

6d33327

Typo

d24eb9a

Collect etcd metrics from all master nodes

fab170f

bernardjkim force-pushed the bernard/5.5.x/time-filter branch from f57bd02 to fab170f Compare June 22, 2020 18:50

bernardjkim merged commit bed853f into version/5.5.x Jun 23, 2020

bernardjkim deleted the bernard/5.5.x/time-filter branch June 23, 2020 02:13

a-palchikov mentioned this pull request Jun 23, 2020

[5.5.x] update report collector #1755

Merged

3 tasks

This was referenced Jun 25, 2020

[6.1.x] Gravity report improvements #1800

Merged

[7.0.x] Gravity report improvements #1806

Merged

Gravity report improvements #1808

Merged

aelkugia mentioned this pull request Jul 30, 2020

[WIP] Add Lele's part of the training about logs and status gravitational/workshop#64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.5.x] Add since flag to gravity report #1719

[5.5.x] Add since flag to gravity report #1719

bernardjkim commented Jun 16, 2020 •

edited

Loading

r0mant left a comment

r0mant left a comment

knisbet Jun 18, 2020

knisbet Jun 18, 2020

knisbet Jun 18, 2020

bernardjkim Jun 18, 2020

knisbet Jun 18, 2020

bernardjkim Jun 22, 2020

a-palchikov commented Jun 23, 2020 •

edited

Loading

a-palchikov commented Jun 23, 2020

[5.5.x] Add since flag to gravity report #1719

[5.5.x] Add since flag to gravity report #1719

Conversation

bernardjkim commented Jun 16, 2020 • edited Loading

Description

Type of change

Linked tickets and other PRs

TODOs

Testing done

r0mant left a comment

Choose a reason for hiding this comment

r0mant left a comment

Choose a reason for hiding this comment

knisbet Jun 18, 2020

Choose a reason for hiding this comment

knisbet Jun 18, 2020

Choose a reason for hiding this comment

knisbet Jun 18, 2020

Choose a reason for hiding this comment

bernardjkim Jun 18, 2020

Choose a reason for hiding this comment

knisbet Jun 18, 2020

Choose a reason for hiding this comment

bernardjkim Jun 22, 2020

Choose a reason for hiding this comment

a-palchikov commented Jun 23, 2020 • edited Loading

a-palchikov commented Jun 23, 2020

bernardjkim commented Jun 16, 2020 •

edited

Loading

a-palchikov commented Jun 23, 2020 •

edited

Loading