Skip to content
This repository has been archived by the owner on Feb 9, 2024. It is now read-only.

[5.5.x] Add since flag to gravity report #1719

Merged
merged 15 commits into from
Jun 23, 2020

Conversation

bernardjkim
Copy link
Contributor

@bernardjkim bernardjkim commented Jun 16, 2020

Description

This PR adds the since flag to gravity report command. This flag is used to filter the log entries by time. The current implementation only affects the log entries collected from kubectl logs and journalctl.

Add a few more system collectors that have come up recently:

  • Reboots history: last -x
  • Kernel information: uname -a
  • Serf cluster: serf members
  • Swap status: swapon -s (we have vmstat but wouldn't hurt)
  • Etcd metrics: curl -s --insecure --tlsv1.2 --cacert /var/state/root.cert --cert /var/state/etcd.cert --key /var/state/etcd.key https://localhost:4001/metrics

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change has a user-facing impact

Linked tickets and other PRs

TODOs

  • Self-review the change
  • Perform manual testing
  • Address review feedback

Testing done

Verify additional collectors

[gravity-report] tree report                                                                                                                   bernard/5.5.x/time-filter  ✭ ✱
report
├── cluster.json
├── node-1-debug-logs
│   ├── bridge-fdb
│   ├── cluster-status
│   ├── df
│   ├── df-inodes
│   ├── dmesg
│   ├── dmsetup
│   ├── etcd-status
│   ├── etc-logs.tar.gz
│   ├── fdisk
│   ├── free
│   ├── gravity-install.log
│   ├── gravity-packages.yaml
│   ├── gravity-system.log
│   ├── gravity-system.log.gz
│   ├── host-system-failed
│   ├── host-system-jobs
│   ├── host-system-status
│   ├── ifconfig
│   ├── ip-addr
│   ├── ip-link
│   ├── ip-maddress
│   ├── ip-neighbor
│   ├── ip-netconf
│   ├── ip-ntable
│   ├── ip-route
│   ├── ip-rule
│   ├── iptables
│   ├── ip-tcp_metrics
│   ├── lsblk
│   ├── lscpu
│   ├── lsmod
│   ├── planet-journal-export.log.gz
│   ├── planet-status
│   ├── planet-system-failed
│   ├── planet-system-jobs
│   ├── planet-system-status
│   ├── reboot-history  <---
│   ├── route
│   ├── running-processes
│   ├── serf-members  <---
│   ├── slabinfo
│   ├── slabtop
│   ├── swapon <---
│   ├── uname  <---
│   └── vmstat
├── node-1-debug-logs.tar.gz
├── node-2-etcd
│   ├── etcd-backup.json
│   └── etcd-metrics  <---
├── node-2-etcd.tar.gz
[...]

Verify kubectl logs are filtered by time

  • --since 1m
[vagrant@node-1 node-2-k8s-logs]$ cat k8s-logs-kube-system-gravity-site-kn59t-gravity-site
2020-06-13T09:36:30Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:36:30Z INFO             2020/06/13 09:36:30 http: TLS handshake error from 127.0.0.1:32834: tls: internal error: failed to clone hash runtime/asm_amd64.s:1373
2020-06-13T09:36:40Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:36:40Z INFO             2020/06/13 09:36:40 http: TLS handshake error from 127.0.0.1:32900: tls: internal error: failed to clone hash runtime/asm_amd64.s:1373
2020-06-13T09:36:40Z WARN [TELEPROXY] "Failed to renew certificate for [email protected]: \nERROR REPORT:\nOriginal Error: *trace.ConnectionProblemError remote error: tls: internal error\nStack Trace:\n\t/home/bernard/go/src/github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/httplib/httplib.go:110 github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/httplib.ConvertResponse\n\t/home/bernard/go/src/github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/auth/clt.go:211 github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/auth.(*Client).PostJSON\n\t/home/bernard/go/src/github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/auth/clt.go:1169 github.com/gravitational/gravity/vendor/github.com/gravitational/teleport/lib/auth.(*Client).GenerateUserCert\n\t/home/bernard/go/src/github.com/gravitational/gravity/lib/process/proxy.go:125 github.com/gravitational/gravity/lib/process.(*teleportProxyService).initAuthMethods.func1\n\t/home/bernard/go/src/github.com/gravitational/gravity/lib/process/proxy.go:154 github.com/gravitational/gravity/lib/process.(*teleportProxyService).initAuthMethods\n\t/usr/lib/go-1.14/src/runtime/asm_amd64.s:1373 runtime.goexit\nUser Message: remote error: tls: internal error." process/proxy.go:155
2020-06-13T09:36:50Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:36:58Z INFO [RBAC]      Access to create user in namespace default denied to roles Node,default-implicit-role: no allow rule matched. services/role.go:1838
2020-06-13T09:37:00Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:37:10Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447
2020-06-13T09:37:20Z WARN [PROXY:AGE] Failed to create remote tunnel: dial tcp 172.28.128.101:61024: connect: connection refused, conn: <nil>. target:172.28.128.101:61024 reversetunnel/agent.go:447

Verify journal logs are filtered by time

  • --since 1m
  • Not entirely sure how to read these timestamps, but there are much fewer log entries compared to the report without a time filter provided.
[vagrant@node-1 node-1-debug-logs]$ cat gravity-system.log
__CURSOR=s=05ed0a8050e54d6eb25b26ebd98fd6d3;i=1d5d;b=7a3504e73fcc4cb186958464f737b949;m=1ea97c5500;t=5a7f3eea4dbd4;x=bfdc13f631df90fd
__REALTIME_TIMESTAMP=1592041006226388
__MONOTONIC_TIMESTAMP=131692516608
_BOOT_ID=7a3504e73fcc4cb186958464f737b949
_UID=0
_MACHINE_ID=ee32f6c4a34e4799afc544610c083982
PRIORITY=5
_CAP_EFFECTIVE=1fffffffff
_TRANSPORT=syslog
SYSLOG_FACILITY=10
_AUDIT_LOGINUID=1000
_SYSTEMD_OWNER_UID=1000
_SYSTEMD_SLICE=user-1000.slice
SYSLOG_IDENTIFIER=sudo
_GID=1000
_COMM=sudo
_EXE=/usr/bin/sudo
_SELINUX_CONTEXT=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
_HOSTNAME=node-1
MESSAGE= vagrant : TTY=pts/0 ; PWD=/home/vagrant ; USER=root ; COMMAND=/bin/gravity report --help
_CMDLINE=sudo gravity report --help
_AUDIT_SESSION=105
_SYSTEMD_CGROUP=/user.slice/user-1000.slice/session-105.scope
_SYSTEMD_SESSION=105
_SYSTEMD_UNIT=session-105.scope
_PID=17162
_SOURCE_REALTIME_TIMESTAMP=1592041006223422

__CURSOR=s=05ed0a8050e54d6eb25b26ebd98fd6d3;i=1d5e;b=7a3504e73fcc4cb186958464f737b949;m=1ea97c80c1;t=5a7f3eea50796;x=6c580e6969270a4f
__REALTIME_TIMESTAMP=1592041006237590
__MONOTONIC_TIMESTAMP=131692527809
_BOOT_ID=7a3504e73fcc4cb186958464f737b949
PRIORITY=6
_UID=0
_GID=0
_MACHINE_ID=ee32f6c4a34e4799afc544610c083982
_CAP_EFFECTIVE=1fffffffff
_TRANSPORT=syslog
SYSLOG_FACILITY=10
_AUDIT_LOGINUID=1000
_SYSTEMD_OWNER_UID=1000
_SYSTEMD_SLICE=user-1000.slice
SYSLOG_IDENTIFIER=sudo
_COMM=sudo
_EXE=/usr/bin/sudo
_SELINUX_CONTEXT=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
_HOSTNAME=node-1
MESSAGE=pam_unix(sudo:session): session opened for user root by vagrant(uid=0)
_CMDLINE=sudo gravity report --help
_AUDIT_SESSION=105
_SYSTEMD_CGROUP=/user.slice/user-1000.slice/session-105.scope
_SYSTEMD_SESSION=105
_SYSTEMD_UNIT=session-105.scope
_PID=17162
_SOURCE_REALTIME_TIMESTAMP=1592041006236963

__CURSOR=s=05ed0a8050e54d6eb25b26ebd98fd6d3;i=1d5f;b=7a3504e73fcc4cb186958464f737b949;m=1ea97d9723;t=5a7f3eea61df7;x=db7124e0e9f6124
__REALTIME_TIMESTAMP=1592041006308855
__MONOTONIC_TIMESTAMP=131692599075
_BOOT_ID=7a3504e73fcc4cb186958464f737b949
PRIORITY=6
_UID=0
_GID=0
_MACHINE_ID=ee32f6c4a34e4799afc544610c083982
_CAP_EFFECTIVE=1fffffffff
_TRANSPORT=syslog
SYSLOG_FACILITY=10
_AUDIT_LOGINUID=1000
_SYSTEMD_OWNER_UID=1000
_SYSTEMD_SLICE=user-1000.slice
SYSLOG_IDENTIFIER=sudo
_COMM=sudo
_EXE=/usr/bin/sudo
_SELINUX_CONTEXT=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
_HOSTNAME=node-1
MESSAGE=pam_unix(sudo:session): session closed for user root
_CMDLINE=sudo gravity report --help
_AUDIT_SESSION=105
_SYSTEMD_CGROUP=/user.slice/user-1000.slice/session-105.scope
_SYSTEMD_SESSION=105
_SYSTEMD_UNIT=session-105.scope
_PID=17162
_SOURCE_REALTIME_TIMESTAMP=1592041006308612

__CURSOR=s=05ed0a8050e54d6eb25b26ebd98fd6d3;i=1d60;b=7a3504e73fcc4cb186958464f737b949;m=1eac0643e1;t=5a7f3f12ecab5;x=8661f252df6da835
__REALTIME_TIMESTAMP=1592041048820405
__MONOTONIC_TIMESTAMP=131735110625
_BOOT_ID=7a3504e73fcc4cb186958464f737b949
PRIORITY=6
_UID=0
_GID=0
_SYSTEMD_SLICE=system.slice
_MACHINE_ID=ee32f6c4a34e4799afc544610c083982
SYSLOG_FACILITY=3
_CAP_EFFECTIVE=1fffffffff
_TRANSPORT=stdout
_SELINUX_CONTEXT=system_u:system_r:init_t:s0
_HOSTNAME=node-1
SYSLOG_IDENTIFIER=gravity
_COMM=teleport
_CMDLINE=rootfs/usr/bin/teleport start
_STREAM_ID=a4ff5bb0a2e34ea39c31b31ec84dbc20
_PID=1412
_EXE=/var/lib/gravity/local/packages/unpacked/gravitational.io/teleport/3.0.5/rootfs/usr/bin/teleport
_SYSTEMD_CGROUP=/system.slice/gravity__gravitational.io__teleport__3.0.5.service
_SYSTEMD_UNIT=gravity__gravitational.io__teleport__3.0.5.service
MESSAGE=INFO [NODE]      [LOCAL EXEC] Started command: "gravity --insecure system report --filter system --compressed --since 1m0s" id:9 local:172.28.128.101:3022 login:root remote:127.0.0.1:56576 teleportUser:[email protected] srv/exec.go:171

@bernardjkim bernardjkim added the port/5.5 Requires port to version/5.5.x label Jun 16, 2020
Copy link
Contributor

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we could roll a few of those additional system collectors from #1273 in the same PR, that'd be cool (less forward-porting burden later).

lib/ops/operatoracl.go Outdated Show resolved Hide resolved
lib/ops/opsservice/report.go Outdated Show resolved Hide resolved
lib/ops/opsservice/report.go Outdated Show resolved Hide resolved
lib/report/k8s.go Outdated Show resolved Hide resolved
Copy link
Contributor

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly cosmetic feedback.

lib/ops/ops.go Outdated Show resolved Hide resolved
lib/ops/ops.go Outdated Show resolved Hide resolved
lib/ops/ops.go Outdated Show resolved Hide resolved
lib/report/k8s.go Show resolved Hide resolved
lib/ops/opsservice/report.go Outdated Show resolved Hide resolved
lib/report/system.go Outdated Show resolved Hide resolved
tool/gravity/cli/register.go Outdated Show resolved Hide resolved
lib/report/k8s.go Outdated Show resolved Hide resolved
lib/report/report.go Outdated Show resolved Hide resolved
lib/ops/opshandler/opshandler.go Outdated Show resolved Hide resolved
lib/webapi/webapi.go Outdated Show resolved Hide resolved
@bernardjkim bernardjkim marked this pull request as ready for review June 18, 2020 02:46
@bernardjkim bernardjkim requested a review from a team June 18, 2020 02:46
@@ -851,6 +854,8 @@ const (

// EtcdLocalAddr is the local etcd address
EtcdLocalAddr = "https://127.0.0.1:2379"
// EtcdLocalMetricsAddr is address where etcd metrics are exposed
EtcdLocalMetricsAddr = "https://127.0.0.1:4001"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the metrics can be access from the :2379 endpoint. It's out intention to consolidate these into a single port, I just don't think anyone has gotten around to it.

// etcdMetrics fetches etcd metrics
func etcdMetrics() Collectors {
return Collectors{
Cmd("etcd-metrics", utils.PlanetCommandArgs("/usr/bin/curl", "-s", "--insecure", "--tlsv1.2",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the insecure flag is needed / should be used here. I believe all required certs / configuration is being passed for mTLS.

BTW, we hardcode insecure all over the place so this isn't a new issue, but we should really be moving away from hard coding insecure TLS connections when we have all the info needed to properly do a mTLS connection.

@@ -534,7 +534,8 @@ func RegisterCommands(app *kingpin.Application) *Application {

// get cluster diagnostics report
g.ReportCmd.CmdClause = g.Command("report", "Generate cluster diagnostics report")
g.ReportCmd.FilePath = g.ReportCmd.Flag("file", "target report file name").Default("report.tar.gz").String()
g.ReportCmd.FilePath = g.ReportCmd.Flag("file", "Target report file name").Default("report.tar.gz").String()
g.ReportCmd.Since = g.ReportCmd.Flag("since", "Only return logs newer than a relative duration like 5s, 2m, or 3h").Duration()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be setting a default for since, I'm thinking something sensible is like 7 days or 14 days.

As I see it, to get value out of it, it means all of our customers and customers of customers need to be taught about this flag, to use it, etc. We also have potential for UX friction, where for atleast some time we'll have a number of clusters that do not have this flag, leading to unknown flag errors when trying to use an updating runbook or docs for example. Basically, depending on the version of gravity deployed, debug report needs to be used differently to actually get a smaller report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. We can set it to 14 days. I'll edit the description to include the default and to pass 0s if they would like the full logs.

case FilterEtcd:
collectors = append(collectors, etcdBackup()...)
collectors = append(collectors, etcdMetrics()...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough with the filters, but based on your sample output, this seems to only collect metrics from one node in the cluster. Some of the metrics will be relevant to the particular server though, so this should likely be collected on each master node, so each server can be inspected individually. Be careful not to collect on the worker nodes, as those connections will be randomly load balanced to a master.

Unfortunately what makes this even more complicated, is many of the metrics are not useful on their own with a single snapshot. Many metrics will be counters since startup, which don't say anything on their own. It's only when you delta the current snapshot, with one a minute ago, to you get a picture of what was happening over the last minute. This isn't really just isolated to etcd, I don't think we've really planned on how to pull any metric histories as part of the debug reporting. So we might want to put together a plan on pulling out valuable metrics histories.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so that etcd metrics are collected on all master nodes fab170f.

We should definitely look to do some additional processing of the metrics. For this PR I think the single snapshot can still provide some useful information?

@bernardjkim bernardjkim merged commit bed853f into version/5.5.x Jun 23, 2020
@bernardjkim bernardjkim deleted the bernard/5.5.x/time-filter branch June 23, 2020 02:13
@a-palchikov
Copy link
Contributor

a-palchikov commented Jun 23, 2020

Sorry I sort of missed on this PR - why was the kubernetes logs collector removed? I liked how it was grouping the logs by namespace and it would also be beneficial to use the stock tools which could provide more details - why cannot they co-exist?

@a-palchikov
Copy link
Contributor

Also, the new flag is not propagated to planet journal log collector.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
port/5.5 Requires port to version/5.5.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants