Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsh db ls and tsh ls taking 10-30 seconds to execute #11159

Closed
pschisa opened this issue Mar 15, 2022 · 11 comments
Closed

tsh db ls and tsh ls taking 10-30 seconds to execute #11159

pschisa opened this issue Mar 15, 2022 · 11 comments
Assignees
Labels
bug c-bi Internal Customer Reference

Comments

@pschisa
Copy link
Contributor

pschisa commented Mar 15, 2022

Description

What happened:
Database List is taking anywhere from 10-30 seconds to complete for DBA users operating with tsh client v8.1.0. Listing the same database records from Web Console did not take as much time and performs consistently better.

in-memory caching is already enabled. The number of nodes is ~3.5k and the number of databases is ~2k

The commands perform significantly better if the role returns no results for the ls

The significant addition of time seems to happen between this logging step:

tsh db ls

[2022-03-15 20:27:11] DEBU [KEYSTORE]  Returning Teleport TLS certificate "xxx.pem" valid until "2022-03-15 14:26:55 +0000 UTC". client/keystore.go:285
[2022-03-15 20:27:11] DEBU [CLIENT]    Client  is connecting to auth server on cluster "xxx". client/client.go:896
[2022-03-15 20:27:20] DEBU [KEYSTORE]  Returning Teleport TLS certificate "xxx.pem" valid until "2022-03-15 14:26:55 +0000 UTC". client/keystore.go:285
[2022-03-15 20:27:20] DEBU [KEYSTORE]  Reading certificates from path "xxx.pub". client/keystore.go:308

tsh ls

[2022-03-09 13:32:25] DEBU [KEYSTORE]  Returning Teleport TLS certificate "xxx" valid until "2022-03-09 06:36:50 +0000 UTC". client/keystore.go:283
[2022-03-09 13:32:25] DEBU [CLIENT]    Client  is connecting to auth server on cluster "xxx". client/client.go:850
[2022-03-09 13:32:36] Node Name      Address    Labels 

Let me know if additional logging is required and I can sanatize more of the logs

@pschisa pschisa added bug c-bi Internal Customer Reference labels Mar 15, 2022
@pschisa
Copy link
Contributor Author

pschisa commented Mar 16, 2022

Confirmed that this issue happens with a new local user with the default admin role assigned so it does not seem related to roles

@russjones
Copy link
Contributor

@rosstimothy As we discussed, let's start with putting print statements at the DEBUG level into tsh to find out where the worst latency is. Paginated ls should help improve this situation as well.

@jeffery-jen
Copy link

I had raised this issue awhile back and logs are provided at https://support.goteleport.com/hc/en-us/requests/4310

@jeffery-jen
Copy link

We had tested download on a 50MB binary through a side-car NGINX container. Users with trouble loading results through web ui and tsh cli did not have any issue downloading that content.

Reviewed Chrome RTT analysis using web ui, the majority time spent is in TTFB, which means server side processing is the major contributor in API slowness.

Do we have any update on working towards a fix?

@pschisa
Copy link
Contributor Author

pschisa commented Mar 29, 2022

I believe the pagination work at #11019 is going to help resolve this issue. Is that the correct PR @rosstimothy?

@rosstimothy
Copy link
Contributor

I believe that is the last hurdle to clear in order to have RFD 55 fully implemented - @kimlisa can correct me if I am wrong. There have been several other PRs for RFD 55 that have already landed though, like #10980 which could potentially help with tsh db ls slowness.

@kimlisa
Copy link
Contributor

kimlisa commented Mar 29, 2022

hm... i don't think RFD 55 helps speed up tsh ls (it was mainly to improve speed of UI, and adding filter capabilities to both UI and CLI tools). I can see speed being improved if users uses the filter options (if this is what you meant) to get less results back making less trips to the back (instead of listing everything by default which in the case of 3.5 servers, will make 4 trips to the back b/c currently we fetch 1k per loop).

The api got switched to paginated starting here (for databases and apps) https://github.com/gravitational/teleport/pull/9458/files#diff-f5e514e352c12c01eab507e6d9bbdfa697791b35bf914d949b92245a8e950447R1084, which looking through v8 commits, would've been part of release since 8.0.6. For nodes i believe it was released even earlier. The work I added on top of these existing apis was mostly for filter capabilities (and extending pagination to other resources)

@jeffery-jen
Copy link

  1. Test results from ppl affected with poor tsh ls and tsh db ls and ppl not severely affected by it produces similar results in Download / Upload test through NGINX sidecar in Teleport Proxy ECS Task

  2. Web UI RTT graph for database nodes fetch time indicates TTFB as the primary contributor in total time spent in request.

  3. What is the root cause for slow tsh ls and tsh db ls ?

@jeffery-jen
Copy link

Update on the issue

Our findings show that roles with more complicated permission setup vs a simpler one makes a huge difference.

# Reference: https://goteleport.com/teleport/docs/enterprise/ssh-rbac/#introduction
kind: role
version: v4
metadata:
  name: dba
spec:
  options:
    max_session_ttl: 4h
  allow:
    logins: ['user']
    db_labels:
      '*': '*'
    db_users:
      - '*'
    kubernetes_groups: []
    node_labels:
      "_access_team": "*dba*"
    request:
      roles: []
  deny:
    node_labels:
      "_access_team": "*dbcore*"
    db_labels:
      "instance-name": ["account"]
    rules: []

With deny db_labels the request from web-ui takes 3s and cli takes 5mins, without web-ui takes less then 1s and cli takes 10s.

@pschisa
Copy link
Contributor Author

pschisa commented May 10, 2022

This issue should be resolved by #12501 which will be available in the next published v8 release

@pschisa
Copy link
Contributor Author

pschisa commented Jun 24, 2022

Confirmed this issue is resolved with the latest patches

@pschisa pschisa closed this as completed Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug c-bi Internal Customer Reference
Projects
None yet
Development

No branches or pull requests

5 participants