Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL][ASH] Include PID in yb_active_session_history view #23070

Closed
1 task done
pdvmoto opened this issue Jul 1, 2024 · 8 comments
Closed
1 task done

[YSQL][ASH] Include PID in yb_active_session_history view #23070

pdvmoto opened this issue Jul 1, 2024 · 8 comments
Assignees
Labels
kind/new-feature This is a request for a completely new feature

Comments

@pdvmoto
Copy link

pdvmoto commented Jul 1, 2024

Description

Asking to include PID in yb_active_session_history. This field is already present in pg_stat_activity, which seems to contain similar data. the current ysql_session_id is just a local-counter, and is not much use "externally", wheras the linux-PID will allow linkup to an actual process.

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@pdvmoto pdvmoto added the kind/new-feature This is a request for a completely new feature label Jul 1, 2024
@hbhanawat
Copy link
Contributor

@pdvmoto Thanks for submitting the issue. Can you elaborate on the use case when PID will help and ysql_session_id will not? Since the PID is not available for sessions that have ended, I am not sure how to correlate that.

@FranckPachot
Copy link
Contributor

I can add some cases where PID may be helpful:

  • OOM killer killed a process. It is visible in the trace. Then, it can be helpful to know what this process was doing before
  • a long query is running. It is visible in pg_stat_activity (with its PID) and looking at ASH may help understand (queryid is not always set and not unique)
  • OS reports high CPU usage in TOP for a Postgres backend, may be good to look at ASH to understand what it is doing

@pdvmoto
Copy link
Author

pdvmoto commented Jul 9, 2024

What Franck Said.
The first use of PID would be for "immediate viewing" and troubleshooting.
plus: I am already working to save some of the data from ahs and pg_stat_activity. This will allow after-the-fact investigation to put together the (sampled) activity of a given session (pid + startdate).

The "problem" I have with sql_session_id: it doesnt seem to link to anything (yet) ?

and I realize the limitation of "sampled" information. but ASH is a Good Start!

@abhinab-yb abhinab-yb changed the title [New Feature] yb_active_session_history (ash): Asking to include PID [YSQL][ASH] Include PID in yb_active_session_history view Aug 6, 2024
@abhinab-yb abhinab-yb self-assigned this Aug 6, 2024
abhinab-yb added a commit that referenced this issue Aug 13, 2024
Summary:
yb_active_session_history had a field for the pg client session id,
but it's not very useful, because it's an internal counter and the
mapping between session id and pid is only present in the log files.
This makes it hard for the user to correlate session id with an
actual process.

This diff replaces the session id with pid which is more readily
available. The pid is of the process which is executing the query.
For YSQL, it will be pid of the backend which is executing the query,
for YCQL and background activities like flushes and compactions, it
will be pid of the TServer.

This diff also does some refactoring
- Rename YbSetAshClientAddrAndPort to YbAshSetOneTimeMetadata
- Move YbAshSetOneTimeMetadata to yb_ash.c

Upgrade/Rollback safety:

- Old node sends the PB to new node
-- Old node sets the session_id, but the new node ignores it
-- Old node cannot set the pid, so the new node takes the pid as 0, it's incorrect until the nodes are upgraded, and that's fine

- New node sends the PB to old node
-- New node sets the pid, the old node is not aware of it and ignores it
-- New node doesn't set the session_id, old node takes the session id as 0, it's also incorrect and fine

The reason that the incorrect values are fine -
- The data in the circular buffer is only supposed to be there for a few hours, so it's fine to have incorrect value of one field during the upgrade

'ysql_yb_enable_ash' Will be made an AutoFlag at GA time

Test Plan:
./yb_build.sh --java-test TestYsqlUpgrade
./yb_build.sh --java-test TestYbAsh#testYsqlPids

Manually tested that the pids are correct

```
yugabyte=# select wait_event_component, pid, client_node_ip, count(*) from yb_active_session_history group by wait_event_component, pid, client_node_ip order by pid, wait_event_component;
 wait_event_component |  pid  | client_node_ip  | count
----------------------+-------+-----------------+-------
 TServer              | 78409 |                 |   872
 TServer              | 78623 | 127.0.0.1:57719 |    17
 YSQL                 | 78623 | 127.0.0.1:57719 |    15
 YSQL                 | 78667 | 127.0.0.1:57736 |     3
 TServer              | 78672 | 127.0.0.1:57740 |    57
 YSQL                 | 78672 | 127.0.0.1:57740 |   348
 TServer              | 78673 | 127.0.0.1:57742 |    57
 YSQL                 | 78673 | 127.0.0.1:57742 |   356
 TServer              | 78674 | 127.0.0.1:57744 |    59
 YSQL                 | 78674 | 127.0.0.1:57744 |   352
 TServer              | 78675 | 127.0.0.1:57746 |    60
 YSQL                 | 78675 | 127.0.0.1:57746 |   344
 TServer              | 78676 | 127.0.0.1:57748 |    52
 YSQL                 | 78676 | 127.0.0.1:57748 |   349
 TServer              | 78677 | 127.0.0.1:57750 |    59
 YSQL                 | 78677 | 127.0.0.1:57750 |   348
 TServer              | 78678 | 127.0.0.1:57752 |    50
 YSQL                 | 78678 | 127.0.0.1:57752 |   344
 TServer              | 78679 | 127.0.0.1:57754 |    44
 YSQL                 | 78679 | 127.0.0.1:57754 |   350
 TServer              | 78680 | 127.0.0.1:57756 |    54
 YSQL                 | 78680 | 127.0.0.1:57756 |   348
 TServer              | 78681 | 127.0.0.1:57758 |    46
 YSQL                 | 78681 | 127.0.0.1:57758 |   358
 YSQL                 | 78693 | 127.0.0.1:57761 |     2

```

Output of `ps -A | grep yugabyte`

```
78672 ??         0:06.75 postgres: yugabyte yugabyte 127.0.0.1(57740) idle
78673 ??         0:06.71 postgres: yugabyte yugabyte 127.0.0.1(57742) INSERT
78674 ??         0:06.64 postgres: yugabyte yugabyte 127.0.0.1(57744) UPDATE
78675 ??         0:06.66 postgres: yugabyte yugabyte 127.0.0.1(57746) COMMIT
78676 ??         0:06.55 postgres: yugabyte yugabyte 127.0.0.1(57748) UPDATE
78677 ??         0:06.59 postgres: yugabyte yugabyte 127.0.0.1(57750) UPDATE
78678 ??         0:06.58 postgres: yugabyte yugabyte 127.0.0.1(57752) COMMIT
78679 ??         0:06.52 postgres: yugabyte yugabyte 127.0.0.1(57754) UPDATE
78680 ??         0:06.56 postgres: yugabyte yugabyte 127.0.0.1(57756) UPDATE
78681 ??         0:06.48 postgres: yugabyte yugabyte 127.0.0.1(57758) idle
78693 ??         0:00.18 postgres: yugabyte yugabyte 127.0.0.1(57761) idle
78409 ttys006    4:08.89 /Users/asaha/code/yugabyte-db/build/latest/bin/yb-tserver ...
```

Reviewers: jason, amitanand

Reviewed By: jason

Subscribers: hsunder, yql, amitanand, hbhanawat, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37081
@abhinab-yb
Copy link
Contributor

Hi @pdvmoto @FranckPachot, thanks for raising this issue, we have replaced the ysql_session_id with pid. cc: @hbhanawat

@pdvmoto
Copy link
Author

pdvmoto commented Aug 13, 2024

Thx! Will this be available in any of the docker-images to test ? (I'm unable to compile from source myself)

jasonyb pushed a commit that referenced this issue Aug 13, 2024
Summary:
 bd1e19e [PLAT-14835]: Add extra transient YCQL index tables in xClusterTableConfig during GET calls and metrics.
 2715c58 [docs] updated yb version with shortcode (#23456)
 a2b5495 [PLAT-14912] docs warning for install root as subdirectory (#23470)
 53365b1 [#23422] YSQL: Disable random warmup feature by default for connection manager tests
 09d6e96 [#22876][#22835][#22773] CDCSDK: Add null checks & remove optimisation in table removal from CDC stream
 69d4052 [#22862] XCluster: Improving XCluster Index Base WAL Retention Policy
 706e97d [#23460] DocDB: Read vector index data
 b1a90b9 [#23428] docdb: Remove non-transactional snapshot code
 581648f [PLAT-13957] Update RBAC wrapper for xCluster DR
 fbaf945 Whitepaper on migration (#23468)
 f6af2f5 [PLAT-13936] Upgrade Grpc and its dependencies to fix CVEs
 d7027fe [PLAT-14892] Update PITR configuration step text
 92804ac [PLAT-14760] Use new xCluster sync API
 3cb8faf [PLAT-11243] Upgrade python requests to latest version
 Excluded: a036313 [#23070] YSQL, ASH: Replace ysql_session_id with pid
 4d2f71f [PLAT-14882] Retrieve userName from the attribute lists in case not found in dn
 99489c0 [PLAT-14909] Upgrade YBC version to 2.2.0.0-b4

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37272
@abhinab-yb
Copy link
Contributor

@pdvmoto I pushed a custom docker image with this change here. Please note that it's not an official release and only meant for testing

abhinab-yb added a commit that referenced this issue Aug 16, 2024
…d with pid

Summary:
- postinit.c
-- /* Yugabyte includes */
--- YB de8c183 introduced `#include <arpa/inet.h>` first, and then YB a036313 moved it to yb_ash.c
--- PG 55782d5 added the adjacent line `/* Yugabyte includes */`
--- Keep the `/* Yugabyte includes */` line and remove the `#include <arpa/inet.h>` line
-- static void YbEnsureSysTablePrefetchingStopped();
--- YB de8c183 introduced the function declaration `YbSetAshClientAddrAndPort` and YB a036313 renamed `YbSetAshClientAddrAndPort` to `YbAshSetOneTimeMetadata` and moved in to yb_ash.h/yb_ash.c
--- PG 55782d5 introduced the function declaration `InitPostgresImpl` and `YbEnsureSysTablePrefetchingStopped` adjacent to it
--- Keep `InitPostgresImpl` and `YbEnsureSysTablePrefetchingStopped` and remove `YbSetAshClientAddrAndPort`

The original diff summary:

yb_active_session_history had a field for the pg client session id,
but it's not very useful, because it's an internal counter and the
mapping between session id and pid is only present in the log files.
This makes it hard for the user to correlate session id with an
actual process.

This diff replaces the session id with pid which is more readily
available. The pid is of the process which is executing the query.
For YSQL, it will be pid of the backend which is executing the query,
for YCQL and background activities like flushes and compactions, it
will be pid of the TServer.

This diff also does some refactoring
- Rename YbSetAshClientAddrAndPort to YbAshSetOneTimeMetadata
- Move YbAshSetOneTimeMetadata to yb_ash.c

Upgrade/Rollback safety:

- Old node sends the PB to new node
-- Old node sets the session_id, but the new node ignores it
-- Old node cannot set the pid, so the new node takes the pid as 0, it's incorrect until the nodes are upgraded, and that's fine

- New node sends the PB to old node
-- New node sets the pid, the old node is not aware of it and ignores it
-- New node doesn't set the session_id, old node takes the session id as 0, it's also incorrect and fine

The reason that the incorrect values are fine -
- The data in the circular buffer is only supposed to be there for a few hours, so it's fine to have incorrect value of one field during the upgrade

'ysql_yb_enable_ash' Will be made an AutoFlag at GA time

Test Plan:
./yb_build.sh --java-test TestYsqlUpgrade
./yb_build.sh --java-test TestYbAsh#testYsqlPids

Manually tested that the pids are correct

```
yugabyte=# select wait_event_component, pid, client_node_ip, count(*) from yb_active_session_history group by wait_event_component, pid, client_node_ip order by pid, wait_event_component;
 wait_event_component |  pid  | client_node_ip  | count
----------------------+-------+-----------------+-------
 TServer              | 78409 |                 |   872
 TServer              | 78623 | 127.0.0.1:57719 |    17
 YSQL                 | 78623 | 127.0.0.1:57719 |    15
 YSQL                 | 78667 | 127.0.0.1:57736 |     3
 TServer              | 78672 | 127.0.0.1:57740 |    57
 YSQL                 | 78672 | 127.0.0.1:57740 |   348
 TServer              | 78673 | 127.0.0.1:57742 |    57
 YSQL                 | 78673 | 127.0.0.1:57742 |   356
 TServer              | 78674 | 127.0.0.1:57744 |    59
 YSQL                 | 78674 | 127.0.0.1:57744 |   352
 TServer              | 78675 | 127.0.0.1:57746 |    60
 YSQL                 | 78675 | 127.0.0.1:57746 |   344
 TServer              | 78676 | 127.0.0.1:57748 |    52
 YSQL                 | 78676 | 127.0.0.1:57748 |   349
 TServer              | 78677 | 127.0.0.1:57750 |    59
 YSQL                 | 78677 | 127.0.0.1:57750 |   348
 TServer              | 78678 | 127.0.0.1:57752 |    50
 YSQL                 | 78678 | 127.0.0.1:57752 |   344
 TServer              | 78679 | 127.0.0.1:57754 |    44
 YSQL                 | 78679 | 127.0.0.1:57754 |   350
 TServer              | 78680 | 127.0.0.1:57756 |    54
 YSQL                 | 78680 | 127.0.0.1:57756 |   348
 TServer              | 78681 | 127.0.0.1:57758 |    46
 YSQL                 | 78681 | 127.0.0.1:57758 |   358
 YSQL                 | 78693 | 127.0.0.1:57761 |     2

```

Output of `ps -A | grep yugabyte`

```
78672 ??         0:06.75 postgres: yugabyte yugabyte 127.0.0.1(57740) idle
78673 ??         0:06.71 postgres: yugabyte yugabyte 127.0.0.1(57742) INSERT
78674 ??         0:06.64 postgres: yugabyte yugabyte 127.0.0.1(57744) UPDATE
78675 ??         0:06.66 postgres: yugabyte yugabyte 127.0.0.1(57746) COMMIT
78676 ??         0:06.55 postgres: yugabyte yugabyte 127.0.0.1(57748) UPDATE
78677 ??         0:06.59 postgres: yugabyte yugabyte 127.0.0.1(57750) UPDATE
78678 ??         0:06.58 postgres: yugabyte yugabyte 127.0.0.1(57752) COMMIT
78679 ??         0:06.52 postgres: yugabyte yugabyte 127.0.0.1(57754) UPDATE
78680 ??         0:06.56 postgres: yugabyte yugabyte 127.0.0.1(57756) UPDATE
78681 ??         0:06.48 postgres: yugabyte yugabyte 127.0.0.1(57758) idle
78693 ??         0:00.18 postgres: yugabyte yugabyte 127.0.0.1(57761) idle
78409 ttys006    4:08.89 /Users/asaha/code/yugabyte-db/build/latest/bin/yb-tserver ...
```

Reviewers: jason, tfoucher

Reviewed By: jason

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37288
@pdvmoto
Copy link
Author

pdvmoto commented Sep 14, 2024

Thx, only see this now. Will check ASAP, early next week.

@pdvmoto
Copy link
Author

pdvmoto commented Sep 16, 2024

Positive! First Tests done. It Works Fine. I can identify processes from ash, and I can find ash-data per-process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/new-feature This is a request for a completely new feature
Projects
None yet
Development

No branches or pull requests

4 participants