-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Postgres crash in `SetLatch(latch=0x024) at latch.c #20309
Labels
kind/bug
This issue is a bug
priority/highest
Highest priority issue
qa_automation
Bugs identified via itest-system, LST, Stress automation or causing automation failures
qa_stress
Bugs identified via Stress automation
Comments
Karvy-yb
added
area/ysql
Yugabyte SQL (YSQL)
priority/medium
Medium priority issue
status/awaiting-triage
Issue awaiting triage
qa_automation
Bugs identified via itest-system, LST, Stress automation or causing automation failures
qa_stress
Bugs identified via Stress automation
labels
Dec 14, 2023
Observing the same issue for test_sequence_multizone_ddl in |
yugabyte-ci
added
priority/highest
Highest priority issue
and removed
status/awaiting-triage
Issue awaiting triage
priority/medium
Medium priority issue
labels
Dec 14, 2023
timothy-e
changed the title
[YSQL] [Stress Test ]Postgres crash in `SetLatch(latch=0x0000000000000024) at latch.c
[YSQL] Postgres crash in `SetLatch(latch=0x024) at latch.c
Dec 14, 2023
timothy-e
added a commit
that referenced
this issue
Dec 15, 2023
…cleanup Summary: Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet. = Background = === Latch Context === ```lang=c, name=latch.h /* ... * There are two kinds of latches: local and shared. A local latch is * initialized by InitLatch, and can only be set from the same process. * A local latch can be used to wait for a signal to arrive, by calling * SetLatch in the signal handler. A shared latch resides in shared memory, * and must be initialized at postmaster startup by InitSharedLatch. Before * a shared latch can be waited on, it must be associated with a process * with OwnLatch. Only the process owning the latch can wait on it, but any * process can set it. ... ``` === Process Life Cycle === A process is forked by the postmaster. Inside the new process, we set the values in the following order: 1. `InitPostmasterChild` sets `MyLatch` to a local latch object. 2. `InitProcess` sets `MyProc` to a non-null value from the free proc list 3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct. 4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now. 5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again. 6. **The process runs normally, waiting on the latch for new events. ** 7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive) 8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch. 9. The process sets its `MyProc` to NULL. 10. The process disowns the shared latch. = Problem = If the process receives a SIGHUP signal... * before (5) above, nothing happens - no signal handlers are registered. * during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically. * during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault. The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`. Test Plan: 1. Apply the patch ```lang=diff diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c index eedb2057ea..c4529af546 100644 --- a/src/postgres/src/backend/storage/lmgr/proc.c +++ b/src/postgres/src/backend/storage/lmgr/proc.c @@ -885,6 +885,7 @@ ProcKill(int code, Datum arg) SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ReleaseProcToFreeList(proc); @@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg) * latch. */ SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ``` 2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")` 3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID` 4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID` 5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully. Reviewers: amartsinchyk, kfranz, smishra Reviewed By: amartsinchyk, smishra Subscribers: smishra, yql Differential Revision: https://phorge.dev.yugabyte.com/D31071
timothy-e
added a commit
that referenced
this issue
Dec 15, 2023
…ighup handler at cleanup Summary: Original commit: D31071 / af01dc0 Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet. = Background = === Latch Context === ```lang=c, name=latch.h /* ... * There are two kinds of latches: local and shared. A local latch is * initialized by InitLatch, and can only be set from the same process. * A local latch can be used to wait for a signal to arrive, by calling * SetLatch in the signal handler. A shared latch resides in shared memory, * and must be initialized at postmaster startup by InitSharedLatch. Before * a shared latch can be waited on, it must be associated with a process * with OwnLatch. Only the process owning the latch can wait on it, but any * process can set it. ... ``` === Process Life Cycle === A process is forked by the postmaster. Inside the new process, we set the values in the following order: 1. `InitPostmasterChild` sets `MyLatch` to a local latch object. 2. `InitProcess` sets `MyProc` to a non-null value from the free proc list 3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct. 4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now. 5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again. 6. **The process runs normally, waiting on the latch for new events. ** 7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive) 8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch. 9. The process sets its `MyProc` to NULL. 10. The process disowns the shared latch. = Problem = If the process receives a SIGHUP signal... * before (5) above, nothing happens - no signal handlers are registered. * during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically. * during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault. The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`. Test Plan: 1. Apply the patch ```lang=diff diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c index eedb2057ea..c4529af546 100644 --- a/src/postgres/src/backend/storage/lmgr/proc.c +++ b/src/postgres/src/backend/storage/lmgr/proc.c @@ -885,6 +885,7 @@ ProcKill(int code, Datum arg) SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ReleaseProcToFreeList(proc); @@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg) * latch. */ SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ``` 2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")` 3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID` 4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID` 5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully. Reviewers: amartsinchyk, kfranz, smishra Reviewed By: smishra Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31073
timothy-e
added a commit
that referenced
this issue
Dec 21, 2023
…ighup handler at cleanup Summary: Original commit: af01dc0 / D31071 Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet. = Background = === Latch Context === ```lang=c, name=latch.h /* ... * There are two kinds of latches: local and shared. A local latch is * initialized by InitLatch, and can only be set from the same process. * A local latch can be used to wait for a signal to arrive, by calling * SetLatch in the signal handler. A shared latch resides in shared memory, * and must be initialized at postmaster startup by InitSharedLatch. Before * a shared latch can be waited on, it must be associated with a process * with OwnLatch. Only the process owning the latch can wait on it, but any * process can set it. ... ``` === Process Life Cycle === A process is forked by the postmaster. Inside the new process, we set the values in the following order: 1. `InitPostmasterChild` sets `MyLatch` to a local latch object. 2. `InitProcess` sets `MyProc` to a non-null value from the free proc list 3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct. 4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now. 5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again. 6. **The process runs normally, waiting on the latch for new events. ** 7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive) 8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch. 9. The process sets its `MyProc` to NULL. 10. The process disowns the shared latch. = Problem = If the process receives a SIGHUP signal... * before (5) above, nothing happens - no signal handlers are registered. * during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically. * during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault. The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`. Jira: DB-9276 Test Plan: 1. Apply the patch ```lang=diff diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c index eedb2057ea..c4529af546 100644 --- a/src/postgres/src/backend/storage/lmgr/proc.c +++ b/src/postgres/src/backend/storage/lmgr/proc.c @@ -885,6 +885,7 @@ ProcKill(int code, Datum arg) SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ReleaseProcToFreeList(proc); @@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg) * latch. */ SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ``` 2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")` 3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID` 4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID` 5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully. Reviewers: amartsinchyk, kfranz, smishra Reviewed By: amartsinchyk Subscribers: smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31167
timothy-e
added a commit
that referenced
this issue
Dec 21, 2023
…hup handler at cleanup Summary: Original commit: af01dc0 / D31071 Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet. = Background = === Latch Context === ```lang=c, name=latch.h /* ... * There are two kinds of latches: local and shared. A local latch is * initialized by InitLatch, and can only be set from the same process. * A local latch can be used to wait for a signal to arrive, by calling * SetLatch in the signal handler. A shared latch resides in shared memory, * and must be initialized at postmaster startup by InitSharedLatch. Before * a shared latch can be waited on, it must be associated with a process * with OwnLatch. Only the process owning the latch can wait on it, but any * process can set it. ... ``` === Process Life Cycle === A process is forked by the postmaster. Inside the new process, we set the values in the following order: 1. `InitPostmasterChild` sets `MyLatch` to a local latch object. 2. `InitProcess` sets `MyProc` to a non-null value from the free proc list 3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct. 4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now. 5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again. 6. **The process runs normally, waiting on the latch for new events. ** 7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive) 8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch. 9. The process sets its `MyProc` to NULL. 10. The process disowns the shared latch. = Problem = If the process receives a SIGHUP signal... * before (5) above, nothing happens - no signal handlers are registered. * during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically. * during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault. The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`. Jira: DB-9276 Test Plan: 1. Apply the patch ```lang=diff diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c index eedb2057ea..c4529af546 100644 --- a/src/postgres/src/backend/storage/lmgr/proc.c +++ b/src/postgres/src/backend/storage/lmgr/proc.c @@ -885,6 +885,7 @@ ProcKill(int code, Datum arg) SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ReleaseProcToFreeList(proc); @@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg) * latch. */ SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ``` 2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")` 3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID` 4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID` 5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully. Reviewers: amartsinchyk, kfranz, smishra Reviewed By: amartsinchyk Subscribers: yql, smishra Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31170
timothy-e
added a commit
that referenced
this issue
Dec 21, 2023
…hup handler at cleanup Summary: Original commit: af01dc0 / D31071 Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet. = Background = === Latch Context === ```lang=c, name=latch.h /* ... * There are two kinds of latches: local and shared. A local latch is * initialized by InitLatch, and can only be set from the same process. * A local latch can be used to wait for a signal to arrive, by calling * SetLatch in the signal handler. A shared latch resides in shared memory, * and must be initialized at postmaster startup by InitSharedLatch. Before * a shared latch can be waited on, it must be associated with a process * with OwnLatch. Only the process owning the latch can wait on it, but any * process can set it. ... ``` === Process Life Cycle === A process is forked by the postmaster. Inside the new process, we set the values in the following order: 1. `InitPostmasterChild` sets `MyLatch` to a local latch object. 2. `InitProcess` sets `MyProc` to a non-null value from the free proc list 3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct. 4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now. 5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again. 6. **The process runs normally, waiting on the latch for new events. ** 7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive) 8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch. 9. The process sets its `MyProc` to NULL. 10. The process disowns the shared latch. = Problem = If the process receives a SIGHUP signal... * before (5) above, nothing happens - no signal handlers are registered. * during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically. * during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault. The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`. Jira: DB-9276 Test Plan: 1. Apply the patch ```lang=diff diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c index eedb2057ea..c4529af546 100644 --- a/src/postgres/src/backend/storage/lmgr/proc.c +++ b/src/postgres/src/backend/storage/lmgr/proc.c @@ -885,6 +885,7 @@ ProcKill(int code, Datum arg) SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ReleaseProcToFreeList(proc); @@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg) * latch. */ SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ``` 2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")` 3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID` 4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID` 5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully. Reviewers: amartsinchyk, kfranz, smishra Reviewed By: amartsinchyk Subscribers: smishra, yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31169
timothy-e
added a commit
that referenced
this issue
Dec 21, 2023
…hup handler at cleanup Summary: Original commit: af01dc0 / D31071 Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet. = Background = === Latch Context === ```lang=c, name=latch.h /* ... * There are two kinds of latches: local and shared. A local latch is * initialized by InitLatch, and can only be set from the same process. * A local latch can be used to wait for a signal to arrive, by calling * SetLatch in the signal handler. A shared latch resides in shared memory, * and must be initialized at postmaster startup by InitSharedLatch. Before * a shared latch can be waited on, it must be associated with a process * with OwnLatch. Only the process owning the latch can wait on it, but any * process can set it. ... ``` === Process Life Cycle === A process is forked by the postmaster. Inside the new process, we set the values in the following order: 1. `InitPostmasterChild` sets `MyLatch` to a local latch object. 2. `InitProcess` sets `MyProc` to a non-null value from the free proc list 3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct. 4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now. 5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again. 6. **The process runs normally, waiting on the latch for new events. ** 7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive) 8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch. 9. The process sets its `MyProc` to NULL. 10. The process disowns the shared latch. = Problem = If the process receives a SIGHUP signal... * before (5) above, nothing happens - no signal handlers are registered. * during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically. * during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault. The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`. Jira: DB-9276 Test Plan: 1. Apply the patch ```lang=diff diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c index eedb2057ea..c4529af546 100644 --- a/src/postgres/src/backend/storage/lmgr/proc.c +++ b/src/postgres/src/backend/storage/lmgr/proc.c @@ -885,6 +885,7 @@ ProcKill(int code, Datum arg) SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ReleaseProcToFreeList(proc); @@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg) * latch. */ SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ``` 2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")` 3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID` 4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID` 5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully. Reviewers: amartsinchyk, kfranz, smishra Reviewed By: amartsinchyk Subscribers: yql, smishra Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31168
timothy-e
added a commit
that referenced
this issue
Dec 21, 2023
…hup handler at cleanup Summary: Original commit: af01dc0 / D31071 Short version: `MyLatch` is safe to use at anytime, but `MyProc->procLatch` refers to a shared latch that may not exist at startup and shut down. We don't worry about the startup case, because exit handlers have not been registered yet. = Background = === Latch Context === ```lang=c, name=latch.h /* ... * There are two kinds of latches: local and shared. A local latch is * initialized by InitLatch, and can only be set from the same process. * A local latch can be used to wait for a signal to arrive, by calling * SetLatch in the signal handler. A shared latch resides in shared memory, * and must be initialized at postmaster startup by InitSharedLatch. Before * a shared latch can be waited on, it must be associated with a process * with OwnLatch. Only the process owning the latch can wait on it, but any * process can set it. ... ``` === Process Life Cycle === A process is forked by the postmaster. Inside the new process, we set the values in the following order: 1. `InitPostmasterChild` sets `MyLatch` to a local latch object. 2. `InitProcess` sets `MyProc` to a non-null value from the free proc list 3. `InitProcess` calls `OwnLatch(MyProc->procLatch)`, allowing itself to wait on the latch of the PGPROC struct. 4. `InitProcess` switches the value of`MyLatch` to `MyProc->procLatch` from the local latch. Since the PGPROC list is shared, other processes can send signals to this process now. 5. The process sets up its signal handlers. In most cases (including the webserver), these involve setting the process's own latch to cause it to enter normal execution again. 6. **The process runs normally, waiting on the latch for new events. ** 7. When the process begins to shut down, we enter the cleanup phase in `ProcKill`. (This runs under normal circumstances, even though the name is aggressive) 8. The process sets its `MyLatch` to the local latch, because its about to release it's shared latch. 9. The process sets its `MyProc` to NULL. 10. The process disowns the shared latch. = Problem = If the process receives a SIGHUP signal... * before (5) above, nothing happens - no signal handlers are registered. * during normal execution, `MyProc->procLatch` and `MyLatch` both point to the latch of the PGPROC struct. Both work identically. * during cleanup, after `MyProc` was set to null: `MyLatch` is still valid to set and wait on (points to its local latch) but `MyProc->procLatch` causes a segmentation fault. The solution is to use `MyLatch` in all circumstances, so that we allow the process life cycle to control whether we use the shared latch or the local latch. PG16 has very few references to `MyProc->procLatch`. Jira: DB-9276 Test Plan: 1. Apply the patch ```lang=diff diff --git a/src/postgres/src/backend/storage/lmgr/proc.c b/src/postgres/src/backend/storage/lmgr/proc.c index eedb2057ea..c4529af546 100644 --- a/src/postgres/src/backend/storage/lmgr/proc.c +++ b/src/postgres/src/backend/storage/lmgr/proc.c @@ -885,6 +885,7 @@ ProcKill(int code, Datum arg) SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ReleaseProcToFreeList(proc); @@ -995,6 +996,7 @@ AuxiliaryProcKill(int code, Datum arg) * latch. */ SwitchBackToLocalLatch(); proc = MyProc; MyProc = NULL; + pg_usleep(10 * 1000 * 1000); DisownLatch(&proc->procLatch); ``` 2. Get the webserver PID `WEBSERVER_PID=$(pgrep -f "YSQL webserver")` 3. Send a SIGTERM to the webserver - this will cause it to shut itself down, but sleep for 10s during shut down. `kill -SIGTERM $WEBSERVER_PID` 4. While it's sleeping, send a SIGHUP `kill -SIGHUP $WEBSERVER_PID` 5. Before this change, a segmentation fault would occur - MyProc was null. After this change, we use `MyLatch` which has been switched to use a local latch instead of the shared `PGPROC` `procLatch`, so the SIGHUP completes successfully. Reviewers: amartsinchyk, kfranz, smishra Reviewed By: amartsinchyk Subscribers: yql, smishra Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31166
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
kind/bug
This issue is a bug
priority/highest
Highest priority issue
qa_automation
Bugs identified via itest-system, LST, Stress automation or causing automation failures
qa_stress
Bugs identified via Stress automation
Jira Link: DB-9276
Description
Test:
test_create_alter_delete_tables_vm_restarts
Version:
2.18.5.0-b91
Test steps:
Core dump:
Note: There is another core dump observed which is already reported in #17172
Stress report: http://stress.dev.yugabyte.com/stress_test/75f1961a-10ff-4d36-a378-93251e23de71
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: