PGPool-II not forwarding client connection to backend during NETWORK ISOLATION #82

meorkamil · 2024-12-03T03:14:06Z

Hi,

We found out something during our testing for split brain scenario. We have 3 node PGPool-II + Watchdog (lifecheck heartbeat mode) configured in different datacenters as below diagram. We simulate the split brain scenario where we drop incoming and outgoing connection between DC1 <-> DC2, except PGPool-II in our AWS.

Setup

VMKDB01U - DC1 (PGPool-II + Watchdog + Postgresql)
VMKDB02U - DC2 (PGPool-II + Watchdog + Postgresql)
VMKDB03U - AWS (PGPool-II + Watchdog) Acting as witness

Problem

We notice that our VMKDB02U changed it state to NETWORK ISOLATION hence the client that connecting to our watchdog LEADER hung / buffered until the network connectivity stable then it resume as normal.

Questions

Is this expected behavior of PGPool-II when it entered NETWORK ISOLATION state? based on below:
- pgpool2/src/watchdog/watchdog.c
  
  Line 6530 in 009b197
  
  * We can get into this state if we detect the total
- pgpool2/src/watchdog/watchdog.c
  
  Line 6621 in 009b197
  
  * we could end up in tis state if we were connected to the
- 62c444f / b2f1526

VMKDB02U log

2024-11-27 18:34:58.983: watchdog pid 1850759: LOG:  watchdog node state changed from [INITIALIZING] to [STANDING FOR LEADER]
2024-11-27 18:34:58.984: watchdog pid 1850759: LOG:  our stand for coordinator request is rejected by node "192.168.1.3:9999 Linux VMKDB03U"
2024-11-27 18:34:58.984: watchdog pid 1850759: DETAIL:  we might be in partial network isolation and cluster already have a valid leader
2024-11-27 18:34:58.984: watchdog pid 1850759: HINT:  please verify the watchdog life-check and network is working properly
2024-11-27 18:34:58.984: watchdog pid 1850759: LOG:  watchdog node state changed from [STANDING FOR LEADER] to [NETWORK ISOLATION]

VMKDB03U log

2024-11-27 18:37:12.547: watchdog pid 41337: LOG:  We are connected to leader node "192.168.1.1:9999 Linux VMKDB01U" and another node "192.168.1.2:9999 Linux VMKDB02U" is trying to become a leader                                              
2024-11-27 18:37:13.126: sr_check_worker pid 41376: LOG:  get_query_result failed: status: -2
2024-11-27 18:37:13.126: sr_check_worker pid 41376: CONTEXT:  while checking replication time lag
2024-11-27 18:37:23.166: sr_check_worker pid 41376: LOG:  get_query_result failed: status: -2
2024-11-27 18:37:23.166: sr_check_worker pid 41376: CONTEXT:  while checking replication time lag

Watchdog Node Information
Node Name         : 192.168.1.3:9999 Linux VMKDB03U
Host Name         : 192.168.1.3
Delegate IP       : Not_Set
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 1
Status            : 7
Status Name       : STANDBY
Membership Status : MEMBER

Node Name         : 192.168.1.1:9999 Linux VMKDB01U
Host Name         : 192.168.1.1
Delegate IP       : Not_Set
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 3
Status            : 4
Status Name       : LEADER
Membership Status : MEMBER

Node Name         : 192.168.1.2:9999 Linux VMKDB02U
Host Name         : 192.168.1.2
Delegate IP       : Not_Set
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 2
Status            : 12
Status Name       : NETWORK ISOLATION
Membership Status : MEMBER

The text was updated successfully, but these errors were encountered:

pengbo0328 · 2024-12-05T07:33:05Z

Thank you for reporting this issue.
Which version are you using? Does this issue occur in the latest version?

meorkamil · 2024-12-05T13:30:38Z

Thank you for reporting this issue. Which version are you using? Does this issue occur in the latest version?

Our environment setup as below:

Postgres Version: postgresql16
PGPool-II Version: pgpool-II version 4.5.5 (hotooriboshi)
OS: Red Hat Enterprise Linux release 8.10 (Ootpa)

Yes, we are running the latest pgpool 4.5.5

pengbo0328 · 2024-12-06T06:51:57Z

I think it is the correct behavior.
VMKDB02U attempts to become the LEADER, but the request is rejected by VMKDB03U because VMKDB03U can connect to VMKDB01U, which is the current LEADER.

Problem

We notice that our VMKDB02U changed it state to NETWORK ISOLATION hence the client that connecting to our watchdog LEADER hung / buffered until the network connectivity stable then it resume as normal.

The LEADER is VMKDB01U.
Do you mean the client cannot connect to the LEADER, VMKDB01U?
Is the client using a VIP to connect to pgpool? The VIP is normally assigned to the LEADER.

meorkamil · 2024-12-06T17:18:08Z

I think it is the correct behavior.
VMKDB02U attempts to become the LEADER, but the request is rejected by VMKDB03U because VMKDB03U can connect to VMKDB01U, which is the current LEADER.

Agreed. I just wanted to confirm the connectivity from the client side. We observed that pgpool was unresponsive during the NETWORK ISOLATION event on VMKDB02U, even though the client was connected via the VIP (designated to the LEADER).

In this Post we not using a VIP. Instead, we using service discovery to register the LEADER/STANDY instance. Client always connected to a LEADER host IP.

The LEADER is VMKDB01U.
Do you mean the client cannot connect to the LEADER, VMKDB01U?
Is the client using a VIP to connect to pgpool? The VIP is normally assigned to the LEADER.

Moving forward, we tested with the VIP (sample configuration), where the client connected via VIP 192.168.1.4, but unfortunately, the same issue persisted.

Watchdog Cluster Information
Total Nodes              : 3
Remote Nodes             : 2
Member Remote Nodes      : 2
Alive Remote Nodes       : 1
Nodes required for quorum: 2
Quorum state             : QUORUM EXIST
Local node escalation    : NO
Leader Node Name         : 192.168.1.1:9999 Linux VMKDB01U
Leader Host Name         : 192.168.1.1

Watchdog Node Information
Node Name         : 192.168.1.3:9999 Linux VMKDB03U
Host Name         : 192.168.1.3
Delegate IP       : 192.168.1.4
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 1
Status            : 7
Status Name       : STANDBY
Membership Status : MEMBER

Node Name         : 192.168.1.1:9999 Linux VMKDB01U
Host Name         : 192.168.1.1
Delegate IP       : 192.168.1.4
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 100
Status            : 4
Status Name       : LEADER
Membership Status : MEMBER

Node Name         : 192.168.1.2:9999 Linux VMKDB02U
Host Name         : 192.168.1.2
Delegate IP       : 192.168.1.4
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 2
Status            : 12
Status Name       : NETWORK ISOLATION
Membership Status : MEMBER

VMKDB02U Logs

2024-12-07 00:41:58.469: watchdog pid 1123869: LOG:  watchdog node state changed from [NETWORK ISOLATION] to [JOINING]
2024-12-07 00:41:58.469: watchdog pid 1123869: LOG:  watchdog node state changed from [JOINING] to [INITIALIZING]
2024-12-07 00:41:59.471: watchdog pid 1123869: LOG:  watchdog node state changed from [INITIALIZING] to [STANDING FOR LEADER]
2024-12-07 00:41:59.471: watchdog pid 1123869: LOG:  our stand for coordinator request is rejected by node "192.168.1.3:9999 Linux VMKDB03U"
2024-12-07 00:41:59.471: watchdog pid 1123869: DETAIL:  we might be in partial network isolation and cluster already have a valid leader
2024-12-07 00:41:59.471: watchdog pid 1123869: HINT:  please verify the watchdog life-check and network is working properly
2024-12-07 00:41:59.471: watchdog pid 1123869: LOG:  watchdog node state changed from [STANDING FOR LEADER] to [NETWORK ISOLATION

VMKDB03U Logs

2024-12-07 00:40:59.469: watchdog pid 8407: LOG:  We are connected to leader node "192.168.1.1:9999 Linux VMKDB01U" and another node "192.168.1.2:9999 Linux VMKDB02U" is trying to become a leader

VMKDB01U Logs

2024-12-07 00:40:45.527: psql pid 1118963: DETAIL:  query: "INSERT INTO service (service_name) VALUES ('Sat Dec  7 00:40:45 +08 2024 VMKAPP01U');"
2024-12-07 00:40:45.530: psql pid 1118963: LOG:  Terminate message from frontend.
2024-12-07 00:40:45.539: child pid 1118951: LOG:  new connection received
2024-12-07 00:40:45.539: child pid 1118951: DETAIL:  connecting host=192.168.2.1 port=35754
2024-12-07 00:40:45.578: psql pid 1118951: LOG:  Query message from frontend.
2024-12-07 00:40:45.578: psql pid 1118951: DETAIL:  query: "SELECT * FROM service ORDER BY service_id DESC LIMIT 1;"
2024-12-07 00:40:45.583: psql pid 1118951: LOG:  Terminate message from frontend.
2024-12-07 00:40:47.342: pcp_main pid 1118971: LOG:  forked new pcp worker, pid=1122146 socket=6
2024-12-07 00:40:47.346: pcp_main pid 1118971: LOG:  PCP process with pid: 1122146 exit with SUCCESS.
2024-12-07 00:40:47.346: pcp_main pid 1118971: LOG:  PCP process with pid: 1122146 exits with status 0
2024-12-07 00:40:47.352: pcp_main pid 1118971: LOG:  forked new pcp worker, pid=1122158 socket=6
2024-12-07 00:40:47.353: pcp_main pid 1118971: LOG:  PCP process with pid: 1122158 exit with SUCCESS.
2024-12-07 00:40:47.353: pcp_main pid 1118971: LOG:  PCP process with pid: 1122158 exits with status 0
2024-12-07 00:40:47.488: heart_beat_sender pid 1118937: ERROR:  failed to send watchdog heartbeat, sendto failed
2024-12-07 00:40:47.488: heart_beat_sender pid 1118937: DETAIL:  sending packet to "192.168.1.2" failed with reason: "Operation not permitted"
2024-12-07 00:40:48.599: psql pid 1118961: LOG:  new connection received
2024-12-07 00:40:48.599: psql pid 1118961: DETAIL:  connecting host=192.168.2.1 port=35766
2024-12-07 00:40:49.488: heart_beat_sender pid 1118937: ERROR:  failed to send watchdog heartbeat, sendto failed
2024-12-07 00:40:49.488: heart_beat_sender pid 1118937: DETAIL:  sending packet to "192.168.1.2" failed with reason: "Operation not permitted

pengbo0328 assigned codeforall Dec 5, 2024

pengbo0328 assigned pengbo0328 and unassigned codeforall Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PGPool-II not forwarding client connection to backend during NETWORK ISOLATION #82

PGPool-II not forwarding client connection to backend during NETWORK ISOLATION #82

meorkamil commented Dec 3, 2024 •

edited

Loading

pengbo0328 commented Dec 5, 2024

meorkamil commented Dec 5, 2024

pengbo0328 commented Dec 6, 2024

meorkamil commented Dec 6, 2024

PGPool-II not forwarding client connection to backend during NETWORK ISOLATION #82

PGPool-II not forwarding client connection to backend during NETWORK ISOLATION #82

Comments

meorkamil commented Dec 3, 2024 • edited Loading

Setup

Problem

Questions

VMKDB02U log

VMKDB03U log

pengbo0328 commented Dec 5, 2024

meorkamil commented Dec 5, 2024

pengbo0328 commented Dec 6, 2024

meorkamil commented Dec 6, 2024

VMKDB02U Logs

VMKDB03U Logs

VMKDB01U Logs

meorkamil commented Dec 3, 2024 •

edited

Loading