Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGPool-II not forwarding client connection to backend during NETWORK ISOLATION #82

Open
meorkamil opened this issue Dec 3, 2024 · 4 comments
Assignees

Comments

@meorkamil
Copy link

meorkamil commented Dec 3, 2024

Hi,

We found out something during our testing for split brain scenario. We have 3 node PGPool-II + Watchdog (lifecheck heartbeat mode) configured in different datacenters as below diagram. We simulate the split brain scenario where we drop incoming and outgoing connection between DC1 <-> DC2, except PGPool-II in our AWS.

Setup

VMKDB01U - DC1 (PGPool-II + Watchdog + Postgresql)
VMKDB02U - DC2 (PGPool-II + Watchdog + Postgresql)
VMKDB03U - AWS (PGPool-II + Watchdog) Acting as witness

image

Problem

We notice that our VMKDB02U changed it state to NETWORK ISOLATION hence the client that connecting to our watchdog LEADER hung / buffered until the network connectivity stable then it resume as normal.

Questions

VMKDB02U log

2024-11-27 18:34:58.983: watchdog pid 1850759: LOG:  watchdog node state changed from [INITIALIZING] to [STANDING FOR LEADER]
2024-11-27 18:34:58.984: watchdog pid 1850759: LOG:  our stand for coordinator request is rejected by node "192.168.1.3:9999 Linux VMKDB03U"
2024-11-27 18:34:58.984: watchdog pid 1850759: DETAIL:  we might be in partial network isolation and cluster already have a valid leader
2024-11-27 18:34:58.984: watchdog pid 1850759: HINT:  please verify the watchdog life-check and network is working properly
2024-11-27 18:34:58.984: watchdog pid 1850759: LOG:  watchdog node state changed from [STANDING FOR LEADER] to [NETWORK ISOLATION]

VMKDB03U log

2024-11-27 18:37:12.547: watchdog pid 41337: LOG:  We are connected to leader node "192.168.1.1:9999 Linux VMKDB01U" and another node "192.168.1.2:9999 Linux VMKDB02U" is trying to become a leader                                              
2024-11-27 18:37:13.126: sr_check_worker pid 41376: LOG:  get_query_result failed: status: -2
2024-11-27 18:37:13.126: sr_check_worker pid 41376: CONTEXT:  while checking replication time lag
2024-11-27 18:37:23.166: sr_check_worker pid 41376: LOG:  get_query_result failed: status: -2
2024-11-27 18:37:23.166: sr_check_worker pid 41376: CONTEXT:  while checking replication time lag
Watchdog Node Information
Node Name         : 192.168.1.3:9999 Linux VMKDB03U
Host Name         : 192.168.1.3
Delegate IP       : Not_Set
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 1
Status            : 7
Status Name       : STANDBY
Membership Status : MEMBER

Node Name         : 192.168.1.1:9999 Linux VMKDB01U
Host Name         : 192.168.1.1
Delegate IP       : Not_Set
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 3
Status            : 4
Status Name       : LEADER
Membership Status : MEMBER

Node Name         : 192.168.1.2:9999 Linux VMKDB02U
Host Name         : 192.168.1.2
Delegate IP       : Not_Set
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 2
Status            : 12
Status Name       : NETWORK ISOLATION
Membership Status : MEMBER
@pengbo0328
Copy link
Collaborator

Thank you for reporting this issue.
Which version are you using? Does this issue occur in the latest version?

@meorkamil
Copy link
Author

Thank you for reporting this issue. Which version are you using? Does this issue occur in the latest version?

Our environment setup as below:

Postgres Version: postgresql16
PGPool-II Version: pgpool-II version 4.5.5 (hotooriboshi)
OS: Red Hat Enterprise Linux release 8.10 (Ootpa)

Yes, we are running the latest pgpool 4.5.5

@pengbo0328
Copy link
Collaborator

I think it is the correct behavior.
VMKDB02U attempts to become the LEADER, but the request is rejected by VMKDB03U because VMKDB03U can connect to VMKDB01U, which is the current LEADER.

Problem

We notice that our VMKDB02U changed it state to NETWORK ISOLATION hence the client that connecting to our watchdog LEADER hung / buffered until the network connectivity stable then it resume as normal.

The LEADER is VMKDB01U.
Do you mean the client cannot connect to the LEADER, VMKDB01U?
Is the client using a VIP to connect to pgpool? The VIP is normally assigned to the LEADER.

@pengbo0328 pengbo0328 assigned pengbo0328 and unassigned codeforall Dec 6, 2024
@meorkamil
Copy link
Author

I think it is the correct behavior.
VMKDB02U attempts to become the LEADER, but the request is rejected by VMKDB03U because VMKDB03U can connect to VMKDB01U, which is the current LEADER.

Agreed. I just wanted to confirm the connectivity from the client side. We observed that pgpool was unresponsive during the NETWORK ISOLATION event on VMKDB02U, even though the client was connected via the VIP (designated to the LEADER).

In this Post we not using a VIP. Instead, we using service discovery to register the LEADER/STANDY instance. Client always connected to a LEADER host IP.

The LEADER is VMKDB01U.
Do you mean the client cannot connect to the LEADER, VMKDB01U?
Is the client using a VIP to connect to pgpool? The VIP is normally assigned to the LEADER.

Moving forward, we tested with the VIP (sample configuration), where the client connected via VIP 192.168.1.4, but unfortunately, the same issue persisted.

Watchdog Cluster Information
Total Nodes              : 3
Remote Nodes             : 2
Member Remote Nodes      : 2
Alive Remote Nodes       : 1
Nodes required for quorum: 2
Quorum state             : QUORUM EXIST
Local node escalation    : NO
Leader Node Name         : 192.168.1.1:9999 Linux VMKDB01U
Leader Host Name         : 192.168.1.1

Watchdog Node Information
Node Name         : 192.168.1.3:9999 Linux VMKDB03U
Host Name         : 192.168.1.3
Delegate IP       : 192.168.1.4
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 1
Status            : 7
Status Name       : STANDBY
Membership Status : MEMBER

Node Name         : 192.168.1.1:9999 Linux VMKDB01U
Host Name         : 192.168.1.1
Delegate IP       : 192.168.1.4
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 100
Status            : 4
Status Name       : LEADER
Membership Status : MEMBER

Node Name         : 192.168.1.2:9999 Linux VMKDB02U
Host Name         : 192.168.1.2
Delegate IP       : 192.168.1.4
Pgpool port       : 9999
Watchdog port     : 9000
Node priority     : 2
Status            : 12
Status Name       : NETWORK ISOLATION
Membership Status : MEMBER

VMKDB02U Logs

2024-12-07 00:41:58.469: watchdog pid 1123869: LOG:  watchdog node state changed from [NETWORK ISOLATION] to [JOINING]
2024-12-07 00:41:58.469: watchdog pid 1123869: LOG:  watchdog node state changed from [JOINING] to [INITIALIZING]
2024-12-07 00:41:59.471: watchdog pid 1123869: LOG:  watchdog node state changed from [INITIALIZING] to [STANDING FOR LEADER]
2024-12-07 00:41:59.471: watchdog pid 1123869: LOG:  our stand for coordinator request is rejected by node "192.168.1.3:9999 Linux VMKDB03U"
2024-12-07 00:41:59.471: watchdog pid 1123869: DETAIL:  we might be in partial network isolation and cluster already have a valid leader
2024-12-07 00:41:59.471: watchdog pid 1123869: HINT:  please verify the watchdog life-check and network is working properly
2024-12-07 00:41:59.471: watchdog pid 1123869: LOG:  watchdog node state changed from [STANDING FOR LEADER] to [NETWORK ISOLATION

VMKDB03U Logs

2024-12-07 00:40:59.469: watchdog pid 8407: LOG:  We are connected to leader node "192.168.1.1:9999 Linux VMKDB01U" and another node "192.168.1.2:9999 Linux VMKDB02U" is trying to become a leader

VMKDB01U Logs

2024-12-07 00:40:45.527: psql pid 1118963: DETAIL:  query: "INSERT INTO service (service_name) VALUES ('Sat Dec  7 00:40:45 +08 2024 VMKAPP01U');"
2024-12-07 00:40:45.530: psql pid 1118963: LOG:  Terminate message from frontend.
2024-12-07 00:40:45.539: child pid 1118951: LOG:  new connection received
2024-12-07 00:40:45.539: child pid 1118951: DETAIL:  connecting host=192.168.2.1 port=35754
2024-12-07 00:40:45.578: psql pid 1118951: LOG:  Query message from frontend.
2024-12-07 00:40:45.578: psql pid 1118951: DETAIL:  query: "SELECT * FROM service ORDER BY service_id DESC LIMIT 1;"
2024-12-07 00:40:45.583: psql pid 1118951: LOG:  Terminate message from frontend.
2024-12-07 00:40:47.342: pcp_main pid 1118971: LOG:  forked new pcp worker, pid=1122146 socket=6
2024-12-07 00:40:47.346: pcp_main pid 1118971: LOG:  PCP process with pid: 1122146 exit with SUCCESS.
2024-12-07 00:40:47.346: pcp_main pid 1118971: LOG:  PCP process with pid: 1122146 exits with status 0
2024-12-07 00:40:47.352: pcp_main pid 1118971: LOG:  forked new pcp worker, pid=1122158 socket=6
2024-12-07 00:40:47.353: pcp_main pid 1118971: LOG:  PCP process with pid: 1122158 exit with SUCCESS.
2024-12-07 00:40:47.353: pcp_main pid 1118971: LOG:  PCP process with pid: 1122158 exits with status 0
2024-12-07 00:40:47.488: heart_beat_sender pid 1118937: ERROR:  failed to send watchdog heartbeat, sendto failed
2024-12-07 00:40:47.488: heart_beat_sender pid 1118937: DETAIL:  sending packet to "192.168.1.2" failed with reason: "Operation not permitted"
2024-12-07 00:40:48.599: psql pid 1118961: LOG:  new connection received
2024-12-07 00:40:48.599: psql pid 1118961: DETAIL:  connecting host=192.168.2.1 port=35766
2024-12-07 00:40:49.488: heart_beat_sender pid 1118937: ERROR:  failed to send watchdog heartbeat, sendto failed
2024-12-07 00:40:49.488: heart_beat_sender pid 1118937: DETAIL:  sending packet to "192.168.1.2" failed with reason: "Operation not permitted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants