Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when disconnecting the network on postgres and pgpool standby node #55

Open
hernanpezzani opened this issue Jun 19, 2024 · 17 comments
Assignees

Comments

@hernanpezzani
Copy link

hernanpezzani commented Jun 19, 2024

The postgres and pgpool configuration is the following one:
version postgresql 14.12 and pgpool 4.4.7
Node 0 is primary postgresql server (backend 0) and standby pgpool_node,
Node 1 is the standby postgresql (backend 1) and standby pgpool_node
Node 2 is LEADER of pgpool with delegate_IP UP and it does not have postgresql.
WatchDog is enabled with hearthbeat

Postgres works perfectly, reads, writes and works with pgpool connecting from port 9999.

Detail of the failure:
When the network of the Node 1 is TURNED DOWN (VMWARE disconnect network), the cluster pgpool delays connection to the database through the delegate_IP and port 9999. We have checked it and it takes 1-2 min to ask for the password.
psql -h delegate_IP -p 9999 -U prueba -d prueba
after 1-2 min prompt password

while the network is DOWN on node 1, direct access to node 0 postgresql through port 5432 is correct, the replication between the nodes becomes asynchronous and postgres operation is as expected, but access through IP delegated by port 9999 is still delayed

When the network is TURNED ON again, the pgpool service continues delaying connections until the service gets restarted on Node 2 (PGPOOL LEADER).

PGPOOL STATE NETWORK DOWN
3 3 YES nodo2.dominio.com:9999 Linux nodo2.dominio.com nodo2.dominio.com
nodo0.dominio.com:9999 Linux nodo0.dominio.com nodo0.dominio.com 9999 90000 7 STANDBY 0 MEMBER
nodo1.dominio.com:9999 Linux nodo1.dominio.com nodo1.dominio.com 9999 90000 8 LOST 0 MEMBER
nodo2.dominio.com:9999 Linux nodo2.dominio.com nodo2.dominio.com 9999 90000 4 LEADER 0 MEMBER

config pgpool
delegate_IP = '172.27.37.56'
if_cmd_path = '/sbin'
if_up_cmd = '/bin/sudo /sbin/ip addr add $IP$/24 dev ens192 label ens192:0'
if_down_cmd = '/bin/sudo /sbin/ip addr del $IP$/24 dev ens192'
arping_path = '/usr/sbin'
arping_cmd = '/bin/sudo /sbin/arping -U $IP$ -w 1 -I ens192'

load_balance_mode = off
connection_cache = on

backend_hostname0 = 'nodo0.dominio.com'
backend_port0 = 5432
backend_data_directory0 = '/postgresql/14/data/mdona-cpdm-pre-cluster07'
backend_weight0 = 1
backend_application_name0 = 'nodo0'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'nodo1.dominio.com'
backend_port1 = 5432
backend_data_directory1 = '/postgresql/14/data/mdona-cpdm-pre-cluster07'
backend_weight1 = 1
backend_application_name1 = 'nodo1'
backend_flag1 = 'ALLOW_TO_FAILOVER'

use_watchdog = on
wd_escalation_command = '/etc/pgpool-II/escalation.sh'
hostname0 = 'nodo0.dominio.com'
pgpool_port0 = 9999
wd_port0 = 9000
heartbeat_hostname0 = 'nodo0.dominio.com'
heartbeat_port0 = 9694
hostname1 = 'nodo1.dominio.com'
pgpool_port1 = 9999
wd_port1 = 9000
heartbeat_hostname1 = 'nodo1.dominio.com
heartbeat_port1 = 9694
hostname2 = 'nodo2.dominio.com'
pgpool_port2 = 9999
wd_port2 = 9000
heartbeat_hostname2 = 'nodo2.dominio.com'
heartbeat_port2 = 9694

process_management_mode = dynamic
process_management_strategy = gentle
min_spare_children = 10
max_spare_children = 20

health_check_period = 5
health_check_timeout = 20
health_check_user = 'pgpool2'
health_check_password = ''
health_check_database = ''
health_check_max_retries = 0
health_check_retry_delay = 1
connect_timeout = 10000

num_init_children = 500
max_pool = 2

child_life_time = 120
child_max_connections = 0
connection_life_time = 120
client_idle_limit = 0

backend_clustering_mode = 'streaming_replication'
listen_addresses = '*'
port = 9999
socket_dir = '/tmp'

Hernan Pezzani

@pengbo0328 pengbo0328 self-assigned this Jun 19, 2024
@pengbo0328
Copy link
Collaborator

@hernanpezzani Thank you for reporting this issue.

I tried to reproduce this issue, but I couldn't reproduce it.

Detail of the failure:
When the network of the Node 1 is TURNED DOWN (VMWARE disconnect network), the cluster pgpool delays connection to the database through the delegate_IP and port 9999. We have checked it and it takes 1-2 min to ask for the password.
psql -h delegate_IP -p 9999 -U prueba -d prueba
after 1-2 min prompt password

Node1 is the standby pgpool and standby postgres.
Does this issue only occur when you TURN DOWN the network on Node1?
If you TURN DOWN the network on other nodes, the same issue occurs?

I could not find the configuration of failover scripts.
Does pgpool manage failover of PostgreSQL?

Could you enable the debug log and reproduce this issue again?

log_min_messages = debug5

Please share:

  • pgpool.conf
  • the debug log
  • time when you turned down the network
  • time when you ran psql command

@hernanpezzani
Copy link
Author

Hi @pengbo0328

Node1 is the standby pgpool and standby postgres. Yes

Does this issue only occur when you TURN DOWN the network on Node1?
If you TURN DOWN the network on other nodes, the same issue occurs?

We have not simulated it because we focused on seeing why it failed when both nodes were standby (postgres and pgpool).
The network loss simulation is done by disconnecting the virtual machine network (uncheck connected), when the services are stopped or the node is restarted or stopped the behavior is as expected.

I could not find the configuration of failover scripts.
Does pgpool manage failover of PostgreSQL? Yes

Could you enable the debug log and reproduce this issue again? Yes

time when you turned down the network: "5 -10 min approximately"

time when you ran psql command: "every 1 sec we try to connect from an external device"

We are preparing (waiting for the OK from security) the file with the scripts.sh, .conf and only the logs with debug5 of the primary pgpool node (node2), the file size is about 50Mb, we attach it to the issue?

Do you need the logs of all 3 nodes?

@hernanpezzani
Copy link
Author

In the log with debug5 enabled, the lines with the format of the example, which is what is after "->" (marked in bold)

example: `postgresSELECT count() FROM pg_catalog.pg_proc AS p, pg_catalog.pg_namespace AS n WHERE p.proname = 'pg_read_file' AND n.oid = p.pronamespace AND n.nspname ~ '.' AND p.provolatile = 'v'postgres' -> '5564261919auUa8uyt9ceceb0626896e'.

Is the user's password?

@tatsuo-ishii
Copy link
Collaborator

Is the user's password?

No.

@hernanpezzani
Copy link
Author

I cannot upload files larger than 25 mb. If I split it with split -b 20M file.tar.gz "file.part" it does not allow me to upload the extensions partaa, etc.
2020-08-26 13_31_23-Error del Servidor de Contenidos
2020-08-26 16_21_38-Error del Servidor de Contenidos

@pengbo0328
Copy link
Collaborator

@hernanpezzani

We have not simulated it because we focused on seeing why it failed when both nodes were standby (postgres and pgpool).
The network loss simulation is done by disconnecting the virtual machine network (uncheck connected), when the services are stopped or the node is restarted or stopped the behavior is as expected.

I tried the same way to disconnect the netwrok (uncheck connected on VMWare settings).
Then the pgpool node is in LOST status, and standby PostgreSQL is in down status.

I cannot upload files larger than 25 mb. If I split it with split -b 20M file.tar.gz "file.part"
it does not allow me to upload the extensions partaa, etc.

Thank you for preparing the debug log.
OK.
Because it is in a large size, could you share the default logs with "log_min_messages = warning"?
I will try to check whether the cause can be identified from the default logs.

If you could share the pgpool.conf and the scripts (failover and escalation scripts), I will try to reproduce this issue and enable debug logs by myself.

@hernanpezzani
Copy link
Author

issue55

please rename the file to zip

@hernanpezzani
Copy link
Author

https://drive.google.com/file/d/1Dso7q_hM6NlT2_7Ybjyd3nH0-nWyuY1t/view?usp=drive_link

link to drive.google

@hernanpezzani
Copy link
Author

Hello
Have you seen any error in the configuration that can generate the failure when we lose network in standby?
Can we contribute with any log or local test?
Thanks

@pengbo0328
Copy link
Collaborator

@hernanpezzani Sorry for the late response.

From your logs, because failover completed in one second, I think the failover script doesn't cause this issue.

Because the log file is large (1.3G), it is difficult to identify the child process which you connected to using psql command.
If you have a test environment, could you disable the connections from other applications and connect to pgpool only using psql command?

time when you ran psql command: "every 1 sec we try to connect from an external device"

Run psql every second will cause lots of debug logs, could you please try it just once?

Also, it will be appreciated if you could also let me know the time you connected to pgpool using psql command.
It is helpful for identifying the child process which is connected by psql command.

I also noticed that wd_monitoring_interfaces_list caused pgpool shutdown on Node 1 when the network is turned down on Node1. Could you disable this parameter and try again?

@hernanpezzani
Copy link
Author

Hello

We are going to perform the tests by limiting the connection attempts to keep the log as simple as possible.

We are also going to perform new tests with the parameter wd_monitoring_interfaces_list deactivated
wd_monitoring_interfaces_list=''

@hernanpezzani
Copy link
Author

Hello,
With the parameter wd_monitoring_interfaces_list='' the error is still reproduced.

For the test of disabling all connections, we have modified the listening port of pgpool to 9998 to avoid any external connection. Removing all application connections and only making connections from node3 and from an external node, everything was OK.

We changed the port back to 9999 to allow the application connections to come in again (about 220) and when we did this, again the failure occurred and the connection took about 2 minutes.

we have made the test in another cluster with less load and the error has not been reproduced, but in clusters with load also it is reproduced, of about 35 clusters that we have at the moment about 15 are affected by this case.

@pengbo0328
Copy link
Collaborator

@hernanpezzani Thank you for testing.
I will try to increase the load and see if it can be reproduced.

@hernanpezzani
Copy link
Author

@pengbo0328 If you need to make any test, configuration change in which we can collaborate, if it is within our reach we can do it.

@pengbo0328
Copy link
Collaborator

pengbo0328 commented Aug 19, 2024

@hernanpezzani Could you check if the same issue occurs when setting the following parameters?

process_management_mode = static
reserved_connections = 1

The settings above enable static process_management_mode and displays an error message when the connections reaches max connections.
It is possible that new connections were hanging because there were no idle child processes.

If the same issue occurs, could you run the following command to display the status of connections.

pcp_proc_info -p 9898 -h <delegate_IP>

@hernanpezzani
Copy link
Author

hernanpezzani commented Aug 20, 2024

Hi @pengbo0328
we have changed the configuration and performed the test.
At about 12:56 we disconnected the network and the instability was only about 2min. After this time, the cluster is up, the standby node is down and the pgbench test continues with the test. We observed errors in the log but we are not able to know if it is from the network downtime or from the connected applications.

attach log_prueba.txt

@hernanpezzani
Copy link
Author

hernanpezzani commented Aug 20, 2024

can you confirm if log_prueba.txt is uploaded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants