-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when disconnecting the network on postgres and pgpool standby node #55
Comments
@hernanpezzani Thank you for reporting this issue. I tried to reproduce this issue, but I couldn't reproduce it.
Node1 is the standby pgpool and standby postgres. I could not find the configuration of failover scripts. Could you enable the debug log and reproduce this issue again?
Please share:
|
Hi @pengbo0328 Node1 is the standby pgpool and standby postgres. Yes Does this issue only occur when you TURN DOWN the network on Node1? We have not simulated it because we focused on seeing why it failed when both nodes were standby (postgres and pgpool). I could not find the configuration of failover scripts. Could you enable the debug log and reproduce this issue again? Yes time when you turned down the network: "5 -10 min approximately" time when you ran psql command: "every 1 sec we try to connect from an external device" We are preparing (waiting for the OK from security) the file with the scripts.sh, .conf and only the logs with debug5 of the primary pgpool node (node2), the file size is about 50Mb, we attach it to the issue? Do you need the logs of all 3 nodes? |
In the log with debug5 enabled, the lines with the format of the example, which is what is after "->" (marked in bold) example: `postgresSELECT count() FROM pg_catalog.pg_proc AS p, pg_catalog.pg_namespace AS n WHERE p.proname = 'pg_read_file' AND n.oid = p.pronamespace AND n.nspname ~ '.' AND p.provolatile = 'v'postgres' -> '5564261919auUa8uyt9ceceb0626896e'. Is the user's password? |
No. |
I tried the same way to disconnect the netwrok (uncheck connected on VMWare settings).
Thank you for preparing the debug log. If you could share the pgpool.conf and the scripts (failover and escalation scripts), I will try to reproduce this issue and enable debug logs by myself. |
Hello |
@hernanpezzani Sorry for the late response. From your logs, because failover completed in one second, I think the failover script doesn't cause this issue. Because the log file is large (1.3G), it is difficult to identify the child process which you connected to using
Run Also, it will be appreciated if you could also let me know the time you connected to pgpool using I also noticed that |
Hello We are going to perform the tests by limiting the connection attempts to keep the log as simple as possible. We are also going to perform new tests with the parameter wd_monitoring_interfaces_list deactivated |
Hello, For the test of disabling all connections, we have modified the listening port of pgpool to 9998 to avoid any external connection. Removing all application connections and only making connections from node3 and from an external node, everything was OK. We changed the port back to 9999 to allow the application connections to come in again (about 220) and when we did this, again the failure occurred and the connection took about 2 minutes. we have made the test in another cluster with less load and the error has not been reproduced, but in clusters with load also it is reproduced, of about 35 clusters that we have at the moment about 15 are affected by this case. |
@hernanpezzani Thank you for testing. |
@pengbo0328 If you need to make any test, configuration change in which we can collaborate, if it is within our reach we can do it. |
@hernanpezzani Could you check if the same issue occurs when setting the following parameters?
The settings above enable static process_management_mode and displays an error message when the connections reaches max connections. If the same issue occurs, could you run the following command to display the status of connections.
|
Hi @pengbo0328 attach log_prueba.txt |
can you confirm if log_prueba.txt is uploaded? |
The postgres and pgpool configuration is the following one:
version postgresql 14.12 and pgpool 4.4.7
Node 0 is primary postgresql server (backend 0) and standby pgpool_node,
Node 1 is the standby postgresql (backend 1) and standby pgpool_node
Node 2 is LEADER of pgpool with delegate_IP UP and it does not have postgresql.
WatchDog is enabled with hearthbeat
Postgres works perfectly, reads, writes and works with pgpool connecting from port 9999.
Detail of the failure:
When the network of the Node 1 is TURNED DOWN (VMWARE disconnect network), the cluster pgpool delays connection to the database through the delegate_IP and port 9999. We have checked it and it takes 1-2 min to ask for the password.
psql -h delegate_IP -p 9999 -U prueba -d prueba
after 1-2 min prompt password
while the network is DOWN on node 1, direct access to node 0 postgresql through port 5432 is correct, the replication between the nodes becomes asynchronous and postgres operation is as expected, but access through IP delegated by port 9999 is still delayed
When the network is TURNED ON again, the pgpool service continues delaying connections until the service gets restarted on Node 2 (PGPOOL LEADER).
PGPOOL STATE NETWORK DOWN
3 3 YES nodo2.dominio.com:9999 Linux nodo2.dominio.com nodo2.dominio.com
nodo0.dominio.com:9999 Linux nodo0.dominio.com nodo0.dominio.com 9999 90000 7 STANDBY 0 MEMBER
nodo1.dominio.com:9999 Linux nodo1.dominio.com nodo1.dominio.com 9999 90000 8 LOST 0 MEMBER
nodo2.dominio.com:9999 Linux nodo2.dominio.com nodo2.dominio.com 9999 90000 4 LEADER 0 MEMBER
config pgpool
delegate_IP = '172.27.37.56'
if_cmd_path = '/sbin'
if_up_cmd = '/bin/sudo /sbin/ip addr add $IP$/24 dev ens192 label ens192:0'
if_down_cmd = '/bin/sudo /sbin/ip addr del $IP$/24 dev ens192'
arping_path = '/usr/sbin'
arping_cmd = '/bin/sudo /sbin/arping -U $IP$ -w 1 -I ens192'
load_balance_mode = off
connection_cache = on
backend_hostname0 = 'nodo0.dominio.com'
backend_port0 = 5432
backend_data_directory0 = '/postgresql/14/data/mdona-cpdm-pre-cluster07'
backend_weight0 = 1
backend_application_name0 = 'nodo0'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_hostname1 = 'nodo1.dominio.com'
backend_port1 = 5432
backend_data_directory1 = '/postgresql/14/data/mdona-cpdm-pre-cluster07'
backend_weight1 = 1
backend_application_name1 = 'nodo1'
backend_flag1 = 'ALLOW_TO_FAILOVER'
use_watchdog = on
wd_escalation_command = '/etc/pgpool-II/escalation.sh'
hostname0 = 'nodo0.dominio.com'
pgpool_port0 = 9999
wd_port0 = 9000
heartbeat_hostname0 = 'nodo0.dominio.com'
heartbeat_port0 = 9694
hostname1 = 'nodo1.dominio.com'
pgpool_port1 = 9999
wd_port1 = 9000
heartbeat_hostname1 = 'nodo1.dominio.com
heartbeat_port1 = 9694
hostname2 = 'nodo2.dominio.com'
pgpool_port2 = 9999
wd_port2 = 9000
heartbeat_hostname2 = 'nodo2.dominio.com'
heartbeat_port2 = 9694
process_management_mode = dynamic
process_management_strategy = gentle
min_spare_children = 10
max_spare_children = 20
health_check_period = 5
health_check_timeout = 20
health_check_user = 'pgpool2'
health_check_password = ''
health_check_database = ''
health_check_max_retries = 0
health_check_retry_delay = 1
connect_timeout = 10000
num_init_children = 500
max_pool = 2
child_life_time = 120
child_max_connections = 0
connection_life_time = 120
client_idle_limit = 0
backend_clustering_mode = 'streaming_replication'
listen_addresses = '*'
port = 9999
socket_dir = '/tmp'
Hernan Pezzani
The text was updated successfully, but these errors were encountered: