Wrapper with Connection Pooling #256

marlongionazwift · 2022-11-01T14:39:52Z

marlongionazwift
Nov 1, 2022

Hello!

I'm testing the wrapper with Hikari, and what I'm seeing is that the wrapper is working as expected with the connection I'm using in the moment the failover happens, but all the other connections in the pool will remain pointing to the "cluster endpoint" and will have to wait for DNS to update in order to work. Is this really what we should expect or am I probably doing something wrong ?

Just wanted to confirm because, if that's true, we won't be able to really achieve a "fast failover" when using connection pooling with the wrapper.

Regards,
Marlon

Answered by davecramer

Nov 1, 2022

@marlongionazwift Thanks for the report. I don't think you are doing anything wrong. I'm trying to figure out how to deal with this scenario though. We almost need to track which connections Hikari has open in the driver and either 1) invalidate all of them or 2) figure out how to fail them all over. Thinking this some more we would need to invalidate all of them as the session would need to be reset. Interesting!

View full answer

davecramer · 2022-11-01T23:59:56Z

davecramer
Nov 1, 2022
Maintainer

@marlongionazwift Thanks for the report. I don't think you are doing anything wrong. I'm trying to figure out how to deal with this scenario though. We almost need to track which connections Hikari has open in the driver and either 1) invalidate all of them or 2) figure out how to fail them all over. Thinking this some more we would need to invalidate all of them as the session would need to be reset. Interesting!

0 replies

marlongionazwift · 2022-11-03T12:40:51Z

marlongionazwift
Nov 3, 2022
Author

Hello, @davecramer, thanks for your response!

Regarding your first point: I was able to configure a validation query in Hikari to evict "read-only" connections, but that does not solve the problem yet, since the new connections will be opened pointing to the cluster endpoint and we will continue depending on the DNS refresh.

Regarding point 2: It seems that failing over all the current connections would solve the problem just partially, because new opened connections would still depend on DNS refresh

2 things I thought about:

Maybe we could do something with the "read only transaction error" we receive in the connections still pointing to the old writer? This event could be also a trigger to make the replacement of the physical connection in the wrapper.
What if the wrapper always started the physical connections already pointing to the writer instance (instead of doing that only in the event of a failover) and it also kept monitoring all connections in order to identify (and replace) when they are pointing to the wrong/old writer?

Regards,
Marlon

0 replies

sergiyvamz · 2022-11-04T17:00:34Z

sergiyvamz
Nov 4, 2022
Maintainer

Hello @marlongionazwift, @davecramer

I wanted to bring some ideas and I hope they might help us to choose a right direction.

Usually, connection pools perform a connection check before returning it to a user application. I can assume that such a validity check may happen when the failover process at the DB cluster isn't yet over. I'd suspect that in such a case it can trigger the failover process in the driver and lead into, sooner or later, a valid connection.

The other possible scenario is when such a validity check happens after the DB cluster failover is over. This case is a bit tricky because it's not clear whether a physical connection to a database node has survived or not. I'd expect that failover on the DB cluster will close all open connections since the DB node should be reconfigured with a new role and needs to be restarted. The validity check on a closed connection leads to connection eviction from a connection pool. The connection pool can move to another idle connection and eventually return a valid connection to a user app. If the physical connection is survived over DB failover (and I'm quite dubious about probability of such scenario), a user application may be getting a valid/healthy connection to the same node it was connected before DB failover. However, there's a high chance that a role of the node has changed and that may cause dramatic side effects.

A quick summary of the cases I mentioned above:

Connection pool is requested a connection from the pool.
Connection pool calls isValid() on an idle candidate connection.
1) Failover on the DB cluster is in progress; candidate connection starts failover process and reconnects to a proper node when it gets available.
2) Failover on the DB cluster is over and nodes have changed their roles.
a) Physical connection to a node is dead; connection pool evicts this candidate connection from the pool and continues with another one.
b) Physical connection is still alive (hardly possible); user app gets a healty connection to a node that has changed its role.
3) There's no connections in the pool and new connection needs to be open.

It seems to me that cases 1) and 2a) require no correction and they can be handled properly by both connection pool and a user app.

Case 2b) is interesting. As a quick solution I'd suggest to extend original logic of connection health check with an additional node role check. It could be easily done with a new plugin, or with an update to existing Failover plugin. Depending of the node role, connection could be invalidated that practically brings us to the case 2a).

All mentioned above is my understanding of how things work and they need a practical confirmation.

As for opening new connections with a cluster endpoint, related to case 3), there's a new plugin auroraStaleDns' available (please check the latest snapshot build). It supports cluster endpoints and verifies that an opened connection is actually established to a writer node. If needed, connection is disposed and a new one is opened to a proper node. The plugin mitigates the problem with DNS stale data. I hope that might be helpful.

About invalidating all affected connections in a pool. More investigation is needed to determine if this is a robust solution, however it depends on a particular connection pool and API it provides. I'd like to see a public method that accept a list of connections to evict from the pool, or a method to evict all connections that match some criteria like connection url. I'm not sure if any of such public API exists in popular connection pool implementations.

0 replies

davecramer · 2022-11-07T12:40:20Z

davecramer
Nov 7, 2022
Maintainer

From what I can tell HikariCP does not do anything onBorrow() it does have a keepAlive setting where it will call isValid() periodically.
This will make 2(b) very possible in my opinion. However the client shouldn't assume that the connection is still in the state they left it in when they returned it to the pool. So I think that is OK.

0 replies

marlongionazwift · 2022-11-10T19:29:53Z

marlongionazwift
Nov 10, 2022
Author

Hello, @davecramer and @sergiyvamz!

I tested the "auroraStaleDns" and its working very well. With that I was able to recover pretty nicely (and fast) from a failover.

I said that "other connections in the pool remain poiting to the cluster endpoint" but that's not really what is happening (since after the failover all connections are dead, like @sergiyvamz said), what happened is that new connections were created before DNS refresh, so they were pointing to the old writer.

If I understood correctly, those new connections are being created because it seems that the "connection override class" for Hikari is not being called when executing the validation query, only when executing a normal query from the client application, making the broken connections to be evicted from the pool and making the pool create new connections (which will be pointing to the wrong instance).

2 replies

marlongionazwift Nov 10, 2022
Author

It seems that using "isValid" as validation query, the connection is being discarded without even calling the "failover" routine of the wrapper. When specifying a query like "SELECT 1" (connectionTestQuery), then the failover is being called in the wrapper, but the "connection override class" is not being used so the connection is being evicted anyway.

davecramer Nov 10, 2022
Maintainer

Good to know, thanks for the feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrapper with Connection Pooling #256

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Wrapper with Connection Pooling #256

marlongionazwift Nov 1, 2022

Replies: 5 comments · 2 replies

davecramer Nov 1, 2022 Maintainer

marlongionazwift Nov 3, 2022 Author

sergiyvamz Nov 4, 2022 Maintainer

davecramer Nov 7, 2022 Maintainer

marlongionazwift Nov 10, 2022 Author

marlongionazwift Nov 10, 2022 Author

davecramer Nov 10, 2022 Maintainer

marlongionazwift
Nov 1, 2022

Replies: 5 comments 2 replies

davecramer
Nov 1, 2022
Maintainer

marlongionazwift
Nov 3, 2022
Author

sergiyvamz
Nov 4, 2022
Maintainer

davecramer
Nov 7, 2022
Maintainer

marlongionazwift
Nov 10, 2022
Author

marlongionazwift Nov 10, 2022
Author

davecramer Nov 10, 2022
Maintainer