JDBC connection does not reset #18685

silviu-negoita · 2021-07-14T14:50:55Z

Describe the bug

Having up the application, with a sqs collector, started at booting via ManagedExecutor:

 void onStart(@Observes(during = AFTER_COMPLETION) StartupEvent ev) {
        managedExecutor.execute(this::startProcessing);
    }

Somewhere down the line, I do some DB selects / inserts
2. Reset the database. Reproduced in aws cloud, but even in local, with docker postgres db container. At this step end, databse is up and ready. Accepting new connections.
3. Got stack trace

Caused by: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:350)
	at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:481)
	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:401)
	at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:164)
	at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114)
	at io.agroal.pool.wrapper.PreparedStatementWrapper.executeQuery(PreparedStatementWrapper.java:78)
	at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:57)
	... 54 more
Caused by: java.io.EOFException
	at org.postgresql.core.PGStream.receiveChar(PGStream.java:445)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2057)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:323)
	... 60 more

App still up, receives a new sqs event, in method from step 1
App fails on any db interaction with following:

aused by: org.hibernate.exception.JDBCConnectionException: could not prepare statement
	at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:112)
	at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:42)
	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:113)
	at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement(StatementPreparerImpl.java:186)
	at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.prepareQueryStatement(StatementPreparerImpl.java:151)
	at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:2103)
	at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:2040)
	at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:2018)
	at org.hibernate.loader.Loader.doQuery(Loader.java:948)
	at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:349)
	at org.hibernate.loader.Loader.doList(Loader.java:2849)
	at org.hibernate.loader.Loader.doList(Loader.java:2831)
	at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2663)
	at org.hibernate.loader.Loader.list(Loader.java:2658)
	at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:506)
	at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:400)
	at org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:219)
	at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1414)
	at org.hibernate.query.internal.AbstractProducedQuery.doList(AbstractProducedQuery.java:1625)
	at org.hibernate.query.internal.AbstractProducedQuery.list(AbstractProducedQuery.java:1593)
	... 42 more
Caused by: org.postgresql.util.PSQLException: This connection has been closed.
	at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:877)
	at org.postgresql.jdbc.PgConnection.prepareStatement(PgConnection.java:1734)
	at org.postgresql.jdbc.PgConnection.prepareStatement(PgConnection.java:428)
	at io.agroal.pool.wrapper.ConnectionWrapper.prepareStatement(ConnectionWrapper.java:659)
	at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$5.doPrepare(StatementPreparerImpl.java:149)
	at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement(StatementPreparerImpl.java:176)
	... 58 more

Expected behavior

Database connection is restored after DB is up again.

Actual behavior

Infinite failure. It recovers just after the app is restarted.

Output of `java -version`

openjdk 11.0.11 2021-04-20

Quarkus version or git rev

1.13.7

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

maven

The text was updated successfully, but these errors were encountered:

stuartwdouglas · 2021-07-15T04:51:46Z

Are you getting a new connection from the pool each time or are you holding onto one for the life of the app?

silviu-negoita · 2021-07-15T14:33:23Z

Are you getting a new connection from the pool each time or are you holding onto one for the life of the app?

Im not doing anything specific related to DB - JDBC. Anything is happening behind the scene. I am not even into a transaction, it fails on a select operation.
Not sure how to see the actual db connections. It is doable via debug?

gsmet · 2021-07-21T11:55:52Z

Maybe you could try adding quarkus.datasource.jdbc.validation-query-sql=select 1 to your application.properties?

silviu-negoita · 2021-07-29T11:43:06Z

Hi again! @gsmet i tried your fix, but there is no difference.
Meanwhile, I can found a better way to reproduce. When DB connection is down, I stress out the application with events(which hits the DB). After a while, application will not receive events anymore . I attached a thread dump for this case, but also after restart, when the app is working normally.
mem_dumps.zip

silviu-negoita · 2021-07-29T14:05:59Z

Also, i see something similar in here #15025. I am reproducing the issue in 2.0.3.Final

apat59 · 2021-09-06T06:17:39Z

Hi,
I'm also hitting this since 1.13.2 exactly and is still happening on 2.2.1.
However I'm not restarting the DB, this happens almost every time I roll out a new deployment/pod.
The flow is the same:
1- agroal creates conn
2- conn acquired, used, returned to pool many times (in my case quartz, at least 20 times) without any problem
3- conn acquired, that EOF exception happens but, conn returned, not destroyed
4- conn acquired, and TransactionImple.getStatus: javax.transaction.Status.STATUS_MARKED_ROLLBACK
5- many problems may happen here, conn closed, Deferred enlistment not supported, tx not started
6- here, Destroyed connection org.postgresql.jdbc.PgConnection happens, isn't it too late?

Many conn are suffering the exact same flow, very often at startup but also when we set max-life-time.

Generally that goes with corrupted metrics. As soon as this happens the active count will be < 0.
On way to reproduce it outside of the startup quite easily is to set a short max-life-time, the problem will happen over and over.

I've been unable to implement a reproducer. I'm in touch with Luis in private.
I've have a shared gdoc and TRACE logs I can share only with Red Hatters.

This looks very specific to cloud environment, certainly to the cloud db server pattern/product.

jonsalvas · 2021-11-24T11:39:03Z

I can confirm I am also facing this issue with quarkus 2.0.3.final in a kubernetes environment with HPA scaling pods up/down based on load

apat59 · 2021-11-25T08:47:58Z

@jonsalvas out of curiosity, do you know the details about the db setup and possible extra pooling mecanism?

hleb-kastseika · 2022-02-18T07:34:24Z

Any updates on this issue?

maxandersen · 2022-02-18T18:45:33Z

From Reddit thread highlighting the issue:

I was able to reproduce the issue, but it doesn't seem very serious to me.

⁠Start Quarkus application, send a few requests.
⁠Stop Postgres.
⁠Send requests, observe "PSQLException: This connection has been closed." errors.
⁠Start Postgres.
⁠Send requests, keep observing connection closed errors for about 30 seconds.
⁠Normal behavior resumes.

My application configuration doesn't have anything special regarding database connection validation or maximum age.

In my opinion the only scenario where this is an issue is when database connections are frequently interrupted.

apat59 · 2022-02-19T13:07:16Z

@maxandersen
Hi mate (Anthony writing),
Since:

the symptom appeared exactly starting 1.13.7
the problem is not happening with another app built on plain vert.x + c3p0 pooling hitting the same db cluster

I'm pretty sure there is something really tricky and specific.
In my case, the db is an amazon prod cluster and the apps are deployed on openshift.
I indeed suspect some specific amazon config but we do not administrate the db, so we didn't manage to get any clue.

I managed to reduce this "pseudo" leak, I'll paste my settings tomorrow.
I'm also in touch with Luis and I send him info as soon as I have.

maxandersen · 2022-02-21T23:36:45Z

@apat59 thanks for the info - you mention Luis; Luis who ? (there are a handfull of those around :)

maxandersen · 2022-02-21T23:41:59Z

also @apat59 what is the latest version of quarkus you tested this against ? 1.13.7 is quite old by now :)

maxandersen · 2022-02-22T00:28:02Z

so I tried to reproduce this locally and thus far it works as I expect. when db is down you get errors; when db returns the pool will eventually recover and all works again.

See https://github.com/maxandersen/quarkus18685reproduce for reproducer.

this is on quarkus v2.7.1.Final

stuartwdouglas · 2022-02-22T01:09:48Z

Can you try with agroal/agroal#48 ?

With this patch I see things resolving immediately after the DB is back up.

quarkus-bot · 2022-02-22T02:06:47Z

/cc @Sanne, @barreiro, @yrodiere

apat59 · 2022-02-22T06:37:06Z

hi,
more on our own issue here . Please note I'm not totally sure it's the same issue as here.
We've been stuck with 1.13.0 for a long time because of this symptom.
Until 2.2.1.Final and a mix of properties which limited the "problem".
Note, we never managed to reproduce this in dev, it only happened in openshift + aurora dbs.

quarkus.datasource.db-kind=postgresql
quarkus.datasource.health.enabled=true
quarkus.datasource.metrics.enabled=true
quarkus.transaction-manager.default-transaction-timeout=60
quarkus.datasource.jdbc.min-size=6
quarkus.datasource.jdbc.initial-size=6
# insane but temporary required
quarkus.datasource.jdbc.max-size=70
quarkus.datasource.jdbc.background-validation-interval=15S
quarkus.datasource.jdbc.validation-query-sql=select 1;
quarkus.datasource.jdbc.acquisition-timeout=5
quarkus.datasource.jdbc.idle-removal-interval=5M
quarkus.datasource.jdbc.detect-statement-leaks=true
quarkus.datasource.jdbc.new-connection-sql=select 1;
# NOTE, we may have to change that in the future
quarkus.datasource.jdbc.driver=org.postgresql.ds.PGSimpleDataSource

As soon as I can (heavy week here), I'll re enable the config that used to trigger the problem

#quarkus.datasource.jdbc.max-lifetime=150
#quarkus.datasource.jdbc.detect-statement-leaks=true

and update to latest quarkus release and deploy to QA where we have grafana enabled and can monitor the pool.
I'll let you know.

Btw Luis is Luis Barreiro :)

maxandersen · 2022-02-22T07:15:43Z

@apat59 sounds great.

if you can try with stuarts patch that would be great but if notlook for what error codes you get and if its any of the ones starting with 08 as seen in https://www.postgresql.org/docs/13/errcodes-appendix.html at least you should see faster recovery.

barreiro · 2022-02-22T17:20:38Z

@stuartwdouglas @maxandersen from what I can understand, the problem is not just about error codes but how the Amazon Aurora service does the DB failover.

I don't know the internals, but it looks like there is a "primary" that can fail, and when it does Agroal does flush the connections (based on the SQL error code) and new connections are created to a read-only "replica". Agroal does expect that the "replica" instance will eventually become the "primary" at some point in the future (once the recovery procedure is complete) and because of that the connections to the replica are not flush. Aurora seems to keep the "replica" always in read-only mode and bring the "primary" back up, which is not the expected behavior.

From looking around in the Amazon product documentation, they seem to acknowledge their failover problem because they do suggest a couple of 'solutions' to deal with it.

We can think of ways to handle this failover on our side, but a first step (that can be useful in other scenarios as well) would be to somehow expose the AgroalDataSource.flush(FlushMode) operation in Quarkus to allow the pool to be flush without the need to restart the application.

maxandersen · 2022-02-22T21:27:24Z

Yeah This sounds like a separate issue that i suggest You open separate issue for.

Looks like Agroal might wanna learn some tricks to deal with Aurora Amazon special fail over Logic.

apat59 · 2022-02-24T09:18:55Z

Hi,
I've updated #16279 with metrics and updates.

geoand · 2024-09-13T11:42:48Z

Is this still an issue?

geoand · 2024-10-02T07:42:26Z

Closing for lack of feedback

silviu-negoita added the kind/bug Something isn't working label Jul 14, 2021

quarkus-bot bot added the triage/needs-triage label Jul 14, 2021

gastaldi added area/agroal and removed triage/needs-triage labels Feb 22, 2022

geoand added the triage/needs-feedback We are waiting for feedback. label Sep 13, 2024

geoand closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JDBC connection does not reset #18685

JDBC connection does not reset #18685

silviu-negoita commented Jul 14, 2021 •

edited

Loading

stuartwdouglas commented Jul 15, 2021

silviu-negoita commented Jul 15, 2021 •

edited

Loading

gsmet commented Jul 21, 2021

silviu-negoita commented Jul 29, 2021 •

edited

Loading

silviu-negoita commented Jul 29, 2021

apat59 commented Sep 6, 2021 •

edited

Loading

jonsalvas commented Nov 24, 2021 •

edited

Loading

apat59 commented Nov 25, 2021

hleb-kastseika commented Feb 18, 2022

maxandersen commented Feb 18, 2022

apat59 commented Feb 19, 2022

maxandersen commented Feb 21, 2022

maxandersen commented Feb 21, 2022

maxandersen commented Feb 22, 2022

stuartwdouglas commented Feb 22, 2022

quarkus-bot bot commented Feb 22, 2022

apat59 commented Feb 22, 2022 •

edited

Loading

maxandersen commented Feb 22, 2022

barreiro commented Feb 22, 2022

maxandersen commented Feb 22, 2022

apat59 commented Feb 24, 2022

geoand commented Sep 13, 2024

geoand commented Oct 2, 2024

JDBC connection does not reset #18685

JDBC connection does not reset #18685

Comments

silviu-negoita commented Jul 14, 2021 • edited Loading

Describe the bug

Expected behavior

Actual behavior

Output of java -version

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

stuartwdouglas commented Jul 15, 2021

silviu-negoita commented Jul 15, 2021 • edited Loading

gsmet commented Jul 21, 2021

silviu-negoita commented Jul 29, 2021 • edited Loading

silviu-negoita commented Jul 29, 2021

apat59 commented Sep 6, 2021 • edited Loading

jonsalvas commented Nov 24, 2021 • edited Loading

apat59 commented Nov 25, 2021

hleb-kastseika commented Feb 18, 2022

maxandersen commented Feb 18, 2022

apat59 commented Feb 19, 2022

maxandersen commented Feb 21, 2022

maxandersen commented Feb 21, 2022

maxandersen commented Feb 22, 2022

stuartwdouglas commented Feb 22, 2022

quarkus-bot bot commented Feb 22, 2022

apat59 commented Feb 22, 2022 • edited Loading

maxandersen commented Feb 22, 2022

barreiro commented Feb 22, 2022

maxandersen commented Feb 22, 2022

apat59 commented Feb 24, 2022

geoand commented Sep 13, 2024

geoand commented Oct 2, 2024

silviu-negoita commented Jul 14, 2021 •

edited

Loading

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

silviu-negoita commented Jul 15, 2021 •

edited

Loading

silviu-negoita commented Jul 29, 2021 •

edited

Loading

apat59 commented Sep 6, 2021 •

edited

Loading

jonsalvas commented Nov 24, 2021 •

edited

Loading

apat59 commented Feb 22, 2022 •

edited

Loading