Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDBC connection does not reset #18685

Closed
silviu-negoita opened this issue Jul 14, 2021 · 23 comments
Closed

JDBC connection does not reset #18685

silviu-negoita opened this issue Jul 14, 2021 · 23 comments
Labels
area/agroal kind/bug Something isn't working triage/needs-feedback We are waiting for feedback.

Comments

@silviu-negoita
Copy link

silviu-negoita commented Jul 14, 2021

Describe the bug

  1. Having up the application, with a sqs collector, started at booting via ManagedExecutor:
 void onStart(@Observes(during = AFTER_COMPLETION) StartupEvent ev) {
        managedExecutor.execute(this::startProcessing);
    }

Somewhere down the line, I do some DB selects / inserts
2. Reset the database. Reproduced in aws cloud, but even in local, with docker postgres db container. At this step end, databse is up and ready. Accepting new connections.
3. Got stack trace

Caused by: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:350)
	at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:481)
	at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:401)
	at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:164)
	at org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:114)
	at io.agroal.pool.wrapper.PreparedStatementWrapper.executeQuery(PreparedStatementWrapper.java:78)
	at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.extract(ResultSetReturnImpl.java:57)
	... 54 more
Caused by: java.io.EOFException
	at org.postgresql.core.PGStream.receiveChar(PGStream.java:445)
	at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2057)
	at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:323)
	... 60 more

  1. App still up, receives a new sqs event, in method from step 1
  2. App fails on any db interaction with following:
aused by: org.hibernate.exception.JDBCConnectionException: could not prepare statement
	at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:112)
	at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:42)
	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:113)
	at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement(StatementPreparerImpl.java:186)
	at org.hibernate.engine.jdbc.internal.StatementPreparerImpl.prepareQueryStatement(StatementPreparerImpl.java:151)
	at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:2103)
	at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:2040)
	at org.hibernate.loader.Loader.executeQueryStatement(Loader.java:2018)
	at org.hibernate.loader.Loader.doQuery(Loader.java:948)
	at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:349)
	at org.hibernate.loader.Loader.doList(Loader.java:2849)
	at org.hibernate.loader.Loader.doList(Loader.java:2831)
	at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2663)
	at org.hibernate.loader.Loader.list(Loader.java:2658)
	at org.hibernate.loader.hql.QueryLoader.list(QueryLoader.java:506)
	at org.hibernate.hql.internal.ast.QueryTranslatorImpl.list(QueryTranslatorImpl.java:400)
	at org.hibernate.engine.query.spi.HQLQueryPlan.performList(HQLQueryPlan.java:219)
	at org.hibernate.internal.SessionImpl.list(SessionImpl.java:1414)
	at org.hibernate.query.internal.AbstractProducedQuery.doList(AbstractProducedQuery.java:1625)
	at org.hibernate.query.internal.AbstractProducedQuery.list(AbstractProducedQuery.java:1593)
	... 42 more
Caused by: org.postgresql.util.PSQLException: This connection has been closed.
	at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:877)
	at org.postgresql.jdbc.PgConnection.prepareStatement(PgConnection.java:1734)
	at org.postgresql.jdbc.PgConnection.prepareStatement(PgConnection.java:428)
	at io.agroal.pool.wrapper.ConnectionWrapper.prepareStatement(ConnectionWrapper.java:659)
	at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$5.doPrepare(StatementPreparerImpl.java:149)
	at org.hibernate.engine.jdbc.internal.StatementPreparerImpl$StatementPreparationTemplate.prepareStatement(StatementPreparerImpl.java:176)
	... 58 more

Expected behavior

Database connection is restored after DB is up again.

Actual behavior

Infinite failure. It recovers just after the app is restarted.

Output of java -version

openjdk 11.0.11 2021-04-20

Quarkus version or git rev

1.13.7

Build tool (ie. output of mvnw --version or gradlew --version)

maven

@silviu-negoita silviu-negoita added the kind/bug Something isn't working label Jul 14, 2021
@stuartwdouglas
Copy link
Member

Are you getting a new connection from the pool each time or are you holding onto one for the life of the app?

@silviu-negoita
Copy link
Author

silviu-negoita commented Jul 15, 2021

Are you getting a new connection from the pool each time or are you holding onto one for the life of the app?

  1. Im not doing anything specific related to DB - JDBC. Anything is happening behind the scene. I am not even into a transaction, it fails on a select operation.
  2. Not sure how to see the actual db connections. It is doable via debug?

@gsmet
Copy link
Member

gsmet commented Jul 21, 2021

Maybe you could try adding quarkus.datasource.jdbc.validation-query-sql=select 1 to your application.properties?

@silviu-negoita
Copy link
Author

silviu-negoita commented Jul 29, 2021

Hi again! @gsmet i tried your fix, but there is no difference.
Meanwhile, I can found a better way to reproduce. When DB connection is down, I stress out the application with events(which hits the DB). After a while, application will not receive events anymore . I attached a thread dump for this case, but also after restart, when the app is working normally.
mem_dumps.zip

@silviu-negoita
Copy link
Author

Also, i see something similar in here #15025. I am reproducing the issue in 2.0.3.Final

@apat59
Copy link

apat59 commented Sep 6, 2021

Hi,
I'm also hitting this since 1.13.2 exactly and is still happening on 2.2.1.
However I'm not restarting the DB, this happens almost every time I roll out a new deployment/pod.
The flow is the same:
1- agroal creates conn
2- conn acquired, used, returned to pool many times (in my case quartz, at least 20 times) without any problem
3- conn acquired, that EOF exception happens but, conn returned, not destroyed
4- conn acquired, and TransactionImple.getStatus: javax.transaction.Status.STATUS_MARKED_ROLLBACK
5- many problems may happen here, conn closed, Deferred enlistment not supported, tx not started
6- here, Destroyed connection org.postgresql.jdbc.PgConnection happens, isn't it too late?

Many conn are suffering the exact same flow, very often at startup but also when we set max-life-time.

Generally that goes with corrupted metrics. As soon as this happens the active count will be < 0.
On way to reproduce it outside of the startup quite easily is to set a short max-life-time, the problem will happen over and over.

I've been unable to implement a reproducer. I'm in touch with Luis in private.
I've have a shared gdoc and TRACE logs I can share only with Red Hatters.

This looks very specific to cloud environment, certainly to the cloud db server pattern/product.

@jonsalvas
Copy link

jonsalvas commented Nov 24, 2021

I can confirm I am also facing this issue with quarkus 2.0.3.final in a kubernetes environment with HPA scaling pods up/down based on load

@apat59
Copy link

apat59 commented Nov 25, 2021

@jonsalvas out of curiosity, do you know the details about the db setup and possible extra pooling mecanism?

@hleb-kastseika
Copy link

Any updates on this issue?

@maxandersen
Copy link
Member

From Reddit thread highlighting the issue:

I was able to reproduce the issue, but it doesn't seem very serious to me.

  1. ⁠Start Quarkus application, send a few requests.
  2. ⁠Stop Postgres.
  3. ⁠Send requests, observe "PSQLException: This connection has been closed." errors.
  4. ⁠Start Postgres.
  5. ⁠Send requests, keep observing connection closed errors for about 30 seconds.
  6. ⁠Normal behavior resumes.

My application configuration doesn't have anything special regarding database connection validation or maximum age.

In my opinion the only scenario where this is an issue is when database connections are frequently interrupted.

@apat59
Copy link

apat59 commented Feb 19, 2022

@maxandersen
Hi mate (Anthony writing),
Since:

  • the symptom appeared exactly starting 1.13.7
  • the problem is not happening with another app built on plain vert.x + c3p0 pooling hitting the same db cluster

I'm pretty sure there is something really tricky and specific.
In my case, the db is an amazon prod cluster and the apps are deployed on openshift.
I indeed suspect some specific amazon config but we do not administrate the db, so we didn't manage to get any clue.

I managed to reduce this "pseudo" leak, I'll paste my settings tomorrow.
I'm also in touch with Luis and I send him info as soon as I have.

@maxandersen
Copy link
Member

@apat59 thanks for the info - you mention Luis; Luis who ? (there are a handfull of those around :)

@maxandersen
Copy link
Member

also @apat59 what is the latest version of quarkus you tested this against ? 1.13.7 is quite old by now :)

@maxandersen
Copy link
Member

so I tried to reproduce this locally and thus far it works as I expect. when db is down you get errors; when db returns the pool will eventually recover and all works again.

See https://github.com/maxandersen/quarkus18685reproduce for reproducer.

this is on quarkus v2.7.1.Final

@stuartwdouglas
Copy link
Member

Can you try with agroal/agroal#48 ?

With this patch I see things resolving immediately after the DB is back up.

@quarkus-bot
Copy link

quarkus-bot bot commented Feb 22, 2022

/cc @Sanne, @barreiro, @yrodiere

@apat59
Copy link

apat59 commented Feb 22, 2022

hi,
more on our own issue here . Please note I'm not totally sure it's the same issue as here.
We've been stuck with 1.13.0 for a long time because of this symptom.
Until 2.2.1.Final and a mix of properties which limited the "problem".
Note, we never managed to reproduce this in dev, it only happened in openshift + aurora dbs.

quarkus.datasource.db-kind=postgresql
quarkus.datasource.health.enabled=true
quarkus.datasource.metrics.enabled=true
quarkus.transaction-manager.default-transaction-timeout=60
quarkus.datasource.jdbc.min-size=6
quarkus.datasource.jdbc.initial-size=6
# insane but temporary required
quarkus.datasource.jdbc.max-size=70
quarkus.datasource.jdbc.background-validation-interval=15S
quarkus.datasource.jdbc.validation-query-sql=select 1;
quarkus.datasource.jdbc.acquisition-timeout=5
quarkus.datasource.jdbc.idle-removal-interval=5M
quarkus.datasource.jdbc.detect-statement-leaks=true
quarkus.datasource.jdbc.new-connection-sql=select 1;
# NOTE, we may have to change that in the future
quarkus.datasource.jdbc.driver=org.postgresql.ds.PGSimpleDataSource

As soon as I can (heavy week here), I'll re enable the config that used to trigger the problem

#quarkus.datasource.jdbc.max-lifetime=150
#quarkus.datasource.jdbc.detect-statement-leaks=true

and update to latest quarkus release and deploy to QA where we have grafana enabled and can monitor the pool.
I'll let you know.

Btw Luis is Luis Barreiro :)

@maxandersen
Copy link
Member

@apat59 sounds great.

if you can try with stuarts patch that would be great but if notlook for what error codes you get and if its any of the ones starting with 08 as seen in https://www.postgresql.org/docs/13/errcodes-appendix.html at least you should see faster recovery.

@barreiro
Copy link
Contributor

@stuartwdouglas @maxandersen from what I can understand, the problem is not just about error codes but how the Amazon Aurora service does the DB failover.

I don't know the internals, but it looks like there is a "primary" that can fail, and when it does Agroal does flush the connections (based on the SQL error code) and new connections are created to a read-only "replica". Agroal does expect that the "replica" instance will eventually become the "primary" at some point in the future (once the recovery procedure is complete) and because of that the connections to the replica are not flush. Aurora seems to keep the "replica" always in read-only mode and bring the "primary" back up, which is not the expected behavior.

From looking around in the Amazon product documentation, they seem to acknowledge their failover problem because they do suggest a couple of 'solutions' to deal with it.

We can think of ways to handle this failover on our side, but a first step (that can be useful in other scenarios as well) would be to somehow expose the AgroalDataSource.flush(FlushMode) operation in Quarkus to allow the pool to be flush without the need to restart the application.

@maxandersen
Copy link
Member

Yeah This sounds like a separate issue that i suggest You open separate issue for.

Looks like Agroal might wanna learn some tricks to deal with Aurora Amazon special fail over Logic.

@apat59
Copy link

apat59 commented Feb 24, 2022

Hi,
I've updated #16279 with metrics and updates.

@geoand
Copy link
Contributor

geoand commented Sep 13, 2024

Is this still an issue?

@geoand geoand added the triage/needs-feedback We are waiting for feedback. label Sep 13, 2024
@geoand
Copy link
Contributor

geoand commented Oct 2, 2024

Closing for lack of feedback

@geoand geoand closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agroal kind/bug Something isn't working triage/needs-feedback We are waiting for feedback.
Projects
None yet
Development

No branches or pull requests

10 participants