Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLState(08006) error causing HTTP 500 error upstream #31645

Closed
tim-o opened this issue Oct 19, 2018 · 8 comments · Fixed by #41451
Closed

SQLState(08006) error causing HTTP 500 error upstream #31645

tim-o opened this issue Oct 19, 2018 · 8 comments · Fixed by #41451
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data)

Comments

@tim-o
Copy link
Contributor

tim-o commented Oct 19, 2018

Describe the problem
While conducting a routine migration, customer encountered the following:

  1. Context canceled / aborted transactions in the logs, for example:
    W181019 15:24:49.629554 61831020 internal/client/txn.go:556 [n2,client=130.211.2.195:59869,user=foo] failure aborting transaction: HandledRetryableTxnError: TransactionAbortedError: txn aborted "sql txn" id=581b6de4 key=/Table/82/1/"\xe4\x15\xd4\xe7\xab\xc4G\x1d\x9c5\xd99\x98\xab\xe2\xf5"/0 rw=true pri=0.03491282 iso=SERIALIZABLE stat=PENDING epo=0 ts=1539962689.619304580,0 orig=1539962689.619304580,0 max=1539962690.119304580,0 wto=false rop=false seq=2; abort caused by: failed to send RPC: sending to all 3 replicas failed; last error: {<nil> context canceled} I181019 15:24:56.600105 181 gossip/gossip.go:488 [n2] gossip status (ok, 3 nodes)
  2. Upstream SQLSTATE(08006) errors in HikariCP: WARN c.z.hikari.pool.ProxyConnection - roach - Connection org.postgresql.jdbc.PgConnection@594e7e23 marked as broken because of SQLSTATE(08006), ErrorCode(0)
  3. HTTP 500 errors in their application due to the broken connection & 08006 error.

Looking in our code, I see code 08006 is the result of a CodeConnectionFailureError. These are called only twice: in schema_changer.go and inbound.go:99

The comment above that line does seem to describe the condition described in the log warning. However, it's not clear why the result of this race is a 08006 error since that would usually indicate a problem with the connection while (at least as far as I can see) the underlying issue is just context cancellation.

To Reproduce
This doesn't reproduce easily, but did result in 4 errors today for the customer during routine migrations.

Environment:
Postgres JDBC 42.1.3
Hibernate 5.1.8.Final
HikariCP 3.2.0
CRDB version to be provided.

Additional context
What was the impact?
HTTP 500 errors potentially sent to end users.

@vivekmenezes, assigning to you first since @andreimatei indicated you wrote this portion of the code. If there's a better home let me know.

@tim-o tim-o added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Oct 19, 2018
@andreimatei
Copy link
Contributor

I don't think the failure aborting txn is related to the rest of the problem here: the erroneous (I think) use of the CodeConnectionFailureError (I'm pretty sure that warning message wasn't related to whatever communication error DistSQL received, or at least wasn't upstream of it).
My understanding is that Postgres uses CodeConnectionFailureError to tell clients that it's about to crash and so they should bail on specific connections. That's not how we're using it, and so we're confusing connection poolers and such.

@vivekmenezes
Copy link
Contributor

Indeed, these appear to be two problems:

The first retryable error seems to be due to a context cancellation.

@andreimatei why do you think error 08006 is not being returned from distsql?

@andreimatei
Copy link
Contributor

No, I think the 08006 is being returned by DistSQL - and that DistSQL is wrong to use that error code.
I think the "failure to abort" error has nothing to do with anything and should be ignored.

@vivekmenezes
Copy link
Contributor

I think we should create a new postgresql error of "Class 58" (system) called "Network Error" for this distsql error. Thoughts?

@vivekmenezes vivekmenezes added the A-sql-execution Relating to SQL execution. label Oct 19, 2018
@andreimatei
Copy link
Contributor

andreimatei commented Oct 19, 2018 via email

@tim-o
Copy link
Contributor Author

tim-o commented Oct 19, 2018

Vivek, in general when someone encounters this error how should they interpret it? has the underlying transaction completed? should it be restarted? Should they take action?

@andreimatei
Copy link
Contributor

andreimatei commented Oct 19, 2018

No, sorry, upon looking more closely, I don't think class 58 is right. It seems to be defined as "Class 58 - System Error (errors external to PostgreSQL itself)". In our case, that stream being interrupted is most often no "external to Postgres itself", in my opinion.
I think this communication error, and many other things, fall into a "data temporary unavailable" category, which does not seem to be represented well in Postgres.
My recommendation would be to punt the question, and revert to returning an untyped error (which goes to the client with CodeInternalError - XX000) - and so do something else for the schema changer guy that wants to recognize it.

@tim-o the client should respond like they respond to the vast majority of our errors (CodeInternalError). Namely, at the moment we throw our hands in the air and offer no guidance :)
See this old thread for some past discussions: https://groups.google.com/a/cockroachlabs.com/forum/#!searchin/eng/data$20unavailability$20and$20errors%7Csort:date/eng/CZa0Sl3SLc0/UWjYEMjtEAAJ

To be only slightly more helpful - if the error does not come from a commit statement, then the transaction has not been committed; the client should generally retry - the old "please try again later" approach to error handling.
If the error comes from a commit, then I'm not sure - I think most of the time if the code is not an ambiguous one (CodeTransactionResolutionUnknownError), then the txn again has not been committed. But I think we miss some cases, so for example the cockroach-go lib treats most commit errors as ambiguous.
If the commit error is a communication error between the client and crdb - in which case crdb is out of the loop and the disposition is again ambiguous. For this last case, if the cockroach-go client library is used, a particular go error is returned: AmbiguousCommitError.

@jordanlewis
Copy link
Member

I think this should be downgraded to an S-3, as it doesn't cause nodes to fail.

@jordanlewis jordanlewis added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Mar 26, 2019
andreimatei added a commit to andreimatei/cockroach that referenced this issue Oct 8, 2019
Before this patch, DistSQL would use the the Postgres
ConnectionFailure code when a network stream between processors on
different nodes would break. This was the wrong code to use; Postgres
uses this code for trouble with the client connection, not internal
problems. There's evidence that middleware treats this code as a signal
to tear down a connection (cockroachdb#31645). This patch switches to a new,
CRDB-specific error code in the "internal error" class.

Fixes cockroachdb#31645

Release note: None
craig bot pushed a commit that referenced this issue Oct 17, 2019
33655: storage: skip TestWedgedReplicaDetection on test short r=andreimatei a=andreimatei

Takes 10s. I've opened #33654 asking for an investigation.

Release note: None

41391: roachtest: sort clusters in leftover clusters report r=andreimatei a=andreimatei

They used to displayed in non-deterministic map iteration order.

Release note: None

41401: roachtest: plumb the cluster id prefix from --cluster-id r=andreimatei a=andreimatei

The flag was not hooked up to anything.

Release justification: N/A

Release note: None

41451: sql: use a different error code for communication failure r=andreimatei a=andreimatei

Before this patch, DistSQL would use the the Postgres
ConnectionFailure code when a network stream between processors on
different nodes would break. This was the wrong code to use; Postgres
uses this code for trouble with the client connection, not internal
problems. There's evidence that middleware treats this code as a signal
to tear down a connection (#31645). This patch switches to a new,
CRDB-specific error code in the "internal error" class.

Fixes #31645

Release note: None

41494: util/tracing: rename FormatRecordedSpans to Recording.String() r=andreimatei a=andreimatei

Elevate the discoverability of FormatRecordedSpans() by making it the
stringer for a Recording. Remove the inferior stringer I had previously
added.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in c22e03f Oct 18, 2019
arulajmani pushed a commit to arulajmani/cockroach that referenced this issue Oct 29, 2019
Before this patch, DistSQL would use the the Postgres
ConnectionFailure code when a network stream between processors on
different nodes would break. This was the wrong code to use; Postgres
uses this code for trouble with the client connection, not internal
problems. There's evidence that middleware treats this code as a signal
to tear down a connection (cockroachdb#31645). This patch switches to a new,
CRDB-specific error code in the "system" class.

For consistency, it moves the code for RangeUnavailable to the same
class.

The class used is class 58, used by both Postgres and DB2 for "system"
errors. In Postgres, these errors are external to the database (e.g.
IO). In DB2 (and now in CRDB), they're also internal (to the cluster).

Fixes cockroachdb#31645

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants