SQLState(08006) error causing HTTP 500 error upstream #31645

tim-o · 2018-10-19T19:10:56Z

Describe the problem
While conducting a routine migration, customer encountered the following:

Context canceled / aborted transactions in the logs, for example:
W181019 15:24:49.629554 61831020 internal/client/txn.go:556 [n2,client=130.211.2.195:59869,user=foo] failure aborting transaction: HandledRetryableTxnError: TransactionAbortedError: txn aborted "sql txn" id=581b6de4 key=/Table/82/1/"\xe4\x15\xd4\xe7\xab\xc4G\x1d\x9c5\xd99\x98\xab\xe2\xf5"/0 rw=true pri=0.03491282 iso=SERIALIZABLE stat=PENDING epo=0 ts=1539962689.619304580,0 orig=1539962689.619304580,0 max=1539962690.119304580,0 wto=false rop=false seq=2; abort caused by: failed to send RPC: sending to all 3 replicas failed; last error: {<nil> context canceled} I181019 15:24:56.600105 181 gossip/gossip.go:488 [n2] gossip status (ok, 3 nodes)
Upstream SQLSTATE(08006) errors in HikariCP: WARN c.z.hikari.pool.ProxyConnection - roach - Connection org.postgresql.jdbc.PgConnection@594e7e23 marked as broken because of SQLSTATE(08006), ErrorCode(0)
HTTP 500 errors in their application due to the broken connection & 08006 error.

Looking in our code, I see code 08006 is the result of a CodeConnectionFailureError. These are called only twice: in schema_changer.go and inbound.go:99

The comment above that line does seem to describe the condition described in the log warning. However, it's not clear why the result of this race is a 08006 error since that would usually indicate a problem with the connection while (at least as far as I can see) the underlying issue is just context cancellation.

To Reproduce
This doesn't reproduce easily, but did result in 4 errors today for the customer during routine migrations.

Environment:
Postgres JDBC 42.1.3
Hibernate 5.1.8.Final
HikariCP 3.2.0
CRDB version to be provided.

Additional context
What was the impact?
HTTP 500 errors potentially sent to end users.

@vivekmenezes, assigning to you first since @andreimatei indicated you wrote this portion of the code. If there's a better home let me know.

The text was updated successfully, but these errors were encountered:

andreimatei · 2018-10-19T19:22:49Z

I don't think the failure aborting txn is related to the rest of the problem here: the erroneous (I think) use of the CodeConnectionFailureError (I'm pretty sure that warning message wasn't related to whatever communication error DistSQL received, or at least wasn't upstream of it).
My understanding is that Postgres uses CodeConnectionFailureError to tell clients that it's about to crash and so they should bail on specific connections. That's not how we're using it, and so we're confusing connection poolers and such.

vivekmenezes · 2018-10-19T19:48:19Z

Indeed, these appear to be two problems:

The first retryable error seems to be due to a context cancellation.

@andreimatei why do you think error 08006 is not being returned from distsql?

andreimatei · 2018-10-19T19:55:07Z

No, I think the 08006 is being returned by DistSQL - and that DistSQL is wrong to use that error code.
I think the "failure to abort" error has nothing to do with anything and should be ignored.

vivekmenezes · 2018-10-19T20:07:42Z

I think we should create a new postgresql error of "Class 58" (system) called "Network Error" for this distsql error. Thoughts?

andreimatei · 2018-10-19T20:18:52Z

Yeah, class 58 seems the right one. But I don't know if introducing that error code in one specific place where communication fails (as opposed to more generally on top of all gPRC errors everywhere) is a good or a bad thing... Separately, I don't know what errors the schema changer should recognize as "non-permanent". isPermanentSchemaChangeError() seems a bit out of control to me. I think we should stop recognizing specific errors + explicit schema change / job cancelation policy.

…

On Fri, Oct 19, 2018 at 4:08 PM vivekmenezes ***@***.***> wrote: I think we should create a new postgresql error of "Class 58" (system) called "Network Error" for this distsql error. Thoughts? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31645 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXBceqFVkhEr1ufeWKxcC95SPTQDPoFks5umjEtgaJpZM4XxRAJ> .

tim-o · 2018-10-19T20:19:49Z

Vivek, in general when someone encounters this error how should they interpret it? has the underlying transaction completed? should it be restarted? Should they take action?

andreimatei · 2018-10-19T20:36:43Z

No, sorry, upon looking more closely, I don't think class 58 is right. It seems to be defined as "Class 58 - System Error (errors external to PostgreSQL itself)". In our case, that stream being interrupted is most often no "external to Postgres itself", in my opinion.
I think this communication error, and many other things, fall into a "data temporary unavailable" category, which does not seem to be represented well in Postgres.
My recommendation would be to punt the question, and revert to returning an untyped error (which goes to the client with CodeInternalError - XX000) - and so do something else for the schema changer guy that wants to recognize it.

@tim-o the client should respond like they respond to the vast majority of our errors (CodeInternalError). Namely, at the moment we throw our hands in the air and offer no guidance :)
See this old thread for some past discussions: https://groups.google.com/a/cockroachlabs.com/forum/#!searchin/eng/data$20unavailability$20and$20errors%7Csort:date/eng/CZa0Sl3SLc0/UWjYEMjtEAAJ

To be only slightly more helpful - if the error does not come from a commit statement, then the transaction has not been committed; the client should generally retry - the old "please try again later" approach to error handling.
If the error comes from a commit, then I'm not sure - I think most of the time if the code is not an ambiguous one (CodeTransactionResolutionUnknownError), then the txn again has not been committed. But I think we miss some cases, so for example the cockroach-go lib treats most commit errors as ambiguous.
If the commit error is a communication error between the client and crdb - in which case crdb is out of the loop and the disposition is again ambiguous. For this last case, if the cockroach-go client library is used, a particular go error is returned: AmbiguousCommitError.

related to cockroachdb#31645 Release note: None

jordanlewis · 2019-03-26T17:14:26Z

I think this should be downgraded to an S-3, as it doesn't cause nodes to fail.

Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (cockroachdb#31645). This patch switches to a new, CRDB-specific error code in the "internal error" class. Fixes cockroachdb#31645 Release note: None

33655: storage: skip TestWedgedReplicaDetection on test short r=andreimatei a=andreimatei Takes 10s. I've opened #33654 asking for an investigation. Release note: None 41391: roachtest: sort clusters in leftover clusters report r=andreimatei a=andreimatei They used to displayed in non-deterministic map iteration order. Release note: None 41401: roachtest: plumb the cluster id prefix from --cluster-id r=andreimatei a=andreimatei The flag was not hooked up to anything. Release justification: N/A Release note: None 41451: sql: use a different error code for communication failure r=andreimatei a=andreimatei Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (#31645). This patch switches to a new, CRDB-specific error code in the "internal error" class. Fixes #31645 Release note: None 41494: util/tracing: rename FormatRecordedSpans to Recording.String() r=andreimatei a=andreimatei Elevate the discoverability of FormatRecordedSpans() by making it the stringer for a Recording. Remove the inferior stringer I had previously added. Release note: None Co-authored-by: Andrei Matei <[email protected]>

Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (cockroachdb#31645). This patch switches to a new, CRDB-specific error code in the "system" class. For consistency, it moves the code for RangeUnavailable to the same class. The class used is class 58, used by both Postgres and DB2 for "system" errors. In Postgres, these errors are external to the database (e.g. IO). In DB2 (and now in CRDB), they're also internal (to the cluster). Fixes cockroachdb#31645 Release note: None

tim-o added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Oct 19, 2018

tim-o assigned vivekmenezes Oct 19, 2018

vivekmenezes added the A-sql-execution Relating to SQL execution. label Oct 19, 2018

vivekmenezes assigned vivekmenezes and unassigned vivekmenezes Oct 19, 2018

vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Oct 23, 2018

sql: report internal distsql communication error as an internal error

f6c1397

related to cockroachdb#31645 Release note: None

vivekmenezes mentioned this issue Oct 23, 2018

sql: report internal distsql communication error as an internal error #14770

Closed

vivekmenezes added a commit to vivekmenezes/cockroach that referenced this issue Oct 24, 2018

sql: report internal distsql communication error as an internal error

dc682fd

related to cockroachdb#31645 Release note: None

timveil mentioned this issue Feb 9, 2019

CRDB: problem loading seats timveil-cockroach/oltpbench#10

Closed

jordanlewis added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-2 Medium-high impact: many users impacted, risks of availability and difficult-to-fix data errors labels Mar 26, 2019

andreimatei unassigned vivekmenezes Oct 8, 2019

andreimatei mentioned this issue Oct 8, 2019

sql: use a different error code for communication failure #41451

Merged

craig bot closed this as completed in c22e03f Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLState(08006) error causing HTTP 500 error upstream #31645

SQLState(08006) error causing HTTP 500 error upstream #31645

tim-o commented Oct 19, 2018 •

edited

Loading

andreimatei commented Oct 19, 2018

vivekmenezes commented Oct 19, 2018

andreimatei commented Oct 19, 2018

vivekmenezes commented Oct 19, 2018

andreimatei commented Oct 19, 2018 via email

tim-o commented Oct 19, 2018

andreimatei commented Oct 19, 2018 •

edited

Loading

jordanlewis commented Mar 26, 2019

SQLState(08006) error causing HTTP 500 error upstream #31645

SQLState(08006) error causing HTTP 500 error upstream #31645

Comments

tim-o commented Oct 19, 2018 • edited Loading

andreimatei commented Oct 19, 2018

vivekmenezes commented Oct 19, 2018

andreimatei commented Oct 19, 2018

vivekmenezes commented Oct 19, 2018

andreimatei commented Oct 19, 2018 via email

tim-o commented Oct 19, 2018

andreimatei commented Oct 19, 2018 • edited Loading

jordanlewis commented Mar 26, 2019

tim-o commented Oct 19, 2018 •

edited

Loading

andreimatei commented Oct 19, 2018 •

edited

Loading