-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLState(08006) error causing HTTP 500 error upstream #31645
Comments
I don't think the |
Indeed, these appear to be two problems: The first retryable error seems to be due to a context cancellation. @andreimatei why do you think error 08006 is not being returned from distsql? |
No, I think the |
I think we should create a new postgresql error of "Class 58" (system) called "Network Error" for this distsql error. Thoughts? |
Yeah, class 58 seems the right one. But I don't know if introducing that
error code in one specific place where communication fails (as opposed to
more generally on top of all gPRC errors everywhere) is a good or a bad
thing...
Separately, I don't know what errors the schema changer should recognize as
"non-permanent". isPermanentSchemaChangeError() seems a bit out of control
to me. I think we should stop recognizing specific errors + explicit
schema change / job cancelation policy.
…On Fri, Oct 19, 2018 at 4:08 PM vivekmenezes ***@***.***> wrote:
I think we should create a new postgresql error of "Class 58" (system)
called "Network Error" for this distsql error. Thoughts?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31645 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBceqFVkhEr1ufeWKxcC95SPTQDPoFks5umjEtgaJpZM4XxRAJ>
.
|
Vivek, in general when someone encounters this error how should they interpret it? has the underlying transaction completed? should it be restarted? Should they take action? |
No, sorry, upon looking more closely, I don't think class 58 is right. It seems to be defined as "Class 58 - System Error (errors external to PostgreSQL itself)". In our case, that stream being interrupted is most often no "external to Postgres itself", in my opinion. @tim-o the client should respond like they respond to the vast majority of our errors (CodeInternalError). Namely, at the moment we throw our hands in the air and offer no guidance :) To be only slightly more helpful - if the error does not come from a |
related to cockroachdb#31645 Release note: None
related to cockroachdb#31645 Release note: None
I think this should be downgraded to an S-3, as it doesn't cause nodes to fail. |
Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (cockroachdb#31645). This patch switches to a new, CRDB-specific error code in the "internal error" class. Fixes cockroachdb#31645 Release note: None
33655: storage: skip TestWedgedReplicaDetection on test short r=andreimatei a=andreimatei Takes 10s. I've opened #33654 asking for an investigation. Release note: None 41391: roachtest: sort clusters in leftover clusters report r=andreimatei a=andreimatei They used to displayed in non-deterministic map iteration order. Release note: None 41401: roachtest: plumb the cluster id prefix from --cluster-id r=andreimatei a=andreimatei The flag was not hooked up to anything. Release justification: N/A Release note: None 41451: sql: use a different error code for communication failure r=andreimatei a=andreimatei Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (#31645). This patch switches to a new, CRDB-specific error code in the "internal error" class. Fixes #31645 Release note: None 41494: util/tracing: rename FormatRecordedSpans to Recording.String() r=andreimatei a=andreimatei Elevate the discoverability of FormatRecordedSpans() by making it the stringer for a Recording. Remove the inferior stringer I had previously added. Release note: None Co-authored-by: Andrei Matei <[email protected]>
Before this patch, DistSQL would use the the Postgres ConnectionFailure code when a network stream between processors on different nodes would break. This was the wrong code to use; Postgres uses this code for trouble with the client connection, not internal problems. There's evidence that middleware treats this code as a signal to tear down a connection (cockroachdb#31645). This patch switches to a new, CRDB-specific error code in the "system" class. For consistency, it moves the code for RangeUnavailable to the same class. The class used is class 58, used by both Postgres and DB2 for "system" errors. In Postgres, these errors are external to the database (e.g. IO). In DB2 (and now in CRDB), they're also internal (to the cluster). Fixes cockroachdb#31645 Release note: None
Describe the problem
While conducting a routine migration, customer encountered the following:
W181019 15:24:49.629554 61831020 internal/client/txn.go:556 [n2,client=130.211.2.195:59869,user=foo] failure aborting transaction: HandledRetryableTxnError: TransactionAbortedError: txn aborted "sql txn" id=581b6de4 key=/Table/82/1/"\xe4\x15\xd4\xe7\xab\xc4G\x1d\x9c5\xd99\x98\xab\xe2\xf5"/0 rw=true pri=0.03491282 iso=SERIALIZABLE stat=PENDING epo=0 ts=1539962689.619304580,0 orig=1539962689.619304580,0 max=1539962690.119304580,0 wto=false rop=false seq=2; abort caused by: failed to send RPC: sending to all 3 replicas failed; last error: {<nil> context canceled} I181019 15:24:56.600105 181 gossip/gossip.go:488 [n2] gossip status (ok, 3 nodes)
WARN c.z.hikari.pool.ProxyConnection - roach - Connection org.postgresql.jdbc.PgConnection@594e7e23 marked as broken because of SQLSTATE(08006), ErrorCode(0)
Looking in our code, I see code 08006 is the result of a CodeConnectionFailureError. These are called only twice: in schema_changer.go and inbound.go:99
The comment above that line does seem to describe the condition described in the log warning. However, it's not clear why the result of this race is a 08006 error since that would usually indicate a problem with the connection while (at least as far as I can see) the underlying issue is just context cancellation.
To Reproduce
This doesn't reproduce easily, but did result in 4 errors today for the customer during routine migrations.
Environment:
Postgres JDBC 42.1.3
Hibernate 5.1.8.Final
HikariCP 3.2.0
CRDB version to be provided.
Additional context
What was the impact?
HTTP 500 errors potentially sent to end users.
@vivekmenezes, assigning to you first since @andreimatei indicated you wrote this portion of the code. If there's a better home let me know.
The text was updated successfully, but these errors were encountered: