Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server,tablemetadatacache: make spanStats failures more readable for system.table_metadata #134546

Closed
xinhaoz opened this issue Nov 7, 2024 · 2 comments
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability

Comments

@xinhaoz
Copy link
Member

xinhaoz commented Nov 7, 2024

Today, if a span stats request fails for a batch in update table metadata job we just write all errors encountered to the affected rows, as is.
Ex)


error requesting iterating nodes for span stats from node 100 (NODE_STATUS_LIVE): operation "iterate-nodes-fn" timed out after 1m0s (given timeout 1m0s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ; error requesting iterating nodes for span stats from node 102 (NODE_STATUS_LIVE): operation "iterate-nodes-fn" timed out after 1m0s (given timeout 1m0s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ; error requesting iterating nodes for span stats from node 87 (NODE_STATUS_LIVE): operation "iterate-nodes-fn" timed out after 1m0s (given timeout 1m0s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ; error requesting iterating nodes for span stats from node 107 (NODE_STATUS_LIVE): operation "iterate-nodes-fn" timed out after 1m0s (given timeout 1m0s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ;

This is bad for a couple of reasons

  • The concatenated error string can be unpredictable in length, and grows with the more errors encountered in the cluster fan out (e.g. in a 150 node cluster, we could write 150 error strings).
  • The concatenated error messages as is from span stats is not very useful or readable. They are often repetitive (e.g. multiple nodes timing out).

We should format the errors received from the span stats call to be shorter and more useful.
One option could be:

Node timeouts: 1,2,3
Unknown node errors: 4, 5

We could go one level deeper than the node and show which spans encountered errors during the req, if any, throwing away the specific error message, e.g. (might not be possible if we need to redact the span keys)

errors on spans: [a, b], [c, d]

To achieve this we'd need to modify the span stats response to track errors by node or span - today they are just an array of strings. The tablemetadata update job can then parse the returned errors into the desired format.

Jira issue: CRDB-44127

@xinhaoz xinhaoz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability labels Nov 7, 2024
@xinhaoz xinhaoz changed the title ui,server,tablemetadata: make spanStats failures more readable for system.table_metadata server,tablemetadata: make spanStats failures more readable for system.table_metadata Nov 7, 2024
@xinhaoz xinhaoz changed the title server,tablemetadata: make spanStats failures more readable for system.table_metadata server,tablemetadatacache: make spanStats failures more readable for system.table_metadata Nov 7, 2024
@kyle-a-wong
Copy link
Contributor

Made a quick change to the TMJ to store a generic "An error has occurred while fetching span stats." in system.table_,metadata if the span stats request failed for a given batch. Also updated the frontend code to not include details about why the job failed. Related PR

I think it may still be valuable to have more specific errors around span stats request failures in the table, but the work to do so is not trivial. The SpanStatsResponse proto contains errors for partial errors, but these are represented as strings on said response object. As a result, a lot of error type information is lost. As a fix, this proto should be updated to have errors typed as errorspb.EncodedError a la:

  map<int32, errorspb.EncodedError> errors = 2 [
    (gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID",
    (gogoproto.nullable) = false
  ];

By making this change, it will be easier to group node errors by type, which will allow us to complete this ticket

@kyle-a-wong
Copy link
Contributor

kyle-a-wong commented Nov 14, 2024

Discussed with @dhartunian and have decided to close this ticket out. There is not much value in storing this error data in system.table_metadata and customers can refer to logs to see specific errors being returned.

While #134902 implements a different solution than those suggested in the issue description, it solves the issue being described and can be considered done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability
Projects
None yet
Development

No branches or pull requests

2 participants