server,tablemetadatacache: make spanStats failures more readable for system.table_metadata #134546

xinhaoz · 2024-11-07T18:40:27Z

Today, if a span stats request fails for a batch in update table metadata job we just write all errors encountered to the affected rows, as is.
Ex)


error requesting iterating nodes for span stats from node 100 (NODE_STATUS_LIVE): operation "iterate-nodes-fn" timed out after 1m0s (given timeout 1m0s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ; error requesting iterating nodes for span stats from node 102 (NODE_STATUS_LIVE): operation "iterate-nodes-fn" timed out after 1m0s (given timeout 1m0s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ; error requesting iterating nodes for span stats from node 87 (NODE_STATUS_LIVE): operation "iterate-nodes-fn" timed out after 1m0s (given timeout 1m0s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ; error requesting iterating nodes for span stats from node 107 (NODE_STATUS_LIVE): operation "iterate-nodes-fn" timed out after 1m0s (given timeout 1m0s): grpc: context deadline exceeded [code 4/DeadlineExceeded] ;

This is bad for a couple of reasons

The concatenated error string can be unpredictable in length, and grows with the more errors encountered in the cluster fan out (e.g. in a 150 node cluster, we could write 150 error strings).
The concatenated error messages as is from span stats is not very useful or readable. They are often repetitive (e.g. multiple nodes timing out).

We should format the errors received from the span stats call to be shorter and more useful.
One option could be:

Node timeouts: 1,2,3
Unknown node errors: 4, 5

We could go one level deeper than the node and show which spans encountered errors during the req, if any, throwing away the specific error message, e.g. (might not be possible if we need to redact the span keys)

errors on spans: [a, b], [c, d]

To achieve this we'd need to modify the span stats response to track errors by node or span - today they are just an array of strings. The tablemetadata update job can then parse the returned errors into the desired format.

Jira issue: CRDB-44127

The text was updated successfully, but these errors were encountered:

kyle-a-wong · 2024-11-14T19:35:52Z

Made a quick change to the TMJ to store a generic "An error has occurred while fetching span stats." in system.table_,metadata if the span stats request failed for a given batch. Also updated the frontend code to not include details about why the job failed. Related PR

I think it may still be valuable to have more specific errors around span stats request failures in the table, but the work to do so is not trivial. The SpanStatsResponse proto contains errors for partial errors, but these are represented as strings on said response object. As a result, a lot of error type information is lost. As a fix, this proto should be updated to have errors typed as errorspb.EncodedError a la:

  map<int32, errorspb.EncodedError> errors = 2 [
    (gogoproto.castkey) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID",
    (gogoproto.nullable) = false
  ];

By making this change, it will be easier to group node errors by type, which will allow us to complete this ticket

kyle-a-wong · 2024-11-14T20:30:31Z

Discussed with @dhartunian and have decided to close this ticket out. There is not much value in storing this error data in system.table_metadata and customers can refer to logs to see specific errors being returned.

While #134902 implements a different solution than those suggested in the issue description, it solves the issue being described and can be considered done.

xinhaoz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability labels Nov 7, 2024

xinhaoz changed the title ~~ui,server,tablemetadata: make spanStats failures more readable for system.table_metadata~~ server,tablemetadata: make spanStats failures more readable for system.table_metadata Nov 7, 2024

xinhaoz changed the title ~~server,tablemetadata: make spanStats failures more readable for system.table_metadata~~ server,tablemetadatacache: make spanStats failures more readable for system.table_metadata Nov 7, 2024

exalate-issue-sync bot assigned kyle-a-wong Nov 11, 2024

exalate-issue-sync bot assigned kyle-a-wong and unassigned kyle-a-wong Nov 14, 2024

kyle-a-wong closed this as completed Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server,tablemetadatacache: make spanStats failures more readable for system.table_metadata #134546

server,tablemetadatacache: make spanStats failures more readable for system.table_metadata #134546

xinhaoz commented Nov 7, 2024 •

edited

Loading

kyle-a-wong commented Nov 14, 2024

kyle-a-wong commented Nov 14, 2024 •

edited

Loading

server,tablemetadatacache: make spanStats failures more readable for system.table_metadata #134546

server,tablemetadatacache: make spanStats failures more readable for system.table_metadata #134546

Comments

xinhaoz commented Nov 7, 2024 • edited Loading

kyle-a-wong commented Nov 14, 2024

kyle-a-wong commented Nov 14, 2024 • edited Loading

xinhaoz commented Nov 7, 2024 •

edited

Loading

kyle-a-wong commented Nov 14, 2024 •

edited

Loading