-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Correctly return error when attempting to serialize a too large message rather than crashing #22301
Closed
1 task done
Labels
2.20 Backport Required
2024.1 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/high
High Priority
Comments
mdbridge
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
May 7, 2024
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
May 7, 2024
yugabyte-ci
added
priority/high
High Priority
and removed
status/awaiting-triage
Issue awaiting triage
priority/medium
Medium priority issue
labels
May 10, 2024
rthallamko3
added
priority/medium
Medium priority issue
and removed
priority/high
High Priority
labels
Aug 12, 2024
yugabyte-ci
added
priority/high
High Priority
and removed
priority/medium
Medium priority issue
labels
Aug 12, 2024
To reproduce:
|
1 task
One of the customers is running into this and they need the fix in 2.20, so adding corresponding backport labels. |
es1024
added a commit
that referenced
this issue
Sep 6, 2024
Summary: The presence of large rows can result in very large RPC responses to read requests: - By default, we return `yb_fetch_row_limit` rows regardless of row size. - When `yb_fetch_size_limit` is set, we return responses exceeding `yb_fetch_size_limit`, by one row. - We always return at least one full row, so regardless of what `yb_fetch_size_limit` and `yb_fetch_row_limit` are set to, it is possible to have very large RPC responses. When the RPC response is sufficiently large, we do not fail gracefully and do not properly return an error back to the user: - When an RPC response exceeds `rpc_max_message_size`, we attempt to return an error back, but still attempt to use the oversized RPC response for the error, causing the error return to itself fail, resulting in a DFATAL that does not properly respond to the RPC and leaves clients hanging. - When an RPC response exceeds 2^32 bytes, we instead trigger a FATAL from `narrow_cast`, due to assumptions that RPC responses do not exceed 2^32 bytes. - We also consume (unbounded) large amounts of memory to generate and process this response, which may trigger FATAL from checked mallocs failing. This diff makes the following changes: - Change `narrow_cast`s that may FATAL to either not use `narrow_cast`, or to cap the value appropriately before performing `narrow_cast`. - Catch large responses and error earlier to avoid some unnecessary allocations. - Do not attempt to send the sidecars with an error response, to avoid the error response itself failing due to being too large. - Impose a maximum of `rpc_max_message_size` on `yb_fetch_size_limit` (importantly, also in the case where it is set to its default value of `0` for unlimited). This diff also changes protobuf_message_total_bytes_limit from int32 to uint32 (this is safe because negative values made no sense and would have prevented any RPCs from being sent) and adds gflag validators to enforce the following relationship: rpc_max_message_size < protobuf_message_total_bytes_limit < 512 MB **Upgrade/Rollback safety:** This diff only touches test only protos. Jira: DB-11216 Test Plan: Jenkins. Added test cases: - `./yb_build.sh --gtest_filter 'PgMiniTest.ReadHugeRow'` to test RPC too large error is returned up properly. - `./yb_build.sh --gtest_filter TestRpc.MaxSizeResponse --cxx-test rpc_rpc-test` to test max sized RPC, tested before and after changes. No tests were added for the narrow_cast case due to memory limitations running unit tests on Jenkins, but a modified version of the above test that runs into the narrow_cast case was run locally to confirm its fix. Reviewers: qhu, sergei Reviewed By: sergei Subscribers: rthallam, yyan, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37548
jasonyb
pushed a commit
that referenced
this issue
Sep 9, 2024
Summary: 82bca83 [DOC-453] ASH wait event table updates for 2.21.1 (#23595) 011830f [PLAT-14809][YBA CLI]Support Master Key Rotation bd39e84 [PLAT-15088] Add DB version check for connection pooling 9853daf [PLAT-14808][YBA CLI]Support edit KMS operations b02f6f7 [doc][yba] KMS and expiring tokens (#23754) 62d30d5 [PLAT-15091][PLAT-14092] Improve HA logging and backup metric 6cc5f6a Add more documentation on allowed preview flags (#23826) a28b3ec [#22301] docdb: Improve handling of large responses bc28ee8 [#23752] DocDB: Integrating hnswlib into the vector indexing framework and hnsw_tool c40fff2 copy Develop changes to stable (#23812) bf7c0b0 [#23828] docdb: Fix compile error in lwproto-test.cc Excluded: 9c1cc23 [#23707] Add table name to /metrics JSON endpoint 3de6206 [PLAT-15151]: Update autoflags based checks for xCluster/DR b7bc45e Revert "[PLAT-13800] skip setting imageBundle reference for YBM based clusters using machineImage" 35add07 [PLAT-14888]: Store PITR parmas during create DR and allow users to edit them. 3ba74e8 Fix docs for xcluster DDL flow for YCQL & onprem provider ssh user (#23827) 981415e [#23820] DocDB: Use shared memory for executing read and write pg client queries in release f20edbb Moving Driver reference to Develop section (#23683) a3f3bfe [PLAT-15209] Do not set optimized mem & tablet split gflags for YBM a95dc94 [PLAT-14801] add v2 API to list YBC Gflags metadata 06d2b8d [PLAT-15194]CLI | All outputs are returned as List instead of JSON for commands like describe 0375a68 [#23700] CDCSDK: Perform table removal from CDC stream via background thread 5d2a151 [#23797] YSQL: Stabilise PgExplainAnalyzeModifyTable tests when running with Connection Manager e62dcfd sbt Swaggergen on master to fix Uts Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, tfoucher Differential Revision: https://phorge.dev.yugabyte.com/D37901
es1024
added a commit
that referenced
this issue
Sep 11, 2024
Summary: Original commit: a28b3ec / D37548 The presence of large rows can result in very large RPC responses to read requests: - By default, we return `yb_fetch_row_limit` rows regardless of row size. - When `yb_fetch_size_limit` is set, we return responses exceeding `yb_fetch_size_limit`, by one row. - We always return at least one full row, so regardless of what `yb_fetch_size_limit` and `yb_fetch_row_limit` are set to, it is possible to have very large RPC responses. When the RPC response is sufficiently large, we do not fail gracefully and do not properly return an error back to the user: - When an RPC response exceeds `rpc_max_message_size`, we attempt to return an error back, but still attempt to use the oversized RPC response for the error, causing the error return to itself fail, resulting in a DFATAL that does not properly respond to the RPC and leaves clients hanging. - When an RPC response exceeds 2^32 bytes, we instead trigger a FATAL from `narrow_cast`, due to assumptions that RPC responses do not exceed 2^32 bytes. - We also consume (unbounded) large amounts of memory to generate and process this response, which may trigger FATAL from checked mallocs failing. This diff makes the following changes: - Change `narrow_cast`s that may FATAL to either not use `narrow_cast`, or to cap the value appropriately before performing `narrow_cast`. - Catch large responses and error earlier to avoid some unnecessary allocations. - Do not attempt to send the sidecars with an error response, to avoid the error response itself failing due to being too large. - Impose a maximum of `rpc_max_message_size` on `yb_fetch_size_limit` (importantly, also in the case where it is set to its default value of `0` for unlimited). This diff also changes protobuf_message_total_bytes_limit from int32 to uint32 (this is safe because negative values made no sense and would have prevented any RPCs from being sent) and adds gflag validators to enforce the following relationship: rpc_max_message_size < protobuf_message_total_bytes_limit < 512 MB **Upgrade/Rollback safety:** This diff only touches test only protos. Jira: DB-11216 Test Plan: Jenkins. Added test cases: - `./yb_build.sh --gtest_filter 'PgMiniTest.ReadHugeRow'` to test RPC too large error is returned up properly. - `./yb_build.sh --gtest_filter TestRpc.MaxSizeResponse --cxx-test rpc_rpc-test` to test max sized RPC, tested before and after changes. No tests were added for the narrow_cast case due to memory limitations running unit tests on Jenkins, but a modified version of the above test that runs into the narrow_cast case was run locally to confirm its fix. Reviewers: qhu, sergei Reviewed By: sergei Subscribers: ybase, yql, yyan, rthallam Differential Revision: https://phorge.dev.yugabyte.com/D37864
es1024
added a commit
that referenced
this issue
Sep 11, 2024
Summary: Original commit: a28b3ec / D37548 The presence of large rows can result in very large RPC responses to read requests: - By default, we return `yb_fetch_row_limit` rows regardless of row size. - When `yb_fetch_size_limit` is set, we return responses exceeding `yb_fetch_size_limit`, by one row. - We always return at least one full row, so regardless of what `yb_fetch_size_limit` and `yb_fetch_row_limit` are set to, it is possible to have very large RPC responses. When the RPC response is sufficiently large, we do not fail gracefully and do not properly return an error back to the user: - When an RPC response exceeds `rpc_max_message_size`, we attempt to return an error back, but still attempt to use the oversized RPC response for the error, causing the error return to itself fail, resulting in a DFATAL that does not properly respond to the RPC and leaves clients hanging. - When an RPC response exceeds 2^32 bytes, we instead trigger a FATAL from `narrow_cast`, due to assumptions that RPC responses do not exceed 2^32 bytes. - We also consume (unbounded) large amounts of memory to generate and process this response, which may trigger FATAL from checked mallocs failing. This diff makes the following changes: - Change `narrow_cast`s that may FATAL to either not use `narrow_cast`, or to cap the value appropriately before performing `narrow_cast`. - Catch large responses and error earlier to avoid some unnecessary allocations. - Do not attempt to send the sidecars with an error response, to avoid the error response itself failing due to being too large. - Impose a maximum of `rpc_max_message_size` on `yb_fetch_size_limit` (importantly, also in the case where it is set to its default value of `0` for unlimited). This diff also changes protobuf_message_total_bytes_limit from int32 to uint32 (this is safe because negative values made no sense and would have prevented any RPCs from being sent) and adds gflag validators to enforce the following relationship: rpc_max_message_size < protobuf_message_total_bytes_limit < 512 MB **Upgrade/Rollback safety:** This diff only touches test only protos. Jira: DB-11216 Test Plan: Jenkins. Added test cases: - `./yb_build.sh --gtest_filter 'PgMiniTest.ReadHugeRow'` to test RPC too large error is returned up properly. - `./yb_build.sh --gtest_filter TestRpc.MaxSizeResponse --cxx-test rpc_rpc-test` to test max sized RPC, tested before and after changes. No tests were added for the narrow_cast case due to memory limitations running unit tests on Jenkins, but a modified version of the above test that runs into the narrow_cast case was run locally to confirm its fix. Reviewers: qhu, sergei Reviewed By: sergei Subscribers: rthallam, yyan, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37865
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.20 Backport Required
2024.1 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/high
High Priority
Jira Link: DB-11216
Description
Today, we have the following code:
where
However, SerializedMessageSize has:
This code will crash the TServer if full_size is greater than the maximum integer value for uint32_t. (narrow_cast crashes the process if it is unable to complete the cast.)
This task is to prevent the TServer from crashing due to narrow_cast's in serialization.cc.
At a minimum, add a test that serializing a message greater than 4 GiB returns a InvalidArgument Status rather than crashing the process.
Tthe log snippet of the crash is:
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: