[DocDB] Correctly return error when attempting to serialize a too large message rather than crashing #22301

mdbridge · 2024-05-07T20:32:59Z

Description

Today, we have the following code:

~/code/yugabyte-db/src/yb/rpc/serialization.cc:67:
Status SerializeMessage(
    AnyMessageConstPtr msg, size_t body_size, const RefCntBuffer& param_buf,
    size_t additional_size, size_t offset) {
  DCHECK_EQ(msg.SerializedSize(), body_size);
  auto size = SerializedMessageSize(body_size, additional_size);


  auto total_size = size + additional_size;
  if (total_size > FLAGS_rpc_max_message_size) {
    return STATUS_FORMAT(InvalidArgument, "Sending too long RPC message ($0 bytes)", total_size);
  }

where

~/code/yugabyte-db/src/yb/rpc/lightweight_message.cc:30:
DEFINE_UNKNOWN_uint64(rpc_max_message_size, 255_MB,
    "The maximum size of a message of any RPC that the server will accept. The sum of "
    "consensus_max_batch_size_bytes and 1KB should be less than rpc_max_message_size");

However, SerializedMessageSize has:

~/code/yugabyte-db/src/yb/rpc/serialization.cc:62:
size_t SerializedMessageSize(size_t body_size, size_t additional_size) {
  auto full_size = body_size + additional_size;
  return body_size + CodedOutputStream::VarintSize32(narrow_cast<uint32_t>(full_size));

This code will crash the TServer if full_size is greater than the maximum integer value for uint32_t. (narrow_cast crashes the process if it is unable to complete the cast.)

This task is to prevent the TServer from crashing due to narrow_cast's in serialization.cc.

At a minimum, add a test that serializing a message greater than 4 GiB returns a InvalidArgument Status rather than crashing the process.

Tthe log snippet of the crash is:


6 06:30:43 ulcpcxygdb004 yb-tserver: F0506 06:30:43.382462 13246 <http://casts.cc:21]|casts.cc:21]> Bad narrow cast: 5867568307 &gt; 4294967295
May  6 06:30:43 ulcpcxygdb004 yb-tserver: Fatal failure details written to /yuga01/yb-data/tserver/logs/yb-tserver.FATAL.details.2024-05-06T06_30_43.pid26522.txt
May  6 06:30:43 ulcpcxygdb004 yb-tserver: F20240506 06:30:43 ../../src/yb/gutil/casts.cc:21] Bad narrow cast: 5867568307 &gt; 4294967295
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d1354bf47  google::LogMessage::SendToLog()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d1354ce8d  google::LogMessage::Flush()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d1354d509  google::LogMessageFatal::~LogMessageFatal()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d13d53609  yb::BadNarrowCast()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d1454b0bd  yb::rpc::YBInboundCall::Respond()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d144acb2e  yb::rpc::RpcContext::RespondSuccess()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d147f309c  yb::tserver::(anonymous namespace)::ReadQuery::Complete()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d147ec2cc  yb::tserver::PerformRead()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d14844433  yb::tserver::TabletServiceImpl::Read()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d149805a4  std::__1::__function::__func&lt;&gt;::operator()()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d1498204f  yb::tserver::TabletServerServiceIf::Handle()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d14537f8a  yb::rpc::ServicePoolImpl::Handle()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d14477d7f  yb::rpc::InboundCall::InboundCallTask::Run()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d14546b23  yb::rpc::(anonymous namespace)::Worker::Execute()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x562d14c18b82  yb::Thread::SuperviseThread()
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x7fa58a181694  start_thread
May  6 06:30:43 ulcpcxygdb004 yb-tserver: @     0x7fa58a68341d  __clone

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

I confirm this issue does not contain any sensitive information.

The text was updated successfully, but these errors were encountered:

es1024 · 2024-08-22T17:01:31Z

To reproduce:

CREATE TABLE test(pk INT PRIMARY KEY, text00 TEXT, text01 TEXT, text02 TEXT, text03 TEXT, text04 TEXT, text05 TEXT, text06 TEXT, text07 TEXT, text08 TEXT, text09 TEXT, text10 TEXT, text11 TEXT, text12 TEXT, text13 TEXT, text14 TEXT, text15 TEXT, text16 TEXT);
INSERT INTO test(pk) VALUES(0);
UPDATE test SET text00 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text01 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text02 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text03 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text04 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text05 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text06 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text07 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text08 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text09 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text10 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text11 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text12 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text13 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text14 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text15 = repeat('0', 254000000) WHERE pk = 0;
UPDATE test SET text16 = repeat('0', 254000000) WHERE pk = 0;
SELECT * FROM test LIMIT 1;

rthallamko3 · 2024-08-28T22:37:24Z

One of the customers is running into this and they need the fix in 2.20, so adding corresponding backport labels.

Summary: The presence of large rows can result in very large RPC responses to read requests: - By default, we return `yb_fetch_row_limit` rows regardless of row size. - When `yb_fetch_size_limit` is set, we return responses exceeding `yb_fetch_size_limit`, by one row. - We always return at least one full row, so regardless of what `yb_fetch_size_limit` and `yb_fetch_row_limit` are set to, it is possible to have very large RPC responses. When the RPC response is sufficiently large, we do not fail gracefully and do not properly return an error back to the user: - When an RPC response exceeds `rpc_max_message_size`, we attempt to return an error back, but still attempt to use the oversized RPC response for the error, causing the error return to itself fail, resulting in a DFATAL that does not properly respond to the RPC and leaves clients hanging. - When an RPC response exceeds 2^32 bytes, we instead trigger a FATAL from `narrow_cast`, due to assumptions that RPC responses do not exceed 2^32 bytes. - We also consume (unbounded) large amounts of memory to generate and process this response, which may trigger FATAL from checked mallocs failing. This diff makes the following changes: - Change `narrow_cast`s that may FATAL to either not use `narrow_cast`, or to cap the value appropriately before performing `narrow_cast`. - Catch large responses and error earlier to avoid some unnecessary allocations. - Do not attempt to send the sidecars with an error response, to avoid the error response itself failing due to being too large. - Impose a maximum of `rpc_max_message_size` on `yb_fetch_size_limit` (importantly, also in the case where it is set to its default value of `0` for unlimited). This diff also changes protobuf_message_total_bytes_limit from int32 to uint32 (this is safe because negative values made no sense and would have prevented any RPCs from being sent) and adds gflag validators to enforce the following relationship: rpc_max_message_size < protobuf_message_total_bytes_limit < 512 MB **Upgrade/Rollback safety:** This diff only touches test only protos. Jira: DB-11216 Test Plan: Jenkins. Added test cases: - `./yb_build.sh --gtest_filter 'PgMiniTest.ReadHugeRow'` to test RPC too large error is returned up properly. - `./yb_build.sh --gtest_filter TestRpc.MaxSizeResponse --cxx-test rpc_rpc-test` to test max sized RPC, tested before and after changes. No tests were added for the narrow_cast case due to memory limitations running unit tests on Jenkins, but a modified version of the above test that runs into the narrow_cast case was run locally to confirm its fix. Reviewers: qhu, sergei Reviewed By: sergei Subscribers: rthallam, yyan, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37548

Summary: 82bca83 [DOC-453] ASH wait event table updates for 2.21.1 (#23595) 011830f [PLAT-14809][YBA CLI]Support Master Key Rotation bd39e84 [PLAT-15088] Add DB version check for connection pooling 9853daf [PLAT-14808][YBA CLI]Support edit KMS operations b02f6f7 [doc][yba] KMS and expiring tokens (#23754) 62d30d5 [PLAT-15091][PLAT-14092] Improve HA logging and backup metric 6cc5f6a Add more documentation on allowed preview flags (#23826) a28b3ec [#22301] docdb: Improve handling of large responses bc28ee8 [#23752] DocDB: Integrating hnswlib into the vector indexing framework and hnsw_tool c40fff2 copy Develop changes to stable (#23812) bf7c0b0 [#23828] docdb: Fix compile error in lwproto-test.cc Excluded: 9c1cc23 [#23707] Add table name to /metrics JSON endpoint 3de6206 [PLAT-15151]: Update autoflags based checks for xCluster/DR b7bc45e Revert "[PLAT-13800] skip setting imageBundle reference for YBM based clusters using machineImage" 35add07 [PLAT-14888]: Store PITR parmas during create DR and allow users to edit them. 3ba74e8 Fix docs for xcluster DDL flow for YCQL & onprem provider ssh user (#23827) 981415e [#23820] DocDB: Use shared memory for executing read and write pg client queries in release f20edbb Moving Driver reference to Develop section (#23683) a3f3bfe [PLAT-15209] Do not set optimized mem & tablet split gflags for YBM a95dc94 [PLAT-14801] add v2 API to list YBC Gflags metadata 06d2b8d [PLAT-15194]CLI | All outputs are returned as List instead of JSON for commands like describe 0375a68 [#23700] CDCSDK: Perform table removal from CDC stream via background thread 5d2a151 [#23797] YSQL: Stabilise PgExplainAnalyzeModifyTable tests when running with Connection Manager e62dcfd sbt Swaggergen on master to fix Uts Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, tfoucher Differential Revision: https://phorge.dev.yugabyte.com/D37901

Summary: Original commit: a28b3ec / D37548 The presence of large rows can result in very large RPC responses to read requests: - By default, we return `yb_fetch_row_limit` rows regardless of row size. - When `yb_fetch_size_limit` is set, we return responses exceeding `yb_fetch_size_limit`, by one row. - We always return at least one full row, so regardless of what `yb_fetch_size_limit` and `yb_fetch_row_limit` are set to, it is possible to have very large RPC responses. When the RPC response is sufficiently large, we do not fail gracefully and do not properly return an error back to the user: - When an RPC response exceeds `rpc_max_message_size`, we attempt to return an error back, but still attempt to use the oversized RPC response for the error, causing the error return to itself fail, resulting in a DFATAL that does not properly respond to the RPC and leaves clients hanging. - When an RPC response exceeds 2^32 bytes, we instead trigger a FATAL from `narrow_cast`, due to assumptions that RPC responses do not exceed 2^32 bytes. - We also consume (unbounded) large amounts of memory to generate and process this response, which may trigger FATAL from checked mallocs failing. This diff makes the following changes: - Change `narrow_cast`s that may FATAL to either not use `narrow_cast`, or to cap the value appropriately before performing `narrow_cast`. - Catch large responses and error earlier to avoid some unnecessary allocations. - Do not attempt to send the sidecars with an error response, to avoid the error response itself failing due to being too large. - Impose a maximum of `rpc_max_message_size` on `yb_fetch_size_limit` (importantly, also in the case where it is set to its default value of `0` for unlimited). This diff also changes protobuf_message_total_bytes_limit from int32 to uint32 (this is safe because negative values made no sense and would have prevented any RPCs from being sent) and adds gflag validators to enforce the following relationship: rpc_max_message_size < protobuf_message_total_bytes_limit < 512 MB **Upgrade/Rollback safety:** This diff only touches test only protos. Jira: DB-11216 Test Plan: Jenkins. Added test cases: - `./yb_build.sh --gtest_filter 'PgMiniTest.ReadHugeRow'` to test RPC too large error is returned up properly. - `./yb_build.sh --gtest_filter TestRpc.MaxSizeResponse --cxx-test rpc_rpc-test` to test max sized RPC, tested before and after changes. No tests were added for the narrow_cast case due to memory limitations running unit tests on Jenkins, but a modified version of the above test that runs into the narrow_cast case was run locally to confirm its fix. Reviewers: qhu, sergei Reviewed By: sergei Subscribers: ybase, yql, yyan, rthallam Differential Revision: https://phorge.dev.yugabyte.com/D37864

Summary: Original commit: a28b3ec / D37548 The presence of large rows can result in very large RPC responses to read requests: - By default, we return `yb_fetch_row_limit` rows regardless of row size. - When `yb_fetch_size_limit` is set, we return responses exceeding `yb_fetch_size_limit`, by one row. - We always return at least one full row, so regardless of what `yb_fetch_size_limit` and `yb_fetch_row_limit` are set to, it is possible to have very large RPC responses. When the RPC response is sufficiently large, we do not fail gracefully and do not properly return an error back to the user: - When an RPC response exceeds `rpc_max_message_size`, we attempt to return an error back, but still attempt to use the oversized RPC response for the error, causing the error return to itself fail, resulting in a DFATAL that does not properly respond to the RPC and leaves clients hanging. - When an RPC response exceeds 2^32 bytes, we instead trigger a FATAL from `narrow_cast`, due to assumptions that RPC responses do not exceed 2^32 bytes. - We also consume (unbounded) large amounts of memory to generate and process this response, which may trigger FATAL from checked mallocs failing. This diff makes the following changes: - Change `narrow_cast`s that may FATAL to either not use `narrow_cast`, or to cap the value appropriately before performing `narrow_cast`. - Catch large responses and error earlier to avoid some unnecessary allocations. - Do not attempt to send the sidecars with an error response, to avoid the error response itself failing due to being too large. - Impose a maximum of `rpc_max_message_size` on `yb_fetch_size_limit` (importantly, also in the case where it is set to its default value of `0` for unlimited). This diff also changes protobuf_message_total_bytes_limit from int32 to uint32 (this is safe because negative values made no sense and would have prevented any RPCs from being sent) and adds gflag validators to enforce the following relationship: rpc_max_message_size < protobuf_message_total_bytes_limit < 512 MB **Upgrade/Rollback safety:** This diff only touches test only protos. Jira: DB-11216 Test Plan: Jenkins. Added test cases: - `./yb_build.sh --gtest_filter 'PgMiniTest.ReadHugeRow'` to test RPC too large error is returned up properly. - `./yb_build.sh --gtest_filter TestRpc.MaxSizeResponse --cxx-test rpc_rpc-test` to test max sized RPC, tested before and after changes. No tests were added for the narrow_cast case due to memory limitations running unit tests on Jenkins, but a modified version of the above test that runs into the narrow_cast case was run locally to confirm its fix. Reviewers: qhu, sergei Reviewed By: sergei Subscribers: rthallam, yyan, yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37865

mdbridge added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels May 7, 2024

yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels May 7, 2024

yugabyte-ci assigned yusong-yan May 10, 2024

yugabyte-ci added priority/high High Priority and removed status/awaiting-triage Issue awaiting triage priority/medium Medium priority issue labels May 10, 2024

rthallamko3 assigned es1024 and unassigned yusong-yan May 15, 2024

rthallamko3 added priority/medium Medium priority issue and removed priority/high High Priority labels Aug 12, 2024

yugabyte-ci added priority/high High Priority and removed priority/medium Medium priority issue labels Aug 12, 2024

es1024 mentioned this issue Aug 26, 2024

[DocDB] Maximums on row sizes should be added/enforced #23636

Open

1 task

rthallamko3 added 2024.1 Backport Required 2.20 Backport Required labels Aug 28, 2024

rthallamko3 closed this as completed Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DocDB] Correctly return error when attempting to serialize a too large message rather than crashing #22301

[DocDB] Correctly return error when attempting to serialize a too large message rather than crashing #22301

mdbridge commented May 7, 2024 •

edited by rthallamko3

Loading

es1024 commented Aug 22, 2024

rthallamko3 commented Aug 28, 2024

[DocDB] Correctly return error when attempting to serialize a too large message rather than crashing #22301

[DocDB] Correctly return error when attempting to serialize a too large message rather than crashing #22301

Comments

mdbridge commented May 7, 2024 • edited by rthallamko3 Loading

Description

Issue Type

Warning: Please confirm that this issue does not contain any sensitive information

es1024 commented Aug 22, 2024

rthallamko3 commented Aug 28, 2024

mdbridge commented May 7, 2024 •

edited by rthallamko3

Loading