Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash with s3 request result is not correct under disagg arch #7576

Closed
lilinghai opened this issue May 31, 2023 · 3 comments · Fixed by #7530
Closed

TiFlash with s3 request result is not correct under disagg arch #7576

lilinghai opened this issue May 31, 2023 · 3 comments · Fixed by #7530
Assignees

Comments

@lilinghai
Copy link

lilinghai commented May 31, 2023

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  • Deploy a cluster with 3 tiflash write nodes and 3 tiflash compute nodes
  • A random dml workload and check count(*) from tikv and tiflash
[2023/05/30 23:13:26.645 +08:00] [ERROR] [consistency_test.go:590] ["check consistence error"] ["db name"=db12] [sql="select count(*) from other_handle7"] ["prev succ"=4566] [error="result inconsistent tso 441833117339353090, tikv: [12100000], tiflash [6925760]"] 

2. What did you expect to see? (Required)

3. What did you see instead (Required)

There are error logging as follows in TiFlash write nodes logging files.

[2023/05/30 23:13:14.985 +08:00] [DEBUG] [FlashService.cpp:772] ["FetchDisaggPages respond finished, task_id=DisTaskId<MPP<query:<query_ts:1685459584596285229, local_query_id:113897, server_id:1623874, start_ts:441833117339353090>,task_id:3>,executor=TableFullScan_16>"] [source="DisTaskId<MPP<query:<query_ts:1685459584596285229, local_query_id:113897, server_id:1623874, start_ts:441833117339353090>,task_id:3>,executor=TableFullScan_16>"] [thread_id=819]
[2023/05/30 23:13:18.553 +08:00] [ERROR] [FlashService.cpp:782] ["FetchDisaggPages meet exception: Check task.isValid() failed, task.err_msg = Segment task not found by segment_id, table_id=2934, segment_id=35597\n\n       0x1c2084e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29493326]\n                \tdbms/src/Common/Exception.h:46\n       0x81442f4\tDB::FlashService::FetchDisaggPages(grpc::ServerContext*, disaggregated::FetchDisaggPagesRequest const*, grpc::ServerWriter<disaggregated::PagesPacket>*) [tiflash+135545588]\n                \tdbms/src/Flash/FlashService.cpp:761\n       0x90b7bed\tgrpc::internal::ServerStreamingHandler<tikvpb::Tikv::Service, disaggregated::FetchDisaggPagesRequest, disaggregated::PagesPacket>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&) [tiflash+151747565]\n                \tcontrib/grpc/include/grpcpp/impl/codegen/method_handler.h:206\n       0x8a71530\tgrpc::Server::SyncRequest::ContinueRunAfterInterception() [tiflash+145167664]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:433\n       0x8a71361\tgrpc::Server::SyncRequest::Run(std::__1::shared_ptr<grpc::Server::GlobalCallbacks> const&, bool) [tiflash+145167201]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:421\n       0x8a83555\tgrpc::ThreadManager::WorkerThread::WorkerThread(grpc::ThreadManager*)::$_0::__invoke(void*) [tiflash+145241429]\n                \tcontrib/grpc/src/cpp/thread_manager/thread_manager.cc:36\n       0x8e1da1a\tgrpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) [tiflash+149019162]\n                \tcontrib/grpc/src/core/lib/gprpp/thd_posix.cc:110\n  0x7ff69d024802\tstart_thread [libc.so.6+653314]\n  0x7ff69cfc4450\tclone3 [libc.so.6+259152]"] [source="DisTaskId<MPP<query:<query_ts:1685459584596285229, local_query_id:113897, server_id:1623874, start_ts:441833117339353090>,task_id:3>,executor=TableFullScan_16>"] [thread_id=834]
[2023/05/30 23:13:18.553 +08:00] [ERROR] [FlashService.cpp:782] ["FetchDisaggPages meet exception: Check task.isValid() failed, task.err_msg = Segment task not found by segment_id, table_id=2934, segment_id=49271\n\n       0x1c2084e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29493326]\n                \tdbms/src/Common/Exception.h:46\n       0x81442f4\tDB::FlashService::FetchDisaggPages(grpc::ServerContext*, disaggregated::FetchDisaggPagesRequest const*, grpc::ServerWriter<disaggregated::PagesPacket>*) [tiflash+135545588]\n                \tdbms/src/Flash/FlashService.cpp:761\n       0x90b7bed\tgrpc::internal::ServerStreamingHandler<tikvpb::Tikv::Service, disaggregated::FetchDisaggPagesRequest, disaggregated::PagesPacket>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&) [tiflash+151747565]\n                \tcontrib/grpc/include/grpcpp/impl/codegen/method_handler.h:206\n       0x8a71530\tgrpc::Server::SyncRequest::ContinueRunAfterInterception() [tiflash+145167664]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:433\n       0x8a71361\tgrpc::Server::SyncRequest::Run(std::__1::shared_ptr<grpc::Server::GlobalCallbacks> const&, bool) [tiflash+145167201]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:421\n       0x8a83555\tgrpc::ThreadManager::WorkerThread::WorkerThread(grpc::ThreadManager*)::$_0::__invoke(void*) [tiflash+145241429]\n                \tcontrib/grpc/src/cpp/thread_manager/thread_manager.cc:36\n       0x8e1da1a\tgrpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) [tiflash+149019162]\n                \tcontrib/grpc/src/core/lib/gprpp/thd_posix.cc:110\n  0x7ff69d024802\tstart_thread [libc.so.6+653314]\n  0x7ff69cfc4450\tclone3 [libc.so.6+259152]"] [source="DisTaskId<MPP<query:<query_ts:1685459584596285229, local_query_id:113897, server_id:1623874, start_ts:441833117339353090>,task_id:3>,executor=TableFullScan_16>"] [thread_id=810]
[2023/05/30 23:13:18.553 +08:00] [ERROR] [FlashService.cpp:782] ["FetchDisaggPages meet exception: Check task.isValid() failed, task.err_msg = Segment task not found by segment_id, table_id=2934, segment_id=35615\n\n       0x1c2084e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29493326]\n                \tdbms/src/Common/Exception.h:46\n       0x81442f4\tDB::FlashService::FetchDisaggPages(grpc::ServerContext*, disaggregated::FetchDisaggPagesRequest const*, grpc::ServerWriter<disaggregated::PagesPacket>*) [tiflash+135545588]\n                \tdbms/src/Flash/FlashService.cpp:761\n       0x90b7bed\tgrpc::internal::ServerStreamingHandler<tikvpb::Tikv::Service, disaggregated::FetchDisaggPagesRequest, disaggregated::PagesPacket>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&) [tiflash+151747565]\n                \tcontrib/grpc/include/grpcpp/impl/codegen/method_handler.h:206\n       0x8a71530\tgrpc::Server::SyncRequest::ContinueRunAfterInterception() [tiflash+145167664]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:433\n       0x8a71361\tgrpc::Server::SyncRequest::Run(std::__1::shared_ptr<grpc::Server::GlobalCallbacks> const&, bool) [tiflash+145167201]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:421\n       0x8a83555\tgrpc::ThreadManager::WorkerThread::WorkerThread(grpc::ThreadManager*)::$_0::__invoke(void*) [tiflash+145241429]\n                \tcontrib/grpc/src/cpp/thread_manager/thread_manager.cc:36\n       0x8e1da1a\tgrpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) [tiflash+149019162]\n                \tcontrib/grpc/src/core/lib/gprpp/thd_posix.cc:110\n  0x7ff69d024802\tstart_thread [libc.so.6+653314]\n  0x7ff69cfc4450\tclone3 [libc.so.6+259152]"] [source="DisTaskId<MPP<query:<query_ts:1685459584596285229, local_query_id:113897, server_id:1623874, start_ts:441833117339353090>,task_id:3>,executor=TableFullScan_16>"] [thread_id=822]
[2023/05/30 23:13:18.553 +08:00] [ERROR] [FlashService.cpp:782] ["FetchDisaggPages meet exception: Check task.isValid() failed, task.err_msg = Segment task not found by segment_id, table_id=2934, segment_id=35552\n\n       0x1c2084e\tDB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) [tiflash+29493326]\n                \tdbms/src/Common/Exception.h:46\n       0x81442f4\tDB::FlashService::FetchDisaggPages(grpc::ServerContext*, disaggregated::FetchDisaggPagesRequest const*, grpc::ServerWriter<disaggregated::PagesPacket>*) [tiflash+135545588]\n                \tdbms/src/Flash/FlashService.cpp:761\n       0x90b7bed\tgrpc::internal::ServerStreamingHandler<tikvpb::Tikv::Service, disaggregated::FetchDisaggPagesRequest, disaggregated::PagesPacket>::RunHandler(grpc::internal::MethodHandler::HandlerParameter const&) [tiflash+151747565]\n                \tcontrib/grpc/include/grpcpp/impl/codegen/method_handler.h:206\n       0x8a71530\tgrpc::Server::SyncRequest::ContinueRunAfterInterception() [tiflash+145167664]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:433\n       0x8a71361\tgrpc::Server::SyncRequest::Run(std::__1::shared_ptr<grpc::Server::GlobalCallbacks> const&, bool) [tiflash+145167201]\n                \tcontrib/grpc/src/cpp/server/server_cc.cc:421\n       0x8a83555\tgrpc::ThreadManager::WorkerThread::WorkerThread(grpc::ThreadManager*)::$_0::__invoke(void*) [tiflash+145241429]\n                \tcontrib/grpc/src/cpp/thread_manager/thread_manager.cc:36\n       0x8e1da1a\tgrpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*, grpc_core::Thread::Options const&)::'lambda'(void*)::__invoke(void*) [tiflash+149019162]\n                \tcontrib/grpc/src/core/lib/gprpp/thd_posix.cc:110\n  0x7ff69d024802\tstart_thread [libc.so.6+653314]\n  0x7ff69cfc4450\tclone3 [libc.so.6+259152]"] [source="DisTaskId<MPP<query:<query_ts:1685459584596285229, local_query_id:113897, server_id:1623874, start_ts:441833117339353090>,task_id:3>,executor=TableFullScan_16>"] [thread_id=813]

4. What is your TiFlash version? (Required)

master

@lilinghai lilinghai added the type/bug The issue is confirmed as a bug. label May 31, 2023
@JaySon-Huang JaySon-Huang self-assigned this May 31, 2023
@JaySon-Huang JaySon-Huang added the affects-7.1 This bug affects the 7.1.x(LTS) versions. label May 31, 2023
@JaySon-Huang JaySon-Huang changed the title TiFlash with s3 request result is not correct TiFlash with s3 request result is not correct under disagg arch May 31, 2023
@JaySon-Huang
Copy link
Contributor

  1. One compute node dispatch its request to 2 write nodes (w0, w1)
  2. One write node (w0) return Region error quickly, while the request is still running on another write nodes(w1)
    i7576_0
  3. Compute node refresh its RegionCache, rebuild the request, and send other Requests (R0' and R1') with the same DisaggTaskID to all 2 write nodes
  4. In w1 there are two Establish disagg task request with the same DisaggTaskID, there is a chance that the DisaggTaskSnapshot of R1 overwrite the snapshot of R1'
  5. While the compute node discards the Response of R1 and try to process the read by the Response of R1'
    i7576_1
  6. The segments snapshot hold by R1' is not the same as R1, so the FetchPagesReqeust to w1 is fail
  7. The compute node continues to return an incorrect result when FetchPagesRequest return error to tidb but not the error message

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Jun 6, 2023

Reopen because a similar problem happens again. But the root cause is not the same.

• Failure [6168.340 seconds]
TiFlash
/home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/consistency/consistency_test.go:31
  when dml with check consistency
  /home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/consistency/consistency_test.go:271
    should always consistency #004# [It]
    /home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/consistency/consistency_test.go:337

    Unexpected error:
        <*fmt.wrapError | 0xc0014d0020>: {
            msg: "result inconsistent tso 441946394349535247, tikv: [9000], tiflash [9143]",
            err: <*errors.errorString | 0xc000d23430>{
                s: "result inconsistent",
            },
        }
        result inconsistent tso 441946394349535247, tikv: [9000], tiflash [9143]
    occurred

    /home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/consistency/consistency_test.go:594

This time it is TiFlash returns more rows than TiKV. The root cause is still under investigation

@JaySon-Huang
Copy link
Contributor

Reopen because a similar problem happens again. But the root cause is not the same.

• Failure [6168.340 seconds]
TiFlash
/home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/consistency/consistency_test.go:31
  when dml with check consistency
  /home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/consistency/consistency_test.go:271
    should always consistency #004# [It]
    /home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/consistency/consistency_test.go:337

    Unexpected error:
        <*fmt.wrapError | 0xc0014d0020>: {
            msg: "result inconsistent tso 441946394349535247, tikv: [9000], tiflash [9143]",
            err: <*errors.errorString | 0xc000d23430>{
                s: "result inconsistent",
            },
        }
        result inconsistent tso 441946394349535247, tikv: [9000], tiflash [9143]
    occurred

    /home/jenkins/agent/workspace/endless-master-build/testcase/tiflash/consistency/consistency_test.go:594

This time it is TiFlash returns more rows than TiKV. The root cause is still under investigation

close because it can not be reproduced in the internal testing workload for a long time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants