Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coroutine leaking after one TiKV got stuck #10965

Closed
5kbpers opened this issue Sep 17, 2021 · 5 comments · Fixed by #10976
Closed

Coroutine leaking after one TiKV got stuck #10965

5kbpers opened this issue Sep 17, 2021 · 5 comments · Fixed by #10976
Assignees
Labels
affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. type/bug The issue is confirmed as a bug.

Comments

@5kbpers
Copy link
Member

5kbpers commented Sep 17, 2021

Bug Report

Affected versions

v5.1, v5.2, master

Description

For Advancing the Resolved TS, we need to verify the leadership of Raft Leaders by calling a check_leader RPC every 1s to all the other TiKVs (Here we did not filter out learners so there will call the RPC to TiFlash).
However, here we use futures::join_all to wait for all responses to get received. Then once one TiKV got stuck and the timeout is too long, some coroutines would be accumulated and probably lead to OOM.

@5kbpers 5kbpers added the type/bug The issue is confirmed as a bug. label Sep 17, 2021
@5kbpers 5kbpers self-assigned this Sep 17, 2021
@BusyJay
Copy link
Member

BusyJay commented Sep 17, 2021

How do you find the problem? Can pending futures also be added to memory trace?

Can yatp support memory trace? /cc @sticnarf

@5kbpers
Copy link
Member Author

5kbpers commented Sep 17, 2021

How do you find the problem? Can pending futures also be added to memory trace?

Can yatp support memory trace? /cc @sticnarf

Through the auto-generated jemalloc profile (start TiKV with MALLOC_CONF=prof:true,prof_active:true,lg_prof_interval:33), compared the adjacent two profiles and found the problem.

jeprof.25133.25.i25.heap.svg.zip
jeprof.25133.24.i24.heap.svg.zip

@BusyJay
Copy link
Member

BusyJay commented Sep 17, 2021

Does memory profile from dashboard work? Why it can't capture the problem before?

@hicqu
Copy link
Contributor

hicqu commented Sep 17, 2021

Does memory profile from dashboard work? Why it can't capture the problem before?

Memory profile from dashboard or wget is not strong enough to help this case. I'm working on improving it now.

@NingLin-P
Copy link
Member

The cdc component also adopt the CheckLeader mechanism and the similar usage of futures::join_all, so this issue may also happen to cluster that deployed TiCDC.

@youjiali1995 youjiali1995 added affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. labels Nov 24, 2021
ti-chi-bot added a commit that referenced this issue Feb 8, 2022
close #10965, ref #10965, ref #10976

Signed-off-by: ti-srebot <[email protected]>
Signed-off-by: Yilin Chen <[email protected]>

Co-authored-by: 5kbpers <[email protected]>
Co-authored-by: Yilin Chen <[email protected]>
Co-authored-by: Ti Chi Robot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.1 This bug affects 5.1.x versions. affects-5.2 This bug affects 5.2.x versions. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants