Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb panic at graceful shutdown #36793

Closed
glorv opened this issue Aug 2, 2022 · 9 comments
Closed

tidb panic at graceful shutdown #36793

glorv opened this issue Aug 2, 2022 · 9 comments
Assignees
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.3 affects-6.4 affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-6.0 may-affects-6.2 severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.

Comments

@glorv
Copy link
Contributor

glorv commented Aug 2, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

use systemctl stop tidb-4000 to stop a tidb-server during a sysbench test

2. What did you expect to see? (Required)

TiDB should stop gracefully

3. What did you see instead (Required)

There is a panic with following stacktrace:

fatal error: concurrent map read and map write

goroutine 383 [running]:
runtime.throw({0x3bdd127?, 0xc002831380?})
..../usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc0168f57d8 sp=0xc0168f57a8 pc=0x13808f1
runtime.mapaccess1_faststr(0xc015484fc0?, 0xc010983f40?, {0xc002831380, 0x7f})
..../usr/local/go/src/runtime/map_faststr.go:22 +0x3a5 fp=0xc0168f5840 sp=0xc0168f57d8 pc=0x135b045
github.com/pingcap/tidb/util/kvcache.(*SimpleLRUCache).Delete(0xc00173a280, {0x4134ae0?, 0xc015484fc0?})
..../data4/gl/tidb/util/kvcache/simple_lru.go:163 +0x70 fp=0xc0168f58a0 sp=0xc0168f5840 pc=0x1991070
github.com/pingcap/tidb/server.(*TiDBStatement).Close(0xc00450bb90)
..../data4/gl/tidb/server/driver_tidb.go:177 +0x185 fp=0xc0168f5918 sp=0xc0168f58a0 pc=0x32bb2a5
github.com/pingcap/tidb/server.(*TiDBStatement).Close-fm()
....<autogenerated>:1 +0x26 fp=0xc0168f5930 sp=0xc0168f5918 pc=0x32f52e6
github.com/pingcap/tidb/parser/terror.Call(0xc0168f5a98?)
..../data4/gl/tidb/parser/terror/terror.go:304 +0x31 fp=0xc0168f5a70 sp=0xc0168f5930 pc=0x173f4d1
github.com/pingcap/tidb/server.(*TiDBContext).Close(0xc002246270)
..../data4/gl/tidb/server/driver_tidb.go:247 +0x90 fp=0xc0168f5b08 sp=0xc0168f5a70 pc=0x32bb850
github.com/pingcap/tidb/server.closeConn(0xc001012b00, 0xc00134cb40?)
..../data4/gl/tidb/server/conn.go:354 +0x1dc fp=0xc0168f5be8 sp=0xc0168f5b08 pc=0x32a191c
github.com/pingcap/tidb/server.(*clientConn).Close(0xc001012b00)
..../data4/gl/tidb/server/conn.go:338 +0x89 fp=0xc0168f5c18 sp=0xc0168f5be8 pc=0x32a1709
github.com/pingcap/tidb/server.(*Server).kickIdleConnection(0xc00064ba40)
..../data4/gl/tidb/server/server.go:804 +0x1e5 fp=0xc0168f5d90 sp=0xc0168f5c18 pc=0x32df8c5
github.com/pingcap/tidb/server.(*Server).GracefulDown(0x0?, {0x4154e60, 0xc00019a000}, 0x32dc020?)
..../data4/gl/tidb/server/server.go:772 +0xf8 fp=0xc0168f5e70 sp=0xc0168f5d90 pc=0x32df4b8
main.cleanup(0xc00064ba40?, {0x41758d8, 0xc0012bb3b0}, 0x0?, 0x0?)
..../data4/gl/tidb/tidb-server/main.go:788 +0x5a fp=0xc0168f5eb0 sp=0xc0168f5e70 pc=0x34a01ba
main.main.func1(0x20?)
..../data4/gl/tidb/tidb-server/main.go:217 +0x6f fp=0xc0168f5f18 sp=0xc0168f5eb0 pc=0x349bd4f
github.com/pingcap/tidb/util/signal.SetupSignalHandler.func2()
..../data4/gl/tidb/util/signal/signal_posix.go:55 +0x202 fp=0xc0168f5fe0 sp=0xc0168f5f18 pc=0x3496e22
runtime.goexit()
..../usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc0168f5fe8 sp=0xc0168f5fe0 pc=0x13b4bc1
created by github.com/pingcap/tidb/util/signal.SetupSignalHandler
..../data4/gl/tidb/util/signal/signal_posix.go:52 +0x19f

goroutine 1 [chan receive]:
main.main()
..../data4/gl/tidb/tidb-server/main.go:223 +0x41d
...

I think the root cause is that clientconn.Close is called twice at:

tidb/server/server.go

Lines 539 to 542 in 34d877a

defer func() {
terror.Log(conn.Close())
logutil.Logger(ctx).Debug("connection closed")
}()

and

tidb/server/server.go

Lines 792 to 809 in 34d877a

func (s *Server) kickIdleConnection() {
var conns []*clientConn
s.rwlock.RLock()
for _, cc := range s.clients {
if cc.ShutdownOrNotify() {
// Shutdowned conn will be closed by us, and notified conn will exist themselves.
conns = append(conns, cc)
}
}
s.rwlock.RUnlock()
for _, cc := range conns {
err := cc.Close()
if err != nil {
logutil.BgLogger().Error("close connection", zap.Error(err))
}
}
}

4. What is your TiDB version? (Required)

nightly

@glorv glorv added the type/bug The issue is confirmed as a bug. label Aug 2, 2022
@glorv
Copy link
Contributor Author

glorv commented Aug 2, 2022

found another race that may also cause this panic

goroutine 23990 [running]:
runtime.throw({0x3bdd127?, 0x0?})
..../usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc012496a78 sp=0xc012496a48 pc=0x13808f1
runtime.mapaccess2_fast32(0x1355be7?, 0x8?, 0x32)
..../usr/local/go/src/runtime/map_fast32.go:62 +0x176 fp=0xc012496a98 sp=0xc012496a78 pc=0x13590b6
github.com/pingcap/tidb/planner/core.GetPreparedStmt(0x0?, 0x0?)
..../data4/gl/tidb/planner/core/cache.go:264 +0x69 fp=0xc012496ad0 sp=0xc012496a98 pc=0x2cb59a9
github.com/pingcap/tidb/planner.IsReadOnly({0x41643b0?, 0xc034391680?}, 0xf4240?)
..../data4/gl/tidb/planner/optimize.go:54 +0x54 fp=0xc012496bb0 sp=0xc012496ad0 pc=0x2ecc974
github.com/pingcap/tidb/executor.(*ExecStmt).IsReadOnly(0xc003a93570?, 0xc0343f6120?)
..../data4/gl/tidb/executor/adapter.go:324 +0x45 fp=0xc012496be8 sp=0xc012496bb0 pc=0x300ae25
github.com/pingcap/tidb/session.finishStmt({0x4154e60, 0xc000054078}, 0xc003da66c0, {0x0, 0x0}, {0x4162f60, 0xc031dcd860?})
..../data4/gl/tidb/session/tidb.go:231 +0x6e fp=0xc012496c68 sp=0xc012496be8 pc=0x323310e
github.com/pingcap/tidb/session.(*execStmtResult).Close(0xc0746b07b0)
..../data4/gl/tidb/session/session.go:2186 +0xab fp=0xc012496cb8 sp=0xc012496c68 pc=0x322830b
github.com/pingcap/tidb/server.(*tidbResultSet).Close(0x0?)
..../data4/gl/tidb/server/driver_tidb.go:423 +0x3c fp=0xc012496cd8 sp=0xc012496cb8 pc=0x32bcadc
github.com/pingcap/tidb/server.ResultSet.Close-fm()
....<autogenerated>:1 +0x2b fp=0xc012496cf0 sp=0xc012496cd8 pc=0x32f52ab
github.com/pingcap/tidb/parser/terror.Call(0xc004b62341?)
..../data4/gl/tidb/parser/terror/terror.go:304 +0x31 fp=0xc012496e30 sp=0xc012496cf0 pc=0x173f4d1
github.com/pingcap/tidb/server.(*clientConn).executePreparedStmtAndWriteResult.func2()
..../data4/gl/tidb/server/conn_stmt.go:259 +0x26 fp=0xc012496e48 sp=0xc012496e30 pc=0x32b71e6
github.com/pingcap/tidb/server.(*clientConn).executePreparedStmtAndWriteResult(0xc00348a840, {0x4154ed0, 0xc0746b0210}, {0x416f8f0, 0xc004b50bd0}, {0xc0343915f0?, 0xc06b6e87ec?, 0x10?}, 0x0)
..../data4/gl/tidb/server/conn_stmt.go:262 +0x33f fp=0xc012496f10 sp=0xc012496e48 pc=0x32b6fff
github.com/pingcap/tidb/server.(*clientConn).handleStmtExecute(0xc00348a840, {0x4154e28, 0xc03439a800}, {0xc06b6e87e1, 0x1b, 0x1b})
..../data4/gl/tidb/server/conn_stmt.go:201 +0x677 fp=0xc012497130 sp=0xc012496f10 pc=0x32b65f7
github.com/pingcap/tidb/server.(*clientConn).dispatch(0xc00348a840, {0x4154ed0?, 0xc0034eee10?}, {0xc06b6e87e0, 0x1c, 0x1c})
..../data4/gl/tidb/server/conn.go:1423 +0x12a5 fp=0xc012497520 sp=0xc012497130 pc=0x32ab5e5
github.com/pingcap/tidb/server.(*clientConn).Run(0xc00348a840, {0x4154ed0, 0xc0034eee10})
..../data4/gl/tidb/server/conn.go:1146 +0x24e fp=0xc012497b20 sp=0xc012497520 pc=0x32a7bee
github.com/pingcap/tidb/server.(*Server).onConn(0xc001984000, 0xc00348a840)
..../data4/gl/tidb/server/server.go:563 +0x65d fp=0xc012497fc0 sp=0xc012497b20 pc=0x32dc6fd
github.com/pingcap/tidb/server.(*Server).startNetworkListener.func2()
..../data4/gl/tidb/server/server.go:454 +0x2a fp=0xc012497fe0 sp=0xc012497fc0 pc=0x32db60a
runtime.goexit()

@glorv
Copy link
Contributor Author

glorv commented Aug 2, 2022

Looked into the gracefully shutdown logic again and found there are a few defects in the current implement:

  1. Function ShutdownOrNotify is called every 1s periodically to check and exit a connection, but it will skip set the status flag if there is a running transaction. Thus, if the server is still busy, the flag can be set with little possibility.
  2. When the graceful period(15s by default) is passed, KillAllConnections is called to stop all connections. It calls conn.Close immediately, thus there may be some race conditions with the still running tasks.

@xhebox @tiancaiamao I'm not sure what is the best way to deal with this, what do you think?

@xhebox
Copy link
Contributor

xhebox commented Aug 2, 2022

Looked into the gracefully shutdown logic again and found there are a few defects in the current implement:

1. Function `ShutdownOrNotify` is called every 1s periodically to check and exit a connection, but it will skip set the status flag if there is a running transaction. Thus, if the server is still busy, the flag can be set with little possibility.

2. When the graceful period(15s by default) is passed, `KillAllConnections` is called to stop all connections. It calls `conn.Close` immediately, thus there may be some race conditions with the still running tasks.

@xhebox @tiancaiamao I'm not sure what is the best way to deal with this, what do you think?

For now, I think checking for nil pointers to prevent panic is the best solution. A better code structure needs big refactors.

@glorv
Copy link
Contributor Author

glorv commented Aug 2, 2022

The panic is caused by concurrent map read and map write because multiple routines concurrently visit the planner cache. I don't find a easy way to prevent it.

@xhebox
Copy link
Contributor

xhebox commented Aug 2, 2022

The panic is caused by concurrent map read and map write because multiple routines concurrently visit the planner cache. I don't find a easy way to prevent it.

Maybe make it PreparedPlanCache() concurrent safe.

@tiancaiamao tiancaiamao added the sig/sql-infra SIG: SQL Infra label Aug 2, 2022
@tiancaiamao
Copy link
Contributor

KILL statement shared the same issue.
The only way that works is, set a killed flag (atomic) and let the session to exit itself. The main thread just wait for the operation.
All other solutions have defects.

@ti-chi-bot ti-chi-bot added may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.0 may-affects-6.1 may-affects-6.2 labels Aug 2, 2022
@VelocityLight VelocityLight added the affects-6.5 This bug affects the 6.5.x(LTS) versions. label Dec 2, 2022
@VelocityLight VelocityLight added the affects-7.1 This bug affects the 7.1.x(LTS) versions. label Apr 20, 2023
@ti-chi-bot ti-chi-bot added the affects-7.5 This bug affects the 7.5.x(LTS) versions. label Oct 23, 2023
@bb7133
Copy link
Member

bb7133 commented Nov 22, 2023

IMO for the case listed in this issue and #48224, it has been fixed by #32111.

I think the root cause is that clientconn.Close is called twice at:

tidb/server/server.go

Lines 539 to 542 in 34d877a

defer func() {
terror.Log(conn.Close())
logutil.Logger(ctx).Debug("connection closed")
}()

and

tidb/server/server.go

Lines 792 to 809 in 34d877a

func (s *Server) kickIdleConnection() {
var conns []*clientConn
s.rwlock.RLock()
for _, cc := range s.clients {
if cc.ShutdownOrNotify() {
// Shutdowned conn will be closed by us, and notified conn will exist themselves.
conns = append(conns, cc)
}
}
s.rwlock.RUnlock()
for _, cc := range conns {
err := cc.Close()
if err != nil {
logutil.BgLogger().Error("close connection", zap.Error(err))
}
}
}

Yes Close is called twice by both the 'main goroutine' and the 'session goroutine'. I agree with @tiancaiamao 's comment that we should set flags and just wait for the operation from 'session goroutine', and this is what is implemented in #32111.

However,

When the graceful period(15s by default) is passed, KillAllConnections is called to stop all connections. It calls conn.Close immediately, thus there may be some race conditions with the still running tasks.

This is not covered by #32111, but I would say that this is generally fine since:

  1. It is a rare case.
  2. The 'graceful period' has passed, we could say that the shutdown is not 'graceful' so that panic is kind of acceptable.

But for further enhancement, we cloud still try to make closeConn thread-safe.

@july2993 @tiancaiamao @glorv @xhebox WDYT?

@Tema
Copy link
Contributor

Tema commented Nov 23, 2023

@bb7133 not all LB are SQL aware, so it might not be possible to drain connections gracefully from LB side, so we have to drive it from server (tidb) side to close connections when they are idle and not allow to open new connections, so over time we drain as many connections gracefully as possible.

@tiancaiamao
Copy link
Contributor

Close by #49073

@ti-chi-bot ti-chi-bot added affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. and removed may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 labels May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.3 affects-6.4 affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-6.6 affects-7.0 affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. may-affects-4.0 This bug maybe affects 4.0.x versions. may-affects-5.0 This bug maybe affects 5.0.x versions. may-affects-5.1 This bug maybe affects 5.1.x versions. may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-6.0 may-affects-6.2 severity/major sig/sql-infra SIG: SQL Infra type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants