Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD processes heartbeat slowly after scaling within a large cluster. #7248

Closed
nolouch opened this issue Oct 24, 2023 · 4 comments · Fixed by #7252
Closed

PD processes heartbeat slowly after scaling within a large cluster. #7248

nolouch opened this issue Oct 24, 2023 · 4 comments · Fixed by #7252
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.

Comments

@nolouch
Copy link
Contributor

nolouch commented Oct 24, 2023

Bug Report

PD processes heartbeat slowly after scaling within a large cluster.

After scaling, 7.1.x will estimate the progress of scale, the estimate will hold the lock too long.

pd/server/cluster/cluster.go

Lines 1943 to 1951 in 3f1a688

for _, s := range stores {
if s.IsRemoving() || s.IsRemoved() {
continue
}
if placement.MatchLabelConstraints(s, rule.LabelConstraints) {
matchStores = append(matchStores, s)
}
}
regionSize := c.core.GetRegionSizeByRange(startKey, endKey) * int64(rule.Count)

Here GetRegionSizeByRange will iterate over many regions.

stack info in 7.1.1:

14708261976110 151990 @ 0xf0ada9 0x171c036 0x20d30b1 0x20d2bd6 0x20d171c 0x20c72d4 0xefb901
#	0xf0ada8	sync.(*RWMutex).RUnlock+0x28							/usr/local/go/src/sync/rwmutex.go:119
#	0x171c035	github.com/tikv/pd/pkg/core.(*RegionsInfo).GetRegionSizeByRange+0x175		/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/pkg/core/region.go:1538
#	0x20d30b0	github.com/tikv/pd/server/cluster.(*RaftCluster).calculateRange+0x2b0		/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/cluster/cluster.go:1851
#	0x20d2bd5	github.com/tikv/pd/server/cluster.(*RaftCluster).getThreshold+0x3f5		/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/cluster/cluster.go:1825
#	0x20d171b	github.com/tikv/pd/server/cluster.(*RaftCluster).checkStores+0x41b		/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/cluster/cluster.go:1752
#	0x20c72d3	github.com/tikv/pd/server/cluster.(*RaftCluster).runNodeStateCheckJob+0xd3	/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/pd/server/cluster/cluster.go:539
@nolouch nolouch added the type/bug The issue is confirmed as a bug. label Oct 24, 2023
@nolouch nolouch added the affects-7.1 This bug affects the 7.1.x(LTS) versions. label Oct 24, 2023
@rleungx
Copy link
Member

rleungx commented Oct 25, 2023

It's a known problem, especially when the rule count is small.

@CabinfeverB
Copy link
Member

It's a known problem, especially when the rule count is small.

when the rule count is small means the key range of a rule is too large?

@rleungx
Copy link
Member

rleungx commented Oct 25, 2023

It's a known problem, especially when the rule count is small.

when the rule count is small means the key range of a rule is too large?

Yes

@ti-chi-bot ti-chi-bot bot closed this as completed in #7252 Nov 8, 2023
ti-chi-bot bot pushed a commit that referenced this issue Nov 8, 2023
close #7248

Signed-off-by: nolouch <[email protected]>

Co-authored-by: nolouch <[email protected]>
Co-authored-by: ShuNing <[email protected]>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 8, 2023
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 8, 2023
ti-chi-bot bot added a commit that referenced this issue Nov 8, 2023
close #7248

Signed-off-by: nolouch <[email protected]>

Co-authored-by: nolouch <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot pushed a commit that referenced this issue Nov 21, 2023
close #7248

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: nolouch <[email protected]>

Co-authored-by: Ryan Leung <[email protected]>
Co-authored-by: nolouch <[email protected]>
ti-chi-bot bot pushed a commit that referenced this issue Nov 21, 2023
close #7248

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: nolouch <[email protected]>

Co-authored-by: Ryan Leung <[email protected]>
Co-authored-by: nolouch <[email protected]>
@nolouch nolouch added the affects-6.1 This bug affects the 6.1.x(LTS) versions. label Jan 10, 2024
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Jan 10, 2024
ti-chi-bot bot added a commit that referenced this issue Feb 28, 2024
close #7248

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>

Co-authored-by: Ryan Leung <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
@seiya-annie
Copy link

/report customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

6 participants