-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve region reload strategy #1122
Conversation
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
@crazycs520 PTAL |
Signed-off-by: zyguan <[email protected]>
Signed-off-by: zyguan <[email protected]>
if reloadOnAccess { | ||
lr, err = c.loadRegion(bo, key, isEndKey) | ||
} else { | ||
lr, err = c.loadRegionByID(bo, r.GetID()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference here using by id or by key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the replacement of previous async reload introduced by #843, which uses loadRegionByID
. I just try to keep the original implementation but do not very sure about why we use loadRegionByID
actually. Maybe it's more efficient than loadRegionByKey
.
internal/locate/region_cache.go
Outdated
updated int32 = iota // region is updated and no need to reload. | ||
needSync // need sync new region info. | ||
needReloadOnAccess int32 = 1 << iota // indicates the region will be reloaded on next access | ||
needExpireAfterTTL // indicates the region will expire after RegionCacheTTL (even when it's accessed continuously) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strategy is to expire these regions with certain conditions like down peers, we may need to consider how to make the region reloads smoother as the current TTL is a constant value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's relative fine for regions that have stale or unreachable peers, since the cache GC only scan 50 regions per seconds. For regions that have down peers, maybe we can add a random value to lastAccess
when constructing them by newRegion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could do some improvements about smoothing region reloads in another PR, considering both expire reloading and region cache miss reloading. I rembember we've encountered lots of region reload caused by sudden TTL expire or something like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. One idea I'm thinking about is that we may change lastAccess
to TTL
and push it to now + RegionCacheTTL + RandomJitter
on access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
internal/locate/region_cache.go
Outdated
@@ -1123,9 +1093,18 @@ func (c *RegionCache) findRegionByKey(bo *retry.Backoffer, key []byte, isEndKey | |||
c.insertRegionToCache(r, true, true) | |||
c.mu.Unlock() | |||
} | |||
} else if r.checkNeedReloadAndMarkUpdated() { | |||
} else if flags := r.resetSyncFlags(needReloadOnAccess | needAsyncReloadReady); flags > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the async reload becomes sync here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a potential regression. In the current implementation, reloading is divided into three levels:
- invalidate: the region cannot be used any more, reloading is required.
- reload on access: we need to reload the region on access, but we can tolerate reloading failure. (it's equivalent to the previous syncflag)
- async reload: the region needs to be reload later, but not too emergent.
So, it's not a real "async" reload and the name is somehow misleading. I've considered keeping the async reload goroutine or doing reload in cache GC, but gave up for simplicity.
Currently, we mark a region as NeedAsyncReload
only when we find it has a store which is stale but reachable or its leader store is stale. Please let me know if the real async reload matters here, I'll change it. Or maybe we just rename the flag (eg. DelayedReload)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename or comment the flag, both OK to me.
The current comment indicates it will be reloaded in async mode, which is inaccurate.
needAsyncReloadPending // indicates the region will be reloaded in async mode
needAsyncReloadReady // indicates the region is ready to be reloaded in async mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed, PTAL.
internal/locate/region_cache.go
Outdated
continue | ||
} | ||
if syncFlags&needAsyncReloadPending > 0 { | ||
region.setSyncFlags(needAsyncReloadReady) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I understand the flag change here is to avoid too frequent meaningless reloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it tries to limit the rate of reloading here.
Co-authored-by: ekexium <[email protected]> Signed-off-by: zyguan <[email protected]>
@zyguan |
I'll submit another PR soon, which tries to make reloading caused by TTL smooth (#1122 (comment)) and adds some metrics to observe the reason of reloading. |
This PR add a new reload strategy: if a region is marked as
needExpireAfterTTL
, we don't updatelastAccess
any more, thus the region will expire after RegionCacheTTL. Currently, we set this flag:This ensure that those region will be reloaded every RegionCacheTTL. So the issues like #879 and pingcap/tidb#35418 should be resolved.
This PR also optimize region fields and replace async-reload with delayed-reload.
Also ref: #1104
Regression test for #879
set region-cache-ttl and max-store-down-time to 5m, see the following timeline:
Regression test for #1029
set region-cache-ttl and max-store-down-time to 5m, see the following timeline:
show table regions
to reload regions manuallyshow table regions
Regression test for #843