-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SKS-1903: Fix deleting a PG might create multiple deletion tasks at the same time #151
Conversation
Codecov Report
@@ Coverage Diff @@
## master #151 +/- ##
==========================================
- Coverage 56.77% 56.35% -0.42%
==========================================
Files 17 17
Lines 3160 3208 +48
==========================================
+ Hits 1794 1808 +14
- Misses 1210 1244 +34
Partials 156 156
|
@@ -630,8 +630,12 @@ func (r *ElfMachineReconciler) deletePlacementGroup(ctx *context.MachineContext) | |||
return false, nil | |||
} | |||
|
|||
if err := ctx.VMService.DeleteVMPlacementGroupsByName(ctx, *placementGroup.Name); err != nil { | |||
if pgNames, err := ctx.VMService.DeleteVMPlacementGroupsByName(ctx, *placementGroup.Name); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里加一个func DeleteVMPlacementGroupsByNamePrefix()吧,以区别于DeleteVMPlacementGroupByName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除单个 PG 和多个 PG 使用不同的 func?删除的逻辑应该可以复用,DeleteVMPlacementGroupByName 调用 DeleteVMPlacementGroupsByNamePrefix ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单个 PG 是by Name,多个 PG是by NamePrefix,用一个func会混淆
修复删除放置组产生重复删除任务
产生原因
通过 SDK 批量删除放置组,每个放置组都会产生一个删除任务。CAPE 使用了同步轮询,当删除任务超时之后,CAPE 下一次 reconcile 会马上再次尝试删除。而 Tower 没有控制,所以出现了同一个放置组被多个任务并发删除的情况。
解决
1.从 Tower 查询出来需要被删除的放置组
2.过滤出来正在被删除中的(防止产生重复删除任务)
3.删除不是正在被删除的放置组
4.Cluster controller 等待所有的放置组被删除完成,否则 requeue。
测试
测试环境:3主机嵌套集群
1.创建 1CP + 10 个 3Worker 节点组 集群,缩容为 1 个 3Worker 节点组,节点组和放置组被正常删除。
2.删除上述集群,节点组均被正常删除。
3.1 创建 1CP + 10 个 3Worker 节点组 集群。
3.2 并启动脚本每秒给集群创建一个放置组
3.3 然后删除该集群
3.4 不断切换主机的 mongoDB primary。(暂停 primary 所在的主机)
3.5 观察到删除放置组的任务出现了错误:
3.6 选择其中一个放置组,多次删除任务是按照时间先后顺序的,没有同时出现并发删除现象。