Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression in the Ray plugin #4239

Merged
merged 5 commits into from
Oct 16, 2023
Merged

Fix regression in the Ray plugin #4239

merged 5 commits into from
Oct 16, 2023

Conversation

pingsutw
Copy link
Member

Tracking issue

NA

Describe your changes

Fixed a regression in Ray plugin back to this change: 66e7b64

Propeller has a panic because rayJob.Status.StartTime is nil when the status is JobStatusPending

Updates:

  • Convert the ray cluster phase to the flyte task phase if ray cluster is not running
  • Convert the ray job phase to the flyte task phase if ray cluster is running

Error:

Workflow[flytesnacks:development:.flytegen.task.ray_task] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[taskraytask]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [ray]: panic when executing a plugin [ray]. Stack: [goroutine 2818 [running]:
runtime/debug.Stack()
	/opt/homebrew/opt/go/libexec/src/runtime/debug/stack.go:24 +0x64
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1.1()
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/task/handler.go:382 +0xb4
panic({0x103e54b40?, 0x105c06500?})
	/opt/homebrew/opt/go/libexec/src/runtime/panic.go:914 +0x218
github.com/flyteorg/flyte/flyteplugins/go/tasks/plugins/k8s/ray.rayJobResourceHandler.GetTaskPhase({}, {0x140063a83c0?, 0x20?}, {0x1043e5ef0, 0x140063a83c0}, {0x1043fdc88?, 0x14005235180})
	/Users/jeev/Workspace/repos/flyte/flyte/flyteplugins/go/tasks/plugins/k8s/ray/ray.go:405 +0x268
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes/task/k8s.(*PluginManager).checkResourcePhase(0x140063b6a68, {0x1043d9df8, 0x140063a6d80}, {0x1043ee780, 0x140063ac0c0}, {0x1043fdc88?, 0x14005235180?}, 0x140063a2588)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go:282 +0x5f8
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes/task/k8s.PluginManager.Handle({{0x102b18ce4, 0x3}, {0x1043da3a8, 0x105d39fa0}, {0x1043b9dc0, 0x14000778b00}, {0x14b9be8a8, 0x1400106b520}, {{0x1043fa058, 0x1400108acd0}, ...}, ...}, ...)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go:344 +0x3a8
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin.func1(0x140063b6f88?, {0x1043d9df8, 0x140063a6a80}, {0x1043de2c0?, 0x14002214160?}, 0x9?)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/task/handler.go:389 +0x110
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes/task.Handler.invokePlugin({{0x1043dd4c0, 0x14000134cf0}, {0x1043bd970, 0x140001a8a00}, 0x14000d13140, 0x14000d13170, 0x14000d131a0, {0x1043de2c0, 0x14002718580}, 0x14001224d20, ...}, ...)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/task/handler.go:391 +0x60
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes/task.Handler.Handle({{0x1043dd4c0, 0x14000134cf0}, {0x1043bd970, 0x140001a8a00}, 0x14000d13140, 0x14000d13170, 0x14000d131a0, {0x1043de2c0, 0x14002718580}, 0x14001224d20, ...}, ...)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/task/handler.go:572 +0x630
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.handleParentNode({{0x1043e76b8, 0x14001290680}, {{0x14001289840, {{...}, 0x0}, {0x140008bb130, 0xb, 0xb}}, {0x14001289860, {{...}, ...}, ...}, ...}, ...}, ...)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/dynamic/handler.go:67 +0xa0
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes/dynamic.dynamicNodeTaskNodeHandler.Handle({{0x1043e76b8, 0x14001290680}, {{0x14001289840, {{...}, 0x0}, {0x140008bb130, 0xb, 0xb}}, {0x14001289860, {{...}, ...}, ...}, ...}, ...}, ...)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/dynamic/handler.go:224 +0x888
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes.(*nodeExecutor).execute(0x140012900d0, {0x1043d9df8, 0x14006383d40}, {0x1043dd580, 0x14000790780}, {0x1043f1cf0, 0x140063ac000}, {0x1044002e0?, 0x14006394460?})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/executor.go:822 +0xd8
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes.(*nodeExecutor).handleQueuedOrRunningNode(0x140012900d0, {0x1043d9df8, 0x14006383d40}, {0x1043f1cf0?, 0x140063ac000?}, {0x1043dd580, 0x14000790780})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/executor.go:1116 +0x838
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes.(*nodeExecutor).HandleNode(0x140012900d0, {0x1043d9df8, 0x14006383d40}, {0x1043b69e0, 0x14005713900}, {0x1043f1cf0, 0x140063ac000?}, {0x1043dd580?, 0x14000790780})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/executor.go:1384 +0x228
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes.(*recursiveNodeExecutor).RecursiveNodeHandler(0x14000b37e00, {0x1043d9df8, 0x140063836e0}, {0x1043f6a48, 0x1400624eaf0}, {0x1043b69e0, 0x14005713900}, {0x1043dab18?, 0x14005713900?}, {0x1043f38e0, ...})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/executor.go:229 +0x51c
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes.(*recursiveNodeExecutor).handleDownstream(0x1043d9df8?, {0x1043d9df8, 0x140063836e0}, {0x1043f6a48, 0x1400624eaf0}, {0x1043b69e0, 0x14005713900?}, {0x1043dab18?, 0x14005713900}, {0x1043f38e0, ...})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/executor.go:294 +0x300
github.com/flyteorg/flyte/flytepropeller/pkg/controller/nodes.(*recursiveNodeExecutor).RecursiveNodeHandler(0x14000b37e00, {0x1043d9df8, 0x140063836e0}, {0x1043f6a48, 0x1400624eaf0}, {0x1043b69e0, 0x14005713900}, {0x1043dab18?, 0x14005713900?}, {0x1043f38e0, ...})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/nodes/executor.go:236 +0x698
github.com/flyteorg/flyte/flytepropeller/pkg/controller/workflow.(*workflowExecutor).handleRunningWorkflow(0x140000ddd50, {0x1043d9df8, 0x140063836e0}, 0x14005713900)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/workflow/executor.go:149 +0x148
github.com/flyteorg/flyte/flytepropeller/pkg/controller/workflow.(*workflowExecutor).HandleFlyteWorkflow(0x140000ddd50, {0x1043d9df8, 0x140063836e0}, 0x14005713900)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/workflow/executor.go:395 +0x304
github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow.func2(0x14000a9f580, {0x1043d9df8, 0x140063836e0}, 0x140063bb7c8, 0x103c7f4a0?)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/handler.go:142 +0x104
github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*Propeller).TryMutateWorkflow(0x14000a9f580, {0x1043d9df8, 0x14006383290}, 0x1400632ea00)
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/handler.go:143 +0x36c
github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*Propeller).Handle(0x14000a9f580, {0x1043d9df8, 0x14006383290}, {0x14005707b00, 0x17}, {0x14005707b18, 0x14})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/handler.go:259 +0xb10
github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem.func1(0x14000e64a20, 0x140063bbf10, {0x103c7f4a0?, 0x140063907e0})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/workers.go:88 +0x3b4
github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*WorkerPool).processNextWorkItem(0x14000e64a20, {0x1043d9df8, 0x14006383290})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/workers.go:99 +0xc0
github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*WorkerPool).runWorker(0x1043d9df8?, {0x1043d9df8, 0x14003feec60})
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/workers.go:115 +0x98
github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*WorkerPool).Run.func1()
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/workers.go:150 +0x50
created by github.com/flyteorg/flyte/flytepropeller/pkg/controller.(*WorkerPool).Run in goroutine 220
	/Users/jeev/Workspace/repos/flyte/flyte/flytepropeller/pkg/controller/workers.go:147 +0x214

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Screenshots

image image image

Note to reviewers

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: Kevin Su <[email protected]>
@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (d03a0bc) 58.30% compared to head (7b72f63) 59.06%.
Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4239      +/-   ##
==========================================
+ Coverage   58.30%   59.06%   +0.75%     
==========================================
  Files         552      621      +69     
  Lines       48955    53104    +4149     
==========================================
+ Hits        28542    31364    +2822     
- Misses      18080    19240    +1160     
- Partials     2333     2500     +167     
Flag Coverage Δ
unittests 59.06% <86.36%> (+0.75%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
flyteplugins/go/tasks/plugins/k8s/ray/ray.go 83.37% <86.36%> (-0.24%) ⬇️

... and 172 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jeevb
jeevb previously approved these changes Oct 16, 2023
Signed-off-by: Kevin Su <[email protected]>
@pingsutw
Copy link
Member Author

image

@pingsutw pingsutw merged commit 6cbd7bb into master Oct 16, 2023
@pingsutw pingsutw deleted the fix-ray branch October 16, 2023 21:59
squiishyy pushed a commit to squiishyy/flyte that referenced this pull request Oct 18, 2023
---------

Signed-off-by: Kevin Su <[email protected]>
Signed-off-by: squiishyy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants