Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: workflows can fail with activity heartbeat timeouts #960

Open
jraddaoui opened this issue Jun 5, 2024 · 1 comment
Open

Problem: workflows can fail with activity heartbeat timeouts #960

jraddaoui opened this issue Jun 5, 2024 · 1 comment

Comments

@jraddaoui
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

In some high load scenarios and environments with limited resources we have seen workflows ending unexpectedly with activity heartbeat timeouts:

2024-05-08T20:33:12.510Z	V(2)	preprocessing-worker.temporal	log/with_logger.go:84	error	{"Namespace": "default", "TaskQueue": "preprocessing", "WorkerID": "1@preprocessing-worker-0@", "WorkflowType": "preprocessing", "WorkflowID": "preprocessing-538f2b75-3175-4730-a5ee-fdfc6a8410d3", "RunID": "b5ee427d-0f6b-4f27-a0e4-f86870854fe8", "Attempt": 1, "err": "error downloading package: activity error (type: DownloadPackageActivity, scheduledEventID: 15, startedEventID: 16, identity: ): activity Heartbeat timeout (type: Heartbeat)", "error": "Workflow completed with errors!"}

Describe the solution you'd like

Thanks to the detective work done by @DanielCosme, we found out that increasing the activities HeartbeatTimeout and the worker DefaultHeartbeatThrottleInterval and MaxHeartbeatThrottleInterval may reduce the likelihood of such timeouts. This values should be configurable so they can be set based on the expected system load and available resources.

Describe alternatives you've considered

Over-provision everywhere!

Additional context

Check @DanielCosme PR implementing this solution in artefactual-labs/enduro: artefactual-labs/enduro#612

@jraddaoui jraddaoui added this to Enduro Jun 5, 2024
@jraddaoui jraddaoui moved this to 🛠 Refining in Enduro Jun 5, 2024
@DanielCosme
Copy link
Contributor

I went deep into the rabbit hole, being able to configure the timeout values is definitely useful and a must have due to the variety of environments this system can run. However the root cause for timeout failures at a high SIP count in a queue was different, I was able to make the timeouts no more for up to 30k (I did no more tests) queued SIPs via configuring the concurrent workflows the worker is willing to work at a time. Check this PR artefactual-labs/enduro#616 @jraddaoui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🛠 Refining
Development

No branches or pull requests

2 participants