Problem: workflows can fail with activity heartbeat timeouts #960

jraddaoui · 2024-06-05T15:07:57Z

Is your feature request related to a problem? Please describe.

In some high load scenarios and environments with limited resources we have seen workflows ending unexpectedly with activity heartbeat timeouts:

2024-05-08T20:33:12.510Z	V(2)	preprocessing-worker.temporal	log/with_logger.go:84	error	{"Namespace": "default", "TaskQueue": "preprocessing", "WorkerID": "1@preprocessing-worker-0@", "WorkflowType": "preprocessing", "WorkflowID": "preprocessing-538f2b75-3175-4730-a5ee-fdfc6a8410d3", "RunID": "b5ee427d-0f6b-4f27-a0e4-f86870854fe8", "Attempt": 1, "err": "error downloading package: activity error (type: DownloadPackageActivity, scheduledEventID: 15, startedEventID: 16, identity: ): activity Heartbeat timeout (type: Heartbeat)", "error": "Workflow completed with errors!"}

Describe the solution you'd like

Thanks to the detective work done by @DanielCosme, we found out that increasing the activities HeartbeatTimeout and the worker DefaultHeartbeatThrottleInterval and MaxHeartbeatThrottleInterval may reduce the likelihood of such timeouts. This values should be configurable so they can be set based on the expected system load and available resources.

Describe alternatives you've considered

Over-provision everywhere!

Additional context

Check @DanielCosme PR implementing this solution in artefactual-labs/enduro: artefactual-labs/enduro#612

The text was updated successfully, but these errors were encountered:

DanielCosme · 2024-06-11T18:18:12Z

I went deep into the rabbit hole, being able to configure the timeout values is definitely useful and a must have due to the variety of environments this system can run. However the root cause for timeout failures at a high SIP count in a queue was different, I was able to make the timeouts no more for up to 30k (I did no more tests) queued SIPs via configuring the concurrent workflows the worker is willing to work at a time. Check this PR artefactual-labs/enduro#616 @jraddaoui

jraddaoui added this to Enduro Jun 5, 2024

jraddaoui moved this to 🛠 Refining in Enduro Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem: workflows can fail with activity heartbeat timeouts #960

Problem: workflows can fail with activity heartbeat timeouts #960

jraddaoui commented Jun 5, 2024

DanielCosme commented Jun 11, 2024

Problem: workflows can fail with activity heartbeat timeouts #960

Problem: workflows can fail with activity heartbeat timeouts #960

Comments

jraddaoui commented Jun 5, 2024

DanielCosme commented Jun 11, 2024