Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timer timing issue #201

Open
drewhoskins-stripe opened this issue Oct 26, 2022 · 1 comment
Open

Timer timing issue #201

drewhoskins-stripe opened this issue Oct 26, 2022 · 1 comment

Comments

@drewhoskins-stripe
Copy link
Contributor

temporal-ruby Timer timing issue

Scenario:

  • Timer started (30 minutes)
  • Sleep started by workflow code (15 minute)
  • Worker goes down
  • Sleep ends
  • Timer fires.
  • Worker comes back up
  • Worker receives a history containing the sleep finishing followed by timer firing.

The workflow thus receives two TIMER_FIRED events in its history at once. Now what should happen?

What’s happening in temporal-ruby today

It processes the timers in order. Sleep first, then the other timer. The workflow completes first, and it’s not allowed to do anything after the workflow has completed. Thus, if the timer issues a command, the task will fail with Temporal::WorkflowAlreadyCompletingError.

Attempted solution

My first attempt was to early-out in state_manager.rb

        history_window.events.each do |event|
          apply_event(event)
          break if if workflow_finished?
        end

This fixed my problem and is logical; the workflow will execute the same as it would have if the worker had never gone down. But, it will silently drop events.
I asked Maxim what other SDKs do…

Other SDKs’ behaviors

According to Maxim:

Temporal processes new events in batches. One batch per workflow task. It applies all the events before running workflow code. So in case of Java or Go, for example the Futures that correspond to timer will be set to ready before running any workflow threads. This way workflow can check if any of these Futures are ready to decide what to do next.

Signals have similar properties, but they use callbacks in Java and some other SDKs. Temporal SDK schedules callback threads before running workflow threads. This way workflow cannot complete before the callback threads were called.

Then in case of Java SDK it is going to make the main workflow thread eligible to run. But it still will run only after all other threads executed and all other Futures resolved.
Without this functionality you guaranteed to lose signals if they are delivered as callbacks.

Rust core does the same

Java implementation - we can a see that callbacks (timer in this case) have a higher priority than the main thread (which is awoken by the sleep in this case).

What now?

Changing temporal-ruby's behavior is obviously somewhat tricky and would cause version changes if not done in a version-safe way. We might take this fix on but want to work with the community.

  • Should we fix this?
  • Implementation suggestions for the fix itself?
  • Version safety. I'm guessing we should add a configuration that changes this behavior that folks can opt into.
@antstorm
Copy link
Contributor

Thanks for bringing this up!

I think there might be a relatively painless fix to this. All the concurrent execution threads in Ruby are represented by fibers, so if we can distinguish the callback fibers from the main one in which the workflow was executed and skip resuming it until the very end of the history window. We already do something very similar with event handlers ordering.

A simpler fix that you can do straight away — is to check yourself whether a workflow has finished or not from your time callback using workflow.completed?. Not pretty, but at least will save you from failing workflow tasks.

In term of version safety — I don't think it should affect anyone. At least I don't think there's a non-failure scenario where anyone would depend on this kind of behaviour. It they did they would already be seeing failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants