Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Cache Reservation API #1461

Merged
merged 3 commits into from
Oct 11, 2021

Conversation

hamersaw
Copy link
Contributor

@welcome
Copy link

welcome bot commented Sep 10, 2021

Thank you for opening this pull request! 🙌
These tips will help get your PR across the finish line: - Most of the repos have a PR template; if not, fill it out to the best of your knowledge. - Sign off your commits (Reference: DCO Guide).

@hamersaw hamersaw changed the title [RFC] Cache Reservation API RFC - Cache Reservation API Sep 14, 2021
@kumare3
Copy link
Contributor

kumare3 commented Sep 20, 2021

@hamersaw should we merge this?

@hamersaw hamersaw force-pushed the rfc/create-reservation-api branch from 2e31f0d to c066da3 Compare September 23, 2021 23:58
@hamersaw hamersaw requested review from EngHabu and kumare3 September 24, 2021 00:13
@hamersaw
Copy link
Contributor Author

@hamersaw should we merge this?

@kumare3 the current implementation differs slightly from this RFC. I don't think it's necessary to update the document as the RFC process is meant to gather input and not serve as documentation. If I'm wrong I'm happy to update this, otherwise it's ready to merge.

User-side functionality will require the introduction of a cacheReservation Flyte task annotation. An elementary example is how this may look is provided below:

```python
@task(cacheReservation=true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wild-endeavor @kumare3 Can you give your opinion about this flag?

Suggested change
@task(cacheReservation=true)
@task(cache_serialized=true)

maybe?

Copy link
Contributor Author

@hamersaw hamersaw Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was actually one of the slight differences. the flytekit integration is currently using cache_reservable to induce the API. the flag is used in tandem with the cache flag or it doesn't do anything. so tasks with this functionality enabled will be annotated as follows:

@task(cache=True, cache_reservable=True, cache_version="1.0")

this seemed fairly idiomatic, seeing as how the reservation functionality introduces a layer on top of the existing cache scheme, but very open to feedback. i can certainly make changes in the flytekit PR (once it's ready).
cc @eapolinario

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I don't like cacheReservation. Cache_serialized=True is better.

Is there a timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll make the changes. Do we want to require the cache flag to be set as well? Or is setting cache_serialized enough to enable caching?

Is the timeout you mention for executions waiting on the cache to be populated? Is introducing another parameter the best solution here? The existing timeout parameter will cover the single executing task but not others that are waiting.

Copy link
Contributor Author

@hamersaw hamersaw Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Maybe this is a terrible idea. The catalog key currently includes the cache_version parameter. Which, in absence, might promote undesired behavior where different versions of tasks are execution in serial.

Is there any advantage to generalizing this more? We can change the parameter to just "serialized" which means that only a single instance of the task can run at a time (obviously defined by inputs). The backend implementations will be very similar to what they are now, mostly changing the naming to better clarify what is happening.

@task(serialized=True)

And the obvious use case is for cachable tasks

@task(cache=True, serialized=True)

And can add a timeout which fails tasks that are waiting for other to complete

@task(serialized=True, serialize_timeout=30m)

@wild-endeavor
Copy link
Contributor

random comments @kumare3

  • i'd prefer the word "reserve" to "serialize" but not a big deal. Don't want confusion with "serializable".
  • This isn't a guarantee of at-most-once semantics right? We should call that out explicitly in user-facing docs/docstrings etc.
  • This seems like a general ask, should we enable this feature for non-cached tasks as well?
  • There can be temporary deadlock right? propeller starts running one execution, it goes down so it doesn't release it, and the new one can't make another reservation. What order of magnitude is the expiration do we think?

@hamersaw
Copy link
Contributor Author

I'm able to comment on a few of these.

random comments @kumare3

* i'd prefer the word "reserve" to "serialize" but not a big deal.  Don't want confusion with "serializable".

That was my initial intuition as well, in our domain serialize is more commonly associated with the aforementioned rather than executing a sequence of tasks in serial.

* This isn't a guarantee of at-most-once semantics right?  We should call that out explicitly in user-facing docs/docstrings etc.

I believe so,
Correct

* This seems like a general ask, should we enable this feature for non-cached tasks as well?

We should be careful with this. As I noted ^^^, I believe we would need to add a "version" parameter to each task similar to the existing "cache_version" parameter. Otherwise different versions of tasks (predicated on more than input / output values) may be executed in serial. We could deprecate the existing cache_version in favor of plain version, but I'm not sure if this is feasible.

* There can be temporary deadlock right?  propeller starts running one execution, it goes down so it doesn't release it, and the new one can't make another reservation.  What order of magnitude is the expiration do we think?

Currently reservation timeouts are handled using a grace period multiplier (configured in datacatalog) on the reservation extension heartbeat interval. This heartbeat interval is the same as the workflow reevaluation loop. For my local testing the reeval is set to 5s and multiplier to 3 - so a 15 second deadlock. It seems production reeval is commonly on the order of 30s, if we use the same 3 multiplier then 1:30.

@EngHabu
Copy link
Contributor

EngHabu commented Oct 8, 2021

@hamersaw have we reached a resolution here?

@hamersaw
Copy link
Contributor Author

hamersaw commented Oct 8, 2021

@hamersaw have we reached a resolution here?

I believe so, I updated the RFC to reflect the current state. @wild-endeavor says as long as you and @kumare3 are in consensus he's happy (https://unionai.slack.com/archives/C02CMUNT4PQ/p1633111921106700). The only two issues were:
(1) naming: we are going with "cacheSerialize".
(2) serializing non-cache tasks: we are only enabling this for cache tasks because otherwise all tasks need a version.

As long as that sounds good, it's ready to move forward.

@EngHabu
Copy link
Contributor

EngHabu commented Oct 8, 2021

Awesome! let's get merge it then..

@hamersaw hamersaw merged commit d37321f into flyteorg:master Oct 11, 2021
@hamersaw hamersaw deleted the rfc/create-reservation-api branch October 11, 2021 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants