RFC - Cache Reservation API #1461

hamersaw · 2021-09-10T18:09:15Z

welcome · 2021-09-10T18:09:17Z

Thank you for opening this pull request! 🙌
These tips will help get your PR across the finish line: - Most of the repos have a PR template; if not, fill it out to the best of your knowledge. - Sign off your commits (Reference: DCO Guide).

kumare3 · 2021-09-20T22:23:56Z

@hamersaw should we merge this?

Signed-off-by: Daniel Rammer <[email protected]>

hamersaw · 2021-09-24T00:15:22Z

@hamersaw should we merge this?

@kumare3 the current implementation differs slightly from this RFC. I don't think it's necessary to update the document as the RFC process is meant to gather input and not serve as documentation. If I'm wrong I'm happy to update this, otherwise it's ready to merge.

EngHabu · 2021-09-24T00:18:55Z

rfc/core language/1461-cache-reservable.md

+User-side functionality will require the introduction of a cacheReservation Flyte task annotation. An elementary example is how this may look is provided below:
+
+```python
+@task(cacheReservation=true)


@wild-endeavor @kumare3 Can you give your opinion about this flag?

Suggested change

@task(cacheReservation=true)

@task(cache_serialized=true)

maybe?

this was actually one of the slight differences. the flytekit integration is currently using cache_reservable to induce the API. the flag is used in tandem with the cache flag or it doesn't do anything. so tasks with this functionality enabled will be annotated as follows:

@task(cache=True, cache_reservable=True, cache_version="1.0")

this seemed fairly idiomatic, seeing as how the reservation functionality introduces a layer on top of the existing cache scheme, but very open to feedback. i can certainly make changes in the flytekit PR (once it's ready).
cc @eapolinario

Ya I don't like cacheReservation. Cache_serialized=True is better.

Is there a timeout?

Sounds good, I'll make the changes. Do we want to require the cache flag to be set as well? Or is setting cache_serialized enough to enable caching?

Is the timeout you mention for executions waiting on the cache to be populated? Is introducing another parameter the best solution here? The existing timeout parameter will cover the single executing task but not others that are waiting.

Edit: Maybe this is a terrible idea. The catalog key currently includes the cache_version parameter. Which, in absence, might promote undesired behavior where different versions of tasks are execution in serial.

Is there any advantage to generalizing this more? We can change the parameter to just "serialized" which means that only a single instance of the task can run at a time (obviously defined by inputs). The backend implementations will be very similar to what they are now, mostly changing the naming to better clarify what is happening.

@task(serialized=True)

And the obvious use case is for cachable tasks

@task(cache=True, serialized=True)

And can add a timeout which fails tasks that are waiting for other to complete

@task(serialized=True, serialize_timeout=30m)

Signed-off-by: Daniel Rammer <[email protected]>

wild-endeavor · 2021-09-28T20:25:15Z

random comments @kumare3

i'd prefer the word "reserve" to "serialize" but not a big deal. Don't want confusion with "serializable".
This isn't a guarantee of at-most-once semantics right? We should call that out explicitly in user-facing docs/docstrings etc.
This seems like a general ask, should we enable this feature for non-cached tasks as well?
There can be temporary deadlock right? propeller starts running one execution, it goes down so it doesn't release it, and the new one can't make another reservation. What order of magnitude is the expiration do we think?

hamersaw · 2021-09-30T12:46:37Z

I'm able to comment on a few of these.

random comments @kumare3

* i'd prefer the word "reserve" to "serialize" but not a big deal.  Don't want confusion with "serializable".

That was my initial intuition as well, in our domain serialize is more commonly associated with the aforementioned rather than executing a sequence of tasks in serial.

* This isn't a guarantee of at-most-once semantics right?  We should call that out explicitly in user-facing docs/docstrings etc.

I believe so,
Correct

* This seems like a general ask, should we enable this feature for non-cached tasks as well?

We should be careful with this. As I noted ^^^, I believe we would need to add a "version" parameter to each task similar to the existing "cache_version" parameter. Otherwise different versions of tasks (predicated on more than input / output values) may be executed in serial. We could deprecate the existing cache_version in favor of plain version, but I'm not sure if this is feasible.

* There can be temporary deadlock right?  propeller starts running one execution, it goes down so it doesn't release it, and the new one can't make another reservation.  What order of magnitude is the expiration do we think?

Currently reservation timeouts are handled using a grace period multiplier (configured in datacatalog) on the reservation extension heartbeat interval. This heartbeat interval is the same as the workflow reevaluation loop. For my local testing the reeval is set to 5s and multiplier to 3 - so a 15 second deadlock. It seems production reeval is commonly on the order of 30s, if we use the same 3 multiplier then 1:30.

EngHabu · 2021-10-08T21:37:07Z

@hamersaw have we reached a resolution here?

hamersaw · 2021-10-08T21:46:13Z

@hamersaw have we reached a resolution here?

I believe so, I updated the RFC to reflect the current state. @wild-endeavor says as long as you and @kumare3 are in consensus he's happy (https://unionai.slack.com/archives/C02CMUNT4PQ/p1633111921106700). The only two issues were:
(1) naming: we are going with "cacheSerialize".
(2) serializing non-cache tasks: we are only enabling this for cache tasks because otherwise all tasks need a version.

As long as that sounds good, it's ready to move forward.

EngHabu · 2021-10-08T22:25:02Z

Awesome! let's get merge it then..

hamersaw changed the title ~~[RFC] Cache Reservation API~~ RFC - Cache Reservation API Sep 14, 2021

This was referenced Sep 23, 2021

Cache Serialize API flyteorg/flytepropeller#326

Merged

Cache Serialize API flyteorg/flyteplugins#213

Merged

hackmd-deploy and others added 2 commits September 23, 2021 18:56

created cache reservation api rfc

9709a93

Signed-off-by: Daniel Rammer <[email protected]>

prepended rfc document with pr number

c066da3

Signed-off-by: Daniel Rammer <[email protected]>

hamersaw force-pushed the rfc/create-reservation-api branch from 2e31f0d to c066da3 Compare September 23, 2021 23:58

hamersaw requested review from EngHabu and kumare3 September 24, 2021 00:13

EngHabu reviewed Sep 24, 2021

View reviewed changes

hamersaw mentioned this pull request Sep 27, 2021

Add cache_serialize parameter to tasks flyteorg/flytekit#673

Merged

8 tasks

updated with a high-level overview of current implementation details

84e7671

Signed-off-by: Daniel Rammer <[email protected]>

EngHabu approved these changes Oct 8, 2021

View reviewed changes

hamersaw merged commit d37321f into flyteorg:master Oct 11, 2021

hamersaw deleted the rfc/create-reservation-api branch October 11, 2021 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC - Cache Reservation API #1461

RFC - Cache Reservation API #1461

hamersaw commented Sep 10, 2021

welcome bot commented Sep 10, 2021

kumare3 commented Sep 20, 2021

hamersaw commented Sep 24, 2021

EngHabu Sep 24, 2021

hamersaw Sep 24, 2021 •

edited

Loading

kumare3 Sep 24, 2021

hamersaw Sep 24, 2021

hamersaw Sep 24, 2021 •

edited

Loading

wild-endeavor commented Sep 28, 2021

hamersaw commented Sep 30, 2021

EngHabu commented Oct 8, 2021

hamersaw commented Oct 8, 2021

EngHabu commented Oct 8, 2021

RFC - Cache Reservation API #1461

RFC - Cache Reservation API #1461

Conversation

hamersaw commented Sep 10, 2021

welcome bot commented Sep 10, 2021

kumare3 commented Sep 20, 2021

hamersaw commented Sep 24, 2021

EngHabu Sep 24, 2021

Choose a reason for hiding this comment

hamersaw Sep 24, 2021 • edited Loading

Choose a reason for hiding this comment

kumare3 Sep 24, 2021

Choose a reason for hiding this comment

hamersaw Sep 24, 2021

Choose a reason for hiding this comment

hamersaw Sep 24, 2021 • edited Loading

Choose a reason for hiding this comment

wild-endeavor commented Sep 28, 2021

hamersaw commented Sep 30, 2021

EngHabu commented Oct 8, 2021

hamersaw commented Oct 8, 2021

EngHabu commented Oct 8, 2021

hamersaw Sep 24, 2021 •

edited

Loading

hamersaw Sep 24, 2021 •

edited

Loading