Specify recovery semantics #116

mmusgrov · 2019-03-17T21:31:48Z

Section 3.2.9. Recovery Requirements of the spec document discusses recovery but it is under specified. This issue is to formalise our stance on recovery. The preference is to leave it unspecified since the spec is already clear that the outcome of closing/cancelling an LRA MUST either have an eventual end state otherwise it MUST be logged.

tomjenkinson · 2019-03-18T10:00:23Z

I think it is clear from 3.2.9 that the responsibility of infrastructure and applications is to be able to satisfy the requirements of the normal LRA life cycle in the presence of failure.

mmusgrov · 2019-03-18T11:13:15Z

Note that there is a test in the TCK called acceptTest that tests that participants which return 202 Accepted are eventually completed/compensated. The test does this by triggering a "recovery scan" on a "recovery coordinator endpoint" and the config for this endpoint need removing and an alternative way of checking for completion/compensation is required (such as waiting for a configurable period).

mmusgrov · 2019-03-18T11:14:47Z

@tomjenkinson you are right, I think this issue can be closed as "already fixed"

rdebusscher · 2019-03-27T11:36:11Z

@mmusgrov @tomjenkinson
Is the LRA_HTTP_RECOVERY_HEADER linked to the recovery requirements described in 3.2.9? The header is only briefly mentioned in 3.2.2 Start and ending LRA but not explaining what the Participant should do with the Recovery Header.

mmusgrov · 2019-03-27T16:26:07Z

The LRA_HTTP_RECOVERY_HEADER was added to the Java client API (and was in the original REST part of the spec when the project was in the sandbox) to allow participants to tell the coordinator to contact them on different URLs. This was to support participant recovery. Since we no longer have the REST endpoints and Java Client API it is not so simple to incorporate the same functionality.

If we loose the LRA_HTTP_RECOVERY_HEADER in milestone 1.0 then we loose participant recovery but I am fine with moving it out to 1.x.

mmusgrov · 2019-03-29T14:01:46Z

BTW even without participant recovery functionality the LRA_HTTP_RECOVERY_HEADER header is still required - it is returned on responses to client requests that result in the enlistment of a participant - it enables the caller and the participant to know that a participant has been enlisted in the LRA. Also note that the header is included when participant callbacks are issued (compensate/complete/etc). I will check that this is clear in the spec.

Actually participant recovery functionality is quite easy to keep if we add an @Recover annotation:

@Recover
Response participantHasMoved(
    @HeaderParam(LRA_HTTP_RECOVERY_HEADER) String recoveryId) {...}

@Compensate
aMethod(...) {}

etc

The implementation sees the presence of the recovery header and replaces the participant (ie uses this new resource to complete/compensate instead of the old one) and puts the new recovery URI in the LRA_HTTP_RECOVERY_HEADER response header so that the caller can move it again if required.

This would also fix the requirement I mentioned above about needing something to trigger a recovery attempt (the TCK test called acceptTest): if the LRA is completing then the implementation SHOULD replay the completion notification on this participant (similarly for compensating).

tomjenkinson · 2019-04-01T09:17:11Z

+1, except we must not (nor should we need to) be changing the recovery header ID

tomjenkinson · 2019-04-01T09:18:25Z

My understanding is that business logic iterates outstanding recoveryIds and calls them during the processes recovery sequence.

mmusgrov · 2019-04-01T16:14:34Z

@rdebusscher @ochaloup @mstefank and others, I'd like to get agreement on this before Rudy goes on PTO (which I think means by the end of the week).

During this weeks hangout the primary objection came from Ondra stating that recovery should not be exposed in the business logic. But this should not be a problem since the method does not need to be exposed to other services because issue #35 can be used to hide it from such users.

Martin's objection was that applications should be using service URLs so should never need to change their endpoints. But unless we have good reason we should not restrict which endpoints a service can be contacted on.

Note that the reason I have called it recovery and raised the proposal on this github issue is that the feature would typically be used if the participant crashes during the final phase of the protocol (ie when the LRA is in the state completing/compensating) and wishes to come up on a different IP address or port. The fact that participants can also use the feature to trigger replay of the end phase of the protocol on the participant in order to discover the outcome is just a bonus.

ochaloup · 2019-04-01T17:31:53Z

My objection is about giving the user a new annotation. It does not make any difference if the @Recover is used as the JAX-RS endpoint or if it's defined as standalone @Recover annotation without the JAX-RS endpoint being defined.

I think the ability to redefine the location for the participant is reasonable. But it would be better not enforcing it with annotation at the participant side. The programmatic API would be a better fit for this.

With this proposal the behaviour of the coordinator regarding LRA_HTTP_RECOVERY_HEADER should be clarified. If the LRA_HTTP_RECOVERY_HEADER will be left in the spec then it would be good to say that coordinator has to return when participant is enlisted. Then it should be clarified what is the REST call format which the coordinator expects from the participant when invoked to LRA_HTTP_RECOVERY_HEADER url. Then only the direct REST call could be used to invoke to functionality at the coordinator.

It could be worthy to consider splitting the capability of the participant location enlistment into two - one for changing location and the other for recovery invocation.

rdebusscher · 2019-04-02T08:24:50Z

@mmusgrov @ochaloup @xstefank and all,

I have created a scenario using the Participant recovery in this gist https://gist.github.com/rdebusscher/ad914ba528dbe638f926e6bf2fdbd3e9

I cannot see how the LRA gets properly cancelled without specific code in the client (which is not desired)

mmusgrov · 2019-04-02T09:50:05Z

In response to Ondra:

To your first comment:

The programmatic API would be a better fit for this.

The two APIs (programmatic and annotations) are intended to be semantically equivalent.

And to the remainder of your comments, we do already do what you are asking for:

Your second comment:

would be good to say that coordinator has to return [the recovery header] when participant is enlisted

is already covered in the specification document

If the method is to run in the context of an LRA and the annotated class
also contains methods annotated with @Compensate and @Complete then the bean
(aka participant or compensator) will be enlisted with the LRA. Enlisting with
an LRA means that the bean will be notified when the current LRA is
subsequently cancelled or closed. In addition the implementation should generate a
URI for this participant that can be used
for recovery purposes and it should make it available to the participant via a JAX-RS
header param with name defined by the Java constant LRA_HTTP_RECOVERY_HEADER as defined

and the LRA javadoc:

/**
 * the name of the HTTP header field that contains a recovery URI corresponding
 * to a participant enlistment in an LRA
 */
String LRA_HTTP_RECOVERY_HEADER = "Long-Running-Action-Recovery";

The Red Hat implementation exposes this header whenever our coordinator invokes participant callbacks (though I have not said that in the spec yet since we still have this current open issue).

Your third comment is invalid:

it should be clarified what is the REST call format which the coordinator expects from the participant when invoked

The LRA specification is focused purely on the annotations and the coordinator is private to our implementation of the spec. All that needs to happen is for the implementation (of the LRA spec) to notice the Recover annotation and check two things 1) that the recovery header is present on the incoming request and 2) that the enclosing resource conforms to our requirements to be a participant (ie has a Compensate method) - it is very similar to how the implementation should enlist a participant (the only difference would be that it should update it private mapping of participants to recovery ids which it can do because the incoming JAX-RS call to the Recover method will carry the recovery header that was provided by the implementation when the original participant was enlisted).

To your last comment:

It could be worthy to consider splitting the capability of the participant location enlistment into two - one for changing location and the other for recovery invocation.

I don't see why that would be necessary: the application is saying "I would like to be asked to compensate at a new location please". Since typically the application uses this recovery behaviour when the LRA is in the state compensating or completing, then after a crash it makes eminent sense for the implementation to replay the termination phase at the new location (it would be surprising if we specified it otherwise).

In response to Rudy:

I am not altogether clear about your scenario so my response may be off. You say

service A is crashed at that moment and restored at a different URL.
How is the cancelStepA() called eventually?

If service A crashes and comes back up on a different endpoint the method annotated with @recover should be called by something. This will cause the coordinator/implementation to replay the compensation invocation on the participant where the @recover annotation is present.

ochaloup · 2019-04-02T10:20:13Z

@mmusgrov :

The two APIs (programmatic and annotations) are intended to be semantically equivalent.

I don't know what is behind this intention but I don't see it as a good idea here. If there is a programmatic API then there is no much reason to provide this functionality with the annotation.

Your second comment is already covered in the specification document

I see. When I was reading it I thought it would be good to be clarified but as you summarized that for me I have no more objections.

The LRA specification is focused purely on the annotations and the coordinator is private to our implementation of the spec.

I do understand. I wrote my comment rather for a consideration and discussion. I understand that this comment nearly touches the interoperability. Here we talk about API not about the message communication protocol.

I don't see why that would be necessary: the application is saying "I would like to be asked to compensate at a new location please".

Ok. It makes sense to me.

mmusgrov · 2019-04-02T15:03:43Z

So are there any objections if I go ahead and raise a PR for this so that people can see:

how it works;
what the TCK test for it looks like;
the replacement for the current TCK feature for triggering recovery which we need for tests where participants return the 202 Accepted HTTP response

Also I don't think we should be specifying optional behaviour unless there are good reasons to do so.

rdebusscher · 2019-04-03T12:22:59Z

@mmusgrov

the method annotated with @recover should be called by something.
The client application? How does it know the ServiceA is now available on other URL? Does it need to keep track of services it calls and check with the Discovery service to see if they are moved?

The only option I see is that the participant informs the coordinator that it is up (and thus each participant needs some kind of unique ID - the Client Id ?) and that recovery is 'automatic' in the sense it doesn't needs an action from any other system.

tomjenkinson · 2019-04-03T12:50:03Z

IIRC the idea is that:

During @lra calls the service is expected to remember all @HeaderParam(LRA_HTTP_RECOVERY_HEADER) it has been passed.
When it knows it has @Complete or @Compensate the related work it is able to remove them
During recovery, it is expected that this service "SELECT * from myHTTPRecoveryHeaderTable" (style behaviour) and involve it's own @recover method for each of these headers. The LRA vendors implementation might have a servlet filter that therefore is able to intercept this call and (if implemented using client/server approach) can send a message to its (possible implementation choice of) coordinator to say that the participant that mapped to @HeaderParam(LRA_HTTP_RECOVERY_HEADER) is now on a new address (that it would have read from it's configuration).

@mmusgrov - is that correct?

mmusgrov · 2019-04-03T13:09:15Z

@tomjenkinson yes that is how I would imagine it would be used. From a spec implementation point of view the behaviour is very similar to how participants are enlisted, ie the implementation is fully capable of re-registering the participant. It is more burdensome for the application but the presence of the annotation on participants is optional. Note that the implementation has to support the new annotation but it is optional whether or not participants use it.

@rdebusscher I think Tom's response addresses your question but I will create a PR to show how I would imagine it works.

rdebusscher · 2019-04-03T15:15:35Z

During @lra calls the service is expected to remember all @HeaderParam(LRA_HTTP_RECOVERY_HEADER) it has been passed.

I don't like the idea that each JAX-RS resource needs to be stateful and even store that info on a persistent store so that it can retrieve that info when it is started up on a different server/Location. This means also that some kind of management is needed to know where to store and retrieve this type of data.

to say that the participant that mapped to @HeaderParam(LRA_HTTP_RECOVERY_HEADER) is now on a new address

This means that the participant needs to be able to contact the Service Registry System to exactly know on which address it is available. Since there is no standard around this, interfaces with all existing and future new Systems needs to be developed/maintained. This is not viable.

rdebusscher · 2019-04-03T15:16:44Z

But since you all agree that this seems the only possible way of solving this problem, please go ahead and include it as an OPTIONAL feature.
Then vendors can decide if they are willing to implement it or not.

mmusgrov · 2019-04-03T15:29:22Z

I don't think we should be specifying optional behaviour unless there are good reasons to do so.

mmusgrov · 2019-04-03T15:36:23Z

If a resource does not want to take advantage of this annotation then that is fine for it to omit it, ie the use of the annotation on a participant is optional.

Since it is simple for the LRA spec implementation to implement the new behaviour and since there is no compelling reason to make it optional for the spec it should be mandatory.

rdebusscher · 2019-04-04T04:14:48Z

I experimented last night more with this recovery scenario and slightly adjusted the Payara POC to allow participant recovery, even in the case where the service is brought up on another host with another URL.

When the participant registers itself at startup of the service with an ID like - the coordinator perfectly know when and where each participant is, also after a crash.

So there is no need for the LRA_HTTP_RECOVERY_HEADER or any other annotation within the API. Within the spec we just need to mention that implementation must guarantee that in case of a crash, the compensate and compete calls will occur, also in the case the service is started somewhere else after a crash.
Optionally we can indicate that a participant will register itself at startup of the service with the coordinator.

xstefank · 2019-04-04T11:33:40Z

Sorry for the late entry to this discussion.

@rdebusscher I like the idea but the problem is that we don't want to require that there is a coordinator. In routing slip based solution, I don't think the registration on startup is possible to implement. It depends if this should be something to consider as all available implementations now use coordination.

On the question of the new annotation I have still not cleared my mind.

mmusgrov · 2019-04-07T22:16:45Z

The @recover feature is useful for couple of reasons:

to ask the implementation what the final outcome was - ie the same participant comes back up on the same endpoint and wants to know what the final outcome of the LRA was
to ask to be recovered on different endpoint

In case 1 the implementation does not need the recovery id since it can work out what the end point is.
In case 2 the recovery header is required since the implementation needs to lookup the original enlistment.

This new suggestion of a "client id" is changing the spec and we already have a solution for the problem in hand.

mmusgrov · 2019-08-01T16:34:51Z

I think this feature could be a useful addition so I am moving it to the next milestone (rather than close it as will not fix).

mmusgrov added this to the 1.0 milestone Mar 17, 2019

mmusgrov added the question Further information is requested label Mar 18, 2019

xstefank mentioned this issue Apr 8, 2019

issue122 context propagation tests #132

Merged

mmusgrov modified the milestones: 1.0, 1.x Aug 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify recovery semantics #116

Specify recovery semantics #116

mmusgrov commented Mar 17, 2019

tomjenkinson commented Mar 18, 2019

mmusgrov commented Mar 18, 2019

mmusgrov commented Mar 18, 2019

rdebusscher commented Mar 27, 2019

mmusgrov commented Mar 27, 2019 •

edited

Loading

mmusgrov commented Mar 29, 2019 •

edited

Loading

tomjenkinson commented Apr 1, 2019 •

edited

Loading

tomjenkinson commented Apr 1, 2019

mmusgrov commented Apr 1, 2019

ochaloup commented Apr 1, 2019

rdebusscher commented Apr 2, 2019

mmusgrov commented Apr 2, 2019

ochaloup commented Apr 2, 2019 •

edited

Loading

mmusgrov commented Apr 2, 2019

rdebusscher commented Apr 3, 2019

tomjenkinson commented Apr 3, 2019

mmusgrov commented Apr 3, 2019 •

edited

Loading

rdebusscher commented Apr 3, 2019

rdebusscher commented Apr 3, 2019

mmusgrov commented Apr 3, 2019

mmusgrov commented Apr 3, 2019 •

edited

Loading

rdebusscher commented Apr 4, 2019

xstefank commented Apr 4, 2019

mmusgrov commented Apr 7, 2019

mmusgrov commented Aug 1, 2019

Specify recovery semantics #116

Specify recovery semantics #116

Comments

mmusgrov commented Mar 17, 2019

tomjenkinson commented Mar 18, 2019

mmusgrov commented Mar 18, 2019

mmusgrov commented Mar 18, 2019

rdebusscher commented Mar 27, 2019

mmusgrov commented Mar 27, 2019 • edited Loading

mmusgrov commented Mar 29, 2019 • edited Loading

tomjenkinson commented Apr 1, 2019 • edited Loading

tomjenkinson commented Apr 1, 2019

mmusgrov commented Apr 1, 2019

ochaloup commented Apr 1, 2019

rdebusscher commented Apr 2, 2019

mmusgrov commented Apr 2, 2019

ochaloup commented Apr 2, 2019 • edited Loading

mmusgrov commented Apr 2, 2019

rdebusscher commented Apr 3, 2019

tomjenkinson commented Apr 3, 2019

mmusgrov commented Apr 3, 2019 • edited Loading

rdebusscher commented Apr 3, 2019

rdebusscher commented Apr 3, 2019

mmusgrov commented Apr 3, 2019

mmusgrov commented Apr 3, 2019 • edited Loading

rdebusscher commented Apr 4, 2019

xstefank commented Apr 4, 2019

mmusgrov commented Apr 7, 2019

mmusgrov commented Aug 1, 2019

mmusgrov commented Mar 27, 2019 •

edited

Loading

mmusgrov commented Mar 29, 2019 •

edited

Loading

tomjenkinson commented Apr 1, 2019 •

edited

Loading

ochaloup commented Apr 2, 2019 •

edited

Loading

mmusgrov commented Apr 3, 2019 •

edited

Loading

mmusgrov commented Apr 3, 2019 •

edited

Loading