Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify recovery semantics #116

Open
mmusgrov opened this issue Mar 17, 2019 · 25 comments
Open

Specify recovery semantics #116

mmusgrov opened this issue Mar 17, 2019 · 25 comments
Labels
question Further information is requested
Milestone

Comments

@mmusgrov
Copy link
Contributor

Section 3.2.9. Recovery Requirements of the spec document discusses recovery but it is under specified. This issue is to formalise our stance on recovery. The preference is to leave it unspecified since the spec is already clear that the outcome of closing/cancelling an LRA MUST either have an eventual end state otherwise it MUST be logged.

@mmusgrov mmusgrov added this to the 1.0 milestone Mar 17, 2019
@mmusgrov mmusgrov added the question Further information is requested label Mar 18, 2019
@tomjenkinson
Copy link
Contributor

I think it is clear from 3.2.9 that the responsibility of infrastructure and applications is to be able to satisfy the requirements of the normal LRA life cycle in the presence of failure.

@mmusgrov
Copy link
Contributor Author

Note that there is a test in the TCK called acceptTest that tests that participants which return 202 Accepted are eventually completed/compensated. The test does this by triggering a "recovery scan" on a "recovery coordinator endpoint" and the config for this endpoint need removing and an alternative way of checking for completion/compensation is required (such as waiting for a configurable period).

@mmusgrov
Copy link
Contributor Author

@tomjenkinson you are right, I think this issue can be closed as "already fixed"

@rdebusscher
Copy link
Member

@mmusgrov @tomjenkinson
Is the LRA_HTTP_RECOVERY_HEADER linked to the recovery requirements described in 3.2.9? The header is only briefly mentioned in 3.2.2 Start and ending LRA but not explaining what the Participant should do with the Recovery Header.

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Mar 27, 2019

The LRA_HTTP_RECOVERY_HEADER was added to the Java client API (and was in the original REST part of the spec when the project was in the sandbox) to allow participants to tell the coordinator to contact them on different URLs. This was to support participant recovery. Since we no longer have the REST endpoints and Java Client API it is not so simple to incorporate the same functionality.

If we loose the LRA_HTTP_RECOVERY_HEADER in milestone 1.0 then we loose participant recovery but I am fine with moving it out to 1.x.

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Mar 29, 2019

BTW even without participant recovery functionality the LRA_HTTP_RECOVERY_HEADER header is still required - it is returned on responses to client requests that result in the enlistment of a participant - it enables the caller and the participant to know that a participant has been enlisted in the LRA. Also note that the header is included when participant callbacks are issued (compensate/complete/etc). I will check that this is clear in the spec.

Actually participant recovery functionality is quite easy to keep if we add an @Recover annotation:

@Recover
Response participantHasMoved(
    @HeaderParam(LRA_HTTP_RECOVERY_HEADER) String recoveryId) {...}

@Compensate
aMethod(...) {}

etc

The implementation sees the presence of the recovery header and replaces the participant (ie uses this new resource to complete/compensate instead of the old one) and puts the new recovery URI in the LRA_HTTP_RECOVERY_HEADER response header so that the caller can move it again if required.

This would also fix the requirement I mentioned above about needing something to trigger a recovery attempt (the TCK test called acceptTest): if the LRA is completing then the implementation SHOULD replay the completion notification on this participant (similarly for compensating).

@tomjenkinson
Copy link
Contributor

tomjenkinson commented Apr 1, 2019

+1, except we must not (nor should we need to) be changing the recovery header ID

@tomjenkinson
Copy link
Contributor

My understanding is that business logic iterates outstanding recoveryIds and calls them during the processes recovery sequence.

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Apr 1, 2019

@rdebusscher @ochaloup @mstefank and others, I'd like to get agreement on this before Rudy goes on PTO (which I think means by the end of the week).

During this weeks hangout the primary objection came from Ondra stating that recovery should not be exposed in the business logic. But this should not be a problem since the method does not need to be exposed to other services because issue #35 can be used to hide it from such users.

Martin's objection was that applications should be using service URLs so should never need to change their endpoints. But unless we have good reason we should not restrict which endpoints a service can be contacted on.

Note that the reason I have called it recovery and raised the proposal on this github issue is that the feature would typically be used if the participant crashes during the final phase of the protocol (ie when the LRA is in the state completing/compensating) and wishes to come up on a different IP address or port. The fact that participants can also use the feature to trigger replay of the end phase of the protocol on the participant in order to discover the outcome is just a bonus.

@ochaloup
Copy link
Contributor

ochaloup commented Apr 1, 2019

My objection is about giving the user a new annotation. It does not make any difference if the @Recover is used as the JAX-RS endpoint or if it's defined as standalone @Recover annotation without the JAX-RS endpoint being defined.

I think the ability to redefine the location for the participant is reasonable. But it would be better not enforcing it with annotation at the participant side. The programmatic API would be a better fit for this.

With this proposal the behaviour of the coordinator regarding LRA_HTTP_RECOVERY_HEADER should be clarified. If the LRA_HTTP_RECOVERY_HEADER will be left in the spec then it would be good to say that coordinator has to return when participant is enlisted. Then it should be clarified what is the REST call format which the coordinator expects from the participant when invoked to LRA_HTTP_RECOVERY_HEADER url. Then only the direct REST call could be used to invoke to functionality at the coordinator.

It could be worthy to consider splitting the capability of the participant location enlistment into two - one for changing location and the other for recovery invocation.

@rdebusscher
Copy link
Member

@mmusgrov @ochaloup @xstefank and all,

I have created a scenario using the Participant recovery in this gist https://gist.github.com/rdebusscher/ad914ba528dbe638f926e6bf2fdbd3e9

I cannot see how the LRA gets properly cancelled without specific code in the client (which is not desired)

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Apr 2, 2019

In response to Ondra:

To your first comment:

The programmatic API would be a better fit for this.

The two APIs (programmatic and annotations) are intended to be semantically equivalent.

And to the remainder of your comments, we do already do what you are asking for:

Your second comment:

would be good to say that coordinator has to return [the recovery header] when participant is enlisted

is already covered in the specification document

If the method is to run in the context of an LRA and the annotated class
also contains methods annotated with @Compensate and @Complete then the bean
(aka participant or compensator) will be enlisted with the LRA. Enlisting with
an LRA means that the bean will be notified when the current LRA is
subsequently cancelled or closed. In addition the implementation should generate a
URI for this participant that can be used
for recovery purposes and it should make it available to the participant via a JAX-RS
header param with name defined by the Java constant LRA_HTTP_RECOVERY_HEADER as defined

and the LRA javadoc:

/**
 * the name of the HTTP header field that contains a recovery URI corresponding
 * to a participant enlistment in an LRA
 */
String LRA_HTTP_RECOVERY_HEADER = "Long-Running-Action-Recovery";

The Red Hat implementation exposes this header whenever our coordinator invokes participant callbacks (though I have not said that in the spec yet since we still have this current open issue).

Your third comment is invalid:

it should be clarified what is the REST call format which the coordinator expects from the participant when invoked

The LRA specification is focused purely on the annotations and the coordinator is private to our implementation of the spec. All that needs to happen is for the implementation (of the LRA spec) to notice the Recover annotation and check two things 1) that the recovery header is present on the incoming request and 2) that the enclosing resource conforms to our requirements to be a participant (ie has a Compensate method) - it is very similar to how the implementation should enlist a participant (the only difference would be that it should update it private mapping of participants to recovery ids which it can do because the incoming JAX-RS call to the Recover method will carry the recovery header that was provided by the implementation when the original participant was enlisted).

To your last comment:

It could be worthy to consider splitting the capability of the participant location enlistment into two - one for changing location and the other for recovery invocation.

I don't see why that would be necessary: the application is saying "I would like to be asked to compensate at a new location please". Since typically the application uses this recovery behaviour when the LRA is in the state compensating or completing, then after a crash it makes eminent sense for the implementation to replay the termination phase at the new location (it would be surprising if we specified it otherwise).

In response to Rudy:

I am not altogether clear about your scenario so my response may be off. You say

service A is crashed at that moment and restored at a different URL.
How is the cancelStepA() called eventually?

If service A crashes and comes back up on a different endpoint the method annotated with @recover should be called by something. This will cause the coordinator/implementation to replay the compensation invocation on the participant where the @recover annotation is present.

@ochaloup
Copy link
Contributor

ochaloup commented Apr 2, 2019

@mmusgrov :

The two APIs (programmatic and annotations) are intended to be semantically equivalent.

I don't know what is behind this intention but I don't see it as a good idea here. If there is a programmatic API then there is no much reason to provide this functionality with the annotation.

Your second comment is already covered in the specification document

I see. When I was reading it I thought it would be good to be clarified but as you summarized that for me I have no more objections.

The LRA specification is focused purely on the annotations and the coordinator is private to our implementation of the spec.

I do understand. I wrote my comment rather for a consideration and discussion. I understand that this comment nearly touches the interoperability. Here we talk about API not about the message communication protocol.

I don't see why that would be necessary: the application is saying "I would like to be asked to compensate at a new location please".

Ok. It makes sense to me.

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Apr 2, 2019

So are there any objections if I go ahead and raise a PR for this so that people can see:

  • how it works;
  • what the TCK test for it looks like;
  • the replacement for the current TCK feature for triggering recovery which we need for tests where participants return the 202 Accepted HTTP response

Also I don't think we should be specifying optional behaviour unless there are good reasons to do so.

@rdebusscher
Copy link
Member

@mmusgrov

the method annotated with @recover should be called by something.
The client application? How does it know the ServiceA is now available on other URL? Does it need to keep track of services it calls and check with the Discovery service to see if they are moved?

The only option I see is that the participant informs the coordinator that it is up (and thus each participant needs some kind of unique ID - the Client Id ?) and that recovery is 'automatic' in the sense it doesn't needs an action from any other system.

@tomjenkinson
Copy link
Contributor

IIRC the idea is that:

  1. During @lra calls the service is expected to remember all @HeaderParam(LRA_HTTP_RECOVERY_HEADER) it has been passed.
  2. When it knows it has @Complete or @Compensate the related work it is able to remove them
  3. During recovery, it is expected that this service "SELECT * from myHTTPRecoveryHeaderTable" (style behaviour) and involve it's own @recover method for each of these headers. The LRA vendors implementation might have a servlet filter that therefore is able to intercept this call and (if implemented using client/server approach) can send a message to its (possible implementation choice of) coordinator to say that the participant that mapped to @HeaderParam(LRA_HTTP_RECOVERY_HEADER) is now on a new address (that it would have read from it's configuration).

@mmusgrov - is that correct?

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Apr 3, 2019

@tomjenkinson yes that is how I would imagine it would be used. From a spec implementation point of view the behaviour is very similar to how participants are enlisted, ie the implementation is fully capable of re-registering the participant. It is more burdensome for the application but the presence of the annotation on participants is optional. Note that the implementation has to support the new annotation but it is optional whether or not participants use it.

@rdebusscher I think Tom's response addresses your question but I will create a PR to show how I would imagine it works.

@rdebusscher
Copy link
Member

During @lra calls the service is expected to remember all @HeaderParam(LRA_HTTP_RECOVERY_HEADER) it has been passed.

I don't like the idea that each JAX-RS resource needs to be stateful and even store that info on a persistent store so that it can retrieve that info when it is started up on a different server/Location. This means also that some kind of management is needed to know where to store and retrieve this type of data.

to say that the participant that mapped to @HeaderParam(LRA_HTTP_RECOVERY_HEADER) is now on a new address

This means that the participant needs to be able to contact the Service Registry System to exactly know on which address it is available. Since there is no standard around this, interfaces with all existing and future new Systems needs to be developed/maintained. This is not viable.

@rdebusscher
Copy link
Member

But since you all agree that this seems the only possible way of solving this problem, please go ahead and include it as an OPTIONAL feature.
Then vendors can decide if they are willing to implement it or not.

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Apr 3, 2019

I don't think we should be specifying optional behaviour unless there are good reasons to do so.

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Apr 3, 2019

If a resource does not want to take advantage of this annotation then that is fine for it to omit it, ie the use of the annotation on a participant is optional.

Since it is simple for the LRA spec implementation to implement the new behaviour and since there is no compelling reason to make it optional for the spec it should be mandatory.

@rdebusscher
Copy link
Member

I experimented last night more with this recovery scenario and slightly adjusted the Payara POC to allow participant recovery, even in the case where the service is brought up on another host with another URL.

When the participant registers itself at startup of the service with an ID like - the coordinator perfectly know when and where each participant is, also after a crash.

So there is no need for the LRA_HTTP_RECOVERY_HEADER or any other annotation within the API. Within the spec we just need to mention that implementation must guarantee that in case of a crash, the compensate and compete calls will occur, also in the case the service is started somewhere else after a crash.
Optionally we can indicate that a participant will register itself at startup of the service with the coordinator.

@xstefank
Copy link
Member

xstefank commented Apr 4, 2019

Sorry for the late entry to this discussion.

@rdebusscher I like the idea but the problem is that we don't want to require that there is a coordinator. In routing slip based solution, I don't think the registration on startup is possible to implement. It depends if this should be something to consider as all available implementations now use coordination.

On the question of the new annotation I have still not cleared my mind.

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Apr 7, 2019

The @recover feature is useful for couple of reasons:

  1. to ask the implementation what the final outcome was - ie the same participant comes back up on the same endpoint and wants to know what the final outcome of the LRA was
  2. to ask to be recovered on different endpoint

In case 1 the implementation does not need the recovery id since it can work out what the end point is.
In case 2 the recovery header is required since the implementation needs to lookup the original enlistment.

This new suggestion of a "client id" is changing the spec and we already have a solution for the problem in hand.

@mmusgrov
Copy link
Contributor Author

mmusgrov commented Aug 1, 2019

I think this feature could be a useful addition so I am moving it to the next milestone (rather than close it as will not fix).

@mmusgrov mmusgrov modified the milestones: 1.0, 1.x Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants