Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

475 last operation description #619

Merged
merged 2 commits into from
Feb 17, 2018

Conversation

maleck13
Copy link
Contributor

@maleck13 maleck13 commented Jan 8, 2018

Describe what this PR does and why we need it:
broker implementation for issue 475 last operation description
Note
This pr is based on top of the previous PR #610 so probably better to review once that one is closed out.
Changes proposed in this pull request

  • Add new channel that sends JobState changes asynchronously to allow the state to be updated ready for the last operation polling endpoint to read.
  • Add sets of tests for various places that have been changed.
  • change the watch pod to use the watch API
  • Add StartSyncJob to unify the broker Job API

Does this PR depend on another PR (Use this to track when PRs should be merged)
depends-on #610
depends-on #671

Which issue this PR fixes (This will close that issue when PR gets merged)
fixes #475

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 8, 2018
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 0f253da on maleck13:475-last-operation-description into ** on openshift:master**.

@maleck13
Copy link
Contributor Author

maleck13 commented Jan 9, 2018

@jmrodri I know you are working async bind at the moment. I think this PR should wait and be rebased on your work once it lands. Likely I will need to make some changes to support last operation updates in async bind operations also.

@rthallisey rthallisey added the 3.10 | release-1.2 Kubernetes 1.10 | Openshift 3.10 | Broker release-1.2 label Jan 9, 2018
@rthallisey
Copy link
Contributor

rthallisey commented Jan 9, 2018

Related #542

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 674aebb on maleck13:475-last-operation-description into ** on openshift:master**.

@jmrodri jmrodri self-requested a review January 12, 2018 04:21
@maleck13 maleck13 mentioned this pull request Jan 23, 2018
@maleck13 maleck13 force-pushed the 475-last-operation-description branch from 674aebb to 3e8ca31 Compare January 23, 2018 12:30
@maleck13 maleck13 changed the title WIP 475 last operation description 475 last operation description Jan 23, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2018
@maleck13
Copy link
Contributor Author

The broker code is in good shape I think (at least ready for review). Still need help understanding how to test the module

@maleck13 maleck13 force-pushed the 475-last-operation-description branch from 3e8ca31 to 0175a9a Compare January 23, 2018 13:11
@maleck13
Copy link
Contributor Author

maleck13 commented Jan 23, 2018

Test Case:

Build broker image from this branch
point it at the maleck13 org
provision keycloak apb
in the namespace where the keycloak instance was provisioned you should see the service instance if you expand it, you can see the details of the last operation
Another option is to watch the serviceinstance. Should see something like:

status: Provision request for ServiceInstance in-flight to Broker
status: The instance is being provisioned asynchronously
status: The instance is being provisioned asynchronously (created postgres service)
status: The instance is being provisioned asynchronously (created keycloak route)
status: The instance was provisioned successfully

Example code for the apb is here https://github.com/maleck13/keycloak-apb/blob/add-last-ops/roles/provision-keycloak-apb/tasks/provision-keycloak.yml#L11
Note the above code is only example code as I had to add the last_operation.py file manually to test it. However once PR ansibleplaybookbundle/ansible-asb-modules#9 is merged I think it will then be available to all new APBs

@maleck13 maleck13 force-pushed the 475-last-operation-description branch 2 times, most recently from 4507c4c to 384da98 Compare January 23, 2018 18:05
Copy link
Contributor

@rthallisey rthallisey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Lot's of new tests!

I'm going to look into pulling last_operation info during the $actions. It would be cool to see this running all the time in the gate.

}
send(state)
podStatus := pod.Status
log.Debugf("pod [%s] in phase %s", podName, podStatus.Phase)
switch podStatus.Phase {
case apiv1.PodFailed:
if errorPullingImage(podStatus.ContainerStatuses) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want a w.Stop() here too before we return.


time.Sleep(time.Duration(apbWatchInterval) * time.Second)
if podEvent.Type == watch.Deleted {
w.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we only get in here if the pod is deleted before either PodSucceeded or PodFailed? In other words, I think you can produce this when you delete the apb during execution. If that's the case, I think we should either return an error or log that the apb was deleted. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think you are correct. In most cases the pod would succeed or fail. So it it was somehow deleted it would likely be an error and it should be reported


log.Debug("bindjob: returned from apb.Bind")
//read our status updates and send on updated JobMsgs for the subscriber to persist
for su := range stateUpdates {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're blocking on reading the bind status updates is this really async? Do we have to block on this?

Copy link
Contributor Author

@maleck13 maleck13 Jan 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So previously the bind function itself was synchronous and would block job function. Now we have put that into its own go routine but we still need to wait until the Job is complete within the Run function. So we block on the status updates channel as this is only closed once the Bind job is complete
The Job itself though is started in its own go routine within the work engine and so is still asynchronous
https://github.com/openshift/ansible-service-broker/pull/619/files#diff-bcf626f58af278748816aca2379c1928R53


go func() {
defer func() {
metrics.UnbindingJobFinished()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be the same general workflow for all the actions. Maybe at some point we can break apart the Run function into multiple functions to make the API more granular. It's something we can tackle down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or potentially abstract it into a single run function as they all effectively do the same thing

@maleck13
Copy link
Contributor Author

@rthallisey I have addressed your requested changes and comments

@rthallisey
Copy link
Contributor

@maleck13 can you rebase this to pull in some fixes to travis?

@maleck13 maleck13 force-pushed the 475-last-operation-description branch from 344cfa1 to a3291d9 Compare January 24, 2018 18:22
@eriknelson
Copy link
Contributor

Looks like this needs a rebase @maleck13. Trying to get through outstanding PRs this week, happy to come back to this when it's green.

@maleck13 maleck13 force-pushed the 475-last-operation-description branch 2 times, most recently from 61df7eb to 4f0301d Compare February 10, 2018 22:20
@rthallisey
Copy link
Contributor

@eriknelson can you have another look at this?

@eriknelson
Copy link
Contributor

@maleck13 @rthallisey grabbing lunch and I will review. Thanks guys.

log.Debugf(
"Watching pod [ %s ] in namespace [ %s ] for completion",
podName,
namespace,
)

k8scli, err := clients.Kubernetes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we get the v1.PodInterface in this method?

I think that is better because it now ever caller has to get the client for the only purpose of passing into a function. I think this encapsulates that behaviour and if it errors we bubble up that error.

If the only reason we are doing it is for testing then I think we need to do a better job of mocking out the core componets rather then passing values into functions. I think that we got bite by doing that before.

Copy link
Contributor Author

@maleck13 maleck13 Feb 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want unit tests for a method like this, then at some point we will need to use an interface to allow us to mock the interactions with external dependencies.
Having a package level function used internally as part of this method, means we cannot mock it out.
Really this is just a simple form of dependency injection. The package level functions can make this tricky. I would prefer to have a unit test for the logic here, than not have one? I took a similar approach to the provision function here
https://github.com/maleck13/ansible-service-broker/blob/4f0301dac3dae580cb5f393baafaa70a2d859073/pkg/broker/broker.go#L377
Except it is passed in via the constructor. It would be a bigger refactor to
In this case to have the dependency passed as part of a constructor as it would likely mean changes all the way up the call chain.

I think we need to do a better job of mocking out the core componets rather then passing values into functions

On that point though, longer term it may be worth discussion to change some of the package functions to be struct methods and make use of the kubernetes.Interface in place of the *client.ClientSet concrete type.
If we use the kubernetes.Interface and refactor to allow it to be injected as part of the main method via constructors and dependency injection, it will allow us to have a very powerful set of unit tests as we can control the behavior of the tests by mocking out just the client as generally this is what underpins all of the external communications. This becomes even more true if we move to CRDs instead of a separate etcd.

I can create a follow on issue for this.

@eriknelson eriknelson added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2018
@eriknelson
Copy link
Contributor

eriknelson commented Feb 16, 2018

@maleck13 Thanks again for your contribution. Had a chance to sit down with @shurley this morning and talk through this one. Overall, it's looking pretty good and it's absolutely a feature we need
to bring in for 3.10. Have a couple high level requests for change:

  1. A statusUpdates channel has been introduced into the various apb.<action>
    signatures as a way for them to push status update events out to subscribers.
    Instead of callers creating the channels and passing them as arguments,
    I think the function parameters should remain the same as they are today,
    and the function should return a channel that accepts a StatusMessage interface
    that lives in the apb pkg:
// apb/types.go
type StatusMessage interface {
  String() string
}

--

// Example provision.go
// Provision - will run the abp with the provision action.
func Provision(
        instance *ServiceInstance,
) (string, *ExtractedCredentials, error, <-chan apb.StatusMessage){
  // Create channel
  // Do work
  // Return previous args + channel
}

The reasoning is that in the near future, we're expecting to break the apb
package out into its own vendorable library for other clients, and we would
like the apb pkg (and the actions) to own the channel resource, not the caller.
Additionally, it's reasonable that callers may not care about statusUpdates
at all, so forcing a caller to deal with that at the function signature level
is undesirable. With the proposed change, the caller can simply ignore the
channel with a _ variable.The StatusMessage interface decouples the apb
pkg from the broker pkg, which is also important (as opposed to a JobState).

  1. There does not appear to be much use for the StartNewSyncJob variant.
    I can understand that there might be a desire for status updates to occur with
    sync requests as well, but in practice, the job is done with a completion message
    that overriddes any incremental status updates before the token is even returned
    to the client. Ultimately, we're unable to imagine a use-case for sync updates.
    To reduce unnecessary complexity, let's revert the split and keep the existing
    approach, with a single async start job, and have the sync branches of the broker.go
    methods simply call the apb.<action>. They can ignore the returned channel from 1).

Happy to continue the discussion on these points, otherwise I'll make sure we stay on top of updates so things this isn't sitting in a PR for too much longer.

EDIT: @shawn-hurley raised an important point re: the return channels that requires an amendment to this. Hashing that out and I will update this comment.

Copy link
Contributor

@eriknelson eriknelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Contributor

@jmrodri jmrodri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some discussions.

@eriknelson
Copy link
Contributor

eriknelson commented Feb 16, 2018

Background: The issue with 1) is that the channel is not going to be returned to a caller until the function has finished, which suggests the need to wrap the work in a go routine...which requires a channel to allow the caller to synchronize the work.

So, this PR has opened a really productive dialogue around what, exactly, we want the public interface for libapb to look like. The conclusion is that we would like 2 suites of actions that allow folks to run APBs either asynchronously, or synchronously: Provision(...) <-chan StatusMessage and ProvisionSync(...). The former being our preferred, recommended way for starting long running work, and getting event-driven updates as to their status.

This really isn't in the scope of this PR, so in an effort to bring in your work, it makes sense to merge this, log some issues, and fix them as a follow up.

We do still need a rebase.

@eriknelson
Copy link
Contributor

@maleck13 please rebase and we're ready to bring this into master.

fix type in SubscriberDAO after code review
rebase and update the unbind and bind jobs to accept state updates add the sync and async job start to unbind and bind. Minor updates

address review feedback add a w.Stop on pod failed and return error on pod deleted
@maleck13 maleck13 force-pushed the 475-last-operation-description branch from 4f0301d to 1cfb140 Compare February 17, 2018 21:45
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2018
@maleck13
Copy link
Contributor Author

@eriknelson Great excited to have this finally merged.

Thanks for your detailed comments. Adding some thoughts of my own here:

Background: The issue with 1) is that the channel is not going to be returned to a caller until the function has finished, which suggests the need to wrap the work in a go routine...which requires a channel to allow the caller to synchronize the work.

It's possible I am restating the problem.

I generally agree that it would be nice to return a channel from the Job. I played around with this option early on when working on the feature, but it presented me with some issues:

  1. The caller in this case is the broker, which would mean the broker would need to read the status from the channel and write the state via the DAO, this seemed counter to the current design and role of the subscribers.
  2. If not option 1 then the broker.go methods would need access to the workmsg channel and funnel the status messages to the workmsg channel so the subscribers could do their work. This also seemed counter to the current design. Perhaps it could be an option though however as you call out, I think changing the Jobs to return a channel needs some more thought (perhaps a proposal)?

So I went with what seemed the least disruptive approach that kept the broker design the same.

There does not appear to be much use for the StartNewSyncJob variant.
I can understand that there might be a desire for status updates to occur with
sync requests as well, but in practice, the job is done with a completion message
that overriddes any incremental status updates before the token is even returned
to the client. Ultimately, we're unable to imagine a use-case for sync updates.
To reduce unnecessary complexity, let's revert the split and keep the existing
approach, with a single async start job, and have the sync branches of the broker.go
methods simply call the apb.. They can ignore the returned channel from 1).

Interesting, I had actually hoped that change would reduce the complexity. The status updates are of no value here, you are correct, as there is no opportunity for them to be seen.
The goal of the change was to have the same approach to both sync and async jobs and to have the work engine be the only place that jobs were handled. It seemed the broker.go needed to know too much about apb methods (it had to know that calling apb.Provision was a synchronous action) . So adding a StartSyncJob method seemed to ensure that the broker.go didn't know too much about the internals of the apb package but instead relied on the API provided.

Also happy to chat further or jump on call.

@eriknelson
Copy link
Contributor

The caller in this case is the broker, which would mean the broker would need to read the status from the channel and write the state via the DAO, this seemed counter to the current design and role of the subscribers.

I actually initially thought we weren't using the subscribers for this, but I was incorrect, so I have to thank you for preserving that. Definitely an architecture we want to keep, especially with a revamped engine. I'm not sure I understand this statement though; regardless of whether or not the caller creates the channel and passes it in, or if the apb package returns the channel, the caller (broker job) is going to have to monitor that channel for messages, wrap them in JobMessages, and pass along to the Subscriber?

It seemed the broker.go needed to know too much about apb methods (it had to know that calling apb.Provision was a synchronous action).

I'm personally not too concerned about this. As we continue, the OSB domain is going to get pushed into some kind of a Broker Framework, and the APB domain is getting pushed into a libapb. Ultimately, this repo is going to turn into the glue between those two worlds, so IMO, that's exactly what this binary is going to be concerned with.

Anyway, we're of the opinion this PR adds quite a bit of value as-is, and most of the feedback is more appropriate for a follow up that's part of the work moving in the direction I described. I'm planning on posting a proposal with some code on top of this for review from the community, since I think I can better explain myself with actual code. Thanks again for your help!

@eriknelson eriknelson merged commit b7e109d into openshift:master Feb 17, 2018
@maleck13 maleck13 deleted the 475-last-operation-description branch February 18, 2018 09:01
jianzhangbjz pushed a commit to jianzhangbjz/ansible-service-broker that referenced this pull request May 17, 2018
* add test cases for provision subscriber and job

fix type in SubscriberDAO after code review

* initial pass at last operation description implementation

rebase and update the unbind and bind jobs to accept state updates add the sync and async job start to unbind and bind. Minor updates

address review feedback add a w.Stop on pod failed and return error on pod deleted
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.10 | release-1.2 Kubernetes 1.10 | Openshift 3.10 | Broker release-1.2 size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Last Operation description
8 participants