Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart preempted jobs #33

Closed
wants to merge 39 commits into from

Conversation

ingwarsw
Copy link
Contributor

@ingwarsw ingwarsw commented Mar 1, 2019

This PR is "working prototype" for restarting jobs after being preempted by google.

Im testing in on our system for more than 2 months and its working, Its restarting all jobs that was preempted, but there are few things that should be done to make it production ready.

Missing features:

  • Pass parameters to restarted jobs
  • Work with other types of jobs (only pipeline tested)
  • Pass original owner of build, and add some Cause about restart
  • Add checkbox in GUI to actually dont restart failed jobs
  • Tests

I can finish it but maybe you have some ideas how to do some of needed things.

Copy link
Contributor

@rachely3n rachely3n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. I just had a few comments.
Also, can we have method headers and descriptions for the functions/classes? Will create less confusion down the line. I will need some time to think about the use case and what other features to include.


boolean getPreempted() {
try {
return preemptedFuture != null && preemptedFuture.isDone() && preemptedFuture.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this logic a bit further? Just want to make sure we are both thinking the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It basically checks if node was preempted :)

First we check if we have feature (if node is not preemptive we may not have it set)
Then we check if feature is done.
And if its done check if its true == preempted

@@ -89,13 +89,10 @@ protected void _terminate(TaskListener listener) throws IOException, Interrupted

// If the instance is running, attempt to terminate it. This is an asynch call and we
// return immediately, hoping for the best.
cloud.client.terminateInstanceWithStatus(cloud.projectId, zone, name, "RUNNING");
cloud.getClient().terminateInstance(cloud.getProjectId(), zone, name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @param retentionTimeMinutes number of minutes of idleness after which to kill the slave; serves a backup in case the strategy fails to detect the end of a task
* @param oneShot create one shot instance strategy
*/
ComputeEngineRetentionStrategy(int retentionTimeMinutes, boolean oneShot) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this is package scoped and not public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It dont need to be public? :)
Any reason to make it public?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been the convention throughout the rest of the code base. That said, I don't have any strong opinions one way or the other :)

@craigdbarber
Copy link
Contributor

Thanks for digging in to this. Would like to see test coverage for this.

Copy link
Contributor

@craigdbarber craigdbarber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few inline comments.

*/
@Override
public Boolean call() throws IOException {
HttpTransport transport = new NetHttpTransport();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method's content should exist within a client abstraction similar to the compute client. Also, it's client should be created with the ClientFactory. This de-duplicates much of this logic, and ensures the user agent string for the plugin is preserved. Also, a generated java client lib for the API should be used if it's available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we will never use this methods in any different places..
So we dont need extra (and rather dont want) client for it..
And the smaller its the faster it works, cause everything here must be serialised and send to slave node (cause its called from slave itself)

I didnt found any lib for that purpose.. I tried but i failed.. if you know any let me know..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with our libraries team, and there currently doesn't exist a client lib for this API. That said, it would still be preferable to maintain the consistency of client abstractions we've established throughout the code base, and move this API logic into its own client class. It allows for better testing and separation of concerns.

* - setting one shot instances {@link OnceRetentionStrategy}
* - in case of preemption of GCP instance to restart preempted tasks
*/
public class ComputeEngineRetentionStrategy extends RetentionStrategy<ComputeEngineComputer> implements ExecutorListener {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this class. We'll need to make sure it has proper unit test coverage.

return parent;
}

private void checkPreempted(Executor executor, Queue.Task task) {
Copy link
Contributor

@craigdbarber craigdbarber Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems to encapsulate two separate concerns: 1. Checking if a task was preempted 2. Rescheduling the task if 1. is true. Suggest separating the concerns for improved orthogonality and easier testing.

@craigdbarber
Copy link
Contributor

Pinging this to make sure it doesn't get dropped. Please feel free to reach out if there are any questions regarding the feedback.

@ingwarsw
Copy link
Contributor Author

@craigatgoogle I didnt forget, just dont have time to finish it..
And first need to close #54
We should talk on slack how we would like to do it..
At least few concepts here are really helpful..
And there few ones I have no idea how to finish, but even partially working they should be helpfull..

@joshk0
Copy link

joshk0 commented May 25, 2019

Looks like #54 is merged. How can we help with this change? I could take a whirl at resolving the conflicts if that helps.

@ingwarsw
Copy link
Contributor Author

New PR covering same #113 ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants