Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart preempted jobs #33

Closed
wants to merge 39 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
5496e06
One shot instances
ingwarsw Dec 3, 2018
c600518
Fix merge commit
ingwarsw Dec 3, 2018
8f877a5
Cleaning lost nodes job
ingwarsw Dec 22, 2018
bdb265f
Do not try stop already stopping instances
ingwarsw Dec 24, 2018
6a7ccf8
Dont kill terminated
ingwarsw Dec 27, 2018
e5545af
Post review changes
ingwarsw Jan 2, 2019
c20b914
Merge remote-tracking branch 'jenkins/master' into clean_lost_nodes_pr
ingwarsw Jan 3, 2019
ade6622
Javadoc for added methods
ingwarsw Jan 3, 2019
44b5b7f
Merge remote-tracking branch 'jenkins/master' into pr-one-shot-squash
ingwarsw Jan 5, 2019
f3a46f2
Tests for One shot instances
ingwarsw Jan 9, 2019
2fa4322
Fix all tests
ingwarsw Jan 9, 2019
b265d5d
After review changes
ingwarsw Jan 9, 2019
31d9d27
Merge branch 'clean_lost_nodes_pr' into develop
ingwarsw Jan 10, 2019
6aec866
Delete preemptied instances
ingwarsw Jan 10, 2019
80c9be2
Add restarting tasks in case of preemption
ingwarsw Jan 10, 2019
70ea6c5
Not working version because of lags in operations
ingwarsw Jan 11, 2019
0219669
tests
ingwarsw Jan 13, 2019
cfd851b
Semi working
ingwarsw Jan 15, 2019
7cc3292
Ugly but working
ingwarsw Jan 15, 2019
6708acf
Almost almost
ingwarsw Jan 15, 2019
9a576d4
Minor fixes
ingwarsw Jan 15, 2019
ecf5483
Terminating preempted jobs
ingwarsw Jan 18, 2019
31cd6e0
Nimor fixes
ingwarsw Jan 27, 2019
7a2cc4d
Merge remote-tracking branch 'jenkins/master' into pr-preemptive
ingwarsw Jan 29, 2019
029486e
Merge conflicts
ingwarsw Jan 30, 2019
bcbe1ed
Cleanup
ingwarsw Jan 30, 2019
376a1fa
Fixes after merge
ingwarsw Jan 30, 2019
b607d99
Remove bit of old logs
ingwarsw Jan 30, 2019
ba1a795
Merge remote-tracking branch 'jenkins/master' into pr-preemptive-merge
ingwarsw Mar 1, 2019
3fb8957
Fix mirror issues
ingwarsw Mar 1, 2019
9583f45
Fix mirror compile issues
ingwarsw Mar 1, 2019
5b51374
Merge branch 'master' into pr-preemptive-merge
ingwarsw Mar 12, 2019
702ede8
Merge remote-tracking branch 'jenkins/master' into pr-preemptive-merge
ingwarsw Mar 19, 2019
022c596
Merge branch 'pr-preemptive-merge' of https://github.com/ingwarsw/goo…
ingwarsw Mar 19, 2019
ae8fd88
Review changes
ingwarsw Mar 19, 2019
007abb4
Fix message
ingwarsw Mar 19, 2019
226970e
Fix message
ingwarsw Mar 19, 2019
c5401dd
Merge remote-tracking branch 'jenkins/master' into pr-preemptive-merge
ingwarsw Mar 19, 2019
4df1339
Fix license headers
ingwarsw Mar 19, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Review changes
ingwarsw committed Mar 19, 2019
commit ae8fd88a95dc530d61e868446a77feb152283bfb
Original file line number Diff line number Diff line change
@@ -54,9 +54,9 @@ void onConnected(TaskListener listener) throws IOException {
getChannel().addListener(new Channel.Listener() {
@Override
public void onClosed(Channel channel, IOException cause) {
LOGGER.log(Level.INFO, "Got channel close event");
LOGGER.log(Level.FINE, "Got channel close event");
if (getPreempted()) {
LOGGER.log(Level.INFO, "Goc channel close and its preempted");
LOGGER.log(Level.FINE, "Preempted node channel closed, terminating all executors");
getExecutors().forEach(executor -> interruptExecutor(executor, nodeName));
}
}
Original file line number Diff line number Diff line change
@@ -23,7 +23,6 @@

import java.io.IOException;


@Extension
public class ComputeEngineComputerListener extends ComputerListener {
@Override
Original file line number Diff line number Diff line change
@@ -29,9 +29,9 @@

import java.io.IOException;
import java.util.Collections;
import java.util.Optional;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.util.Optional;

public class ComputeEngineInstance extends AbstractCloudSlave {
private static final long serialVersionUID = 1;
@@ -100,6 +100,7 @@ protected void _terminate(TaskListener listener) throws IOException, Interrupted

/**
* Based on the instance configuration, whether to create snapshot for an instance with failed builds at deletion time.
*
* @return Whether or not to create the snapshot.
*/
public boolean isCreateSnapshot() {
@@ -109,7 +110,7 @@ public boolean isCreateSnapshot() {
public String getCloudName() {
return cloudName;
}

public void onConnected() {
this.connected = true;
}
Original file line number Diff line number Diff line change
@@ -23,20 +23,22 @@
import hudson.model.ExecutorListener;
import hudson.model.Job;
import hudson.model.Queue;
import hudson.model.queue.Tasks;
import hudson.security.ACL;
import hudson.security.ACLContext;
import hudson.slaves.RetentionStrategy;
import jenkins.model.Jenkins;
import org.jenkinsci.plugins.durabletask.executors.OnceRetentionStrategy;

import javax.annotation.Nonnull;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;

ingwarsw marked this conversation as resolved.
Show resolved Hide resolved
/**
* A strategy that allows:
* - setting one shot instances {@link OnceRetentionStrategy}
* - in case of preemption of GCP instance to restart preempted tasks
*/
public class ComputeEngineRetentionStrategy extends RetentionStrategy<ComputeEngineComputer> implements ExecutorListener {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this class. We'll need to make sure it has proper unit test coverage.


private static final Logger LOGGER = Logger.getLogger(ComputeEngineRetentionStrategy.class.getName());

private final OnceRetentionStrategy delegate;
@@ -54,12 +56,12 @@ public class ComputeEngineRetentionStrategy extends RetentionStrategy<ComputeEng
}

@Override
public long check(@Nonnull ComputeEngineComputer c) {
public long check(ComputeEngineComputer c) {
return delegate.check(c);
}

@Override
public void start(@Nonnull ComputeEngineComputer c) {
public void start(ComputeEngineComputer c) {
delegate.start(c);
}

@@ -87,10 +89,10 @@ public void taskCompletedWithProblems(Executor executor, Queue.Task task, long d
}

private Queue.Task getBaseTask(Queue.Task task) {
Queue.Task parent = Tasks.getOwnerTaskOf(task);
Queue.Task parent = task.getOwnerTask();
while (task != parent) {
task = parent;
parent = Tasks.getOwnerTaskOf(task);
parent = task.getOwnerTask();
}
return parent;
}
@@ -103,8 +105,8 @@ private void checkPreempted(Executor executor, Queue.Task task) {
if (preemptible && preempted) {
LOGGER.log(Level.INFO, baseTask + " is preemptive and was preempted");
List<Action> actions = generateActionsForTask(task);
try (ACLContext context = ACL.as(task.getDefaultAuthentication())) {
Jenkins.getInstance().getQueue().schedule2(baseTask, 0, actions);
try (ACLContext notUsed = ACL.as(task.getDefaultAuthentication())) {
Jenkins.get().getQueue().schedule2(baseTask, 0, actions);
}
} else if (preemptible) {
LOGGER.log(Level.INFO, baseTask + " is preemptive and was NOT preempted");
@@ -116,21 +118,20 @@ private List<Action> generateActionsForTask(Queue.Task task) {
try {
final Job job = (Job) baseTask;
final List causes = job.getLastBuild().getCauses();
System.out.println("Causes: " + causes);
LOGGER.log(Level.FINE, "Causes: " + causes);
} catch (Exception e) {
System.out.println("Exception for " + baseTask);
e.printStackTrace();
LOGGER.log(Level.WARNING, "Exception for " + baseTask, e);
}
return ImmutableList.of(
new CauseAction(new Cause.UserIdCause()),
new CauseAction(new RebuildCause())
);
}

public static class RebuildCause extends Cause {
@Override
public String getShortDescription() {
return "Rebuilding preempted job";
return Messages.RebuildCause_ShortDescription();
}
}
}
Original file line number Diff line number Diff line change
@@ -28,23 +28,40 @@

import java.io.IOException;

/**
* Slave callback class checking if instance was preempted.
* All of code here is serialized and executed on node.
*/
final class PreemptedCheckCallable extends MasterToSlaveCallable<Boolean, IOException> {
ingwarsw marked this conversation as resolved.
Show resolved Hide resolved
private static final String METADATA_SERVER_URL = "http://metadata.google.internal/computeMetadata/v1/instance/preempted?wait_for_change=true";

private final TaskListener listener;

/**
* Callback constructor.
* @param listener Node listener on which we can add extra information from slave side.
*/
PreemptedCheckCallable(TaskListener listener) {
this.listener = listener;
}

/**
* Actual callback code, executed on node side.
* Checks in Google metadata server if instance was preempted.
*
* See https://cloud.google.com/compute/docs/instances/create-start-preemptible-instance#detecting_if_an_instance_was_preempted
*
* @return True if node was preempted.
* @throws IOException Exception when calling Google metadata API
*/
@Override
public Boolean call() throws IOException {
HttpTransport transport = new NetHttpTransport();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method's content should exist within a client abstraction similar to the compute client. Also, it's client should be created with the ClientFactory. This de-duplicates much of this logic, and ensures the user agent string for the plugin is preserved. Also, a generated java client lib for the API should be used if it's available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we will never use this methods in any different places..
So we dont need extra (and rather dont want) client for it..
And the smaller its the faster it works, cause everything here must be serialised and send to slave node (cause its called from slave itself)

I didnt found any lib for that purpose.. I tried but i failed.. if you know any let me know..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with our libraries team, and there currently doesn't exist a client lib for this API. That said, it would still be preferable to maintain the consistency of client abstractions we've established throughout the code base, and move this API logic into its own client class. It allows for better testing and separation of concerns.

GenericUrl metadata = new GenericUrl(METADATA_SERVER_URL);
HttpRequest request = transport.createRequestFactory().buildGetRequest(metadata);
request.setHeaders(new HttpHeaders().set("Metadata-Flavor", "Google"));
request.setReadTimeout(Integer.MAX_VALUE);
listener.getLogger().println("Preemptible instance, listening metadata for preemption event");
listener.getLogger().println("Preemptive instance, listening metadata for preemption event");
HttpResponse response = request.execute();
final String result = IOUtils.toString(response.getContent(), Charsets.UTF_8);
listener.getLogger().println("Got preemption event " + result);
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
ComputeEngineCloud.DisplayName=Google Compute Engine
ComputeEngineAgent.DisplayName=Google Compute Engine
InstanceConfiguration.SnapshotConfigError=One-shot must be enabled to create snapshots
InstanceConfiguration.SnapshotConfigError=One-shot must be enabled to create snapshots
RebuildCause.ShortDescription=Rebuilding preempted job