-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow re-use of existing pods for restarts #244
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the one remark about final field converted to non-final everything else looks great to me.
I'd like to better understand the reason behind that change, but still I am ok with merging.
|
||
@SuppressFBWarnings(value = "SE_TRANSIENT_FIELD_NOT_RESTORED", justification = "not needed on deserialization") | ||
private final transient List<Closeable> closables = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the pipeline is resumed, since it is transient, it will not be deserialized...therefore when we come to add to array, it will NPE...therefore, we check before we add to it and create if necessary. Therefore it cannot be final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an error with a dependency, so tests haven't run yet
@@ -43,7 +43,7 @@ | |||
<!-- jenkins plugins versions --> | |||
<jenkins-basic-steps.version>2.3</jenkins-basic-steps.version> | |||
<jenkins-credentials.version>2.1.7</jenkins-credentials.version> | |||
<jenkins-durable-task.version>1.13</jenkins-durable-task.version> | |||
<jenkins-durable-task.version>1.16</jenkins-durable-task.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.16 does not exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I am waiting for the new release that will include jenkinsci/durable-task-plugin#49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use 1.16-20171108.031054-1
in the meantime. Based on jenkinsci/durable-task-plugin@9b0b723.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jglick
pod = client.pods().inNamespace(namespace).create(pod); | ||
LOGGER.log(Level.INFO, "Created Pod: {0} in namespace {1}", new Object[]{podId, namespace}); | ||
listener.getLogger().printf("Created Pod: %s in namespace %s", podId, namespace); | ||
Pod checkPod = client.pods().inNamespace(namespace).withName(podId).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC when the master restarts, agents reconnect and this method would not be called and the agents already reconnect.
There may be multiple pods with same podId running different builds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple test (https://paste.fedoraproject.org/paste/rwMXdUC670i5PIZFAokjlQ) using the current master gives this in the logs when a restart happens:
https://paste.fedoraproject.org/paste/BRTBZEMvyPxsOo2BJJf6uw
With or without idleMinutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess a simple check of launched
value at the beginning of the method would be enough. If the pod has been deleted in the mean time we may risk launching another one, leading to an inconsistent state anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vlatombe so you are ok with the approach in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vlatombe but how is launched
being persisted across restarts? on a restart, the attempt to reconnect fails with a handshake error:
https://paste.fedoraproject.org/paste/yHv3FoJ35r7vG8J3hWwYEQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scoheb Should be in the slave's config.xml > launcher > launched.
Dunno about the handshake for now since the real error is likely in the master's logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the slave config.xml once launched and while running a build:
https://paste.fedoraproject.org/paste/cp~cnrkWdoDkoZS~OLkWLg
Looks like the launched property (when true) is not saved. Do we need a slave.save()
after at the end of the launch()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like the right move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With master and ensuring 'launched == true' in config.xml (via direct save() in launch()), here is Jenkins log after restart:
https://paste.fedoraproject.org/paste/UhqADZdp010vXMAk2HI8nw
No trace of why the slave is not connecting properly...
@@ -40,7 +46,9 @@ | |||
@Override | |||
public boolean start() throws Exception { | |||
|
|||
Cloud cloud = Jenkins.getInstance().getCloud(step.getCloud()); | |||
LOGGER.info("starting"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this more descriptive and probably just FINE
level ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove this.
semaphore 'wait' | ||
sh ''' | ||
#!/bin/bash | ||
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sh 'for i in `seq 1 20`; do echo $i; sleep 5; done'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having issues with 'seq' at one time...I thought it was due to the busybox image. I will update my sh call.
KubernetesNodeContext nodeContext = new KubernetesNodeContext(getContext()); | ||
client = nodeContext.connectToCloud(); | ||
} catch (Exception e) { | ||
ContainerLogStepExecution.this.getContext().onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include a TODO comment mentioning JENKINS-40161.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
client = nodeContext.connectToCloud(); | ||
decorator.setKubernetesClient(client); | ||
} catch (Exception e) { | ||
ContainerStepExecution.this.getContext().onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// run each step inside its own JenkinsRule | ||
for (Statement step : steps) { | ||
j = new JenkinsRuleNonLocalhost().with(loader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably use jenkinsci/jenkins-test-harness#75 and avoid a lot of code duplication, no?
@@ -42,8 +40,6 @@ | |||
import org.eclipse.jetty.webapp.Configuration; | |||
import org.eclipse.jetty.webapp.WebAppContext; | |||
import org.eclipse.jetty.webapp.WebXmlConfiguration; | |||
import org.jvnet.hudson.test.JenkinsRule; | |||
import org.jvnet.hudson.test.ThreadPoolImpl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntelliJ showed them as unused.
15f629a
to
dd33690
Compare
@Vlatombe Since we were not checking |
@carlossg Would you be able to provide some of the test logs from https://jenkins.gke.csanchez.org/job/jenkinsci/job/kubernetes-plugin/job/PR-244/10/ ? I can run the |
776dafa
to
94f8d71
Compare
package org.csanchez.jenkins.plugins.kubernetes.pipeline | ||
|
||
podTemplate(label: 'mypod', containers: [ | ||
containerTemplate(name: 'busybox', image: 'startx/fedora', ttyEnabled: true, command: '/bin/cat'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not busybox
image ?
package org.csanchez.jenkins.plugins.kubernetes.pipeline | ||
|
||
podTemplate(label: 'mypod', namespace: 'default', idleMinutes: 0, containers: [ | ||
containerTemplate(name: 'busybox', image: 'startx/fedora', ttyEnabled: true, command: '/bin/cat'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not busybox image ?
for reference, the stack when agents try to reconnect is
|
@carlossg updated podTemplates in *.groovy to use If the tests fail again, can you help me and provide any logs from k8s side? |
now test logs are captured and removed the explicit namespace definition, that was the reason for failure, let's see |
@carlossg Thanks Seems like tests are failing due to leftover pods...can you take a look please? |
@carlossg added to |
BTW as a matter of courtesy to reviewers, it is nice to hyperlink to JIRA JENKINS-47476 in the PR description, and then also mark the JIRA In Review. |
(assuming this PR purports to fix the defect) |
@jglick Thanks. I will properly link to the JIRA and mark it as In Review in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not understand the K8S aspects, but there are some minor mistakes in generic Pipeline usage (not related to jenkinsci/durable-task-plugin#49).
} | ||
}); | ||
story.addStep(new Statement() { | ||
@SuppressWarnings("SleepWhileInLoop") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No such sleep
here AFAICT?
|
||
@Test | ||
public void getContainerLogWithRestart() throws Exception { | ||
story.addStep(new Statement() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW you are on java.level=8
(even though the supposed jenkins.version=2.32.1
runs on Java 7 and thus people actually using 2.32.x or 2.46.x on Java 7 and installing this would see Jenkins crash and burn if they installed this plugin), so you may as well use
story.then(r -> {
configureCloud();
r.jenkins.addNode(…);
// …
});
pom.xml
Outdated
@@ -3,7 +3,7 @@ | |||
<parent> | |||
<groupId>org.jenkins-ci.plugins</groupId> | |||
<artifactId>plugin</artifactId> | |||
<version>2.32</version> | |||
<version>2.37</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or 3.0
pom.xml
Outdated
@@ -46,12 +46,12 @@ | |||
<jenkins.version>2.32.1</jenkins.version> | |||
|
|||
<kubernetes-client.version>2.6.1</kubernetes-client.version> | |||
<slf4j.version>1.7.13</slf4j.version> | |||
<slf4j.version>1.7.25</slf4j.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this override should not be here at all; cf. jenkinsci/plugin-pom#73.
@@ -120,6 +120,12 @@ public void launch(SlaveComputer computer, TaskListener listener) { | |||
if (slave == null) { | |||
throw new IllegalStateException("Node has been removed, cannot launch " + computer.getName()); | |||
} | |||
if (launched) { | |||
LOGGER.info("Agent: " + slave.getNodeName() + " has already been launched. Activating..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW it does not really matter for INFO
and above, but it is good to get in the habit of using either message formats or lambdas when logging nonconstant messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I'd also prefer
LOGGER.log(INFO, "Agent has already been launched, activating: {}", slave.getNodeName());
SemaphoreStep.success("wait/1", false); | ||
// we need to wait until we are sure that the sh | ||
// step has started... | ||
Thread.sleep(10000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
, true)); | ||
WorkflowRun b = p.scheduleBuild2(0).waitForStart(); | ||
SemaphoreStep.waitForStart("wait/1", b); | ||
SemaphoreStep.success("wait/1", false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
WorkflowRun b = p.scheduleBuild2(0).waitForStart(); | ||
SemaphoreStep.waitForStart("wait/1", b); | ||
SemaphoreStep.success("wait/1", false); | ||
Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -24,8 +24,6 @@ | |||
|
|||
package org.jvnet.hudson.test; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eek! Cf. jenkinsci/jenkins-test-harness#84
} | ||
|
||
@Override | ||
protected JenkinsRule createJenkinsRule(Description description) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good to see jenkinsci/jenkins-test-harness#75 getting used at least.
@carlossg I believe I have addressed all of Jesse's comments. |
Latest test run shows this in logs:
and
|
@carlossg Still getting this warning which I think is causing tests to fail...: WARNING o.c.j.p.k.KubernetesCloud#provision: Failed to count the # of live instances on Kubernetes |
@carlossg Sorry. my bad. Some of the test groovy files had a namespace defined in them...I removed them and now tests are passing. |
private transient KubernetesClient client; | ||
|
||
@Override | ||
// TODO Revisit for JENKINS-40161 | ||
public void onResume() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this is the proper implementation for this step. The parent class SynchronousNonBlockingStepExecution
makes it clear that resuming such a step is unsupported, the bulk of the execution being run in a different thread, which is not resumed on restart.
As I understand it, the current onResume
is just a no-op, so you may as well remove injection of 'client' as it won't be used in later calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you @Vlatombe ... I was just going for completeness (i.e. to have all pipeline steps resume) but as I look at it, the step is not a long running one like container() I will amend my PR |
- container step Includes: - update to parent pom to get latest jenkins-test-harness - update to durable-task plugin v1.16 [FIXES JENKINS-47476]
@@ -16,6 +16,8 @@ | |||
|
|||
package org.csanchez.jenkins.plugins.kubernetes.pipeline; | |||
|
|||
import com.google.inject.Inject; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some unused imports to clean up here.
@@ -4,30 +4,47 @@ | |||
import java.util.logging.Level; | |||
import java.util.logging.Logger; | |||
|
|||
import com.google.inject.Inject; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
@@ -5,12 +5,15 @@ | |||
import java.util.logging.Level; | |||
import java.util.logging.Logger; | |||
|
|||
import com.google.inject.Inject; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. CI is failing due to rebase I believe.
tests passing so 🎉 thanks for the hard work! |
This doesn't seem to have fixed anything for me. Jenkins is still rejecting old agents because it's an unknown client name. |
It turns out my plugin upgrade got reverted while I was testing by someone. This is working great! |
JENKINS-47476
[FIXES JENKINS-47476]