Restart preempted jobs v2 #113

ingwarsw · 2019-06-13T11:32:15Z

src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineComputer.java

src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineInstance.java

src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineRetentionStrategy.java

src/main/java/com/google/jenkins/plugins/computeengine/PreemptedCheckCallable.java

src/main/java/com/google/jenkins/plugins/computeengine/client/ComputeClient.java

...m/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudRestartPreemptedIT.java

craigdbarber · 2019-06-13T20:06:00Z

Thanks for adding this functionality! Please be sure to run mvn verify to ensure the code formatter is executed against this code.

src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineRetentionStrategy.java

...m/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudRestartPreemptedIT.java

ingwarsw · 2019-06-15T09:20:58Z

Can we proceed further?
Not wait until there will be need for next merge..

craigdbarber · 2019-06-19T17:48:35Z

Ran through ITs and encountered a failure:

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 391.28 s <<< FAILURE! - in com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudRestartPreemptedIT
[ERROR] testIfNodeWasPreempted(com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudR
estartPreemptedIT) Time elapsed: 368.561 s <<< ERROR!
org.awaitility.core.ConditionTimeoutException: Condition with lambda expression in com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudRestartPreemptedIT that uses com.google.jenkins.plugins.computeengine.ComputeEngineComputer was not fulfilled within 5 minutes.
at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:136)
at org.awaitility.core.CallableCondition.await(CallableCondition.java:79)
at org.awaitility.core.CallableCondition.await(CallableCondition.java:27)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:840)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:802)
at com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudRestartPreemptedIT.testIfNodeWasPreempted(ComputeEngineCloudRestartPreemptedIT.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:553)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)

ingwarsw · 2019-06-20T10:08:27Z

Thats new test, and its longer than all other ones cause its creating 2 instances in sequence...
But I set timeout for it to 15 minutes..

Im not sure from where you have that 5 minutes timeout set..
But it should be longer..

stephenashank · 2019-06-20T19:23:01Z

Were you able to get the test running with a timeout of 15 minutes @ingwarsw? I just fetched this PR and the error message says that the preempted test timed out after 15 minutes.

stephenashank · 2019-06-20T19:36:34Z

It appears that the 5 minute timeout that Craig experienced came from the condition on line 127 failing to be true within 5 minutes after the simulated maintenance event. Given that 15 minutes is the sum of all those timeouts, it might be good to add a buffer of a few minutes to the class rule timeout.

For me, this is what the log output from the test looks like once connecting to the agent by SSH is successful, hopefully this helps with debugging:

  92.569 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Connecting to 34.83.84.9 on port 22, with timeout 10000.
  92.755 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Connected via SSH.
  92.844 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: connect fresh as root
  93.092 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Connecting to 34.83.84.9 on port 22, with timeout 10000.
  93.206 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Connected via SSH.
  93.336 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Verifying: java -fullversion
  93.419 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Copying agent.jar to: /tmp
  93.637 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar
  97.752 [id=59]	INFO	c.g.j.p.c.ComputeEngineComputer#onConnected: Instance integration-q0riqj is preemptive, setting up preemption listener
  99.164 [id=78]	INFO	c.g.j.p.c.ComputeEngineCloud#lambda$getPlannedNodeFuture$0: 86734ms elapsed waiting for node integration-q0riqj to connect
 636.983 [id=119]	INFO	hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel integration-q0riqj.
java.util.concurrent.TimeoutException: Ping started at 1561057859628 hasn't completed by 1561058099629
	at hudson.remoting.PingThread.ping(PingThread.java:134)
	at hudson.remoting.PingThread.run(PingThread.java:90)
 703.635 [id=135]	INFO	hudson.model.AsyncPeriodicWork$1#run: Started Connection Activity monitoring to agents
 703.637 [id=135]	INFO	hudson.model.AsyncPeriodicWork$1#run: Finished Connection Activity monitoring to agents. 2 ms
 900.009 [id=14]	INFO	c.g.j.p.c.integration.ITUtil#teardownResources: teardown

craigdbarber · 2019-06-20T20:07:27Z

...m/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudRestartPreemptedIT.java

+        .until(() -> computer.getLog().contains("listening to metadata for preemption event"));
+
+    client.simulateMaintenanceEvent(PROJECT_ID, ZONE, name);
+    Awaitility.await().timeout(5, TimeUnit.MINUTES).until(computer::getPreempted);


This 5 minute timeout is causing the IT to fail.

It seems that this is system dependent, as it doesn't fail here when I run the tests.

rachely3n · 2019-06-20T20:12:03Z

I'm planning to cut a release today, so in the event that we can merge today, let's wait until the release.

stephenashank · 2019-06-20T20:28:54Z

I tried it with a 20 minute timeout and the same thing happened. After both runs, the preempted machine still shows up in my list of compute engine instances in a stopped state. On this second run, the hudson.model.AsyncPeriodicWork#run method doesn't show up in the logs, otherwise it was nearly identical apart from the teardown happening after 20 minutes rather than 15.

ingwarsw · 2019-06-20T23:15:02Z

All IT tests in this project are highly dependant from network speed from machine on which tests are running to GCP..

Most of other tests dont need to fully start instance..
Here we need to send all jobs that will run on slave (cause task that listens for GCP preemption even is that kind of job)
Any idea how to make this tests more stable?

craigdbarber · 2019-06-21T17:15:16Z

Agree with the idea of spending some cycles on improving IT reliability. Suggest creating an issue to track the work. In the immediate, let's extend the timeouts hard-coded into the awaits in the offending IT so that we can get this PR merged. Thanks.

ingwarsw · 2019-06-22T09:12:21Z

@craigdbarber Timeouts increased a bit..
But if you have slow connection (especially upload) it will be not enough..

We should create pipeline to run tests automatically..
If you guys run them manually it will always be lot of pain..

stephenashank · 2019-06-24T23:32:27Z

Have you been able to run the integration test successfully @ingwarsw ?

I was debugging the test and confirmed that it reached line 129, however it times out while waiting for the call to taskFuture.get() to finish. See my logs above as those have not changed.

ingwarsw · 2019-06-25T08:15:16Z

@stephenashank Yup works for me..

Can you see if your instance catches preemptive event?
Best place is to see here..
https://console.cloud.google.com/compute/operations

ingwarsw · 2019-06-25T08:17:08Z

@stephenashank And from log I see it should break 2 lines UP..
Not on line that your showing..

Seems like you have really slow upload speed..
Can you try it on some GCP machine?

stephenashank · 2019-06-26T00:45:25Z

I attempted to do this on a few different GCP machines such that they were in the same zone as the instances being created. I also disabled all other integration tests while running to keep the network and processing resources dedicated to this test. Despite this, not much has changed in terms of where the test times out.

From my most recent run, the operations I can see are "Create an Instance", "simulateMaintenanceEvent", and "Instance preempted". The machine was never brought back up, it remained in the stopped state, and was never deleted afterwards.

After running into firewall issues on my earlier runs I'm beginning to suspect there's a difference in the network setup inherent to the projects or organizations we're using outside of the instance configuration we specify in the test.

craigdbarber · 2019-06-26T22:17:52Z

src/main/java/com/google/jenkins/plugins/computeengine/PreemptedCheckCallable.java

+  }
+
+  private HttpRequest createMetadataRequest() throws IOException {
+    HttpTransport transport = new NetHttpTransport();


Taking another look at this. This won't work as is. This logic needs to be consolidated with the ClientFactory, and the request needs to be built using the GoogleClientRequestInitializer similar to the approached used for the ComputeClient. A number of our customers are running their masters on-prem and thus won't have the GCP SA baked into the VM metadata, which is what would be required for this to work as is.

This is what is causing the ITs to fail for me btw, as I'm running them on a VM not in GCE.

But this code is running on slave side.. not master.. so it should work..
Could you explain how would you like it to run?

Anything except calling metadata will not work cause all other sources have delay..

And I run tests from my own computer.. not in GCE for sure and it works..
But maybe there is something wrong here..
Cause it seems like its not catching what it should..

@craigdbarber Did you found why tests on your side are failing?

Not yet. Both @stephenashank and I are encountering the same problem. Could you do us a favor an run this command: gcloud projects get-iam-policy
--flatten="bindings[].members"
--format='table(bindings.role)'
--filter="bindings.members:"
And paste the results into this thread. Perhaps that will help us get to the bottom of this.

ROLE roles/editor

any luck? can try to get this going on my own environment

stephenashank · 2019-07-25T17:59:09Z

src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineComputer.java

+                  getChannel().close();
+                }
+                return value;
+              } catch (InterruptedException|ExecutionException|IOException e) {


When I ran the tests this code from your recent commit was reformatted, so just run mvn compile and commit the result.

stephenashank

LGTM after reformatting, and any final comments @craigdbarber might have.

craigdbarber

ITs are now passing, thanks!
LGTM

ingwarsw added 3 commits June 13, 2019 13:19

Restart preempted jobs v2

ef2a2ac

Moving to lombok logger

052d59a

Add integration tests

f6ba90e