Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart preempted jobs v2 #113

Merged
merged 14 commits into from
Jul 25, 2019
Merged

Conversation

ingwarsw
Copy link
Contributor

@craigdbarber
Copy link
Contributor

Thanks for adding this functionality! Please be sure to run mvn verify to ensure the code formatter is executed against this code.

@ingwarsw ingwarsw mentioned this pull request Jun 14, 2019
@ingwarsw ingwarsw force-pushed the pr-preemptive-merge3 branch from 825053f to a3d15ee Compare June 14, 2019 14:19
@ingwarsw ingwarsw marked this pull request as ready for review June 14, 2019 14:20
@ingwarsw
Copy link
Contributor Author

Can we proceed further?
Not wait until there will be need for next merge..

@craigdbarber
Copy link
Contributor

Ran through ITs and encountered a failure:

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 391.28 s <<< FAILURE! - in com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudRestartPreemptedIT
[ERROR] testIfNodeWasPreempted(com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudR
estartPreemptedIT) Time elapsed: 368.561 s <<< ERROR!
org.awaitility.core.ConditionTimeoutException: Condition with lambda expression in com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudRestartPreemptedIT that uses com.google.jenkins.plugins.computeengine.ComputeEngineComputer was not fulfilled within 5 minutes.
at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:136)
at org.awaitility.core.CallableCondition.await(CallableCondition.java:79)
at org.awaitility.core.CallableCondition.await(CallableCondition.java:27)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:840)
at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:802)
at com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudRestartPreemptedIT.testIfNodeWasPreempted(ComputeEngineCloudRestartPreemptedIT.java:127)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:553)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)

@ingwarsw
Copy link
Contributor Author

Thats new test, and its longer than all other ones cause its creating 2 instances in sequence...
But I set timeout for it to 15 minutes..

Im not sure from where you have that 5 minutes timeout set..
But it should be longer..

@stephenashank
Copy link
Contributor

Were you able to get the test running with a timeout of 15 minutes @ingwarsw? I just fetched this PR and the error message says that the preempted test timed out after 15 minutes.

@stephenashank
Copy link
Contributor

It appears that the 5 minute timeout that Craig experienced came from the condition on line 127 failing to be true within 5 minutes after the simulated maintenance event. Given that 15 minutes is the sum of all those timeouts, it might be good to add a buffer of a few minutes to the class rule timeout.

For me, this is what the log output from the test looks like once connecting to the agent by SSH is successful, hopefully this helps with debugging:

  92.569 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Connecting to 34.83.84.9 on port 22, with timeout 10000.
  92.755 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Connected via SSH.
  92.844 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: connect fresh as root
  93.092 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Connecting to 34.83.84.9 on port 22, with timeout 10000.
  93.206 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Connected via SSH.
  93.336 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Verifying: java -fullversion
  93.419 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Copying agent.jar to: /tmp
  93.637 [id=59]	INFO	c.g.j.p.c.ComputeEngineCloud#log: Launching Jenkins agent via plugin SSH: java -jar /tmp/agent.jar
  97.752 [id=59]	INFO	c.g.j.p.c.ComputeEngineComputer#onConnected: Instance integration-q0riqj is preemptive, setting up preemption listener
  99.164 [id=78]	INFO	c.g.j.p.c.ComputeEngineCloud#lambda$getPlannedNodeFuture$0: 86734ms elapsed waiting for node integration-q0riqj to connect
 636.983 [id=119]	INFO	hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel integration-q0riqj.
java.util.concurrent.TimeoutException: Ping started at 1561057859628 hasn't completed by 1561058099629
	at hudson.remoting.PingThread.ping(PingThread.java:134)
	at hudson.remoting.PingThread.run(PingThread.java:90)
 703.635 [id=135]	INFO	hudson.model.AsyncPeriodicWork$1#run: Started Connection Activity monitoring to agents
 703.637 [id=135]	INFO	hudson.model.AsyncPeriodicWork$1#run: Finished Connection Activity monitoring to agents. 2 ms
 900.009 [id=14]	INFO	c.g.j.p.c.integration.ITUtil#teardownResources: teardown

.until(() -> computer.getLog().contains("listening to metadata for preemption event"));

client.simulateMaintenanceEvent(PROJECT_ID, ZONE, name);
Awaitility.await().timeout(5, TimeUnit.MINUTES).until(computer::getPreempted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 5 minute timeout is causing the IT to fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is system dependent, as it doesn't fail here when I run the tests.

@rachely3n
Copy link
Contributor

I'm planning to cut a release today, so in the event that we can merge today, let's wait until the release.

@stephenashank
Copy link
Contributor

I tried it with a 20 minute timeout and the same thing happened. After both runs, the preempted machine still shows up in my list of compute engine instances in a stopped state. On this second run, the hudson.model.AsyncPeriodicWork#run method doesn't show up in the logs, otherwise it was nearly identical apart from the teardown happening after 20 minutes rather than 15.

@ingwarsw
Copy link
Contributor Author

All IT tests in this project are highly dependant from network speed from machine on which tests are running to GCP..

Most of other tests dont need to fully start instance..
Here we need to send all jobs that will run on slave (cause task that listens for GCP preemption even is that kind of job)
Any idea how to make this tests more stable?

@craigdbarber
Copy link
Contributor

Agree with the idea of spending some cycles on improving IT reliability. Suggest creating an issue to track the work. In the immediate, let's extend the timeouts hard-coded into the awaits in the offending IT so that we can get this PR merged. Thanks.

@ingwarsw
Copy link
Contributor Author

@craigdbarber Timeouts increased a bit..
But if you have slow connection (especially upload) it will be not enough..

We should create pipeline to run tests automatically..
If you guys run them manually it will always be lot of pain..

@stephenashank
Copy link
Contributor

stephenashank commented Jun 24, 2019

Have you been able to run the integration test successfully @ingwarsw ?

I was debugging the test and confirmed that it reached line 129, however it times out while waiting for the call to taskFuture.get() to finish. See my logs above as those have not changed.

@ingwarsw
Copy link
Contributor Author

@stephenashank Yup works for me..

Can you see if your instance catches preemptive event?
Best place is to see here..
https://console.cloud.google.com/compute/operations

@ingwarsw
Copy link
Contributor Author

@stephenashank And from log I see it should break 2 lines UP..
Not on line that your showing..

Seems like you have really slow upload speed..
Can you try it on some GCP machine?

@stephenashank
Copy link
Contributor

I attempted to do this on a few different GCP machines such that they were in the same zone as the instances being created. I also disabled all other integration tests while running to keep the network and processing resources dedicated to this test. Despite this, not much has changed in terms of where the test times out.

From my most recent run, the operations I can see are "Create an Instance", "simulateMaintenanceEvent", and "Instance preempted". The machine was never brought back up, it remained in the stopped state, and was never deleted afterwards.

After running into firewall issues on my earlier runs I'm beginning to suspect there's a difference in the network setup inherent to the projects or organizations we're using outside of the instance configuration we specify in the test.

}

private HttpRequest createMetadataRequest() throws IOException {
HttpTransport transport = new NetHttpTransport();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking another look at this. This won't work as is. This logic needs to be consolidated with the ClientFactory, and the request needs to be built using the GoogleClientRequestInitializer similar to the approached used for the ComputeClient. A number of our customers are running their masters on-prem and thus won't have the GCP SA baked into the VM metadata, which is what would be required for this to work as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what is causing the ITs to fail for me btw, as I'm running them on a VM not in GCE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this code is running on slave side.. not master.. so it should work..
Could you explain how would you like it to run?

Anything except calling metadata will not work cause all other sources have delay..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I run tests from my own computer.. not in GCE for sure and it works..
But maybe there is something wrong here..
Cause it seems like its not catching what it should..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@craigdbarber Did you found why tests on your side are failing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. Both @stephenashank and I are encountering the same problem. Could you do us a favor an run this command: gcloud projects get-iam-policy
--flatten="bindings[].members"
--format='table(bindings.role)'
--filter="bindings.members:"
And paste the results into this thread. Perhaps that will help us get to the bottom of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROLE
roles/editor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any luck? can try to get this going on my own environment

getChannel().close();
}
return value;
} catch (InterruptedException|ExecutionException|IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I ran the tests this code from your recent commit was reformatted, so just run mvn compile and commit the result.

Copy link
Contributor

@stephenashank stephenashank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after reformatting, and any final comments @craigdbarber might have.

Copy link
Contributor

@craigdbarber craigdbarber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ITs are now passing, thanks!
LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants