-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart preempted jobs v2 #113
Restart preempted jobs v2 #113
Conversation
ingwarsw
commented
Jun 13, 2019
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineComputer.java
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineComputer.java
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineComputer.java
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineInstance.java
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineRetentionStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineRetentionStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineRetentionStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineRetentionStrategy.java
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineRetentionStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/PreemptedCheckCallable.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/jenkins/plugins/computeengine/client/ComputeClient.java
Show resolved
Hide resolved
...m/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudRestartPreemptedIT.java
Outdated
Show resolved
Hide resolved
...m/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudRestartPreemptedIT.java
Show resolved
Hide resolved
Thanks for adding this functionality! Please be sure to run mvn verify to ensure the code formatter is executed against this code. |
825053f
to
a3d15ee
Compare
src/main/java/com/google/jenkins/plugins/computeengine/ComputeEngineRetentionStrategy.java
Outdated
Show resolved
Hide resolved
...m/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudRestartPreemptedIT.java
Outdated
Show resolved
Hide resolved
...m/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudRestartPreemptedIT.java
Outdated
Show resolved
Hide resolved
...m/google/jenkins/plugins/computeengine/integration/ComputeEngineCloudRestartPreemptedIT.java
Outdated
Show resolved
Hide resolved
Can we proceed further? |
Ran through ITs and encountered a failure: [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 391.28 s <<< FAILURE! - in com.google.jenkins.plugins.computeengine.integration.ComputeEngineCloudRestartPreemptedIT |
Thats new test, and its longer than all other ones cause its creating 2 instances in sequence... Im not sure from where you have that 5 minutes timeout set.. |
Were you able to get the test running with a timeout of 15 minutes @ingwarsw? I just fetched this PR and the error message says that the preempted test timed out after 15 minutes. |
It appears that the 5 minute timeout that Craig experienced came from the condition on line 127 failing to be true within 5 minutes after the simulated maintenance event. Given that 15 minutes is the sum of all those timeouts, it might be good to add a buffer of a few minutes to the class rule timeout. For me, this is what the log output from the test looks like once connecting to the agent by SSH is successful, hopefully this helps with debugging:
|
.until(() -> computer.getLog().contains("listening to metadata for preemption event")); | ||
|
||
client.simulateMaintenanceEvent(PROJECT_ID, ZONE, name); | ||
Awaitility.await().timeout(5, TimeUnit.MINUTES).until(computer::getPreempted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 5 minute timeout is causing the IT to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is system dependent, as it doesn't fail here when I run the tests.
I'm planning to cut a release today, so in the event that we can merge today, let's wait until the release. |
I tried it with a 20 minute timeout and the same thing happened. After both runs, the preempted machine still shows up in my list of compute engine instances in a stopped state. On this second run, the hudson.model.AsyncPeriodicWork#run method doesn't show up in the logs, otherwise it was nearly identical apart from the teardown happening after 20 minutes rather than 15. |
All IT tests in this project are highly dependant from network speed from machine on which tests are running to GCP.. Most of other tests dont need to fully start instance.. |
Agree with the idea of spending some cycles on improving IT reliability. Suggest creating an issue to track the work. In the immediate, let's extend the timeouts hard-coded into the awaits in the offending IT so that we can get this PR merged. Thanks. |
@craigdbarber Timeouts increased a bit.. We should create pipeline to run tests automatically.. |
Have you been able to run the integration test successfully @ingwarsw ? I was debugging the test and confirmed that it reached line 129, however it times out while waiting for the call to taskFuture.get() to finish. See my logs above as those have not changed. |
@stephenashank Yup works for me.. Can you see if your instance catches preemptive event? |
@stephenashank And from log I see it should break 2 lines UP.. Seems like you have really slow upload speed.. |
I attempted to do this on a few different GCP machines such that they were in the same zone as the instances being created. I also disabled all other integration tests while running to keep the network and processing resources dedicated to this test. Despite this, not much has changed in terms of where the test times out. From my most recent run, the operations I can see are "Create an Instance", "simulateMaintenanceEvent", and "Instance preempted". The machine was never brought back up, it remained in the stopped state, and was never deleted afterwards. After running into firewall issues on my earlier runs I'm beginning to suspect there's a difference in the network setup inherent to the projects or organizations we're using outside of the instance configuration we specify in the test. |
} | ||
|
||
private HttpRequest createMetadataRequest() throws IOException { | ||
HttpTransport transport = new NetHttpTransport(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking another look at this. This won't work as is. This logic needs to be consolidated with the ClientFactory, and the request needs to be built using the GoogleClientRequestInitializer similar to the approached used for the ComputeClient. A number of our customers are running their masters on-prem and thus won't have the GCP SA baked into the VM metadata, which is what would be required for this to work as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what is causing the ITs to fail for me btw, as I'm running them on a VM not in GCE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this code is running on slave
side.. not master.. so it should work..
Could you explain how would you like it to run?
Anything except calling metadata will not work cause all other sources have delay..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I run tests from my own computer.. not in GCE for sure and it works..
But maybe there is something wrong here..
Cause it seems like its not catching what it should..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@craigdbarber Did you found why tests on your side are failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet. Both @stephenashank and I are encountering the same problem. Could you do us a favor an run this command: gcloud projects get-iam-policy
--flatten="bindings[].members"
--format='table(bindings.role)'
--filter="bindings.members:"
And paste the results into this thread. Perhaps that will help us get to the bottom of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROLE
roles/editor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any luck? can try to get this going on my own environment
getChannel().close(); | ||
} | ||
return value; | ||
} catch (InterruptedException|ExecutionException|IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I ran the tests this code from your recent commit was reformatted, so just run mvn compile
and commit the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after reformatting, and any final comments @craigdbarber might have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ITs are now passing, thanks!
LGTM