-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(Jenkinsfile): Fix jenkins pipline for CI #67
Conversation
I think I need to add the params and get rid of the RIOT submodule change. |
Jenkinsfile
Outdated
stash name: 'sources' | ||
script { | ||
for (i = 0; i < nodes.size(); i++) { | ||
echo "${nodes[i]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this script
block a left-over from debugging, or do you want to keep it for informational purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debugging, thanks!
Also it seems like I am not getting the notifications... I don't know why... yet! |
I am wondering if the overall timeout is a good idea as it may cause some problems if there are many jobs in the queue since the nodes can be blocked for a long time but the master ticker may still be going... |
I guess having that global timeout throughout the job lifetime is good. It's very unlikely, but if we observe hangs in the setup or notification phase (or future stages), then the global timeout seems to be our only rescue? Of course, we could also wrap each of that stage in a local timeout .. but the global one is more convenient. |
Good, the problem is the timing, if I start 10 jobs at once the last one would have to wait for the nodes to be complete, meaning my timeout would need to be some function of running jobs or something (it would take at least 3 hours to run through 10 jobs). Anyways currently it is set at 1 hour, I think that is fine if we don't have to wait for other jobs to finish with the node but that currently is not the case. What would be a good balance? |
Darn also it seems like the catching of the errors prevents timeouts and aborts. Maybe for the time I increase everything to something that should work and we can tune later once I figure out how to capture error types (ie a timeout occured or a stop message occured). |
Oh man... the timeout actually seems not too nice.. |
2ed734d
to
3196db6
Compare
Maybe it is ready. Still could use some work but there was at least one case where the timeouts and exiting worked out well. It would be nice to get this in by the end of the day. |
Jenkinsfile
Outdated
]) | ||
def runParallel(args) { | ||
parallel args.items.collectEntries { name -> [ "${name}": { | ||
// We want to timeout of a node doesn't respond in 15 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of/if
Jenkinsfile
Outdated
stepFlash(tests[i]) | ||
stepTest(tests[i]) | ||
stepArchiveTestResults(tests[i]) | ||
} catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this particular exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the timeout or abort exception. Without it a timeout will only cancel one test on a node.
Jenkinsfile
Outdated
} | ||
if (caughtException) { | ||
// This should exit out of the node that failed | ||
error caughtException.message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we move this line into the catch
statement? The surrounding if
seems to be a bit verbose .. (below, too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it gets caught by the catcherror which sets build status and stage status. This is what is required to exit out.
I am willing to say there is a better way than using the catcherror call though. I haven't tested.
Is there a test run that I can look at? Back at Jenkins I couldn't find any |
Looks like there is still some work. Darn. Will take care of tomorrow. |
I tried to simplify the Should I just call it quits and have a catchError with a try catch that allows me to throw the caught error while outside the catchError context or can we accept that things look like they are passing when they are not (we still get correct test results) or should I continue to search for a way where I can try catch and only fail that stage? For some reason the robot-test fail case seems to function properly as the unstable setting is showing up. |
It appears that the current stash and unstash has some issues with asynchronous behaviour. For example an unstash occurs on a node in a working directory that is different than what is expected when running a test. It also appears that some directories are not being cleaned. The following commit makes a number of changes to fix that: - makes jenkinsfile declarative as this is better supported - run all tests on node before releasing to fix any shared workspace problems - Cleanup function names and steps to make it more readable - Add timeouts to overall process and timeout per node after it starts - Handle errors if unstash fails only stop the node - Allow a timeout/stop to exit the whole set of tests
eacc5f9
to
a72ab97
Compare
I confirmed the node timeout only starts ticking after the node is acquired. I set it to 1 hour and the whole process to 3 hours. There are still some strange things happening when we try to stop and it is trying to change states but it just requires an additional stop and it seems fine. I think we can leave it for now as we have yet to get too many lockup problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rework greatly improves the pipeline design and reduces the overall build time. We can address the remaining minor irks and quirks in follow-up PRs to keep the diff minimal. ACK!
Thanks for all the help! |
Contribution Description
It appears that the current stash and unstash has some issues with asynchronous behaviour. For example an unstash occurs on a node in a working directory that is different than what is expected when running a test. It also appears that some directories are not being cleaned.
The following PR makes a number of changes to fix that:
Testing Procedure
Check the CI, I don't know how that params will effect everything but it is better than now and we can always fix later if an issue occurs
Related Issues
Checks some boxes on #66