Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(Jenkinsfile): Fix jenkins pipline for CI #67

Merged
merged 1 commit into from
May 18, 2020
Merged

Conversation

MrKevinWeiss
Copy link
Collaborator

Contribution Description

It appears that the current stash and unstash has some issues with asynchronous behaviour. For example an unstash occurs on a node in a working directory that is different than what is expected when running a test. It also appears that some directories are not being cleaned.

The following PR makes a number of changes to fix that:

  • makes jenkinsfile declarative as this is better supported
  • run all tests on node before releasing to fix any shared workspace problems
  • Cleanup function names and steps to make it more readable
  • Add timeouts to overall process of 1 hour and per node at 45 mins
  • Handle errors if unstash fails only stop the node

Testing Procedure

Check the CI, I don't know how that params will effect everything but it is better than now and we can always fix later if an issue occurs

Related Issues

Checks some boxes on #66

@MrKevinWeiss MrKevinWeiss added the enhancement New feature or request label May 15, 2020
@MrKevinWeiss MrKevinWeiss self-assigned this May 15, 2020
@MrKevinWeiss
Copy link
Collaborator Author

I think I need to add the params and get rid of the RIOT submodule change.

Jenkinsfile Outdated
stash name: 'sources'
script {
for (i = 0; i < nodes.size(); i++) {
echo "${nodes[i]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this script block a left-over from debugging, or do you want to keep it for informational purposes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging, thanks!

@MrKevinWeiss
Copy link
Collaborator Author

Also it seems like I am not getting the notifications... I don't know why... yet!

@MrKevinWeiss
Copy link
Collaborator Author

I am wondering if the overall timeout is a good idea as it may cause some problems if there are many jobs in the queue since the nodes can be blocked for a long time but the master ticker may still be going...

@cgundogan
Copy link
Member

I am wondering if the overall timeout is a good idea

I guess having that global timeout throughout the job lifetime is good. It's very unlikely, but if we observe hangs in the setup or notification phase (or future stages), then the global timeout seems to be our only rescue? Of course, we could also wrap each of that stage in a local timeout .. but the global one is more convenient.

@MrKevinWeiss
Copy link
Collaborator Author

I guess having that global timeout throughout the job lifetime is good.

Good, the problem is the timing, if I start 10 jobs at once the last one would have to wait for the nodes to be complete, meaning my timeout would need to be some function of running jobs or something (it would take at least 3 hours to run through 10 jobs).

Anyways currently it is set at 1 hour, I think that is fine if we don't have to wait for other jobs to finish with the node but that currently is not the case. What would be a good balance?

@MrKevinWeiss
Copy link
Collaborator Author

Darn also it seems like the catching of the errors prevents timeouts and aborts. Maybe for the time I increase everything to something that should work and we can tune later once I figure out how to capture error types (ie a timeout occured or a stop message occured).

@MrKevinWeiss
Copy link
Collaborator Author

Oh man... the timeout actually seems not too nice..

@MrKevinWeiss
Copy link
Collaborator Author

Maybe it is ready. Still could use some work but there was at least one case where the timeouts and exiting worked out well. It would be nice to get this in by the end of the day.

Jenkinsfile Outdated
])
def runParallel(args) {
parallel args.items.collectEntries { name -> [ "${name}": {
// We want to timeout of a node doesn't respond in 15 mins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of/if

Jenkinsfile Outdated
stepFlash(tests[i])
stepTest(tests[i])
stepArchiveTestResults(tests[i])
} catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this particular exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the timeout or abort exception. Without it a timeout will only cancel one test on a node.

Jenkinsfile Outdated
}
if (caughtException) {
// This should exit out of the node that failed
error caughtException.message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we move this line into the catch statement? The surrounding if seems to be a bit verbose .. (below, too)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it gets caught by the catcherror which sets build status and stage status. This is what is required to exit out.

I am willing to say there is a better way than using the catcherror call though. I haven't tested.

@cgundogan
Copy link
Member

Is there a test run that I can look at? Back at Jenkins I couldn't find any

@MrKevinWeiss
Copy link
Collaborator Author

Looks like there is still some work. Darn. Will take care of tomorrow.

@MrKevinWeiss
Copy link
Collaborator Author

I tried to simplify the catchError command since I need to use try catch anyways. The problem is now I cannot see failures in the stages. The catchError allowed me to set buildResult and stageResult but it appears I don't have that control with the currentBuild global variable. I guess I am really struggling with the documentation on what I have access to.

Should I just call it quits and have a catchError with a try catch that allows me to throw the caught error while outside the catchError context or can we accept that things look like they are passing when they are not (we still get correct test results) or should I continue to search for a way where I can try catch and only fail that stage?

For some reason the robot-test fail case seems to function properly as the unstable setting is showing up.

It appears that the current stash and unstash has some issues with asynchronous behaviour.
For example an unstash occurs on a node in a working directory that is different than what is expected when running a test.
It also appears that some directories are not being cleaned.

The following commit makes a number of changes to fix that:

- makes jenkinsfile declarative as this is better supported
- run all tests on node before releasing to fix any shared workspace problems
- Cleanup function names and steps to make it more readable
- Add timeouts to overall process and timeout per node after it starts
- Handle errors if unstash fails only stop the node
- Allow a timeout/stop to exit the whole set of tests
@MrKevinWeiss
Copy link
Collaborator Author

I confirmed the node timeout only starts ticking after the node is acquired. I set it to 1 hour and the whole process to 3 hours.

There are still some strange things happening when we try to stop and it is trying to change states but it just requires an additional stop and it seems fine. I think we can leave it for now as we have yet to get too many lockup problems.

Copy link
Member

@cgundogan cgundogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rework greatly improves the pipeline design and reduces the overall build time. We can address the remaining minor irks and quirks in follow-up PRs to keep the diff minimal. ACK!

@cgundogan cgundogan merged commit 563ec26 into master May 18, 2020
@MrKevinWeiss
Copy link
Collaborator Author

Thanks for all the help!

@MrKevinWeiss MrKevinWeiss deleted the test/jenkinsfile0 branch May 18, 2020 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants