-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Job deadlines #88
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found some fixes!
P.S. share your ideas, feedbacks or issues with us at https://github.com/fixmie/feedback (this message will be removed after the beta stage).
Co-Authored-By: fixmie[bot] <44270338+fixmie[bot]@users.noreply.github.com>
This allows for bpftrace to take some extra time to process and print bpf map data. For very large maps or in some edge cases the user may want to override this value
This is configurable now :)
thanks @leodido BTW one side-effect of this is that anything that trips the job deadline will appear as a failed job. That's not ideal, but if we check the exit status of the container (ie, if it actually succeeded) before the pod is GC'd we should be able to rescue the true exit status. For now i don't think it much matters if a job passed its deadline reports as failed, even if this was expected. One way around this would be to have the tracejob runner time out, so that the job exits cleaning and shows as completed if this happens. I think that can be addressed in a separate PR though unless anyone feels strongly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok the pipeline job is not green for the reason explained by @dalehamel but this PR looks amazing!
Nevertheless this can get merged in since that adjustment/fix will be approached in another PR soon (ideally before receiving other PRs from non-maintainers).
Before merging I want to try and let the trace runner die gracefully, I don't feel right making 'failed jobs' a norm |
@@ -184,6 +187,11 @@ func (t *TraceJobClient) DeleteJobs(nf TraceJobFilter) error { | |||
func (t *TraceJobClient) CreateJob(nj TraceJob) (*batchv1.Job, error) { | |||
|
|||
bpfTraceCmd := []string{ | |||
"/bin/timeout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty happy with this solution, I tested it out and it allows trace jobs to complete within their deadline and print their maps. It's also a more reliable way to ensure maps actually do get printed, and access their log data.
@leodido can you take another look please? I basically just added the timeout command, and upped the k8s deadline to include the grace period. This should give plenty of time for the process to shut down cleanly, so it doesn't need to rely on the pre-stop hook. So, if a job times out, we will have a failed job that is past its activity deadline, similar to exit 124 from the timeout command, indicating a job did actually pass its deadline and didn't exit as it was supposed to. This should ideally be a rare case. In most cases, we should see that the job is able to complete and get the output from the logs, even if it is a map or histogram. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💣 I like it!
Fixes #80 and paves the way for #11
I think this is pretty elegant and I have to give the credit for both ideas to @jerr
This allows for ensuring that bpftrace is signalled with SIGINT to dump its map before exiting.
Have tested this and it works more reliably than the interactive traces via TTY attach.
A nice benefit is you can now collect data for a pre-set interval before exiting 😂