Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Job deadlines #88

Merged
merged 8 commits into from
Sep 18, 2019
Merged

feat: Job deadlines #88

merged 8 commits into from
Sep 18, 2019

Conversation

dalehamel
Copy link
Member

Fixes #80 and paves the way for #11

I think this is pretty elegant and I have to give the credit for both ideas to @jerr

This allows for ensuring that bpftrace is signalled with SIGINT to dump its map before exiting.

Have tested this and it works more reliably than the interactive traces via TTY attach.

A nice benefit is you can now collect data for a pre-set interval before exiting 😂

@dalehamel dalehamel requested review from leodido and fntlnz September 16, 2019 23:58
@dalehamel dalehamel changed the title Job deadlines Feat - Job deadlines Sep 16, 2019
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some fixes!

P.S. share your ideas, feedbacks or issues with us at https://github.com/fixmie/feedback (this message will be removed after the beta stage).

pkg/cmd/run.go Outdated Show resolved Hide resolved
Co-Authored-By: fixmie[bot] <44270338+fixmie[bot]@users.noreply.github.com>
pkg/tracejob/job.go Outdated Show resolved Hide resolved
pkg/tracejob/job.go Outdated Show resolved Hide resolved
This allows for bpftrace to take some extra time to process and print
bpf map data.

For very large maps or in some edge cases the user may want to override this value
pkg/tracejob/job.go Outdated Show resolved Hide resolved
This is configurable now :)
@dalehamel
Copy link
Member Author

thanks @leodido

BTW one side-effect of this is that anything that trips the job deadline will appear as a failed job. That's not ideal, but if we check the exit status of the container (ie, if it actually succeeded) before the pod is GC'd we should be able to rescue the true exit status.

For now i don't think it much matters if a job passed its deadline reports as failed, even if this was expected. One way around this would be to have the tracejob runner time out, so that the job exits cleaning and shows as completed if this happens.

I think that can be addressed in a separate PR though unless anyone feels strongly.

@leodido leodido self-requested a review September 17, 2019 00:49
leodido
leodido previously approved these changes Sep 17, 2019
Copy link
Member

@leodido leodido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok the pipeline job is not green for the reason explained by @dalehamel but this PR looks amazing!

Nevertheless this can get merged in since that adjustment/fix will be approached in another PR soon (ideally before receiving other PRs from non-maintainers).

@dalehamel
Copy link
Member Author

Before merging I want to try and let the trace runner die gracefully, I don't feel right making 'failed jobs' a norm

@@ -184,6 +187,11 @@ func (t *TraceJobClient) DeleteJobs(nf TraceJobFilter) error {
func (t *TraceJobClient) CreateJob(nj TraceJob) (*batchv1.Job, error) {

bpfTraceCmd := []string{
"/bin/timeout",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty happy with this solution, I tested it out and it allows trace jobs to complete within their deadline and print their maps. It's also a more reliable way to ensure maps actually do get printed, and access their log data.

@dalehamel
Copy link
Member Author

@leodido can you take another look please?

I basically just added the timeout command, and upped the k8s deadline to include the grace period. This should give plenty of time for the process to shut down cleanly, so it doesn't need to rely on the pre-stop hook.

So, if a job times out, we will have a failed job that is past its activity deadline, similar to exit 124 from the timeout command, indicating a job did actually pass its deadline and didn't exit as it was supposed to. This should ideally be a rare case.

In most cases, we should see that the job is able to complete and get the output from the logs, even if it is a map or histogram.

@leodido leodido self-requested a review September 18, 2019 23:13
Copy link
Member

@leodido leodido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💣 I like it!

@dalehamel dalehamel changed the title Feat - Job deadlines feat: Job deadlines Sep 18, 2019
@dalehamel dalehamel merged commit 7da686a into master Sep 18, 2019
@dalehamel dalehamel deleted the job-deadlines branch September 18, 2019 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feat: A mechanism to set activeDeadlineSeconds
2 participants