Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use prometheus to annotate pod/node data onto Job model #723

Merged
merged 19 commits into from
Jan 19, 2024

Conversation

jjnesbitt
Copy link
Collaborator

@jjnesbitt jjnesbitt commented Jan 10, 2024

Supersedes #697

This PR collects much more data every time job data is collected by the job webhook. This allows us to derive "cost per job" metrics.

The following notable changes are included in this PR:

  • build_timings_processor.py has been moved/renamed to the job_processor folder, with several files, including one for build timings. This is because at this point, a lot more is going on than just uploading build timings.
  • A new Node model, which stores data relating to a node that a job ran on, including available cpu/memory, its spot price, etc.
  • A new JobPod model, which stores data about the pod a job ran on in the cluster, including resource requests, limits, and usage.

The above two models are referenced by Job, but are null-able, since jobs that weren't run in the cluster (uo, etc.), as well as historical aws jobs, won't have any available data.

There are a few more minor changes worth mentioning:

  • The Job.duration field has been changed from a FloatField to a DurationField, to better represent the data and make it easier to work with.
  • The Job.aws field has been changed to be no longer null-able. The field was made null-able in Handle jobs with missing runners #724, to get around a quirk of the gitlab API. This PR circumvents that part of the API entirely, and so null values shouldn't be allowed. The amount of rows in the database that have a null value in the aws column is only a handful at the moment, and so those rows are deleted as a part of the migration, to clean things up.

@jjnesbitt jjnesbitt force-pushed the job-node-data-prometheus branch 2 times, most recently from 0b9f9ac to 5aa73fa Compare January 12, 2024 16:57
@jjnesbitt jjnesbitt force-pushed the job-node-data-prometheus branch from 5aa73fa to 13381d5 Compare January 12, 2024 17:01
@jjnesbitt jjnesbitt force-pushed the job-node-data-prometheus branch 4 times, most recently from fe9d564 to 147e0a4 Compare January 13, 2024 19:34
Additionally, break apart job model to hold data pertaining to the node and pod. Since a job isn't always run in the cluster, this will de-clutter the model and prevent a lot of null rows on non-aws jobs.
@jjnesbitt jjnesbitt force-pushed the job-node-data-prometheus branch from 147e0a4 to cdf18a1 Compare January 15, 2024 18:03
@jjnesbitt jjnesbitt force-pushed the job-node-data-prometheus branch from 3781335 to 0bd349b Compare January 15, 2024 19:27
@jjnesbitt jjnesbitt marked this pull request as ready for review January 16, 2024 16:31
@jjnesbitt
Copy link
Collaborator Author

@alecbcs @cmelone Let me know if anything seems awry or jumps out at you here.

@jjnesbitt jjnesbitt force-pushed the job-node-data-prometheus branch from cd8879d to 8fbcc15 Compare January 17, 2024 17:57
danlamanna
danlamanna previously approved these changes Jan 17, 2024
danlamanna
danlamanna previously approved these changes Jan 18, 2024
danlamanna
danlamanna previously approved these changes Jan 18, 2024
danlamanna
danlamanna previously approved these changes Jan 18, 2024
@jjnesbitt jjnesbitt force-pushed the job-node-data-prometheus branch from 5df2ea8 to e8b3696 Compare January 18, 2024 23:11
@jjnesbitt jjnesbitt requested a review from danlamanna January 18, 2024 23:24
@jjnesbitt
Copy link
Collaborator Author

I've been testing this and am seeing times of around 3-5 seconds for the process_job task.

@jjnesbitt jjnesbitt merged commit 09ffdca into main Jan 19, 2024
15 checks passed
@jjnesbitt jjnesbitt deleted the job-node-data-prometheus branch January 19, 2024 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants