Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update stats output to include data for failed matches #1187

Merged
merged 3 commits into from
May 1, 2024

Conversation

milroy
Copy link
Member

@milroy milroy commented Apr 26, 2024

Performance testing with Fluxion has revealed the need to save and output data on failed matches. This PR updates the resource module to track and report these stats.

Comment on lines 2249 to 2278
if (flux_respond_pack (h, msg, "{s:I s:I s:o s:f s:I s:I s:I s:I s:f s:f"
" s:f s:f s:I s:I s:f s:f s:f s:f}",
"V", num_vertices (ctx->db->resource_graph),
"E", num_edges (ctx->db->resource_graph),
"by_rank", o,
"load-time", ctx->perf.load,
"graph-uptime", graph_uptime_s,
"time-since-reset", time_since_reset_s,
"njobs", ctx->perf.njobs,
"njobs-reset", ctx->perf.njobs_reset,
"min-match", min,
"max-match", ctx->perf.max,
"avg-match", mean,
"match-variance", variance,
"njobs-failed", ctx->perf.njobs_failed,
"njobs-reset-failed", ctx->perf.njobs_reset_failed,
"min-match-failed", min_failed,
"max-match-failed", ctx->perf.max_failed,
"avg-match-failed", mean_failed,
"match-variance-failed", variance_failed) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quck thought - it might make sense to break the flat list up into some objects like

match.success.min
match.success.avg
match.success.max
match.failed.min
match.failed.avg
match.failed.max

etc or wherever grouping makes sense. That would make it a little easier to access specific data with queries e.g.

flux module stats sched-fluxion-resource | jq .match.failed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! Let me know if the implementation makes sense.

@milroy milroy force-pushed the stats-update branch 2 times, most recently from 02587f1 to 9497cdd Compare April 27, 2024 06:33
@milroy milroy requested review from garlick and trws April 27, 2024 06:34
goto error;
}

if (flux_respond_pack (h, msg, "{s:I s:I s:o s:f s:I s:I s:{s:o s:o}}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you but it might be nice to have the min/max/avg values in sub-objects.

Also it looks like the sub-objects could be leaked on error.

Packing objects with little o (which "steals" the reference) complicates object reference count management on error because the original reference is not always restored when pack fails (at least that was found to be the case in some jansson versions). So a suggestion is to either pack the response in one go or use big O to pack the sub objects and decrement their reference counts on both success and failure paths.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garlick, I think I implemented your first suggestion as you intended. Let me know what you think and I'll discard or squash the fixup commits.

@milroy milroy force-pushed the stats-update branch 2 times, most recently from 48cfae7 to 4f75e31 Compare April 29, 2024 00:23
src/cmd/flux-ion-resource.py Outdated Show resolved Hide resolved
"avg-match", mean,
"match-variance", variance) < 0) {

if (flux_respond_pack (h, msg, "{s:I s:I s:o s:f s:I s:I"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why the indentation, and why this is like this, but my goodness this is complex. I'm guessing you had it making two sub-objects earlier and backed away based on @garlick's comment. This is clearly right, so I'm ok with taking it, but breaking deep ones like this down would be nice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I updated this with sub-objects. @trws and @garlick let me know if this implementation looks better.

@milroy
Copy link
Member Author

milroy commented May 1, 2024

Thanks for the second round of feedback!

Copy link
Member

@trws trws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy. Thanks for the updates @milroy!

resource/modules/resource_match.cpp Show resolved Hide resolved
errno = ENOMEM;
goto error_free;
}
if (!(match_failed = json_pack ("{s:I s:I s:{s:f s:f s:f s:f}}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot easier for me to follow, good stuff.

@milroy milroy force-pushed the stats-update branch 3 times, most recently from 6c8b606 to 98cbee2 Compare May 1, 2024 07:10
@trws
Copy link
Member

trws commented May 1, 2024

@garlick, would you like one more pass here or are we ready for MWP?

milroy added 2 commits May 1, 2024 09:34
Problem: failed matches for job specifications with unsatisfiable
constraints have been observed restricting the rate at which valid
requests are scheduled. The resource module does not currently track
stats on failed matches.

Add tracking and reporting of stats on failed matches.
Problem: flux ion does not report the stats on failed matches collected
and returned via RPC from the resource module.

Add the ability to output the stats.
@garlick
Copy link
Member

garlick commented May 1, 2024

Go ahead. Nice work!

@trws
Copy link
Member

trws commented May 1, 2024

Ready when you are then @milroy!

@milroy
Copy link
Member Author

milroy commented May 1, 2024

Thanks for the reviews!

I added a tiny commit to test populating the failed stats path. Once that succeeds I'll set MWP.

Problem: the current stats tests don't test populating failed stats.

Submit two unsatisfiable jobs to generate failed job stats.
@milroy milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label May 1, 2024
Copy link

codecov bot commented May 1, 2024

Codecov Report

Attention: Patch coverage is 91.42857% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 73.9%. Comparing base (bb686df) to head (9013583).

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #1187   +/-   ##
======================================
  Coverage    73.9%   73.9%           
======================================
  Files         102     102           
  Lines       14565   14595   +30     
======================================
+ Hits        10766   10790   +24     
- Misses       3799    3805    +6     
Files Coverage Δ
resource/modules/resource_match.cpp 69.2% <91.4%> (+0.2%) ⬆️

@mergify mergify bot merged commit 7a45c25 into flux-framework:master May 1, 2024
23 checks passed
@milroy milroy deleted the stats-update branch May 1, 2024 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants