Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

traverser: fix default job duration when no duration specified #1104

Merged
merged 2 commits into from
Nov 7, 2023

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Nov 1, 2023

When a resource graph has a fixed duration (i.e. a non-zero expiration), Fluxion currently assigns the graph duration to job requests that do not specify a duration. This results in job expirations that exceed the instance lifetime.

This PR fixes that by adjusting unspecified jobspec durations to the remaining graph duration, instead of the entire graph duration. This results in the expiration of these jobs matching that of the instance.

The existing test that checked this behavior was also updated and fixed.

milroy
milroy previously requested changes Nov 3, 2023
Copy link
Member

@milroy milroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch and the fix is elegant.

I'm requesting a few minor changes: a slight change to the now computation, a nit on the second commit message, and also that the commit order be swapped so the testsuite changes come after what they're testing.

@@ -70,8 +70,9 @@ struct jobmeta_t {
now = t;
jobid = id;
alloc_type = alloc;
const auto now = std::chrono::system_clock::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think redeclaring now in this way will cause slight drift in the time when the updating traversal determines idata availability:

int64_t avail = planner_multi_avail_resources_at (subplan,

versus the at time passed via meta here:

rc = detail::dfu_impl_t::update (root, writers, meta);

The drift would probably be very small, but it could conceivably cause the avail variable to be set incorrectly here:

if (avail == 0 && planner_multi_span_size (subplan) == 0)

int64_t g_duration = std::chrono::duration_cast<std::chrono::seconds>
(graph_duration.graph_end - graph_duration.graph_start).count ();
(graph_duration.graph_end - now).count ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of the above, my preference is to do something like this:

Suggested change
(graph_duration.graph_end - now).count ();
(graph_duration.graph_end - std::chrono::system_clock::now ()).count ();

resource/traversers/dfu_impl.hpp Outdated Show resolved Hide resolved
@grondo grondo force-pushed the issue#1103 branch 2 times, most recently from 0336899 to 48b1444 Compare November 3, 2023 13:54
@grondo
Copy link
Contributor Author

grondo commented Nov 3, 2023

Thanks @milroy! I've made the suggested changes and pushed the result.

Comment on lines 74 to 75
(graph_duration.graph_end -
std::chrono::system_clock::now()).count ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @grondo, but after more thought my previous suggestion results in a small drift, too. Can you do this instead:

Suggested change
(graph_duration.graph_end -
std::chrono::system_clock::now()).count ();
(graph_duration.graph_end - now);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! However, for the uninitiated, can you describe the problem being fixed by this change? I only understand enough of std::chrono to make this one change, and with this subtle of a change it seems this might be fairly brittle. At least I could add a comment here for future code spelunkers.

Copy link
Member

@milroy milroy Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for my request was to reduce the number of independent measurements of when now is throughout Fluxion which I hope makes drift debugging easier.

Your question is good and caused me to reconsider the g_duration computation and the downstream use. I think there's a bigger problem with the logic of aligning the job duration with the graph_end time. Issue #1103 reveals that the check here (which was designed to catch the problem) isn't working correctly:

meta.duration = graph_end - *at;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only just realized what you said initially: that I was redeclaring now in the original patch. Somehow I missed that there was already a now declared in the function. 🤦 Sorry about that!

However, using the existing int64_t now results in a compilation error. I'm guessing some kind of cast needs to be used, but being pretty unfamilar with std::chrono I'm not sure what to use.

Since g_duration and now are already int64_t type, would it make sense to just cast graph_duration.graph_end to int64_t as well before taking the difference, e.g.

diff --git a/resource/traversers/dfu_impl.hpp b/resource/traversers/dfu_impl.hpp
index 098fcd41..f9fa8f9b 100644
--- a/resource/traversers/dfu_impl.hpp
+++ b/resource/traversers/dfu_impl.hpp
@@ -70,9 +70,9 @@ struct jobmeta_t {
         now = t;
         jobid = id;
         alloc_type = alloc;
-        int64_t g_duration = std::chrono::duration_cast<std::chrono::seconds>
-            (graph_duration.graph_end -
-             std::chrono::system_clock::now()).count ();
+        int64_t g_end = std::chrono::system_clock::to_time_t
+                          (graph_duration.graph_end);
+        int64_t g_duration = g_end - now;
 
         if (g_duration <= 0) {
             errno = EINVAL;

I really know nothing about std::chrono and was honestly having trouble casting now to the appropriate type otherwise. Please feel free to suggest anything you'd like (or even go ahead and push the correct fix to this branch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your question is good and caused me to reconsider the g_duration computation and the downstream use. I think there's a bigger problem with the logic of aligning the job duration with the graph_end time. Issue #1103 reveals that the check here (which was designed to catch the problem) isn't working correctly:

Ah, yes, I see now that the "fix" attempted in this PR won't really fix anything. If the start of the job is delayed, then its expiration may still be greater than the instance expiration because real time has passed between when build() is called and the job's actual starttime.

I think the real fix is in the code you've called out above, or to have a way to just set the job expiration to the instance expiration, regardless of the job's duration (though I don't know the code well enough to determine how to do that).

Also, unless I'm mistaken, it seems like setting the job's duration to the instance duration in build() isn't doing anything, and that code should be removed to avoid confusion.

Ideally, we'd like to get this and the follow on PR #1105 fixed before the tag tomorrow. Maybe I'll look into why the duration adjustment in run() isn't working (as best I can), but could use help if you're available.

Copy link
Contributor Author

@grondo grondo Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, when I print some debug from that line, graph_end seems to have the wrong scale:

at=1699291701 meta.duration=120 graph_end=1699291819000000000

Does graph_end need one of those crazy std::chrono::duration_cast<std::chrono::seconds> casts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milroy: FYI I've dropped the initial "fix" here (which wasn't a fix at all as you pointed out) in favor of this:

diff --git a/resource/traversers/dfu.cpp b/resource/traversers/dfu.cpp
index 4f7e7c73..061b6844 100644
--- a/resource/traversers/dfu.cpp
+++ b/resource/traversers/dfu.cpp
@@ -290,7 +290,9 @@ int dfu_traverser_t::run (Jobspec::Jobspec &jobspec,
     }
 
     int rc = -1;
-    int64_t graph_end = graph_duration.graph_end.time_since_epoch ().count ();
+    int64_t graph_end = std::chrono::duration_cast<std::chrono::seconds>
+                            (graph_duration.graph_end
+                                .time_since_epoch ()).count ();
     detail::jobmeta_t meta;
     vtx_t root = get_graph_db ()->metadata.roots.at (dom);
     bool x = detail::dfu_impl_t::exclusivity (jobspec.resources, root);

I may have guessed wrong at the correct fix here, so feel free to correct me again.

I've also fixed up the test itself, which wasn't even testing with Fluxion modules in the previous version 🤦 Basically, this is a whole new PR...

@grondo grondo changed the title resource: fix default jobspec duration when no duration specified traverser: fix default jobspec duration when no duration specified Nov 6, 2023
@grondo
Copy link
Contributor Author

grondo commented Nov 6, 2023

I also noticed that the added test here does not work because flux alloc and flux batch in the flux-sched testsuite do not load Fluxion module by default. (Other tests that use flux batch load fluxion modules manually).

So, I'll be force-pushing this branch with a completely different fix and the testsuite improved.

Problem: In dfu_traverser_t::run() the current graph_end is compared
to the current `*at` time to ensure the job duration won't overrun
the graph expiration. However, the local graph_end variable is created
using the default graph_duration.graph_end internal count, which may
not be in seconds.

Explicitly cast the graph_end to std::chrono::seconds so that the
value of the local graph_end variable is the same units as `*at` and
`duration` which are seconds.

Fixes flux-framework#1103
Problem: The test in t4011-match-duration.t tests that the duration
of a job is inherited from the enclosing instance duration, but we
should really be testing that the *expiration* of a child job is
inherited from the parent when no duration is specified. O/w, a job
submitted when the instance has only a few minutes remaining could
have its expiration set for long after the instance is terminated.

Update the test to ensure that expiration of a job with no duration
specified is inherited from the enclosing instance expiration.
Use flux-alloc(1) to launch the test instance instead of running
standalone flux-start(1) instances since this better simulates the
intended use case.
Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Merging #1104 (641a82a) into master (8da547c) will increase coverage by 0.0%.
The diff coverage is 100.0%.

❗ Current head 641a82a differs from pull request most recent head 218f7ec. Consider uploading reports for the commit 218f7ec to get more accurate results

@@          Coverage Diff           @@
##           master   #1104   +/-   ##
======================================
  Coverage    71.8%   71.8%           
======================================
  Files          88      88           
  Lines       11524   11525    +1     
======================================
+ Hits         8279    8280    +1     
  Misses       3245    3245           
Files Coverage Δ
resource/traversers/dfu_impl.hpp 94.7% <100.0%> (+0.1%) ⬆️

@grondo grondo changed the title traverser: fix default jobspec duration when no duration specified traverser: fix default job duration when no duration specified Nov 6, 2023
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add my approval FWIW since we're on a tight timeline for a tag today, we need this fixed, and there's a regression test. Perhaps if @milroy thinks follow up is needed can do that in another PR?

@grondo grondo dismissed milroy’s stale review November 7, 2023 16:49

Reimplemented the change here with a different fix

@grondo
Copy link
Contributor Author

grondo commented Nov 7, 2023

Thanks! I'll set MWP.

@grondo grondo added the merge-when-passing mergify.io - merge PR automatically once CI passes label Nov 7, 2023
@mergify mergify bot merged commit 7d63065 into flux-framework:master Nov 7, 2023
20 checks passed
@grondo grondo deleted the issue#1103 branch November 7, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants