Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-info: support annotations (part 2) #3037

Closed
wants to merge 15 commits into from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jul 7, 2020

This is built upon PR #2960, part #2 in the annotations work. Annotations are distributed from the job-manager to job-info and available via the job-info.list service.

I think this is a good stopping point for part 2. Adding the ability for users to create their own annotations could possibly go into this PR as well, but I decided to hold off and maybe that'll be part 3 (as it will introduce security checks, thus security tests). Presumably part #4 will be the final flux-jobs support.

I don't think there's anything too spectacular to mention about the PR. The biggest decision, as discussed in #2608, is that annotations should be sent via an event from the job-manager, vs many of the other options discussed/prototyped.

As an FYI, the additions in this PR are the last 7 commits in this series.

@chu11
Copy link
Member Author

chu11 commented Jul 8, 2020

hmm hit this potentially racy fail?

expecting success: /usr/src/t/issues/t2686-shell-input-race.sh
{"[0-3]": {"Package": 1, "Core": 16, "PU": 16, "cpuset": "0-15"}}
2020-07-08T01:16:58.768930Z sched-simple.debug[0]: ready: 4 of 4 cores: rank[0-3]/core0
16290676736
16307453952
16324231168
16341008384
16357785600
16357785601
16374562816
16391340032
16408117248
16424894464
16441671680
16458448896
16760438784
16777216000
16793993216
16810770432
16827547648
16844324864
16944988160
17012097024
17028874240
17028874241
17028874242
17045651456
flux-start: 0 (pid 12908) Killed
not ok 9 - t2686-shell-input-race.sh

@grondo
Copy link
Contributor

grondo commented Jul 8, 2020

hmm hit this potentially racy fail?

We see this error frequently in various tests:

flux-start: 0 (pid 12908) Killed
not ok 9 - t2686-shell-input-race.sh

This means one broker was slow to exit and was killed by flux-start. I've never seen it outside of Travis-CI so not sure how to debug.

@chu11 chu11 force-pushed the issue2608_annotate_part2 branch from ecbb4ee to 5715c57 Compare July 12, 2020 15:48
@chu11
Copy link
Member Author

chu11 commented Jul 12, 2020

just rebasing now that #2960 is in

@chu11 chu11 changed the title job-info: support annotations job-info: support annotations (part 2) Jul 16, 2020
@chu11 chu11 force-pushed the issue2608_annotate_part2 branch 2 times, most recently from 06cd560 to d361bc6 Compare July 20, 2020 17:50
@chu11
Copy link
Member Author

chu11 commented Jul 20, 2020

So I've rebased & added the following on top of this PR:

  • re-worked annotations to store "namespace" keys within an object of a key. i.e.

    {"sched.reason_pending": "no cores"}

    is now

    {"sched":{"reason_pending": "no cores"}}

  • re-worked annotations updates to be "recursive". e.g.

    updating

    {"sched": { "reason_pending": "no cores"}}

    with

    {"sched": { "t_estimate": 1234}}

    should result in

    {"sched": { "reason_pending": "no cores", "t_estimate": 1234}}

  • support initial annotations in flux-jobs. flux-jobs requires some refactoring to allow arbitrary annotation keys to be listed by
    the user i.e. annotations.user.foo.bar. So for this initial implementation I only supported the defined keys in RFC27.

    sched.reason_pending
    sched.resource_summary
    sched.t_estimate

    I'll follow up with generic support in a later PR (@grondo has done some refactoring in his PR Add support for RFC19 F58 encoded JOBIDs #3045)

  • added tests

@chu11 chu11 force-pushed the issue2608_annotate_part2 branch from c5b3ca1 to a523295 Compare July 20, 2020 21:30
grondo and others added 14 commits July 20, 2020 15:53
Problem: flux-jobs header row is formatted with the same Formatter as
JobInfo rows, which is at odds with simply registering valid field
names and their corresponding header strings. E.g. exception info
header and valid field names have to inserted into the headers dict
in multiple steps.

Add a custom HeaderFormatter class that overrides the Formatter
get_field() method. This function directly looks up the header name
from kwargs[field_name] instead of calling getattr() recursively as
is done by the default Formatter.

This means "dotted" fields can be literally registered in the
headers dict.

Update the exception info registration using the new scheme.
If there are no state transitions to publish, do not publish a
state transitions event.

Fixes flux-framework#3054
If job ID does not exist, error should be ENOENT not EINVAL.
Rename functions to make it clear they are getting / checking
via the job manager.

Update callers appropriately.
Add context checks, which were missing when reconstructing job info
from the eventlog.
Support a new job-annotations event so other modules, such as job-info,
can be aware of changes to job annotations.
Listen to the newly created job-annotations event and make
job annotations available to be returned by job listing services.
Add flux job list annotation tests to t2203-job-manager-dummysched-single
and t2204-job-manager-dummysched-unlimited.

Add list attribute tests to t/t2230-job-info-list.
Allow annotation objects to be updated recursively, rather than
only at the first level.
Instead of annotating `sched` namespace annotations with a period
notation (i.e. `sched.resource_summary`), store annotations within
an object (i.e. sched is { "resource_summary" : ... })

Update tests appropriately.
In sched-dummy, annotate `sched` namespace annotations in a
sub-object instead of using period notation.

Update tests as necessary.
Support listing initial annotations of "annotations", "annotations.sched",
"annotations.sched.reason_pending", "annotations.sched.resource_summary",
and "annotations.sched.t_estimate".
@grondo
Copy link
Contributor

grondo commented Jul 20, 2020

Ugh, I should have had you wait on tests. I started down reworking t2800-jobs-cmd.t and had to end up rewriting most tests. At least your new tests will be self-contained, so merging in either way should not be too bad.

Add flux jobs annotation tests to t2203-job-manager-dummysched-single
and t2204-job-manager-dummysched-unlimited.

Add output header tests to t2800-jobs-cmd.t.
@chu11 chu11 force-pushed the issue2608_annotate_part2 branch from a523295 to c3822c8 Compare July 21, 2020 00:01
@chu11
Copy link
Member Author

chu11 commented Jul 21, 2020

Ugh, I should have had you wait on tests. I started down reworking t2800-jobs-cmd.t and had to end up rewriting most tests. At least your new tests will be self-contained, so merging in either way should not be too bad.

No worries and good timing on your comment. I just realized I forgot to add some tests to t2800-jobs-cmd.t, and went ahead and yoinked your refactoring patch for flux-jobs's header output. So this PR is now based on your PR #3045 anyways :-)

@grondo
Copy link
Contributor

grondo commented Jul 21, 2020

Ok, a lot more refactoring coming to flux-jobs...

@codecov-commenter
Copy link

Codecov Report

Merging #3037 into master will increase coverage by 0.00%.
The diff coverage is 72.88%.

@@           Coverage Diff            @@
##           master    #3037    +/-   ##
========================================
  Coverage   81.16%   81.16%            
========================================
  Files         285      285            
  Lines       44035    44140   +105     
========================================
+ Hits        35740    35826    +86     
- Misses       8295     8314    +19     
Impacted Files Coverage Δ
src/cmd/flux-job.c 85.08% <ø> (ø)
src/modules/job-info/job-info.c 78.20% <ø> (ø)
src/modules/job-info/list.c 68.31% <ø> (ø)
src/modules/sched-simple/sched.c 72.14% <ø> (ø)
src/modules/job-info/job_state.c 69.96% <58.69%> (-0.71%) ⬇️
src/modules/job-manager/event.c 75.92% <72.22%> (-0.55%) ⬇️
src/modules/job-manager/alloc.c 80.36% <78.78%> (-0.42%) ⬇️
src/cmd/flux-jobs.py 92.61% <93.75%> (+0.05%) ⬆️
src/modules/job-info/job_util.c 93.45% <100.00%> (+0.25%) ⬆️
src/modules/job-manager/raise.c 79.80% <100.00%> (ø)
... and 5 more

@chu11
Copy link
Member Author

chu11 commented Jul 21, 2020

Doh! I just realized I got a bug here:

class AnnotationsInfo:                                                                                                                      
    def __init__(self, annotationsDict):                                                                                                    
        self.annotationsDict = annotationsDict                                                                                              
        self.atuple = namedtuple("X", annotationsDict.keys())(                                                                              
            *(                                                                                                                              
                AnnotationsInfo(v) if isinstance(v, dict) else v                                                                            
                for v in annotationsDict.values()                                                                                           
            )                                                                                                                               
        )                                                                                                                                   
                                                                                                                                            
    def __repr__(self):                                                                                                                     
        # Special case, empty dict return empty string                                                                                      
        if self.annotationsDict:                                                                                                            
            return json.dumps(self.annotationsDict)                                                                                         
        return ""                                                                                                                           
                                                                                                                                            
    def __getattr__(self, attr):                                                                                                            
        try:                                                                                                                                
            return object.__getattribute__(self.atuple, attr)                                                                               
        except AttributeError:                                                                                                              
            return "" 

the getattr trick works if the last field of annotations.sched.reason_pending doesn't exist, but it doesn't work if annotations.sched doesn't exist. Perhaps I need to recursively create empty tuples in this recursive case.

@chu11
Copy link
Member Author

chu11 commented Jul 21, 2020

Given PR #3060 and some other things I discovered, I'm going to re-do PR ordering. Nixing this PR.

@chu11 chu11 closed this Jul 21, 2020
@grondo
Copy link
Contributor

grondo commented Jul 21, 2020

Ugh, sorry for the inconvenience @chu11! ☹️

@chu11
Copy link
Member Author

chu11 commented Jul 21, 2020

Ugh, sorry for the inconvenience @chu11! frowning_face

No worries, its also due to other things I'm uncovering as I do more development!

@chu11 chu11 deleted the issue2608_annotate_part2 branch June 5, 2021 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants