Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] Need to emit R_lite #312

Closed
dongahn opened this issue Apr 11, 2018 · 12 comments
Closed

[enhancement] Need to emit R_lite #312

dongahn opened this issue Apr 11, 2018 · 12 comments

Comments

@dongahn
Copy link
Member

dongahn commented Apr 11, 2018

Scheduler needs to emit R_lite. WIP is at here. This is related to flux-framework/flux-core#1439 as well.

@dongahn
Copy link
Member Author

dongahn commented Apr 24, 2018

@grondo: I integrated resrc_tree_serialize_lite with the scheduler framework module. It now emits R_lite like this, but somehow the job is not running but gets stuck in runrequest.

What was the expected R_lite format? I found our discussion at flux-framework/flux-core#1378 (comment) and also vaguely remember we wanted to use core instead of cores.

BTW, as part of post processing hostname to rank resolution, the rank key appears after the children key in my R_lite. Could this be a problem?

quartz770{dahn}24: flux kvs dir -R lwj.0.0.1
lwj.0.0.1.R_lite = [ { "children": { "core": "0" }, "rank": 0 } ]
lwj.0.0.1.cmdline = ["sleep", "0"]
lwj.0.0.1.create-time = 1524544985.712936
lwj.0.0.1.cwd = /g/g0/dahn/workspace/flux-cancel/flux-sched
lwj.0.0.1.environ = {"BASH_ENV": "/usr/share/lmod/lmod/init/bash", "CVS_RSH": "ssh", "DISPLAY": "192.168.131.8:16.0", "DK_NODE": "/usr/global/tools/dotkit", "DK_ROOT"...
lwj.0.0.1.input.config = [{"dst": "*", "src": "stdin"}]
lwj.0.0.1.ncores = 1
lwj.0.0.1.ngpus = 0
lwj.0.0.1.nnodes = 1
lwj.0.0.1.ntasks = 1
lwj.0.0.1.opts.nnodes = 1
lwj.0.0.1.opts.tasks-per-node = 1
lwj.0.0.1.rank.0.cores = 1
lwj.0.0.1.rank.0.gpus = 0
lwj.0.0.1.state = runrequest
lwj.0.0.1.walltime = 0

@grondo
Copy link
Contributor

grondo commented Apr 24, 2018 via email

@dongahn
Copy link
Member Author

dongahn commented Apr 24, 2018

I think the issue is related to flux-framework/flux-core#1378 (comment)

Because we wanted to remove json type from the public interfaces, we converted json type to string type in JSC. So I was writing R_lite as a string, and rcalc doesn't seem to like it.

I will see how I can avoid this conversion. But it seems we will need a big surgery to JSC to use "getter/setter" pattern to avoid excessive json to string type conversions..

@garlick
Copy link
Member

garlick commented Apr 24, 2018

Do we have an encoded JSON object being stored as a JSON string?
This may just be question of using the right KVS API function to store it.
Let me have a look...

@garlick
Copy link
Member

garlick commented Apr 24, 2018

@dongahn - just realized I'm not sure what to look at. Where is the the code that is writing R_lite to the KVS?

@dongahn
Copy link
Member Author

dongahn commented Apr 24, 2018

Still in my repo. I will spend a bit more time this afternoon before creating a PR. If I need you to look at it, i will call out @garlick. Thanks!

@grondo
Copy link
Contributor

grondo commented Apr 24, 2018

Still in my repo.

You can also push to a branch on your fork and point us to it here.

Do we have an encoded JSON object being stored as a JSON string?

I was going to ask the same question. That was how the rdl key was stored so if you are using the same interface that could be the problem.

@dongahn
Copy link
Member Author

dongahn commented Apr 24, 2018

Yes I was using the same interface as rdl:

From this sched code, the rdl is encoded into a JSON string (Jtostr (jcb)).

    if (resrc_tree_serialize (ro, job->resrc_tree)) {
        flux_log (h, LOG_ERR, "%"PRId64" resource serialization failed: %s",
                  job->lwj_id, strerror (errno));
        goto done;
    }
    json_object_set_new (jcb, JSC_RDL, ro);
    jcbstr = Jtostr (jcb);
    if (jsc_update_jcb (h, job->lwj_id, JSC_RDL, jcbstr) != 0) {
        flux_log (h, LOG_ERR, "error jsc udpate: %"PRId64" (%s)", job->lwj_id,
                  strerror (errno));
        goto done;
    }
    Jput (jcb);

Then, in this JSC code, it is stored as a string.

    if (flux_kvs_txn_pack (txn, 0, key, "s", rs) < 0) {
        flux_log_error (h, "update %s", key);
        goto done;
    }
    if (!(f = flux_kvs_commit (h, 0, txn)) || flux_future_get (f, NULL) < 0) {
        flux_log_error (h, "commit failed");
        goto done;
    }

@garlick
Copy link
Member

garlick commented Apr 24, 2018

If the rs parameter is always serialized JSON, then it seems like the flux_kvs_txn_pack(fmt="s") should be replaced with flux_kvs_txn_put().

If this JSC function is also used to update JSON string values, then it seems like it should be split or something similar to the "pack" interface would be more appropriate?

@dongahn
Copy link
Member Author

dongahn commented Apr 24, 2018

@garlick: I actually reached the same conclusion with some more testing and I now have working R_lite. Thanks.

@garlick
Copy link
Member

garlick commented Apr 24, 2018

Ok great!

@dongahn
Copy link
Member Author

dongahn commented Apr 27, 2018

Both flux-framework/flux-core#1485 and #321 have just been merged.

@dongahn dongahn closed this as completed Apr 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants