-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: add config file boot method #1320
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1320 +/- ##
==========================================
+ Coverage 78.3% 78.55% +0.25%
==========================================
Files 156 162 +6
Lines 28407 29606 +1199
==========================================
+ Hits 22245 23258 +1013
- Misses 6162 6348 +186
|
This does in fact complete #1236. I am not at all clear on what you think was "proposed as necessary in #1236". You seem to be assuming intentions that are nowhere present in the work statement. I think it is necessary to point out too that you did not open a detailed ticket explaining your approach to solving the problem before creating this code and making a pull request. But you did demand that of me. I am not suggesting that you should be doing anything different in this work. I am just pointing out that if you had allowed me to work the same way that you are working, then the work would already have been complete and you wouldn't have needed to spend your own time on this current coding effort. |
Towards the top of #1236 you mentioned modularization that was independent of the bootstrap method. I kept that to a minimum in this PR. I tried to add enough of a high level description of the bootstrap method in #1236 before starting so you would have an idea of high level plan and scope and could comment. A new issue wasn't necessary. |
7c81bf4
to
3e34a0e
Compare
01f3bc0
to
3d0e84c
Compare
@garlick, I see what might be a problem in Travis here. The sharness --chain-lint tests are not compatible in Travis with sending processes into the background for some reason. You might try the following patch and see if this fixes the current failure? diff --git a/t/t0013-config-file.t b/t/t0013-config-file.t
index cebca9f..764db08 100755
--- a/t/t0013-config-file.t
+++ b/t/t0013-config-file.t
@@ -80,7 +80,7 @@ test_expect_success 'start size=1 with private config file' '
# size=2 boot from config file
#
-test_expect_success 'start size=2 with private config files' '
+test_expect_success NO_CHAIN_LINT 'start size=2 with private config files' '
flux broker -Sboot.method=config \
-Sboot.config_file=${TCONFDIR}/priv2.1.conf &
run_timeout 5 flux broker -Sboot.method=config \ |
Thanks! Trying it on my test branch. |
That was it! Thanks. |
3ccca5f
to
5ab0d92
Compare
There's a bit of libtomlc99 unicode support code that is not covered by testing (also notice this in flux-security). I wonder if we should just remove that code or at least make sure it works...? I'm running through this PR on my desktop now. Otherwise, so far seems to check out ok. |
Import TOML parser library and TAP test from flux-security. Originally: https://github.com/cktan/tomlc99 cktan/tomlc99@a0b45225 Plus fix for a segfaulting test: cktan/tomlc99@01ecb88d cktan/tomlc99#3 TAP test replaces upstream test. It runs the BurntSushi test inputs through the parser and checks for expected pass/fail. See also: https://github.com/BurntSushi/toml-test Add 'tar-pax' option to AM_INIT_AUTOMAKE so that long filenames of test input files can be tarred by the dist target.
Import TOML utility library and unit test from flux-security.
Import prototype config class and unit test from flux-security.
Problem: create_rundir() is called both within boot_pmi(), and after bootstrap. The call within boot_pmi() was added so that the session-id obtained from PMI could be used in the directory name, and the rundir could be used in ipc:// socket paths. However the session-id is no longer used in the path, so we can make single call before bootstrap.
Problem: ipaddr_getprimary() calls log_err_exit() on error. Change error handling behavior. Have it return an error string in an optional error buffer, and return -1 on failure, 0 on success. Update broker user.
Add function to return argz list of addresses of all local network interfaces, in stringified IP/IPv6 form.
Problem: no test coverage for ipaddr.c Add a trivial TAP test.
Problem: the default values fo tbon.endpoint and mcast.endpoint are not appropriate for boot methods other than PMI. Move the code for ensuring that these attrs are set into boot_pmi(). Refactor some repeated code into a helper function.
If a "boot.method" attribute is set to "config", bootstrap overlay parameters from a config file instead of PMI. If it is set to PMI or is unset, the previous PMI boot method is executed. If config is selected, the attribute "boot.config_file" attribute must point to a TOML configuration file. See boot_config.h.
Problem: PMI code and its helpers are confusing to isolate and understand when they live in broker.c with unrelated code. Move boot_pmi() and its helper functions to boot_pmi.[ch]
Add a config file that can be used to boot a single-node Flux configuration. Modify the systemd unit file so that it uses this config file to start the local session.
Problem: fatal bootstrap errors sometimes cause segfaults when libzmq atexit handlers are called. The libzmq worker thread and zauth actor are started when flux_sec_comms_init() is called. Defer this until overlay_bind() or overlay_connect() is called for the first time. This should also make sharness tests that cover fatal bootstrap errors run a bit faster.
Problem: sharness tests that run a broker under run_timeout() cannot actually terminate the broker if the timer runs out. Unblock SIGALRM in the broker.
Problem: four subtests of t1105-proxy.t fail when run on a system that is also running a Flux system instance. These tests use the construct FLUX_URI=value VAR=value ... test_must_fail command It looks like the test_must_fail shell function actually does not transmit the environment variable settings to "command", and since FLUX_URI is unset by the test script, the command attempts to connect to the system instance and succeeds. This example, which prints nothing and exits with a code of 1, demonstrates the problem with the construct: #!/bin/sh foo() { "$@" } X=1 foo printenv X Solution: run the command directly and test its exit code, instead of running it with test_must_fail().
Rebased on current master and resynced the libtomlc99 stuff from flux-security, after flux-framework/flux-security#54 improved test coverage there. On
Finally, The overlay module does need some refactoring in general, and also to support multiple overlay topologies, but that's a topic for another PR I think. |
Thanks. To be clear, regarding refactoring I was only talking about this block of code, duplicated between if (!ov->sec_initialized) {
if (flux_sec_comms_init (ov->sec) < 0) {
log_msg ("flux_sec_comms_init: %s", flux_sec_errstr (ov->sec));
goto done;
}
ov->sec_initialized = true;
} E.g. move this block to if (overlay_sec_init (ov) < 0)
goto done; I only noticed this because of the duplicated error handling code that is not covered by tests. |
Ah, OK that makes sense! I'll fix. |
Problem: duplicate code for initializing security context exists in overlay_connect() and overlay_bind(). Factor out to shared private function.
Thanks, looks good to me. After travis reports back is this ready for merging? |
Yes. At the doctor but will check back on Travis when I get done
…On Feb 21, 2018 10:43 AM, "Mark Grondona" ***@***.***> wrote:
Thanks, looks good to me. After travis reports back is this ready for
merging?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#1320 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKX2wY1mKO4u7WCC5ydf3KhkGJDgezMks5tXGO1gaJpZM4RjTs->
.
|
Looks good! |
OK to merge IMHO. |
I'm putting this PR up early to get any feedback since I'm about to go offline for the rest of the day. It's not ready for merging yet (needs some hand testing on a cluster, as well as a sharness test).
This adds a new boot method for brokers where size, rank, session-id, and overlay endpoints are taken from a TOML config file rather than exchanged with PMI. The idea is to allow a multi-node system instance to bootstrap via systemd. Such a system instance still has many hurdles to overcome to be useful. As of now, it will not function unless all nodes listed in the config file are up, and it has no ability to recover its state if something goes wrong, or even survive an orderly shutdown.
Other than pulling boot_pmi.c out of broker.c and creating another boot method example, I didn't tackle any of the refactoring proposed as necessary in #1236. That problem to me seems co-mingled with the challenge of supporting multiple overlay topologies, since much of the current boot code is specific to TBON. Anyway maybe having this second example method on hand will help.
This pulls in the TOML config file support from
flux-security
. Since that still may undergo some change, I didn't try to introduce a general flux config file. Instead, this is just a config file for the boot process, and it does not contain any shortcuts that would make configuring a cluster easier. I would expect it to evolve:Each node is expected to have an identical config file, and finds its own endpoint address and that of its parent from the tbon-endpoints array.
The systemd unit file will need to be modified to run the broker something like this:
(I haven't tried that yet and maybe we will actually want systemd to start the broker with
flux start
)Feedback welcome.