Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move libzmq dependent functions out of libflux-core into libzmqutil #3797

Merged
merged 12 commits into from
Jul 31, 2021

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jul 24, 2021

It's been a massive journey to try and de-link libzmq from libflux-core (#3620 #3622, #3623, #3624, #3697, #3701, #3742, #3746, #3773, #3776), but hopefully this is the end :-)

This PR will move the few libzmq dependent functions out of libflux-core and into a new convenience library libzmqutil. The functions are:

flux_msg_sendzsock() / fluxmsg_sendzsock_ex()
flux_msg_recvzsock()
flux_zmq_watcher_create()
flux_zmq_watcher_get_zsock()

Only those few dependent modules/binaries (broker, shmem, libtestutil) that need them will use libzmqutil and link to libzmq. We have to create a few private helper headers along the way (message_private.h, reactor_private.h) that exports a few macros, helper functions, structs, etc.

You'll notice two fixup commits in this PR. They both rename functions.

flux_msg_sendzsock -> zmqutil_msg_sendzsock
flux_msg_recvzsock -> zmqutil_msg_recvzsock
flux_zmq_watcher_create -> zmqutil_zmq_watcher_create
flux_zmq_watcher_get_zsock -> zmqutil_zmq_watcher_get_zsock

renaming the functions is of course completely optional. I went back and forth on it. So it's just fixup patches for now, we can squash or remove based on discussion. Or perhaps folks don't like how I renamed them, I struggled with that as well. I didn't want to prefix anything zmq_ to avoid potential confusion with it being a libzmq function. But zmqutil_ is not the prettiest prefix either.

Performance tests on fluke (multiple runs showed consistent results similar to these)

current master

Running throughput.py 1024 jobs per iteration, 1 iterations
throughput:     22.9 job/s (script:  22.7 job/s)
1: 46 seconds

Running throughput.py 2048 jobs per iteration, 1 iterations
throughput:     21.9 job/s (script:  21.8 job/s)
1: 94 seconds

this PR branch

>src/cmd/flux start ./throughput_loop.sh 1024 1
Running throughput.py 1024 jobs per iteration, 1 iterations
throughput:     25.3 job/s (script:  25.1 job/s)
1: 41 seconds


Running throughput.py 2048 jobs per iteration, 1 iterations
throughput:     24.0 job/s (script:  23.9 job/s)
1: 85 seconds

generally speaking, around a 10% improvement across many runs

@chu11 chu11 force-pushed the issue3617_zmqutil branch 2 times, most recently from 8503937 to a25d905 Compare July 24, 2021 04:23
@garlick
Copy link
Member

garlick commented Jul 24, 2021

Whew! Nice.

Quick comment on function naming. What about:

flux_msg_sendzsock -> zmqutil_msg_send
flux_msg_recvzsock -> zmqutil_msg_recvz
flux_zmq_watcher_create -> zmqutil_watcher_create
flux_zmq_watcher_get_zsock -> zmqutil_watcher_get_zsock

just reducing the redundancy a bit. Same with the file names - could probably drop the zmqutil_ prefix since they are already in a zmqutil directory.

@chu11 chu11 force-pushed the issue3617_zmqutil branch from a25d905 to e982a49 Compare July 25, 2021 02:22
@chu11
Copy link
Member Author

chu11 commented Jul 25, 2021

re-pushed per comments. Comments suggested rename of functions were liked, so removed the fixup fixes. Adjusted commit messages as a result.

@chu11 chu11 force-pushed the issue3617_zmqutil branch from e982a49 to 8bb0eba Compare July 25, 2021 02:26
@garlick
Copy link
Member

garlick commented Jul 26, 2021

I checked src/shell/.libs/flux-shell and it is still linked against libzmq and friends, but I can't for the life of me figure out how!

@chu11
Copy link
Member Author

chu11 commented Jul 26, 2021

@garlick I'm not seeing it on a TOSS3 cluster. Are you ldd-ing .libs/flux-shell or .libs/lt-flux-shell? b/c the former will link to the installed /usr/lib64/libflux-core.so, which (probably) still has the czmq dependency.

Edit: oh , i just re-read your comment. It think you should be checking .libs/lt-flux-shell.

@garlick
Copy link
Member

garlick commented Jul 26, 2021

Ah! that was the problem. I'm not getting an lt-flux-shell in my build (Ubuntu 20.04 LTS) but after installing this version over the top of the old one I had installed, no more libzmq!

$ ldd .libs/flux-shell
	linux-vdso.so.1 (0x00007fffba7e3000)
	libflux-core.so.2 => /usr/local/lib/libflux-core.so.2 (0x00007f0321102000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f03210df000)
	libflux-optparse.so.1 => /usr/local/lib/libflux-optparse.so.1 (0x00007f03210c9000)
	liblua5.1.so.0 => /lib/x86_64-linux-gnu/liblua5.1.so.0 (0x00007f0321098000)
	libhwloc.so.15 => /lib/x86_64-linux-gnu/libhwloc.so.15 (0x00007f0321047000)
	libjansson.so.4 => /lib/x86_64-linux-gnu/libjansson.so.4 (0x00007f0321038000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f0320ee7000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0320cf5000)
	libuuid.so.1 => /lib/x86_64-linux-gnu/libuuid.so.1 (0x00007f0320cec000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0320ce6000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f03211ee000)
	libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007f0320cb9000)
	libltdl.so.7 => /lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f0320cae000)

Yay!

@chu11
Copy link
Member Author

chu11 commented Jul 26, 2021

@garlick i think you need to run src/shell/flux-shell one time for it to generate the lt-flux-shell.

@garlick
Copy link
Member

garlick commented Jul 26, 2021

i think you need to run src/shell/flux-shell one time

I think that's dependent on the libtool version? My system doesn't produce executables with the lt- prefix. They are the same name as the original. This is libtool 2.4.6-14.

@grondo
Copy link
Contributor

grondo commented Jul 26, 2021

I think that's dependent on the libtool version?

It is dependent on whether "fast-install" mode is enabled in libtool. I guess on some distros fast-install mode isn't needed, so libtool turns it off by default? I could have sworn it was enabled on my system (Ubuntu 18.04), but now I see it is set to needless:

enable_fast_install='needless'
fast_install=$enable_fast_install

I did verify that on CentOS 7 (and thus presumably TOSS 3), fast-install is enabled by default:

enable_fast_install='yes'
fast_install=$enable_fast_install

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally very pleased with how this turned out! Just the one question below.

static int iovec_to_msg (flux_msg_t *msg,
struct msg_iovec *iov,
int iovcnt)
int flux_iovec_to_msg (flux_msg_t *msg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these need to have the flux_ prefix? This will make them available publicly.

Copy link
Member Author

@chu11 chu11 Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done b/c libflux-core.so doesn't export any non-flux_ prefixed functions, so the iovec functions weren't available in some places. But now that I think about it, any code that needs the iovec functions could ldadd src/common/libflux/libflux.la instead of src/common/libflx-core.la. Let me see how that works out.

Copy link
Member Author

@chu11 chu11 Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this experiment didn't go as well as I thought it would. There's so many symbols to resolve, and a nice chunk of "what to ldadd first" to ensure dependencies get resolved accordingly. here's an example:

diff --git a/src/common/librouter/Makefile.am b/src/common/librouter/Makefile.am
index 756f5bb..4dcdd41 100644
--- a/src/common/librouter/Makefile.am
+++ b/src/common/librouter/Makefile.am
@@ -63,9 +63,15 @@ test_ldadd = \
         $(top_builddir)/src/common/librouter/librouter.la \
         $(top_builddir)/src/common/libtestutil/libtestutil.la \
         $(top_builddir)/src/common/libflux-internal.la \
-        $(top_builddir)/src/common/libflux-core.la \
-        $(top_builddir)/src/common/libtap/libtap.la \
+        $(top_builddir)/src/common/libflux/libflux.la \
+        $(top_builddir)/src/common/liblsd/liblsd.la \
+        $(top_builddir)/src/common/libev/libev.la \
+        $(top_builddir)/src/common/libutil/libutil.la \
+        $(top_builddir)/src/common/libccan/libccan.la \
+        $(top_builddir)/src/common/libtomlc99/libtomlc99.la \
         $(top_builddir)/src/common/libzmqutil/libzmqutil.la \
+        $(top_builddir)/src/common/libczmqcontainers/libczmqcontainers.la \
+        $(top_builddir)/src/common/libtap/libtap.la \
        $(ZMQ_LIBS)
 
 test_cppflags = \

worse yet I'm hitting segfaults in sharness tests. I haven't been able to figure it out yet, although the backtrace of the segfault points to an unknown function. A symbol collision wouldn't surprise me given what's going on like the above.

Hmmm. Let me mull over this tonight and maybe something obviously simpler can come to mind.

Copy link
Member Author

@chu11 chu11 Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't think of an obviously clean solution here. We don't want to move iovec into yet another convenience lib. Would it be worthwhile to prefix the functions to something to indicate they are private and make sure they are accessible in the libflux-core.map file? Like prefix them with fprivate_ or something instead?

Although unintended, we get away with this on the reactor side b/c we static inline those two simple functions. We'd have this problem there too if the functions were not inlined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just add libflux/libflux.la right after libflux-core.la in the LDADD rules? I didn't try to work through any tests, but the following patch resulted in a working broker and shmem connector for me:

diff --git a/src/broker/Makefile.am b/src/broker/Makefile.am
index f25cd2eaf..3c02fd731 100644
--- a/src/broker/Makefile.am
+++ b/src/broker/Makefile.am
@@ -67,6 +67,7 @@ flux_broker_LDADD = \
        $(builddir)/libbroker.la \
        $(top_builddir)/src/common/libflux-core.la \
        $(top_builddir)/src/common/libzmqutil/libzmqutil.la \
+       $(top_builddir)/src/common/libflux/libflux.la \
        $(top_builddir)/src/common/libpmi/libpmi_client.la \
        $(top_builddir)/src/common/libflux-internal.la \
        $(top_builddir)/src/common/libflux-optparse.la \
diff --git a/src/common/libflux/message.c b/src/common/libflux/message.c
index 51e402d2e..c6f0db869 100644
--- a/src/common/libflux/message.c
+++ b/src/common/libflux/message.c
@@ -384,7 +384,7 @@ static int msg_append_route (flux_msg_t *msg,
     return 0;
 }
 
-int flux_iovec_to_msg (flux_msg_t *msg,
+int iovec_to_msg (flux_msg_t *msg,
                        struct msg_iovec *iov,
                        int iovcnt)
 {
@@ -507,7 +507,7 @@ flux_msg_t *flux_msg_decode (const void *buf, size_t size)
         iovcnt++;
         p += n;
     }
-    if (flux_iovec_to_msg (msg, iov, iovcnt) < 0)
+    if (iovec_to_msg (msg, iov, iovcnt) < 0)
         goto error;
     free (iov);
     return msg;
@@ -1622,7 +1622,7 @@ void flux_msg_fprint (FILE *f, const flux_msg_t *msg)
     flux_msg_fprint_ts (f, msg, -1);
 }
 
-int flux_msg_to_iovec (const flux_msg_t *msg,
+int msg_to_iovec (const flux_msg_t *msg,
                        uint8_t *proto,
                        int proto_len,
                        struct msg_iovec **iovp,
diff --git a/src/common/libflux/message_private.h b/src/common/libflux/message_private.h
index 7923b8c10..8872d3082 100644
--- a/src/common/libflux/message_private.h
+++ b/src/common/libflux/message_private.h
@@ -61,11 +61,11 @@ struct msg_iovec {
     void *transport_data;
 };
 
-int flux_iovec_to_msg (flux_msg_t *msg,
+int iovec_to_msg (flux_msg_t *msg,
                        struct msg_iovec *iov,
                        int iovcnt);
 
-int flux_msg_to_iovec (const flux_msg_t *msg,
+int msg_to_iovec (const flux_msg_t *msg,
                        uint8_t *proto,
                        int proto_len,
                        struct msg_iovec **iovp,
diff --git a/src/common/libzmqutil/msg_zsock.c b/src/common/libzmqutil/msg_zsock.c
index 42e5b9c54..ad71f51fe 100644
--- a/src/common/libzmqutil/msg_zsock.c
+++ b/src/common/libzmqutil/msg_zsock.c
@@ -40,7 +40,7 @@ int zmqutil_msg_send_ex (void *sock, const flux_msg_t *msg, bool nonblock)
         return -1;
     }
 
-    if (flux_msg_to_iovec (msg, proto, PROTO_SIZE, &iov, &iovcnt) < 0)
+    if (msg_to_iovec (msg, proto, PROTO_SIZE, &iov, &iovcnt) < 0)
         goto error;
 
     if (nonblock)
@@ -117,7 +117,7 @@ flux_msg_t *zmqutil_msg_recv (void *sock)
 
     if (!(msg = flux_msg_create (FLUX_MSGTYPE_ANY)))
         goto error;
-    if (flux_iovec_to_msg (msg, iov, iovcnt) < 0)
+    if (iovec_to_msg (msg, iov, iovcnt) < 0)
         goto error;
     rv = msg;
 error:
diff --git a/src/connectors/shmem/Makefile.am b/src/connectors/shmem/Makefile.am
index 82ade229b..1c64c6553 100644
--- a/src/connectors/shmem/Makefile.am
+++ b/src/connectors/shmem/Makefile.am
@@ -22,6 +22,7 @@ shmem_la_LDFLAGS = -module $(san_ld_zdef_flag) \
 shmem_la_LIBADD = \
        $(top_builddir)/src/common/libflux-internal.la \
        $(top_builddir)/src/common/libflux-core.la \
+       $(top_builddir)/src/common/libflux/libflux.la \
        $(top_builddir)/src/common/libzmqutil/libzmqutil.la \
        $(ZMQ_LIBS)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know internally how a binary "looks up" a symbol when it calls a function, but I'm wondering if the binary linked symbol and the symbol in the shared object are both looked up at times. Doing another debugging run, the function address of flux_msglist_pollevents() also changes.

Copy link
Member

@garlick garlick Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no further enlightenment here.

Is it possible to inline those iovec functions as a quick solution? <--EDIT: I see that's likely not the case.

Also, do the PROTO offsets and stuff need to be in that header? Seems like only PROTO_SIZE needs to be exported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the proto export is just for the PROTO_SIZE.

Copy link
Member Author

@chu11 chu11 Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking to some DEG folks, there is some scanning stuff that is done internally to find symbols and that scanning can change over time, such as if libraries are used in a certain order / loaded in certain order. Setting LDDEBUG=all, I noticed this:

    161371:	binding file /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/.libs/libflux-core.so.2 [0] to /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/.libs/libflux-core.so.2 [0]: normal symbol `flux_msglist_create'
    161452:	binding file /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/.libs/libflux-core.so.2 [0] to /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/.libs/libflux-core.so.2 [0]: normal symbol `flux_msglist_create'
    161452:	binding file /g/g0/achu/chaos/git/flux-framework/flux-core/src/broker/.libs/lt-flux-broker [0] to /g/g0/achu/chaos/git/flux-framework/flux-core/src/common/.libs/libflux-core.so.2 [0]: normal symbol `flux_msglist_create'

I don't know exactly how to read this output, but it does look like flux_msglist_create() was found in the .so file and the lt-flux-broker at certain points in time.

Copy link
Member Author

@chu11 chu11 Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm trying the libmessageprivate.la approach now. Unfortunately we have to put more stuff into message_private.h and message_private.c for things to work, but it seems the more correct solution.

Comment on lines +21 to +33
struct flux_reactor {
struct ev_loop *loop;
int usecount;
unsigned int errflag:1;
};

struct flux_watcher {
flux_reactor_t *r;
flux_watcher_f fn;
void *arg;
struct flux_watcher_ops *ops;
void *data;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unfortunate that we need to expose these things outside of reactor.c, given that we theoretically are providing an API to implement custom watchers. I guess it's because ev_zmq was implemented as a pure libev watcher, and libev isn't exposed directly. We might be able to reimplement this in terms of the exported functions, but I'm thinking that could be hairy and time consuming, and this works.

I guess it's for the greater good, and it's still private to flux-core :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't think about the idea of implementing via exposed functions. But yeah, the fact libev wasn't public did create problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not worth the effort at this time IMHO.

@@ -23,6 +23,7 @@
#endif
Copy link
Member

@garlick garlick Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message for c9096f7

This might be the commit that people will find in the history when they ask "why isn't libzmqutil part of libflux"? So it might be good to mention the FIPS startup penalty and its adverse effect on job throughput in the commit message problem statement, even though the issue reference does provide that background. Something like:

Problem: on systems configured for FIPS compliance, the cyrpto libs used by libzmq perform a self-test in the dso init function, which slows down library loading. When linking with libzmq is widespread, this significantly impairs job throughput.

Isolate 0MQ functions that depend on libzmq in a non-exported convenience library, to make it possible to minimize the number of components linked with libzmq.

@grondo
Copy link
Contributor

grondo commented Jul 27, 2021

Sorry, I'm not up to date on this PR. However, I was going to ask if it would be useful to add some kind of test that ensures the build (maybe just flux-shell) hasn't linked against libcrypto? It would be trivial to add and might catch and accidental regression down the line (though I'm not sure how likely that actually is).

Just thought I throw that idea out there.

@chu11
Copy link
Member Author

chu11 commented Jul 28, 2021

Sorry, I'm not up to date on this PR. However, I was going to ask if it would be useful to add some kind of test that ensures the build (maybe just flux-shell) hasn't linked against libcrypto? It would be trivial to add and might catch and accidental regression down the line (though I'm not sure how likely that actually is).

Good idea. I just added a simple:

if ldd ${FLUX_BUILD_DIR}/src/shell/.libs/lt-flux-shell | grep -q zmq                                                                        
then                                                                                                                                        
    exit 1                                                                                                                                  
fi                                                                                                                                          
exit 0  

regression test, until I remembered our conversation above regarding the default of enable_fast_install. I'm not sure the above test is portable. But presumably checking ldd src/shell/.libs/flux-shell isn't portable or safe as well. hmmm.

@garlick
Copy link
Member

garlick commented Jul 28, 2021

presumably checking ldd src/shell/.libs/flux-shell isn't portable or safe

Would libtool e ldd src/shell/flux-shell work here?

@chu11 chu11 force-pushed the issue3617_zmqutil branch from 8bb0eba to b1c8e09 Compare July 29, 2021 04:14
@chu11
Copy link
Member Author

chu11 commented Jul 29, 2021

re-pushed, fixing up the commit message per comments above and adding the extra test. This re-push is mostly to see if the new test works correctly under the docker images. None of the non-prefixing of the iovec functions is in this re-push.

@chu11 chu11 force-pushed the issue3617_zmqutil branch from b1c8e09 to ccac7ca Compare July 30, 2021 05:03
@chu11
Copy link
Member Author

chu11 commented Jul 30, 2021

re-pushed. To resolve not exporting iovec functions, I created a libmessageprivate.la convenience library within libflux. This will export the iovec functions and a collection of helper functions that are needed. Unfortunately, this approach did increase the size of message private more than before. struct flux_msg and struct route_id and msg_proto_setup() also had to be added to message private.

note I added these functions to message private.

int msg_route_push (flux_msg_t *msg,
                    const char *id,
                    unsigned int id_len);

int msg_route_append (flux_msg_t *msg,
                      const char *id,
                      unsigned int id_len);

void msg_route_clear (flux_msg_t *msg);

int msg_route_delete_last (flux_msg_t *msg);

basically I had to make a call about making route_id_create() and route_id_destroy() callable in message_private OR to hide all of their usage behind "route id helper functions". I did the latter.

@garlick
Copy link
Member

garlick commented Jul 30, 2021

OK, finally getting close to done I think. The message_private.c seems like maybe it should be split up topically. Without going too crazy with refactoring, I do have a modest set of changes I was playing with on a copy of your branch. Let me try to clean that up and push it somewhere that you can have a look at.

@garlick
Copy link
Member

garlick commented Jul 30, 2021

Well FWIW I moved some things around on my issue3617_zmqutil_refactor branch.

The changes are just on top of this branch and do not have proper commit messages since they probably would just be squashed if you decide to include them. The changes are

  • rename libmessageprivate.la to libmessage.la
  • include it in libflux-internal.la
  • split libmessage_private.[ch] into _iovec, _proto, and _route source modules

Let me know what you think @chu11. I'm open to whatever - we need to get this done and move on.

@chu11
Copy link
Member Author

chu11 commented Jul 30, 2021

@garlick Ahhh, I didn't think about putting it into libflux-internal. Makes sense. I think it works. Let me yoink your commits and squash em where appropriate and make sure all commits rebuild/pass tests accordingly.

chu11 added 5 commits July 30, 2021 11:36
Problem: The message API currently has an internal function
called flux_msg_create_common().  This function would make it
difficult to migrate some functions out of libflux that depend
on it.

Solution: Allow flux_msg_create() to take the type FLUX_MSGTYPE_ANY,
indicating that the type for this message is not yet known.  As
a result, flux_msg_create_common() can be removed.  Update all prior
callers of flux_msg_create_common().  Add checks to ensure a message
type has been set before the message is encoded/sent.  Add unit tests.
In preparation for future refactoring, place route creation
or destruction code inside wrapper functions.  Have flux_msg_route_clear(),
flux_msg_route_push(), and flux_msg_route_delete_last() call these
new helper functions.
Problem: msg_append_route() is not named consistently to other
functions.

Solution: Rename to msg_route_append().
@chu11 chu11 force-pushed the issue3617_zmqutil branch from ccac7ca to 7b46472 Compare July 31, 2021 00:47
@chu11
Copy link
Member Author

chu11 commented Jul 31, 2021

re-pushed, squashing and adjusting commits in the process, including the commit messages.

@chu11
Copy link
Member Author

chu11 commented Jul 31, 2021

hmmm, one builder failed with.

  FAIL: test_plugin.t 100 - flux_plugin_load worked
  #   Failed test 'flux_plugin_load worked'
  #   at test/plugin.c line 419.
  Bail out!  Failed to load test plugin: dlopen: test/.libs/plugin_foo.so: undefined symbol: flux_kvs_commit
  ERROR: test_plugin.t - Bail out! Failed to load test plugin: dlopen: test/.libs/plugin_foo.so: undefined symbol: flux_kvs_commit

i'm perplexed, as nothing in the unit tests or src/common/libflux uses the kvs. And it's only one builder. Going to restart.

Edit: it failed again on centos8 - py3.7 ... super confused right now.

Edit2: Well I guess the absorption of libflux-internal instead of a bunch of individual util libs did get a few kvs symbols brought in. But I have no idea why tests pass on some builders but not others. I guess subtlety of the centos8 linker defaults I guess?

chu11 added 7 commits July 30, 2021 22:22
Problem: We wish to move some messaging functions out of libflux, but there are some
shared structs / functions that will be needed if we do so.

Solution: Create a message convenience library within src/common/libflux
and add it to libflux-internal so it can be used internally by other flux-core code.

The following structs/functions were migrated out of message.c into their own
individual files.

struct flux_msg -> message_private.h
struct msg_iovec, msg_to_iovec(), iovec_to_msg() -> message_iovec.[ch]
PROTO macros, msg_proto_setup(), proto_get_u32() -> message_proto.[ch]
struct route_id, msg_route_push/append/clear/delete_last() -> message_route.[ch]

Update several linker dependencies to ensure all symbols are available for compilation.
Problem: We would like to remove the libzmq dependency on libflux-core.
Doing so requires us to put libzmq dependent functions somewhere else.

Solution: Create a new internal library libzmqutil and put
flux_msg_sendzsock() and flux_msg_recvzsock() in there.  Rename functions
to zmqutil_msg_send() and zmqutil_msg_recv() respectively.  Move related
unit tests as well.  Update all callers, include new headers,
and link to new internal library.
Problem: In an effort to consolidate zmq dependent code into
one library, ev_zmq in libutil is out of place.

Solution: Move ev_zmq from libutil to libzmqutil.  As a result,
libutil is no longer dependent on libzmq, and thus libflux-internal
is no longer dependent on libzmq.  Remove libzmq dependency
in libflux-internal.  Any binaries/libraries that need libzmq
and previously acquired it via libflux-internal, add libzmq dependency
directly.
Problem: We wish to move some reactor functions out of libflux, but there
are some shared structs / functions that will be needed if we do so.

Solution: Create a reactor_private.h header file that can be used internally by
flux-core utility libraries.  Move struct flux_reactor, struct_watcher,
events_to_libev, and libev_to_events into the header.
Problem: On systems configured for FIPS compliance, the cyrpto libs
used by libzmq perform a self-test in the dso init function, which slows down
library loading.  When linking with libzmq is widespread, this significantly
impairs job throughput.

Solution: Isolate functions that depend on libzmq in the non-exported convenience
library libzmqutil.  This will make it possible to minimize the number of components
linked to libzmq.

The remaining functions that must be isolated are flux_zmq_watcher_create()
and flux_zmq_watcher_get_zsock().  Move them into libzmqutil and rename them
to zmqutil_wacher_create() and zmqutil_watcher_get_zsock().  Update callers
accordingly.

As a result of this change, libflux is no longer dependent on libzmq
and the library dependency can be removed.  Add dependency to any libraries/binaries
previously dependent on libflux's dependency on libzmq.

Fixes flux-framework#3617
@chu11 chu11 force-pushed the issue3617_zmqutil branch from 889b389 to e4b64a4 Compare July 31, 2021 05:22
@codecov
Copy link

codecov bot commented Jul 31, 2021

Codecov Report

Merging #3797 (889b389) into master (31bb96c) will decrease coverage by 0.00%.
The diff coverage is 92.14%.

❗ Current head 889b389 differs from pull request most recent head e4b64a4. Consider uploading reports for the commit e4b64a4 to get more accurate results

@@            Coverage Diff             @@
##           master    #3797      +/-   ##
==========================================
- Coverage   83.34%   83.33%   -0.01%     
==========================================
  Files         342      348       +6     
  Lines       50997    51014      +17     
==========================================
+ Hits        42502    42515      +13     
- Misses       8495     8499       +4     
Impacted Files Coverage Δ
src/common/libflux/reactor.c 93.07% <ø> (-0.08%) ⬇️
src/common/libzmqutil/ev_zmq.c 87.75% <ø> (ø)
src/common/libzmqutil/ev_zmq.h 100.00% <ø> (ø)
src/modules/job-manager/wait.c 77.27% <ø> (ø)
src/broker/module.c 76.45% <83.33%> (ø)
src/common/libflux/message_iovec.c 86.66% <86.66%> (ø)
src/common/libzmqutil/reactor.c 90.00% <90.00%> (ø)
src/common/libzmqutil/msg_zsock.c 91.52% <91.52%> (ø)
src/common/libflux/message.c 95.06% <92.59%> (+0.72%) ⬆️
src/broker/overlay.c 88.97% <100.00%> (-0.56%) ⬇️
... and 14 more

@chu11
Copy link
Member Author

chu11 commented Jul 31, 2021

re-pushed with a fix for the builder failure described above, just had to add a libkvs/libkvs.la dependency addition.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're good here. Thanks for the long slog! Set MWP when you're ready.

@mergify mergify bot merged commit a3f5f71 into flux-framework:master Jul 31, 2021
@chu11 chu11 deleted the issue3617_zmqutil branch August 1, 2021 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants