Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log: Use RTLD_NOLOADwhen checking symbols #310

Merged
merged 1 commit into from
May 14, 2018
Merged

log: Use RTLD_NOLOADwhen checking symbols #310

merged 1 commit into from
May 14, 2018

Conversation

chrissie-c
Copy link
Contributor

On FreeBSD 11 call dlopen on a shared library causes the constructors
to run again. As we're just getting symbols we don't need this to
happen.

Actually we don't WANT it to happen because it can cause qb_log_init to
be called twice (recursively) and the dlnames list gets corrupted. This
causes corosync (at least) to crash at startup.

Signed-off-by: Christine Caulfield [email protected]

on FreeBSD 11 call dlopen on a shared library causes the constructors
to run again. As we're just getting symbols we don't need this to
happen.

Actually we don't WANT it to happen because it can cause qb_log_init to
be called twice (recursively) and the dlnames list gets corrupted. This
causess corosync (at leasT0 to crash at startup.

Signed-off-by: Christine Caulfield <[email protected]>
@kgaillot
Copy link
Contributor

I'm not very familiar with the dl*() functions. Doesn't the library need to be loaded to use dlsym()? It looks like RTLD_NOLOAD is a glibc thing, I guess we don't support anything else?

@chrissie-c
Copy link
Contributor Author

I think by 'load' it means activate ready for execution - which is the usual use for dlopen of course. It certainly means the constructors are called (which is the problem here). with NOLOAD I suspect it just loads it into memory so you can get the symbols from it. But because we are already walking the tree of loaded libraries we don't need to 'LOAD' them again, the symbols are already there (I have checked this).

It's certainly supported by FreeBSD (which is where I see the problem) and Linux, but if you think it might be a problem I can set it at ./configure time.

@wferi
Copy link
Contributor

wferi commented May 11, 2018

According to dlopen(3), RTLD_NOLOAD does not load the shared object; dlopen() returns NULL if the object is not loaded yet. The Linux manual also states that "constructor functions are executed before dlopen() returns", but I guess it's really meant only when it actually loads a shared object (and also: "any initialization returns are called just once"). The BSD manual only mentions that "when an object is first loaded into the address space in this way, its function _init(), if any, is called by the dynamic linker."
The Linux manual calls RTLD_NOLOAD a GNU extension, which is also present on Solaris. Apparently, FreeBSD 11 supports it as well.

Beyond this, the algorithm seems racy: what if a shared object is replaced after the application initially loads it but before this dlopen() is executed? I'd expect dlopen() to return NULL in this case, but is this an error, or will the subsequent dlerror() return NULL, too (which gets passed to qb_log())?
The alternative outlined in https://stackoverflow.com/a/2694373/1142595 is somewhat hairy, though, see https://stackoverflow.com/questions/23809102/print-the-symbol-table-of-an-elf-file for example.

@chrissie-c
Copy link
Contributor Author

We're not looking for the whole symbol table, just for the logging segments inside each loaded object so I think dlsym() is fine. Also adding NOLOAD removes the potential for the race as will only look inside the objects that are already loaded in memory and in use by the program - which is what we want.

If the shared object is replaced we most certainly do NOT want the new symbols as they might not be correct for what is in RAM.

@wferi
Copy link
Contributor

wferi commented May 11, 2018

Sure, but with RTLD_NOLOAD added dlopen() will return NULL if the shared object is replaced (even though its previous instance is already loaded), thus you won't be able to use dlsym() on it.

@chrissie-c
Copy link
Contributor Author

TBH that's probably preferable to getting the wrong symbol, which would be a guaranteed segfault :)

@wferi
Copy link
Contributor

wferi commented May 11, 2018

Probably so, though I'm not sure what happens on symbol lookup failure.
And I guess it would be best to get the correct symbol anyway.

@chrissie-c
Copy link
Contributor Author

It's a messy area I agree, thanks for your input. I'll see if anyone else has any comments before I commit it. Just to be on the safe side :)

@wferi
Copy link
Contributor

wferi commented May 11, 2018

To be explicit, I support merging this PR. We're probably better off with it, even though the code might not be perfect still. I'm unsure because the quite insightful #266 (comment) and its followups do not mention this supposed race.

@chrissie-c
Copy link
Contributor Author

Thanks Feri. I've only seen the race on FreeBSD but it could potentially be more common

@chrissie-c chrissie-c merged commit c235284 into ClusterLabs:master May 14, 2018
@chrissie-c chrissie-c deleted the dlopen-noload branch May 14, 2018 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants