Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use closefrom and remove our fds max limit. #4872

Merged
merged 3 commits into from
Oct 22, 2021

Conversation

ZmnSCPxj
Copy link
Contributor

@ZmnSCPxj ZmnSCPxj commented Oct 19, 2021

Fixes #4868

This also inadvertently fixes a latent bug: before this patch, in the
subd function in lightningd/subd.c, we would close execfail[1]
before doing an exec.
We use an EOF on execfail[1] as a signal that exec succeeded (the
fd is marked CLOEXEC), and otherwise use it to pump errno to the
parent.
The intent is that this fd should be kept open until exec, at which
point CLOEXEC triggers and close that fd and sends the EOF, or if
exec fails we can send the errno to the parent process vua that
pipe-end.

However, in the previous version, we end up closing that fd before
reaching exec, either in the loop which dup2s passed-in fds (by
overwriting execfail[1] with a dup2) or in the "close everything"
loop, which does not guard against execfail[1], only
dev_disconnect_fd.

Ping @whitslack.

Also ping @NicolasDorier and @cdecker --- on Linux < 5.9, this new code uses /proc/$$/fd to see the open file descriptors (and avoid looping all the way to 1048576), but I do not know if /proc is limited or inaccessible within Docker, and the original #2977 was triggered by Docker on root. Cursory search suggests it should be accessible but I would prefer expert opinion.

We can argue that we can just "kick the can" and bump up the max limit to 4096, but at some point @whitslack is going to have 10,000 channels all by himself, and besides closefrom is "the future" and should be implementable on more systems at some point, maybe. More to the point: if we have N channels, the limit has to be at least N, and we need to launch N subdaemons (one for each channel), and iterating from 0 to N takes N steps, which we do N times at each subdaemon, meaning kicking the can and just raising the limit is O( n^2 ).

Created as draft for now; in particular the code is only minimally tested on Linux 5.11 (and I tried disabling HAVE_NR_CLOSE_RANGE to test the /proc scanning, which should work on a lot of systems that are not running lightningd in a chroot jail). It should be tested on more systems and container types, and while I can try to figure out how to get the tests running on FreeBSD in a VM, I do not know if I can legally run a MacOS VM if I refuse to purchase anything from Apple.

@ZmnSCPxj ZmnSCPxj force-pushed the use-closefrom branch 2 times, most recently from 99b0ce6 to 5682ea0 Compare October 19, 2021 05:03
ccan/ccan/closefrom/_info Outdated Show resolved Hide resolved
@rustyrussell
Copy link
Contributor

I think increasing to 4k in the worst case is also the Right Thing, which means this won't make things worse,

I say move it out of draft (with the simple fixes preferably) and we use the -rcs for further testing...

@rustyrussell rustyrussell added this to the v0.10.2 milestone Oct 20, 2021
@ZmnSCPxj ZmnSCPxj removed this from the v0.10.2 milestone Oct 20, 2021
@ZmnSCPxj
Copy link
Contributor Author

ZmnSCPxj commented Oct 20, 2021

Removed from 0.10.2 milestone for now: the CI FreeBSD smoke test shows this line: https://github.com/ElementsProject/lightning/pull/4872/checks?check_run_id=3935023800#step:3:4379

   checking for closefrom() offered by system... no

FreeBSD should support closefrom, so this should be "yes", meaning something is broken in the configurator entry for this. Lemme debug the FreeBSD bit.

@ZmnSCPxj
Copy link
Contributor Author

Right right, closefrom returns void and I was doing int res = closefrom(STDERR_FILENO + 1); in the configurator. Hope the FreeBSD smoke test shows closefrom is detected now!

Also did the closefrom_limit as suggested, and use bool everywhere it made sense. closefrom_may_be_slow is still exposed so we can print something to logs.

@ZmnSCPxj
Copy link
Contributor Author

Configurator test on FreeBSD now works! https://github.com/ElementsProject/lightning/pull/4872/checks?check_run_id=3958492505#step:3:4385

Will bring out of draft.

@ZmnSCPxj ZmnSCPxj marked this pull request as ready for review October 21, 2021 01:43
@ZmnSCPxj ZmnSCPxj added this to the v0.10.2 milestone Oct 21, 2021
Signed-off-by: ZmnSCPxj jxPCSnmZ <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
This also inadvertently fixes a latent bug: before this patch, in the
`subd` function in `lightningd/subd.c`, we would close `execfail[1]`
*before* doing an `exec`.
We use an EOF on `execfail[1]` as a signal that `exec` succeeded (the
fd is marked CLOEXEC), and otherwise use it to pump `errno` to the
parent.
The intent is that this fd should be kept open until `exec`, at which
point CLOEXEC triggers and close that fd and sends the EOF, *or* if
`exec` fails we can send the `errno` to the parent process vua that
pipe-end.

However, in the previous version, we end up closing that fd *before*
reaching `exec`, either in the loop which `dup2`s passed-in fds (by
overwriting `execfail[1]` with a `dup2`) or in the "close everything"
loop, which does not guard against `execfail[1]`, only
`dev_disconnect_fd`.
Fixes: ElementsProject#4868

ChangeLog-Fixed: We now no longer self-limit the number of file descriptors (which limits the number of channels) in sufficiently modern systems, or where we can access `/proc` or `/dev/fd`.  We still self-limit on old systems where we cannot find the list of open files on `/proc` or `/dev/fd`, so if you need > ~4000 channels, upgrade or mount `/proc`.
@rustyrussell
Copy link
Contributor

Included in upstream ccan/ trivial rebase.

@ZmnSCPxj
Copy link
Contributor Author

ZmnSCPxj commented Oct 21, 2021

This PR currently still fails an edge case (which the existing version would also fail at):

  • Suppose we want to pass two fds to the subdaemon.
  • By complete chance, the second fd being passed happens to be fd 3.
  • In both the old code and the current PR, fd 3 will be overwritten (by the first fd), and then moved to fd 4 (when our loop reaches the second fd).
    • The subademon will then see fd 3 as closed, and fd 4 being the file descriptor that was passed first instead of second.

I think the PR can be pushed as-is, since it does no worse than the existing code (and maybe we have some protection already against the above...? Do we pass more than one fd to a subdaemon?)

The correct solution would be No it is still incorrect, darn it!:

  • Loop over the passed in fds.
    • For each one, open /dev/null and associate the passed-in fds with the newly opened /dev/null fd.
  • Loop over the passed in fds again (note! this must be a separate loop!):
    • For each one, move_fd the original fd to the /dev/null we allocated for it and ovewrite it with the new fd
  • Loop over the destination /dev/null fds.
    • For each one, move_fd to fdnum++.

The above is very technical and complex and stuff, so I think it should be a different PR; this PR, as noted, does no worse than the current code anyway (and we might already avoid, deliberately or accidentally, the mentioned edge case), and is strictly an improvement. WDYT @rustyrussell ?

--

Edit: This is the actual correct solution

  • Loop over the passed in fds:
    • Check if fdnum == execfail[1] or dev_disconnect_fd, if so move_fd_any them out of the way (i.e. this is the current code).
    • va_copy the current ap, then loop over the copy:
      • Check if fdnum ==``*va_read(ap_copy), if move_fd_any it out of the way -- fortunately the passed-in fds are int * so we can actually use the passed-in storage.

@ZmnSCPxj
Copy link
Contributor Author

CI failure seems unrelated, it is pointing at hsmtool, which should not be affected by this (does hsmtool use pipecmd?). Is there a way to restart just one test? The only button I can find seems to restart all of them.

I have code now to implement the fix to the really edgy edge case in the previous comment, but am wary of adding it currently, because I think we will not hit that edge case in practice anyway (because if so lightningd would already be currently failing a lot). Maybe. Hmmmmmm.

@vincenzopalazzo
Copy link
Contributor

Is there a way to restart just one test? The only button I can find seems to restart all of them.

yep, I think there is any method to restart only one check

@rustyrussell rustyrussell mentioned this pull request Oct 22, 2021
@rustyrussell
Copy link
Contributor

Is there a way to restart just one test? The only button I can find seems to restart all of them.

yep, I think there is any method to restart only one check

No, you have to restart the entire thing :( known GH actions limitation it seems...

@cdecker
Copy link
Member

cdecker commented Oct 22, 2021

ACK 9d5fbc1

@cdecker cdecker merged commit e733fdf into ElementsProject:master Oct 22, 2021
@ZmnSCPxj ZmnSCPxj deleted the use-closefrom branch October 22, 2021 16:43
@ZmnSCPxj
Copy link
Contributor Author

If anyone is interested and has a MacOS, here is some code I would like to have compiled and executed:

#include<errno.h>
#include<libproc.h>
#include<stdbool.h>
#include<stdio.h>
#include<stdlib.h>
#include<sys/types.h>
#include<unistd.h>

static
void closefrom(int fromfd)
{
	int saved_errno = errno;

	int maxfd;

/* #ifdef HAVE_PROC_PIDINFO */
	int orig_sz, sz, i, num;
	pid_t pid = getpid();
	struct proc_fdinfo *infos;

	orig_sz = proc_pidinfo(pid, PROC_PIDLISTFDS, 0, NULL, 0);
{ /* debug */
if (orig_sz >= 0)
	printf("proc_pidinfo(NULL, 0) ok\n");
} /* debug */
	if (orig_sz == 0)
		return;
	else if ((orig_sz > 0) && ((infos = malloc(orig_sz)) != NULL)) {
		sz = proc_pidinfo(pid, PROC_PIDLISTFDS, 0,
				  infos, orig_sz);
		if (sz >= 0 && sz <= orig_sz) {
{ /* debug */
	printf("proc_pidinfo(infos, orig_sz) ok\n");
} /* debug */
			num = sz / sizeof(struct proc_fdinfo);
			for (i = 0; i < sz; ++i) {
				if (infos[i].proc_fd >= fromfd)
					close(infos[i].proc_fd);
			}
			free(infos);
			goto quit;
		} else
			free(infos);
	}
/* #endif // HAVE_PROC_PIDINFO */

{ /* debug */
if (orig_sz >= 0)
	printf("fell back oh no!\n");
} /* debug */

	maxfd = sysconf(_SC_OPEN_MAX);

	for (; fromfd < maxfd; ++fromfd)
		close(fromfd);

quit:
	errno = saved_errno;
}

static
bool can_proc_pidinfo(void)
{
/* #ifdef HAVE_PROC_PIDINFO */
	int sz = proc_pidinfo(getpid(), PROC_PIDLISTFDS, 0, NULL, 0);
	if (sz >= 0)
		return true;
/* #endif // HAVE_PROC_PIDINFO */
	return false;
}

static
bool closefrom_may_be_slow(void)
{
	if (can_proc_pidinfo())
		return false;
	else
		return true;
}

int main(void)
{
	int fds[0];
	ssize_t wres;

	char buf = '\0';

	printf("closefrom_may_be_slow: %s\n",
	       closefrom_may_be_slow() ? "true" : "false");

	if (pipe(fds) < 0) {
		perror("pipe");
		return 1;
	}

	closefrom(STDERR_FILENO + 1);

	/* Writing to the write end should fail.  */
	do {
		wres = write(fds[1], &buf, 1);
	} while ((wres < 0) && (errno == EINTR));
	if ((wres < 0) && (errno == EBADF)) {
		printf("successfully closed by closefrom!\n");
		return 0;
	} else if (wres < 0) {
		perror("write");
		return 1;
	} else {
		printf("unexpected success on supposedly-closed socket!\n");
		return 1;
	}
}

On MacOS we use /dev/fd, which should work, according to my research, but note that it may not be accessible of lightningd is run in a chroot jail. On the other hand, the above uses malloc, which has problems. First, it can fail, meaning we end up with the fallback. Second, in a few rare contexts (not in lightningd, but this is ccan code and intended to be much more general) you do not want to call malloc ever --- e.g. you are implementing your own programming language with a GC and want your GC to use sbrk directly with the assumption that everything the GC controls is a single contiguous memory area: a random malloc risks a random uncontrolled sbrk which causes a malloc-managed memory block in the middle of your GC heap (and remember, free rarely, if ever, puts memory back with a reversing sbrk, most malloc implementations will just retain the memory for later reuse in a later malloc and always sbrk more memory, never less).

Unfortunately, allocating off the stack is difficult. Variable-length arrays are in C99, but ccan is intended to be compatible with pre-C99 compilers. alloca may require the occasional alloca(0) on some systems --- and in those systems, it is emulated by a linked-list of malloced blocks anyway, so bleah. Static-sized stack arrays are not general --- what if the number of kept fds below fromfd exceeds the static size? So malloc is the most portable solution (i.e. in most contexts where closefrom makes sense, malloc use should be safe as long as you ensure free, and hardly anyone makes programming language implementations with a GC that assumes contiguous memory area anyway, most GC algos will want to split up memory into large chunks and you can just malloc those at the C level).

@whitslack
Copy link
Collaborator

@ZmnSCPxj: If you don't want to use malloc, why don't you just mmap an anonymous private chunk of memory and munmap it when you're finished with it?

@ZmnSCPxj
Copy link
Contributor Author

Certainly a possibility if we consider the GC-with-contiguous-address-space thing, but that is marginal; mmap can still fail and force us to the slow close loop.

@NicolasDorier
Copy link
Collaborator

/proc seems to work fine inside containers

@ZmnSCPxj
Copy link
Contributor Author

ZmnSCPxj commented Nov 8, 2021

@NicolasDorier thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

**BROKEN** Could not subdaemon channel: Too many open files
6 participants