[bug:1421649] When using a fuse mount for client, EC volumes do not mount. #924

gluster-ant · 2020-03-12T12:40:15Z

URL: https://bugzilla.redhat.com/1421649
Creator: nigelb at redhat
Time: 20170213T10:54:56

We recently fixed bug 1402661 by using the /usr/libexec/glusterfs folder for creating the executable files. This folder is created when you have the glusterfs-server package installed.

However, when you only have the client installed, it doesn't create the /usr/libexec/glusterfs folder. Therefore, disperse volumes cannot be mounted correctly. This is from the mount logs on the client.

[2017-02-13 09:53:14.583515] I [MSGID: 100030] [glusterfsd.c:2460:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.11dev (args: /usr/sbin/glusterfs --volfile-server=172.19.2.81 --volfile-id=/testvol_dispersed /mnt/testvol_dispersed_glusterfs)
[2017-02-13 09:53:14.593506] I [MSGID: 101190] [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2017-02-13 09:53:14.597531] I [MSGID: 122067] [ec-code.c:1025:ec_code_detect] 0-testvol_dispersed-disperse-0: Using 'sse' CPU extensions
[2017-02-13 09:53:14.597580] E [MSGID: 122074] [ec-code.c:425:ec_code_space_create] 0-testvol_dispersed-disperse-0: Unable to create a temporary file for the ec dynamic code [No such file or directory]
[2017-02-13 09:53:14.597598] E [MSGID: 122073] [ec.c:654:init] 0-testvol_dispersed-disperse-0: Failed to initialize matrix management [No such file or directory]
[2017-02-13 09:53:16.597684] E [MSGID: 101019] [xlator.c:503:xlator_init] 0-testvol_dispersed-disperse-0: Initialization of volume 'testvol_dispersed-disperse-0' failed, review your volfile again
[2017-02-13 09:53:16.597723] E [MSGID: 101066] [graph.c:324:glusterfs_graph_init] 0-testvol_dispersed-disperse-0: initializing translator failed
[2017-02-13 09:53:16.597770] E [MSGID: 101176] [graph.c:680:glusterfs_graph_activate] 0-graph: init failed
[2017-02-13 09:53:16.598251] W [glusterfsd.c:1329:cleanup_and_exit] (-->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x3c1) [0x7fbf765006f1] -->/usr/sbin/glusterfs(glusterfs_process_volfp+0x1b1) [0x7fbf764fa8c1] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7fbf764f9dfb] ) 0-: received signum (1), shutting down
[2017-02-13 09:53:16.598281] I [fuse-bridge.c:5802:fini] 0-fuse: Unmounting '/mnt/testvol_dispersed_glusterfs'.
[2017-02-13 09:53:16.598575] W [glusterfsd.c:1329:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7fbf74e61dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7fbf764f9f85] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x7fbf764f9dfb] ) 0-: received signum (15), shutting down

gluster-ant · 2020-03-12T12:40:17Z

Time: 20170213T10:58:51
nigelb at redhat commented:
We have a follow up issue from the mmap changes for SELinux

gluster-ant · 2020-03-12T12:40:18Z

Time: 20170213T13:58:58
bugzilla-bot at gluster.org commented:
REVIEW: https://review.gluster.org/16612 (build: Add %{_libexecdir}/glusterfs to base package) posted (#1) for review on master by Anoop C S ([email protected])

gluster-ant · 2020-03-12T12:40:18Z

Time: 20170213T14:40:51
ndevos at redhat commented:
Files under /usr/libexec/glusterfs should not be writable by processes after the initial installation. Anything that generates files should use /tmp or /run. For this particular case /run/gluster/lib looks most suitable.

This new /run/gluster/lib directory should be added to extras/run-gluster.tmpfiles.in so that it is created upon boot. There also needs to be a selinux-policy update (most likely) to enforce the correct context on the directory.

gluster-ant · 2020-03-12T12:40:19Z

Time: 20170213T14:55:42
ndevos at redhat commented:
Also, EC is client-side... How is this supposed to work for gfapi applications that do not run as root?

I think the location of the generated files needs to be detected during runtime:

/run/gluster/lib when running as root
/run/user//gluster/lib (maybe?) when running as non-root

For non-root, the directory needs to be created on demand, and have the correct SELinux context set after that (type:bin_t)?

gluster-ant · 2020-03-12T12:40:20Z

Time: 20170216T07:23:16
jahernan at redhat commented:
Nigel, do we really need to block 3.10.0 for this problem ?

There's already a patch to avoid failures when mmap() or anything else related to the dynamic code generation fails (https://bugzilla.redhat.com/show_bug.cgi?id=1421955). It simply falls back to the precompiled code instead of causing fatal failures.

This is not the best solution but allows disperse to be used, at least until a proper solution for the selinux problem is found and implemented. It's also a good solution for any possible future change or OS problem that could cause this to fail again.

Another workaround is to manually disable dynamic code generation by using option disperse.cpu-extensions = none.

gluster-ant · 2020-03-12T12:40:21Z

Time: 20170216T07:35:04
anoopcs at redhat commented:
(In reply to Xavier Hernandez from comment #5)

Nigel, do we really need to block 3.10.0 for this problem ?

There's already a patch to avoid failures when mmap() or anything else
related to the dynamic code generation fails
(https://bugzilla.redhat.com/show_bug.cgi?id=1421955). It simply falls back
to the precompiled code instead of causing fatal failures.

This is not the best solution but allows disperse to be used, at least until
a proper solution for the selinux problem is found and implemented. It's
also a good solution for any possible future change or OS problem that could
cause this to fail again.

Another workaround is to manually disable dynamic code generation by using
option disperse.cpu-extensions = none.

Additionally we also need to make sure that these are clearly documented in the release notes for 3.10.

gluster-ant · 2020-03-12T12:40:22Z

Time: 20170216T11:31:30
sisharma at redhat commented:

Why does it have to generate dynamic binary at runtime?
Can user be added to group which has permissions to generate this dynamic binary and then drop privileges, so there is no chance to escalate privileges at any later stage.
Solution in comment 5 in this bug seems to be fine, but what are the effects of re-using pre-compiled code ? That brings me to first question 'why dynamic binary generation is required' ?

gluster-ant · 2020-03-12T12:40:23Z

Time: 20170216T14:45:35
srangana at redhat commented:
(In reply to Anoop C S from comment #6)

(In reply to Xavier Hernandez from comment #5)

Nigel, do we really need to block 3.10.0 for this problem ?

There's already a patch to avoid failures when mmap() or anything else
related to the dynamic code generation fails
(https://bugzilla.redhat.com/show_bug.cgi?id=1421955). It simply falls back
to the precompiled code instead of causing fatal failures.

This is not the best solution but allows disperse to be used, at least until
a proper solution for the selinux problem is found and implemented. It's
also a good solution for any possible future change or OS problem that could
cause this to fail again.

Another workaround is to manually disable dynamic code generation by using
option disperse.cpu-extensions = none.

Additionally we also need to make sure that these are clearly documented in
the release notes for 3.10.

If https://bugzilla.redhat.com/show_bug.cgi?id=1421955 is taken in, what additional release notes are required? I need them ASAP so that the same can be updated.

Further, removing this from the 3.10 blocker list as per the discussion here.

gluster-ant · 2020-03-12T12:40:24Z

Time: 20170217T16:48:07
anoopcs at redhat commented:
(In reply to Shyamsundar from comment #8)

(In reply to Anoop C S from comment #6)

(In reply to Xavier Hernandez from comment #5)

Nigel, do we really need to block 3.10.0 for this problem ?

There's already a patch to avoid failures when mmap() or anything else
related to the dynamic code generation fails
(https://bugzilla.redhat.com/show_bug.cgi?id=1421955). It simply falls back
to the precompiled code instead of causing fatal failures.

This is not the best solution but allows disperse to be used, at least until
a proper solution for the selinux problem is found and implemented. It's
also a good solution for any possible future change or OS problem that could
cause this to fail again.

Another workaround is to manually disable dynamic code generation by using
option disperse.cpu-extensions = none.

Additionally we also need to make sure that these are clearly documented in
the release notes for 3.10.

If https://bugzilla.redhat.com/show_bug.cgi?id=1421955 is taken in, what
additional release notes are required? I need them ASAP so that the same can
be updated.

Further, removing this from the 3.10 blocker list as per the discussion here.

I thought it would be better to add a note about this fallback mechanism which we do internally instead of explicitly setting the specified volume set option. As you know I am not an expert in EC. I will leave it for Xavi to have a final word.

Xavi,
What do you think of adding a note in release notes for 3.10 regarding the internal fallback we do with https://review.gluster.org/#/c/16614/?

gluster-ant · 2020-03-12T12:40:25Z

Time: 20170220T07:20:18
jahernan at redhat commented:
If patch https://review.gluster.org/16614/ is accepted in 3.10, then I wouldn't say anything about the disperse.cpu-extensions workaround but add a comment stating that in some cases (basically if /usr/libexec/glusterfs doesn't exist or if it's not writable by the process using the volume) the disperse volumes won't use the any acceleration (it will fallback to the same computation method used in 3.8 and earlier).

Not sure how to word it or how many details to tell.

gluster-ant · 2020-03-12T12:40:26Z

Time: 20170220T10:54:53
jahernan at redhat commented:
(In reply to Siddharth Sharma from comment #7)

Why does it have to generate dynamic binary at runtime?

Basically because the amount of possible combinations that may be needed to encode/decode the data is too big (in the order of tens of thousands). On normal circumstances, only a bunch of combinations are really needed, but in relatively big configurations, things can fail in multiple ways and all of them need to be supported though not at the same time (as an example, in a 16+4 configuration, there are 4845 different ways in which 4 of the 20 bricks can fail, and each combination needs a different matrix to be computed).

It can be done without dynamic code (it has been working this way till recently) using more generic functions, but the performance is significantly worse.

The addition of SSE and AVX support makes this problem even worse because it would need to build specific functions for each of the possible extensions for each combination.

Can user be added to group which has permissions to generate this dynamic
binary and then drop privileges, so there is no chance to escalate
privileges at any later stage.

I don't think so. The code doesn't really need to be stored anywhere. In a previous patch it was created only on memory but, because of selinux, it needed to be stored into a file. The file is volatile though: the code creates the file and then immediately deletes it, keeping only an open fd. The generated code only needs to be there while the process is running.

Since creating all possible combinations would be impractical, only the needed combinations are created as they are needed on run-time, so we cannot create a file with all the code and then revoke privileges.

Solution in comment 5 in this bug seems to be fine, but what are the
effects of re-using pre-compiled code ? That brings me to first question
'why dynamic binary generation is required' ?

Well, solution in comment 5 basically disables dynamic code generation if something fails, so we are at the starting point again. The performance benefits are lost, though it's true that it works in all cases.

The detailed problem is this:

First solution:

An mmap() call was used to create an anomymous memory area where dynamic code was generated. It had read/write/execute privileges and it was not backed by any file. The problem here is that selinux forbids this.

Second solution:

Looking at how some internet browsers solve this problem (they generate dynamic code for javascript), I tried to use the double mmap() approach, that also seems to be the recommended solution to support selinux. With this approach I need a backing file mmapped to two distinct memory regions, one with read/write access and another one with read/execute access.

This works fine and selinux really allows this, but the problem here is that the backing file must have the bin_t selinux's type. Otherwise selinux doesn't allow the file to be mmaped to an executable region.

With the help of Anoop, we found that /usr/libexec/glusterfs has this type set, so we used it and everything worked fine with selinux activated.

The problem here (I didn't know that) is that this directory can be read-only in some cases, so it cannot be used. Also, it requires root privileges, while some gfapi applications may not run as root.

So the questions are:

Is there any way to improve security in this approach ?
Where can we store the backing file with bin_t type ? Niels proposed /run/glusterfs for root and /run//glusterfs for regular users. Is this ok ?
Can an selinux policy be distributed with gluster to make the previous directories of type bin_t ?

Thanks

gluster-ant · 2020-03-12T12:40:27Z

Time: 20170314T15:48:26
sgrubb at redhat commented:
You know that selinux policy can be adjusted to allow generating code in memory for this specific process.

gluster-ant · 2020-03-12T12:40:27Z

Time: 20170314T19:36:12
ndevos at redhat commented:
(In reply to Steve Grubb from comment #12)

You know that selinux policy can be adjusted to allow generating code in
memory for this specific process.

If that is acceptable, we should make that happen. Siddarth, what is your take on this?

gluster-ant · 2020-03-12T12:40:28Z

Time: 20170315T04:29:47
sisharma at redhat commented:
There are two ways

As Steve Suggested
or
I looked into it again yesterday, there is /run/usr/$gid which gets re-created after every reboot. suppose if user has gid (1000), there will be /run/usr/1000/.. only readable and writable by user only.

looking at other places, there doesnt seem to be a place which is recommended for such type of files. So if this file does not remain for long time I am OK with tweaking selinux policy to do so.

gluster-ant · 2020-03-12T12:40:29Z

Time: 20170315T09:36:25
jahernan at redhat commented:
After having seen some documentation, I like the dual mapping solution as it seems more robust and safe to me. This requires the creation of a file. This file is created and immediately deleted, so it only exists while the process keeps it open.

If it's ok to create an selinux policy that allows the process to create that file in an already existing directory (this directory should be writable by the owner of the process, not necessarily root. Maybe /tmp ?), that's fine to me.

If the best place to put that file is inside /run/usr/$gid (is it really gid or uid ?), I can write the necessary code. In this case, would we need to create a /run/usr/$gid/glusterfs ? or we can directly use /run/usr/$gid ?

Does /run/usr/$gid exist in all distributions ?

gluster-ant · 2020-03-12T12:40:30Z

Time: 20170315T11:05:12
sisharma at redhat commented:
(In reply to Xavier Hernandez from comment #15)

After having seen some documentation, I like the dual mapping solution as it
seems more robust and safe to me. This requires the creation of a file. This
file is created and immediately deleted, so it only exists while the process
keeps it open.

If it's ok to create an selinux policy that allows the process to create
that file in an already existing directory (this directory should be
writable by the owner of the process, not necessarily root. Maybe /tmp ?),
that's fine to me.

If the best place to put that file is inside /run/usr/$gid (is it really gid
or uid ?), I can write the necessary code. In this case, would we need to
create a /run/usr/$gid/glusterfs ? or we can directly use /run/usr/$gid ?

Does /run/usr/$gid exist in all distributions ?

/run/usr/$gid is created by pam, and it is not persistent. so the problem with writing such file with predictable filename to /tmp is that it will become vulnerable to symlink attack. So I am not in favor of it being written to /tmp.

gluster-ant · 2020-03-12T12:40:31Z

Time: 20170315T11:28:51
ndevos at redhat commented:
(In reply to Xavier Hernandez from comment #15)

Does /run/usr/$gid exist in all distributions ?

This would be /run/user/$uid. Note that it is "user", not "usr" and "$uid" not "$gid".

Modern distributions (Fedora, CentOS 7, all systemd ones?) have this directory. It is not available on CentOS 6 though, and that is also a target we aim to support as a Gluster community.

We can do some ./configure stuff to select different directories, but it still requires the correct SELinux context (bin_t) to be set too. Depending on the directory that gets picked for a distribution, the SELinux policy needs those modifications. Suggestions for a non-systemd distribution directory welcome.

gluster-ant · 2020-03-12T12:40:32Z

Time: 20170317T15:48:48
sgrubb at redhat commented:
Another approach might be to compile into mmap() memory that is rw, and then mprotect() it to be rx. That assumes you are using a language (javascript) that does not do runtime optimizing which requires continued writing as the app runs.

gluster-ant · 2020-03-12T12:40:33Z

Time: 20170317T16:38:37
ssekidde at redhat commented:
(In reply to Niels de Vos from comment #17)

We can do some ./configure stuff to select different directories, but it
still requires the correct SELinux context (bin_t) to be set too. Depending
on the directory that gets picked for a distribution, the SELinux policy
needs those modifications. Suggestions for a non-systemd distribution
directory welcome.

bin_t is used to identify a file as an ordinary program type. If this is shell script then it should be shell_exec_t.

gluster-ant · 2020-03-12T12:40:34Z

Time: 20170320T07:27:54
jahernan at redhat commented:
(In reply to Steve Grubb from comment #18)

Another approach might be to compile into mmap() memory that is rw, and then
mprotect() it to be rx. That assumes you are using a language (javascript)
that does not do runtime optimizing which requires continued writing as the
app runs.

The dynamic code here is used because the number of combinations that would be needed grows easily to thousands. However in normal circumstances only a small subset is used and tends to remain stable until some event happens (brick failure or similar).

The program mmaps regions in blocks of 64KB to store multiple fragments of code. However the fragments are generated as they are needed. At the same time some other fragments might be executing, so the mprotect() solution is not practical here.

Using an independent mmap() for each fragment of code seems a waste of memory to me (normally each fragment of code requires few hundreds of bytes).

(In reply to Simon Sekidde from comment #19)

bin_t is used to identify a file as an ordinary program type. If this is
shell script then it should be shell_exec_t.

The dynamic code generated and stored into the temporary backend file is machine code, not shell code. The bin_t type works fine here.

gluster-ant · 2020-03-12T12:40:35Z

Time: 20181031T13:10:45
jahernan at redhat commented:
I'll try to fix this problem.

Is it ok to use /run/gluster/bin (for example) to store dynamic code for root user and /run/user/$uid/gluster for other users ?

If so, do I need to also update https://github.com/gluster/glusterfs-selinux ?

If that's the correct option, I'll implement it.

gluster-ant · 2020-03-12T12:40:36Z

Time: 20181112T13:26:17
ndevos at redhat commented:
(In reply to Xavi Hernandez from comment #21)

I'll try to fix this problem.

Is it ok to use /run/gluster/bin (for example) to store dynamic code for
root user and /run/user/$uid/gluster for other users ?

If so, do I need to also update https://github.com/gluster/glusterfs-selinux
?

If that's the correct option, I'll implement it.

Siddarth or Simon, could you answer this question?

gluster-ant · 2020-03-12T12:40:37Z

Time: 20181120T08:22:24
sisharma at redhat commented:
This looks OK to me, /run/gluster/bin is not writable by non-root user. From SELinux perspective this is not standard path to have binary. It would be good to get opinion from Simon as well.

gluster-ant · 2020-03-12T12:40:40Z

Time: 20191118T07:00:38
anoopcs at redhat commented:
Re-visiting..

Are we going to address this any time soon? I remember that we have the fall back mechanism implemented but wanted to fix the real problem by putting dynamic code satisfying SELinux needs.

gluster-ant · 2020-03-12T12:40:41Z

Time: 20191122T14:04:42
jahernan at redhat commented:
If the proposed change is ok (which I don't know for sure yet), I'll do it when I've time. It's not urgent and there's a fallback for the particular cases where the client is not running as root (only for gfapi) so it can get delayed.

The change is quite simple though. Only select a different path depending on the user (and maybe creating some directory), so if someone wants to take it, I'll be happy to help.

stale · 2020-10-08T20:32:50Z

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

stale · 2020-10-24T01:57:29Z

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

gluster-ant added Migrated Type:Bug labels Mar 12, 2020

gluster-ant assigned aspandey Mar 12, 2020

stale bot added the wontfix Managed by stale[bot] label Oct 8, 2020

stale bot closed this as completed Oct 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug:1421649] When using a fuse mount for client, EC volumes do not mount. #924

[bug:1421649] When using a fuse mount for client, EC volumes do not mount. #924

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

stale bot commented Oct 8, 2020

stale bot commented Oct 24, 2020

[bug:1421649] When using a fuse mount for client, EC volumes do not mount. #924

[bug:1421649] When using a fuse mount for client, EC volumes do not mount. #924

Comments

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

gluster-ant commented Mar 12, 2020

stale bot commented Oct 8, 2020

stale bot commented Oct 24, 2020