High memory consumption depending on volume bricks count #23

vreutskyi · 2018-09-12T09:13:31Z

We've obtained very high memory usage produced by gfapi.Volume when mounted to big volume (with large bricks count). There are few experiment results, showing memory used by python process mounted to different envs:
Before mount (VSZ / RSS): 212376 / 8932
(2 nodes) 12 bricks volume : 631644 / 21440
(6 nodes) 384 bricks: 861648 / 276516
(10 nodes) 600 bricks: 987116 / 432028

Almost half GB per process just on start! And even more when actively used. As we are planning to run near 100 client nodes with 50 processes per node, amount of memory needed becomes fantastic.
Is there any reason for gfapi to use so much memory to just mount the volume?
Does that mean that server-side scaling up requires corresponding scaling up of client side?

prashanthpai · 2018-09-14T09:03:44Z

The python bindings are a very thin wrapper around the libgfapi C library. The client connects to each brick and 600 of them is a lot of bricks. There will be one instance of client xlator (a shared object loaded) per brick.

You'll find some more information in this related issue: gluster/glusterfs#325

abulkavs · 2018-09-21T13:05:17Z

@prashanthpai, it's not about python binding, it's about usage of gfapi, either via native C or via python. FUSE is a singleton within a system, but gfapi is instantiated within a process and may be even many times, so if FUSE takes 0.5G of memory just once, then N instances of gfapi multiply memory usage by N. In case of 2000 * 0,5G = around 1TB of RAM (not drives).
Then imagine we somehow can afford to give up this 1T (for whatever xlator is there) and we go into production. Then suddenly we'd make a decision to double capacity of the Gluster Volume by just doubling amount of nodes and bricks correspondingly. Then the most important thing here is that client nodes would implicitly use 2TB of RAM. We have to revise memory consumption of existing client nodes and even redeploy them.
So my conclusions are:

This is the most critical architectural issue in distributed and scalable systems implicitly exposed to the client side.
When product description states that GlusterFS "is designed to scale capacity", please don't forget to mention that horizontal scaling of server nodes implicitly requires vertical scaling of client nodes in terms of memory consumption.
Clearly state in product description that memory consumption of FUSE native client depends on number of bricks within GlusterFS Volume and can be even 1G per 1000 bricks.
Clearly state in product description that gfapi is better in terms of performance than FUSE native client but memory consumption is at the same level multiplied by number of gpi instances on the single node. For instance in case of 10 nginx worker processes use gfapi, e. g., to just call list_with_stats, the memory consumption would be 10G. Spinning up 100 docker containers with such nginx processes would result in 1TB of RAM.
Please state in documentation what is affordable number of volume clients?
This is totally not clear why gfapi as a client is that "fat". Why does it consume 3M of memory per brick. What is this memory used for? If that it's metadata then numbers are just insane. If that is used for buffers? Then why does client which supposes to transfer data between application memory and socket need any extra buffers at all? If that is inode specifics, then why does gfapi need to know about inodes, shouldn't that be just FUSE-related specifics.

Is that all because gfapi implementation is derived from FUSE native client what entailed unnecessary stuff from there?

So I'd kindly ask to comment on my concerns, so we cloud make a right decision whether and how to use gfapi or even GlusterFS in general?

prashanthpai · 2018-09-24T04:59:16Z

but gfapi is instantiated within a process and may be even many times

Although that may be a valid usecase, gfapi instances are isolated and cannot be shared across processes unlike a FUSE mount. Usually gfapi consumers (NFS ganesha, QEMU, SMB) will have only one instance per volume per process through-out the lifetime of the process.

Improvements have been made to make the client light-weight (less "fat" as you pointed out). See gluster/glusterfs#242. With that model, the client stack is split into two - a thin client and a proxy daemon. You'll have a very thin client residing in each process and one local daemon per machine(s) which will talk to the bricks. This is a tradeoff which adds one additional hop. You can try that out.

If that is inode specifics, then why does gfapi need to know about inodes, shouldn't that be just FUSE-related specifics.

Kindly note that libgfapi instances do not talk to FUSE. They maintain inode table in memory and cache metadata for faster access.

Is that all because gfapi implementation is derived from FUSE native client what entailed unnecessary stuff from there?

gfapi implementation isn't derived from FUSE, although they share some code.

I'm cc'ing libgfapi maintainers to see if they can better answer your queries and help you with other recommendations.
cc @nixpanic @poornimag @ShyamsundarR

prashanthpai added the known-issue label Sep 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory consumption depending on volume bricks count #23

High memory consumption depending on volume bricks count #23

vreutskyi commented Sep 12, 2018 •

edited

Loading

prashanthpai commented Sep 14, 2018

abulkavs commented Sep 21, 2018

prashanthpai commented Sep 24, 2018 •

edited

Loading

High memory consumption depending on volume bricks count #23

High memory consumption depending on volume bricks count #23

Comments

vreutskyi commented Sep 12, 2018 • edited Loading

prashanthpai commented Sep 14, 2018

abulkavs commented Sep 21, 2018

prashanthpai commented Sep 24, 2018 • edited Loading

vreutskyi commented Sep 12, 2018 •

edited

Loading

prashanthpai commented Sep 24, 2018 •

edited

Loading