-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug:1478411] Directory listings on fuse mount are very slow due to small number of getdents() entries #910
Comments
Time: 20170804T14:02:07 Using the example program for getdents() from http://man7.org/linux/man-pages/man2/getdents.2.html and running it on my directory, I got this output (file names blacked out with "a"): getdents(3, /* 16 entries /, 10240) = 1552 It seems like when the file names are longer, (first block) getdents() returns less results -- 16 in the above case instead of the usual 20. So I wonder if there's some fuse-related buffer that gets filled that results getdents() returning so few entries, and whether I can adjust it somehow. |
Time: 20170804T14:39:53 [pid 18266] 1501856999.667820 read(10, "\27\3\3\0\34", 5) = 5 The read() syscalls that get the file names over from the gluster server happen in bursts, and the bursts are about 300 ms apart. (I'm using SSL so the file names aren't visible in the read calls.) They happen in 16 KB buffers (but here's some strange read()s of size 5 in between, not sure what those are for), and looking at the timings they don't seem to be blocked on network roundtrips, so that's good. But the roughly 4K sized writev() syscalls each have a network roundtrip in betweeen. That seems strange to me, given that we've just read() all the data before in a batch. What's going on there? |
Time: 20170804T14:46:38 [pid 18266] 1501857629.747928 read(10, "\357U>\245\325n\200#\360\22/9\370\205lL\322\226gk\233\255\2633[\10R\34j\334,'"..., 16408) = 16408 |
Time: 20170906T04:00:28 |
Time: 20170906T06:07:11 *** This bug has been marked as a duplicate of bug 1356453 *** |
Time: 20170906T23:01:18 $ gluster --version $ for x in $ strace find /myglustermountdir |
Time: 20170915T07:38:42
Thanks for the data. So here is my analysis and possible solution:
I would suggest we close this bug on gluster, and raise it in FUSE kernel? An alternative for this, we can implement glfs_getdents in libgfapi(system call equivalent of GlusterFS and get rid of the FUSE interference) which can be integrated to other applications or we can write a wrapper and provide it as a command. |
Time: 20170915T13:53:27 I think we should definitely bring this up as a FUSE issue, I can imagine gluster and other FUSE based software would have much better performance if they weren't limited to 4 KB per syscall. Do you know where this PAGE_SIZE limit is implemented, or would you even be able to file this issue? I know very little about fuse internals and don't feel prepared to make a high quality issue report on this topic with them yet.
You are right, for a test I just increased latency tenfold with I assumed it was a network roundtrip because the time spent per syscall is roughly my LAN network roundtrip (0.2 ms), but that just happened to be how slow the syscalls were independent of the network.
Personally I would prefer if we could keep it open until getdents() performance is fixed; from a Gluster user's perspective, it is a Gluster problem that directory listings are slow, and the fact that FUSE plays a role in it is an implementation detail. I did a couple more measurements that suggest that there are still large integer factors unexplained: Using $ gcc getdents-listdir.c -O2 -o listdir With Results for BUF_SIZE = 10240: gluster fuse: getdents(3, /* 20 entries /, 10240) = 1040 <0.000199> Results for BUF_SIZE = 131072: gluster fuse: getdents(3, /* 20 entries /, 131072) = 1040 <0.000199> This shows that, almost independent of BUF_SIZE, computing bytes per time,
That's almost a 40x performance difference (and as you say, no networking is involved). Even when taking into account the mentioned 5x space overhead of Why might an individual getdents() call be that much slower on fuse? |
Time: 20170917T00:01:08 (on my branch https://github.com/nh2/linux/compare/v4.9-fuse-large-readdir). For sshfs (another FUSE program) this brings an immediate improvement: With 1M files prepared like this: mkdir sshfsmount-1M sshfsdir-1M and my example program adapted from the getdents() man page (code here: https://gist.github.com/nh2/6ebd9d5befe130fd6faacd1024ead3d7) I get an immediate improvement for Without kernel patch (1 page): 0.267021 (You have to run twice in quick succession to get these results, because sshfs discards its cache very quickly and we want to measure FUSE syscall overhead, not it fetching the data over sshfs. If you wait too long it may take a minute for a full fetch.) That's a 1.5x speedup for the entire program run; but because sshfs does some initialisation work, we should look at the actual strace outputs instead: Without kernel patch (1 page): strace -tttT -f -e getdents ./listdir-silent sshfsmount-1M/1505605898.572720 getdents(3, /* 128 entries /, 131072) = 4064 <47.414827> With 32-page kernel patch: strace -tttT -f -e getdents ./listdir-silent sshfsmount-1M/1505605890.435250 getdents(3, /* 4096 entries /, 131072) = 130776 <60.054614> Here you can see first the initial fetching work (depends on what's in SSHFS's cahe at that point), and then the real syscalls. Using 32 pages has increased the bytes per syscall by 32x, and the time by ~5x, so it's approximately: 6x faster So landing such a patch seems beneficial to remove syscall overhead. It would certainly help SSHFS cached directory listings. Next, back to gluster. |
Time: 20170917T00:38:24 As of glusterfs, another possible issue came up. The buffer that a given xlator fills with dirents is fixed size (during the handling of a given readdir[p] fop). However, various xlators operate with various dirent flavors (eg. posix with system dirents, fuse wuth fuse dirents) so when the dirent holding buffer is passed around between xlators, the converted dirents won't fill optimally the next xlator's buffer. Practically, the fuse dirent is bigger, so not all of the entries received from the underlying xlator will fit in the dirent buffer of fuse after conversion. The rest will be discarded, and will have to be read again on next getdents call.
Alas, the above described phenomenon defeats this too: because of the re-read constraint the dir offsets of subsequent getdents' won't be monotonic, upon which readdir-ahead deactivates itself. Whether / to what rate does this occur might depend on the configuration. @nh2: Therefore we'd like you to kindly ask to share your volume info and TRACE level logs, from mount on to observing the small getdents() calls. |
Time: 20170917T01:57:39 Now running a similar test on gluster, for time's sake with only 10K files. My gluster config: Volume Name: myvol Files created like touch /glustermount/largenames-10k/1234567890123456789012345678901234567890-file{1..10000}Also doing quickly repeated runs to obtain these numbers. Without kernel patch (1 page): strace -w -c -f -e getdents ./listdir-silent /glustermount/largenames-10k/% time seconds usecs/call calls errors syscall 100.00 0.213868 384 557 getdents With 32-page kernel patch: strace -w -c -f -e getdents ./listdir-silent /glustermount/largenames-10k% time seconds usecs/call calls errors syscall 100.00 0.211732 11763 18 getdents Almost no improvement! Let's look at the individual getdents() invocations: Without kernel patch (1 page): strace -Tttt -f -e getdents ./listdir-silent /glustermount/largenames-10k/1505608150.612771 getdents(3, /* 19 entries /, 131072) = 1272 <0.007789> With 32-page kernel patch: strace -Tttt -f -e getdents ./listdir-silent /glustermount/largenames-10k/1505608076.391872 getdents(3, /* 604 entries /, 131072) = 43392 <0.022552> Observations:
So I started investigating what it (the glusterfs FUSE mount process) is doing. First thing I noticed: I was wrong when I said
after doing my test with But what is certainly happening is localhost networking! If I use tc qdisc add dev lo root netem delay 200msthus making my strace -Tttt -f -e getdents ./listdir-silent /mount/glustermount-10k1505580070.060286 getdents(3, /* 604 entries /, 131072) = 43392 <0.824180> So there is networking happening, in my case it just happened over the A quick So I ran strace against te strace -tttT -f -p THEPIDWhat I could see was a FUSE request/response loop: As a result, the fulfillment of each getdents() request is bracketed between a I have posted one of those brackets at https://gist.github.com/nh2/163ffea5bdc16b3a509c4b262b1d382a Each such getdents() fulfillment takes ~29 ms. I find that quite a lot just to fill in ~40000 Byte above, so I analysed what major time sinks are within these 29 ms in this strace output of the https://gist.github.com/nh2/163ffea5bdc16b3a509c4b262b1d382a#gistcomment-2205164 Quite some CPU time is spent (again, seemingly way too much for providing 40000 Bytes), but also significant time is spent waiting on communication with the glusterfsd socket (which is FD 10), with poll() and apparently a blocking write(). (There's also some precisely sized reads, such as those 5-byte reads, for which I wonder if they could be combined with larger reads to reduce the amount of syscalls, but they are so fast compared to the rest that's going on that they don't matter for the current investigation). As there's lots of waiting for glusterfsd in there, I started to strace glusterfsd instead, while running That immediately revealed why getdents() on glusterfs is so much slower than on SSHFS (sorry long lines): [pid 972] 1505589830.074176 lstat("/data/brick/.glusterfs/00/00/00000000-0000-0000-0000-000000000001/largenames-10k/1234567890123456789012345678901234567890-file7552", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000007> Apparently when gluster tries to fulfil a getdents() FUSE request, each individual file returned by my XFS brick file system is stat()ed afterwards and also gets 6 lgetxattr() calls (all returning ENODATA). I guess that's just part of how gluster works (e.g. to determine based on xattrs whether a file shall actually be shown in the output or not), but it wasn't obvious to me when I started debugging this, and it certainly makes the business much slower than just forwarding some underlying XFS getdents() results. As mentioned in my Github gist from before, in contrast to stracing So I switched to measure with perf record -e 'syscalls:sys_*' -p 918which had 6x overhead for an entire ./listdir-silent run over the 10k files, and then, for lower overhead, specific syscalls (since then I already knew what syscalls were going on, so I couldn't miss any in between), perf record -e 'syscalls:sys_enter_newlstat' -e 'syscalls:sys_exit_newlstat' -e 'syscalls:sys_enter_lgetxattr' -e 'syscalls:sys_exit_lgetxattr' -p 918perf scriptwhich shows much less overhead in the profile: 26239.494406: syscalls:sys_enter_newlstat and also increased total run time only by ~2x to ~0.5 seconds. I suspect that the profile shows that the amount of syscalls gluster makes on each file (and the fact that it has to make per-file syscalls for getdents()) is somewhat problematic: Each file eats ~18 us, that's roughly 50K files per second, or 100K per second assuming 2x profiling overhead. I wonder if some future version of Gluster should not store all its metadata info in extended attributes, as putting it there requires lots of syscalls to retrieve it? I guess this is one reason for Gluster's low small-file and directory performance? I suspect that if it had all this info in memory, or could obtain it via a few-system-calls-large-reads method (e.g. a single index file or even mmap), it could do better at this. |
Time: 20170917T01:59:41 |
Time: 20170917T02:02:18 strace -wc -f -e getdents,lstat ls -1U /glustermount/largenames-10k/ > /dev/null% time seconds usecs/call calls errors syscall 100.00 0.208856 11603 18 getdents 100.00 0.208856 18 total strace -wc -f -e getdents,lstat ls -lU /glustermount/largenames-10k/ > /dev/null% time seconds usecs/call calls errors syscall 74.99 0.168202 17 10001 lstat 100.00 0.224291 10019 total Notice the difference ls For some weird reason, when I also stat() as above with It would make sense to me if the present of getdents() made stat()s faster (e.g. as part of some What might be going on here? |
Time: 20170917T02:10:44
@csaba: OK, I have read your comment now. My volume info is in the post further above, would you mind to shortly describe or link how I can set the log level to TRACE and which exact file(s) I should provide? |
Time: 20170918T06:20:28 Meanwhile, you mentioned stat and getxattrs are being performed, so we have stat-prefetch which caches the stat and xattrs of a file/dir. Please execute the following on your test system and see if you get improvements: $ gluster vol set group metadata-cache Also there are few patches(WIP), that were a result of debugging performance issues in readdir: I suggest to try these patches if possible and the metadata-cache to see if the performance reaches any satisfactory levels? |
Time: 20170918T08:17:43
Please set following options to get logs at TRACE log-level: gluster volume set diagnostics.client-log-level TRACEgluster volume set diagnostics.brick-log-level TRACEAfter this run your tests and attach logfiles of gluster mount and brick processes (usually found /var/log/glusterfs/ and /var/log/glusterfs/bricks) |
Time: 20170918T10:38:47 |
Time: 20170918T10:42:18 |
Time: 20170918T11:20:06
Note that since the volume configuration used here has just one dht subvolume (1x3), patch #18312 won't affect the test-case used in this bz. Since I happened to notice the issue in dht while looking into this bz, I used this bz. With Hindsight, I should've realized this bug is not a duplicate of bz 1356453. But again, I couldn't find the volume info. As long as dht has a single subvolume, its unlikely that dht affects readdir(p) performance. |
Time: 20170919T09:44:32
As you've noted earlier, this will be equal to default readdir-ahead request size. Wondering what would be the behaviour if we increase the readdir-ahead request size by a factor of 10 times? You can use the following option Option: performance.rda-request-size gluster volume set performance.rda-request-size 1310720 |
Time: 20171121T00:26:56 |
Time: 20171206T09:57:18 |
Time: 20171206T10:09:38 |
Time: 20171208T05:35:27 ... in readdirp response if dentry points to a directory inode. This
(cherry picked from commit 59d1cc7) |
Time: 20171208T05:37:16 Superflous dentries that cannot be fit in the buffer size provided by
So, the best strategy would be to fill the buffer optimally - neither
|
Time: 20171219T07:17:49 glusterfs-glusterfs-3.12.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-devel/2017-December/054093.html |
Time: 20171229T02:03:05 I still get getdents() calls returning only ~20 entries: getdents(3, /* 20 entries /, 131072) = 1040 The patch above also speaks of reduced CPU usage which I haven't tested yet. I haven't tested with my kernel patch yet that allows FUSE readdir() to use more than 1 page, will retest with that. |
Time: 20171229T03:36:07 |
Time: 20171229T03:38:57 This is during the strace -ftttT -e getdents ls -1U largenames-10k/ > /dev/null1514518214.946014 getdents(3, /* 604 entries /, 131072) = 43392 <0.220178> This operation is bracketed between setting the logging level from INFO to TRACE, and the other way around. |
Time: 20171229T04:41:20 None of them have an effect on reducing the total amount spent in the getdents() to less than what's shown in the strace above. |
Time: 20180102T06:43:31
Thanks for your persistent efforts. We do acknowledge that readdirplus on bricks is a bottleneck. But, efforts required to improve that are significant and hence no major effort is currently in progress. However, translator readdir-ahead tries to hide the readdirplus latency on bricks from application by initiating a stream of readdirplus (starts when application does an opendir) even before application has done a readdir. So, I am very much interested in your results from tests done in a setup with both of the following suggestions:
|
Time: 20180102T10:49:03 I can't set rda-request-size that high, because: volume set: failed: '1310720' in 'option rda-request-size 1310720' is out of range [4096 - 131072] Can I just bump that limit in |
Time: 20180102T12:20:17 OK, so with the rda-request-size limit bumped in readdir-ahead.c, and set to 1310720 on the volume, I get:
That seems to do something! The getdents() calls still return the same ~40k number of bytes as before, but the durations of the calls are now changed: When before every ~2nd getdents took 0.1 seconds, a long call now happens only at the beginning for largenames-10k/. However, that initial call now takes much longer, so that the total time for all getdents() calls together remains the same as in comment #28 and there is no time improvement overall (see another strace further down this post that shows this). With largenames-100k:
That is changed too: Only around every 10th entry takes 0.1 seconds. But in sum, this still takes as much time as it did before, just that it's spent in larger clusters. Another thing I notice is that performance goes bad when the directory is written to in parallel to calling the getdents (e.g. with a I have another question about what you said: You mentioned that the "translator readdir-ahead tries to hide the readdirplus latency on bricks". Is readdir-latency on bricks really an issue? Comparing directly the impact of With rda-request-size 1310720 (10x higher than default):
With rda-request-size 131072 (default):
In the above we can see: The same total time is spent, but with the larger That begs the question: My suspicion: Gluster is burning this time as CPU cycles. The current TIME+ shown in htop for glusterfs and glusterfsd are: 5:47.62 glusterfs and not moving. After I run 5:48.51 glusterfs In relative times, the increase is: 0.89 glusterfs This is a very large chunk of the 3.28 seconds that the Also consider: The total amount of data returned by getdents() for largenames-100k/ is around 7 MB. glusterfsd spends 1.81 seconds CPU time to produce these 7 MB. That makes ~3.8 MB/s CPU throughput. This is not a reasonable number for CPU processing throughput. Gluster must be doing something extremely inefficient to spend so much CPU time at so little data. |
Time: 20180117T08:00:42
Glusterfs does following things on a brick during a readdirplus call (A readdirplus - like NFS readdirplus - is a readdir + collection of metadata like xattrs, stat for each dentry so that kernel don't have to do a lookup/stat for each dentry and by default every readdir is converted to readdirplus):
So, there is list processing involved here. Note also that this list processing happens more than once at different layers. This list processing could be the reason for cpu usage. Experiment 1 - Isolate the overhead required to access metadata of each dentry:
Experiment 2 - Isolate the overhead of iterating through list of dentries:In this experiment we'll fetch metadata of each dentry, but without using readdirplus. As I mentioned above, different xlators traverse through the list in the scope of a single readdirplus call. So, there are multiple iterations. However, we can still access metadata of each dentry without using readdirplus. In this case kernel initiates lookup/stat calls for each dentry it fetched through readdir. The crucial difference is that list processing is minimal in glusterfs as most of the xlators are not interested in just dentries (instead they are interested in inode pointed out by dentry. Since readdir doesn't have inode/stat information they just pass the list to higher layers without processing it).
We can compare results from these two experiments with the information provided in comment #35. [1] http://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html |
Time: 20180117T08:02:47
We can use "find . > /dev/null" to do this.
|
Time: 20180117T08:19:44
But this adds a an extra latency for a lookup call/response to traverse through glusterfs stack and network for each dentry. Hence this skews the results. So, the delta (as compared to the numbers we saw in comment #35) we actually see is, delta (D) = latency of readdirplus (R) + n * lookup latency (L) - time spent in processing dentry list by various xlators (I). Where n is the number of dentries in the list. So, We can deduce (n * lookup latency) (L) using results from experiment 1. L = Total time taken in experiment 2 (T2) - Total time taken in experiment 1 (T1) So, I = R + (T2 - T1) - D
|
Time: 20181023T14:54:32 |
Thank you for your contributions. |
I'm no longer using gluster, so I cannot tell whether this bug I originally filed still exists, but in the interest of helping the project (given that I used it for a while and it brought me great value): Auto-closing bots are dumb, an anti-productivity feature, and the admission to not have a proper bug triage process. Design problems like this do not disappear because nobody has commented on them, and nobody will bother commenting every 6 months to keep them alive. Using bots like this you will lose accurate information of what are unsolved problems, and frustrate community contributers (I've seen it many times; it is common for active Github users to be subscribed to 1000s of issues, and if a bot pings each of them every 6 months, this creates tens of useless emails per day). |
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it. |
That's not true, I posted...
Normal people cannot reopen issues, so I cannot. Whoever decided to use the stale bot could have considered how Github works first. |
Not sure why stalebot didn't consider your comment! Will surely fix that part! |
Thanks for the feedback. I will raise this in community meeting |
I still have this issue, any hope of fixing it? |
Thank you for your contributions. |
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it. |
URL: https://bugzilla.redhat.com/1478411
Creator: nh2-redhatbugzilla at deditus.de
Time: 20170804T13:56:14
I have a GlusterFS 3.10 volume and mounted it with the fuse mount (
mount -t glusterfs
), both on Linux.On it I have a directory with 1 million files in it.
It takes very long to
find /that/directory
.Using
strace
, I believe I discovered (at least part of) the reason:1501854600.235524 getdents(4, /* 20 entries /, 131072) = 1048
1501854600.235727 getdents(4, / 20 entries /, 131072) = 1032
1501854600.235922 getdents(4, / 20 entries */, 131072) = 1032
Despite
find
issuinggetdents()
with a large buffer size of 128K, glusterfs always only fills in 20 directory entries.Each of those takes a network roundtrip (seemingly).
I also strace'd the brick on the server, where everything seems fine: There getdents() returns typically 631 entries, filling the 32KB buffer which the brick implementation uses for getdents().
If the find could also do ~631 per call, my directory listing would probably be 30x faster!
So it seems like something in gluster or fuse caps the number of getdents results per call to roughly 20.
What could that be?
The text was updated successfully, but these errors were encountered: