-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the overhead of IndexInput#prefetch
when data is cached in RAM.
#13381
Conversation
As Robert pointed out and benchmarks confirmed, there is some (small) overhead to calling `madvise` via the foreign function API, benchmarks suggest it is in the order of 1-2us. This is not much for a single call, but may become non-negligible across many calls. Until now, we only looked into using prefetch() for terms, skip data and postings start pointers which are a single prefetch() operation per segment per term. But we may want to start using it in cases that could result into more calls to `madvise`, e.g. if we start using it for stored fields and a user requests 10k documents. In apache#13337, Robert wondered if we could take advantage of `mincore()` to reduce the overhead of `IndexInput#prefetch()`, which is what this PR is doing. For now, this is trying to not add new APIs. Instead, `IndexInput#prefetch` tracks consecutive hits on the page cache and calls `madvise` less and less frequently under the hood as the number of cache hits increases.
I slightly modified the benchmark from #13337import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;
import org.apache.lucene.store.MMapDirectory;
public class PrefetchBench {
private static final int NUM_TERMS = 3;
private static final long FILE_SIZE = 100L * 1024 * 1024 * 1024; // 100GB
private static final int NUM_BYTES = 16;
public static int DUMMY;
public static void main(String[] args) throws IOException {
Path filePath = Paths.get(args[0]);
Path dirPath = filePath.getParent();
String fileName = filePath.getFileName().toString();
Random r = ThreadLocalRandom.current();
try (Directory dir = new MMapDirectory(dirPath)) {
if (Arrays.asList(dir.listAll()).contains(fileName) == false) {
try (IndexOutput out = dir.createOutput(fileName, IOContext.DEFAULT)) {
byte[] buf = new byte[8196];
for (long i = 0; i < FILE_SIZE; i += buf.length) {
r.nextBytes(buf);
out.writeBytes(buf, buf.length);
}
}
}
for (boolean dataFitsInCache : new boolean[] { false, true}) {
try (IndexInput i0 = dir.openInput("file", IOContext.DEFAULT)) {
byte[][] b = new byte[NUM_TERMS][];
for (int i = 0; i < NUM_TERMS; ++i) {
b[i] = new byte[NUM_BYTES];
}
IndexInput[] inputs = new IndexInput[NUM_TERMS];
if (dataFitsInCache) {
// 16MB slice that should easily fit in the page cache
inputs[0] = i0.slice("slice", 0, 16 * 1024 * 1024);
} else {
inputs[0] = i0;
}
for (int i = 1; i < NUM_TERMS; ++i) {
inputs[i] = inputs[0].clone();
}
final long length = inputs[0].length();
List<Long>[] latencies = new List[2];
latencies[0] = new ArrayList<>();
latencies[1] = new ArrayList<>();
for (int iter = 0; iter < 100_000; ++iter) {
final boolean prefetch = (iter & 1) == 0;
final long start = System.nanoTime();
for (IndexInput ii : inputs) {
final long offset = r.nextLong(length - NUM_BYTES);
ii.seek(offset);
if (prefetch) {
ii.prefetch(offset, 1);
}
}
for (int i = 0; i < NUM_TERMS; ++i) {
inputs[i].readBytes(b[i], 0, b[i].length);
}
final long end = System.nanoTime();
// Prevent the JVM from optimizing away the reads
DUMMY = Arrays.stream(b).mapToInt(Arrays::hashCode).sum();
latencies[iter & 1].add(end - start);
}
latencies[0].sort(null);
latencies[1].sort(null);
System.out.println("Data " + (dataFitsInCache ? "fits" : "does not fit") + " in the page cache");
long prefetchP50 = latencies[0].get(latencies[0].size() / 2);
long prefetchP90 = latencies[0].get(latencies[0].size() * 9 / 10);
long prefetchP99 = latencies[0].get(latencies[0].size() * 99 / 100);
long noPrefetchP50 = latencies[1].get(latencies[1].size() / 2);
long noPrefetchP90 = latencies[1].get(latencies[1].size() * 9 / 10);
long noPrefetchP99 = latencies[1].get(latencies[1].size() * 99 / 100);
System.out.println(" With prefetching: P50=" + prefetchP50 + "ns P90=" + prefetchP90 + "ns P99=" + prefetchP99 + "ns");
System.out.println(" Without prefetching: P50=" + noPrefetchP50 + "ns P90=" + noPrefetchP90 + "ns P99=" + noPrefetchP99 + "ns");
}
}
}
}
} It gives the following results. Before the change:
After the change:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numbers look great. I like the simple solution here to lower the overhead for when things fit in RAM. Let's try MemorySegment.isLoaded()
and if performance is similar, we can avoid maintaining our own native mincore plumbing.
} | ||
return true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be replaced with MemorySegment.isLoaded()
which does the exact same thing in the openJDK via C code?
// on the next power of two of the counter. | ||
return; | ||
} | ||
|
||
if (NATIVE_ACCESS.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the native access check to the top.
@@ -344,7 +354,11 @@ public void prefetch(long offset, long length) throws IOException { | |||
} | |||
|
|||
final MemorySegment prefetchSlice = segment.asSlice(offset, length); | |||
nativeAccess.madviseWillNeed(prefetchSlice); | |||
if (nativeAccess.mincore(prefetchSlice) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like Robert said: MemorySegment::isLoaded
is doing that. It also works on zero-length / without-address segments (see the "unmapper" check, if there is no unmapper it is not an mmapped segment and it has no address).
The actual SCOPED_MEMORY_ACCESS.isLoaded()
call uses mincore()
:
- https://github.com/openjdk/jdk/blob/b92bd671835c37cff58e2cdcecd0fe4277557d7f/src/java.base/share/classes/jdk/internal/misc/X-ScopedMemoryAccess.java.template#L239-L258
- https://github.com/openjdk/jdk/blob/b92bd671835c37cff58e2cdcecd0fe4277557d7f/src/java.base/share/classes/java/nio/Buffer.java#L903-L906
- https://github.com/openjdk/jdk/blob/b92bd671835c37cff58e2cdcecd0fe4277557d7f/src/java.base/share/classes/java/nio/MappedMemoryUtils.java#L36-L46
- and finally the C code, e.g. for UNIX flavour: https://github.com/openjdk/jdk/blob/b92bd671835c37cff58e2cdcecd0fe4277557d7f/src/java.base/unix/native/libnio/MappedMemoryUtils.c#L43-L80
So it is actually a copy of your code in a mixed native and java code!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we look into replacing our native code with a call to MemorySegment#load()
in a virtual thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we look into replacing our native code with a call to
MemorySegment#load()
in a virtual thread?
Let's keep it with pure madvise. A virtual thread is not a good idea because the "touch every page" code inside the JVM is not suitable for a virtual thread as it is cpu bound.
@@ -32,6 +32,9 @@ abstract class NativeAccess { | |||
*/ | |||
public abstract void madviseWillNeed(MemorySegment segment) throws IOException; | |||
|
|||
/** Returns {@code true} if pages from the given {@link MemorySegment} are resident in RAM. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert
@@ -17,6 +17,7 @@ | |||
package org.apache.lucene.store; | |||
|
|||
import java.io.IOException; | |||
import java.lang.foreign.Arena; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java
Show resolved
Hide resolved
P.S.: Actually when looking at the code, the The only problem with that is: After doing the madvise, it touches a byte in each page to actually trigger the load synchronously. So we have to stay with our direct native call here. |
somewhat related: i was playing around with the new
https://lwn.net/Articles/917096/ You can play around with it easily on linux 6.x from the commandline: $ fincore --output-all myindexdir/* |
Maybe if we didn't close the fd in mmapdir we could eventually think about making use of this on modern linux. it doesn't have a glibc wrapper yet... here is minimal sample code, but maybe just look at #include <sys/syscall.h>
#include <linux/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
int
cachestat(int fd, struct cachestat_range *range, struct cachestat *stats, int flags) {
return syscall(SYS_cachestat, fd, range, stats, flags);
}
int main(int argc, char **argv) {
int fd;
if (argc != 2) {
printf("usage: %s <file>\n", argv[0]);
return 2;
}
if ((fd = open(argv[1], O_RDONLY)) < 0) {
perror("couldn't open");
return 1;
}
struct cachestat_range range = { 0, 0 };
struct cachestat cstats;
if (cachestat(fd, &range, &cstats, 0) != 0) {
perror("couldn't cachestat");
return 1;
}
printf("cached: %llu\ndirty: %llu\nwriteback: %llu\nevicted: %llu\nrecently_evicted: %llu\n",
cstats.nr_cache, cstats.nr_dirty, cstats.nr_writeback, cstats.nr_evicted, cstats.nr_recently_evicted);
return 0;
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Let's just think about the counter and clones.
This is an interesting idea! I was discussing this potential problem with @tveasey the other day. With terms and postings, we're currently only looking into loading a few pages in parallel per search thread and we then use them immediately. With GBs of capacity for the page cache, it would be extremely unlikely for these pages to get evicted in the meantime. But if/when we start looking into using |
I added "search" concurrency to the benchmark to make it a bit more realisticimport java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ThreadLocalRandom;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.IOContext;
import org.apache.lucene.store.IndexInput;
import org.apache.lucene.store.IndexOutput;
import org.apache.lucene.store.MMapDirectory;
import org.apache.lucene.util.ThreadInterruptedException;
public class PrefetchBench {
private static final int CONCURRENCY = 10;
private static final int NUM_TERMS = 3;
private static final long FILE_SIZE = 100L * 1024 * 1024 * 1024; // 100GB
private static final int NUM_BYTES = 16;
public static int DUMMY;
public static void main(String[] args) throws Exception {
Path filePath = Paths.get(args[0]);
Path dirPath = filePath.getParent();
String fileName = filePath.getFileName().toString();
Random r = ThreadLocalRandom.current();
try (Directory dir = new MMapDirectory(dirPath)) {
if (Arrays.asList(dir.listAll()).contains(fileName) == false) {
try (IndexOutput out = dir.createOutput(fileName, IOContext.DEFAULT)) {
byte[] buf = new byte[8196];
for (long i = 0; i < FILE_SIZE; i += buf.length) {
r.nextBytes(buf);
out.writeBytes(buf, buf.length);
}
}
}
for (boolean dataFitsInCache : new boolean[] { false, true}) {
try (IndexInput i0 = dir.openInput("file", IOContext.DEFAULT)) {
final IndexInput input;
if (dataFitsInCache) {
// 16MB slice that should easily fit in the page cache
input = i0.slice("slice", 0, 16 * 1024 * 1024);
} else {
input = i0;
}
final CountDownLatch latch = new CountDownLatch(1);
RandomReader[] readers = new RandomReader[CONCURRENCY];
for (int i = 0; i < readers.length; ++i) {
IndexInput[] inputs = new IndexInput[NUM_TERMS];
for (int j = 0; j < inputs.length; ++j) {
inputs[j] = input.clone();
}
readers[i] = new RandomReader(inputs, latch);
readers[i].start();
}
latch.countDown();
List<Long> prefetchLatencies = new ArrayList<>();
List<Long> noPrefetchLatencies = new ArrayList<>();
for (RandomReader reader : readers) {
reader.join();
prefetchLatencies.addAll(reader.latencies[0]);
noPrefetchLatencies.addAll(reader.latencies[1]);
}
prefetchLatencies.sort(null);
noPrefetchLatencies.sort(null);
System.out.println("Data " + (dataFitsInCache ? "fits" : "does not fit") + " in the page cache");
long prefetchP50 = prefetchLatencies.get(prefetchLatencies.size() / 2);
long prefetchP90 = prefetchLatencies.get(prefetchLatencies.size() * 9 / 10);
long prefetchP99 = prefetchLatencies.get(prefetchLatencies.size() * 99 / 100);
long noPrefetchP50 = noPrefetchLatencies.get(noPrefetchLatencies.size() / 2);
long noPrefetchP90 = noPrefetchLatencies.get(noPrefetchLatencies.size() * 9 / 10);
long noPrefetchP99 = noPrefetchLatencies.get(noPrefetchLatencies.size() * 99 / 100);
System.out.println(" With prefetching: P50=" + prefetchP50 + "ns P90=" + prefetchP90 + "ns P99=" + prefetchP99 + "ns");
System.out.println(" Without prefetching: P50=" + noPrefetchP50 + "ns P90=" + noPrefetchP90 + "ns P99=" + noPrefetchP99 + "ns");
}
}
}
}
private static class RandomReader extends Thread {
private final IndexInput[] inputs;
private final CountDownLatch latch;
private final byte[][] b = new byte[NUM_TERMS][];
final List<Long>[] latencies = new List[2];
RandomReader(IndexInput[] inputs, CountDownLatch latch) {
this.inputs = inputs;
this.latch = latch;
latencies[0] = new ArrayList<>();
latencies[1] = new ArrayList<>();
for (int i = 0; i < NUM_TERMS; ++i) {
b[i] = new byte[NUM_BYTES];
}
}
@Override
public void run() {
try {
latch.await();
final ThreadLocalRandom r = ThreadLocalRandom.current();
final long length = inputs[0].length();
for (int iter = 0; iter < 100_000; ++iter) {
final boolean prefetch = (iter & 1) == 0;
final long start = System.nanoTime();
for (IndexInput ii : inputs) {
final long offset = r.nextLong(length - NUM_BYTES);
ii.seek(offset);
if (prefetch) {
ii.prefetch(offset, 1);
}
}
for (int i = 0; i < NUM_TERMS; ++i) {
inputs[i].readBytes(b[i], 0, b[i].length);
}
final long end = System.nanoTime();
// Prevent the JVM from optimizing away the reads
DUMMY = Arrays.stream(b).mapToInt(Arrays::hashCode).sum();
latencies[iter & 1].add(end - start);
}
} catch (IOException e) {
throw new UncheckedIOException(e);
} catch (InterruptedException e) {
throw new ThreadInterruptedException(e);
}
}
}
} On the latest version of this PR, it reports:
vs. the following on
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, lets make incremental progress!
…AM. (apache#13381) As Robert pointed out and benchmarks confirmed, there is some (small) overhead to calling `madvise` via the foreign function API, benchmarks suggest it is in the order of 1-2us. This is not much for a single call, but may become non-negligible across many calls. Until now, we only looked into using prefetch() for terms, skip data and postings start pointers which are a single prefetch() operation per segment per term. But we may want to start using it in cases that could result into more calls to `madvise`, e.g. if we start using it for stored fields and a user requests 10k documents. In apache#13337, Robert wondered if we could take advantage of `mincore()` to reduce the overhead of `IndexInput#prefetch()`, which is what this PR is doing via `MemorySegment#isLoaded()`. `IndexInput#prefetch` tracks consecutive hits on the page cache and calls `madvise` less and less frequently under the hood as the number of consecutive cache hits increases.
As Robert pointed out and benchmarks confirmed, there is some (small) overhead to calling
madvise
via the foreign function API, benchmarks suggest it is in the order of 1-2us. This is not much for a single call, but may become non-negligible across many calls. Until now, we only looked into using prefetch() for terms, skip data and postings start pointers which are a single prefetch() operation per segment per term.But we may want to start using it in cases that could result into more calls to
madvise
, e.g. if we start using it for stored fields and a user requests 10k documents. In #13337, Robert wondered if we could take advantage ofmincore()
to reduce the overhead ofIndexInput#prefetch()
, which is what this PR is doing.For now, this is trying to not add new APIs. Instead,
IndexInput#prefetch
tracks consecutive hits on the page cache and callsmadvise
less and less frequently under the hood as the number of cache hits increases.