-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add IndexInput isLoaded #13998
Add IndexInput isLoaded #13998
Conversation
@ChrisHegarty this will be a very useful thing. Can we also figure out how much data is loaded with this API? So lets say an IndexInput is 30GB and only 10GB is loaded/mapped in memory can return that too? |
Indeed.
While possible, it's not straightforward and would require some native access. For now, let's go with the basic loaded / not-loaded, since this is useful as is. |
You would need to call for non-mmapped i/o you can do similar with syscalls such as |
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java
Show resolved
Hide resolved
Yeah, we can look at how to call |
yes, agreed about i'm very much against using |
Also for debugging these issues, you can get this information at non-java level using
|
++
Yeah, optionally being able to For now, I mostly wanna be able to:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for me. Maybe implement this API on our in-memory index inputs to return true, e.g. ByteBuffersIndexInput
?
* hint because the operating system may have paged out some of the data by the time this method | ||
* returns. If the optional is true, then it's likely that the contents of this input are resident | ||
* in physical memory. A value of false does not imply that the contents are not resident in | ||
* physical memory. An empty optional is returned if it is not possible to determine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like an empty optional and false
mostly mean the same thing, which makes me wonder if this should return a boolean
directly?
It may also be worth pointing out that this method runs in linear time with the amount of data that this IndexInput
exposes (as opposed to constant-time). So it makes little sense to use it to do something like "if (isLoaded() == false) { prefetch(); }"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note about the time complexity. I'd like to keep the tri-state of the return type, at least for now. Since I think will be useful to know that the isLoaded-ness or not, is determinable or not.
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java
Show resolved
Hide resolved
yeah, I think that this prob makes sense. Lemme satisfy myself that it will always be true. |
it won't be in core if currently swapped out, no? I don't think a hardcoded |
Yeah. I was taking a little time to consider if it might be worth casting to |
Sorry for derailing the PR, let's not implement it on ByteBuffersIndexInput then. We can look into it in a separate PR if we want. |
This commit adds IndexInput::isLoaded to help determine if the contents of an input is resident in physical memory. The intent of this new method is to help build inspection and diagnostic infrastructure on top.
This commit adds
IndexInput::isLoaded
to help determine if the contents of an input is resident in physical memory.The intent of this new method is to help build inspection and diagnostic infrastructure on top. The initial requirement is to help understand if vector data and more specifically the HNSW graph are in memory. For search use cases, performance drops off significantly if, at least, the graph is not resident. This is not a perfect API, more of a hint, but along with read advice like MADV_WILLNEED may be used to determine perf issues searching vectors