Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: cache loaded index metadata #1831

Merged
merged 9 commits into from
Jan 16, 2024
Merged

perf: cache loaded index metadata #1831

merged 9 commits into from
Jan 16, 2024

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Jan 14, 2024

Closes #1657

@eddyxu eddyxu self-assigned this Jan 15, 2024
@eddyxu eddyxu added vector Vector Search performance labels Jan 15, 2024
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging this deadlock in GCS has made me a little paranoid I think.

Comment on lines 100 to 101
/// Cache index metadadta of a version of the dataset.
pub(crate) index_metadata_cache: Arc<futures::lock::Mutex<Option<CachedIndexMetadata>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not include this as part of session?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Was thinking that this cache has the same lifetime with the Dataset.

But if put into Session, we can actually maintain the cache for multiple opened dataset (i.e., per request). Seems a better deal. I will update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put it in session, you can then just re-use the existing metadata cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The session does have the same lifetime as Dataset doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Session object can be passed in as external managed memory (outside of Dataset)
https://github.com/lancedb/lance/blob/main/rust/lance/src/dataset/builder.rs#L162

let manifest_file = self.manifest_file(self.version().version).await?;
read_manifest_indexes(&self.object_store, &manifest_file, &self.manifest).await
async fn load_indices<'a>(&'a self) -> Result<Vec<IndexMetadata>> {
let mut indices = self.index_metadata_cache.lock().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Async locks make me a little uneasy. Could we avoid it by doing something like...

Grab lock
If available and correct version then return
Release lock
Load indices from file
Grab lock
If not set, then set
Release lock

This could lead to multiple threads loading the indices but that should be rare and harmless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg.

self.misses.fetch_add(1, Ordering::Relaxed);
}
}

#[derive(Clone)]
pub struct IndexCache {
// TODO: Can we merge these two caches into one for uniform memory management?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downcasting/upcasting is weird. You can try. I got spun around and gave up on it at the time simply because the PR had lots of other stuff.

@eddyxu eddyxu merged commit 27dbc79 into main Jan 16, 2024
17 checks passed
@eddyxu eddyxu deleted the lei/load_index_cache branch January 16, 2024 00:00
eddyxu added a commit that referenced this pull request Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache the result of read_manifest_indexes
3 participants