Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize indices #1841

Merged
merged 11 commits into from
Jan 17, 2024
Merged

feat: optimize indices #1841

merged 11 commits into from
Jan 17, 2024

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Jan 17, 2024

Allow users to speicfy how many delta indices to be merged

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions to help me understand how this operation works

python/src/dataset.rs Outdated Show resolved Hide resolved
rust/lance/src/index.rs Outdated Show resolved Hide resolved
Comment on lines +54 to +60
let mut indices = Vec::with_capacity(old_indices.len());
for idx in old_indices {
let index = dataset
.open_generic_index(&column.name, &idx.uuid.to_string())
.await?;
indices.push(index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, could maybe use Vec::from_iter here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I change to

    let indices = stream::iter(old_indices.iter())
        .zip(repeat((dataset.clone(), col_name.clone())))
        .map(|(meta, (ds, col_name))| async move {
            ds.open_generic_index(&col_name, &meta.uuid.to_string())
                .await
        })
        .buffered(10)
        .try_collect::<Vec<_>>()
        .await?;

it complaints

error: higher-ranked lifetime error
   --> lance/src/index.rs:300:5
    |
300 |     #[instrument(skip_all)]
    |     ^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: could not prove `Pin<Box<{async block@lance/src/index.rs:300:5: 300:28}>>: CoerceUnsized<Pin<Box<(dyn futures::Future<Output = Result<(), lance_core::Error>> + std::marker::Send + 'i)>>>`
    = note: this error originates in the attribute macro `instrument` (in Nightly builds, run with -Z macro-backtrace for more info)

error: could not compile `lance` (lib) due to previous error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the return of the dreaded rust-lang/rust#102211

Don't worry too much about it but I I've had luck with something like this...

    let indices = stream::iter(old_indices.iter())
        .zip(repeat((dataset.clone(), col_name.clone())))
        .map(|(meta, (ds, col_name))| async move {
            ds.open_generic_index(&col_name, &meta.uuid.to_string())
                .await
        })
        .collect::<Vec<_>>();
   let indices = stream::iter(indices)
        .buffered(10)
        .try_collect::<Vec<_>>()
        .await?;

...or you can just remove the call to buffered (that's the method that usually introduces the bogus error).

rust/lance/src/index/append.rs Outdated Show resolved Hide resolved
Comment on lines +20 to +23
/// If `num_indices_to_merge` is 0, a new delta index will be created.
/// If `num_indices_to_merge` is 1, the delta updates will be merged into the latest index.
/// If `num_indices_to_merge` is more than 1, the delta updates and latest N indices
/// will be merged into one single index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explain to the user why they would change this parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry too much. We can tweak this if/when the parameter becomes public. I'm guessing that there is some kind of cost-to-compute/accuracy-of-index tradeoff here? Or is it a cost-to-compute/cost-to-search tradeoff?

In other words, "why wouldn't I merge the indicies into one big index every time?" or "why wouldn't I make every index a delta index?"

I'm still not sure it is clear from the comment.

@eddyxu eddyxu merged commit 1671ea0 into main Jan 17, 2024
17 checks passed
@eddyxu eddyxu deleted the lei/opt_idx branch January 17, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants