Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce options for reducing the overhead for a clustering procedure #3731

Conversation

alexanderguzhva
Copy link
Contributor

@alexanderguzhva alexanderguzhva commented Aug 7, 2024

Several changes:

  1. Introduce ClusteringParameters::check_input_data_for_NaNs, which may suppress checks for NaN values in the input data
  2. Introduce ClusteringParameters::use_faster_subsampling, which uses a newly added SplitMix64-based rng (SplitMix64RandomGenerator) and also may pick duplicate points from the original input dataset. Surprisingly, rand_perm() may involve noticeable non-zero costs for certain scenarios.
  3. Negative values for ClusteringParameters::seed initialize internal clustering rng with high-resolution clock each time, making clustering procedure to pick different subsamples each time. I've decided not to use std::random_device in order to avoid possible negative effects.

Useful for future ProductResidualQuantizer improvements.

@mdouze
Copy link
Contributor

mdouze commented Aug 12, 2024

Can you say a bit more about when the rand_perm is too slow? It is surprising to me that the rng could be a perf bottleneck.
I assume it's when the number of k = samples needed is << n = total number of ids.
Because the rand_perm is O(n).
When k << n we could sample k elements and checking with an std::set that each elements hasn't been selected before.

@mdouze
Copy link
Contributor

mdouze commented Aug 12, 2024

Still importing...

@facebook-github-bot
Copy link
Contributor

@mdouze has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@alexanderguzhva
Copy link
Contributor Author

I use a dataset 1048576x768 for PRQ experiments with beam_width = 16. In this case with my new candidate code these two mentioned pieces (check for Nans and rand_perm) may take up to 10-15% of the total training time. For rand_perm(), it will be required to generate a permutation of, say, 524288 points out of 16M available.

@facebook-github-bot
Copy link
Contributor

@mnorris11 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mnorris11 merged this pull request in afe9c40.

ketor pushed a commit to dingodb/faiss that referenced this pull request Aug 20, 2024
facebookresearch#3731)

Summary:
Several changes:
1. Introduce `ClusteringParameters::check_input_data_for_NaNs`, which may suppress checks for NaN values in the input data
2. Introduce `ClusteringParameters::use_faster_subsampling`, which uses a newly added SplitMix64-based rng (`SplitMix64RandomGenerator`) and also may pick duplicate points from the original input dataset.  Surprisingly, `rand_perm()` may involve noticeable non-zero costs for certain scenarios.
3. Negative values for `ClusteringParameters::seed` initialize internal clustering rng with high-resolution clock each time, making clustering procedure to pick different subsamples each time. I've decided not to use `std::random_device` in order to avoid possible negative effects.

Useful for future `ProductResidualQuantizer` improvements.

Pull Request resolved: facebookresearch#3731

Reviewed By: asadoughi

Differential Revision: D61106105

Pulled By: mnorris11

fbshipit-source-id: 072ab2f5ce4f82f9cf49d678122f65d1c08ce596
aalekhpatel07 pushed a commit to aalekhpatel07/faiss that referenced this pull request Oct 17, 2024
facebookresearch#3731)

Summary:
Several changes:
1. Introduce `ClusteringParameters::check_input_data_for_NaNs`, which may suppress checks for NaN values in the input data
2. Introduce `ClusteringParameters::use_faster_subsampling`, which uses a newly added SplitMix64-based rng (`SplitMix64RandomGenerator`) and also may pick duplicate points from the original input dataset.  Surprisingly, `rand_perm()` may involve noticeable non-zero costs for certain scenarios.
3. Negative values for `ClusteringParameters::seed` initialize internal clustering rng with high-resolution clock each time, making clustering procedure to pick different subsamples each time. I've decided not to use `std::random_device` in order to avoid possible negative effects.

Useful for future `ProductResidualQuantizer` improvements.

Pull Request resolved: facebookresearch#3731

Reviewed By: asadoughi

Differential Revision: D61106105

Pulled By: mnorris11

fbshipit-source-id: 072ab2f5ce4f82f9cf49d678122f65d1c08ce596
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants