introduce options for reducing the overhead for a clustering procedure #3731

alexanderguzhva · 2024-08-07T16:33:34Z

Several changes:

Introduce ClusteringParameters::check_input_data_for_NaNs, which may suppress checks for NaN values in the input data
Introduce ClusteringParameters::use_faster_subsampling, which uses a newly added SplitMix64-based rng (SplitMix64RandomGenerator) and also may pick duplicate points from the original input dataset. Surprisingly, rand_perm() may involve noticeable non-zero costs for certain scenarios.
Negative values for ClusteringParameters::seed initialize internal clustering rng with high-resolution clock each time, making clustering procedure to pick different subsamples each time. I've decided not to use std::random_device in order to avoid possible negative effects.

Useful for future ProductResidualQuantizer improvements.

Signed-off-by: Alexandr Guzhva <[email protected]>

mdouze · 2024-08-12T05:32:38Z

Can you say a bit more about when the rand_perm is too slow? It is surprising to me that the rng could be a perf bottleneck.
I assume it's when the number of k = samples needed is << n = total number of ids.
Because the rand_perm is O(n).
When k << n we could sample k elements and checking with an std::set that each elements hasn't been selected before.

mdouze · 2024-08-12T05:36:15Z

Still importing...

facebook-github-bot · 2024-08-12T05:38:21Z

@mdouze has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

alexanderguzhva · 2024-08-12T13:30:55Z

I use a dataset 1048576x768 for PRQ experiments with beam_width = 16. In this case with my new candidate code these two mentioned pieces (check for Nans and rand_perm) may take up to 10-15% of the total training time. For rand_perm(), it will be required to generate a permutation of, say, 524288 points out of 16M available.

facebook-github-bot · 2024-08-14T19:54:04Z

@mnorris11 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-15T00:29:14Z

@mnorris11 merged this pull request in afe9c40.

facebookresearch#3731) Summary: Several changes: 1. Introduce `ClusteringParameters::check_input_data_for_NaNs`, which may suppress checks for NaN values in the input data 2. Introduce `ClusteringParameters::use_faster_subsampling`, which uses a newly added SplitMix64-based rng (`SplitMix64RandomGenerator`) and also may pick duplicate points from the original input dataset. Surprisingly, `rand_perm()` may involve noticeable non-zero costs for certain scenarios. 3. Negative values for `ClusteringParameters::seed` initialize internal clustering rng with high-resolution clock each time, making clustering procedure to pick different subsamples each time. I've decided not to use `std::random_device` in order to avoid possible negative effects. Useful for future `ProductResidualQuantizer` improvements. Pull Request resolved: facebookresearch#3731 Reviewed By: asadoughi Differential Revision: D61106105 Pulled By: mnorris11 fbshipit-source-id: 072ab2f5ce4f82f9cf49d678122f65d1c08ce596

facebook-github-bot added the CLA Signed label Aug 7, 2024

alexanderguzhva force-pushed the clustering_speedup branch from 6dff5e2 to cef8af7 Compare August 7, 2024 16:35

introduce options for reducing the overhead for a clustering procedure

5c72df3

Signed-off-by: Alexandr Guzhva <[email protected]>

alexanderguzhva force-pushed the clustering_speedup branch from cef8af7 to 5c72df3 Compare August 8, 2024 13:52

mnorris11 requested a review from mdouze August 8, 2024 17:49

asadoughi added the Implementation label Aug 8, 2024

Merge branch 'main' into clustering_speedup

54b999d

facebook-github-bot closed this in afe9c40 Aug 15, 2024

facebook-github-bot added the Merged label Aug 15, 2024

alexanderguzhva mentioned this pull request Aug 27, 2024

introduce options for reducing the overhead for a clustering procedure zilliztech/knowhere#790

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce options for reducing the overhead for a clustering procedure #3731

introduce options for reducing the overhead for a clustering procedure #3731

alexanderguzhva commented Aug 7, 2024 •

edited

Loading

mdouze commented Aug 12, 2024

mdouze commented Aug 12, 2024

facebook-github-bot commented Aug 12, 2024

alexanderguzhva commented Aug 12, 2024

facebook-github-bot commented Aug 14, 2024

facebook-github-bot commented Aug 15, 2024

introduce options for reducing the overhead for a clustering procedure #3731

introduce options for reducing the overhead for a clustering procedure #3731

Conversation

alexanderguzhva commented Aug 7, 2024 • edited Loading

mdouze commented Aug 12, 2024

mdouze commented Aug 12, 2024

facebook-github-bot commented Aug 12, 2024

alexanderguzhva commented Aug 12, 2024

facebook-github-bot commented Aug 14, 2024

facebook-github-bot commented Aug 15, 2024

alexanderguzhva commented Aug 7, 2024 •

edited

Loading