-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introduce options for reducing the overhead for a clustering procedure #3731
introduce options for reducing the overhead for a clustering procedure #3731
Conversation
6dff5e2
to
cef8af7
Compare
Signed-off-by: Alexandr Guzhva <[email protected]>
cef8af7
to
5c72df3
Compare
Can you say a bit more about when the rand_perm is too slow? It is surprising to me that the rng could be a perf bottleneck. |
Still importing... |
@mdouze has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I use a dataset 1048576x768 for PRQ experiments with beam_width = 16. In this case with my new candidate code these two mentioned pieces (check for Nans and |
@mnorris11 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mnorris11 merged this pull request in afe9c40. |
facebookresearch#3731) Summary: Several changes: 1. Introduce `ClusteringParameters::check_input_data_for_NaNs`, which may suppress checks for NaN values in the input data 2. Introduce `ClusteringParameters::use_faster_subsampling`, which uses a newly added SplitMix64-based rng (`SplitMix64RandomGenerator`) and also may pick duplicate points from the original input dataset. Surprisingly, `rand_perm()` may involve noticeable non-zero costs for certain scenarios. 3. Negative values for `ClusteringParameters::seed` initialize internal clustering rng with high-resolution clock each time, making clustering procedure to pick different subsamples each time. I've decided not to use `std::random_device` in order to avoid possible negative effects. Useful for future `ProductResidualQuantizer` improvements. Pull Request resolved: facebookresearch#3731 Reviewed By: asadoughi Differential Revision: D61106105 Pulled By: mnorris11 fbshipit-source-id: 072ab2f5ce4f82f9cf49d678122f65d1c08ce596
facebookresearch#3731) Summary: Several changes: 1. Introduce `ClusteringParameters::check_input_data_for_NaNs`, which may suppress checks for NaN values in the input data 2. Introduce `ClusteringParameters::use_faster_subsampling`, which uses a newly added SplitMix64-based rng (`SplitMix64RandomGenerator`) and also may pick duplicate points from the original input dataset. Surprisingly, `rand_perm()` may involve noticeable non-zero costs for certain scenarios. 3. Negative values for `ClusteringParameters::seed` initialize internal clustering rng with high-resolution clock each time, making clustering procedure to pick different subsamples each time. I've decided not to use `std::random_device` in order to avoid possible negative effects. Useful for future `ProductResidualQuantizer` improvements. Pull Request resolved: facebookresearch#3731 Reviewed By: asadoughi Differential Revision: D61106105 Pulled By: mnorris11 fbshipit-source-id: 072ab2f5ce4f82f9cf49d678122f65d1c08ce596
Several changes:
ClusteringParameters::check_input_data_for_NaNs
, which may suppress checks for NaN values in the input dataClusteringParameters::use_faster_subsampling
, which uses a newly added SplitMix64-based rng (SplitMix64RandomGenerator
) and also may pick duplicate points from the original input dataset. Surprisingly,rand_perm()
may involve noticeable non-zero costs for certain scenarios.ClusteringParameters::seed
initialize internal clustering rng with high-resolution clock each time, making clustering procedure to pick different subsamples each time. I've decided not to usestd::random_device
in order to avoid possible negative effects.Useful for future
ProductResidualQuantizer
improvements.