[REVIEW] Deterministic UMAP with floating point rounding. #3848

trivialfis · 2021-05-11T05:32:18Z

Use floating rounding to make UMAP optimization deterministic. This is a breaking change as the batch size parameter is removed.

Add procedure for rounding the gradient updates.
Add buffer for gradient updates.
Add an internal parameter deterministic, which should be set to true when random_state is set.

The test file is removed due to #3849 .

trivialfis · 2021-05-11T12:24:36Z

cjnolet · 2021-05-11T21:32:54Z

@trivialfis, I compiled and executed both of your approaches against the current approach to reproducibility. So far, it does look like this approach is faster and I've verified that it appears to be reproducible. It would still be nice for completeness to do some profiling and isolate where the bottlenecks exist in the warp-reduction approach but this approach does have the benefit of requiring less changes to the code.

Here's some preliminary results from a very informal benchmark on a V100 (using defaults). Notice your truncation approach is right about the same timing as the non-reproducible approach of the current UMAP implementation

>>> def do_it():
...   import time
...   s = time.time()
...   m = UMAP().fit(X)
...   print("TooK %ss" % (time.time() - s))

Current UMAP (non-reproducible)
>>> X, y = make_blobs(100000, 256)
>>> do_it()
TooK 0.7839245796203613s
>>> do_it()
TooK 0.790290355682373s

Current UMAP (reproducible)
>>> X, y = make_blobs(100000, 256)
>>> do_it()
TooK 1.065580129623413s
>>> do_it()

Warp-level reductions:
>>> X, y = make_blobs(100000, 128)
>>> do_it()
TooK 1.012941837310791s
>>> do_it()
TooK 0.9855947494506836s
>>> X, y = make_blobs(100000, 256)
>>> do_it()
TooK 1.2152369022369385s

Truncation: 
>>> X, y = make_blobs(100000, 256)
>>> do_it()
TooK 1.2795426845550537s
>>> do_it()
TooK 0.7870500087738037s
>>> do_it()
TooK 0.7900457382202148s
>>> do_it()
TooK 0.7811644077301025s

>>> def do_it():
...   import time
...   s = time.time()
...   m = UMAP(random_state=42).fit(X)
...   print("TooK %ss" % (time.time() - s))
...   print(m.embedding_)
... 
>>> do_it()
TooK 0.9100887775421143s
[[-8.367687  -3.012333 ]
 [-9.082779  -4.595929 ]
 [-0.6999092  1.0279074]
 ...
 [ 8.140747   3.4932442]
 [ 8.890211   3.0183897]
 [ 8.434332   2.4182014]]
>>> do_it()
TooK 0.7889752388000488s
[[-8.367687  -3.012333 ]
 [-9.082779  -4.595929 ]
 [-0.6999092  1.0279074]
 ...
 [ 8.140747   3.4932442]
 [ 8.890211   3.0183897]
 [ 8.434332   2.4182014]]
>>> do_it()
TooK 0.7994036674499512s
[[-8.367687  -3.012333 ]
 [-9.082779  -4.595929 ]
 [-0.6999092  1.0279074]
 ...
 [ 8.140747   3.4932442]
 [ 8.890211   3.0183897]
 [ 8.434332   2.4182014]]
>>> do_it()
TooK 0.8132762908935547s
[[-8.367687  -3.012333 ]
 [-9.082779  -4.595929 ]
 [-0.6999092  1.0279074]
 ...
 [ 8.140747   3.4932442]
 [ 8.890211   3.0183897]
 [ 8.434332   2.4182014]]

cjnolet

Just providing some initial feedback. I'll go through another round when you're ready. So far I'm excited by the timings I'm seeing for both approaches.

cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh

cpp/src/umap/simpl_set_embed/algo.cuh

cpp/test/sg/umap_parametrizable_test.cu

python/cuml/test/test_umap.py

trivialfis · 2021-05-13T04:28:55Z

It seems the mnmg test is flaky.

Update: NVM, fixed.

cjnolet

Finished last review round. Have you gotten a chance to run these changes on larger datasets such as fashion mnist or google news embeddings? It would help just to test against some datasets just to verify there aren't any violated assumptions in the rounding. Otherwise, these changes are looking great.

cpp/src/umap/simpl_set_embed/algo.cuh

cpp/test/sg/umap_parametrizable_test.cu

python/cuml/manifold/umap.pyx

cpp/src/umap/simpl_set_embed/optimize_batch_kernel.cuh

trivialfis · 2021-05-14T03:16:53Z

fashion

branch-0.20	This PR

codecov-commenter · 2021-05-14T21:13:12Z

Codecov Report

Merging #3848 (bc26e07) into branch-0.20 (46174b7) will decrease coverage by 8.60%.
The diff coverage is 52.66%.

@@               Coverage Diff               @@
##           branch-0.20    #3848      +/-   ##
===============================================
- Coverage        85.96%   77.35%   -8.61%     
===============================================
  Files              225      214      -11     
  Lines            16986    16552     -434     
===============================================
- Hits             14602    12804    -1798     
- Misses            2384     3748    +1364

Flag	Coverage Δ
dask	`?`
non-dask	`77.35% <52.66%> (-0.46%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/cuml/benchmark/nvtx_benchmark.py	`0.00% <0.00%> (ø)`
python/cuml/common/memory_utils.py	`76.82% <ø> (-1.93%)`	⬇️
python/cuml/dask/common/dask_arr_utils.py	`27.77% <0.00%> (-68.00%)`	⬇️
python/cuml/dask/common/utils.py	`28.15% <0.00%> (-15.54%)`	⬇️
python/cuml/dask/ensemble/base.py	`19.55% <0.00%> (-64.36%)`	⬇️
python/cuml/ensemble/randomforestclassifier.pyx	`83.61% <ø> (ø)`
python/cuml/linear_model/logistic_regression.pyx	`89.21% <ø> (ø)`
python/cuml/neighbors/nearest_neighbors.pyx	`93.11% <ø> (-0.03%)`	⬇️
python/cuml/common/base.pyx	`74.10% <29.41%> (-6.23%)`	⬇️
python/cuml/model_selection/_split.py	`88.99% <75.67%> (-1.87%)`	⬇️
... and 100 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4d06991...bc26e07. Read the comment docs.

This is a breaking change as the batch size parameter is removed. * Add procedure for rounding the gradient updates. * Add buffer for gradient updates. * Add an internal parameter `deterministic`, which should be set to `true` when `random_state` is set. * Cleanup tests.

trivialfis · 2021-05-17T17:01:26Z

Rebased onto branch-21.06. Not sure why is conda failing.

trivialfis · 2021-05-17T23:53:31Z

rerun tests

trivialfis · 2021-05-19T04:10:44Z

rerun tests

cjnolet

LGTM. The evaluation on the datasets I've seen look great.

cjnolet · 2021-05-20T18:16:35Z

@gpucibot merge

trivialfis · 2021-05-20T18:53:46Z

@cjnolet Thanks for all the advice! Learned a lot during this.

Closes #4725 #3848 removes the usage of `optim_batch_size` in code. This PR removes the parameter from the docstring and in `UMAPParams`. Authors: - Thomas J. Fan (https://github.com/thomasjpfan) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #4732

Use floating rounding to make UMAP optimization deterministic. This is a breaking change as the batch size parameter is removed. * Add procedure for rounding the gradient updates. * Add buffer for gradient updates. * Add an internal parameter `deterministic`, which should be set to `true` when `random_state` is set. The test file is removed due to rapidsai#3849 . Authors: - Jiaming Yuan (https://github.com/trivialfis) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#3848

) Closes rapidsai#4725 rapidsai#3848 removes the usage of `optim_batch_size` in code. This PR removes the parameter from the docstring and in `UMAPParams`. Authors: - Thomas J. Fan (https://github.com/thomasjpfan) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#4732

trivialfis changed the title ~~[WIP] Deterministic UMAP with floating point roundng.~~ [WIP] Deterministic UMAP with floating point rounding. May 11, 2021

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels May 11, 2021

trivialfis added breaking Breaking change feature request New feature or request labels May 11, 2021

trivialfis mentioned this pull request May 11, 2021

[BUG] UMAP test is not built. #3849

Closed

trivialfis requested a review from cjnolet May 11, 2021 13:38

cjnolet mentioned this pull request May 11, 2021

[BUG] Lanczos solver isn't reproducible even with seed rapidsai/raft#195

Closed

cjnolet requested changes May 11, 2021

View reviewed changes

trivialfis marked this pull request as ready for review May 12, 2021 07:44

trivialfis requested review from a team as code owners May 12, 2021 07:44

trivialfis changed the title ~~[WIP] Deterministic UMAP with floating point rounding.~~ [REVIEW] Deterministic UMAP with floating point rounding. May 12, 2021

cjnolet added the 4 - Waiting on Author Waiting for author to respond to review label May 12, 2021

cjnolet mentioned this pull request May 12, 2021

[TASK] Remove CTK 11.2+ assertion in UMAP reproducibility gtests #3742

Closed

trivialfis removed the 4 - Waiting on Author Waiting for author to respond to review label May 13, 2021

dantegd added the 4 - Waiting on Reviewer Waiting for reviewer to review or respond label May 13, 2021

mdemoret-nv linked an issue May 13, 2021 that may be closed by this pull request

[BUG] UMAP test is not built. #3849

Closed

cjnolet requested changes May 13, 2021

View reviewed changes

trivialfis force-pushed the fea-deterministic-umap-truncation-max branch 2 times, most recently from b850a33 to bc26e07 Compare May 14, 2021 17:07

trivialfis force-pushed the fea-deterministic-umap-truncation-max branch from bc26e07 to 3945822 Compare May 17, 2021 17:00

trivialfis changed the title ~~[REVIEW] Deterministic UMAP with floating point rounding.~~ [REVIEW] Deterministic UMAP with floating point rounding May 17, 2021

trivialfis changed the title ~~[REVIEW] Deterministic UMAP with floating point rounding~~ [REVIEW] Deterministic UMAP with floating point rounding. May 17, 2021

trivialfis requested a review from cjnolet May 18, 2021 00:52

cjnolet approved these changes May 20, 2021

View reviewed changes

rapids-bot bot merged commit 99a80c8 into rapidsai:branch-21.06 May 20, 2021

trivialfis deleted the fea-deterministic-umap-truncation-max branch May 20, 2021 18:53

thomasjpfan mentioned this pull request May 12, 2022

MNT Removes unused optim_batch_size from UMAP's docstring #4732

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Deterministic UMAP with floating point rounding. #3848

[REVIEW] Deterministic UMAP with floating point rounding. #3848

trivialfis commented May 11, 2021 •

edited

Loading

trivialfis commented May 11, 2021 •

edited

Loading

cjnolet commented May 11, 2021

cjnolet left a comment

trivialfis commented May 13, 2021 •

edited

Loading

cjnolet left a comment

trivialfis commented May 14, 2021 •

edited

Loading

codecov-commenter commented May 14, 2021

trivialfis commented May 17, 2021 •

edited

Loading

trivialfis commented May 17, 2021

trivialfis commented May 19, 2021

cjnolet left a comment

cjnolet commented May 20, 2021

trivialfis commented May 20, 2021

[REVIEW] Deterministic UMAP with floating point rounding. #3848

[REVIEW] Deterministic UMAP with floating point rounding. #3848

Conversation

trivialfis commented May 11, 2021 • edited Loading

trivialfis commented May 11, 2021 • edited Loading

cjnolet commented May 11, 2021

cjnolet left a comment

Choose a reason for hiding this comment

trivialfis commented May 13, 2021 • edited Loading

cjnolet left a comment

Choose a reason for hiding this comment

trivialfis commented May 14, 2021 • edited Loading

fashion

codecov-commenter commented May 14, 2021

Codecov Report

trivialfis commented May 17, 2021 • edited Loading

trivialfis commented May 17, 2021

trivialfis commented May 19, 2021

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet commented May 20, 2021

trivialfis commented May 20, 2021

trivialfis commented May 11, 2021 •

edited

Loading

trivialfis commented May 11, 2021 •

edited

Loading

trivialfis commented May 13, 2021 •

edited

Loading

trivialfis commented May 14, 2021 •

edited

Loading

trivialfis commented May 17, 2021 •

edited

Loading