umap fit() implementation and tests #321

rishic3 · 2023-07-06T21:23:57Z

future additions:
- OOM detection of data subsample (requires running a mini spark job to query GPU memory on node)
- "convert_dtype" fit() param support

…o umap

Signed-off-by: Rishi <[email protected]>

python/src/spark_rapids_ml/umap.py

python/tests/test_umap.py

wbo4958

It would be great if another test can be added to test the UMAP estimator persistence.

…into umap-fit

update to 23.08

rishic3 · 2023-07-12T22:54:58Z

Idea to avoid overriding _call_cuml_fit_func():

Currently, override of _call_cuml_fit_func is only necessitated by the yield statement.

Alternatively, add class method def _per_row_fit_return(self) -> bool (similar to _fit_array_order or _require_nccl_ucx). Add conditional within core._call_cuml_fit_func() to optionally return data row by row as a bonafide RDD based on this class flag. (may be useful if future algos need this too.)

Open to other (cleaner?) suggestions.

python/tests/test_umap.py

wbo4958 · 2023-07-14T00:59:43Z

Idea to avoid overriding _call_cuml_fit_func():

Currently, override of _call_cuml_fit_func is only necessitated by the yield statement.

Alternatively, add class method def _per_row_fit_return(self) -> bool (similar to _fit_array_order or _require_nccl_ucx). Add conditional within core._call_cuml_fit_func() to optionally return data row by row as a bonafide RDD based on this class flag. (may be useful if future algos need this too.)

Open to other (cleaner?) suggestions.

I would suggest the 2nd way which can re-use code.

lijinf2 · 2023-07-14T01:08:12Z

Idea to avoid overriding _call_cuml_fit_func():

Currently, override of _call_cuml_fit_func is only necessitated by the yield statement.

Alternatively, add class method def _per_row_fit_return(self) -> bool (similar to _fit_array_order or _require_nccl_ucx). Add conditional within core._call_cuml_fit_func() to optionally return data row by row as a bonafide RDD based on this class flag. (may be useful if future algos need this too.)

Open to other (cleaner?) suggestions.

Is it possible to set pyspark configuration to increase this serialization limit? One way is to check the expected serialization limit and hint user to increase the pyspark configuration parameter.

another option is to return row by row in core.py for all algorithms. (any downside?)

rishic3 · 2023-07-14T17:35:26Z

Idea to avoid overriding _call_cuml_fit_func():

Currently, override of _call_cuml_fit_func is only necessitated by the yield statement.
Alternatively, add class method def _per_row_fit_return(self) -> bool (similar to _fit_array_order or _require_nccl_ucx). Add conditional within core._call_cuml_fit_func() to optionally return data row by row as a bonafide RDD based on this class flag. (may be useful if future algos need this too.)
Open to other (cleaner?) suggestions.

Is it possible to set pyspark configuration to increase this serialization limit? One way is to check the expected serialization limit and hint user to increase the pyspark configuration parameter.

another option is to return row by row in core.py for all algorithms. (any downside?)

There is the driver.maxResultSize that can be set to unlimited, but I don't think there's a way to increase the JVM 2GB byte array limit.

As for returning row-by-row for all algos, I don't think this would work - UMAP is not implementing the typical get_cuml_fit_func and instead has its own generator function, so we would have to also make this change for other algos.

eordentlich

some additional comments. looks good overall.

python/src/spark_rapids_ml/umap.py

python/src/spark_rapids_ml/core.py

python/src/spark_rapids_ml/umap.py

python/tests/test_umap.py

…into umap-fit

rishic3 · 2023-07-20T20:48:52Z

build

python/src/spark_rapids_ml/umap.py

rishic3 · 2023-07-21T18:25:43Z

build

eordentlich

LGTM

rishic3 and others added 9 commits July 5, 2023 13:02

umap + testing

dfebaae

Merge branch 'NVIDIA:branch-23.06' into umap

369ca0b

set model num_workers

bfd1968

Merge branch 'umap' of https://github.com/rishic3/spark-rapids-ml int…

0f56492

…o umap

formatting

efc4fd5

supervised fit support

c290876

formatting

5b16dab

update code example

e55eb4f

signed

1e3a395

Signed-off-by: Rishi <[email protected]>

rishic3 marked this pull request as ready for review July 7, 2023 16:04