Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umap fit() implementation and tests #321

Merged
merged 41 commits into from
Jul 21, 2023
Merged

Conversation

rishic3
Copy link
Collaborator

@rishic3 rishic3 commented Jul 6, 2023

  • future additions:
    • OOM detection of data subsample (requires running a mini spark job to query GPU memory on node)
    • "convert_dtype" fit() param support

@rishic3 rishic3 marked this pull request as ready for review July 7, 2023 16:04
python/tests/test_umap.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if another test can be added to test the UMAP estimator persistence.

@rishic3 rishic3 marked this pull request as draft July 10, 2023 01:51
@rishic3 rishic3 changed the base branch from branch-23.06 to branch-23.08 July 10, 2023 17:42
@rishic3 rishic3 marked this pull request as ready for review July 12, 2023 22:50
@rishic3
Copy link
Collaborator Author

rishic3 commented Jul 12, 2023

Idea to avoid overriding _call_cuml_fit_func():

Currently, override of _call_cuml_fit_func is only necessitated by the yield statement.

Alternatively, add class method def _per_row_fit_return(self) -> bool (similar to _fit_array_order or _require_nccl_ucx). Add conditional within core._call_cuml_fit_func() to optionally return data row by row as a bonafide RDD based on this class flag. (may be useful if future algos need this too.)

Open to other (cleaner?) suggestions.

@wbo4958
Copy link
Collaborator

wbo4958 commented Jul 14, 2023

Idea to avoid overriding _call_cuml_fit_func():

Currently, override of _call_cuml_fit_func is only necessitated by the yield statement.

Alternatively, add class method def _per_row_fit_return(self) -> bool (similar to _fit_array_order or _require_nccl_ucx). Add conditional within core._call_cuml_fit_func() to optionally return data row by row as a bonafide RDD based on this class flag. (may be useful if future algos need this too.)

Open to other (cleaner?) suggestions.

I would suggest the 2nd way which can re-use code.

@lijinf2
Copy link
Collaborator

lijinf2 commented Jul 14, 2023

Idea to avoid overriding _call_cuml_fit_func():

Currently, override of _call_cuml_fit_func is only necessitated by the yield statement.

Alternatively, add class method def _per_row_fit_return(self) -> bool (similar to _fit_array_order or _require_nccl_ucx). Add conditional within core._call_cuml_fit_func() to optionally return data row by row as a bonafide RDD based on this class flag. (may be useful if future algos need this too.)

Open to other (cleaner?) suggestions.

Is it possible to set pyspark configuration to increase this serialization limit? One way is to check the expected serialization limit and hint user to increase the pyspark configuration parameter.

another option is to return row by row in core.py for all algorithms. (any downside?)

@rishic3
Copy link
Collaborator Author

rishic3 commented Jul 14, 2023

Idea to avoid overriding _call_cuml_fit_func():

Currently, override of _call_cuml_fit_func is only necessitated by the yield statement.
Alternatively, add class method def _per_row_fit_return(self) -> bool (similar to _fit_array_order or _require_nccl_ucx). Add conditional within core._call_cuml_fit_func() to optionally return data row by row as a bonafide RDD based on this class flag. (may be useful if future algos need this too.)
Open to other (cleaner?) suggestions.

Is it possible to set pyspark configuration to increase this serialization limit? One way is to check the expected serialization limit and hint user to increase the pyspark configuration parameter.

another option is to return row by row in core.py for all algorithms. (any downside?)

There is the driver.maxResultSize that can be set to unlimited, but I don't think there's a way to increase the JVM 2GB byte array limit.

As for returning row-by-row for all algos, I don't think this would work - UMAP is not implementing the typical get_cuml_fit_func and instead has its own generator function, so we would have to also make this change for other algos.

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some additional comments. looks good overall.

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Show resolved Hide resolved
python/src/spark_rapids_ml/core.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
python/tests/test_umap.py Outdated Show resolved Hide resolved
@rishic3
Copy link
Collaborator Author

rishic3 commented Jul 20, 2023

build

@rishic3
Copy link
Collaborator Author

rishic3 commented Jul 21, 2023

build

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rishic3 rishic3 merged commit d4cc698 into NVIDIA:branch-23.08 Jul 21, 2023
1 check passed
@rishic3 rishic3 deleted the umap-fit branch September 26, 2024 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants