-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility #62
Comments
Hi Gregor, Glad to hear you like RSC. For UMAP I have bad news for you. This is what cuML says about it
For Harmony I'll take a look and see if I can improve reproducibility. I think for this one it might be possible. Thank you for providing the test code. |
Thank you for pointing lack of reproducibility. As it turns out some of the setup ups I used for the 100% reproducibility will still not happen due to the async nature of GPU-computing and floats being floats. With 32-bit floats i got the assertion error too:
With 64-bit floats the error is:
Now the question is how to move on from here. Would you like the option in |
This doesn't sound too bad if they are referring to three digits of precision of the output coordinates. Or am I misunderstanding this?
The new results seem pretty similar indeed. I wonder if they still impact the neighborhood graph and the leiden clustering, as this is what matters to me (manually annotating clusters is even more painful if the cluster labels change all the time). |
So if you use leiden clustering I would in general advice to use rapids-23.04. In 23.06 onwards it's beyond broken #44.
I will have a look at the Impacted this has on the neighbourhood search and clustering.
I set the default |
Ok so with the next release of rapids-singlecell v0.9.0 the Most calculations were anyway done with 64 bit floats so it won't affect performance and memory usage too much. |
Describe the bug
Hi @Intron7,
many thanks for putting together this package! The speedup is really mindblowing for larger datasets and makes my day-to-day analyses a lot more fun!
I know that reproducibility for this kind of algorithms is hard, and there's probably no way to get reproducibility between different devices, driver versions etc. but I think it should be possible to run the same code in the same environment on the same machine and get exactly the same results for all algorithms.
I tested this for some example data and found the following
rsc.tl.harmony
rsc.tl.neighbors
rsc.tl.umap
rsc.tl.leiden
Are there maybe any additional seeds that need to be set to make this work consistently?
Steps/Code to reproduce bug
Environment details (please complete the following information):
The text was updated successfully, but these errors were encountered: