Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility #62

Closed
grst opened this issue Sep 4, 2023 · 6 comments
Closed

Reproducibility #62

grst opened this issue Sep 4, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@grst
Copy link

grst commented Sep 4, 2023

Describe the bug
Hi @Intron7,

many thanks for putting together this package! The speedup is really mindblowing for larger datasets and makes my day-to-day analyses a lot more fun!

I know that reproducibility for this kind of algorithms is hard, and there's probably no way to get reproducibility between different devices, driver versions etc. but I think it should be possible to run the same code in the same environment on the same machine and get exactly the same results for all algorithms.

I tested this for some example data and found the following

reproducible function
rsc.tl.harmony
✔️ rsc.tl.neighbors
rsc.tl.umap
✔️ rsc.tl.leiden

Are there maybe any additional seeds that need to be set to make this work consistently?

Steps/Code to reproduce bug

import rapids_singlecell as rsc
import scanpy as sc
import pooch
import numpy.testing as npt
import anndata as ad

# Get example data and build AnnData object with two full samples
# Source: https://scverse-tutorials.readthedocs.io/en/latest/notebooks/basic-scrna-tutorial.html
EXAMPLE_DATA = pooch.create(
    path=pooch.os_cache("scverse_tutorials"),
    base_url="doi:10.6084/m9.figshare.22716739.v1/",
)
EXAMPLE_DATA.load_registry_from_doi()
samples = {
    "s1d1": "s1d1_filtered_feature_bc_matrix.h5",
    "s1d3": "s1d3_filtered_feature_bc_matrix.h5",
}
adatas = {}

for sample_id, filename in samples.items():
    path = EXAMPLE_DATA.fetch(filename)
    sample_adata = sc.read_10x_h5(path)
    sample_adata.var_names_make_unique()
    adatas[sample_id] = sample_adata

adata = ad.concat(adatas, label="sample")
adata.obs_names_make_unique()

# basic preprocessing in scanpy
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.tl.pca(adata)
adata2 = adata.copy()

for ad in [adata, adata2]:
    rsc.tl.harmony_integrate(ad, key="sample")
    # don't use harmony -- we want to know if neighbors gives different result, even given the same PCA
    rsc.tl.neighbors(ad, use_rep="X_pca")
    rsc.tl.umap(ad)
    rsc.tl.leiden(ad)

npt.assert_array_equal(adata.obsm["X_pca_harmony"], adata2.obsm["X_pca_harmony"])
npt.assert_array_equal(adata.obsp["connectivities"].data, adata2.obsp["connectivities"].data)
npt.assert_array_equal(adata.obsm["X_umap"], adata2.obsm["X_umap"])
npt.assert_array_equal(adata.obs["leiden"].values, adata.obs["leiden"].values)

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Linux Distro/Architecture: Scientific Linux 7 (3.10.0-1160.11.1.el7.x86_64)
  • GPU Model/Driver: Tesla V100-SXM2-32GB, 470.129.06
  • CUDA: 11.4
  • Method of Rapids install: conda, tried both the 23.04 and 23.06 yml files provided in this repo.
@grst grst added the bug Something isn't working label Sep 4, 2023
@Intron7
Copy link
Member

Intron7 commented Sep 4, 2023

Hi Gregor,

Glad to hear you like RSC. For UMAP I have bad news for you. This is what cuML says about it

random_state is the seed used by the random number generator during embedding initialization and during sampling used by the optimizer. Note: Unfortunately, achieving a high amount of parallelism during the optimization stage often comes at the expense of determinism, since many floating-point additions are being made in parallel without a deterministic ordering. This causes slightly different results across training sessions, even when the same seed is used for random number generation. Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of potentially slower training and increased memory usage.

For Harmony I'll take a look and see if I can improve reproducibility. I think for this one it might be possible.

Thank you for providing the test code.

@Intron7
Copy link
Member

Intron7 commented Sep 4, 2023

Thank you for pointing lack of reproducibility. As it turns out some of the setup ups I used for the random_state was faulty. I'll fix this in the anndata_PR #60.

100% reproducibility will still not happen due to the async nature of GPU-computing and floats being floats.

With 32-bit floats i got the assertion error too:

Mismatched elements: 113906 / 856250 (13.3%)
Max absolute difference: 8.106232e-06
Max relative difference: 0.00205672

With 64-bit floats the error is:

Mismatched elements: 850179 / 856250 (99.3%)
Max absolute difference: 4.16378043e-12
Max relative difference: 1.41103855e-08

Now the question is how to move on from here. Would you like the option in harmony_integrate to use fp64?

@grst
Copy link
Author

grst commented Sep 5, 2023

Setting a random_state will enable consistency of trained embeddings, allowing for reproducible results to 3 digits of precision, but will do so at the expense of potentially slower training and increased memory usage

This doesn't sound too bad if they are referring to three digits of precision of the output coordinates. Or am I misunderstanding this?

100% reproducibility will still not happen due to the async nature of GPU-computing and floats being floats.

The new results seem pretty similar indeed. I wonder if they still impact the neighborhood graph and the leiden clustering, as this is what matters to me (manually annotating clusters is even more painful if the cluster labels change all the time).
Since the max abs. difference is in the e-6 also for float32 I also wonder what happens if we just round the values to 5 digits?

@Intron7
Copy link
Member

Intron7 commented Sep 5, 2023

So if you use leiden clustering I would in general advice to use rapids-23.04. In 23.06 onwards it's beyond broken #44.

Since the max abs. difference is in the e-6 also for float32 I also wonder what happens if we just round the values to 5 digits?

I will have a look at the Impacted this has on the neighbourhood search and clustering.

This doesn't sound too bad if they are referring to three digits of precision of the output coordinates. Or am I misunderstanding this?

I set the default random_state of the umap to 0. So this should already apply in theory. I can try changing it to 42 or something else.

@grst
Copy link
Author

grst commented Sep 5, 2023

I set the default random_state of the umap to 0. So this should already apply in theory. I can try changing it to 42 or something else.

I see! I doubt changing the seed makes a difference. It's then either their bug or a bit unfortunate wording of the doc section you shared and it's just not possible.

Here's what it looks like in practice:
image
image

Probably not equal to 3 significant digits, but given that it's UMAP and the clustering is reproducible I don't even care that much.

@Intron7
Copy link
Member

Intron7 commented Sep 5, 2023

Ok so with the next release of rapids-singlecell v0.9.0 the rsc.pp.harmony_integrate will be reproducible for neighbors and clustering. I switched the calculations to 64 bit.

Most calculations were anyway done with 64 bit floats so it won't affect performance and memory usage too much.

@Intron7 Intron7 closed this as completed Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants