-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precompute clustering #18
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clustering code looks good, with just minor requests. In order for the metrics to recognise the precomputed clusters, I updated the scib version and adjusted the metrics to look for the precomputed clusters, asuming they are stored in .obs
try: | ||
import subprocess | ||
assert subprocess.run('nvidia-smi', shell=True, stdout=subprocess.DEVNULL).returncode == 0 | ||
from rapids_singlecell.tl import leiden | ||
USE_GPU = True | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see that you implemented rapids_singlecell, however we found that the clustering results are quite different form the CPU implementation and often don't make sense. According to this thread, this bug could have been fixed recently. But for simplicity, I would go with the igraph implementation, which will be the default in scanpy in the future.
input = ad.read_h5ad(par["input"]) | ||
|
||
key = f'leiden_r{par["resolution"]}' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would set the CPU algorithm to igraph, as per future warning message from scanpy
kwargs = dict() | |
if not USE_GPU: | |
kwargs |= dict( | |
flavor='igraph', | |
n_iterations=2, | |
) |
input, | ||
resolution=par["resolution"], | ||
neighbors_key="knn", | ||
key_added=key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key_added=key, | |
key_added=key, | |
**kwargs, |
n_cell_cpu = 300_000 | ||
|
||
print("Read input", flush=True) | ||
input = ad.read_h5ad(par["input"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use the partial reading function to only load obs
and obsp
del adata.uns["knn"] | ||
del adata.obsp["knn_connectivities"] | ||
del adata.obsp["knn_distances"] | ||
sc.pp.neighbors(adata, use_rep="X_pca", n_neighbors=30, key_added="knn") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely good to compute the neighbors as a precomputation step, since it's also used for other metrics. We might want to make n_neighbors
configurable in the future
print("Read input", flush=True) | ||
input = ad.read_h5ad(par["input"]) | ||
|
||
key = f'leiden_r{par["resolution"]}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the resolution pattern to fleiden_{par["resolution"]}
, since that's what the optimal clustering function is expecting: https://github.com/theislab/scib/blob/d35727305501f905424f19a0e95b61dcda61c460/scib/metrics/clustering.py#L56-L58
clusterings = [] | ||
for clus_file in par["clusterings"]: | ||
adata = ad.read_h5ad(clus_file) | ||
obs_filt = adata.obs.filter(regex='leiden_r[0-9.]+') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would adjust the pattern as described in the clustering script
- name: "--resolutions" | ||
type: double | ||
multiple: true | ||
default: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These resolutions must be kept consistent between precomputation and the metrics function. The default range from scib is
[0.1, 0.2, ..., 1.8, 1.9, 2.0]
print("Read input", flush=True) | ||
input = ad.read_h5ad(par["input"]) | ||
|
||
input.obsm["clustering"] = merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you don't store the the clusters under .obs
. However, the metric will look in obs for those clusters, so I'd recommend to store them there
Describe your changes
.obsm["clustering"]
to the solution so it can be used by the metricsExample:
Running the process_dataset workflow results in:
Checklist before requesting a review
I have performed a self-review of my code
Check the correct box. Does this PR contain:
Proposed changes are described in the CHANGELOG.md
CI Tests succeed and look good!