Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompute clustering #18

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Precompute clustering #18

wants to merge 7 commits into from

Conversation

rcannood
Copy link
Member

@rcannood rcannood commented Dec 10, 2024

Describe your changes

  • Add .obsm["clustering"] to the solution so it can be used by the metrics

Example:

input_solution$obsm[["clustering"]] |> head()
                            leiden_r0.2 leiden_r0.4 leiden_r0.3 leiden_r0.7 leiden_r0.6 leiden_r0.8 leiden_r0.5
Sample_787                            1           1           1           1           1           1           1
D74_82                                0           3           0           3           3           3           3
D1713_42                              2           2           2           2           2           2           2
human4_lib3.final_cell_0092           0           0           0           0           0           0           0
Sample_716                            1           1           1           1           1           1           1
Sample_834                            1           1           1           1           1           1           1

Running the process_dataset workflow results in:

$ scripts/create_resources/test_resources.sh
Nextflow 24.10.2 is available - Please consider updating your version to it
N E X T F L O W  ~  version 23.10.0
Launching `target/nextflow/workflows/process_datasets/main.nf` [agitated_lamarr] DSL2 - revision: 9ea5522fa8
executor >  local (11)
[af/ae7c7b] process > process_datasets:run_wf:check_dataset_with_schema:processWf:check_dataset_with_schema_process (run)      [100%] 1 of 1 ✔
[52/3c6e9b] process > process_datasets:run_wf:precompute_clustering_run:processWf:precompute_clustering_run_process (run_r0.7) [100%] 7 of 7 ✔
[87/3fe18f] process > process_datasets:run_wf:precompute_clustering_merge:processWf:precompute_clustering_merge_process (run)  [100%] 1 of 1 ✔
[0a/db4c96] process > process_datasets:run_wf:process_dataset:processWf:process_dataset_process (run)                          [100%] 1 of 1 ✔
[d1/df9244] process > process_datasets:publishStatesSimpleWf:publishStatesProc (run)                                           [100%] 1 of 1 ✔

Checklist before requesting a review

  • I have performed a self-review of my code

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md

  • CI Tests succeed and look good!

@rcannood rcannood requested a review from mumichae December 10, 2024 14:37
Copy link
Contributor

@mumichae mumichae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clustering code looks good, with just minor requests. In order for the metrics to recognise the precomputed clusters, I updated the scib version and adjusted the metrics to look for the precomputed clusters, asuming they are stored in .obs

Comment on lines +5 to +10
try:
import subprocess
assert subprocess.run('nvidia-smi', shell=True, stdout=subprocess.DEVNULL).returncode == 0
from rapids_singlecell.tl import leiden
USE_GPU = True
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see that you implemented rapids_singlecell, however we found that the clustering results are quite different form the CPU implementation and often don't make sense. According to this thread, this bug could have been fixed recently. But for simplicity, I would go with the igraph implementation, which will be the default in scanpy in the future.

input = ad.read_h5ad(par["input"])

key = f'leiden_r{par["resolution"]}'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would set the CPU algorithm to igraph, as per future warning message from scanpy

Suggested change
kwargs = dict()
if not USE_GPU:
kwargs |= dict(
flavor='igraph',
n_iterations=2,
)

input,
resolution=par["resolution"],
neighbors_key="knn",
key_added=key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
key_added=key,
key_added=key,
**kwargs,

n_cell_cpu = 300_000

print("Read input", flush=True)
input = ad.read_h5ad(par["input"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the partial reading function to only load obs and obsp

del adata.uns["knn"]
del adata.obsp["knn_connectivities"]
del adata.obsp["knn_distances"]
sc.pp.neighbors(adata, use_rep="X_pca", n_neighbors=30, key_added="knn")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely good to compute the neighbors as a precomputation step, since it's also used for other metrics. We might want to make n_neighbors configurable in the future

print("Read input", flush=True)
input = ad.read_h5ad(par["input"])

key = f'leiden_r{par["resolution"]}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the resolution pattern to fleiden_{par["resolution"]}, since that's what the optimal clustering function is expecting: https://github.com/theislab/scib/blob/d35727305501f905424f19a0e95b61dcda61c460/scib/metrics/clustering.py#L56-L58

clusterings = []
for clus_file in par["clusterings"]:
adata = ad.read_h5ad(clus_file)
obs_filt = adata.obs.filter(regex='leiden_r[0-9.]+')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would adjust the pattern as described in the clustering script

- name: "--resolutions"
type: double
multiple: true
default: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These resolutions must be kept consistent between precomputation and the metrics function. The default range from scib is

[0.1, 0.2, ..., 1.8, 1.9, 2.0]

print("Read input", flush=True)
input = ad.read_h5ad(par["input"])

input.obsm["clustering"] = merged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you don't store the the clusters under .obs. However, the metric will look in obs for those clusters, so I'd recommend to store them there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants