Precompute clustering #18

rcannood · 2024-12-10T14:23:57Z

Describe your changes

Add .obsm["clustering"] to the solution so it can be used by the metrics

Example:

input_solution$obsm[["clustering"]] |> head()
                            leiden_r0.2 leiden_r0.4 leiden_r0.3 leiden_r0.7 leiden_r0.6 leiden_r0.8 leiden_r0.5
Sample_787                            1           1           1           1           1           1           1
D74_82                                0           3           0           3           3           3           3
D1713_42                              2           2           2           2           2           2           2
human4_lib3.final_cell_0092           0           0           0           0           0           0           0
Sample_716                            1           1           1           1           1           1           1
Sample_834                            1           1           1           1           1           1           1

Running the process_dataset workflow results in:

$ scripts/create_resources/test_resources.sh
Nextflow 24.10.2 is available - Please consider updating your version to it
N E X T F L O W  ~  version 23.10.0
Launching `target/nextflow/workflows/process_datasets/main.nf` [agitated_lamarr] DSL2 - revision: 9ea5522fa8
executor >  local (11)
[af/ae7c7b] process > process_datasets:run_wf:check_dataset_with_schema:processWf:check_dataset_with_schema_process (run)      [100%] 1 of 1 ✔
[52/3c6e9b] process > process_datasets:run_wf:precompute_clustering_run:processWf:precompute_clustering_run_process (run_r0.7) [100%] 7 of 7 ✔
[87/3fe18f] process > process_datasets:run_wf:precompute_clustering_merge:processWf:precompute_clustering_merge_process (run)  [100%] 1 of 1 ✔
[0a/db4c96] process > process_datasets:run_wf:process_dataset:processWf:process_dataset_process (run)                          [100%] 1 of 1 ✔
[d1/df9244] process > process_datasets:publishStatesSimpleWf:publishStatesProc (run)                                           [100%] 1 of 1 ✔

Checklist before requesting a review

I have performed a self-review of my code
Check the correct box. Does this PR contain:
- Breaking changes
- New functionality
- Major changes
- Minor changes
- Bug fixes
Proposed changes are described in the CHANGELOG.md
CI Tests succeed and look good!

…ters

mumichae

Clustering code looks good, with just minor requests. In order for the metrics to recognise the precomputed clusters, I updated the scib version and adjusted the metrics to look for the precomputed clusters, asuming they are stored in .obs

mumichae · 2024-12-13T12:29:00Z

src/data_processors/precompute_clustering_run/script.py

+try:
+    import subprocess
+    assert subprocess.run('nvidia-smi', shell=True, stdout=subprocess.DEVNULL).returncode == 0
+    from rapids_singlecell.tl import leiden
+    USE_GPU = True
+except Exception as e:


Nice to see that you implemented rapids_singlecell, however we found that the clustering results are quite different form the CPU implementation and often don't make sense. According to this thread, this bug could have been fixed recently. But for simplicity, I would go with the igraph implementation, which will be the default in scanpy in the future.

mumichae · 2024-12-13T12:32:34Z

src/data_processors/precompute_clustering_run/script.py

+input = ad.read_h5ad(par["input"])
+
+key = f'leiden_r{par["resolution"]}'
+


I would set the CPU algorithm to igraph, as per future warning message from scanpy

Suggested change

kwargs = dict()

if not USE_GPU:

kwargs |= dict(

flavor='igraph',

n_iterations=2,

)

mumichae · 2024-12-13T12:32:48Z

src/data_processors/precompute_clustering_run/script.py

+    input,
+    resolution=par["resolution"],
+    neighbors_key="knn",
+    key_added=key,


Suggested change

key_added=key,

key_added=key,

**kwargs,

mumichae · 2024-12-13T12:33:33Z

src/data_processors/precompute_clustering_run/script.py

+n_cell_cpu = 300_000
+
+print("Read input", flush=True)
+input = ad.read_h5ad(par["input"])


You could use the partial reading function to only load obs and obsp

mumichae · 2024-12-13T12:34:45Z

src/data_processors/process_dataset/script.py

+del adata.uns["knn"]
+del adata.obsp["knn_connectivities"]
+del adata.obsp["knn_distances"]
+sc.pp.neighbors(adata, use_rep="X_pca", n_neighbors=30, key_added="knn")


definitely good to compute the neighbors as a precomputation step, since it's also used for other metrics. We might want to make n_neighbors configurable in the future

mumichae · 2024-12-13T12:39:56Z

src/data_processors/precompute_clustering_run/script.py

+print("Read input", flush=True)
+input = ad.read_h5ad(par["input"])
+
+key = f'leiden_r{par["resolution"]}'


I would change the resolution pattern to fleiden_{par["resolution"]}, since that's what the optimal clustering function is expecting: https://github.com/theislab/scib/blob/d35727305501f905424f19a0e95b61dcda61c460/scib/metrics/clustering.py#L56-L58

mumichae · 2024-12-13T12:40:27Z

src/data_processors/precompute_clustering_merge/script.py

+clusterings = []
+for clus_file in par["clusterings"]:
+    adata = ad.read_h5ad(clus_file)
+    obs_filt = adata.obs.filter(regex='leiden_r[0-9.]+')


I would adjust the pattern as described in the clustering script

mumichae · 2024-12-13T13:19:25Z

src/workflows/process_datasets/config.vsh.yaml

+      - name: "--resolutions"
+        type: double
+        multiple: true
+        default: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]


These resolutions must be kept consistent between precomputation and the metrics function. The default range from scib is

[0.1, 0.2, ..., 1.8, 1.9, 2.0]

mumichae · 2024-12-13T13:22:39Z

src/data_processors/precompute_clustering_merge/script.py

+print("Read input", flush=True)
+input = ad.read_h5ad(par["input"])
+
+input.obsm["clustering"] = merged


I assume you don't store the the clusters under .obs. However, the metric will look in obs for those clusters, so I'd recommend to store them there

rcannood added 2 commits December 10, 2024 15:23

add clustering data frame to the solution

026e765

update script

9392ab6

rcannood requested a review from mumichae December 10, 2024 14:37

rcannood and others added 5 commits December 10, 2024 15:39

add comments

013b54c

add clustering key prefix for cluster-based metrics

b075b3c

add resolutions parameters to metrics to make use of precomputed clus…

0f99a8d

…ters

fix clustering key for nmi and ari

a4404ca

set correct version of scib to make using precomputed clusters possible

54c0fd9

mumichae requested changes Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompute clustering #18

Precompute clustering #18

rcannood commented Dec 10, 2024 •

edited

Loading

mumichae left a comment

mumichae Dec 13, 2024

mumichae Dec 13, 2024

mumichae Dec 13, 2024

mumichae Dec 13, 2024

mumichae Dec 13, 2024

mumichae Dec 13, 2024

mumichae Dec 13, 2024

mumichae Dec 13, 2024

mumichae Dec 13, 2024

		input = ad.read_h5ad(par["input"])

		key = f'leiden_r{par["resolution"]}'

+kwargs = dict()
+if not USE_GPU:
+    kwargs |= dict(
+        flavor='igraph',
+        n_iterations=2,
+    )

Precompute clustering #18

Are you sure you want to change the base?

Precompute clustering #18

Conversation

rcannood commented Dec 10, 2024 • edited Loading

Describe your changes

Checklist before requesting a review

mumichae left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcannood commented Dec 10, 2024 •

edited

Loading