-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precompute clustering #18
base: main
Are you sure you want to change the base?
Changes from all commits
026e765
9392ab6
013b54c
b075b3c
0f99a8d
a4404ca
54c0fd9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
name: precompute_clustering_merge | ||
namespace: data_processors | ||
label: Merge clustering precomputations | ||
summary: Merge the precompute results of clustering on the input dataset | ||
arguments: | ||
- name: --input | ||
type: file | ||
direction: input | ||
required: true | ||
- name: --output | ||
type: file | ||
direction: output | ||
required: true | ||
- type: file | ||
name: --clusterings | ||
description: Clustering results to merge | ||
direction: input | ||
required: true | ||
multiple: true | ||
resources: | ||
- type: python_script | ||
path: script.py | ||
engines: | ||
- type: docker | ||
image: openproblems/base_python:1.0.0 | ||
runners: | ||
- type: executable | ||
- type: nextflow | ||
directives: | ||
label: [midtime, midmem, lowcpu] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
import anndata as ad | ||
import pandas as pd | ||
|
||
## VIASH START | ||
par = { | ||
"input": "resources_test/task_batch_integration/cxg_immune_cell_atlas/dataset.h5ad", | ||
"clusterings": ["output.h5ad", "output2.h5ad"], | ||
"output": "output3.h5ad", | ||
} | ||
## VIASH END | ||
|
||
print("Read clusterings", flush=True) | ||
clusterings = [] | ||
for clus_file in par["clusterings"]: | ||
adata = ad.read_h5ad(clus_file) | ||
obs_filt = adata.obs.filter(regex='leiden_r[0-9.]+') | ||
clusterings.append(obs_filt) | ||
|
||
print("Merge clusterings", flush=True) | ||
merged = pd.concat(clusterings, axis=1) | ||
|
||
print("Read input", flush=True) | ||
input = ad.read_h5ad(par["input"]) | ||
|
||
input.obsm["clustering"] = merged | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume you don't store the the clusters under |
||
|
||
print("Store outputs", flush=True) | ||
input.write_h5ad(par["output"], compression="gzip") |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
name: precompute_clustering_run | ||
namespace: data_processors | ||
label: Run clustering precomputations | ||
summary: Run clustering on the input dataset | ||
arguments: | ||
- name: --input | ||
type: file | ||
direction: input | ||
required: true | ||
- name: --output | ||
type: file | ||
direction: output | ||
required: true | ||
- type: double | ||
name: resolution | ||
default: 0.8 | ||
description: Resolution parameter for clustering | ||
resources: | ||
- type: python_script | ||
path: script.py | ||
engines: | ||
- type: docker | ||
image: openproblems/base_python:1.0.0 | ||
setup: | ||
- type: python | ||
pypi: | ||
- scanpy | ||
- igraph | ||
- leidenalg | ||
runners: | ||
- type: executable | ||
- type: nextflow | ||
directives: | ||
label: [midtime, midmem, lowcpu] |
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,39 @@ | ||||||||||||||||
import anndata as ad | ||||||||||||||||
|
||||||||||||||||
# check if we can use GPU | ||||||||||||||||
USE_GPU = False | ||||||||||||||||
try: | ||||||||||||||||
import subprocess | ||||||||||||||||
assert subprocess.run('nvidia-smi', shell=True, stdout=subprocess.DEVNULL).returncode == 0 | ||||||||||||||||
from rapids_singlecell.tl import leiden | ||||||||||||||||
USE_GPU = True | ||||||||||||||||
except Exception as e: | ||||||||||||||||
Comment on lines
+5
to
+10
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice to see that you implemented rapids_singlecell, however we found that the clustering results are quite different form the CPU implementation and often don't make sense. According to this thread, this bug could have been fixed recently. But for simplicity, I would go with the igraph implementation, which will be the default in scanpy in the future. |
||||||||||||||||
from scanpy.tl import leiden | ||||||||||||||||
|
||||||||||||||||
## VIASH START | ||||||||||||||||
par = { | ||||||||||||||||
"input": "resources_test/task_batch_integration/cxg_immune_cell_atlas/dataset.h5ad", | ||||||||||||||||
"output": "output.h5ad", | ||||||||||||||||
"resolution": 0.8, | ||||||||||||||||
} | ||||||||||||||||
## VIASH END | ||||||||||||||||
|
||||||||||||||||
n_cell_cpu = 300_000 | ||||||||||||||||
|
||||||||||||||||
print("Read input", flush=True) | ||||||||||||||||
input = ad.read_h5ad(par["input"]) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could use the partial reading function to only load |
||||||||||||||||
|
||||||||||||||||
key = f'leiden_r{par["resolution"]}' | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would change the resolution pattern to f |
||||||||||||||||
|
||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would set the CPU algorithm to igraph, as per future warning message from scanpy
Suggested change
|
||||||||||||||||
leiden( | ||||||||||||||||
input, | ||||||||||||||||
resolution=par["resolution"], | ||||||||||||||||
neighbors_key="knn", | ||||||||||||||||
key_added=key, | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||
) | ||||||||||||||||
|
||||||||||||||||
print("Store outputs", flush=True) | ||||||||||||||||
output = ad.AnnData( | ||||||||||||||||
obs=input.obs[[key]], | ||||||||||||||||
) | ||||||||||||||||
output.write_h5ad(par["output"], compression="gzip") |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,8 +12,8 @@ | |
"output": "output.h5ad" | ||
} | ||
meta = { | ||
"config": "target/nextflow/batch_integration/process_dataset/.config.vsh.yaml", | ||
"resources_dir": "src/common/helper_functions" | ||
"config": "target/nextflow/data_processors/process_dataset/.config.vsh.yaml", | ||
"resources_dir": "target/nextflow/data_processors/process_dataset" | ||
} | ||
## VIASH END | ||
|
||
|
@@ -80,6 +80,12 @@ def compute_batched_hvg(adata, n_hvgs): | |
"variance_ratio": variance_ratio | ||
} | ||
|
||
print(">> Recompute neighbors", flush=True) | ||
del adata.uns["knn"] | ||
del adata.obsp["knn_connectivities"] | ||
del adata.obsp["knn_distances"] | ||
sc.pp.neighbors(adata, use_rep="X_pca", n_neighbors=30, key_added="knn") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. definitely good to compute the neighbors as a precomputation step, since it's also used for other metrics. We might want to make |
||
|
||
print(">> Create output object", flush=True) | ||
output_dataset = subset_h5ad_by_format( | ||
adata, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,13 @@ argument_groups: | |
__merge__: /src/api/file_solution.yaml | ||
required: true | ||
direction: output | ||
- name: Clustering | ||
arguments: | ||
- name: "--resolutions" | ||
type: double | ||
multiple: true | ||
default: [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These resolutions must be kept consistent between precomputation and the metrics function. The default range from scib is
|
||
description: Resolution parameter for clustering | ||
|
||
resources: | ||
- type: nextflow_script | ||
|
@@ -29,6 +36,8 @@ dependencies: | |
- name: validation/check_dataset_with_schema | ||
repository: openproblems | ||
- name: data_processors/process_dataset | ||
- name: data_processors/precompute_clustering_run | ||
- name: data_processors/precompute_clustering_merge | ||
|
||
runners: | ||
- type: nextflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would adjust the pattern as described in the clustering script