Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neurips 2021 migration match modality #65

Draft
wants to merge 88 commits into
base: main
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
0db12b5
add mask_dataset
KaiWaldrant Dec 14, 2022
bc99112
debug mask_dataset test
KaiWaldrant Dec 14, 2022
262a1ed
add masked anddata api
KaiWaldrant Dec 14, 2022
3f367e1
add random_embed negative control
KaiWaldrant Dec 14, 2022
861072e
update control_method api
KaiWaldrant Dec 14, 2022
8749e2a
add zeros_embed control
KaiWaldrant Dec 15, 2022
7c89329
add lmds method
KaiWaldrant Dec 15, 2022
0d29dd0
add mnn method
KaiWaldrant Dec 15, 2022
3c46c4d
add newwave method
KaiWaldrant Dec 15, 2022
4ddb315
add pca method
KaiWaldrant Dec 15, 2022
3ae1855
Add totalVI method
KaiWaldrant Dec 16, 2022
87f84cb
add umap method
KaiWaldrant Dec 16, 2022
7cc07bf
add metric ari
KaiWaldrant Dec 16, 2022
caff25d
update comp_metric
KaiWaldrant Dec 16, 2022
f7e0e0b
update ari metric
KaiWaldrant Dec 16, 2022
22c7f46
add asw_batch metric
KaiWaldrant Dec 16, 2022
d7e03de
add asw_label metric
KaiWaldrant Dec 16, 2022
1b47472
add cc_cons metric
KaiWaldrant Dec 16, 2022
ea82ca5
remove DI docker because of old anndata package
KaiWaldrant Jan 4, 2023
16ce776
add check_format metric
KaiWaldrant Jan 4, 2023
4bce62c
add graph connectivity metric
KaiWaldrant Jan 4, 2023
bdbdbfd
add latent mixing metric
KaiWaldrant Jan 4, 2023
5457a6c
add nmi metric
KaiWaldrant Jan 4, 2023
6d50fc4
add rfoob metric
KaiWaldrant Jan 4, 2023
82ae20e
add ti_cons metric
KaiWaldrant Jan 4, 2023
acfb631
add ti_cons_batch metric
KaiWaldrant Jan 4, 2023
71ae0e9
add metric unit test
KaiWaldrant Jan 5, 2023
ed38c11
add task_info.yaml
KaiWaldrant Jan 5, 2023
88419ae
Merge remote-tracking branch 'origin/main' into neurips2021/joint_emb…
KaiWaldrant Jan 6, 2023
b6d5bbd
create NF workflow
KaiWaldrant Jan 6, 2023
99b0524
update changelog
KaiWaldrant Jan 6, 2023
c8ae601
update changelog
KaiWaldrant Jan 6, 2023
10f75d4
fix typo in changelog
KaiWaldrant Jan 6, 2023
e0aef20
fix typo in changelog
KaiWaldrant Jan 6, 2023
8327637
convert sparse matrix to array
KaiWaldrant Jan 9, 2023
1b2dd90
use denormalized counts data
KaiWaldrant Jan 9, 2023
3f02cbe
Add api yaml files
KaiWaldrant Jan 10, 2023
b64c964
add mask_dataset
KaiWaldrant Jan 10, 2023
8bf8833
add constant control method
KaiWaldrant Jan 10, 2023
db7ec21
add random_pairing control method
KaiWaldrant Jan 10, 2023
f850c10
add semi_solution control method
KaiWaldrant Jan 10, 2023
6b17f90
add solution control_method
KaiWaldrant Jan 10, 2023
b30a10e
add dr_knn_cbf method
KaiWaldrant Jan 10, 2023
abb8286
add dr_knnr_knn method
KaiWaldrant Jan 10, 2023
204426a
add linear method
KaiWaldrant Jan 10, 2023
8c54017
add newwave_knnr_cbf method
KaiWaldrant Jan 10, 2023
65dd490
add newwave_knnr_knn method
KaiWaldrant Jan 10, 2023
334dea9
add procrusted_knn method
KaiWaldrant Jan 10, 2023
6482c5d
add babel_knn method
KaiWaldrant Jan 11, 2023
46c5466
add aupr metrics
KaiWaldrant Jan 12, 2023
6014e24
add check_format metric
KaiWaldrant Jan 12, 2023
33f21a0
add match_probability metric
KaiWaldrant Jan 12, 2023
3070f53
add resources and resources_test scripts
KaiWaldrant Jan 12, 2023
4e134e5
add NF workflow
KaiWaldrant Jan 12, 2023
8aad000
fix directives
KaiWaldrant Jan 13, 2023
f3a0017
fix configs
KaiWaldrant Jan 13, 2023
a8895dc
fix directive labels
KaiWaldrant Jan 13, 2023
a849f0b
update configs to align with v1 metadata
KaiWaldrant Jan 13, 2023
399a316
add readme
KaiWaldrant Jan 13, 2023
be3e175
update readme
KaiWaldrant Jan 13, 2023
4d749cd
Merge remote-tracking branch 'origin/main' into neurips2021/joint_emb…
KaiWaldrant Jan 16, 2023
4f7b485
Merge remote-tracking branch 'origin/neurips2021/joint_embedding' int…
KaiWaldrant Jan 16, 2023
78c0db7
Merge remote-tracking branch 'origin/main' into neurips2021/joint_emb…
KaiWaldrant Jan 24, 2023
0bce137
update task info and readme
KaiWaldrant Jan 24, 2023
b4855ea
Merge remote-tracking branch 'origin/neurips2021/joint_embedding' int…
KaiWaldrant Jan 24, 2023
f7d8cbb
add readme and task info
KaiWaldrant Jan 24, 2023
4460310
Add api yaml files
KaiWaldrant Jan 10, 2023
fd038d2
add mask_dataset
KaiWaldrant Jan 10, 2023
43d2db1
add constant control method
KaiWaldrant Jan 10, 2023
f276571
add random_pairing control method
KaiWaldrant Jan 10, 2023
7653329
add semi_solution control method
KaiWaldrant Jan 10, 2023
9e8ecb1
add solution control_method
KaiWaldrant Jan 10, 2023
ebfd94d
add dr_knn_cbf method
KaiWaldrant Jan 10, 2023
77a8db6
add dr_knnr_knn method
KaiWaldrant Jan 10, 2023
db5491e
add linear method
KaiWaldrant Jan 10, 2023
a5b7803
add newwave_knnr_cbf method
KaiWaldrant Jan 10, 2023
b70bf42
add newwave_knnr_knn method
KaiWaldrant Jan 10, 2023
7134f44
add procrusted_knn method
KaiWaldrant Jan 10, 2023
91e41c1
add babel_knn method
KaiWaldrant Jan 11, 2023
eb1e5f5
add aupr metrics
KaiWaldrant Jan 12, 2023
7b3e5ae
add check_format metric
KaiWaldrant Jan 12, 2023
594febc
add match_probability metric
KaiWaldrant Jan 12, 2023
017c203
add resources and resources_test scripts
KaiWaldrant Jan 12, 2023
186125e
add NF workflow
KaiWaldrant Jan 12, 2023
4eaeff9
fix directives
KaiWaldrant Jan 13, 2023
da7a2eb
fix configs
KaiWaldrant Jan 13, 2023
215ecd6
add readme and task info
KaiWaldrant Jan 24, 2023
065d50e
Merge branch 'neurips2021/match_modality' of github.com:openproblems-…
KaiWaldrant Jan 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
create NF workflow
KaiWaldrant committed Jan 6, 2023
commit b6d5bbdcfb5c7aed7ad3858df47006876f6edb3b
8 changes: 7 additions & 1 deletion src/joint_embedding/methods/mnn/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -16,9 +16,15 @@ functionality:
path: script.R
platforms:
- type: docker
image: dataintuitive/randpy:r4.0_py3.8_bioc3.12
image: eddelbuettel/r2u:22.04
setup:
- type: r
cran: [ anndata, lmds, tidyverse, bioconductor]
- type: r
bioc: [ SingleCellExperiment, batchelor, proxyC ]
- type: apt
packages: [ libhdf5-dev, libgeos-dev, python3, python3-pip, python3-dev, python-is-python3]
- type: python
pip: [anndata>=0.8]
- type: nextflow
directives: [ lowmem, lowtime, lowcpu ]
8 changes: 7 additions & 1 deletion src/joint_embedding/methods/newwave/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -25,9 +25,15 @@ functionality:
path: script.R
platforms:
- type: docker
image: dataintuitive/randpy:r4.0_py3.8_bioc3.12
image: eddelbuettel/r2u:22.04
setup:
- type: r
cran: [ anndata, lmds, tidyverse, bioconductor]
- type: r
bioc: [ SingleCellExperiment, NewWave, proxyC ]
- type: apt
packages: [ libhdf5-dev, libgeos-dev, python3, python3-pip, python3-dev, python-is-python3]
- type: python
pip: [anndata>=0.8]
- type: nextflow
directives: [ highmem, hightime, highcpu ]
8 changes: 6 additions & 2 deletions src/joint_embedding/methods/pca/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -21,9 +21,13 @@ functionality:
path: script.R
platforms:
- type: docker
image: dataintuitive/randpy:r4.0_py3.8_bioc3.12
image: eddelbuettel/r2u:22.04
setup:
- type: r
packages: [ irlba, proxyC ]
cran: [ anndata, lmds, tidyverse, bioconductor, irlba, proxyC]
- type: apt
packages: [ libhdf5-dev, libgeos-dev, python3, python3-pip, python3-dev, python-is-python3]
- type: python
pip: [anndata>=0.8]
- type: nextflow
directives: [ lowmem, lowtime, lowcpu ]
10 changes: 5 additions & 5 deletions src/joint_embedding/methods/totalvi/script.py
Original file line number Diff line number Diff line change
@@ -17,12 +17,12 @@
}
## VIASH END

print("Load and prepare data")
print("Load and prepare data", flush=True)
adata_mod1 = anndata.read_h5ad(par['input_mod1'])
adata_mod2 = anndata.read_h5ad(par['input_mod2'])
adata_mod1.obsm['protein_expression'] = adata_mod2.X.toarray()

print('Select highly variable genes')
print('Select highly variable genes', flush=True)
sc.pp.highly_variable_genes(
adata_mod1,
n_top_genes=par['hvg_number'],
@@ -31,18 +31,18 @@
subset=True
)

print("Set up model")
print("Set up model", flush=True)
TOTALVI.setup_anndata(
adata_mod1,
batch_key="batch",
protein_expression_obsm_key="protein_expression"
)

print('Train totalVI with', par['max_epochs'], 'epochs')
print('Train totalVI with', par['max_epochs'], 'epochs', flush=True)
vae = TOTALVI(adata_mod1, latent_distribution="normal")
vae.train(max_epochs = par['max_epochs'])

print("Postprocessing and saving output")
print("Postprocessing and saving output", flush=True)
adata_out = anndata.AnnData(
X=vae.get_latent_representation(),
obs=adata_mod1.obs[['batch']],
10 changes: 7 additions & 3 deletions src/joint_embedding/methods/umap/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
__merge__: ../../api/comp_method.yaml
functionality:
name: umam
name: umap
namespace: joint_embedding/methods
version: dev
description: UMAP dimensionality reduction on the Euclidean distance.
@@ -33,9 +33,13 @@ functionality:
path: script.R
platforms:
- type: docker
image: dataintuitive/randpy:r4.0_py3.8_bioc3.12
image: eddelbuettel/r2u:22.04
setup:
- type: r
packages: [ uwot, irlba, proxyC ]
cran: [ anndata, lmds, tidyverse, irlba, proxyC, uwot]
- type: apt
packages: [ libhdf5-dev, libgeos-dev, python3, python3-pip, python3-dev, python-is-python3]
- type: python
pip: [anndata>=0.8]
- type: nextflow
directives: [ lowmem, lowtime, lowcpu ]
64 changes: 64 additions & 0 deletions src/joint_embedding/resources_scripts/mask_datasets.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

COMMON_DATASETS="resources/datasets/openproblems_v1"
OUTPUT_DIR="resources/joint_embedding/datasets/openproblems_v1"

if [ ! -d "$OUTPUT_DIR" ]; then
mkdir -p "$OUTPUT_DIR"
fi

params_file="$OUTPUT_DIR/params.yaml"

if [ ! -f $params_file ]; then
python << HERE
import anndata as ad
import glob
import yaml

h5ad_files = glob.glob("$COMMON_DATASETS/**.h5ad")

# this task doesn't use normalizations
#
param_list = {}

for h5ad_file in h5ad_files:
print(f"Checking {h5ad_file}")
adata = ad.read_h5ad(h5ad_file, backed=True)
if "counts" in adata.layers:
dataset_id = adata.uns["dataset_id"].replace("/", ".")
obj = {
'id': dataset_id,
'input': h5ad_file,
'dataset_id': dataset_id,
}
param_list[dataset_id] = obj

output = {
"param_list": list(param_list.values()),
"seed": 123,
"output_train": "\$id.train.h5ad",
"output_test": "\$id.test.h5ad"
}

with open("$params_file", "w") as file:
yaml.dump(output, file)
HERE
fi

export NXF_VER=22.04.5
nextflow \
run . \
-main-script target/nextflow/denoising/split_dataset/main.nf \
-profile docker \
-resume \
-params-file $params_file \
--publish_dir "$OUTPUT_DIR"

bin/tools/docker/nextflow/process_log/process_log \
--output "$OUTPUT_DIR/nextflow_log.tsv"
74 changes: 74 additions & 0 deletions src/joint_embedding/resources_scripts/run_benchmarks.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/bin/bash

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

set -e

export TOWER_WORKSPACE_ID=53907369739130

DATASETS_DIR="resources/denoising/datasets/openproblems_v1"
OUTPUT_DIR="resources/denoising/benchmarks/openproblems_v1"

if [ ! -d "$OUTPUT_DIR" ]; then
mkdir -p "$OUTPUT_DIR"
fi

params_file="$OUTPUT_DIR/params.yaml"

if [ ! -f $params_file ]; then
python << HERE
import yaml
import os

dataset_dir = "$DATASETS_DIR"
output_dir = "$OUTPUT_DIR"

# read split datasets yaml
with open(dataset_dir + "/params.yaml", "r") as file:
split_list = yaml.safe_load(file)
datasets = split_list['param_list']

# figure out where train/test files were stored
param_list = []

for dataset in datasets:
id = dataset["id"]
input_train = dataset_dir + "/" + id + ".train.h5ad"
input_test = dataset_dir + "/" + id + ".test.h5ad"

if os.path.exists(input_test):
obj = {
'id': id,
'id': id,
'id': id,
'dataset_id': dataset["dataset_id"],
'input_train': input_train,
'input_test': input_test
}
param_list.append(obj)

# write as output file
output = {
"param_list": param_list,
}

with open(output_dir + "/params.yaml", "w") as file:
yaml.dump(output, file)
HERE
fi

export NXF_VER=22.04.5
nextflow \
run . \
-main-script src/denoising/workflows/run/main.nf \
-profile docker \
-params-file "$params_file" \
--publish_dir "$OUTPUT_DIR" \
-with-tower

bin/tools/docker/nextflow/process_log/process_log \
--output "$OUTPUT_DIR/nextflow_log.tsv"
57 changes: 57 additions & 0 deletions src/joint_embedding/resources_test_scripts/bmmc_cite.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/bin/bash
#
#make sure the following command has been executed
#bin/viash_build -q 'denoising|common'

# get the root of the directory
REPO_ROOT=$(git rev-parse --show-toplevel)

# ensure that the command below is run from the root of the repository
cd "$REPO_ROOT"

MOD_1_DATA=resources_test/common/openproblems_bmmc_cite_starter/openproblems_bmmc_cite_starter.output_rna.h5ad
MOD_2_DATA=resources_test/common/openproblems_bmmc_cite_starter/openproblems_bmmc_cite_starter.output_mod2.h5ad
DATASET_DIR=resources_test/joint_embedding/bmmc_cite

if [ ! -f $MOD_1_DATA ]; then
echo "Error! Could not find raw data"
exit 1
fi

mkdir -p $DATASET_DIR

# split dataset
bin/viash run src/joint_embedding/mask_dataset/config.vsh.yaml -- \
--input_mod1 $MOD_1_DATA \
--input_mod2 $MOD_2_DATA \
--output_mod1 $DATASET_DIR/cite_mod1.h5ad \
--output_mod2 $DATASET_DIR/cite_mod2.h5ad \
--output_solution $DATASET_DIR/cite_solution.h5ad

# run one method
bin/viash run src/joint_embedding/methods/pca/config.vsh.yaml -- \
--input_mod1 $DATASET_DIR/cite_mod1.h5ad \
--input_mod2 $DATASET_DIR/cite_mod2.h5ad \
--output $DATASET_DIR/pca.h5ad

# run one metric
bin/viash run src/joint_embedding/metrics/ari/config.vsh.yaml -- \
--input_prediction $DATASET_DIR/pca.h5ad \
--input_solution $DATASET_DIR/cite_solution.h5ad \
--output $DATASET_DIR/ari.h5ad

# run benchmark
export NXF_VER=22.04.5

bin/nextflow \
run . \
-main-script src/joint_embedding/workflows/run/main.nf \
-profile docker \
-resume \
--id bmmc_cite \
--dataset_id bmmc_site \
--input_mod1 $DATASET_DIR/cite_mod1.h5ad \
--input_mod2 $DATASET_DIR/cite_mod2.h5ad \
--input_solution $DATASET_DIR/cite_solution.h5ad \
--output scores.tsv \
--publish_dir $DATASET_DIR/
26 changes: 26 additions & 0 deletions src/joint_embedding/workflows/run/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
functionality:
name: "run_benchmark"
namespace: "joint_embedding/workflows"
argument_groups:
- name: Inputs
arguments:
- name: "--id"
type: "string"
description: "The ID of the dataset"
required: true
- name: "--input_mod1"
type: "file" # todo: replace with includes
- name: "--input_mod2"
type: "file" # todo: replace with includes
- name: "--input_solution"
type: "file" # todo: replace with includes
- name: Outputs
arguments:
- name: "--output"
direction: "output"
type: file
resources:
- type: nextflow_script
path: main.nf
platforms:
- type: nextflow
Loading