Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add methylation pipelines #184

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open

feat: add methylation pipelines #184

wants to merge 25 commits into from

Conversation

adthrasher
Copy link
Member

Add a methylation process workflow that generates unfiltered, normalized Beta values for a sample. Add a second pipeline that consumes an array of unfiltered, normalized beta values for a set of sample. Applies filtering and generates UMAP coordinates.

@adthrasher adthrasher self-assigned this Nov 8, 2024
Comment on lines +9 to +19
mset: "MethylSet object",
rgset: "RGSet object",
rset: "RatioSet object",
annotation: "Annotation object",
beta: "Beta values",
cn_values: "Copy number values",
m_values: "M values",
pheno: "Phenotype data",
pheno_data: "Phenotype data",
probe_names: "Probe names",
sample_names: "Sample names",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these need to be removed. Charlie's original script output a lot of intermediate files, which we don't really need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create an IntermediateFiles struct like we have for QC? IDK if that would be useful, just an idea. Is there a context where a user would want the intermediate files? Maybe even just for debugging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it some thought, but I don't think there is any value to most of these files. Certainly not sample_names because that will have one entry with the name of the current sample. I can see outputting the M value and CN value as those could be useful.

Int addl_memory_gb = 0
}

Int memory_gb = ceil(size(unfiltered_normalized_beta, "GiB") * 2) + addl_memory_gb
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be adjusted.

}
}

task plot_umap {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task is primarily for testing purposes. I'm not sure we want to plot the UMAPs here. They'll be hosted on Pecan and use the ProteinPaint scatter plot library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful (if not necessary) for non-Pecan users of the workflow. It is also quite a lightweight task, so I'm not concerned about wasted compute. Maybe a boolean toggle for skipping this step?

docker/minfi/1.48.0-1/Dockerfile Show resolved Hide resolved
Comment on lines 7 to 8
combined_beta: "Combined beta values for all samples",
filtered_beta: "Filtered beta values for all samples",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some detail on both of these? It's not clear to me how they're different.

outputs: {
combined_beta: "Combined beta values for all samples",
filtered_beta: "Filtered beta values for all samples",
filtered_probes: "Probes that were retained after filtering",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't love this output name, but can't think of anything better. At first glance, I read filtered_probes and expected this to be a list of the probes that were filtered out. Also would like some more detail on this description as well.

workflows/methylation/methylation-cohort.wdl Outdated Show resolved Hide resolved
workflows/methylation/methylation-cohort.wdl Outdated Show resolved Hide resolved
File filtered_probes = "filtered_probes.csv"
}

#@ except: ContainerValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concern here

#@ except: ContainerValue
runtime {
container: "quay.io/biocontainers/pandas:2.2.1"
memory: "28 GB"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

28 is an odd number. Isn't this going to depend on the size of the input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

28 is a bit random. I'll take another look. It won't depend on the size of the input, but rather the width of the matrix, which is a bit harder to calculate upfront. It's the width because I'm reading a single row in to memory at a time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of the TSV functions, you could try reading into an Array[Array[String]] and getting the length() of the inner array?

}
}

task plot_umap {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful (if not necessary) for non-Pecan users of the workflow. It is also quite a lightweight task, so I'm not concerned about wasted compute. Maybe a boolean toggle for skipping this step?

Comment on lines +9 to +19
mset: "MethylSet object",
rgset: "RGSet object",
rset: "RatioSet object",
annotation: "Annotation object",
beta: "Beta values",
cn_values: "Copy number values",
m_values: "M values",
pheno: "Phenotype data",
pheno_data: "Phenotype data",
probe_names: "Probe names",
sample_names: "Sample names",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create an IntermediateFiles struct like we have for QC? IDK if that would be useful, just an idea. Is there a context where a user would want the intermediate files? Maybe even just for debugging?

library(IlluminaHumanMethylationEPICmanifest)
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)

set.seed(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a reason for setting another seed?

Copy link
Member Author

@adthrasher adthrasher Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These arrays have two types of probes (helpfully Infinium type 1 and Infinium type 2). The probes also cover a varying number of CpG sites. The normalization method works by taking N probes of each type that each have 1, 2, or 3 CpGs. So it selects 6N probes in total. This selection is "random" if the seed is not fixed, of course. Since we're processing samples individually in this step instead of as a cohort, I want the seed to be consistent so the same set of probes is chosen for each sample.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Can you add that to the documentation somewhere? Maybe it should go in meta.help? Or just an embedded "normal" comment, not sure. Your call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants