-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add methylation pipelines #184
base: main
Are you sure you want to change the base?
Conversation
mset: "MethylSet object", | ||
rgset: "RGSet object", | ||
rset: "RatioSet object", | ||
annotation: "Annotation object", | ||
beta: "Beta values", | ||
cn_values: "Copy number values", | ||
m_values: "M values", | ||
pheno: "Phenotype data", | ||
pheno_data: "Phenotype data", | ||
probe_names: "Probe names", | ||
sample_names: "Sample names", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these need to be removed. Charlie's original script output a lot of intermediate files, which we don't really need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an IntermediateFiles
struct like we have for QC? IDK if that would be useful, just an idea. Is there a context where a user would want the intermediate files? Maybe even just for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it some thought, but I don't think there is any value to most of these files. Certainly not sample_names
because that will have one entry with the name of the current sample. I can see outputting the M value and CN value as those could be useful.
Int addl_memory_gb = 0 | ||
} | ||
|
||
Int memory_gb = ceil(size(unfiltered_normalized_beta, "GiB") * 2) + addl_memory_gb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be adjusted.
} | ||
} | ||
|
||
task plot_umap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task is primarily for testing purposes. I'm not sure we want to plot the UMAPs here. They'll be hosted on Pecan and use the ProteinPaint scatter plot library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful (if not necessary) for non-Pecan users of the workflow. It is also quite a lightweight task, so I'm not concerned about wasted compute. Maybe a boolean toggle for skipping this step?
combined_beta: "Combined beta values for all samples", | ||
filtered_beta: "Filtered beta values for all samples", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some detail on both of these? It's not clear to me how they're different.
outputs: { | ||
combined_beta: "Combined beta values for all samples", | ||
filtered_beta: "Filtered beta values for all samples", | ||
filtered_probes: "Probes that were retained after filtering", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't love this output name, but can't think of anything better. At first glance, I read filtered_probes
and expected this to be a list of the probes that were filtered out. Also would like some more detail on this description as well.
File filtered_probes = "filtered_probes.csv" | ||
} | ||
|
||
#@ except: ContainerValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern here
#@ except: ContainerValue | ||
runtime { | ||
container: "quay.io/biocontainers/pandas:2.2.1" | ||
memory: "28 GB" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
28 is an odd number. Isn't this going to depend on the size of the input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
28 is a bit random. I'll take another look. It won't depend on the size of the input, but rather the width of the matrix, which is a bit harder to calculate upfront. It's the width because I'm reading a single row in to memory at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of the TSV functions, you could try reading into an Array[Array[String]]
and getting the length()
of the inner array?
} | ||
} | ||
|
||
task plot_umap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful (if not necessary) for non-Pecan users of the workflow. It is also quite a lightweight task, so I'm not concerned about wasted compute. Maybe a boolean toggle for skipping this step?
mset: "MethylSet object", | ||
rgset: "RGSet object", | ||
rset: "RatioSet object", | ||
annotation: "Annotation object", | ||
beta: "Beta values", | ||
cn_values: "Copy number values", | ||
m_values: "M values", | ||
pheno: "Phenotype data", | ||
pheno_data: "Phenotype data", | ||
probe_names: "Probe names", | ||
sample_names: "Sample names", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an IntermediateFiles
struct like we have for QC? IDK if that would be useful, just an idea. Is there a context where a user would want the intermediate files? Maybe even just for debugging?
library(IlluminaHumanMethylationEPICmanifest) | ||
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19) | ||
|
||
set.seed(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there ever a reason for setting another seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These arrays have two types of probes (helpfully Infinium type 1 and Infinium type 2). The probes also cover a varying number of CpG sites. The normalization method works by taking N
probes of each type that each have 1, 2, or 3 CpGs. So it selects 6N probes in total. This selection is "random" if the seed is not fixed, of course. Since we're processing samples individually in this step instead of as a cohort, I want the seed to be consistent so the same set of probes is chosen for each sample.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Can you add that to the documentation somewhere? Maybe it should go in meta.help
? Or just an embedded "normal" comment, not sure. Your call.
Co-authored-by: Andrew Frantz <[email protected]>
Co-authored-by: Andrew Frantz <[email protected]>
Co-authored-by: Andrew Frantz <[email protected]>
Co-authored-by: Andrew Frantz <[email protected]>
Add a methylation process workflow that generates unfiltered, normalized Beta values for a sample. Add a second pipeline that consumes an array of unfiltered, normalized beta values for a set of sample. Applies filtering and generates UMAP coordinates.