New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat: add methylation pipelines #184

Open

adthrasher wants to merge 25 commits into main from feat/methylation

+585 −0

Member

adthrasher commented Nov 8, 2024

Add a methylation process workflow that generates unfiltered, normalized Beta values for a sample. Add a second pipeline that consumes an array of unfiltered, normalized beta values for a set of sample. Applies filtering and generates UMAP coordinates.


          feat: add methylation pipelines

d9be045

adthrasher self-assigned this

adthrasher added 17 commits

November 8, 2024 10:28


          chore: update docker containers

a130388


          fix: recursive merge of large data sets

0482e52


          fix: iterative merge

bfc6988


          wip: try stepwise filtering

026a441


          wip: try stepwise filtering

8c41e1e


          fix: add pandas to umap image

2d816dd


          fix: add pandas to umap image

2d3e794


          fix: fix merge bugs

992451a


          feat: add genomic-only beta values

a39f9ed


          feat: add genomic-only beta values

89d0f0c


          refactor: rework filter step to read line-by-line

d203c32


          chore: fix output names

2b4b0ae


          fix outputs, add filtered probes

3c3b50f


          chore: clean up lint warnings

bd76e69


          chore: clean up lint warnings

447c59f


          ci: use main for sprocket-action

8f45924


          Merge branch 'main' into feat/methylation

1f44c1c

adthrasher commented

View reviewed changes

workflows/methylation/methylation-preprocess.wdl

Comment on lines +9 to +19

+                          mset: "MethylSet object",
+                          rgset: "RGSet object",
+                          rset: "RatioSet object",
+                          annotation: "Annotation object",
+                          beta: "Beta values",
+                          cn_values: "Copy number values",
+                          m_values: "M values",
+                          pheno: "Phenotype data",
+                          pheno_data: "Phenotype data",
+                          probe_names: "Probe names",
+                          sample_names: "Sample names",

Member Author

adthrasher Nov 20, 2024

Some of these need to be removed. Charlie's original script output a lot of intermediate files, which we don't really need.

Member

a-frantz Nov 20, 2024

Should we create an IntermediateFiles struct like we have for QC? IDK if that would be useful, just an idea. Is there a context where a user would want the intermediate files? Maybe even just for debugging?

Member Author

adthrasher Nov 20, 2024

I'll give it some thought, but I don't think there is any value to most of these files. Certainly not sample_names because that will have one entry with the name of the current sample. I can see outputting the M value and CN value as those could be useful.

adthrasher commented

View reviewed changes

workflows/methylation/methylation-cohort.wdl Outdated

+                      Int addl_memory_gb = 0
+                  }
+                  Int memory_gb = ceil(size(unfiltered_normalized_beta, "GiB") * 2) + addl_memory_gb

Member Author

adthrasher Nov 20, 2024

This needs to be adjusted.

adthrasher commented

View reviewed changes

workflows/methylation/methylation-cohort.wdl

+                  }
+              }
+              task plot_umap {

Member Author

adthrasher Nov 20, 2024

This task is primarily for testing purposes. I'm not sure we want to plot the UMAPs here. They'll be hosted on Pecan and use the ProteinPaint scatter plot library.

Member

a-frantz Nov 20, 2024

I think it would be useful (if not necessary) for non-Pecan users of the workflow. It is also quite a lightweight task, so I'm not concerned about wasted compute. Maybe a boolean toggle for skipping this step?

adthrasher requested a review from a-frantz

November 20, 2024 14:49

a-frantz reviewed

View reviewed changes

docker/minfi/1.48.0-1/Dockerfile Show resolved Hide resolved

workflows/methylation/methylation-cohort.wdl Outdated

Comment on lines 7 to 8

		combined_beta: "Combined beta values for all samples",
		filtered_beta: "Filtered beta values for all samples",

Member

a-frantz Nov 20, 2024

Can we get some detail on both of these? It's not clear to me how they're different.

workflows/methylation/methylation-cohort.wdl Outdated

+                      outputs: {
+                          combined_beta: "Combined beta values for all samples",
+                          filtered_beta: "Filtered beta values for all samples",
+                          filtered_probes: "Probes that were retained after filtering",

Member

a-frantz Nov 20, 2024

Don't love this output name, but can't think of anything better. At first glance, I read filtered_probes and expected this to be a list of the probes that were filtered out. Also would like some more detail on this description as well.

workflows/methylation/methylation-cohort.wdl Outdated Show resolved Hide resolved

workflows/methylation/methylation-cohort.wdl Outdated Show resolved Hide resolved

workflows/methylation/methylation-cohort.wdl Outdated

+                      File filtered_probes = "filtered_probes.csv"
+                  }
+                  #@ except: ContainerValue

Member

a-frantz Nov 20, 2024

same concern here

workflows/methylation/methylation-cohort.wdl

+                  #@ except: ContainerValue
+                  runtime {
+                      container: "quay.io/biocontainers/pandas:2.2.1"
+                      memory: "28 GB"

Member

a-frantz Nov 20, 2024

28 is an odd number. Isn't this going to depend on the size of the input?

Member Author

adthrasher Nov 20, 2024

28 is a bit random. I'll take another look. It won't depend on the size of the input, but rather the width of the matrix, which is a bit harder to calculate upfront. It's the width because I'm reading a single row in to memory at a time.

Member

a-frantz Nov 20, 2024

Speaking of the TSV functions, you could try reading into an Array[Array[String]] and getting the length() of the inner array?

workflows/methylation/methylation-cohort.wdl

+                  }
+              }
+              task plot_umap {

Member

a-frantz Nov 20, 2024

I think it would be useful (if not necessary) for non-Pecan users of the workflow. It is also quite a lightweight task, so I'm not concerned about wasted compute. Maybe a boolean toggle for skipping this step?

workflows/methylation/methylation-preprocess.wdl

Comment on lines +9 to +19

+                          mset: "MethylSet object",
+                          rgset: "RGSet object",
+                          rset: "RatioSet object",
+                          annotation: "Annotation object",
+                          beta: "Beta values",
+                          cn_values: "Copy number values",
+                          m_values: "M values",
+                          pheno: "Phenotype data",
+                          pheno_data: "Phenotype data",
+                          probe_names: "Probe names",
+                          sample_names: "Sample names",

Member

a-frantz Nov 20, 2024

Should we create an IntermediateFiles struct like we have for QC? IDK if that would be useful, just an idea. Is there a context where a user would want the intermediate files? Maybe even just for debugging?

workflows/methylation/methylation-preprocess.wdl

+                          library(IlluminaHumanMethylationEPICmanifest)
+                          library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
+                          set.seed(1)

Member

a-frantz Nov 20, 2024

Is there ever a reason for setting another seed?

Member Author

adthrasher Nov 20, 2024 •

edited

Loading

These arrays have two types of probes (helpfully Infinium type 1 and Infinium type 2). The probes also cover a varying number of CpG sites. The normalization method works by taking N probes of each type that each have 1, 2, or 3 CpGs. So it selects 6N probes in total. This selection is "random" if the seed is not fixed, of course. Since we're processing samples individually in this step instead of as a cohort, I want the seed to be consistent so the same set of probes is chosen for each sample.

Member

a-frantz Nov 20, 2024

Makes sense. Can you add that to the documentation somewhere? Maybe it should go in meta.help? Or just an embedded "normal" comment, not sure. Your call.

adthrasher and others added 6 commits

November 20, 2024 10:40


          Update workflows/methylation/methylation-cohort.wdl

340bb31

Co-authored-by: Andrew Frantz <[email protected]>


          Update workflows/methylation/methylation-cohort.wdl

fc815d1

Co-authored-by: Andrew Frantz <[email protected]>


          Update workflows/methylation/methylation-cohort.wdl

1baa6b7

Co-authored-by: Andrew Frantz <[email protected]>


          Update workflows/methylation/methylation-cohort.wdl

90ec645

Co-authored-by: Andrew Frantz <[email protected]>


          chore: apply PR feedback

08e3344


          chore: apply feedback from PR

6ab794e


          chore: fix omission

d3ffcf1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet