Include gene annotation versions in multiqc report #75

ChristopherBarrington · 2019-11-25T09:25:22Z

After using the atacseq pipeline, I checked the multiqc report and find relevant information recorded such as reference genome. The paths to bed/gtf files are included but a useful piece of information to include would be the annotation version used. The pipeline uses iGenomes so I checked the README in the Annotation subdirectory and found the included files were release-81 (2015).

For downstream analysis, this information would be nice to include in the multiqc report if possible.

Thanks,

Chris

apeltzer · 2019-11-25T09:39:02Z

Hi @ChristopherBarrington ! Just put this over to the nf-core/atacseq repository as this is pipeline specific - thanks for the suggestion, which makes sense I believe.

ChristopherBarrington · 2019-11-25T09:41:53Z

@apeltzer Ok, I thought that the section of the multiqc report was nf-core/tools so included it there - apologies. Thanks for looking into it.

apeltzer · 2019-11-25T09:59:52Z

No worries, it's something to consider for ATACseq as you experienced it there but with the option to also do this in other pipelines using iGenomes so we might open another issue then once evaluating this over here in the nf-core/tools template to allow for such a thing to be reported in the multiqc report 👍

drpatelh · 2019-11-25T12:27:49Z

@apeltzer I asked @ChristopherBarrington to add the issue to nf-core/tools because I was hoping this could have some oversight during the Hackathon in Stockholm. We haven't actually synced AWS iGenomes at the Crick yet so I don't know what the file structure looks like properly. Still awaiting approval from IT 😓

@ChristopherBarrington the versions we are using at the Crick could be different because we are still using an old version of Illumina iGenomes which is why I want to update this ASAP.

Is it ok to move this back to nf-core/tools?

apeltzer · 2019-11-25T14:16:33Z

Ok with me - I already wondered whether its specific to ATACseq enough to only discuss it there - please move it back then 👍

ewels · 2019-11-25T16:33:55Z

Is this information in the GTF files themselves anywhere, or only in the README?

drpatelh · 2019-11-25T17:32:26Z

Its only in the README I think or it can be figured out by which version of the annotation directory is soft-linked as "current". Will be a bit tricky to get this information but would be nice for complete transparency if possible - annotations change even in iGenomes.

ewels · 2019-11-26T09:17:40Z

Honestly, I can't see this happening in nf-core. It sounds highly specific to AWS-iGenomes which I don't like very much and I can imagine all kinds of nastiness with varying filesystems. I have alarm bells ringing for a potential pit of despair project here 🚨

I appreciate the motivation though.. 🤔

drpatelh · 2019-12-05T11:56:19Z

I think this might be quite easy to add in. We could just have a separate readme entry in igenomes.config for all genomes e.g.

'GRCh37' {
      fasta       = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed"
      mito_name   = "MT"
      macs_gsize  = "2.7e9"
      blacklist   = "${baseDir}/assets/blacklists/GRCh37-blacklist.bed"
      readme      = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt"
    }

And some logic to evaluate that parameter in the pipeline:
params.readme = params.genome ? params.genomes[ params.genome ].readme?: false : false

This should then only work if using AWS iGenomes in which case we can just stage the file and output it in the results directory.

ewels · 2019-12-10T15:40:31Z

Ah and just copying in that one file? Ok then yes that is doable. I had envisaged trying to parse out specific strings within a global iGenomes readme file or other horrible stuff. Nice! 👍

Maybe annotation_readme instead of just readme though..? 😉 Though I guess this is under the genome parameter namespace so not such a big deal... 🤷‍♂

drpatelh · 2019-12-10T15:50:12Z

Yep 👍 @ChristopherBarrington do you fancy making a pull request to nf-core/tools adding this in for all genomes? 😉 This will then get propagated to all other pipelines via the automated syncing when tools is released.

ChristopherBarrington · 2019-12-11T09:05:59Z

OK @drpatelh I'll give it a go!

drpatelh · 2019-12-18T11:53:25Z

Ok. This might cause problems 😕 The README file isnt currently part of the aws syncing script which means that this file wont be found for those of us that have used that script to obtain local copies of AWS-iGenomes. Therefore, the path in igenomes.config wont exist and could break the pipeline. This wont be a problem for those that are downloading the genomes directly via AWS-iGenomes at run-time.

So we could update the aws-igenomes.sh and then re-sync to get these files or we just keep things as they are and point the user to download the README file if and when they require further information. I think this is probably best for now.

drpatelh · 2019-12-18T11:59:45Z

Im going to close this for now but feel free to re-open if you find another path of least resistance.

ewels · 2019-12-19T06:17:09Z

Can’t you just try to include it and use a try block to catch the error if it doesn’t exist? Or is there some Nextflow magic to allow missing files?

drpatelh · 2019-12-19T09:28:21Z

Maybe this would work:

try {
   ch_readme = Channel.fromPath(params.readme, checkIfExists: true)
} catch (all) {
   ch_readme = Channel.empty() 
}

and then in the process:

input:
file readme from ch_readme.collect().ifEmpty([])

i.e. using a try block to catch a missing file 😅 Feels a bit hacky though...

drpatelh · 2020-01-24T11:01:51Z

So if you would like to obtain the version specified in the README.txt then you will need to have the aws-cli package installed locally.

You can then obtain a listing of all the files hosted on AWS iGenomes from this file:
https://raw.githubusercontent.com/ewels/AWS-iGenomes/master/ngi-igenomes_file_manifest.txt

Find the README.txt path for your organism and run the command for example like below:
aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/ ./ --exclude "*" --include "README.txt"

ewels · 2020-01-26T06:02:38Z

Surely Nextflow can handle that for you? It already has built-in support for staging S3 files...

ewels · 2020-01-26T06:04:03Z

And I thought that the idea was to include the readme path in the iGenomes config for each species? That would be better as many people will have downloaded this offline and should also have the readmes there..

drpatelh · 2020-01-26T09:48:21Z

Yep. Would be good for Nextflow to do this for us but a number of things would need to happen before this will work properly.

As I mentioned in this comment we would need to update the AWS syncing script to add this file and re-sync for everyone using offline iGenomes otherwise a checkIfExists will fail if the README.txt isn't present.

ewels · 2020-01-26T10:34:38Z

Could we just not checkIfExists? 😉

drpatelh · 2020-01-26T20:27:44Z

Absolutely! But it feels a bit hacky 😓

Ive added the README.txt paths to igenomes.config (see nf-core/tools@d1051be). This file doesnt exist in AWS iGenomes for all genomes but most of them so should be fine.

The paths will now be rolled out to all pipelines via the automated synchronisation when we release tools. Im assuming we dont want to have the implementation for retrieving the file in the pipeline template so Ill transfer this issue to atacseq and implement there.

drpatelh · 2020-01-31T17:31:21Z

This was implemented in #77 whereby README.txt is saved in results/reference_genome/ if it exists. Hopefully, that should suffice.

apeltzer transferred this issue from nf-core/tools Nov 25, 2019

drpatelh transferred this issue from nf-core/atacseq Nov 25, 2019

drpatelh closed this as completed Dec 18, 2019

drpatelh transferred this issue from nf-core/tools Jan 26, 2020

drpatelh reopened this Jan 27, 2020

drpatelh added the enhancement New feature or request label Jan 30, 2020

drpatelh self-assigned this Jan 30, 2020

drpatelh mentioned this issue Jan 31, 2020

Remove .travis.yml and fix a couple of issues #77

Merged

8 tasks

drpatelh mentioned this issue Jan 31, 2020

Remove .travis.yml and fix a couple of issues nf-core/chipseq#137

Merged

8 tasks

drpatelh closed this as completed Jan 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include gene annotation versions in multiqc report #75

Include gene annotation versions in multiqc report #75

ChristopherBarrington commented Nov 25, 2019

apeltzer commented Nov 25, 2019

ChristopherBarrington commented Nov 25, 2019 •

edited

Loading

apeltzer commented Nov 25, 2019

drpatelh commented Nov 25, 2019

apeltzer commented Nov 25, 2019

ewels commented Nov 25, 2019

drpatelh commented Nov 25, 2019

ewels commented Nov 26, 2019

drpatelh commented Dec 5, 2019 •

edited by ewels

Loading

ewels commented Dec 10, 2019

drpatelh commented Dec 10, 2019

ChristopherBarrington commented Dec 11, 2019

drpatelh commented Dec 18, 2019 •

edited

Loading

drpatelh commented Dec 18, 2019

ewels commented Dec 19, 2019

drpatelh commented Dec 19, 2019 •

edited

Loading

drpatelh commented Jan 24, 2020

ewels commented Jan 26, 2020

ewels commented Jan 26, 2020

drpatelh commented Jan 26, 2020

ewels commented Jan 26, 2020

drpatelh commented Jan 26, 2020

drpatelh commented Jan 31, 2020

Include gene annotation versions in multiqc report #75

Include gene annotation versions in multiqc report #75

Comments

ChristopherBarrington commented Nov 25, 2019

apeltzer commented Nov 25, 2019

ChristopherBarrington commented Nov 25, 2019 • edited Loading

apeltzer commented Nov 25, 2019

drpatelh commented Nov 25, 2019

apeltzer commented Nov 25, 2019

ewels commented Nov 25, 2019

drpatelh commented Nov 25, 2019

ewels commented Nov 26, 2019

drpatelh commented Dec 5, 2019 • edited by ewels Loading

ewels commented Dec 10, 2019

drpatelh commented Dec 10, 2019

ChristopherBarrington commented Dec 11, 2019

drpatelh commented Dec 18, 2019 • edited Loading

drpatelh commented Dec 18, 2019

ewels commented Dec 19, 2019

drpatelh commented Dec 19, 2019 • edited Loading

drpatelh commented Jan 24, 2020

ewels commented Jan 26, 2020

ewels commented Jan 26, 2020

drpatelh commented Jan 26, 2020

ewels commented Jan 26, 2020

drpatelh commented Jan 26, 2020

drpatelh commented Jan 31, 2020

ChristopherBarrington commented Nov 25, 2019 •

edited

Loading

drpatelh commented Dec 5, 2019 •

edited by ewels

Loading

drpatelh commented Dec 18, 2019 •

edited

Loading

drpatelh commented Dec 19, 2019 •

edited

Loading