Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include gene annotation versions in multiqc report #75

Closed
ChristopherBarrington opened this issue Nov 25, 2019 · 23 comments
Closed

Include gene annotation versions in multiqc report #75

ChristopherBarrington opened this issue Nov 25, 2019 · 23 comments
Assignees
Labels
enhancement New feature or request

Comments

@ChristopherBarrington
Copy link

After using the atacseq pipeline, I checked the multiqc report and find relevant information recorded such as reference genome. The paths to bed/gtf files are included but a useful piece of information to include would be the annotation version used. The pipeline uses iGenomes so I checked the README in the Annotation subdirectory and found the included files were release-81 (2015).

For downstream analysis, this information would be nice to include in the multiqc report if possible.

Thanks,

Chris

@apeltzer apeltzer transferred this issue from nf-core/tools Nov 25, 2019
@apeltzer
Copy link
Member

Hi @ChristopherBarrington ! Just put this over to the nf-core/atacseq repository as this is pipeline specific - thanks for the suggestion, which makes sense I believe.

@ChristopherBarrington
Copy link
Author

ChristopherBarrington commented Nov 25, 2019

@apeltzer Ok, I thought that the section of the multiqc report was nf-core/tools so included it there - apologies. Thanks for looking into it.

@apeltzer
Copy link
Member

No worries, it's something to consider for ATACseq as you experienced it there but with the option to also do this in other pipelines using iGenomes so we might open another issue then once evaluating this over here in the nf-core/tools template to allow for such a thing to be reported in the multiqc report 👍

@drpatelh
Copy link
Member

@apeltzer I asked @ChristopherBarrington to add the issue to nf-core/tools because I was hoping this could have some oversight during the Hackathon in Stockholm. We haven't actually synced AWS iGenomes at the Crick yet so I don't know what the file structure looks like properly. Still awaiting approval from IT 😓

@ChristopherBarrington the versions we are using at the Crick could be different because we are still using an old version of Illumina iGenomes which is why I want to update this ASAP.

Is it ok to move this back to nf-core/tools?

@apeltzer
Copy link
Member

Ok with me - I already wondered whether its specific to ATACseq enough to only discuss it there - please move it back then 👍

@drpatelh drpatelh transferred this issue from nf-core/atacseq Nov 25, 2019
@ewels
Copy link
Member

ewels commented Nov 25, 2019

Is this information in the GTF files themselves anywhere, or only in the README?

@drpatelh
Copy link
Member

Its only in the README I think or it can be figured out by which version of the annotation directory is soft-linked as "current". Will be a bit tricky to get this information but would be nice for complete transparency if possible - annotations change even in iGenomes.

@ewels
Copy link
Member

ewels commented Nov 26, 2019

Honestly, I can't see this happening in nf-core. It sounds highly specific to AWS-iGenomes which I don't like very much and I can imagine all kinds of nastiness with varying filesystems. I have alarm bells ringing for a potential pit of despair project here 🚨

I appreciate the motivation though.. 🤔

@drpatelh
Copy link
Member

drpatelh commented Dec 5, 2019

I think this might be quite easy to add in. We could just have a separate readme entry in igenomes.config for all genomes e.g.

'GRCh37' {
      fasta       = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/WholeGenomeFasta/genome.fa"
      bwa         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa"
      bowtie2     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/"
      star        = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/"
      bismark     = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Sequence/BismarkIndex/"
      gtf         = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf"
      bed12       = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed"
      mito_name   = "MT"
      macs_gsize  = "2.7e9"
      blacklist   = "${baseDir}/assets/blacklists/GRCh37-blacklist.bed"
      readme      = "${params.igenomes_base}/Homo_sapiens/Ensembl/GRCh37/Annotation/README.txt"
    }

And some logic to evaluate that parameter in the pipeline:
params.readme = params.genome ? params.genomes[ params.genome ].readme?: false : false

This should then only work if using AWS iGenomes in which case we can just stage the file and output it in the results directory.

@ewels
Copy link
Member

ewels commented Dec 10, 2019

Ah and just copying in that one file? Ok then yes that is doable. I had envisaged trying to parse out specific strings within a global iGenomes readme file or other horrible stuff. Nice! 👍

Maybe annotation_readme instead of just readme though..? 😉 Though I guess this is under the genome parameter namespace so not such a big deal... 🤷‍♂

@drpatelh
Copy link
Member

Yep 👍 @ChristopherBarrington do you fancy making a pull request to nf-core/tools adding this in for all genomes? 😉 This will then get propagated to all other pipelines via the automated syncing when tools is released.

@ChristopherBarrington
Copy link
Author

OK @drpatelh I'll give it a go!

@drpatelh
Copy link
Member

drpatelh commented Dec 18, 2019

Ok. This might cause problems 😕 The README file isnt currently part of the aws syncing script which means that this file wont be found for those of us that have used that script to obtain local copies of AWS-iGenomes. Therefore, the path in igenomes.config wont exist and could break the pipeline. This wont be a problem for those that are downloading the genomes directly via AWS-iGenomes at run-time.

So we could update the aws-igenomes.sh and then re-sync to get these files or we just keep things as they are and point the user to download the README file if and when they require further information. I think this is probably best for now.

@drpatelh
Copy link
Member

Im going to close this for now but feel free to re-open if you find another path of least resistance.

@ewels
Copy link
Member

ewels commented Dec 19, 2019

Can’t you just try to include it and use a try block to catch the error if it doesn’t exist? Or is there some Nextflow magic to allow missing files?

@drpatelh
Copy link
Member

drpatelh commented Dec 19, 2019

Maybe this would work:

try {
   ch_readme = Channel.fromPath(params.readme, checkIfExists: true)
} catch (all) {
   ch_readme = Channel.empty() 
}

and then in the process:

input:
file readme from ch_readme.collect().ifEmpty([])

i.e. using a try block to catch a missing file 😅 Feels a bit hacky though...

@drpatelh
Copy link
Member

So if you would like to obtain the version specified in the README.txt then you will need to have the aws-cli package installed locally.

You can then obtain a listing of all the files hosted on AWS iGenomes from this file:
https://raw.githubusercontent.com/ewels/AWS-iGenomes/master/ngi-igenomes_file_manifest.txt

Find the README.txt path for your organism and run the command for example like below:
aws s3 --no-sign-request --region eu-west-1 sync s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/ ./ --exclude "*" --include "README.txt"

@ewels
Copy link
Member

ewels commented Jan 26, 2020

Surely Nextflow can handle that for you? It already has built-in support for staging S3 files...

@ewels
Copy link
Member

ewels commented Jan 26, 2020

And I thought that the idea was to include the readme path in the iGenomes config for each species? That would be better as many people will have downloaded this offline and should also have the readmes there..

@drpatelh
Copy link
Member

Yep. Would be good for Nextflow to do this for us but a number of things would need to happen before this will work properly.

As I mentioned in this comment we would need to update the AWS syncing script to add this file and re-sync for everyone using offline iGenomes otherwise a checkIfExists will fail if the README.txt isn't present.

@ewels
Copy link
Member

ewels commented Jan 26, 2020

Could we just not checkIfExists? 😉

@drpatelh
Copy link
Member

Absolutely! But it feels a bit hacky 😓

Ive added the README.txt paths to igenomes.config (see nf-core/tools@d1051be). This file doesnt exist in AWS iGenomes for all genomes but most of them so should be fine.

The paths will now be rolled out to all pipelines via the automated synchronisation when we release tools. Im assuming we dont want to have the implementation for retrieving the file in the pipeline template so Ill transfer this issue to atacseq and implement there.

@drpatelh drpatelh transferred this issue from nf-core/tools Jan 26, 2020
@drpatelh drpatelh reopened this Jan 27, 2020
@drpatelh drpatelh added the enhancement New feature or request label Jan 30, 2020
@drpatelh drpatelh self-assigned this Jan 30, 2020
@drpatelh
Copy link
Member

This was implemented in #77 whereby README.txt is saved in results/reference_genome/ if it exists. Hopefully, that should suffice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants