Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--macs_gsize option #280

Closed
StevenWingett opened this issue May 26, 2022 · 3 comments
Closed

--macs_gsize option #280

StevenWingett opened this issue May 26, 2022 · 3 comments
Assignees
Milestone

Comments

@StevenWingett
Copy link

Following on from a discussion on Slack with Harshil Patel, I wanted to raise an issue regarding the genome size needed for the --macs_gsize option when running the nf-core/chipseq pipeline (and others).

I shall be using genomes not available in iGenomes and so needed to calculate this value myself. To check I could do this correctly, I tried to get the same values for human and mouse as reported at: https://github.com/nf-core/chipseq/blob/master/conf/igenomes.config

To perform the calculation, I ran the script unique-kmers.py as described at:
https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html

My results for human38 (assuming k-mers of 100-bp) were similar to that reported in iGenomes: 2.8e9 vs 2.7e9 respectively.
However, the calculations for mouse38 were substantially different: 2.47e9 (my calculation) vs 1.87e9 (iGenomes).
(As might be expected, my calculations agree with those displayed on https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html for 100-bp kmers.)

I spoke to the MACS developers and they appear to agree that these values need updating:
macs3-project/MACS#508 (comment)

I believe the relevant nf-core documentation should be updated to show these new values.

(It may be of interest to you to know that I have been putting together a script that automates the reference genome downloading. I shall incorporate the DeepTools kmer estimation of genome size into the automated download process. I shall share this with you when it is ready, incase it is of any use).

Many thanks,

Steven

@JoseEspinosa
Copy link
Member

Thanks a lot for reporting @StevenWingett !
If you don't mind it will be great if you could share a pointer to the script 😄

@JoseEspinosa JoseEspinosa added this to the 2.0 milestone Jun 1, 2022
@StevenWingett
Copy link
Author

Hi Jose,

My genome reference download script is here:

https://github.com/StevenWingett/lmb-nextflow/tree/main/ancillary_scripts/downloading_genomes

The Python3 script 'download_genomes.py' downloads the relevant files. It takes as input the 'genomes_to_download.csv' file. This file lists the genomes to download (as referenced in Ensembl). The CSV file also lists the processing that should be done once the genomes have been downloaded e.g. create Bowtie2 index files.

I shall add code so DeepTools processing may be performed on the downloaded genome to ascertain the mappable genome size.

Please let me know if you have any questions.

All the best,
Steven

@JoseEspinosa
Copy link
Member

We updated the macs_gsize in the igenomes config providing now a value for several read_lengths and also adding some logic to calculate it if the genome is not present, see #283. You can give it a try using the dev branch, thus, I will close the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants