Add t2t hsa #99

hoelzer · 2024-08-07T09:57:57Z

The T2T human genome is more complate than the GrCH38. In the paper, they show that it has 200 Mbp more, closes gaps, and they recently added a more complete Y chromosome.

I implemented this as another option for auto-download. Tested the pipeline on a small nanopore example data set.

Currently, Christian in Jena is running tests using some spike-in human data so we will see soon if a more complete human T2T genome helps to decontaminate more human reads (also for the paper).

If you dont see any problems, we can then also merge this (into dev first, and later main, right?)

hoelzer · 2024-08-07T09:59:42Z

Ah this will fail on the RKI HPC due to restrictions downloading from AWS bucket. I tested this on private hardware.

Also, I added some error handling for fastqc in case running out of RAM. acutally, I think this was just an RKI HPC hick-up but it does also not hurt to have that.

matthuska · 2024-08-08T08:20:07Z

I'm not 100% sure but you might be able to try this link instead:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz

NCBI should be whitelisted in most bioinformatics environments, and definitely is on the HPC (I tested with wget).

hoelzer · 2024-08-08T10:43:22Z

I'm not 100% sure but you might be able to try this link instead:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz

NCBI should be whitelisted in most bioinformatics environments, and definitely is on the HPC (I tested with wget).

Oh, great!

Yes, that would be much better. BUT the NCBI one seems to miss the mitochondrial contig. I would like to have that bc quite some DNA/RNA usually comes from the mt genome.

Ah but the GenBank version seems to have the mtDNA as well:

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/914/755/GCA_009914755.4_T2T-CHM13v2.0/GCA_009914755.4_T2T-CHM13v2.0_genomic.fna.gz

This should do the trick.

I will update the PR.

hoelzer · 2024-08-08T10:57:38Z

For documentation, assembly stats of

GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz (missing mtDNA)

sum = 3117275501, n = 24, ave = 129886479.21, largest = 248387328
N50 = 150617247, n = 9
N60 = 135127769, n = 11
N70 = 133324548, n = 13
N80 = 99753195, n = 16
N90 = 80542538, n = 19
N100 = 45090682, n = 24
N_count = 0
Gaps = 0

chm13v2.0.fa (incl. mtDNA, from the AWS)

sum = 3117292070, n = 25, ave = 124691682.80, largest = 248387328
N50 = 150617247, n = 9
N60 = 135127769, n = 11
N70 = 133324548, n = 13
N80 = 99753195, n = 16
N90 = 80542538, n = 19
N100 = 16569, n = 25
N_count = 0
Gaps = 0

GCA_009914755.4_T2T-CHM13v2.0_genomic.fna.gz (incl mtDNA, from NCBI)

sum = 3117292070, n = 25, ave = 124691682.80, largest = 248387328
N50 = 150617247, n = 9
N60 = 135127769, n = 11
N70 = 133324548, n = 13
N80 = 99753195, n = 16
N90 = 80542538, n = 19
N100 = 16569, n = 25
N_count = 0
Gaps = 0

Thus, the T2T AWS genome (chm13v2.0.fa) and the T2T NCBI GenBank genome (GCA_009914755.4_T2T-CHM13v2.0_genomic.fna) should be the same. (and both having the mtDNA contig)

hoelzer · 2024-08-08T11:04:46Z

In this context I also checked if all the other genomes we provide have the mtDNA contig

…NA (yes all have)

hoelzer · 2024-08-08T11:08:35Z

PR ready and also RKI HPC compatible now

matthuska

LGTM

…interpreted by the shell

matthuska · 2024-08-08T12:14:55Z

I just added quotes around all URLs. The specific ones we're using are fine (except the SC2 one), but it's a good habit to avoid problems with URLs that contain characters that the shell wants to do things with (e.g. & and ?).

hoelzer added 4 commits August 6, 2024 17:17

add T2T v2.0 human genome as download option

8bff893

add T2T v2.0 human genome as download option

8254386

add T2T v2.0 human genome as download option

87a4f6f

error strategy for fastqc

bb19fc1

hoelzer requested review from matthuska and MarieLataretu August 7, 2024 09:57

switch from AWS to NCBI GenBank download, checked all genomes for mtD…

610ac75

…NA (yes all have)

matthuska approved these changes Aug 8, 2024

View reviewed changes

Quote all urls. Good to avoid problems with special characters being …

9a9193b

…interpreted by the shell

hoelzer merged commit 57f4317 into dev Aug 8, 2024
10 of 12 checks passed

hoelzer deleted the add-t2t-hsa branch August 8, 2024 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add t2t hsa #99

Add t2t hsa #99

hoelzer commented Aug 7, 2024

hoelzer commented Aug 7, 2024

matthuska commented Aug 8, 2024 •

edited

Loading

hoelzer commented Aug 8, 2024

hoelzer commented Aug 8, 2024

hoelzer commented Aug 8, 2024

hoelzer commented Aug 8, 2024

matthuska left a comment

matthuska commented Aug 8, 2024

Add t2t hsa #99

Add t2t hsa #99

Conversation

hoelzer commented Aug 7, 2024

hoelzer commented Aug 7, 2024

matthuska commented Aug 8, 2024 • edited Loading

hoelzer commented Aug 8, 2024

hoelzer commented Aug 8, 2024

hoelzer commented Aug 8, 2024

hoelzer commented Aug 8, 2024

matthuska left a comment

Choose a reason for hiding this comment

matthuska commented Aug 8, 2024

matthuska commented Aug 8, 2024 •

edited

Loading