Skip to content

Latest commit

 

History

History
572 lines (479 loc) · 78.3 KB

README.md

File metadata and controls

572 lines (479 loc) · 78.3 KB

Telomere-to-telomere consortium

We have sequenced the CHM13hTERT human cell line with a number of technologies. Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. The data includes 30x PacBio HiFi, 120x coverage of Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. Most raw data is available from this site, with the exception of the PacBio data which was generated by the University of Washington/PacBio and is available from NCBI SRA.

A UCSC browser is available for v2.0 (as well as legacy v1.0 and v1.1 versions). An interactive dotplot visualization of all genomic repeats is also available from resgen.io. Known issues identified in the assembly are tracked at CHM13 issues.

Data reuse and license

All data is released to the public domain (CC0) and we encourage its reuse. We would appreciate if you would acknowledge and cite the "Telomere-to-Telomere" (T2T) Consortium for the creation of this data. More information about our consortium can be found on the T2T homepage and a list of related citations is available below:

The complete sequence of a human genome and companion papers:

  1. Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. bioRxiv, 2021.
  2. Vollger MR, et al. Segmental duplications and their variation in a complete human genome. bioRxiv, 2021.
  3. Gershman A, et al. Epigenetic Patterns in a Complete Human Genome. bioRxiv, 2021.
  4. Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, et al. A complete reference genome improves analysis of human genetic variation. bioRxiv, 2021.
  5. Hoyt SJ, et al. From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. bioRxiv, 2021.
  6. Altemose N, et al. Complete genomic and epigenetic maps of human centromeres. bioRxiv, 2021.
  7. Wagner J, et al. Towards a Comprehensive Variation Benchmark for Challenging Medically-Relevant Autosomal Genes. bioRxiv, 2021.
  8. McCartney AM, Shafin K, Alonge M, et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. bioRxiv, 2021.
  9. Jain C, et al. A long read mapping method for highly repetitive reference sequences. bioRxiv, 2021.
  10. Formenti G, Rhie A, et al. Merfin: improved variant filtering and polishing via k-mer validation. bioRxiv, 2021.

Earlier citations:

  1. Vollger MR, et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Annals of Human Genetics, 2019.
  2. Miga KH, Koren S, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature, 2020.
  3. Nurk S, Walenz BP, et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Research, 2020.
  4. Logsdon GA, et al. The structure, function, and evolution of a complete human chromosome 8. Nature, 2021.

Assembly releases

v2.0

Changes from v1.1 include the addition of a finished ChrY from the GIAB HG002 sample, sequenced both by GIAB and HPRC. Analysis sets for mapping based research is available at aws with a README.

  • chm13v2.0.fa.gz: T2T-CHM13v2.0 assembly with sequences soft-masked using the repeat models discovered by the T2T team. Sequence names are converted to chr*. chrY in this file was assembled from sample HG002/NA24385. The original sequence accession numbers are shown in the FASTA header.
  • chm13v2.0_noY.fa.gz: excluding the Y chromosome. This file only contains only sequences derived from the CHM13 cell line. If you are benchmarking assemblies of CHM13 use this file.
  • chm13v2.0_PAR.bed: pseudoautosomal regions (PARs)
  • chm13v2.0_maskedY.fa.gz: PARs on chrY hard masked to "N"
  • chm13v2.0_maskedY.rCRS.fa.gz: PARs on chrY hard masked to "N" and mitochodrion replaced with rCRS (AC:NC_012920.1)

This genome is also available at NCBI (GCA_009914755.4).

v1.1

Complete T2T reconstruction of a human genome. Changes from v1.0 include filled rDNA gaps and improved polishing within telomeres. One rare heterozygous variant causing a premature stop codon was changed at chr9:134589924 to the more common allele. Also available at NCBI. Changes made from v1.0 to v1.1 are available as a VCF.

v1.0

Complete T2T reconstruction of a human genome, with the exception of 5 known gaps within the rDNA arrays. Polished assembly based on v0.9. Introduces 4 structural corrections and 993 small variant corrections, including a 4 kb telomere extension on chr18. Polishing was performed using a conservative custom pipeline based on DeepVariant calls and structural corrections were manually curated. Consensus quality exceeds Q60. Prior to a preprint being drafted, a brief summary can be found at this blog post. Also available at NCBI. Changes made from v0.9 to v1.0 are available as a VCF.

v0.9

T2T reconstruction of all 23 chromosomes of CHM13 based on a custom assembly pipeline, briefly featuring:

  1. Homopolymer-compression and self-correction of Pacbio HiFi reads
  2. Rescoring of overlaps to account for recurrent Pacbio HiFi errors
  3. Construction and custom pruning of a string graph built over 100% identical overlaps
  4. Manual reconstruction on chromosomal paths through the graph, if necessary aided by ultra-long Nanopore reads
  5. Layout/consensus of original HiFi reads, corresponding to the resulting paths
  6. Patching of regions absent from HiFi data with v0.7 draft sequences

Consensus quality exceeds Q60. Mitochondrial sequence DNA included. Centers of the 5 rDNA arrays are represented by N-gaps.

v0.7

Assembly draft v0.7 was generated with Canu v1.7.1 including rel1 data up to 2018/11/15 and incorporating the previously released PacBio data. Two gaps on the X plus the centromere were manually resolved. Contigs with low coverage support were split and the assembly was scaffolded with BioNano. The assembly was polished with two rounds of nanopolish and two rounds of arrow. The X polishing was done using unique markers matched between the assembly and the raw read data, the rest of the genome used traditional polishing. Finally, the assembly was polished with 10X Genomics data. We validated the assembly using independent BACs. The overall QV is estimated to be Q37 (Q42 in unique regions) and the assembly resolves over 80% of available CHM13 BACs (280/341). The assembly is 2.94 Gbp in size with 359 scaffolds (448 contigs) and an NG50 of 83 Mbp (70 Mbp). Outside of Chr8 and ChrX, this should be considered a draft and likely has mis-assemblies. Older unpolished assemblies are available for benchmarking purposes, but are of lower quality and should not be used for analyses. Also available at NCBI.

HG002 Chromosome X

Finished sequence available from NCBI. An earlier draft v0.7 with the same methods used for CHM13 asm v0.9 with HG002 data HiFi available from the HPRC HG002 data freeze. Due to HiFI coverage gaps which were not patched, the draft is missing approximately 2 Mbp on the p-arm (including the PAR).

Downloads

Sequencing Data

HiFi Data

A total of 100 Gbp of data (32.4x coverage) in HiFi 20 kbp libraries (used for v0.9-v1.1 assemblies) is available from NCBI. An additional 76 Gbp of data (24.4x coverage) is available in HiFi 10 kbp libraries at NCBI. The raw subreads for the 20 kbp libraries are available below.

raw subreads (genome DNA) (NOTE: there are the individual raw subreads NOT HiFi reads. Most users will want to download the HiFi reads the links above).

Oxford Nanopore Data

Nanopore sequencing was performed using Josh Quick's ultra-long read (UL) protocol and modifications as described in The structure, function, and evolution of a complete human chromosome 8.

We sequenced a total of 390 Gbp of data (126x coverage). The read N50 is 58 kbp and there are 219 Gbp bases in reads >50 kbp (71x). The longest full-length mapping read is 1.3 Mbp. Sequencing data was generated from three lines of CHM13 (NHGRI, UW, UCD), which all originate from the original line established by Urvashi Surti. Only the NHGRI line was karyotyped and confirmed to be stable prior to sequencing. For the NHGRI line, NHGRI (PI: Phillippy) and University of Nottingham (PI: Loose) contributed approximately 140 flowcells of UL data using Quick's ultra-long protocol; 199 Gbp (64x, 1.4 Gbp/flowcell). The read N50 is 71 kbp and there are 128 Gbp of data in reads >50 kbp (41x). For the UW line, University of Washington (PI: Eichler) contibuted 106 flowcells of UL data using a new UL protocol developed by Glennis Logsdon; 69 Gbp (22x, 0.6 Gbp/flowcell). The read N50 is 133 kbp and there are 57 Gbp of data in reads >50 kbp (18x). For the UCD line, UCDavis (PI: Dennis) contributed two PromethION cells using a ligation prep; 114 Gbp (37x, 57 Gbp/flowcell). The read N50 is 36 kbp and there are 25 Gbp of data in reads >50 kbp (8x).

Read ids broken out by sequencing location are available for NHGRI, U of Nottingham, UW, and UCD.

rel8 (genome DNA)

rel 8 is the full dataset as of 2020/10/01. All data was re-called using Guppy 5.0.7

Downloads

rel7 (genome DNA)

rel 7 is the full dataset as of 2020/10/01. All data was re-called using Bonito v0.3.1.

Downloads

rel6 (genomic DNA)

rel6 is the full dataset as of 2020/10/01, adding UW data from partitions 232-243. All data was re-called using Guppy 3.6.0 with the HAC model.

Downloads

rel5 (genomic DNA)

rel5 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.6.0 with the HAC model.

Downloads

rel4 (genomic DNA)

rel4 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.4.5 with the HAC model.

Downloads

rel3 (genomic DNA)

rel3 is the full dataset as of 2019/09/01, all data was re-called using Guppy 3.1.5 with the HAC model. We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.

Downloads

rel2 (genomic DNA)

rel2 is the same data as rel1 but recalled with the latest generation callers (Guppy flip-flop 2.3.1). We have provided mappings both to our current draft assembly and to the GRCh38 with decoys in cram format, using minimap2.

Downloads

rel1 (genomic DNA)

The full dataset as of 2019/01/09. These basecalls were generated on-instrument and use older versions of Guppy (depending on when the flowcell ran on the instrument).

Downloads

fast5 data

The raw fast5 data, without basecalls, is available for completeness. The data is grouped into 243 sets.

  • Partitions 1-94 were sequenced at NHGRI

  • Partitions 95-98 were sequenced at University of Nottingham

  • Partitions 99-144 were sequenced at NHGRI

  • Partitions 145-224 were sequenced at University of Washington

  • Partitions 225-226 were sequenced at UC Davis

  • Partitions 227-231 were sequenced at NHGRI

  • Partitions 232-243 were sequenced at University of Washington

  • Note that when the tgz were groupped and uploaded, some inadvertently included more than a single partition. These are denoted as partition ranges in the downloads (e.g. 145-149).

Downloads

Illumina PCRFree Data

A total of >300 Gbp of data (105x coverage) in PCR-Free Illumina libraries is available from NCBI.

10X Genomics Data

Raw fastq files

Approximately 50x of data was generated on a NovaSeq instrument. Based on the summary output of Supernova, there are 1.2 billion reads with 41x effective coverage. The mean molecule length is 130 kbp and an N50 of 864 reads per barcode.

Downloads

BioNano DLS Data

Approximately 430x of data was generated using the Saphyr instrument and the DLE-1 enzyme. There are 15.2 M molecules with an N50 molecule length of 115.9 kbp and a max of 2.3 Mbp (2 M molecules > 150 kbp, N50 218 kbp). The assembly of the molecules is 2.97 Gbp in size with 255 contigs and an NG50 of 59.6 Mbp.

Downloads

  • BNX (md5: 59a7a5583e900e1e5cecb08a34b5b0dc)
  • CMAP (md5: cf1a6fbcf006a26673499b9297664fdb)

Hi-C Data

A library was generated using an Arima genomics kit and sequenced to approximately 40x on an Illumina HiSeq X.

Downloads

RNA-seq data

Two separate poly-A prep libraries were generated at UC Davis and 2x150 bp RNA-seq reads generated on an Illumina NovaSeq (~25 million PE reads each).

Downloads

Previously generated PacBio data

The PacBio data (both CLR and HiFi) was previously generated and is available from the SRA. The list of cells used for arrow polishing the v0.7 assembly are listed here.

Notes on downloading files.

Files are generously hosted by Amazon Web Services. Although available as straight-forward HTTP links, download performance is improved by using the Amazon Web Services command-line interface. References should be amended to use the s3:// addressing scheme, i.e. replace https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/ with s3://human-pangenomics/T2T to download. For example, to download CHM13_prep5_S13_L002_I1_001.fastq.gz to the current working directory use the following command.

aws s3 --no-sign-request cp s3://human-pangenomics/T2T/CHM13/10x/CHM13_prep5_S13_L002_I1_001.fastq.gz .

or to download the full dataset use the following command.

aws s3 --no-sign-request sync s3://human-pangenomics/T2T/CHM13/ .

The s3 command can also be used to get information on the dataset, for example reporting the size of every file in human-readable format.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/ 

or to obtain technology-specific sizes.

aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/fast5
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/nanopore/rel2
aws s3 --no-sign-request ls --recursive --human-readable --summarize s3://human-pangenomics/T2T/CHM13/assemblies

Amending the max_concurrent_requests etc. settings as per this guide will improve download performance further.

You can also browse all the files available on S3 via web interface.

Contact

Please raise issues on this Github repository concerning this dataset.

History

* rel1 and 2: 2nd March 2019. Initial release.
* asm v0.6 and canu rel2 assembly: 28th May 2019. Assembly update.
* Hi-C data added: 25th July 2019. Data update.
* asm v0.6 alignments of rel2 added: 30th Aug 2019. Data Update
* rel3: 16th Sept 2019. Data update.
* chrX v0.7, canu 1.9 and flye 2.5 rel3 assembly: 24th Oct 2019. Assembly update.
* shasta rel3 assembly: 20th Dec 2019. Assembly update.
* chr8 v3, rel4 data: 21 Feb 2020. Data and assembly update.
* update rel3 partition names since some tars included more than a single partition. 16 Apr 2020.
* add CLR/HiFi mappings to chrX v0.7. 8 May 2020.
* update partitions 23,28,30,53,55 and add 227-231 (data was missing from upload). 13 May 2020. Data update.
* add rel5 guppy 3.6.0 data: 4 Jun 2020. Data update.
* add chr8 v9. Aug 26 2020. Assembly update.
* add v0.9/v1.0 genome releases. Sept 22 2020. Assembly update.
* add v0.9/v1.0 alignment files. Sept 29 2020. Assembly update.
* add new UW data. Oct 6 2020. Data update.
* add rna-seq data. Dec 4 2020. Data update.
* add repeat and telomere annotations for v1.0. Dec 17 2020. Assembly annotation update.
* add v1.1 assembly and related files. May 7 2021. Assembly update.