Skip to content

Latest commit

 

History

History
254 lines (210 loc) · 57.3 KB

Genome.md

File metadata and controls

254 lines (210 loc) · 57.3 KB

Whole Human Genome Sequencing Project

Introduction

We have sequenced the CEPH1463 (NA12878/GM12878, Ceph/Utah pedigree) human genome reference standard on the Oxford Nanopore MinION using 1D ligation kits (450 bp/s) using R9.4 chemistry (FLO-MIN106).

Human genomic DNA from GM12878 human cell line (Ceph/Utah pedigree) was either purchased from Coriell - "DNA" - (cat no NA12878) or extracted from the cultured cell line - "cells". As the DNA is native, modified bases will be preserved.

Data reuse and license

We encourage the reuse of this data in your own analysis and publications which is released under the Creative Commons CC-BY license. Therefore we would be grateful if you would cite the reference below if you do.

Citation

Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, Sunir Malla, Hannah Marriott, Tom Nieto, Justin O'Grady, Hugh E Olsen, Brent S Pedersen, Arang Rhie, Hollian Richardson, Aaron R Quinlan, Terrance P Snutch, Louise Tee, Benedict Paten, Adam M Phillippy, Jared T Simpson, Nicholas J Loman & Matthew Loose. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology doi: doi:10.1038/nbt.4060.

Preprint

Miten Jain, Sergey Koren, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew D Beggs, Alexander T Dilthey, Ian T Fiddes, Sunir Malla, Hannah Marriott, Karen H Miga, Tom Nieto, Justin O'Grady, Hugh E Olsen, Brent S Pedersen, Arang Rhie, Hollian Richardson, Aaron Quinlan, Terrance P Snutch, Louise Tee, Benedict Paten, Adam M. Phillippy, Jared T Simpson, Nicholas James Loman, Matthew Loose. Nanopore sequencing and assembly of a human genome with ultra-long reads. bioRxiv. doi: https://doi.org/10.1101/128835.

rel3

Basecalls

The rel3 release consists of the full dataset, and has two new rapid kit runs with a new long DNA extraction method:

  • 39 flowcells
  • 91240120433 bases
  • 14183584 reads
flowcell_id reads bases Date Centre SampleType Kit Pore Links
FAB23716 356209 1409812422 14/07/16 UBC DNA Rapid R9 FASTQ
FAB39088 658224 3287994454 19/09/16 Notts DNA Ligation R9.4 FASTQ
FAB39075 466329 2439355478 20/09/16 UBC DNA Ligation R9.4 FASTQ
FAB39043 436976 2273008592 23/09/16 Bham DNA Ligation R9.4 FASTQ
FAB42706 430660 1966505502 12/10/16 UBC DNA Ligation R9.4 FASTQ
FAB41174 117057 687394987 13/10/16 Bham DNA Ligation R9.4 FASTQ
FAB42260 267644 1399557161 13/10/16 UBC DNA Ligation R9.4 FASTQ
FAB42804 16669 75062609 14/10/16 Bham DNA Ligation R9.4 FASTQ
FAB42316 572838 3275026637 14/10/16 Notts DNA Ligation R9.4 FASTQ
FAB42205 317654 1686630108 14/10/16 Notts DNA Ligation R9.4 FASTQ
FAB42561 233678 1520513556 19/10/16 Notts DNA Ligation R9.4 FASTQ
FAB42473 644869 3357548938 19/10/16 UBC DNA Ligation R9.4 FASTQ
FAB42395 38291 179704035 20/10/16 Norwich DNA Ligation R9.4 FASTQ
FAB42476 435158 2363036522 27/10/16 UBC DNA Ligation R9.4 FASTQ
FAB42451 817629 4530477841 28/10/16 Notts DNA Ligation R9.4 FASTQ
FAB42704 276152 1750149482 28/10/16 UBC DNA Ligation R9.4 FASTQ
FAB42828 33527 163405138 01/11/16 Norwich DNA Ligation R9.4 FASTQ
FAB42810 322058 2020615256 02/11/16 Norwich DNA Ligation R9.4 FASTQ
FAB42798 193551 1339441522 03/11/16 Norwich DNA Ligation R9.4 FASTQ
FAB45280 128234 799554798 11/11/16 Norwich DNA Ligation R9.4 FASTQ
FAB46664 491346 2038018797 15/11/16 UBC DNA Ligation R9.4 FASTQ
FAB46683 72605 286275511 17/11/16 Bham DNA Ligation R9.4 FASTQ
FAB45332 530938 2864140853 17/11/16 UBC DNA Ligation R9.4 FASTQ
FAB43577 426941 2539015084 18/11/16 UCSC DNA Ligation R9.4 FASTQ
FAB44989 558224 3443824633 18/11/16 UCSC DNA Ligation R9.4 FASTQ
FAF01169 339447 2913892142 22/11/16 Bham Cells Ligation R9.4 FASTQ
FAF01441 254705 2203636947 22/11/16 Bham Cells Ligation R9.4 FASTQ
FAB45277 53547 445641679 22/11/16 Notts Cells Ligation R9.4 FASTQ
FAB45321 299174 2584017112 22/11/16 Notts Cells Ligation R9.4 FASTQ
FAF01127 632728 4972081712 25/11/16 Bham Cells Ligation R9.4 FASTQ
FAF01132 689781 5455971336 25/11/16 Bham Cells Ligation R9.4 FASTQ
FAB49712 632158 4906148911 28/11/16 Bham Cells Ligation R9.4 FASTQ
FAF01253 471698 3695661984 28/11/16 Bham Cells Ligation R9.4 FASTQ
FAB45321* 123037 1043504055 28/11/16 Notts Cells Ligation R9.4 FASTQ
FAB49914 309175 2841008085 28/11/16 Notts Cells Ligation R9.4 FASTQ
FAB45271 472656 3689043164 28/11/16 Notts Cells Ligation R9.4 FASTQ
FAB49164 746333 4438258089 06/12/16 UCSC DNA Ligation R9.4 FASTQ
FAB49908 224380 3141600861 09/12/16 Bham Cells Rapid R9.4 FASTQ
FAF04090 91304 1213584440 09/12/16 Bham Cells Rapid R9.4 FASTQ

Please verify downloads against MD5 hashes.

[*] This flowcell ID was input incorrectly.

rel4

Rel4 adds an additional 23140190547 bases in 1415868 reads, predominantly using the new ultra-long read protocol.

asic_id nreads mn count n50 flowcell centre kit date sequencedate
16056159 82138 21998 1806857522 114375 FAF15665 Notts Ultra 10/03/2017 FASTQ
17958431 53723 23321 1252868852 77045 FAF13748 Notts Ultra 10/03/2017 FASTQ
2901545329 41385 20506 848632752 54473 FAF10039 Bham Ultra 01/03/2017 FASTQ
3439856925 19674 30217 594496244 121393 FAF09968 Bham Ultra 03/03/2017 FASTQ
3709819546 73755 26946 1987434656 117805 FAF09277 Bham Ultra 03/06/2017 FASTQ
3976726082 75692 24191 1831031405 88882 FAF14035 Notts Ultra 08/03/2017 FASTQ
4109802543 61227 25048 1533616061 104528 FAF15694 Bham Ultra 06/03/2017 FASTQ
4111860526 65142 25171 1639658993 93299 FAF09713 Bham Ultra 07/03/2017 FASTQ
4178920553 270189 10106 2730589684 24848 FAF18554 UBC Rapid 06/03/2017 FASTQ
4244782843 9663 33401 322753214 102804 FAF15630 Notts Ultra 09/03/2017 FASTQ
4245291640 72936 20524 1496943560 92109 FAF09640 Bham Ultra 07/03/2017 FASTQ
4249180049 68169 25394 1731054841 119444 FAF09701 Bham Ultra 03/03/2017 FASTQ
82266371 71155 24602 1750584936 118548 FAF15586 Bham Ultra 08/03/2017 FASTQ
87644245 451020 8012 3613667827 13920 FAF05869 UBC Ligation 08/03/2017 FASTQ

### Alignments by flowcell

Reads for the rel3 (30x coverage dataset) aligned against pre-computed 1000 genomes GRCh38 BWA database at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/GRCh38_reference_genome/ with decoys using BWA MEM (commit: 5961611c358e480110793bbf241523a3cfac049b) using parameters -x ont2d. Alignment statistics calculated using samtools stats (samtools version 1.3.1).

FileID Sequences Mapped Mapped MQ0 Unmapped Bases Mapped Avg Length Link
FAB23716 356209 319259 26702 36950 1165998694 3957 BAM BAI
FAB39088 658224 613044 35394 45180 3007307322 4995 BAM BAI
FAB39075 466329 425117 28167 41212 2146453407 5230 BAM BAI
FAB39043 436976 415389 21043 21587 2113140439 5201 BAM BAI
FAB42706 430660 375374 17378 55286 1867123361 4566 BAM BAI
FAB41174 117057 114520 4186 2537 652217119 5872 BAM BAI
FAB42260 267644 246982 15624 20662 1263089767 5229 BAM BAI
FAB42804 16669 13311 1755 3358 53666089 4503 BAM BAI
FAB42316 572838 512994 18985 59844 3100596254 5717 BAM BAI
FAB42205 317654 282502 12561 35152 1601397762 5309 BAM BAI
FAB42561 233678 225141 10255 8537 1420740185 6506 BAM BAI
FAB42473 644869 611138 32539 33731 3112342902 5206 BAM BAI
FAB42395 38291 36477 2059 1814 167168840 4693 BAM BAI
FAB42476 435158 416969 20908 18189 2214880871 5430 BAM BAI
FAB42451 817629 779328 36986 38301 4178966543 5540 BAM BAI
FAB42704 276152 263722 12926 12430 1619875186 6337 BAM BAI
FAB42828 33527 27843 2442 5684 146819837 4873 BAM BAI
FAB42810 322058 305070 16802 16988 1808343119 6274 BAM BAI
FAB42798 193551 185739 8749 7812 1232035338 6920 BAM BAI
FAB45280 128234 122219 6336 6015 743280816 6235 BAM BAI
FAB46664 491346 456247 27622 35099 1862427349 4147 BAM BAI
FAB46683 72605 64739 5307 7866 269213160 3942 BAM BAI
FAB45332 530938 497862 26392 33076 2620752139 5394 BAM BAI
FAB43577 426941 410137 19835 16804 2344990054 5946 BAM BAI
FAB44989 558224 536572 25936 21652 3161900821 6169 BAM BAI
FAF01169 339447 315489 16481 23958 2677881316 8584 BAM BAI
FAF01441 254705 238834 12458 15871 2010117898 8651 BAM BAI
FAB45277 53547 51957 2132 1590 426639054 8322 BAM BAI
FAB45321 299174 283355 15165 15819 2366003310 8637 BAM BAI
FAF01127 632728 605633 27192 27095 4640355789 7858 BAM BAI
FAF01132 689781 655357 33564 34424 4966810089 7909 BAM BAI
FAB49712 632158 612752 26264 19406 4594356245 7760 BAM BAI
FAF01253 471698 454434 20639 17264 3430678969 7834 BAM BAI
FAB45321 123037 118311 5891 4726 952851126 8481 BAM BAI
FAB49914 309175 296250 12281 12925 2673848960 9188 BAM BAI
FAB45271 472656 450702 20148 21954 3468377327 7804 BAM BAI
FAB49164 746333 718351 32664 27982 4107087899 5946 BAM BAI
FAB49908 224380 211060 11903 13320 2898563539 14001 BAM BAI
FAF04090 91304 83164 6072 8140 1085757398 13291 BAM BAI

Alignments by chromosome

Flowcell alignments were separated into individual chromosomes using samtools merge.

Chrom Mapped # Mapped MQ0 Bases Mapped Avg Length BAM BAI
chr1 1075867 43397 6829526262 6744 BAM BAI
chr2 1062314 31802 6755642896 6842 BAM BAI
chr3 858643 24189 5487703898 6757 BAM BAI
chr4 845677 30723 5395140705 6890 BAM BAI
chr5 774613 23499 4953273570 6821 BAM BAI
chr6 723047 24496 4618883250 6762 BAM BAI
chr7 696473 28231 4382999832 6772 BAM BAI
chr8 617988 23361 3968911801 6844 BAM BAI
chr9 539660 25898 3428430670 6764 BAM BAI
chr10 594688 20787 3805443564 6845 BAM BAI
chr11 583055 17748 3710684724 6855 BAM BAI
chr12 586663 17891 3734922623 6840 BAM BAI
chr13 440615 17662 2844212242 6904 BAM BAI
chr14 383777 15752 2439119767 6713 BAM BAI
chr15 359853 19556 2268233023 6838 BAM BAI
chr16 386401 22680 2425913744 6787 BAM BAI
chr17 369036 22907 2302471086 6661 BAM BAI
chr18 339094 13053 2172098564 6807 BAM BAI
chr19 257039 10926 1472760724 6266 BAM BAI
chr20 291960 13226 1829244829 6659 BAM BAI
chr21 192383 24988 1207807437 6792 BAM BAI
chr22 172934 10514 1041347396 6665 BAM BAI
chrX 658347 28769 4210769167 7076 BAM BAI
chrY 23378 5292 133803203 7869 BAM BAI
chrM 59363 658 91949786 1628 BAM BAI

FAST5 (Signal Level files)

FAST5 files for 30x dataset have been split by chromosome according to the above alignments, meaning that some files may be found in multiple archives (they can be made non-redundant by reference to the filename). Each complete 'part' contains 100,000 reads and should be roughly in sort order along the chromosome to aid region-by-region analysis.

chrom
chr1 part1 (391 G) part2 (291 G) part3 (284 G) part4 (265 G) part5 (265 G) part6 (242 G) part7 (269 G) part8 (202 G) part9 (205 G)
chr2 part1 (395 G) part2 (311 G) part3 (279 G) part4 (287 G) part5 (288 G) part6 (300 G) part7 (266 G) part8 (247 G) part9 (223 G)
chr3 part1 (338 G) part2 (310 G) part3 (308 G) part4 (249 G) part5 (290 G) part6 (265 G) part7 (278 G) part8 (220 G) part9 (236 G)
chr4 part1 (423 G) part2 (346 G) part3 (344 G) part4 (245 G) part5 (321 G) part6 (237 G) part7 (379 G) part8 (214 G) part9 (213 G)
chr5 part1 (385 G) part2 (393 G) part3 (286 G) part4 (286 G) part5 (264 G) part6 (298 G) part7 (259 G) part8 (215 G) part9 (207 G)
chr6 part1 (313 G) part2 (319 G) part3 (298 G) part4 (318 G) part5 (263 G) part6 (258 G) part7 (264 G) part8 (230 G) part9 (207 G)
chr7 part1 (366 G) part2 (332 G) part3 (308 G) part4 (335 G) part5 (299 G) part6 (243 G) part7 (231 G) part8 (242 G) part9 (238 G)
chr8 part1 (354 G) part2 (309 G) part3 (303 G) part4 (265 G) part5 (274 G) part6 (247 G) part7 (261 G) part8 (214 G) part9 (177 G)
chr9 part1 (352 G) part2 (308 G) part3 (247 G) part4 (278 G) part5 (263 G) part6 (301 G) part7 (226 G) part8 (146 G)
chr10 part1 (367 G) part2 (337 G) part3 (296 G) part4 (282 G) part5 (280 G) part6 (245 G) part7 (233 G) part8 (258 G) part9 (45 G)
chr11 part1 (363 G) part2 (309 G) part3 (290 G) part4 (266 G) part5 (287 G) part6 (306 G) part7 (232 G) part8 (239 G) part9 (10 G)
chr12 part1 (386 G) part2 (323 G) part3 (259 G) part4 (278 G) part5 (290 G) part6 (271 G) part7 (242 G) part8 (256 G) part9 (62 G)
chr13 part1 (307 G) part2 (326 G) part3 (335 G) part4 (327 G) part5 (306 G) part6 (244 G) part7 (123 G)
chr14 part1 (356 G) part2 (363 G) part3 (306 G) part4 (235 G) part5 (292 G) part6 (149 G)
chr15 part1 (322 G) part2 (328 G) part3 (322 G) part4 (262 G) part5 (259 G)
chr16 part1 (347 G) part2 (327 G) part3 (276 G) part4 (308 G) part5 (259 G) part6 (120 G)
chr17 part1 (330 G) part2 (281 G) part3 (273 G) part4 (263 G) part5 (310 G) part6 (19 G)
chr18 part1 (386 G) part2 (315 G) part3 (337 G) part4 (264 G) part5 (320 G)
chr19 part1 (417 G) part2 (320 G) part3 (286 G) part4 (228 G)
chr20 part1 (352 G) part2 (285 G) part3 (281 G) part4 (300 G) part5 (06 G)
chr21 part1 (329 G) part2 (395 G) part3 (290 G)
chrX part1 (592 G) part2 (284 G) part3 (285 G) part4 (274 G) part5 (280 G) part6 (309 G) part7 (227 G) part8 (261 G) part9 (228 G)
chrY part1 (584 G)
chrM part1 (33 G)

Alternative basecalls

Scrappie

De novo assemblies

MHC haplotypes

Read lengths

Cellular library read length distribution

Figure: A typical read length distribution from a flowcell where we have run a cell-extracted DNA library. The y-axis shows the count of bases. Mean read length ~8.6kb with N50 of ~12.5kb (vertical line). Reads longer than 60kb are not expected due to limitations of the QIAGEN extraction kit employed.

Acknowledgements

We would like to acknowledge the support of Oxford Nanopore Technologies in generating this dataset, with particular thanks to Rosemary Dokos, Oliver Hartwell, Jonathan Pugh and Clive Brown. We would like to thank Radoslaw Poplawski and Simon Thompson for technical assistance with configuration and optimising of the CLIMB platform file system. We are grateful to Angel Pizarro and Jed Sundwall at Amazon Web Services for hosting this dataset as an AWS Open Data set.