Skip to content

Latest commit

 

History

History
134 lines (83 loc) · 4.5 KB

Day_3a_scavenger_hunt.md

File metadata and controls

134 lines (83 loc) · 4.5 KB

Human Pangenome Reference Consortium scavenger hunt

If you have extra time while waiting for tools to run, you can look for the answers to these questions in the HPRC's marker paper. This activity is entirely optional.

Questions

How many individuals' genomes were assembled in this phase of the HPRC, and how many will ultimately be assembled?

See answer

The current phase assembled 47 individuals, and the goal is assemble a total of 350

What assembly algorithm was used to construct the HPRC assemblies?

See answer

Trio-Hifiasm

What platforms were the sequencing data generated from?

See answer
  1. HiFi
  2. ONT ultralong
  3. Bionano optical maps
  4. Illumina (child and parents)
  5. 10x Genomics
  6. Hi-C (Dovetail Omni-C)

How do the HPRC assemblies compare to the GRCh38 reference in terms of contiguity?

See answer

Overall similar. Most of the assemblies are slightly less contiguous, but some are slightly more contiguous. However, this result does not factor in the scaffolding of GRCh38, which connects separated contigs that are believed to be on the same chromosome (by using a region of all N characters). With the scaffolding taken into account, GRCh38 would be almost as contiguous as the CHM13 assembly.

Approximately what fraction of the bases in the assemblies did Flagger identify as dubious?

See answer

Most are <1%

How many misassemblies were corrected manually?

See answer

3 haplotypes in 3 separate samples: HG01358, HG001123, and HG002

How would you interpret the 4 different categories shown the colors in Figure 1j?

See answer

These categories could have different interpretations depending on if they are real or artifactual. If they are real, the "Not covered" category indicates a deletion, the "2 Alignments" and ">2 Alignments" category indicate duplications, and the "Only 1 Alignment" category indicates no change in copy number relative to CHM13. All of these categories could also be artifactually produced by assembly errors.

What accounts for the discrepancy in the size of the paternal male haplotype in Figure 1c?

See answer

Paternal male chromosomes have a Y chromosome instead of an X, and the human Y chromosome is much smaller than the X chromosome

How many genomes had observed gene copy number changes?

See answer

3,210

How concordant are the PGGB and Minigraph-Cactus pangenomes' predictions of the number of variant loci among the HPRC assemblies?

See answer

Mostly concordant. PGGB identifies 21 million small variants to Minigraph-Cactus' 22 million. In contrast, PGGB identifies 73,000 structural variants, whereas Minigraph-Cactus identifies 73,000

What tool was used to identify variant loci in the pangenome graph?

See answer

vg deconstruct

How does the performance of the pangenomic methods differ between the entire genome and the Genome in a Bottle's "Challenging Medically Relevant Gene" regions?

See answer

The HPRC pangenome methods improve calling over alternatives in both benchmark sets, but the improvement is proportionally greater in the challenging medically relevant genes.

In the structural variant genotyping experiments, why are the haplotypes removed from the pangenome before genotyping them?

See answer

Leaving them in would "double dip" on the data: using it both to parameterize the model and the evaluate the performance. This artificially inflates performance by allowing the model to "memorize" the data. Leave-one-out analyses are a type of data splitting design that mitigate this effect.

What accounts for the spike in indels of size ~300 base pairs in Figure 6e?

See answer

This is the length of the Alu SINE transposon, which makes up >10% of the human genome

How many novel structural variants were called by PanGenie using the HPRC data?

See answer

Trick question: none. PanGenie genotypes known, previously observed SVs. It does not discover SVs de novo.