Skip to content

Genomic Features metrics

Clay McLeod edited this page Sep 24, 2022 · 7 revisions

The Genomics Features metrics facet reports statistics regarding genomic features contained within a GFF file (e.g. GENCODE GFF). Currently, only counting records within the gene regions (intronic, exonic, intergenic) and exonic translation regions (five prime UTR, three prime UTF, coding sequence) is supported. The report is delivered at under the features key within the results.json file. You can easily examine the output of the general facet by using jq:

cat results.json | jq .features

IMPORTANT: to enable this facet, you must provide the -f/--feature-gff flag! The facet is automatically enabled when this flag is provided. Otherwise, it is disabled and the value for features in result.json will be null.

Overview

First, a interval lookup data structure is built using rust_lapper to store all of the genomic features from the GFF. Next, each record is processed as follows:

  • If the record is unmapped, the record is ignored.
  • The cigar string is used to calculate the length of the record.
  • The start and end genomic coordinates of the record are then used to find all features which intersect with the record.
    • For records that fall within a 5' UTR, 3' UTR, or CDS, the appropriate counters are incremented. Note that a record may span more than one of these categories, but it will only increment each category by one at most.
    • The intronic, exonic, and intergenic counters are incremented appropriately for the record. Here, records are classified as one and only one category:
      • If the record falls within a gene and within an exon, it's classified as exonic.
      • If the record falls within a gene but outside an exon, it's classified as intronic.
      • If the record falls outside a gene, it's classified as intergenic.

Outputs

This facet has the following top-level keys,

Key Description
exonic_translation_regions Contains metrics about the which exonic translation regions a record overlaps with.
gene_regions Contains metrics related to simple record counting for this facet. Includes details on how many records were processed versus how many were ignored (typically due to the insert size being out of range of the histogram).
records Contains statistics regarding how many records were processed, how many records were ignored, and for what reasons.
summary Contains summary statistics regarding this QC facet, most notably percentages regarding how many records were ignored and for what reasons.

Exonic translation regions

Contains metrics about the which exonic translation regions a record overlaps with. Namely, a record can overlap with an untranslated region (either 5' or 3' end) or a coding sequence.

Gene regions

Contains the counts for exonic, intronic, and intergenic records. Note, as the description above outlines, records are currently classified as one and only one category for this metric.

  • If any part of the record falls within a gene and within an exon, it's classified as exonic.
  • If any part of the record falls within a gene but outside an exon, it's classified as intronic.
  • If the record falls outside a gene, it's classified as intergenic.

Records

Contains metrics regarding how many records were processed, how many records were ignored, and for what reason were records ignored.

Summary

Contains summary statistics about which records were ignored and why.