Skip to content
lohrd edited this page Jul 21, 2020 · 14 revisions

SQL Schema for Summary Reports

The following image shows the format of the summary reports that are generated upon each serratus run:

Summary report example

The SQL Schema for accessing each summary report is made up of four tables: Runs, FamilySections, AccessionSections, and FastaSections.

Runs

'Runs' corresponds to the first line of the summary file, where the data for the SRA, reference genome, and date is present. This table has a one to many relationship with the three following tables, all linked by the SRA and auto-generated PK RunId.

FamilySections

'FamilySections' corresponds to the next section of the summary report, where the data for the pan-genome is present. The columns present on this table are as follows:

  • FamilySectionId: This is the PK for the table, autogenerated when entered into the database.
  • FamilySectionLineId: This is a number indicating the position of the family line in the summary file.
  • RunId: This is a FK linking back to the Runs table
  • Sra: This is also a FK linking back to the Runs table, added here for easier query construction
  • Family: This is the name of the family of the pan-genome that is being analyzed
  • Score: This is the score given for the quality of the alignment
  • PctId: This is the percent identity of the sequences aligned (wrt the reference genome)
  • Aln: This is the number of aligned reads
  • Glb: This is the number of global aligned reads
  • PanLen: This is the pangenome length
  • Cvg: This is the coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequence
  • Top: This is the top accession
  • TopAln: This is the top accession aligned reads
  • TopName: This the study name linking to the top accession
Clone this wiki locally