-
Notifications
You must be signed in to change notification settings - Fork 34
SQL Schema
lohrd edited this page Jul 21, 2020
·
14 revisions
The following image shows the format of the summary reports that are generated upon each serratus run:
The SQL Schema for accessing each summary report is made up of four tables: Runs, FamilySections, AccessionSections, and FastaSections.
'Runs' corresponds to the first line of the summary file, where the data for the SRA, reference genome, and date is present. This table has a one to many relationship with the three following tables, all linked by the SRA and auto-generated PK RunId.
'FamilySections' corresponds to the next section of the summary report, where the data for the pan-genome is present. The columns present on this table are as follows:
- FamilySectionId: This is the PK for the table, autogenerated when entered into the database.
- FamilySectionLineId: This is a number indicating the position of the family line in the summary file.
- RunId: This is a FK linking back to the Runs table
- Sra: This is also a FK linking back to the Runs table, added here for easier query construction
- Family: This is the name of the family of the pan-genome that is being analyzed
- Score: This is the score given for the quality of the alignment
- PctId: This is the percent identity of the sequences aligned (wrt the reference genome)
- Aln: This is the number of aligned reads
- Glb: This is the number of global aligned reads
- PanLen: This is the pangenome length
- Cvg: This is the coverage cartoon generated, giving a picture of the quality of alignment throughout the specific sequence
- Top: This is the top accession
- TopAln: This is the top accession aligned reads
- TopName: This the study name linking to the top accession