Skip to content

Pisces VCF Specifications

tamsen edited this page Nov 12, 2016 · 1 revision

The following specification document is valid for both somatic VCF and gVCF formatted files.

SDS ID Specification
VCF-1 The application shall write a header section at the top of the VCF file with the following header lines. These lines shall have the format “##{key}={value}”. The keys and their descriptions are given below:
Key Description
fileformat Version of vcf format, which is “VCFv4.1”.
fileDate Date in YYYYMMDD format.
Source Application name and version, e.g. “CallSomaticVariants 1.0.0.0”
CallSomaticVariants_cmdline Command line call for the program, including all arguments.
Reference File name for reference genome fasta file.
INFO Description of INFO fields used in the file. There is one INFO header line for each field in the file.
FILTER Description of FILTER fields used in the file. There is one FILTER header line for each field in the file.
FORMAT Description of FORMAT fields used in the file. There is one FORMAT header line for each field in the file.
contig List of processed chromosomes and their lengths. There is a contig header line for each chromosome. Format is “##contig=<ID={chrName},length={length}”
SDS ID Specification
VCF-2 The application shall write the following INFO and FORMAT lines to the VCF header, if the associated configuration rule is satisfied. These lines shall have the format: “##{Key}=<ID={FieldName},Number={Number},Type={Type},Description={Description}”.
Key Field name Number Type Description Configuration Rule
INFO DP 1 Integer Total Depth None
FORMAT GT 1 String Genotype None
FORMAT GQ 1 Integer Genotype Quality None
FORMAT AD . Integer Allele Depth None
FORMAT DP . Integer Total Depth Used For Variant Calling None
FORMAT VF . Float Variant Frequency. One number if 0/0 or 0/1. Two numbers for 1/2 None
FORMAT NL 1 Integer Applied BaseCall Noise Level Debug mode enabled, or outputting bias files, or strand bias threshold < 1
FORMAT SB 1 Float StrandBias Score Debug mode enabled, or outputting bias files, or strand bias threshold < 1
FORMAT NC 1 Float Fraction of bases which were uncalled or with basecall quality below the minimum threshold Report no calls enabled
SDS ID Specification
VCF-3 The application write the following FILTER lines to the VCF header, if the associated configuration rule is satisfied. FILTER lines shall have format “##{Key}=<ID={FieldName}, Description={Description}”.
Key FieldName Description Configuration Rule
FILTER q{threshold}, e.g. “q20” Quality below {thresholdValue} Minimum variant score configured > 0.
FILTER LowDP Low coverage (DP tag), therefore no genotype called Minimum coverage configured > 0.
FILTER SB One of the following, depending on the rule: Strand bias threshold configured > 0.
FILTER SB A)Variant strand bias too high Strand bias threshold configured > 0.
FILTER SB B)Variant support on only one strand Filter variants on only one strand
FILTER SB C)Variant strand bias too high or coverage on only one strand Three possible rules:
SDS ID Specification
VCF-4 The application shall write a data section to the VCF file as a tab-delimited table below the header section.
VCF-5 The application shall write a column header line, as below, at the top of the data section of a VCF file. The column header line shall be prefixed by a single “#” and have the following format. The SampleName is set to the input BAM file name (without extension).
#CHROM POS    ID     REF    ALT    QUAL   FILTER INFO   FORMAT {SampleName}
SDS ID Specification
VCF-6 By default VCF mode, after the column header line, the data section of a VCF file shall have one line per variant allele.
VCF-7 If gVCF mode is selected, after the column header line, the data section of a VCF file shall have one line per allele (reference or variant).
VCF-8 If CrushVcf mode is selected, after the column header line, the data section of a VCF file shall have one line per genomic loci.
VCF-9 For each data line item, the application shall write the following values to the data section of a VCF file, as below:
Column Name Value
CHROM Chromosome name
POS Reference position
ID Source ID for variant, always “.”. This columns is provided for downstream annotators to update as appropriate.
REF Reference allele
ALT Alternate (variant) allele
QUAL Variant Quality Score
FILTER “PASS” if no filters. Otherwise, comma-separated list of filter names, e.g. “LowDP,SB”.
INFO Comma-separated list of INFO name and value pairs, in the format “{name}={value}”. Currently only supporting DP INFO field, e.g. “DP=500”.
FORMAT Colon-separated list of field names, e.g. “GT:GQ:AD”.
{SampleName} Colon-separated list of FORMAT field values.
SDS ID Specification
VCF-10 For each data line time, the application shall write the following FORMAT fields to the data section in a VCF file, if the associated configuration rule is satisfied. As below:
FORMAT Field Name Field Value Configuration Rule
GT Genotype None
GQ Genotype Quality Score None
AD If variant call, value is “{X},{Y}” where X is the reference depth and Y is the allele depth. If reference call, value is allele depth None
DP Total coverage depth used in variant calling None
VF Variant frequency None
NL Estimated basecall quality Debug mode enabled, or outputting bias files, or strand bias threshold < 1
SB Strand bias score Debug mode enabled, or outputting bias files, or strand bias threshold < 1
NC No call frequency or fraction Report no calls enabled

General

5.2.10

5.2.9

5.2.7

5.2.5

5.2.0

5.1.6

5.1.3

Clone this wiki locally