This project parses a vcf file and output a table as a result. The output table contains some fields that are interesting to the biologist.
This is an R script. You need to install R on your computer. The R script utilizes three third-party packages : httr, jsonlite, and magrittr, which should be installed as well.
This script is intentionally designed for the coding challenge, so the input and output file names are hard-coded into the script.
input_file <- "input_coding_challenge_final.vcf"
output_file <- "output_vcf_useful_info.txt"
If the script will used for other same format input files, the user can uncomment the statements related to the command line arguments:
args<-commandArgs(TRUE)
if (length(args) < 2) {
print("usage: Rscript parse_vcf.R <input_file> <output_file>")
stop("incorrect arguments.")
}
input_file <- args[1]
output_file <- args[2]
- Variant type (e.g. insertion, deletion, etc.).
- Variant effect (e.g. missense, synonymous, etc.). Note: If multiple variant types exist in the ExAC database, annotate with the most deleterious possibility.
- Read depth at the site of variation.
- Number of reads supporting the variant.
- Percentage of reads supporting the variant versus those supporting reference reads.
- Allele frequency of variant.
- Any other information from ExAC that you feel might be relevant.