--genfile its value and generating the file #67

GuyReeves · 2022-06-23T14:17:21Z

Hi

I am thinking about trying the option --genfile. As of my 7500 samples 50 of them have a sequencing coverage of >x15.

I was thinking of taking the file from --outputInputInVCFFormat = TRUE. and using the data from only the 50 high coverage samples to make the required file ( tab separated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA).

Does that sound like a plan that might work? The number of reads removed during the "generate inputs" look pretty modest for all samples.

Do you think that --genfile might be an option work trying? I have a large number of trios and my Mendelian error rate is very low; I was jus curious if it might be further improved - if it is easy for me to have a go--.
Thanks
Guy

rwdavies · 2022-06-24T07:54:04Z

This would work, though perhaps it's cleanest from a software evaluation perspective to use external software. I normally use the good old fashioned GATK 3 UnifiedGenotyper (now many years old!), as it is fast, and I can tell it to genotype sites given specific reference and alt alleles. I assume HaplotypeCaller can do the same thing, but am not entirely sure. I think samtools/bcftools can do the same thing. But this should work as well (though to be clear on phrasing: are you suggesting (2 = hom alt) (NA = missing) or (2 = (hom alt or missing)), I assume the former, thought phrasing unclear

GuyReeves · 2022-06-24T11:33:18Z

Hi

Yes it totally makes sense to use a real genotype caller rather than (mistakenly) rely on sufficient coverage at ever genotype.
Particularly, as the --outputInputInVCFFormat does not really have anything to filter out weak sites
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

I actually have individual .vcf files, so I just need to merge them, I was trying to be too lazy.

As far as phrasing, I took tthe text from the STITCH help, I understand it as 2 = hom alt and NA i= indicating missing genotype (pretty close to the "vcftools --012" option , but where "-1" needs to be replaced by "NA")
Thanks

Guy

genfile
Path to gen file with high coverage results. Empty for no genfile. File has a header row with a name for each sample, matching what is found in the bam file. Each subject is then a tab seperated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA indicating missing genotype, with rows corresponding to rows of the posfile. Note therefore this file has one more row than posfile which has no header

rwdavies · 2022-06-24T12:27:21Z

I mean, STITCH should do what GATK3 UG does, I think exactly, though I'm probably missing some things. And yeah, STITCH won't filter sites, so if you want annotations to do that, you'll need something else.

OK cool re: NA, sounds good

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--genfile its value and generating the file #67

--genfile its value and generating the file #67

GuyReeves commented Jun 23, 2022

rwdavies commented Jun 24, 2022

GuyReeves commented Jun 24, 2022

rwdavies commented Jun 24, 2022

--genfile its value and generating the file #67

--genfile its value and generating the file #67

Comments

GuyReeves commented Jun 23, 2022

rwdavies commented Jun 24, 2022

GuyReeves commented Jun 24, 2022

rwdavies commented Jun 24, 2022