Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--genfile its value and generating the file #67

Open
GuyReeves opened this issue Jun 23, 2022 · 3 comments
Open

--genfile its value and generating the file #67

GuyReeves opened this issue Jun 23, 2022 · 3 comments

Comments

@GuyReeves
Copy link

Hi

I am thinking about trying the option --genfile. As of my 7500 samples 50 of them have a sequencing coverage of >x15.

I was thinking of taking the file from --outputInputInVCFFormat = TRUE. and using the data from only the 50 high coverage samples to make the required file ( tab separated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA).

Does that sound like a plan that might work? The number of reads removed during the "generate inputs" look pretty modest for all samples.

Do you think that --genfile might be an option work trying? I have a large number of trios and my Mendelian error rate is very low; I was jus curious if it might be further improved - if it is easy for me to have a go--.
Thanks
Guy

@rwdavies
Copy link
Owner

This would work, though perhaps it's cleanest from a software evaluation perspective to use external software. I normally use the good old fashioned GATK 3 UnifiedGenotyper (now many years old!), as it is fast, and I can tell it to genotype sites given specific reference and alt alleles. I assume HaplotypeCaller can do the same thing, but am not entirely sure. I think samtools/bcftools can do the same thing. But this should work as well (though to be clear on phrasing: are you suggesting (2 = hom alt) (NA = missing) or (2 = (hom alt or missing)), I assume the former, thought phrasing unclear

@GuyReeves
Copy link
Author

Hi

Yes it totally makes sense to use a real genotype caller rather than (mistakenly) rely on sufficient coverage at ever genotype.
Particularly, as the --outputInputInVCFFormat does not really have anything to filter out weak sites
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

I actually have individual .vcf files, so I just need to merge them, I was trying to be too lazy.

As far as phrasing, I took tthe text from the STITCH help, I understand it as 2 = hom alt and NA i= indicating missing genotype (pretty close to the "vcftools --012" option , but where "-1" needs to be replaced by "NA")
Thanks

Guy

genfile
Path to gen file with high coverage results. Empty for no genfile. File has a header row with a name for each sample, matching what is found in the bam file. Each subject is then a tab seperated column, with 0 = hom ref, 1 = het, 2 = hom alt and NA indicating missing genotype, with rows corresponding to rows of the posfile. Note therefore this file has one more row than posfile which has no header

@rwdavies
Copy link
Owner

I mean, STITCH should do what GATK3 UG does, I think exactly, though I'm probably missing some things. And yeah, STITCH won't filter sites, so if you want annotations to do that, you'll need something else.

OK cool re: NA, sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants