-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcftools view silently ignores long lines #1614
Comments
This comes from a limitation of the BCF format which cannot represent lines longer than that. BCFtools / HTSlib use BCF as the internal format, even when on input is a VCF, it is first converted into BCF structures. The big lines are usually caused by Number=G tags, such as genotype likelihoods (FORMAT/PL), therefore in practice we store their localized version as described here: Lines 1744 to 1753 in 756e636
If your large VCF is a product of |
Ok thank you for the explanation. I will try out using the FORMAT/LPL and FORMAT/LAA fields. |
I am closing it as this 1) cannot be fixed due to BCF spec limitation and 2) has a functional workaround |
Hello, thank you for your work on htslib and bcftools.
I am working with VCF files containing more than a hundred thousands of samples and I have noticed that "bcftools view" is silently ignoring any VCF records that are very long, I believe the problem starts occurring when the uncompressed line length exceeds 2^31 bytes (2GB). I have a script that can reproduce the problem by creating a dummy VCF file with a long line: https://gist.github.com/hannespetur/b8d44ba90deb3f4510753262a9db94ee
First argument is the number of samples the file has, second argument is the number of alternative alleles the record has. The VCF line has a PL value for each possible diploid genotype, so with more alternative alleles the line size grows very fast.
I have noticed that bgzip can handle lines of this length
however bcftools index/tabix cannot index the file.
Do you have any workarounds for working with large VCF lines like this?
Tested on
Red Hat Enterprice Linux 7 (64-bit)
Intel Xeon E5-2690 v4
128 GB RAM
bcftools 1.14 and htslib 1.14. Problem was also reproduced on 1.10.2.
The text was updated successfully, but these errors were encountered: