-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multi-threaded on-the-fly indexing problems #1672
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When indexing on-the-fly, the virtual offset of the end of each read gets passed to hts_idx_push(). As HTSlib starts a new BGZF block if the next read doesn't fit, the stored offset needs to be adjusted to point to the start of the next block instead of the end of the old one. (While the two locations effectively map to the same uncompressed offset, pointing to the end of a block causes problems for HTSlib before 1.11, and HTSJDK.) When multi-threading, this adjustment was done by calling bgzf_idx_amend_last() from the main thread. However, if the data was written out too promptly, the function could be called too late preventing the adjustment from being applied. To fix this, the adjustment is now moved to bgzf_idx_flush() which is called by the writer thread, ensuring that it is always applied on time. The internal function bgzf_idx_amend_last() is removed as it's no longer needed. Fixes samtools/samtools#1861 (samtools merge --write-index results in Invalid file pointer [from HTSJDK] in rare cases whereas samtools index fixes the problem)
* Switch from hts_idx_push() to bgzf_idx_push() for on-the-fly indexing of BCF and VCF.bgz files. The latter function is needed to record the correct offsets when using multi-threaded BGZF compression. Fixes samtools/bcftools#1985 * Only allow indexing of BGZF-compressed files. It's necessary to enforce this as on-the-fly indexing assumes that the file pointer is in htsFile::fp.bgzf, but uncompressed VCF uses htsFile::fp.hfile.
jmarshall
added a commit
to jmarshall/htslib
that referenced
this pull request
Jul 16, 2024
As noted in samtools#1722, this function was removed in PR samtools#1672.
daviesrob
pushed a commit
that referenced
this pull request
Jul 17, 2024
* Ignore and clean test/*/FAIL* for six subdirectories These files can appear in base_mods, fastq, mpileup, and sam_filter as well as faidx and tabix. * Fix comment header to use `@CO\t` as per other comment headers * Remove extraneous inclusion and add missing dependency * Remove last traces of previously deleted bgzf_idx_amend_last() As noted in #1722, this function was removed in PR #1672. * Use isspace_c() et al in annot-tsv.c * Minor corrections to system headers Plain getopt() is declared in <unistd.h>; strcasecmp() et al are only portably declared in <strings.h>.
jkbonfield
added a commit
to jkbonfield/htslib
that referenced
this pull request
Sep 12, 2024
When using bcftools view --write-index -o out.vcf.gz the virtual file offsets can differ depending on whether we do a bgzf_tell before or after a flush. Specifically it could point to the last byte in the current BGZF block or the first byte in the next BGZF block. Ultimately both of these resolve to the same physical location, but in some situations the former may mean attempting to read zero bytes (the remainder of the bgzf block). This has been known in the past to be misinterpreted as an EOF. (See samtools/samtools#1861) It also means the contents of the index produced by --write-index and a separate bcftools index command can yield different results, albeit both representing the same data. The fix for the samtools / bcftools issue above (samtools#1672) when multi-threading inadvertently recreated the bug when not multi-threading.
jkbonfield
added a commit
to jkbonfield/htslib
that referenced
this pull request
Sep 12, 2024
When using bcftools view --write-index -o out.vcf.gz the virtual file offsets can differ depending on whether we do a bgzf_tell before or after a flush. Specifically it could point to the last byte in the current BGZF block or the first byte in the next BGZF block. Ultimately both of these resolve to the same physical location, but in some situations the former may mean attempting to read zero bytes (the remainder of the bgzf block). This has been known in the past to be misinterpreted as an EOF. (See samtools/samtools#1861) It also means the contents of the index produced by --write-index and a separate bcftools index command can yield different results, albeit both representing the same data. The fix for the samtools / bcftools issue above (samtools#1672) when multi-threading inadvertently recreated the bug when not multi-threading. Fixes samtools/bcftools#2267
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
samtools merge --write-index
(.csi
) results inInvalid file pointer
in rare cases whereassamtools index
(.bai
) fixes the problem samtools#1861htsFile::fp.bgzf
, but uncompressed VCF useshtsFile::fp.hfile
.