-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
probaln_glocal returns suboptimal alignment when bandwidth is too large #1605
Comments
I've started looking at this, so it's on our radar. One problem is this code (and the forward direction code similarly above it) https://github.com/samtools/htslib/blob/develop/probaln.c#L271-L277
As we're going from k up to and including l_ref, our dimension for the array needs to be l_ref+1. We've allocated at most to l_ref (or l_query, but same thing here). Hence the protection against overstepping the memory. Unfortunately that means we don't fill out those The code is clear as mud though, so it's not clear if this is meant to be protected by starting at index 1 instead of 0, or if we're just limiting idim to be 1 too few. Changing idim does fix it, but I don't know if it's the correct solution yet. Continuing my descent into madness. :-) |
Futher investigation shows there are three such Checking the allocations of So given we've allocated +1, the What I have discovered by trial and error is What I'm currently unsure about though is the |
In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3". Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array. However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser. The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion. This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output. Fixes samtools#1605
In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3". Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array. However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser. The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion. This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output. Fixes #1605
In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3". Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array. However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser. The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion. This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output. Fixes samtools#1605
When the bandwidth parameter provided in
probaln_par_t
toprobaln_glocal
is the length of the input seq or greater a suboptimal alignment is returned.Here's a minimal test case (replacing
main
in probaln.c):Here I have passed the same 13bp sequence as both query and reference, so would expect a trivial alignment to be found. Instead, the alignment has the last base inserted rather than matched and the returned likelihood is quite low:
If I change the bandwidth to be l_seq - 1 the expected alignment is found:
The text was updated successfully, but these errors were encountered: