Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

probaln_glocal returns suboptimal alignment when bandwidth is too large #1605

Closed
jts opened this issue Apr 20, 2023 · 2 comments · Fixed by #1616
Closed

probaln_glocal returns suboptimal alignment when bandwidth is too large #1605

jts opened this issue Apr 20, 2023 · 2 comments · Fixed by #1616
Assignees

Comments

@jts
Copy link

jts commented Apr 20, 2023

When the bandwidth parameter provided in probaln_par_t to probaln_glocal is the length of the input seq or greater a suboptimal alignment is returned.

Here's a minimal test case (replacing main in probaln.c):

#include <unistd.h>
int main(int argc, char *argv[])
{

    uint8_t seq[] = { 0, 3, 2, 0, 3, 2, 1, 1, 3, 2, 3, 2, 2 };
    int l_seq = sizeof(seq);
    int* state = malloc(l_seq * sizeof(int));
    uint8_t* q = malloc(l_seq);
    probaln_par_t par = { 0.01, 0.1, l_seq /* - 1*/};
    int P = probaln_glocal(seq, l_seq, seq, l_seq, NULL, &par, state, q);
    printf("bw: %d P: %d\nstate: ", par.bw, P);
    for(int i = 0; i < l_seq; ++i) {
        printf("%02d ", state[i] >> 2);
    }
    printf("\ninsrt: ");
    for(int i = 0; i < l_seq; ++i) {
        printf("%02d ", state[i] & 3);
    }
    printf("\n");
    free(state);
    free(q);

    return 0;
}

Here I have passed the same 13bp sequence as both query and reference, so would expect a trivial alignment to be found. Instead, the alignment has the last base inserted rather than matched and the returned likelihood is quite low:

bw: 13 P: 29
state: 00 01 02 03 04 05 06 07 08 09 10 11 11 
insrt: 00 00 00 00 00 00 00 00 00 00 00 00 01 

If I change the bandwidth to be l_seq - 1 the expected alignment is found:

bw: 12 P: 6
state: 00 01 02 03 04 05 06 07 08 09 10 11 12 
insrt: 00 00 00 00 00 00 00 00 00 00 00 00 00 
@jkbonfield
Copy link
Contributor

jkbonfield commented May 3, 2023

I've started looking at this, so it's on our radar.

One problem is this code (and the forward direction code similarly above it)

https://github.com/samtools/htslib/blob/develop/probaln.c#L271-L277

    for (k = 1; k <= l_ref; ++k) {
        int u;
        double *bi = &b[l_query*i_dim];
        set_u(u, bw, l_query, k);
        if (u < 3 || u >= i_dim - 3) continue;
        bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
    }

As we're going from k up to and including l_ref, our dimension for the array needs to be l_ref+1. We've allocated at most to l_ref (or l_query, but same thing here). Hence the protection against overstepping the memory. Unfortunately that means we don't fill out those bi cases either.

The code is clear as mud though, so it's not clear if this is meant to be protected by starting at index 1 instead of 0, or if we're just limiting idim to be 1 too few. Changing idim does fix it, but I don't know if it's the correct solution yet. Continuing my descent into madness. :-)

@jkbonfield
Copy link
Contributor

Futher investigation shows there are three such if (u < 3 || u >= i_dim - 3) continue statements.

Checking the allocations of f and b arrays I see it uses f = calloc((l_query+1)*i_dim, sizeof(double)), so it's already allocated +1 on the query length, meaning k <= l_ref isn't an issue... I think. I'm unclear of ref vs query length here still, but it looks like this is compensated for by if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);.

So given we've allocated +1, the u >= i_dim - 3 is clearly not needed for memory protection reasons. Equally so u < 3 isn't too. I don't understand what the purpose of these instructions are.

What I have discovered by trial and error is u < 0 does change the BAQ output in about 1 in 3700 cases, but u >= i_dim doesn't for the tests I've run so far (10 million reads). Using >= i_dim instead of >= i_dim-3 does cure this issue too, so it's what I'm going to go with.

What I'm currently unsure about though is the u < 3 vs u < 0 differences. It's different for sure, but is it worse or better? That's challenging! I'll probably leave it as-is for sake of reproducability, but it'd be interesting to figure out why it exists at all given we only ever access u, u+1 and u+2. I wonder if it's just an accident brought about by thinking of things like set_u(v11, bw, i-1, k-1), but if so it's wrong as those negatives get applied before computing u and not after.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue May 4, 2023
In 3 places when filling out forwards and backwards arrays, the "u"
array index has bounds checks of "u < 3 || u >= i_dim-3".

Understanding this code is tricky however!  My hypothesis that the
upper bounds check here is because we use u, u+1 and u+2 in array
indices, and we iterate with "k <= l_ref" so we can access one beyond
the end of the array.

However the arrays are allocated to be dimension (l_query+1)*i_dim, so
(assuming correctness of l_ref vs l_query in bw/i_dim calculation) we
have compensated for this over-step already.  This has been validated
with address sanitiser.

The effect of the i_dim-3 limit is that having band width equal to
query length causes the final state element to be incorrectly labelled
as an insertion.

This hypothesis may however be incorrect, as the lower bound "u < 3"
also seems redundant, yet changing this to "u < 0" does give different
quality scores in about 1 in 4000 sequences (tested on 10 million
illumina short read BAQ calculations).  Hence for now this is left
unchanged.  In normal behaviour using a band, tested using "samtools
calmd -r -E" to generate BQ tags, this commit does not change output.

Fixes samtools#1605
daviesrob pushed a commit that referenced this issue Jun 27, 2023
In 3 places when filling out forwards and backwards arrays, the "u"
array index has bounds checks of "u < 3 || u >= i_dim-3".

Understanding this code is tricky however!  My hypothesis that the
upper bounds check here is because we use u, u+1 and u+2 in array
indices, and we iterate with "k <= l_ref" so we can access one beyond
the end of the array.

However the arrays are allocated to be dimension (l_query+1)*i_dim, so
(assuming correctness of l_ref vs l_query in bw/i_dim calculation) we
have compensated for this over-step already.  This has been validated
with address sanitiser.

The effect of the i_dim-3 limit is that having band width equal to
query length causes the final state element to be incorrectly labelled
as an insertion.

This hypothesis may however be incorrect, as the lower bound "u < 3"
also seems redundant, yet changing this to "u < 0" does give different
quality scores in about 1 in 4000 sequences (tested on 10 million
illumina short read BAQ calculations).  Hence for now this is left
unchanged.  In normal behaviour using a band, tested using "samtools
calmd -r -E" to generate BQ tags, this commit does not change output.

Fixes #1605
vasudeva8 pushed a commit to vasudeva8/htslib that referenced this issue Aug 17, 2023
In 3 places when filling out forwards and backwards arrays, the "u"
array index has bounds checks of "u < 3 || u >= i_dim-3".

Understanding this code is tricky however!  My hypothesis that the
upper bounds check here is because we use u, u+1 and u+2 in array
indices, and we iterate with "k <= l_ref" so we can access one beyond
the end of the array.

However the arrays are allocated to be dimension (l_query+1)*i_dim, so
(assuming correctness of l_ref vs l_query in bw/i_dim calculation) we
have compensated for this over-step already.  This has been validated
with address sanitiser.

The effect of the i_dim-3 limit is that having band width equal to
query length causes the final state element to be incorrectly labelled
as an insertion.

This hypothesis may however be incorrect, as the lower bound "u < 3"
also seems redundant, yet changing this to "u < 0" does give different
quality scores in about 1 in 4000 sequences (tested on 10 million
illumina short read BAQ calculations).  Hence for now this is left
unchanged.  In normal behaviour using a band, tested using "samtools
calmd -r -E" to generate BQ tags, this commit does not change output.

Fixes samtools#1605
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants