probaln_glocal returns suboptimal alignment when bandwidth is too large #1605

jts · 2023-04-20T17:38:58Z

When the bandwidth parameter provided in probaln_par_t to probaln_glocal is the length of the input seq or greater a suboptimal alignment is returned.

Here's a minimal test case (replacing main in probaln.c):

#include <unistd.h>
int main(int argc, char *argv[])
{

    uint8_t seq[] = { 0, 3, 2, 0, 3, 2, 1, 1, 3, 2, 3, 2, 2 };
    int l_seq = sizeof(seq);
    int* state = malloc(l_seq * sizeof(int));
    uint8_t* q = malloc(l_seq);
    probaln_par_t par = { 0.01, 0.1, l_seq /* - 1*/};
    int P = probaln_glocal(seq, l_seq, seq, l_seq, NULL, &par, state, q);
    printf("bw: %d P: %d\nstate: ", par.bw, P);
    for(int i = 0; i < l_seq; ++i) {
        printf("%02d ", state[i] >> 2);
    }
    printf("\ninsrt: ");
    for(int i = 0; i < l_seq; ++i) {
        printf("%02d ", state[i] & 3);
    }
    printf("\n");
    free(state);
    free(q);

    return 0;
}

Here I have passed the same 13bp sequence as both query and reference, so would expect a trivial alignment to be found. Instead, the alignment has the last base inserted rather than matched and the returned likelihood is quite low:

bw: 13 P: 29
state: 00 01 02 03 04 05 06 07 08 09 10 11 11 
insrt: 00 00 00 00 00 00 00 00 00 00 00 00 01

If I change the bandwidth to be l_seq - 1 the expected alignment is found:

bw: 12 P: 6
state: 00 01 02 03 04 05 06 07 08 09 10 11 12 
insrt: 00 00 00 00 00 00 00 00 00 00 00 00 00

The text was updated successfully, but these errors were encountered:

jkbonfield · 2023-05-03T09:19:47Z

I've started looking at this, so it's on our radar.

One problem is this code (and the forward direction code similarly above it)

https://github.com/samtools/htslib/blob/develop/probaln.c#L271-L277

    for (k = 1; k <= l_ref; ++k) {
        int u;
        double *bi = &b[l_query*i_dim];
        set_u(u, bw, l_query, k);
        if (u < 3 || u >= i_dim - 3) continue;
        bi[u+0] = sM / s[l_query] / s[l_query+1]; bi[u+1] = sI / s[l_query] / s[l_query+1];
    }

As we're going from k up to and including l_ref, our dimension for the array needs to be l_ref+1. We've allocated at most to l_ref (or l_query, but same thing here). Hence the protection against overstepping the memory. Unfortunately that means we don't fill out those bi cases either.

The code is clear as mud though, so it's not clear if this is meant to be protected by starting at index 1 instead of 0, or if we're just limiting idim to be 1 too few. Changing idim does fix it, but I don't know if it's the correct solution yet. Continuing my descent into madness. :-)

jkbonfield · 2023-05-04T09:25:13Z

Futher investigation shows there are three such if (u < 3 || u >= i_dim - 3) continue statements.

Checking the allocations of f and b arrays I see it uses f = calloc((l_query+1)*i_dim, sizeof(double)), so it's already allocated +1 on the query length, meaning k <= l_ref isn't an issue... I think. I'm unclear of ref vs query length here still, but it looks like this is compensated for by if (bw < abs(l_ref - l_query)) bw = abs(l_ref - l_query);.

So given we've allocated +1, the u >= i_dim - 3 is clearly not needed for memory protection reasons. Equally so u < 3 isn't too. I don't understand what the purpose of these instructions are.

What I have discovered by trial and error is u < 0 does change the BAQ output in about 1 in 3700 cases, but u >= i_dim doesn't for the tests I've run so far (10 million reads). Using >= i_dim instead of >= i_dim-3 does cure this issue too, so it's what I'm going to go with.

What I'm currently unsure about though is the u < 3 vs u < 0 differences. It's different for sure, but is it worse or better? That's challenging! I'll probably leave it as-is for sake of reproducability, but it'd be interesting to figure out why it exists at all given we only ever access u, u+1 and u+2. I wonder if it's just an accident brought about by thinking of things like set_u(v11, bw, i-1, k-1), but if so it's wrong as those negatives get applied before computing u and not after.

In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3". Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array. However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser. The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion. This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output. Fixes samtools#1605

In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3". Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array. However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser. The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion. This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output. Fixes #1605

In 3 places when filling out forwards and backwards arrays, the "u" array index has bounds checks of "u < 3 || u >= i_dim-3". Understanding this code is tricky however! My hypothesis that the upper bounds check here is because we use u, u+1 and u+2 in array indices, and we iterate with "k <= l_ref" so we can access one beyond the end of the array. However the arrays are allocated to be dimension (l_query+1)*i_dim, so (assuming correctness of l_ref vs l_query in bw/i_dim calculation) we have compensated for this over-step already. This has been validated with address sanitiser. The effect of the i_dim-3 limit is that having band width equal to query length causes the final state element to be incorrectly labelled as an insertion. This hypothesis may however be incorrect, as the lower bound "u < 3" also seems redundant, yet changing this to "u < 0" does give different quality scores in about 1 in 4000 sequences (tested on 10 million illumina short read BAQ calculations). Hence for now this is left unchanged. In normal behaviour using a band, tested using "samtools calmd -r -E" to generate BQ tags, this commit does not change output. Fixes samtools#1605

daviesrob assigned jkbonfield Apr 25, 2023

jkbonfield mentioned this issue May 4, 2023

Change bounds checking in probaln_glocal #1616

Merged

daviesrob closed this as completed in #1616 Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probaln_glocal returns suboptimal alignment when bandwidth is too large #1605

probaln_glocal returns suboptimal alignment when bandwidth is too large #1605

jts commented Apr 20, 2023

jkbonfield commented May 3, 2023 •

edited

Loading

jkbonfield commented May 4, 2023

probaln_glocal returns suboptimal alignment when bandwidth is too large #1605

probaln_glocal returns suboptimal alignment when bandwidth is too large #1605

Comments

jts commented Apr 20, 2023

jkbonfield commented May 3, 2023 • edited Loading

jkbonfield commented May 4, 2023

jkbonfield commented May 3, 2023 •

edited

Loading