-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Behaviour of bam_mods_query_type()
/ interface to hts_base_mod_state
#1550
Comments
bam_mods_query_type()
/ interface to hts_base_mod_state
bam_mods_query_type()
/ interface to hts_base_mod_state
The state is deliberately opaque as it gives us flexibility to change it without breaking the ABI. Things like the definition of Do you have a set of suggested new APIs? We already have test/test_mod.c which has an extended ( Using this trivial example:
I see that Unless we somehow have a non-paired double-stranded sequence (ie not C≡G but C~C forming a non-tightly bound bubble) then C-m shouldn't ever exist. I'm not sure if it's possible to obtain and sequence double-stranded DNA with modified bases that break the usual base-pairing rules. My knowledge of the microbiology isn't sufficient here. So some example data demonstrating your case would help. I'll need a bit more time to investigate what's going on in the C+m;C-m case though and see whether it's legitimate for it to simply fail. Naively I would have assumed it'd just regurgitate the data, irrespective of whether it's an impossible case. |
Part of the problem here is the specification was changed post-implementation by the addition of the implicit vs explicit counting. This meant that the one structure we did make public, is no longer up to the job:
Hence given a modified base, I did wonder whether we should really be iterating over sequence bases and recording yes, no, not-recorded type results, but we know one day someone is going to do something nightmarish such as checking for Suggestions welcomed from those who are actually consuming this data. Your insights are almost certainly better than me second guessing things. Edit: an alternative would be It doesn't really "feel right" though. It's basically an interface where the user would have to go through all combinations doing a search "is this here?", "is that here?" rather than just asking what is present. So I do feel like the correct solution is a better iterator. |
😄 I agree its probably best not to make it public. You've hit one of my basic problems: the implicit/explicit causes some issues. The rest of your answer is also very prescient --- I'll record my thoughts based on my recent work, but its all basically what you've already predicted! When using the pileup API to traverse by reference position, and then getting the list of mods on a read the lack of an entry cannot be taken as having a canonical base; it now implies "canonical" or "no call" dependent on the type of subtags present. One idea I arrived at was that I would be useful if when doing:
That
To help myself in the short term, I wrote a function similar to your
You're correct, it's a bit clunky for a client to be calling that for every subtag they care about (for each read, for each position in the pileup). On |
I've been looking at implementing the base modification API in rust_htslib and agree that it would be very nice for the details of implicit/explicit mode to be hidden from the caller. |
I'm looking at this again now.
What are you actually asking for here when we have a mix of implicit and explicit calls together. Do you want to treat uncalled implicit calls as if they were explicit calls with a minimal quality value? That could work I guess. It would then make both implicit and explicit calls iterated by the same code and filtering out low quality calls is done by the calling code. I'm unsure what this buys us though. If your code is doing that already (to cope with the explicit calls where a mod is explicitly recorded as having e.g. confidence 0) then it'd give the same results anyway for most scenarios I think.
We could have a separate query function for any given mod type to report whether or not it's explicitly or implicitly encoded. I think this is @jts's Note if you have no mods returned, you can still use
An alternative for coordinate based queries (either specific coordinates or seq iterators like
So basically in implicit mode we fill out the uncalled modifications to turn them into explicit "not found" calls, with a predefined standard quality score (eg -1, but controlled via a constant). That means we'd get many more calls to iterate over, but most of them would fall below are call threshold, and we can either filter them by checking equality to We could also have an API call that sets the implicit quality to be used for a specific modification type (or all types), so the user can control what quality should be used for implicit uncalled mods. That could be stored in the opaque state.
With minimum quality 5 and an implicit non-called quality of e.g. 1 specified, then we just skip over the implicit absent calls. However if we set the implicit non-called quality as 10 (approx P(exists)=0.04) and ask to iterate over all observed modified bases with qual >= 5 then we'd be expected to return all those implicit non-called bases, as if they were written in explicit mode. |
Argh sorry I'm being dense. It's the reverse... You want explicit calls that don't cover a base to have a specific unknown likelihood. Maybe it amounts to the same logic though. Instead of a constant So again we'd want a single iterator that expands up the explicit mod types and fills out that field so calling software can do whatever it needs to. Likely simply not add it to the "absent" column in the stats array, so we're correctly sampling found / not-found / unknown instead of just found / not-found. |
I have to admit I've lost the thead here a bit --- it will take me a while to get back in the game. Meanwhile I've passed the baton of handling this information to @ArtRand. |
You're not the only one, as is evident! Ignore my first message :) For what it's worth, I have a modified An example from the test/test_mod tool:
|
The code producing the above output is:
This is just a test. I think we need either some hidden state where to modify the behaviour of the functions by setting a flag in The former could also work with a parameterised
|
Just to clarify this point, the problem I have is that when there are multiple modifications with the same code (like in the MM-double test case) (I haven't looked over the other proposals, but will make time to do so) Jared |
Ok I got it. Yes I hadn't spotted that. |
This allows querying by the i^{th} modificaton type rather than by code. This is useful when we have multiple mods with differing meta-data, such as "C+m.,4;G-m?,1;". The previous bam_mods_query_type isn't sufficient as it's the same code "m" being used. Fixes samtools#1550
This allows querying by the i^{th} modificaton type rather than by code. This is useful when we have multiple mods with differing meta-data, such as "C+m.,4;G-m?,1;". The previous bam_mods_query_type isn't sufficient as it's the same code "m" being used. Fixes samtools#1550
This allows querying by the i^{th} modificaton type rather than by code. This is useful when we have multiple mods with differing meta-data, such as "C+m.,4;G-m?,1;". The previous bam_mods_query_type isn't sufficient as it's the same code "m" being used. Fixes samtools#1550
This allows querying by the i^{th} modificaton type rather than by code. This is useful when we have multiple mods with differing meta-data, such as "C+m.,4;G-m?,1;". The previous bam_mods_query_type isn't sufficient as it's the same code "m" being used. Fixes #1550
This allows querying by the i^{th} modificaton type rather than by code. This is useful when we have multiple mods with differing meta-data, such as "C+m.,4;G-m?,1;". The previous bam_mods_query_type isn't sufficient as it's the same code "m" being used. Fixes samtools#1550
The function
bam_mods_query_type
can be used to ask for the presence of a MM subtag based on the mod base code. It returns the additional information of such a tag if found. However, in the case that say bothC+m
andC-m
are present, the caller will only ever receive knowledge of the first listed tag. So further, queryingm
and receiving back the details forC+m
should not be taken to imply the non-existence of theC-m
tag.I've found it useful to be able to know whether a specific tag exists querying by the additional fields of the
hts_base_mod_state
type. The type is currently documented as being opaque (and indeed it isn't listed in sam.h) -- I'm wondering if it should be more public (there is already the functionbam_mods_recorded
which serves as a public interface to->type
). Alternatively the interface tobam_mods_query_type
could be amended to takestrand
(and potentiallyimplicit
andcanonical
).The text was updated successfully, but these errors were encountered: