Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base mod API improvements #1636

Merged
merged 6 commits into from
Jul 10, 2023
Merged

Conversation

jkbonfield
Copy link
Contributor

Several improvements.

  • Create base_mod.c from sam.c. Sam.c was the largest file in the main htslib dir, and splitting out the base modification code means sam_mod.c is 21st of 36 files, so it's plenty big enough to justify it. Headers haven't changed, with API in htslib/sam.h as before.
  • Bug fix the MM parsing code. It wasn't reverting back implicit, so gave incorrect results on explicit vs implicit on mixed data types within a single alignment.
  • Added a bam_mods_queryi API to go along side the existing bam_mods_query_type one. It returns data on the i^th modification type instead of looking up by 'code'.
  • Added bam_parse_basemod2 API with an additional flag field. The only flag at present is HTS_MOD_REPORT_UNCHECKED
  • Modified bam_next_basemod and bam_mods_at_next_pos (and implicitly bam_mods_at_qpos) to honour the new flag.
  • Added some internal documentation to sam_mods.c. Long term this may be best going somewhere else, but it could do with more tidying up still.

The purpose of the flag is to allow us to distinguish the three states of modified, unmodified, and unchecked.

Imagine a scenario of a processing a pileup where we're trying to gather statistics on what percentage of data in the pileup is modified. If all sequences are using implicit mode where every base is checked and we record the likely modifications in MM and (implicitly) the unrecorded bases have been deemed to be unmodified, then the statistics are a simple binary yes/no.

If we only have explicit-mode data present, where the mod-caller checked some places only and the QUAL field is used to distinguish between likely modified and unlikely modified, and uncalled positions have not been checked, then we need to count differently. We gather statistics only on the called bases and make a binary split on QUAL instead. Uncalled bases (those not in the MM tag) have no information on modification status so are ignored.

However if we have a mix of the two, we get in a pickle. We have to start using the bam_mods_queryi / bam_mods_query_type API for each call list to determine whether the absence of a mod at a particular site means no-mod-detected or did-not-look.

With the HTS_MOD_REPORT_UNCHECKED flag set, the explicit-mode will produce a modification call even on the sites that aren't listed, but with a specific qual of HTS_MOD_UNCHECKED. This means code can handle implicit, explicit and mixed data with a single loop without needing to do additional queries. No mod or low qual = not found. Mod with high qual=found. Mod witth unchecked qual => ignore for stats.

Fixes #1550 hopefully!

@jkbonfield jkbonfield force-pushed the base_mod_revamp branch 2 times, most recently from 5600bae to 7bef70d Compare June 30, 2023 16:30
@daviesrob daviesrob assigned daviesrob and whitwham and unassigned daviesrob Jul 4, 2023
Sam.c is already a behemoth over 6,000 lines long.  It's hard to find
things in it, and splitting by domain (with a sam prefix) makes life
easier as a developer.

Note: when sorted by file size, the split off code is already close to
the median, so this isn't creating pointlessly small files.  The
functions are still declared in sam.h there is no API change.

This commit changes no functionality as it's simply code migration.
…her.

We didn't reset back to implicit after an explicit mod, so
"C+m?,4;G+o,2;" would set "m" to be explicit (?) and also leave "o" as
explicit.
This allows querying by the i^{th} modificaton type rather than by
code. This is useful when we have multiple mods with differing
meta-data, such as "C+m.,4;G-m?,1;".

The previous bam_mods_query_type isn't sufficient as it's the same
code "m" being used.

Fixes samtools#1550
The only flag at the moment is HTS_MOD_REPORT_UNCHECKED.

This changes bam_mods_at_next_pos to report modified bases with
qual=HTS_MOD_UNCHECKED for explicitly modification types that do not
have coverage for this specific position.

For consistency, also set the unknown qual from -1 to HTS_MOD_UNKNOWN
(#defined to -1).  This is used when ML is absent.  Arguably this
could be 255 to match things like unknown MAPQ, but this ship has
sailed.

The test/test_mod tool has a -f INT argument to specify the basemod2
flags.

TODO: modify other functions too.
TODO: add tests, based on MM-explicit.sam
@whitwham whitwham merged commit 7822d8d into samtools:develop Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Behaviour of bam_mods_query_type() / interface to hts_base_mod_state
3 participants