Base mod API improvements #1636

jkbonfield · 2023-06-30T15:36:00Z

Several improvements.

Create base_mod.c from sam.c. Sam.c was the largest file in the main htslib dir, and splitting out the base modification code means sam_mod.c is 21st of 36 files, so it's plenty big enough to justify it. Headers haven't changed, with API in htslib/sam.h as before.
Bug fix the MM parsing code. It wasn't reverting back implicit, so gave incorrect results on explicit vs implicit on mixed data types within a single alignment.
Added a bam_mods_queryi API to go along side the existing bam_mods_query_type one. It returns data on the i^th modification type instead of looking up by 'code'.
Added bam_parse_basemod2 API with an additional flag field. The only flag at present is HTS_MOD_REPORT_UNCHECKED
Modified bam_next_basemod and bam_mods_at_next_pos (and implicitly bam_mods_at_qpos) to honour the new flag.
Added some internal documentation to sam_mods.c. Long term this may be best going somewhere else, but it could do with more tidying up still.

The purpose of the flag is to allow us to distinguish the three states of modified, unmodified, and unchecked.

Imagine a scenario of a processing a pileup where we're trying to gather statistics on what percentage of data in the pileup is modified. If all sequences are using implicit mode where every base is checked and we record the likely modifications in MM and (implicitly) the unrecorded bases have been deemed to be unmodified, then the statistics are a simple binary yes/no.

If we only have explicit-mode data present, where the mod-caller checked some places only and the QUAL field is used to distinguish between likely modified and unlikely modified, and uncalled positions have not been checked, then we need to count differently. We gather statistics only on the called bases and make a binary split on QUAL instead. Uncalled bases (those not in the MM tag) have no information on modification status so are ignored.

However if we have a mix of the two, we get in a pickle. We have to start using the bam_mods_queryi / bam_mods_query_type API for each call list to determine whether the absence of a mod at a particular site means no-mod-detected or did-not-look.

With the HTS_MOD_REPORT_UNCHECKED flag set, the explicit-mode will produce a modification call even on the sites that aren't listed, but with a specific qual of HTS_MOD_UNCHECKED. This means code can handle implicit, explicit and mixed data with a single loop without needing to do additional queries. No mod or low qual = not found. Mod with high qual=found. Mod witth unchecked qual => ignore for stats.

Fixes #1550 hopefully!

Sam.c is already a behemoth over 6,000 lines long. It's hard to find things in it, and splitting by domain (with a sam prefix) makes life easier as a developer. Note: when sorted by file size, the split off code is already close to the median, so this isn't creating pointlessly small files. The functions are still declared in sam.h there is no API change. This commit changes no functionality as it's simply code migration.

…her. We didn't reset back to implicit after an explicit mod, so "C+m?,4;G+o,2;" would set "m" to be explicit (?) and also leave "o" as explicit.

This allows querying by the i^{th} modificaton type rather than by code. This is useful when we have multiple mods with differing meta-data, such as "C+m.,4;G-m?,1;". The previous bam_mods_query_type isn't sufficient as it's the same code "m" being used. Fixes samtools#1550

The only flag at the moment is HTS_MOD_REPORT_UNCHECKED. This changes bam_mods_at_next_pos to report modified bases with qual=HTS_MOD_UNCHECKED for explicitly modification types that do not have coverage for this specific position. For consistency, also set the unknown qual from -1 to HTS_MOD_UNKNOWN (#defined to -1). This is used when ML is absent. Arguably this could be 255 to match things like unknown MAPQ, but this ship has sailed. The test/test_mod tool has a -f INT argument to specify the basemod2 flags. TODO: modify other functions too. TODO: add tests, based on MM-explicit.sam

Also improve tests

jkbonfield force-pushed the base_mod_revamp branch 2 times, most recently from 5600bae to 7bef70d Compare June 30, 2023 16:30

daviesrob assigned daviesrob and whitwham and unassigned daviesrob Jul 4, 2023

jkbonfield added 6 commits July 7, 2023 11:38

Correct base modification implicit / explicit status when mixed toget…

baeb685

…her. We didn't reset back to implicit after an explicit mod, so "C+m?,4;G+o,2;" would set "m" to be explicit (?) and also leave "o" as explicit.

Add more internal sam_mods.c documentation

9293970

Update bam_next_basemod too to cope with HTS_MOD_REPORT_UNCHECKED.

fddc680

Also improve tests

jkbonfield force-pushed the base_mod_revamp branch from 7bef70d to fddc680 Compare July 7, 2023 10:38

whitwham merged commit 7822d8d into samtools:develop Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base mod API improvements #1636

Base mod API improvements #1636

jkbonfield commented Jun 30, 2023

Base mod API improvements #1636

Base mod API improvements #1636

Conversation

jkbonfield commented Jun 30, 2023