Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct mutation rates for observer bias while relaxing assumption that reads span the full section #10

Closed
matthewfallan opened this issue Mar 6, 2024 · 0 comments · Fixed by #12
Assignees
Labels
enhancement New feature or request

Comments

@matthewfallan
Copy link
Contributor

Background

Reads with mutations separated by fewer than 3 non-mutated nucleotides are underrepresented. Presumably, the reverse transcriptase cannot read through two closely spaced chemical modifications, causing such reads to drop out, so that it is rare for an observed read to have two close mutations. Consequently, the chemical reactivities of the observed reads are inherently biased from the true, underlying chemical reactivities among all reads, even those with two or more close chemical modifications.

SEISMIC-RNA can already infer the true chemical reactivities from the observed reads. However, the correction assumes that all reads span the full section. While this assumption is generally valid for amplicon-based samples, applying it to randomly fragmented samples requires discarding any reads that do not fully overlap the section, which reduces the sequencing depth. This requirement makes it impossible to correct the mutation rates over sections that are longer than the read length, including sections produced by the join command.

Proposal

The algorithm that corrects mutation rates could be modified so that the results will still be valid even if reads do not span the full section. First, while the current algorithm calculates only the likelihood that a full-length read would no contain two mutations too close, it could be retooled to calculate this likelihood for reads with every possible pair of start and end coordinates, using a very similar dynamic programming method. Second, it could introduce a new parameter for the proportion of reads aligned to each pair of start and end coordinates, which would also be solved for along with the mutation rates, based on the observed distribution of start and end coordinates.

The main disadvantage of the updated algorithm is its increased computational cost. The current algorithm has time and space complexities of roughly O(N) with respect to the length of the sequence, while the proposed algorithm would likely scale as O(N^3) and also have a somewhat larger fixed cost. However, the time cost may not be prohibitive in practice: for short sections (< 300 nt), an O(N^3) cost would not be as noticeable; while for sequences longer than the read lengths, the algorithm would change an impossible problem into a slowly solved problem. Further work could also enhance the performance, such as careful optimization of the code, sparse matrices for the start/end coordinates, and/or integration with just-in-time Python compilers (e.g. Numba).

@matthewfallan matthewfallan added the enhancement New feature or request label Mar 6, 2024
@matthewfallan matthewfallan self-assigned this Mar 6, 2024
@matthewfallan matthewfallan linked a pull request Mar 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant