Counting lower complexity barcodes yields many false positives #12

mschubert · 2023-02-10T15:00:29Z

I found this tool after trying to use the function vmatchPDict from the Bioconductor package Biostrings for barcode matching (which was horribly slow, and took 100 GB of memory if matching with mismatches). guide-counter is amazingly fast and easy to use! 👍

I have one issue, however: My barcodes that I'm matching are lower complexity than the CRISPR sg RNAs would be, i.e. only 12 nucleotides instead of the 20 mentioned in #8.

As such, the automated offset detection for the library sequences yields many false positives, stemming from the read randomly containing this sequence at another position. As a result, I get many more barcode matches than I have reads (usually on average 1.5-2 per read).

Would it be possible to restrict the offset matching to a common value for all reads, or provide a custom offset via a command-line option?

The text was updated successfully, but these errors were encountered:

tfenne · 2023-02-10T16:39:42Z

Hi @mschubert - yes, I think that should be relatively easy to do. In your case, do you use :

i) a single fixed offset into all the reads
ii) a fixed range (e.g. 10-12bp in)
iii) different or the same values for all samples?

mschubert · 2023-02-10T18:33:00Z

In my case that would be i) same offset for all samples, but if the offset was determined per sample that would also work

mschubert · 2023-02-16T15:34:48Z

Upon reading the guide-counter count --help page again, I realized that I misunderstood what the --offset-min-fraction does. In fact, this option already makes it possible to set a minimum amount of matches for an offset to be considered. So this already works and provides an option to limit off-target matches.

However, the value is given as a fraction of the total number of matches, not a fraction of the 100,000 reads sampled. This means that the desired value of this number changes whenever the off-target number changes. In my case, with 100,000 reads I have 75,000 matches at my desired offset. But when I run with --offset-min-fraction 0.5, I get no counts. And I don't know what to set this value to before running guide-counter, because I don't know how many off-targets I will get.

Wouldn't it make more sense to specify this parameter per reads sampled?

tfenne · 2023-02-18T12:20:07Z

@mschubert My thought in having the denominator be the number of matches (instead of number of reads) is that that fraction should be consistent/predictable even with sequencing error or other problems causing the actual fraction of matching reads overall to drop.

Looking at the code, one issue I see is that it currently counts all matches from a read, so to your point, with short kmers you can end up with multiple matches per read. This is definitely not something I had anticipated. Something that is perhaps exacerbating this is that when looking for the prefixes, it is also tolerating mismatches, which obviously expands the possibility of finding multiple.

I think there are probably a few ways to fix this and I'd appreciate your input:

Add the ability to just specify the prefixes you want to use and skip auto-detection
Add the ability to restrict auto-detection to just a smaller range of the read, but still do auto-detection
Make auto-detection more intelligent/complicated. I think what this would look like would be: i) tallying all the possible matched offsets for a single read, ii) if a single offset is found use it, iii) if multiple offsets are found, and there are a mixture of no-mismatch and one-mismatch offsets, reduce to just the no-mismatch set, iv) instead of counting each offset as 1 match, count it instead as 1 / num_selected_matches_for_read.

(3) would make the prefix auto-detection a bit slower, but also more robust. Whereas (1) is obviously simplest. Thoughts?

tfenne · 2023-02-18T12:40:17Z

@mschubert I'm not sure if you're up for checking out a branch and building, but I took a shot at implementing (3) in #14. Alternatively if you're able to share the first 25-100k of your fastq I can given it a shot too.

mschubert · 2023-02-20T10:11:06Z

Thanks for your quick answer @tfenne!

I should be able to build this locally. Be aware that I'm already using --exact-match, so the no-mismatch before 1-mismatch will not affect my counts. So before I look into building this I've got a conceptual question. The way I see it, we've got two expectations of the --offset-min-fraction parameter:

(Yours) Obvious sequencing errors should not be counted in this fraction (let's say, bacterial contamination or primer dimers)
(Mine) Offsets should not be competitive (i.e., if half of the reads have the barcode on position 35 that should correspond to a --offset-min-fraction of 0.5, irrespective of whether the same amount of matches occurs elsewhere)

I think there are probably a few ways to fix this and I'd appreciate your input

Your proposed changes in #14 would still not align with my expectation because the counting is still competitive. On the contrary, it makes the fraction that I expect harder to estimate because I don't know where a match will be counted as 1 and where as 1/n. What about:

Still count each match as 1, but have a separate counter with how many reads contain a barcode at any position. Then, for computing the --offset-min-fraction divide by this read-based counter instead of the matches-based counter.

(This is your tool of course; so feel free to ignore my opinion here. Maybe there is a reason to use competitive counts that I don't yet see.)

tfenne linked a pull request Feb 18, 2023 that will close this issue

Attempt a smarter offset detection when "guides" are short and might … #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counting lower complexity barcodes yields many false positives #12

Counting lower complexity barcodes yields many false positives #12

mschubert commented Feb 10, 2023

tfenne commented Feb 10, 2023

mschubert commented Feb 10, 2023 •

edited

Loading

mschubert commented Feb 16, 2023

tfenne commented Feb 18, 2023

tfenne commented Feb 18, 2023

mschubert commented Feb 20, 2023

Counting lower complexity barcodes yields many false positives #12

Counting lower complexity barcodes yields many false positives #12

Comments

mschubert commented Feb 10, 2023

tfenne commented Feb 10, 2023

mschubert commented Feb 10, 2023 • edited Loading

mschubert commented Feb 16, 2023

tfenne commented Feb 18, 2023

tfenne commented Feb 18, 2023

mschubert commented Feb 20, 2023

mschubert commented Feb 10, 2023 •

edited

Loading