-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counting lower complexity barcodes yields many false positives #12
Comments
Hi @mschubert - yes, I think that should be relatively easy to do. In your case, do you use : i) a single fixed offset into all the reads |
In my case that would be i) same offset for all samples, but if the offset was determined per sample that would also work |
Upon reading the However, the value is given as a fraction of the total number of matches, not a fraction of the 100,000 reads sampled. This means that the desired value of this number changes whenever the off-target number changes. In my case, with 100,000 reads I have 75,000 matches at my desired offset. But when I run with Wouldn't it make more sense to specify this parameter per reads sampled? |
@mschubert My thought in having the denominator be the number of matches (instead of number of reads) is that that fraction should be consistent/predictable even with sequencing error or other problems causing the actual fraction of matching reads overall to drop. Looking at the code, one issue I see is that it currently counts all matches from a read, so to your point, with short kmers you can end up with multiple matches per read. This is definitely not something I had anticipated. Something that is perhaps exacerbating this is that when looking for the prefixes, it is also tolerating mismatches, which obviously expands the possibility of finding multiple. I think there are probably a few ways to fix this and I'd appreciate your input:
(3) would make the prefix auto-detection a bit slower, but also more robust. Whereas (1) is obviously simplest. Thoughts? |
@mschubert I'm not sure if you're up for checking out a branch and building, but I took a shot at implementing (3) in #14. Alternatively if you're able to share the first 25-100k of your fastq I can given it a shot too. |
Thanks for your quick answer @tfenne! I should be able to build this locally. Be aware that I'm already using
Your proposed changes in #14 would still not align with my expectation because the counting is still competitive. On the contrary, it makes the fraction that I expect harder to estimate because I don't know where a match will be counted as
(This is your tool of course; so feel free to ignore my opinion here. Maybe there is a reason to use competitive counts that I don't yet see.) |
I found this tool after trying to use the function
vmatchPDict
from the Bioconductor packageBiostrings
for barcode matching (which was horribly slow, and took 100 GB of memory if matching with mismatches).guide-counter
is amazingly fast and easy to use! 👍I have one issue, however: My barcodes that I'm matching are lower complexity than the CRISPR sg RNAs would be, i.e. only 12 nucleotides instead of the 20 mentioned in #8.
As such, the automated offset detection for the library sequences yields many false positives, stemming from the read randomly containing this sequence at another position. As a result, I get many more barcode matches than I have reads (usually on average 1.5-2 per read).
Would it be possible to restrict the offset matching to a common value for all reads, or provide a custom offset via a command-line option?
The text was updated successfully, but these errors were encountered: