-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"[error] Can't find right Boundary." with --expectCells and drop-seq data #362
Comments
Heya, we had the same problem with unclear knee-plots like this. We make an alternative plot that looks like these. The first is on high quality data from Allon's K562 data from the original inDrop paper; a knee plot works well on this dataset. and this is from blood in the Zebrafish, the data is of less good quality. The knee plot for this data wasn't clear enough to draw a reasonable cutoff but this alternative plot makes it easier to pick the cutoff: These plots are made like this:
|
Hi @pinin4fjords , That's a really nice suggestion. We had a similar idea, basically the plan was to ask the user to run the algorithm with |
Thanks @roryk and @k3yavi . The issue we have is that we're trying to run a pipeline in a fairly high-throughput manner to get a sensible 'enough' matrix without too much manual intervention. So I'm trying to avoid anything that requires an eyeballing step, accepting that the matrix we get will be less optimal than one you'd get from manual optimisation. Where possible, our curators are extracting the expected cell numbers from publications, so sometimes I have at least a general idea of where to look for an elbow/ feature. @roryk - have you used your alternate view on the data to automatically derive cutoffs? Does it work well? As I say, first point is that this is for cases where I have a rough idea of the target cell number- we're generally working with pre-published data (though cell numbers per run are not always available). From #340 I'd inferred that --expectCells gives Alevin ballpark to look for a knee within, while --forceCells is a strict cuttoff. Is that correct? That being the case, my thought was to try --expectCells first, and failing that --forceCells. The problem is that I need to parse the STDOUT/ERR to detect the boundary error from --expectCells, which is not a very robust way of doing things. If you returned informative error codes (anything but 1) on this and other errors, I could detect the error and implement the logic I describe. |
Hi @pinin4fjords, We have the same use case, trying to automate as much as possible; for some datasets there really isn't anything you can do; if it is super bad both methods are bad. This function does a pretty reasonable job of picking a cutoff based on that histogram:
|
Thanks @roryk for the code and @pinin4fjords for the suggestion. |
I think if you want to automate these steps though the easiest and most robust thing you could do is require everyone to tell you how many cells they captured and sequenced, and relax that number a little bit and do whatever filtering you need to do downstream to get rid of the crap on the low end; usually other quality control metrics like mitochondrial content or genes detected or whatever will filter out the garbage that leaks into the count matrix from being permissive in initial cell demultiplexing + quantification steps. |
I agree with the @roryk's suggestion and that was indeed the motivation to have whitelisting step downstream of deduplication in Alevin. |
I made myself a plot to illustrate @roryk 's approach (hopefully got it right)- just leaving it here in case others are interested. Code here: https://github.com/ebi-gene-expression-group/jon-sandbox/tree/master/droplet_cutoffs. |
Thanks- Jonathan. Yikes, that bad quality one looks like particularly bad quality, I have an example that looks like that in my failed examples. Were you able to recover usable data from it? |
I think your repository is set to private. :) |
@roryk - I was just trying to decide if this dataset is a lost cause- think it probably is. Sorry about the private repo- here's the code:
|
@k3yavi - is it possible to skip the thresholding entirely, so as to use downstream tools to remove empty barcodes instead? |
It is, if you provide a list of CB to use through command line flag --whitelist. But again I think it's a circular problem, if you know the list of CB to use, you might have already figures out the frequency distribution of each CB by parsing the fastq. Either by using --dumpFeatures or externally may be through awk. One other option is to use --keepCBfraction it takes a number in (0, 1] , which basically tells Alevin to use X fraction of CB from the total observed. The caveat there is to figure out a decent value of X as the CB frequency distribution is a long tailed distribution and if say you provide 1 then Alevin will quantify each and every observed CB and slow down the full pipeline. |
Thanks @k3yavi , --keepCBfraction isn't in the docs, so I missed it. Unless it leads to completely unfeasible run times, --keepCBfraction 1 combined with downstream filtering may be the most robust way to handle things in my high throughput situation (as alluded to by @roryk ). Is there a way of combining this with a minimum UMI count per CB to remove just the most obvious junk and hopefully somewhat limit the impact on runtimes? |
Unfortunately it's getting into a little under explored territory.
My guess is keeping the Thanks for this very useful discussion, we will definitely improve/add these options into alevin with the next release. |
Thanks @k3yavi - I think those options would really help us use Alevin in production- look forward to the next release. I'll do some more testing in the meantime. |
Sounds good, I will report back as soon as we have the next release. |
Sorry for the continual questions, one more thing @k3yavi . As a way of tackling this, I could do a pre-run of Alevin with --dumpFeatures --noQuant to derive a very relaxed whitelist with the obvious bad stuff removed, right? So e.g. :
... gets me a whitelist of all starting cell barcodes with > 100 reads (before deduplication)? I don't think I want to actually supply these as a whitelist (they need correction), but it seems like the count would be a good thing to pass to --forceCells in a full Alevin run to generate a 'permissive' matrix I can filter later. Is that a sensible approach? |
Yep, that is what I was reflecting at earlier. Using One thing to note here would be the downstream whitelisting. If the number of CB becomes too many then it can potentially blow up the memory. However, since you are already parsing the names of the CB (in your awk script) you can pass those CB as the |
Okay, thanks @k3yavi. Just to be clear- you're saying I should derive the whitelist from the filtered_cb_frequency rather than the raw? This is a much smaller file in the case of the bad data above (more so than I'd expect from the cb correction, 984), so I was afraid it had already been subjected to knee detection. I also note that it's also not in fact sorted by default. |
ah you are right, raw is the right file since the filtered file might have been wrongly thresholded. |
Yep, raw is reverse sorted, don't worry- thanks. |
Hi @pinin4fjords , We have release a new version
As we have discussed earlier, you can control the expected behavior by tweaking the following two flags.
Just a heads up, alevin with the current release will by default dump the Closing this issue for now, but feel free to reopen if you face any issue or have question. |
Sorry @k3yavi - was away on leave. Seems to be lots of helpful titbits in this release- thank you. |
Hi @k3yavi - apologies, just coming back to this with an eye to updating our pipelines, just wanted to clarify. Just to recap, right now I'm running the previous Alevin version with Am I right in thinking that with the new version I can now just have a single run and say |
Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
Alevin
Describe the bug
Maybe more of a support request than a bug.
I've got some of what I suspect is lower-quality drop-seq data. Running Alevin with default parameters yields very low mapping rates, presumably because elbow-finding is failing. Here's the barcode rank plot, you can see why it's having trouble, you might see an elbow if you squint a bit.
I know from the source publication that we expect 278 cells in this case.
Supplying --expectCells yields the boundary error above. For this to work I need to break out the big guns and use --forceCells, yes? What I would really like is to try --expectCells first to allow Alevin to be a little bit intelligent, and if that fails to use --forceCells. Is that a sensible approach?
If so, could we a) have an informative error code on the boundary error above such that I can easily detect that error and re-submit with --forceCells, or b) if this is generically useful have a flag in Alevin to do it directly?
To Reproduce
Steps and data to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
Desktop (please complete the following information):
Linux ebi6-054.ebi.ac.uk 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 7.3 (Maipo)
Release: 7.3
Codename: Maipo
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: