Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"[error] Can't find right Boundary." with --expectCells and drop-seq data #362

Closed
pinin4fjords opened this issue May 7, 2019 · 26 comments
Closed
Assignees
Labels
alevin issue is primarily related to alevin fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master

Comments

@pinin4fjords
Copy link

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?

Alevin

Describe the bug

Maybe more of a support request than a bug.

I've got some of what I suspect is lower-quality drop-seq data. Running Alevin with default parameters yields very low mapping rates, presumably because elbow-finding is failing. Here's the barcode rank plot, you can see why it's having trouble, you might see an elbow if you squint a bit.

Drop-seq

I know from the source publication that we expect 278 cells in this case.

Supplying --expectCells yields the boundary error above. For this to work I need to break out the big guns and use --forceCells, yes? What I would really like is to try --expectCells first to allow Alevin to be a little bit intelligent, and if that fails to use --forceCells. Is that a sensible approach?

If so, could we a) have an informative error code on the boundary error above such that I can easily detect that error and re-submit with --forceCells, or b) if this is generically useful have a flag in Alevin to do it directly?

To Reproduce
Steps and data to reproduce the behavior:

  • cDNA reads in ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-7142/Control_6_R2.fastq.gz
  • barcode reads in ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTAB/E-MTAB-7142/Control_6_R1.fastq.gz
  • transcript/ gene mapping as: transcript_to_gene.txt
  • Run Alevin in drop-seq mode, salmon 0.13.1

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots

### alevin (dscRNA-seq quantification) v0.13.1
### [ program ] => salmon 
### [ command ] => alevin 
### [ libType ] => { ISR }
### [ mates1 ] => { barcodes.fastq.gz }
### [ mates2 ] => { cdna.fastq.gz }
### [ dropseq ] => { }
### [ index ] => { salmon_index }
### [ threads ] => { 12 }
### [ output ] => { ERR2744256_tmp }
### [ tgMap ] => { transcript_to_gene.txt }
### [ dumpFeatures ] => { }
### [ expectCells ] => { 278 }


[2019-05-07 14:40:36.500] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2019-05-07 14:40:36.500] [jointLog] [info] Usage of --validateMappings, without --hardFilter implies use of range factorization. rangeFactorizationBins is being set to 4
[2019-05-07 14:40:36.500] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.2.
[2019-05-07 14:40:36.500] [jointLog] [info] Using default value of 0.87 for minScoreFraction in Alevin
Using default value of 0.6 for consensusSlack in Alevin
[2019-05-07 14:40:36.511] [alevinLog] [info] Processing barcodes files (if Present) 

 
processed 8 Million barcodes
processed 17 Million barcodes
processed 25 Million barcodes
processed 34 Million barcodes
processed 42 Million barcodes
processed 51 Million barcodes
processed 59 Million barcodes
processed 67 Million barcodes
processed 76 Million barcodes
processed 84 Million barcodes
processed 93 Million barcodes
processed 101 Million barcodesdes
processed 109 Million barcodes
processed 118 Million barcodes
processed 126 Million barcodes
processed 134 Million barcodes
processed 143 Million barcodesodes
processed 148 Million barcodes

[2019-05-07 14:42:18.567] [alevinLog] [info] Done barcode density calculation.
[2019-05-07 14:42:18.567] [alevinLog] [info] # Barcodes Used: 147999216 / 148515763.
[2019-05-07 14:42:19.496] [alevinLog] [error] Can't find right Boundary.
Please Report this issue on github.

Desktop (please complete the following information):

  • OS: Linux
  • Version

Linux ebi6-054.ebi.ac.uk 3.10.0-514.16.1.el7.x86_64 #1 SMP Fri Mar 10 13:12:32 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 7.3 (Maipo)
Release: 7.3
Codename: Maipo

Additional context
Add any other context about the problem here.

@k3yavi k3yavi self-assigned this May 7, 2019
@k3yavi k3yavi added the alevin issue is primarily related to alevin label May 7, 2019
@roryk
Copy link
Contributor

roryk commented May 7, 2019

Heya, we had the same problem with unclear knee-plots like this. We make an alternative plot that looks like these. The first is on high quality data from Allon's K562 data from the original inDrop paper; a knee plot works well on this dataset.

image

and this is from blood in the Zebrafish, the data is of less good quality. The knee plot for this data wasn't clear enough to draw a reasonable cutoff but this alternative plot makes it easier to pick the cutoff:

image

These plots are made like this:

barcode_plot = function(bcs, sample) {
  bcs_hist = hist(log10(bcs$count), plot=FALSE, n=50)
  fLog = bcs_hist$count
  xLog = bcs_hist$mids
  y = fLog * (10^xLog) / sum(fLog * (10^xLog))
  print(qplot(xLog, y) + geom_point() + theme_bw() + ggtitle(sample))
  return(data.frame(x=xLog, y=y, sample=sample))
}

@k3yavi
Copy link
Member

k3yavi commented May 7, 2019

Hi @pinin4fjords ,

That's a really nice suggestion. We had a similar idea, basically the plan was to ask the user to run the algorithm with --dumpFeatures which dumps the filter_read_counts.txt file, basically this file has the per CB level read counts after sequence correction, and visualize the distribution through small script to chose a user-defined threshold by --forceCell command line flag. I do see a value in giving a ballpark for some boundary in the error code, but the problem is a little circular i.e. if we had some idea about the boundary we use it anyhow and correct it later through the intelligent whitelisting.
We are open to ideas you may have ?

@pinin4fjords
Copy link
Author

Thanks @roryk and @k3yavi . The issue we have is that we're trying to run a pipeline in a fairly high-throughput manner to get a sensible 'enough' matrix without too much manual intervention. So I'm trying to avoid anything that requires an eyeballing step, accepting that the matrix we get will be less optimal than one you'd get from manual optimisation. Where possible, our curators are extracting the expected cell numbers from publications, so sometimes I have at least a general idea of where to look for an elbow/ feature.

@roryk - have you used your alternate view on the data to automatically derive cutoffs? Does it work well?

@k3yavi:

As I say, first point is that this is for cases where I have a rough idea of the target cell number- we're generally working with pre-published data (though cell numbers per run are not always available).

From #340 I'd inferred that --expectCells gives Alevin ballpark to look for a knee within, while --forceCells is a strict cuttoff. Is that correct?

That being the case, my thought was to try --expectCells first, and failing that --forceCells. The problem is that I need to parse the STDOUT/ERR to detect the boundary error from --expectCells, which is not a very robust way of doing things. If you returned informative error codes (anything but 1) on this and other errors, I could detect the error and implement the logic I describe.

@roryk
Copy link
Contributor

roryk commented May 7, 2019

Hi @pinin4fjords,

We have the same use case, trying to automate as much as possible; for some datasets there really isn't anything you can do; if it is super bad both methods are bad. This function does a pretty reasonable job of picking a cutoff based on that histogram:

def guess_depth_cutoff(cb_histogram):
    ''' Guesses at an appropriate barcode cutoff
    '''
    with read_cbhistogram(cb_histogram) as fh:
        cb_vals = [int(p.strip().split()[1]) for p in fh]
    histo = np.histogram(np.log10(cb_vals), bins=50)
    vals = histo[0]
    edges = histo[1]
    mids = np.array([(edges[i] + edges[i+1])/2 for i in range(edges.size - 1)])
    wdensity = vals * (10**mids) / sum(vals * (10**mids))
    baseline = np.median(wdensity)
    wdensity = list(wdensity)
    # find highest density in upper half of barcode distribution
    peak = wdensity.index(max(wdensity[len(wdensity)/2:]))
    cutoff = None
    for index, dens in reversed(list(enumerate(wdensity[1:peak]))):
        if dens < 2 * baseline:
            cutoff = index
            break
    if not cutoff:
        return None
    else:
        cutoff = 10**mids[cutoff]
        logger.info('Setting barcode cutoff to %d' % cutoff)
        return cutoff

@k3yavi
Copy link
Member

k3yavi commented May 7, 2019

Thanks @roryk for the code and @pinin4fjords for the suggestion.
That's correct, --expectCells and --forceCells flags are designed to use the way you described above.
re: error codes, It's a good suggestion. Based on the timeline either we can make this into the next release of alevin or we can edit it in a different branch and then you can compile or I can give you a linux binary. Let us know what works for you.

@roryk
Copy link
Contributor

roryk commented May 7, 2019

I think if you want to automate these steps though the easiest and most robust thing you could do is require everyone to tell you how many cells they captured and sequenced, and relax that number a little bit and do whatever filtering you need to do downstream to get rid of the crap on the low end; usually other quality control metrics like mitochondrial content or genes detected or whatever will filter out the garbage that leaks into the count matrix from being permissive in initial cell demultiplexing + quantification steps.

@pinin4fjords
Copy link
Author

Great- thanks @roryk , I'll give your code a try, and yes, we are applying some filtering downstream. Unfortunately we don't always have control of the metadata associated with the experiments we're handling.

@k3yavi - thank you, next release will be fine :-)

@k3yavi
Copy link
Member

k3yavi commented May 7, 2019

I agree with the @roryk's suggestion and that was indeed the motivation to have whitelisting step downstream of deduplication in Alevin.
@pinin4fjords sounds good, I will add this into the todo list for next release.

@pinin4fjords
Copy link
Author

pinin4fjords commented May 8, 2019

I made myself a plot to illustrate @roryk 's approach (hopefully got it right)- just leaving it here in case others are interested.

compare_droplet_threshold

Code here: https://github.com/ebi-gene-expression-group/jon-sandbox/tree/master/droplet_cutoffs.

@roryk
Copy link
Contributor

roryk commented May 8, 2019

Thanks- Jonathan. Yikes, that bad quality one looks like particularly bad quality, I have an example that looks like that in my failed examples. Were you able to recover usable data from it?

@roryk
Copy link
Contributor

roryk commented May 8, 2019

I think your repository is set to private. :)

@pinin4fjords
Copy link
Author

pinin4fjords commented May 8, 2019

@roryk - I was just trying to decide if this dataset is a lost cause- think it probably is.

Sorry about the private repo- here's the code:

library(ggplot2)
library(DropletUtils)
library(gridExtra)

cl <- commandArgs(trailingOnly = TRUE)

bad <- "raw_cb_frequency_bad.txt"
good <- "raw_cb_frequency_good.txt"

# Pick a cutoff on count as per https://github.com/COMBINE-lab/salmon/issues/362#issuecomment-490160480

pick_roryk_cutoff = function(bcs){
  bcs_hist = hist(log10(bcs), plot=FALSE, n=50)
  mids = bcs_hist$mids
  vals = bcs_hist$count
  wdensity = vals * (10^mids) / sum(vals * (10^mids))
  baseline <- median(wdensity)
  
  # Find highest density in upper half of barcode distribution
  
  peak <- which(wdensity == max(wdensity[((length(wdensity)+1)/2):length(wdensity)]))
  
  # Cutoff is the point before the peak at which density falls below 2X baseline
  
  10^mids[max(which(wdensity[1:peak] < (2*baseline)))]
}

# Plot densities 

barcode_density_plot = function(bcs, roryk_cutoff, knee, inflection, name = 'no name') {
  bcs_hist = hist(log10(bcs), plot=FALSE, n=100)
  counts = bcs_hist$count
  mids = bcs_hist$mids
  y = counts * (10^mids) / sum(counts * (10^mids))
  qplot(y, 10^mids) + geom_point() + theme_bw() + ggtitle(name) + ylab('Count') + xlab ('Density') +
  geom_hline(aes(yintercept = roryk_cutoff, color = paste('roryk_cutoff =', length(which(bcs > roryk_cutoff)), 'cells'))) + 
  geom_hline(aes(yintercept = inflection, color = paste('dropletutils_inflection =', length(which(bcs > inflection)), 'cells'))) +
  geom_hline(aes(yintercept = knee, color = paste('dropletutils_knee =', length(which(bcs > knee)), 'cells'))) +
  scale_y_continuous(trans='log10') + theme(axis.title.y=element_blank()) + labs(color='Thresholds')
}  

# Plot a more standard barcode rank plot

barcode_rank_plot <- function(br.out, roryk_total_cutoff, knee, inflection, name='no name'){
  ggplot(data.frame(br.out), aes(x=rank, y=total)) + geom_line() + scale_x_continuous(trans='log10') + scale_y_continuous(trans='log10') + theme_bw() + 
  geom_hline(aes(yintercept = knee, color = 'dropletutils_knee')) + 
  geom_hline(aes(yintercept = inflection, color = 'dropletutils_inflection')) +
  geom_hline(aes(yintercept = roryk_total_cutoff, color = 'roryk_cutoff')) +
  ggtitle(name) + ylab('Count') + xlab('Rank') + theme(legend.position = "none")
}

# Plot the different plots and threshold statistics alongside one another
  
barcode_files <- list(
  'good quality' = good,
  'bad quality' = bad
)

plots <- lapply(names(barcode_files), function(name){
  
  bcf <- barcode_files[[name]]
  barcodes <- read.delim(bcf, header = FALSE)
  
  # Get the roryk cutoff
  roryk_count_cutoff <- pick_roryk_cutoff(barcodes$V2)
  
  # Run dropletUtils' barcodeRanks to get knee etc
  br.out <- barcodeRanks(t(barcodes[,2,drop=FALSE]))
  
  dropletutils_knee <- br.out$knee
  dropletutils_inflection <- br.out$inflection
  
  list(
    dropletutils = barcode_rank_plot(br.out, roryk_count_cutoff, dropletutils_knee, dropletutils_inflection, name = name),
    roryk = barcode_density_plot(barcodes$V2, roryk_count_cutoff, dropletutils_knee, dropletutils_inflection, name = name)
  )
})
names(plots) <- names(barcode_files)

png(width = 800, height = 600, file='compare_droplet_threshold.png')
grid.arrange(plots$bad$dropletutils, plots$bad$roryk, plots$good$dropletutils, plots$good$roryk)
dev.off()

@pinin4fjords
Copy link
Author

pinin4fjords commented May 9, 2019

@k3yavi - is it possible to skip the thresholding entirely, so as to use downstream tools to remove empty barcodes instead?

@k3yavi
Copy link
Member

k3yavi commented May 9, 2019

It is, if you provide a list of CB to use through command line flag --whitelist. But again I think it's a circular problem, if you know the list of CB to use, you might have already figures out the frequency distribution of each CB by parsing the fastq. Either by using --dumpFeatures or externally may be through awk.

One other option is to use --keepCBfraction it takes a number in (0, 1] , which basically tells Alevin to use X fraction of CB from the total observed. The caveat there is to figure out a decent value of X as the CB frequency distribution is a long tailed distribution and if say you provide 1 then Alevin will quantify each and every observed CB and slow down the full pipeline.

@pinin4fjords
Copy link
Author

pinin4fjords commented May 9, 2019

Thanks @k3yavi , --keepCBfraction isn't in the docs, so I missed it. Unless it leads to completely unfeasible run times, --keepCBfraction 1 combined with downstream filtering may be the most robust way to handle things in my high throughput situation (as alluded to by @roryk ).

Is there a way of combining this with a minimum UMI count per CB to remove just the most obvious junk and hopefully somewhat limit the impact on runtimes?

@k3yavi
Copy link
Member

k3yavi commented May 9, 2019

Unfortunately it's getting into a little under explored territory.
Retrospectively, it does makes sense to have such a threshold but sadly in the current version we don't have that option. If it may help, I just got some stats for 10x data and generally it's around top < 1% of the data which is useful, which should be even lower in dropseq. Although it's possible my sample size of 5 below dataset is too small.

useful -> total
4k   -> 1,239,476
8k   -> 1,877,718
900 -> 657,180
2k   -> 1,653,795
9k   -> 2,812,291

My guess is keeping the keepCBFraction even to 0.1 (i.e. 10%) would get you decent number of empty/low confidence CB to correct for downstream but you might have to explore a little. Another caveat is, having too many CB blows up the memory for downstream whitelisting of alevin and currently there is no option to disable that whitelisting, again a must have feature which is missing.

Thanks for this very useful discussion, we will definitely improve/add these options into alevin with the next release.

@pinin4fjords
Copy link
Author

Thanks @k3yavi - I think those options would really help us use Alevin in production- look forward to the next release.

I'll do some more testing in the meantime.

@k3yavi
Copy link
Member

k3yavi commented May 9, 2019

Sounds good, I will report back as soon as we have the next release.
Just a quick correction the useful percentage is less than 1% (not 1-4% it's 0.1 - 0.4%), I was off by a magnitude in the percentage reported above.

@pinin4fjords
Copy link
Author

Sorry for the continual questions, one more thing @k3yavi .

As a way of tackling this, I could do a pre-run of Alevin with --dumpFeatures --noQuant to derive a very relaxed whitelist with the obvious bad stuff removed, right? So e.g. :

awk '{ if ($2 > 100) { print $1} }' raw_cb_frequency.txt

... gets me a whitelist of all starting cell barcodes with > 100 reads (before deduplication)? I don't think I want to actually supply these as a whitelist (they need correction), but it seems like the count would be a good thing to pass to --forceCells in a full Alevin run to generate a 'permissive' matrix I can filter later.

Is that a sensible approach?

@k3yavi
Copy link
Member

k3yavi commented May 9, 2019

Yep, that is what I was reflecting at earlier. Using --dumpFeatures you can get the file filtered_cb_frequency.txt (this is after CB seq correction, raw_cb_frequency is before seq correction). As the file is reverse sorted by frequency you can count the number of CB with frequency >100 and use that for the --forceCell. It does makes alevin multiple pass algorithm but that was the current use case with complex dataset where it fails to find the knee.

One thing to note here would be the downstream whitelisting. If the number of CB becomes too many then it can potentially blow up the memory. However, since you are already parsing the names of the CB (in your awk script) you can pass those CB as the --whitelist, alevin will still sequence correct the remaining CB with 1-edit distance around them. The benefit of this is, alevin will not run downstream whitelisting as a list of whitelisted CB has been provided externally.

@pinin4fjords
Copy link
Author

pinin4fjords commented May 9, 2019

Okay, thanks @k3yavi. Just to be clear- you're saying I should derive the whitelist from the filtered_cb_frequency rather than the raw? This is a much smaller file in the case of the bad data above (more so than I'd expect from the cb correction, 984), so I was afraid it had already been subjected to knee detection. I also note that it's also not in fact sorted by default.

@k3yavi
Copy link
Member

k3yavi commented May 9, 2019

ah you are right, raw is the right file since the filtered file might have been wrongly thresholded.
I think raw was the one which was reverse sorted too. I'll check that in a bit.

@pinin4fjords
Copy link
Author

pinin4fjords commented May 9, 2019

Yep, raw is reverse sorted, don't worry- thanks.

@k3yavi k3yavi added the fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master label May 28, 2019
@k3yavi
Copy link
Member

k3yavi commented Jun 2, 2019

Hi @pinin4fjords ,

We have release a new version v0.14.0. In the latest version we have added different error codes based on what stage the pipeline fails. The error codes are as follows:

1: Error while mapping reads and/or generic errors.
64: Error in knee estimation / Cellular Barcode sequence correction.
74: Error while deduplicating UMI and/or EM optimization.
84: Error while intelligent whitelisting.

As we have discussed earlier, you can control the expected behavior by tweaking the following two flags.

--keepCBfraction: A value in (0, 1] i..e what fraction of CB to keep for quantification.
--freqThreshold: default 10, Minimum frequency required to quantify the CB.

Just a heads up, alevin with the current release will by default dump the dumpFeatures.txt which contains the per CB level features. Please check the release notes for more details.

Closing this issue for now, but feel free to reopen if you face any issue or have question.

@k3yavi k3yavi closed this as completed Jun 2, 2019
@pinin4fjords
Copy link
Author

Sorry @k3yavi - was away on leave. Seems to be lots of helpful titbits in this release- thank you.

@pinin4fjords
Copy link
Author

Hi @k3yavi - apologies, just coming back to this with an eye to updating our pipelines, just wanted to clarify.

Just to recap, right now I'm running the previous Alevin version with --dumpFeatures --noQuant, filtering raw_cb_frequency.txt by barcode frequency (e.g. 10), and then using the whitelist in a further Alevin run, like --whitelist whitelist.txt --forceCells (whitelist size). I then get a relaxed matrix I can filter in downstream analysis.

Am I right in thinking that with the new version I can now just have a single run and say --keepFraction 1 --freqThreshold 10? Is the freqThreshold applied before the keepFraction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alevin issue is primarily related to alevin fixed in develop this bug has been fixed in develop and the issue will be closed when merged into master
Projects
None yet
Development

No branches or pull requests

3 participants