-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clusters_tmp.tsv is empty which caused subprocess.CalledProcessError #128
Comments
I am having a similar problem when running with K = 10, 11 or 12 on quite a large dataset (35G). I get the same subprocess error and doublets.err error as in the OP. This seems to be due to the clustering steps failing to identify any clusters, yet failing without an error. this is the output of clusters.err:
I have tried increasing --restarts and changing --min_alt and --min_ref with no effect (although rerunning the whole pipeline might be required for these changes to take effect, right now i'm deleting the parts related to clusters and doublets and running it again on the same output folder). is any of you aware of this behaviour? i can't begin to diagnose the problem as i don't understand why clustering fails. thanks a lot Tito |
It looks like there was some problem prior to clustering that led to 0 counts? That cant really be it because it says "total loci used 82k" but something similar. Maybe you could send me your alt.mtx and ref.mtx and barcodes.tsv file and then i can debug it. |
Hey @wheaton5, I have sent you a link to download ref, alt, and barcodes for the data that gives me these issues, i've looked at them briefly but they seem ok. Thanks for looking into it! |
Hello again, what that means in detail i'm not sure, but it might create problems. i'm going to try running cellranger count on only the gene expression portion of the data and then rerunning souporcell, see if that does anything. |
Small update. after testing on the output of cellranger count instead of cellranger multi, souporcell run to completion. I think the problem was that i had provided ALL possible barcodes to SoC, instead of the ones found to be associated with cells. could this create problems for the clustering algorithm? maybe @ZebinWen can comment on wether this applies to them. |
@Slacanch thanks for the update. This has been a problem in the past. I should try to fix it. Theoretically it is just adding some noise to the clustering. But something in the code makes it break. |
I think I am having the same issue. I get this error when clustering: I am only running k = 2 on one subset file for testing, but my data is also multiplexed. However, the input is only from cellranger count (I didn't run cellranger multi on this subset as I am testing the difference between demultiplexing via tags and via genotype). However, perhaps my barcodes file still contains all the barcodes? @Slacanch did you alter the barcodes file at all from the cellranger count output? Thanks! |
you should use the barcodes file that is just the cell barcodes. |
Here is my clusters.err file. Is it a problem that it says total loci used 1? total loci used 1 |
yeah, thats definitely a problem lol. |
what do your alt.mtx and ref.mtx look like (top few lines) |
So it isn't finding any cells? Could that be because of the barcodes file? |
alt.mtx: |
%%MatrixMarket matrix coordinate real general |
that looks relatively normal, but im going to look at a normal run real quick to make sure. Thats 12890 cells, 4832 loci, 8825 entries. Maybe that number of entries is low for that number of cells and loci |
What is the umi/cell median? |
These are postmortem intestine nuclei so there are low counts |
You can try rerunning with --min_alt 4 and --min_ref 4. (default is 10 for each). You should get some more loci. But its gonna be tough to cluster with that few UMI |
here is an example of a run with a lot of umi to be fair that had like 25k umi median... which is abnormally large. A good number is around 4k for cells, maybe 1k for nuclei. |
I see. Would it help if the range was large? The last time I ran a snRNA-seq test on these samples the range was quite large. Because this is human postmortem tissue the RNA in the mucosal cells is very degraded. However, the RNA in the neurons and glial cells and certain immune cells are usually well preserved, so I was expecting these nuclei to have higher UMIs. Would it help to restrict to cells with a higher UMI? |
here is a more normal run, with about 4k umi per cell %%MatrixMarket matrix coordinate real general So there is basically almost no data to go off of in your case. |
Okay that makes sense. I also have data from postmortem brain tissue which has ~1600 UMI per cell. So I will run that as well and make sure that is the issue. |
I don't think the low UMI cells should hurt the clustering. The cells with more UMI will contribute that much more to the clustering results and the low UMI cells should get sort of placed according to how the high UMI cells have shaped the cluster centers. But you may run into errors in the code if there are any cells that have no variants at all. I'm not sure |
Because we are working with postmortem nuclei we never get the levels that people working with cells get |
Sure. I think it should work fine with 1600 UMI. But again there might be a bug if any cells contain no variants. fingers crossed |
Okay, so to rerun with --min_alt 4 and --min_ref 4. Should I just run the clustering or the whole thing again with those options? |
You can rerun just the clustering. The pipeline is set up to rerun from a partial previous run. There are .done files for each step. So just delete clustering.done (i think thats right, maybe souporcell.done) and troublet.done and then rerun but with those new options |
Okay so I ran it on my brain data with ~1600 UMI/cell and it worked well, although there were still high levels of unassigned ( 46%). I also re-ran it on the cecum data with the min_alt 4 etc and it doesn't crash but all the droplets are unassigned. I am using raw donor GT right now but I plan to impute the genotype data in the future to fill in some gaps, perhaps this would help? I am also going to try and demultiplex using the 10000genomes genotype ref and then assign any clusters after the fact - perhaps that will give me more variants per nuclei? Otherwise I guess I have to go back to the drawing board for the cecum samples. Thanks! |
Sorry for my late reply and thanks for your update @Slacanch @wheaton5 . There are 2 barcodes.tsv in cellranger's output files, one is in raw data directory, another is in filtered data directory. So according to your addvices, I change my barcodes file from the one in raw data directory into the one in filtered data directory, and it works! 🤙👍🍻 |
Hello,
When I was running souporcell_pipeline.py by setting the parameter
-k 15
which could be a potential problem, I got a subporcess.CalledProcessError when it went to the function "doublets(args, ref_mtx, alt_mtx, cluster_file)". By the information of Traceback, I found that after the function "souporcell(args, ref_mtx, alt_mtx, final_vcf)" was run, the "cluster_file" which was named "clusters_tmp.tsv" in the output directory is empty. I think this might caused the function "doublets(args, ref_mtx, alt_mtx, cluster_file)" went wrong which need the "cluster_file" as input. But now I don't know how to deal with it.By the way, when I checked the doublets.err file, I got something that I don't know what it means.
Thanks a lot!
The text was updated successfully, but these errors were encountered: