Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clusters_tmp.tsv is empty which caused subprocess.CalledProcessError #128

Closed
ZebinWen opened this issue Dec 25, 2021 · 29 comments
Closed

clusters_tmp.tsv is empty which caused subprocess.CalledProcessError #128

ZebinWen opened this issue Dec 25, 2021 · 29 comments

Comments

@ZebinWen
Copy link

ZebinWen commented Dec 25, 2021

Hello,
When I was running souporcell_pipeline.py by setting the parameter -k 15 which could be a potential problem, I got a subporcess.CalledProcessError when it went to the function "doublets(args, ref_mtx, alt_mtx, cluster_file)". By the information of Traceback, I found that after the function "souporcell(args, ref_mtx, alt_mtx, final_vcf)" was run, the "cluster_file" which was named "clusters_tmp.tsv" in the output directory is empty. I think this might caused the function "doublets(args, ref_mtx, alt_mtx, cluster_file)" went wrong which need the "cluster_file" as input. But now I don't know how to deal with it.

running souporcell doublet detection
Traceback (most recent call last) :
File "/opt/souporcell/souporcell_pipeline.py", line 596, in
doublets(args, ref_mtx, alt_mtx, cluster_file)
File "/opt/souporcell/souporcell_pipeline.py", line 541, in doublets
subprocess.check_call([directory+"/troublet/target/release/troublet", "--alts", alt_mtx, "--refs", ref_mtx, "--clusters", cluster_file], stdout = dub, stderr = err)
File "/usr/local/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/souporcell/troublet/target/release/troublet', '--alts', '/home/pro5/results/alt.mtx', '--refs', '/home/pro5/results/ref.mtx', '--clusters', '/home/pro5/results/clusters_tmp.tsv']' returned non-zero exit status 101.

By the way, when I checked the doublets.err file, I got something that I don't know what it means.

thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 111406', src/main.rs:328:22
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Thanks a lot!

@Slacanch
Copy link

Slacanch commented Jan 7, 2022

I am having a similar problem when running with K = 10, 11 or 12 on quite a large dataset (35G). I get the same subprocess error and doublets.err error as in the OP.

This seems to be due to the clustering steps failing to identify any clusters, yet failing without an error. this is the output of clusters.err:

total loci used 82114
thread 2 iteration 0 done with -inf, best so far -inf
thread 3 iteration 0 done with -inf, best so far -inf
thread 4 iteration 0 done with -inf, best so far -inf
thread 5 iteration 0 done with -inf, best so far -inf
thread 6 iteration 0 done with -inf, best so far -inf
thread 0 iteration 0 done with -inf, best so far -inf
thread 1 iteration 0 done with -inf, best so far -inf
thread 7 iteration 0 done with -inf, best so far -inf
thread 2 iteration 1 done with -inf, best so far -inf
thread 4 iteration 1 done with -inf, best so far -inf
thread 3 iteration 1 done with -inf, best so far -inf
thread 6 iteration 1 done with -inf, best so far -inf
thread 0 iteration 1 done with -inf, best so far -inf
thread 1 iteration 1 done with -inf, best so far -inf
thread 5 iteration 1 done with -inf, best so far -inf
thread 2 iteration 2 done with -inf, best so far -inf
thread 7 iteration 1 done with -inf, best so far -inf
thread 3 iteration 2 done with -inf, best so far -inf
thread 4 iteration 2 done with -inf, best so far -inf
thread 6 iteration 2 done with -inf, best so far -inf
thread 1 iteration 2 done with -inf, best so far -inf
thread 0 iteration 2 done with -inf, best so far -inf
thread 2 iteration 3 done with -inf, best so far -inf
thread 5 iteration 2 done with -inf, best so far -inf
thread 3 iteration 3 done with -inf, best so far -inf
thread 7 iteration 2 done with -inf, best so far -inf
thread 6 iteration 3 done with -inf, best so far -inf
thread 4 iteration 3 done with -inf, best so far -inf
thread 1 iteration 3 done with -inf, best so far -inf
thread 0 iteration 3 done with -inf, best so far -inf
thread 5 iteration 3 done with -inf, best so far -inf
thread 7 iteration 3 done with -inf, best so far -inf
best total log probability = -inf

I have tried increasing --restarts and changing --min_alt and --min_ref with no effect (although rerunning the whole pipeline might be required for these changes to take effect, right now i'm deleting the parts related to clusters and doublets and running it again on the same output folder).

is any of you aware of this behaviour? i can't begin to diagnose the problem as i don't understand why clustering fails.

thanks a lot

Tito

@wheaton5
Copy link
Owner

wheaton5 commented Jan 7, 2022

It looks like there was some problem prior to clustering that led to 0 counts? That cant really be it because it says "total loci used 82k" but something similar. Maybe you could send me your alt.mtx and ref.mtx and barcodes.tsv file and then i can debug it.

@Slacanch
Copy link

Hey @wheaton5,
as an update, i've tried rerunning the pipeline completely with different min_alt, min_ref, and --restarts parameters to no effect.

I have sent you a link to download ref, alt, and barcodes for the data that gives me these issues, i've looked at them briefly but they seem ok.

Thanks for looking into it!

@Slacanch
Copy link

Hello again,
Something i forgot to mention, the dataset that is creating these clustering problems is not a straight up gene expression run, but rather a 10X genomics cellplexed sample run with cellranger multi, not cellranger count.

what that means in detail i'm not sure, but it might create problems. i'm going to try running cellranger count on only the gene expression portion of the data and then rerunning souporcell, see if that does anything.

@Slacanch
Copy link

Small update. after testing on the output of cellranger count instead of cellranger multi, souporcell run to completion. I think the problem was that i had provided ALL possible barcodes to SoC, instead of the ones found to be associated with cells. could this create problems for the clustering algorithm?

maybe @ZebinWen can comment on wether this applies to them.

@wheaton5
Copy link
Owner

@Slacanch thanks for the update. This has been a problem in the past. I should try to fix it. Theoretically it is just adding some noise to the clustering. But something in the code makes it break.

@changostraw
Copy link

I think I am having the same issue. I get this error when clustering:
running souporcell clustering
/opt/souporcell/souporcell/target/release/souporcell -k 2 -a /project/rrg-gturecki/Sarah/Demuxafy/Soupercell/cecum_poolC/alt.mtx -r /project/rrg-gturecki/Sarah/Demuxafy/Soupercell/cecum_poolC/ref.mtx --restarts 100 -b /project/rrg-gturecki/Sarah/multiplex_test_demuxify_run2_cecum_poolC/outs/filtered_feature_bc_matrix/barcodes.tsv.gz --min_ref 10 --min_alt 10 --threads 2 --known_genotypes /project/rrg-gturecki/Sarah/Demuxafy/Soupercell/cecum_poolC/common_variants_covered.vcf --known_genotypes_sample_names 309_GT12139_SingleCellGenotyping_GT-P01_309 400_GT14352_SingleCellGenotyping_GT-P01_400
running souporcell doublet detection
Traceback (most recent call last):
File "/opt/souporcell/souporcell_pipeline.py", line 598, in
doublets(args, ref_mtx, alt_mtx, cluster_file)
File "/opt/souporcell/souporcell_pipeline.py", line 543, in doublets
subprocess.check_call([directory+"/troublet/target/release/troublet", "--alts", alt_mtx, "--refs", ref_mtx, "--clusters", cluster_file], stdout = dub, stderr = err)
File "/opt/conda/envs/py36/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/souporcell/troublet/target/release/troublet', '--alts', '/project/rrg-gturecki/Sarah/Demuxafy/Soupercell/cecum_poolC/alt.mtx', '--refs', '/project/rrg-gturecki/Sarah/Demuxafy/Soupercell/cecum_poolC/ref.mtx', '--clusters', '/project/rrg-gturecki/Sarah/Demuxafy/Soupercell/cecum_poolC/clusters_tmp.tsv']' returned non-zero exit status 101.

I am only running k = 2 on one subset file for testing, but my data is also multiplexed. However, the input is only from cellranger count (I didn't run cellranger multi on this subset as I am testing the difference between demultiplexing via tags and via genotype). However, perhaps my barcodes file still contains all the barcodes? @Slacanch did you alter the barcodes file at all from the cellranger count output?

Thanks!

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

you should use the barcodes file that is just the cell barcodes.

@changostraw
Copy link

Here is my clusters.err file. Is it a problem that it says total loci used 1?

total loci used 1
binomial 0 0 1 0 -80.59912 inf
binomial 1 0 1 0 -80.59912 inf
binomial 0 0 2 0 -23.148155 57.450966
binomial 1 0 2 0 -23.148155 57.450966
binomial 0 0 3 1 -23.14816 -0.000005722046
binomial 1 0 3 1 -23.14816 -0.000005722046
binomial 0 0 4 2 -23.14816 0
binomial 1 0 4 2 -23.14816 0
binomial 0 0 5 3 -23.14816 0
binomial 0 0 6 4 -23.14816 0
binomial 1 0 5 3 -23.14816 0
binomial 0 0 7 5 -23.14816 0
binomial 1 0 6 4 -23.14816 0
binomial 0 0 8 6 -23.14816 0
binomial 1 0 7 5 -23.14816 0
binomial 0 0 9 7 -23.14816 0
binomial 1 0 8 6 -23.14816 0
binomial 0 0 10 8 -23.14816 0
thread 0 iteration 0 done with -23.14816, best so far -23.14816
binomial 1 0 9 7 -23.14816 0
binomial 1 0 10 8 -23.14816 0
thread 1 iteration 0 done with -23.14816, best so far -23.14816
binomial 0 1 1 0 -80.59912 inf
binomial 0 1 2 0 -23.148155 57.450966
binomial 0 1 3 1 -23.14816 -0.000005722046
binomial 1 1 1 0 -80.59912 inf
binomial 1 1 2 0 -23.148155 57.450966
binomial 0 1 4 2 -23.14816 0

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

yeah, thats definitely a problem lol.

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

what do your alt.mtx and ref.mtx look like (top few lines)

@changostraw
Copy link

So it isn't finding any cells? Could that be because of the barcodes file?

@changostraw
Copy link

alt.mtx:
%%MatrixMarket matrix coordinate real general
% written by sprs
4832 12890 8825
1 51 0
1 9958 0
1 10740 0
2 782 1
3 1316 1
4 1356 0
7 1778 0
7 9498 0
8 3356 1
8 11915 1
8 12111 0
9 468 1
9 3360 1
9 5943 1
9 7077 1
9 10577 1
10 1366 0
10 5081 0
10 6506 0
11 1366 0
11 5081 0
11 6506 0
12 12366 0
13 2410 0
13 3837 0
14 566 0

@changostraw
Copy link

%%MatrixMarket matrix coordinate real general
% written by sprs
4832 12890 8825
1 51 1
1 9958 1
1 10740 1
2 782 0
3 1316 0
4 1356 1
7 1778 1
7 9498 1
8 3356 0
8 11915 0
8 12111 1
9 468 0
9 3360 0
9 5943 0
9 7077 0
9 10577 0
10 1366 1
10 5081 1
10 6506 1
11 1366 1
11 5081 1
11 6506 1
12 12366 1
13 2410 1
13 3837 1
14 566 1

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

that looks relatively normal, but im going to look at a normal run real quick to make sure. Thats 12890 cells, 4832 loci, 8825 entries. Maybe that number of entries is low for that number of cells and loci

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

What is the umi/cell median?

@changostraw
Copy link

These are postmortem intestine nuclei so there are low counts
Median UMI counts per cell = 230
Median genes per cell = 215
Mean reads per cell = 18,344

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

You can try rerunning with --min_alt 4 and --min_ref 4. (default is 10 for each). You should get some more loci. But its gonna be tough to cluster with that few UMI

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

here is an example of a run with a lot of umi
%%MatrixMarket matrix coordinate real general
% written by sprs
194079 4925 13543647

to be fair that had like 25k umi median... which is abnormally large. A good number is around 4k for cells, maybe 1k for nuclei.

@changostraw
Copy link

I see. Would it help if the range was large? The last time I ran a snRNA-seq test on these samples the range was quite large. Because this is human postmortem tissue the RNA in the mucosal cells is very degraded. However, the RNA in the neurons and glial cells and certain immune cells are usually well preserved, so I was expecting these nuclei to have higher UMIs. Would it help to restrict to cells with a higher UMI?

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

here is a more normal run, with about 4k umi per cell

%%MatrixMarket matrix coordinate real general
% written by sprs
186395 14030 4201477

So there is basically almost no data to go off of in your case.

@changostraw
Copy link

Okay that makes sense. I also have data from postmortem brain tissue which has ~1600 UMI per cell. So I will run that as well and make sure that is the issue.

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

I don't think the low UMI cells should hurt the clustering. The cells with more UMI will contribute that much more to the clustering results and the low UMI cells should get sort of placed according to how the high UMI cells have shaped the cluster centers. But you may run into errors in the code if there are any cells that have no variants at all. I'm not sure

@changostraw
Copy link

Because we are working with postmortem nuclei we never get the levels that people working with cells get

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

Sure. I think it should work fine with 1600 UMI. But again there might be a bug if any cells contain no variants. fingers crossed

@changostraw
Copy link

Okay, so to rerun with --min_alt 4 and --min_ref 4. Should I just run the clustering or the whole thing again with those options?

@wheaton5
Copy link
Owner

wheaton5 commented Jun 8, 2022

You can rerun just the clustering. The pipeline is set up to rerun from a partial previous run. There are .done files for each step. So just delete clustering.done (i think thats right, maybe souporcell.done) and troublet.done and then rerun but with those new options

@changostraw
Copy link

Okay so I ran it on my brain data with ~1600 UMI/cell and it worked well, although there were still high levels of unassigned ( 46%). I also re-ran it on the cecum data with the min_alt 4 etc and it doesn't crash but all the droplets are unassigned. I am using raw donor GT right now but I plan to impute the genotype data in the future to fill in some gaps, perhaps this would help? I am also going to try and demultiplex using the 10000genomes genotype ref and then assign any clusters after the fact - perhaps that will give me more variants per nuclei? Otherwise I guess I have to go back to the drawing board for the cecum samples. Thanks!

@ZebinWen
Copy link
Author

ZebinWen commented Aug 2, 2022

Sorry for my late reply and thanks for your update @Slacanch @wheaton5 . There are 2 barcodes.tsv in cellranger's output files, one is in raw data directory, another is in filtered data directory. So according to your addvices, I change my barcodes file from the one in raw data directory into the one in filtered data directory, and it works! 🤙👍🍻
At the same time, I keep the pipeline running uninterrupted. If it breaks, I can resume the pipeline by rerun the code, and it can continue your previous work thanks to the .done files which is so brilliant!

@ZebinWen ZebinWen closed this as completed Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants