-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check expression specificity annotation names (ES header) #45
Comments
Hi Daiane, Yes if you could post the actual error CELLECT returns in its log that would be useful. Best, |
There are 96 errors like this one: Error in rule format_and_map_genes: File log.format_and_map_snake.dropviz-hippocampus.22.txt where the error messages should be is empty. |
Hi Daian, Thanks - that's helpful. Would you be able to send over all of your cell-type/cell-cluster names too? I suspect they may be the issue. Just to confirm - several of the rules complete successfully it is just 96 that do not? Best, |
Hi Tobi, Correct, several rules complete, but 96 don't (with an error but no descriptive error message). I think I managed to successfully attach the full .log file, let me know if not. 2020-03-05T110455.599906.snakemake.log Cell cluster names are: 11-1 Thanks, |
Hi Daiane, I don't think any of the rules are completing - your log file indicates you're using 96 cores so only 96 jobs can run in parallel at any one time, and all of them crash. I believe it may be a result of the naming scheme of the cell-types. I have a solution which you should try: The above should be fairly straightforward to do the above in a text editor (without typing 'CT' 96 times) or with a bit of bash code but if you need a hand or if it doesn't work, let me know. Best, |
Hi Tobi, I tried adding an H in front of every cell-type since all cells come from the Hippocampus. It did not solve the issue. Log attached. |
Hi Daiane, Hmm, I am stumped and really can't infer anything else from what you've sent. If you can upload the CELLEX-processed dataset I'll try giving it a whirl myself - hopefully that will make it easier to diagnose the problem. Best, |
Hi Tobi, CELLEX-processed dataset attached. |
@Tobi1kenobi : I agree that this could be an issue with snakemake regex.
|
(@Tobi1kenobi : the problem reminds me of #25 ) |
Hi Daiane, So I no longer believe it's a problem with your cell-type names. I tried running your CELLEX specificity matrix and I get an error message:
Not sure why this doesn't come up when you run CELLECT or get sent to the log file, we'll have to look into this. But a quick check for duplicate genes: Reveals that there are many in this dataset and this is almost certainly your problem. I think it's because you did the mapping step yourself. CELLEX has its own built-in mapping functionality that you can use or if you do want to use your own mapping functionality make sure to drop duplicates. Hopefully everything should work now. Best, |
Thanks Tobi, it makes sense. |
A similar function exists for mapping MGI gene names to Ensembl mouse it should be @tstannius can you confirm this is correct? Best, |
Yep The doc string however erroneously states the function as "Maps mouse ensembl gene id's to human ensembl gene id's" though. |
I'm not sure if the .txt.gz file is being loaded, though:
|
Hi Tobi, Would it perhaps be possible for you to try to run the cellex.utils.mapping.mgi_mouse_to_ens_mouse function on the attached file? Just to check if you get the same error I reported above. Many thanks, |
Thanks for pointing out this error! I also experience the same issue with the txt.gz files not being loaded, so definitely a CELLEX bug here. Just to check, within your |
So I can't really see anything after: /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/ because it says it's not a directory. Not experienced in python but from what I read an .egg is the same as a zipped file, so when I do: unzip /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg I get all the directories and files under CELLEX... but no, there's no 'map' directory. The closest thing to it is: cellex/utils/mapping/init.py |
CELLEX has now been updated by @tstannius with some hotfixes for the mapping functions. I managed to convert all the MGI gene names to Ensembl gene names from the import cellex as cellex
import pandas as pd
df = pd.read_csv('brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip')
df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True)
df = df.set_index('gene') #important step before mapping!
cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True) And then when you call the |
Hi, thanks for the update! I did exactly what you typed above and got the error:
Apologies for the super basic question, but does CELLEX need to be installed again or something? |
No worries - no questions are too basic!
would still occur. Did you happen to install CELLEX using pip? If yes, you can update it with a |
Ok, so now CELLEX is updated and I can generate the hippocampus_cellect.esmu.csv.gz with ENSEMBL id's. However, when I run CELLECT on it, now I get another error... well in fact I don't get an error, I just get: ... And I cannot exactly know which job execution failed - there's no errors in the log file, only a few warnings. Could you please be so kind to have a look at the log file and help me on this? Thanks! |
Hi again, I've looked into the
So the point where things start to go wrong is at the
It would perhaps seem like one of your celltypes in the Sorry if this is inconvenient, but I might have to ask you to find the cell types in the CELLEX'ed dataset with the |
I'd agree, this sounds like #3 . It should've been fixed a long time ago but I think has been forgotten about, apologies. But as @bengnielsen said, I'd expect renaming 'nan' as 'unknown' (for example) would resolve the issue. |
Hi there, Changing the name of the cluster from 'nan' to 'Unknown' worked. So just to close this thread: 1 - CELLECT throws an error when the name of a cluster is "nan" Many thanks for your help! |
Thanks @DaianeH ! To fix this the code should either :
or
|
Hi there,
I run CELLEX and now am running CELLECT on Hippocampus dataset of http://dropviz.org.
Genes are originally mouse gene names, I mapped them to human Ensembl IDs using Ensembl Biomart.
My config.yml looks like:
BASE_OUTPUT_DIR: ./CELLECT-LDSC-EXAMPLE
SPECIFICITY_INPUT:
path: /hpc/users/hemerd01/daiane/projects/scRNAseq/data/brain_mouse_dropviz/hippocampus/brain_dropviz_hippocampus.esmu_human_cellect.csv
GWAS_SUMSTATS:
path: example/BMI_Yengo2018.sumstats.gz
ANALYSIS_TYPE:
prioritization: True
conditional: False
heritability: False
heritability_intervals: False
WINDOW_DEFINITION:
WINDOW_SIZE_KB:
100
WINDOW_LD_BASED:
False
LDSC_CONST:
DATA_DIR:
data/ldsc
LDSC_DIR:
ldsc
NUMPY_CORES:
1
When I run:
snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile example/config-ldsc_hippocampus_dropviz.yml
Several "rule format_and_map_genes:" work, until at some point, I start getting: "Error in rule format_and_map_genes:"
The blocks where this happens have a description of jobid, input, output, log and conda-env, but the log where the error was supposed to be reported is empty.
I successfully run CELLECT on datasets prepared on CELLEX before, but can't trace what's going on with this one. Would it be helpful if I send you the entire .log file of the CELLECT run and perhaps the esmu dataset as well?
Many thanks in advance,
The text was updated successfully, but these errors were encountered: