Check expression specificity annotation names (ES header) #45

DaianeH · 2020-03-04T17:47:20Z

Hi there,

I run CELLEX and now am running CELLECT on Hippocampus dataset of http://dropviz.org.
Genes are originally mouse gene names, I mapped them to human Ensembl IDs using Ensembl Biomart.

My config.yml looks like:

BASE_OUTPUT_DIR: ./CELLECT-LDSC-EXAMPLE

SPECIFICITY_INPUT:

id: dropviz-hippocampus
path: /hpc/users/hemerd01/daiane/projects/scRNAseq/data/brain_mouse_dropviz/hippocampus/brain_dropviz_hippocampus.esmu_human_cellect.csv

GWAS_SUMSTATS:

id: BMI_Yengo2018
path: example/BMI_Yengo2018.sumstats.gz

ANALYSIS_TYPE:
prioritization: True
conditional: False
heritability: False
heritability_intervals: False

WINDOW_DEFINITION:
WINDOW_SIZE_KB:
100
WINDOW_LD_BASED:
False

LDSC_CONST:
DATA_DIR:
data/ldsc
LDSC_DIR:
ldsc
NUMPY_CORES:
1

When I run:

snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile example/config-ldsc_hippocampus_dropviz.yml

Several "rule format_and_map_genes:" work, until at some point, I start getting: "Error in rule format_and_map_genes:"
The blocks where this happens have a description of jobid, input, output, log and conda-env, but the log where the error was supposed to be reported is empty.

I successfully run CELLECT on datasets prepared on CELLEX before, but can't trace what's going on with this one. Would it be helpful if I send you the entire .log file of the CELLECT run and perhaps the esmu dataset as well?

Many thanks in advance,

Tobi1kenobi · 2020-03-05T09:05:39Z

Hi Daiane,

Yes if you could post the actual error CELLECT returns in its log that would be useful.

Best,
Tobi

DaianeH · 2020-03-05T16:09:40Z

There are 96 errors like this one:

Error in rule format_and_map_genes:
jobid: 182
output: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.22.bed
log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/logs/log.format_and_map_snake.dropviz-hippocampus.22.txt (check log file(s) for error message)
conda-env: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/.snakemake/conda/216cc198

File log.format_and_map_snake.dropviz-hippocampus.22.txt where the error messages should be is empty.

Tobi1kenobi · 2020-03-05T16:29:53Z

Hi Daian,

Thanks - that's helpful. Would you be able to send over all of your cell-type/cell-cluster names too? I suspect they may be the issue.

Just to confirm - several of the rules complete successfully it is just 96 that do not?

Best,
Tobi

DaianeH · 2020-03-05T16:39:27Z

Hi Tobi,

Correct, several rules complete, but 96 don't (with an error but no descriptive error message). I think I managed to successfully attach the full .log file, let me know if not.

2020-03-05T110455.599906.snakemake.log

Cell cluster names are:

11-1
6-4
6-1
5-11
5-3
6-3
6-6
5-7
5-6
5-14
1-23
1-14
5-1
6
5-4
1-24
6-5
1-19
6-8
1
15
5-8
6-7
4
1-21
4-1
5-9
5-2
8-1
1-7
1-20
1-6
5
1-15
3-7
5-12
3-3
2-6
15-1
11
1-12
3-6
3-5
1-26
7
15-2
13-2
1-8
3-8
1-22
3-10
2-1
2-5
1-2
1-18
7-1
2-7
8-5
13-3
8-4
3-2
1-9
4-2
1-25
9
3-1
1-13
8-3
2-2
1-11
3-11
5-5
9-1
1-27
6-2
2-4
1-17
1-16
16
8
5-15
7-2
1-5
16-1
2
8-2
1-1
5-10
7-3
17-2
17
16-4
17-3
5-13
2-3
1-10
10
13-1
1-3
17-1
9-4
3-9
9-2
13-5
3
1-4
14
13-4
18
9-3
17-4
16-2
12
NA
10-1
13-6
10-2
16-3
13
21
20
22
3-4
19

Thanks,
Daiane.

Tobi1kenobi · 2020-03-05T17:30:42Z

Hi Daiane,

I don't think any of the rules are completing - your log file indicates you're using 96 cores so only 96 jobs can run in parallel at any one time, and all of them crash.

I believe it may be a result of the naming scheme of the cell-types. I have a solution which you should try:
Rename your cell-types using letters e.g. '19' -> 'CT19', '3-4' -> 'CT3-4', etc. I think it may be an issue with CELLECT's use of regex and adding non-numerical characters could resolve it.

The above should be fairly straightforward to do the above in a text editor (without typing 'CT' 96 times) or with a bit of bash code but if you need a hand or if it doesn't work, let me know.

Best,
Tobi

DaianeH · 2020-03-05T21:23:00Z

Hi Tobi, I tried adding an H in front of every cell-type since all cells come from the Hippocampus. It did not solve the issue. Log attached.
Thanks,
Daiane.

2020-03-05T161448.352052.snakemake.log

Tobi1kenobi · 2020-03-06T00:47:57Z

Hi Daiane,

Hmm, I am stumped and really can't infer anything else from what you've sent.

If you can upload the CELLEX-processed dataset I'll try giving it a whirl myself - hopefully that will make it easier to diagnose the problem.

Best,
Tobi

DaianeH · 2020-03-06T14:42:33Z

Hi Tobi,

CELLEX-processed dataset attached.
Thanks!

brain_dropviz_hippocampus.esmu_human_cellect_test.csv.zip

pascaltimshel · 2020-03-07T09:35:58Z

@Tobi1kenobi : I agree that this could be an issue with snakemake regex.
@DaianeH : It could be a problem to have cell-types named e.g. H1, H1-1 and H1-11. For a quick fix of this, I suggest removing the hyphens. Can you check if that works?
We will, of course, look into this for a better long term solution.

H1
H1-1
H1-10
H1-11
H1-12
H1-13
H1-14
H1-15
H1-16
H1-17
H1-18
H1-19
H1-2
H1-20
H1-21
H1-22
H1-23
H1-24
H1-25
H1-26
H1-27
H1-3
H1-4
H1-5
H1-6
H1-7
H1-8
H1-9
H10
H10-1
H10-2
H11
H11-1
H12
H13
H13-1
H13-2
H13-3
H13-4
H13-5
H13-6
H14
H15
H15-1
H15-2
H16
H16-1
H16-2
H16-3
H16-4
H17
H17-1
H17-2
H17-3
H17-4
H18
H19
H2
H2-1
H2-2
H2-3
H2-4
H2-5
H2-6
H2-7
H20
H21
H22
H3
H3-1
H3-10
H3-11
H3-2
H3-3
H3-4
H3-5
H3-6
H3-7
H3-8
H3-9
H4
H4-1
H4-2
H5
H5-1
H5-10
H5-11
H5-12
H5-13
H5-14
H5-15
H5-2
H5-3
H5-4
H5-5
H5-6
H5-7
H5-8
H5-9
H6
H6-1
H6-2
H6-3
H6-4
H6-5
H6-6
H6-7
H6-8
H7
H7-1
H7-2
H7-3
H8
H8-1
H8-2
H8-3
H8-4
H8-5
H9
H9-1
H9-2
H9-3
H9-4
OTHER

pascaltimshel · 2020-03-07T09:36:42Z

(@Tobi1kenobi : the problem reminds me of #25 )

Tobi1kenobi · 2020-03-08T18:11:50Z

Hi Daiane,

So I no longer believe it's a problem with your cell-type names. I tried running your CELLEX specificity matrix and I get an error message:

ValueError: file_multi_gene_set contains duplicated genes within an annotation (i.e. non-unique combinations of 'annotation' and 'gene_input' columns). Fix and rerun.

Not sure why this doesn't come up when you run CELLECT or get sent to the log file, we'll have to look into this.

But a quick check for duplicate genes:
cat brain_dropviz_hippocampus.esmu_human_cellect_test.csv | cut -d, -f 1 | sort | uniq -c | sort

Reveals that there are many in this dataset and this is almost certainly your problem.

I think it's because you did the mapping step yourself. CELLEX has its own built-in mapping functionality that you can use or if you do want to use your own mapping functionality make sure to drop duplicates.

Hopefully everything should work now.

Best,
Tobi

DaianeH · 2020-03-09T14:05:59Z

Thanks Tobi, it makes sense.
So if I want to use CELLEX mapping functionality, first I need to map mouse gene names to mouse Ensembl ID, then mouse Ensembl ID to human Ensembl ID. The latter is done by "cellex.utils.mapping.ens_mouse_to_ens_human", but how can I map mouse gene names to mouse Ensembl ID?

Tobi1kenobi · 2020-03-09T14:14:37Z

A similar function exists for mapping MGI gene names to Ensembl mouse it should be cellex.utils.mapping.mgi_mouse_to_ens_mouse.

@tstannius can you confirm this is correct?

Best,
Tobi

bengnielsen · 2020-03-09T14:28:35Z

Yep cellex.utils.mapping.mgi_mouse_to_ens_mouse loads a file maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz within the CELLEX repo , and defines a dictionary of {mgi gene name: ensembl_gene_id}.

The doc string however erroneously states the function as "Maps mouse ensembl gene id's to human ensembl gene id's" though.

DaianeH · 2020-03-09T14:32:07Z

I'm not sure if the .txt.gz file is being loaded, though:

eso_mapped = cellex.utils.mapping.mgi_mouse_to_ens_mouse(eso.results["esmu"])
Traceback (most recent call last):
File "", line 1, in
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream
self, resource_name
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream
return io.BytesIO(self.get_resource_string(manager, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string
return self._get(self._fn(self.module_path, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get
return self.loader.get_data(path)
OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'

DaianeH · 2020-03-09T16:20:37Z

Hi Tobi,

Would it perhaps be possible for you to try to run the cellex.utils.mapping.mgi_mouse_to_ens_mouse function on the attached file? Just to check if you get the same error I reported above.

Many thanks,
Daiane.

brain_dropviz_hippocampus.esmu.csv.gz

bengnielsen · 2020-03-09T17:35:02Z

Thanks for pointing out this error! I also experience the same issue with the txt.gz files not being loaded, so definitely a CELLEX bug here.

Just to check, within your /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/ -directory, is there a map-directory?

DaianeH · 2020-03-09T17:40:50Z

So I can't really see anything after:

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/

because it says it's not a directory.

Not experienced in python but from what I read an .egg is the same as a zipped file, so when I do:

unzip /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg

I get all the directories and files under CELLEX... but no, there's no 'map' directory. The closest thing to it is:

cellex/utils/mapping/init.py
cellex/utils/mapping/ens_human_to_symbol.py
cellex/utils/mapping/ens_mouse_to_ens_human.py
cellex/utils/mapping/mgi_mouse_to_ens_mouse.py
cellex/utils/mapping/pycache/init.cpython-37.pyc
cellex/utils/mapping/pycache/ens_human_to_symbol.cpython-37.pyc
cellex/utils/mapping/pycache/ens_mouse_to_ens_human.cpython-37.pyc
cellex/utils/mapping/pycache/mgi_mouse_to_ens_mouse.cpython-37.pyc

bengnielsen · 2020-03-09T22:06:27Z

CELLEX has now been updated by @tstannius with some hotfixes for the mapping functions.

I managed to convert all the MGI gene names to Ensembl gene names from the brain_dropviz_hippocampus.esmu.csv.gz by:

import cellex as cellex
import pandas as pd

df = pd.read_csv('brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip')
df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True)
df = df.set_index('gene')   #important step before mapping! 

cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True)

And then when you call the df it should now contain Mouse ensembl id gene names. Let us know if it works!

DaianeH · 2020-03-10T15:45:21Z

Hi, thanks for the update!

I did exactly what you typed above and got the error:

import numpy as np
import cellex as cellex
import pandas as pd
df = pd.read_csv('/brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip')
df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True)
df = df.set_index('gene')
cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True)
Mapping: mouse mgi gene id's --> mouse ensembl gene id's ...
Traceback (most recent call last):
File "", line 1, in
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream
self, resource_name
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream
return io.BytesIO(self.get_resource_string(manager, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string
return self._get(self._fn(self.module_path, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get
return self.loader.get_data(path)
OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'��

Apologies for the super basic question, but does CELLEX need to be installed again or something?

bengnielsen · 2020-03-10T16:48:46Z

No worries - no questions are too basic!
But yeah, it seems like your server still has CELLEX v.1.0.1, which means that the bug from yesterday

OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'

would still occur.

Did you happen to install CELLEX using pip? If yes, you can update it with a pip install -U cellex and get the latest CELLEX v. 1.1.1

DaianeH · 2020-03-16T02:42:08Z

Ok, so now CELLEX is updated and I can generate the hippocampus_cellect.esmu.csv.gz with ENSEMBL id's. However, when I run CELLECT on it, now I get another error... well in fact I don't get an error, I just get:

...
[Sun Mar 15 22:33:49 2020]
Finished job 215.
140 of 240 steps (58%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/.snakemake/log/2020-03-15T223106.425023.snakemake.log

And I cannot exactly know which job execution failed - there's no errors in the log file, only a few warnings. Could you please be so kind to have a look at the log file and help me on this? Thanks!

2020-03-15T223106.425023.snakemake.log

bengnielsen · 2020-03-21T01:22:03Z

Hi again, I've looked into the 2020-03-15T223106.425023.snakemake.log file and spotted the following error near the bottom:

Missing files after 5 seconds:
/sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.nan.bed

So the point where things start to go wrong is at the format_and_map_genes step, which outputs a .bed for each cell type present in your CELLEX'ed dataset:

[Sun Mar 15 22:31:34 2020]
rule format_and_map_genes:
    input: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/multi_genesets/multi_geneset.dropviz-hippocampus.txt
    output: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.nan.bed
    log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/logs/log.format_and_map_snake.dropviz-hippocampus.nan.txt
    jobid: 238
    wildcards: BASE_OUTPUT_DIR=/sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE, run_prefix=dropviz-hippocampus, annotation=nan

It would perhaps seem like one of your celltypes in the hippocampus_cellect.esmu.csv.gz have a nan as their celltype label and thus might messes up the script (perhaps because nan doesn't translate very well when converted to string? What's your take on this, @Tobi1kenobi?) and thus required files .bed files to proceed are not generated, hence snakemake cancels further job execution.

Sorry if this is inconvenient, but I might have to ask you to find the cell types in the CELLEX'ed dataset with the nan labels and rename them.

Tobi1kenobi · 2020-03-21T14:51:40Z

I'd agree, this sounds like #3 . It should've been fixed a long time ago but I think has been forgotten about, apologies.

But as @bengnielsen said, I'd expect renaming 'nan' as 'unknown' (for example) would resolve the issue.

DaianeH · 2020-03-30T16:33:51Z

Hi there,

Changing the name of the cluster from 'nan' to 'Unknown' worked.

So just to close this thread:

1 - CELLECT throws an error when the name of a cluster is "nan"
2 - You suggested I put a letter in front of the cluster numbers, but that doesn't make a difference when there's an hyphen in the cluster name (when there's numbers and hyphens, the cluster name is read as a string). However, I run CELLECT in another dataset with cluster names represented only as numbers, and it threw an error. When adding a letter in front of the cluster numbers, the error was gone. So CELLECT will throw an error when cluster names are numbers only (it must be a string).

Many thanks for your help!

pascaltimshel · 2020-04-12T08:54:17Z

Thanks @DaianeH !

To fix this the code should either :

fail explicitly (before staring Snakemake workflow) when number-only or nan encountered

or

improve snakemake regex(?) to allow for number-only or nan annotation names

DaianeH closed this as completed Mar 9, 2020

DaianeH reopened this Mar 9, 2020

pascaltimshel added check&catch error checks and catching of errors. If very serious issue, report as bug priority_high High priority labels Apr 12, 2020

pascaltimshel changed the title ~~Error in format_and_map_genes~~ Check expression specificity annotation names (ES header) Apr 12, 2020

pascaltimshel pinned this issue May 13, 2020

Tobi1kenobi unpinned this issue Jun 18, 2020

Tobi1kenobi pinned this issue Jun 21, 2020

Tobi1kenobi unpinned this issue Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check expression specificity annotation names (ES header) #45

Check expression specificity annotation names (ES header) #45

DaianeH commented Mar 4, 2020

Tobi1kenobi commented Mar 5, 2020

DaianeH commented Mar 5, 2020 •

edited

Loading

Tobi1kenobi commented Mar 5, 2020

DaianeH commented Mar 5, 2020

Tobi1kenobi commented Mar 5, 2020 •

edited

Loading

DaianeH commented Mar 5, 2020 •

edited

Loading

Tobi1kenobi commented Mar 6, 2020

DaianeH commented Mar 6, 2020

pascaltimshel commented Mar 7, 2020

pascaltimshel commented Mar 7, 2020

Tobi1kenobi commented Mar 8, 2020 •

edited

Loading

DaianeH commented Mar 9, 2020

Tobi1kenobi commented Mar 9, 2020

bengnielsen commented Mar 9, 2020

DaianeH commented Mar 9, 2020

DaianeH commented Mar 9, 2020

bengnielsen commented Mar 9, 2020

DaianeH commented Mar 9, 2020

bengnielsen commented Mar 9, 2020 •

edited

Loading

DaianeH commented Mar 10, 2020

bengnielsen commented Mar 10, 2020

DaianeH commented Mar 16, 2020

bengnielsen commented Mar 21, 2020

Tobi1kenobi commented Mar 21, 2020

DaianeH commented Mar 30, 2020

pascaltimshel commented Apr 12, 2020 •

edited

Loading

Check expression specificity annotation names (ES header) #45

Check expression specificity annotation names (ES header) #45

Comments

DaianeH commented Mar 4, 2020

Tobi1kenobi commented Mar 5, 2020

DaianeH commented Mar 5, 2020 • edited Loading

Tobi1kenobi commented Mar 5, 2020

DaianeH commented Mar 5, 2020

Tobi1kenobi commented Mar 5, 2020 • edited Loading

DaianeH commented Mar 5, 2020 • edited Loading

Tobi1kenobi commented Mar 6, 2020

DaianeH commented Mar 6, 2020

pascaltimshel commented Mar 7, 2020

pascaltimshel commented Mar 7, 2020

Tobi1kenobi commented Mar 8, 2020 • edited Loading

DaianeH commented Mar 9, 2020

Tobi1kenobi commented Mar 9, 2020

bengnielsen commented Mar 9, 2020

DaianeH commented Mar 9, 2020

DaianeH commented Mar 9, 2020

bengnielsen commented Mar 9, 2020

DaianeH commented Mar 9, 2020

bengnielsen commented Mar 9, 2020 • edited Loading

DaianeH commented Mar 10, 2020

bengnielsen commented Mar 10, 2020

DaianeH commented Mar 16, 2020

bengnielsen commented Mar 21, 2020

Tobi1kenobi commented Mar 21, 2020

DaianeH commented Mar 30, 2020

pascaltimshel commented Apr 12, 2020 • edited Loading

DaianeH commented Mar 5, 2020 •

edited

Loading

Tobi1kenobi commented Mar 5, 2020 •

edited

Loading

DaianeH commented Mar 5, 2020 •

edited

Loading

Tobi1kenobi commented Mar 8, 2020 •

edited

Loading

bengnielsen commented Mar 9, 2020 •

edited

Loading

pascaltimshel commented Apr 12, 2020 •

edited

Loading