Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check expression specificity annotation names (ES header) #45

Open
DaianeH opened this issue Mar 4, 2020 · 26 comments
Open

Check expression specificity annotation names (ES header) #45

DaianeH opened this issue Mar 4, 2020 · 26 comments
Labels
check&catch error checks and catching of errors. If very serious issue, report as bug priority_high High priority

Comments

@DaianeH
Copy link

DaianeH commented Mar 4, 2020

Hi there,

I run CELLEX and now am running CELLECT on Hippocampus dataset of http://dropviz.org.
Genes are originally mouse gene names, I mapped them to human Ensembl IDs using Ensembl Biomart.

My config.yml looks like:


BASE_OUTPUT_DIR: ./CELLECT-LDSC-EXAMPLE

SPECIFICITY_INPUT:

  • id: dropviz-hippocampus
    path: /hpc/users/hemerd01/daiane/projects/scRNAseq/data/brain_mouse_dropviz/hippocampus/brain_dropviz_hippocampus.esmu_human_cellect.csv

GWAS_SUMSTATS:

  • id: BMI_Yengo2018
    path: example/BMI_Yengo2018.sumstats.gz

ANALYSIS_TYPE:
prioritization: True
conditional: False
heritability: False
heritability_intervals: False

WINDOW_DEFINITION:
WINDOW_SIZE_KB:
100
WINDOW_LD_BASED:
False

LDSC_CONST:
DATA_DIR:
data/ldsc
LDSC_DIR:
ldsc
NUMPY_CORES:
1


When I run:

snakemake --use-conda -j -s cellect-ldsc.snakefile --configfile example/config-ldsc_hippocampus_dropviz.yml

Several "rule format_and_map_genes:" work, until at some point, I start getting: "Error in rule format_and_map_genes:"
The blocks where this happens have a description of jobid, input, output, log and conda-env, but the log where the error was supposed to be reported is empty.

I successfully run CELLECT on datasets prepared on CELLEX before, but can't trace what's going on with this one. Would it be helpful if I send you the entire .log file of the CELLECT run and perhaps the esmu dataset as well?

Many thanks in advance,

@Tobi1kenobi
Copy link
Contributor

Hi Daiane,

Yes if you could post the actual error CELLECT returns in its log that would be useful.

Best,
Tobi

@DaianeH
Copy link
Author

DaianeH commented Mar 5, 2020

There are 96 errors like this one:

Error in rule format_and_map_genes:
jobid: 182
output: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.22.bed
log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/logs/log.format_and_map_snake.dropviz-hippocampus.22.txt (check log file(s) for error message)
conda-env: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/.snakemake/conda/216cc198

File log.format_and_map_snake.dropviz-hippocampus.22.txt where the error messages should be is empty.

@Tobi1kenobi
Copy link
Contributor

Hi Daian,

Thanks - that's helpful. Would you be able to send over all of your cell-type/cell-cluster names too? I suspect they may be the issue.

Just to confirm - several of the rules complete successfully it is just 96 that do not?

Best,
Tobi

@DaianeH
Copy link
Author

DaianeH commented Mar 5, 2020

Hi Tobi,

Correct, several rules complete, but 96 don't (with an error but no descriptive error message). I think I managed to successfully attach the full .log file, let me know if not.

2020-03-05T110455.599906.snakemake.log

Cell cluster names are:

11-1
6-4
6-1
5-11
5-3
6-3
6-6
5-7
5-6
5-14
1-23
1-14
5-1
6
5-4
1-24
6-5
1-19
6-8
1
15
5-8
6-7
4
1-21
4-1
5-9
5-2
8-1
1-7
1-20
1-6
5
1-15
3-7
5-12
3-3
2-6
15-1
11
1-12
3-6
3-5
1-26
7
15-2
13-2
1-8
3-8
1-22
3-10
2-1
2-5
1-2
1-18
7-1
2-7
8-5
13-3
8-4
3-2
1-9
4-2
1-25
9
3-1
1-13
8-3
2-2
1-11
3-11
5-5
9-1
1-27
6-2
2-4
1-17
1-16
16
8
5-15
7-2
1-5
16-1
2
8-2
1-1
5-10
7-3
17-2
17
16-4
17-3
5-13
2-3
1-10
10
13-1
1-3
17-1
9-4
3-9
9-2
13-5
3
1-4
14
13-4
18
9-3
17-4
16-2
12
NA
10-1
13-6
10-2
16-3
13
21
20
22
3-4
19

Thanks,
Daiane.

@Tobi1kenobi
Copy link
Contributor

Tobi1kenobi commented Mar 5, 2020

Hi Daiane,

I don't think any of the rules are completing - your log file indicates you're using 96 cores so only 96 jobs can run in parallel at any one time, and all of them crash.

I believe it may be a result of the naming scheme of the cell-types. I have a solution which you should try:
Rename your cell-types using letters e.g. '19' -> 'CT19', '3-4' -> 'CT3-4', etc. I think it may be an issue with CELLECT's use of regex and adding non-numerical characters could resolve it.

The above should be fairly straightforward to do the above in a text editor (without typing 'CT' 96 times) or with a bit of bash code but if you need a hand or if it doesn't work, let me know.

Best,
Tobi

@DaianeH
Copy link
Author

DaianeH commented Mar 5, 2020

Hi Tobi, I tried adding an H in front of every cell-type since all cells come from the Hippocampus. It did not solve the issue. Log attached.
Thanks,
Daiane.

2020-03-05T161448.352052.snakemake.log

@Tobi1kenobi
Copy link
Contributor

Hi Daiane,

Hmm, I am stumped and really can't infer anything else from what you've sent.

If you can upload the CELLEX-processed dataset I'll try giving it a whirl myself - hopefully that will make it easier to diagnose the problem.

Best,
Tobi

@DaianeH
Copy link
Author

DaianeH commented Mar 6, 2020

Hi Tobi,

CELLEX-processed dataset attached.
Thanks!

brain_dropviz_hippocampus.esmu_human_cellect_test.csv.zip

@pascaltimshel
Copy link
Collaborator

@Tobi1kenobi : I agree that this could be an issue with snakemake regex.
@DaianeH : It could be a problem to have cell-types named e.g. H1, H1-1 and H1-11. For a quick fix of this, I suggest removing the hyphens. Can you check if that works?
We will, of course, look into this for a better long term solution.

H1
H1-1
H1-10
H1-11
H1-12
H1-13
H1-14
H1-15
H1-16
H1-17
H1-18
H1-19
H1-2
H1-20
H1-21
H1-22
H1-23
H1-24
H1-25
H1-26
H1-27
H1-3
H1-4
H1-5
H1-6
H1-7
H1-8
H1-9
H10
H10-1
H10-2
H11
H11-1
H12
H13
H13-1
H13-2
H13-3
H13-4
H13-5
H13-6
H14
H15
H15-1
H15-2
H16
H16-1
H16-2
H16-3
H16-4
H17
H17-1
H17-2
H17-3
H17-4
H18
H19
H2
H2-1
H2-2
H2-3
H2-4
H2-5
H2-6
H2-7
H20
H21
H22
H3
H3-1
H3-10
H3-11
H3-2
H3-3
H3-4
H3-5
H3-6
H3-7
H3-8
H3-9
H4
H4-1
H4-2
H5
H5-1
H5-10
H5-11
H5-12
H5-13
H5-14
H5-15
H5-2
H5-3
H5-4
H5-5
H5-6
H5-7
H5-8
H5-9
H6
H6-1
H6-2
H6-3
H6-4
H6-5
H6-6
H6-7
H6-8
H7
H7-1
H7-2
H7-3
H8
H8-1
H8-2
H8-3
H8-4
H8-5
H9
H9-1
H9-2
H9-3
H9-4
OTHER

@pascaltimshel
Copy link
Collaborator

(@Tobi1kenobi : the problem reminds me of #25 )

@Tobi1kenobi
Copy link
Contributor

Tobi1kenobi commented Mar 8, 2020

Hi Daiane,

So I no longer believe it's a problem with your cell-type names. I tried running your CELLEX specificity matrix and I get an error message:

ValueError: file_multi_gene_set contains duplicated genes within an annotation (i.e. non-unique combinations of 'annotation' and 'gene_input' columns). Fix and rerun.

Not sure why this doesn't come up when you run CELLECT or get sent to the log file, we'll have to look into this.

But a quick check for duplicate genes:
cat brain_dropviz_hippocampus.esmu_human_cellect_test.csv | cut -d, -f 1 | sort | uniq -c | sort

Reveals that there are many in this dataset and this is almost certainly your problem.

I think it's because you did the mapping step yourself. CELLEX has its own built-in mapping functionality that you can use or if you do want to use your own mapping functionality make sure to drop duplicates.

Hopefully everything should work now.

Best,
Tobi

@DaianeH
Copy link
Author

DaianeH commented Mar 9, 2020

Thanks Tobi, it makes sense.
So if I want to use CELLEX mapping functionality, first I need to map mouse gene names to mouse Ensembl ID, then mouse Ensembl ID to human Ensembl ID. The latter is done by "cellex.utils.mapping.ens_mouse_to_ens_human", but how can I map mouse gene names to mouse Ensembl ID?

@Tobi1kenobi
Copy link
Contributor

A similar function exists for mapping MGI gene names to Ensembl mouse it should be cellex.utils.mapping.mgi_mouse_to_ens_mouse.

@tstannius can you confirm this is correct?

Best,
Tobi

@bengnielsen
Copy link
Collaborator

Yep cellex.utils.mapping.mgi_mouse_to_ens_mouse loads a file maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz within the CELLEX repo , and defines a dictionary of {mgi gene name: ensembl_gene_id}.

The doc string however erroneously states the function as "Maps mouse ensembl gene id's to human ensembl gene id's" though.

@DaianeH
Copy link
Author

DaianeH commented Mar 9, 2020

I'm not sure if the .txt.gz file is being loaded, though:

eso_mapped = cellex.utils.mapping.mgi_mouse_to_ens_mouse(eso.results["esmu"])
Traceback (most recent call last):
File "", line 1, in
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream
self, resource_name
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream
return io.BytesIO(self.get_resource_string(manager, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string
return self._get(self._fn(self.module_path, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get
return self.loader.get_data(path)
OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'

@DaianeH DaianeH closed this as completed Mar 9, 2020
@DaianeH DaianeH reopened this Mar 9, 2020
@DaianeH
Copy link
Author

DaianeH commented Mar 9, 2020

Hi Tobi,

Would it perhaps be possible for you to try to run the cellex.utils.mapping.mgi_mouse_to_ens_mouse function on the attached file? Just to check if you get the same error I reported above.

Many thanks,
Daiane.

brain_dropviz_hippocampus.esmu.csv.gz

@bengnielsen
Copy link
Collaborator

Thanks for pointing out this error! I also experience the same issue with the txt.gz files not being loaded, so definitely a CELLEX bug here.

Just to check, within your /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/ -directory, is there a map-directory?

@DaianeH
Copy link
Author

DaianeH commented Mar 9, 2020

So I can't really see anything after:

/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/

because it says it's not a directory.

Not experienced in python but from what I read an .egg is the same as a zipped file, so when I do:

unzip /hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg

I get all the directories and files under CELLEX... but no, there's no 'map' directory. The closest thing to it is:

cellex/utils/mapping/init.py
cellex/utils/mapping/ens_human_to_symbol.py
cellex/utils/mapping/ens_mouse_to_ens_human.py
cellex/utils/mapping/mgi_mouse_to_ens_mouse.py
cellex/utils/mapping/pycache/init.cpython-37.pyc
cellex/utils/mapping/pycache/ens_human_to_symbol.cpython-37.pyc
cellex/utils/mapping/pycache/ens_mouse_to_ens_human.cpython-37.pyc
cellex/utils/mapping/pycache/mgi_mouse_to_ens_mouse.cpython-37.pyc

@bengnielsen
Copy link
Collaborator

bengnielsen commented Mar 9, 2020

CELLEX has now been updated by @tstannius with some hotfixes for the mapping functions.

I managed to convert all the MGI gene names to Ensembl gene names from the brain_dropviz_hippocampus.esmu.csv.gz by:

import cellex as cellex
import pandas as pd

df = pd.read_csv('brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip')
df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True)
df = df.set_index('gene')   #important step before mapping! 

cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True)

And then when you call the df it should now contain Mouse ensembl id gene names. Let us know if it works!

@DaianeH
Copy link
Author

DaianeH commented Mar 10, 2020

Hi, thanks for the update!

I did exactly what you typed above and got the error:

import numpy as np
import cellex as cellex
import pandas as pd
df = pd.read_csv('/brain_dropviz_hippocampus.esmu.csv.gz', compression='gzip')
df.rename(columns={'Unnamed: 0': 'gene'}, inplace=True)
df = df.set_index('gene')
cellex.utils.mapping.mgi_mouse_to_ens_mouse(df, drop_unmapped=True, verbose=True)
Mapping: mouse mgi gene id's --> mouse ensembl gene id's ...
Traceback (most recent call last):
File "", line 1, in
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/cellex-1.0.1-py3.7.egg/cellex/utils/mapping/mgi_mouse_to_ens_mouse.py", line 30, in mgi_mouse_to_ens_mouse
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1151, in resource_stream
self, resource_name
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1398, in get_resource_stream
return io.BytesIO(self.get_resource_string(manager, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1401, in get_resource_string
return self._get(self._fn(self.module_path, resource_name))
File "/hpc/packages/minerva-centos7/python/3.7.3/lib/python3.7/site-packages/pkg_resources/init.py", line 1540, in _get
return self.loader.get_data(path)
OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz'��

Apologies for the super basic question, but does CELLEX need to be installed again or something?

@bengnielsen
Copy link
Collaborator

No worries - no questions are too basic!
But yeah, it seems like your server still has CELLEX v.1.0.1, which means that the bug from yesterday

OSError: [Errno 0] Error: 'cellex/utils/mapping/maps/Mus_musculus.GRCm38.90.gene_name_version2ensembl.txt.gz' 

would still occur.

Did you happen to install CELLEX using pip? If yes, you can update it with a pip install -U cellex and get the latest CELLEX v. 1.1.1

@DaianeH
Copy link
Author

DaianeH commented Mar 16, 2020

Ok, so now CELLEX is updated and I can generate the hippocampus_cellect.esmu.csv.gz with ENSEMBL id's. However, when I run CELLECT on it, now I get another error... well in fact I don't get an error, I just get:

...
[Sun Mar 15 22:33:49 2020]
Finished job 215.
140 of 240 steps (58%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/.snakemake/log/2020-03-15T223106.425023.snakemake.log

And I cannot exactly know which job execution failed - there's no errors in the log file, only a few warnings. Could you please be so kind to have a look at the log file and help me on this? Thanks!

2020-03-15T223106.425023.snakemake.log

@bengnielsen
Copy link
Collaborator

Hi again, I've looked into the 2020-03-15T223106.425023.snakemake.log file and spotted the following error near the bottom:

Missing files after 5 seconds:
/sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.nan.bed

So the point where things start to go wrong is at the format_and_map_genes step, which outputs a .bed for each cell type present in your CELLEX'ed dataset:

[Sun Mar 15 22:31:34 2020]
rule format_and_map_genes:
    input: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/multi_genesets/multi_geneset.dropviz-hippocampus.txt
    output: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/precomputation/dropviz-hippocampus/bed/dropviz-hippocampus.nan.bed
    log: /sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE/logs/log.format_and_map_snake.dropviz-hippocampus.nan.txt
    jobid: 238
    wildcards: BASE_OUTPUT_DIR=/sc/orga/projects/loosr01a/daiane/projects/scRNAseq/CELLECT/CELLECT-LDSC-EXAMPLE, run_prefix=dropviz-hippocampus, annotation=nan

It would perhaps seem like one of your celltypes in the hippocampus_cellect.esmu.csv.gz have a nan as their celltype label and thus might messes up the script (perhaps because nan doesn't translate very well when converted to string? What's your take on this, @Tobi1kenobi?) and thus required files .bed files to proceed are not generated, hence snakemake cancels further job execution.

Sorry if this is inconvenient, but I might have to ask you to find the cell types in the CELLEX'ed dataset with the nan labels and rename them.

@Tobi1kenobi
Copy link
Contributor

I'd agree, this sounds like #3 . It should've been fixed a long time ago but I think has been forgotten about, apologies.

But as @bengnielsen said, I'd expect renaming 'nan' as 'unknown' (for example) would resolve the issue.

@DaianeH
Copy link
Author

DaianeH commented Mar 30, 2020

Hi there,

Changing the name of the cluster from 'nan' to 'Unknown' worked.

So just to close this thread:

1 - CELLECT throws an error when the name of a cluster is "nan"
2 - You suggested I put a letter in front of the cluster numbers, but that doesn't make a difference when there's an hyphen in the cluster name (when there's numbers and hyphens, the cluster name is read as a string). However, I run CELLECT in another dataset with cluster names represented only as numbers, and it threw an error. When adding a letter in front of the cluster numbers, the error was gone. So CELLECT will throw an error when cluster names are numbers only (it must be a string).

Many thanks for your help!

@pascaltimshel
Copy link
Collaborator

pascaltimshel commented Apr 12, 2020

Thanks @DaianeH !

To fix this the code should either :

  1. fail explicitly (before staring Snakemake workflow) when number-only or nan encountered

or

  1. improve snakemake regex(?) to allow for number-only or nan annotation names

@pascaltimshel pascaltimshel added check&catch error checks and catching of errors. If very serious issue, report as bug priority_high High priority labels Apr 12, 2020
@pascaltimshel pascaltimshel changed the title Error in format_and_map_genes Check expression specificity annotation names (ES header) Apr 12, 2020
@pascaltimshel pascaltimshel pinned this issue May 13, 2020
@Tobi1kenobi Tobi1kenobi unpinned this issue Jun 18, 2020
@Tobi1kenobi Tobi1kenobi pinned this issue Jun 21, 2020
@Tobi1kenobi Tobi1kenobi unpinned this issue Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check&catch error checks and catching of errors. If very serious issue, report as bug priority_high High priority
Projects
None yet
Development

No branches or pull requests

4 participants