Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Feature gene annotation #132

Merged
merged 30 commits into from
Jun 26, 2024
Merged

Conversation

grantn5
Copy link
Contributor

@grantn5 grantn5 commented Jun 12, 2024

I have written some code that will produce a per gene CNV matrix, giving the mean log2 CNV per gene (as 1 gene can appear in many windows)
I have also updated the np.convolve mode to valid so it matches the documentation in the function. Full methods here

I have also updated the code in the instance where the window is larger than the number of genes on the chromosome, to only run one convolution the length of the genes on the given chr see https://github.com/grantn5/infercnvpy/blob/feature_geneAnnotation/src/infercnvpy/tl/_infercnv.py#L198

I have also updated the genomic_position_from_gtf function so it will always read the gtf in as a pandas df (avoids bug where it reads it in as polars df)

gtf = gtfparse.read_gtf(
        gtf_file, usecols=["seqname", "feature", "start", "end", "gene_id", "gene_name"], result_type="pandas"
    )

Closes #128

@grantn5
Copy link
Contributor Author

grantn5 commented Jun 12, 2024

Hi @grst I am not familiar with this building of docs. It is failing due to a warning because the sphinx can't seem to find scanpy in the build environment, any suggestions?

@grst
Copy link
Member

grst commented Jun 13, 2024

It seems the pca function got moved within scanpy. I fixed it in another PR.
Once you resolved the merge conflicts, readthedocs should build again.

I'll then do a proper review.

src/infercnvpy/io/_genepos.py Outdated Show resolved Hide resolved
src/infercnvpy/tl/__init__.py Outdated Show resolved Hide resolved
@grantn5
Copy link
Contributor Author

grantn5 commented Jun 14, 2024

Hi @grst I have pulled your changes and have reverted genomic_position_from_gtf. This is now ready fro review!

@grantn5
Copy link
Contributor Author

grantn5 commented Jun 18, 2024

Hi @grst It looks like all tests that use the oligodendroglioma adata are failing?
Has this object been moved in scanpy?

@grst
Copy link
Member

grst commented Jun 18, 2024

I think this is rather an incompatibility with numpy 2.0 which was very recently released.
I'll look into it, but the issue might be in one of the dependencies.

Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments. Thanks for adding test cases :)

Have you compared the runtime before and after adding your code? It looks like it could make the whole workflow significantly slower and memory intense. In that case it would be great if it could be disabled via a function parameter in tl.infercnvpy.

src/infercnvpy/tl/_infercnv.py Outdated Show resolved Hide resolved
src/infercnvpy/tl/_infercnv.py Outdated Show resolved Hide resolved
src/infercnvpy/tl/_infercnv.py Outdated Show resolved Hide resolved
src/infercnvpy/tl/_infercnv.py Outdated Show resolved Hide resolved
@grantn5
Copy link
Contributor Author

grantn5 commented Jun 19, 2024

Just some minor comments. Thanks for adding test cases :)

Have you compared the runtime before and after adding your code? It looks like it could make the whole workflow significantly slower and memory intense. In that case it would be great if it could be disabled via a function parameter in tl.infercnvpy.

Hi thanks for the review!

I have tested the time on our in-house data ~50,000 cells, ~20,000 genes and it runs within about 10 minutes parallelized over 32 cpu's (before it took ~2 mins). However due to the nested for loop you have to massively reduce the chunksize ~100 as each sample needs to be looped through the flattened index of genes for each window https://github.com/grantn5/infercnvpy/blob/feature_geneAnnotation/src/infercnvpy/tl/_infercnv.py#L230 which will be very long.

I also found that there is a bug in n_jobs where it isn't automatically parallelized over all available cpus's so I have always been setting it manually.

This is definitely a CPU bound task rather than memory so if you are using a machine with lots of CPUs it will speed up immensely.

If required for the PR to be accepted I can do further refactoring to add a parameter to turn off per gene CNV calling by default. However, this may take some time due to the way the function is handling outputs from intermediate functions.

@grst
Copy link
Member

grst commented Jun 20, 2024

Thanks for testing this. Given that it's a 5x runtime increase and scalability is one of the main selling points of this library, I would like to be able to switch this off. For 50k cells it might not be much of an issue, but when working with atlases >1M cells, this makes quite a difference.

@grantn5
Copy link
Contributor Author

grantn5 commented Jun 24, 2024

Hi @grst I have made the calculation of per gene CNV optional, default set to False, I have also added some documentation to the function.

Additionally, I have added some benchmarking to the pytests to show the performance of the added computation.
Hopefully this is now good merge!

Copy link
Member

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@grst grst merged commit 0073690 into icbi-lab:main Jun 26, 2024
6 checks passed
@knadia07
Copy link

knadia07 commented Aug 1, 2024

hi there, i am confused how do we actually get the output for feature gene annotation mapping to back chromosome position, say for cells with very high cnv score?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Per gene copy number signal.
4 participants