-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Feature gene annotation #132
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Hi @grst I am not familiar with this building of docs. It is failing due to a warning because the sphinx can't seem to find scanpy in the build environment, any suggestions? |
It seems the I'll then do a proper review. |
for more information, see https://pre-commit.ci
Hi @grst I have pulled your changes and have reverted genomic_position_from_gtf. This is now ready fro review! |
Hi @grst It looks like all tests that use the oligodendroglioma adata are failing? |
I think this is rather an incompatibility with numpy 2.0 which was very recently released. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments. Thanks for adding test cases :)
Have you compared the runtime before and after adding your code? It looks like it could make the whole workflow significantly slower and memory intense. In that case it would be great if it could be disabled via a function parameter in tl.infercnvpy
.
Hi thanks for the review! I have tested the time on our in-house data ~50,000 cells, ~20,000 genes and it runs within about 10 minutes parallelized over 32 cpu's (before it took ~2 mins). However due to the nested for loop you have to massively reduce the chunksize ~100 as each sample needs to be looped through the flattened index of genes for each window https://github.com/grantn5/infercnvpy/blob/feature_geneAnnotation/src/infercnvpy/tl/_infercnv.py#L230 which will be very long. I also found that there is a bug in n_jobs where it isn't automatically parallelized over all available cpus's so I have always been setting it manually. This is definitely a CPU bound task rather than memory so if you are using a machine with lots of CPUs it will speed up immensely. If required for the PR to be accepted I can do further refactoring to add a parameter to turn off per gene CNV calling by default. However, this may take some time due to the way the function is handling outputs from intermediate functions. |
Thanks for testing this. Given that it's a 5x runtime increase and scalability is one of the main selling points of this library, I would like to be able to switch this off. For 50k cells it might not be much of an issue, but when working with atlases >1M cells, this makes quite a difference. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Hi @grst I have made the calculation of per gene CNV optional, default set to Additionally, I have added some benchmarking to the pytests to show the performance of the added computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
hi there, i am confused how do we actually get the output for feature gene annotation mapping to back chromosome position, say for cells with very high cnv score? |
I have written some code that will produce a per gene CNV matrix, giving the mean log2 CNV per gene (as 1 gene can appear in many windows)
I have also updated the
np.convolve
mode tovalid
so it matches the documentation in the function. Full methods hereI have also updated the code in the instance where the window is larger than the number of genes on the chromosome, to only run one convolution the length of the genes on the given chr see https://github.com/grantn5/infercnvpy/blob/feature_geneAnnotation/src/infercnvpy/tl/_infercnv.py#L198
I have also updated the
genomic_position_from_gtf
function so it will always read the gtf in as a pandas df (avoids bug where it reads it in as polars df)Closes #128