Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Remove count DataFrame from calculate_cooks #292

Merged
merged 27 commits into from
Jul 2, 2024
Merged

ENH Remove count DataFrame from calculate_cooks #292

merged 27 commits into from
Jul 2, 2024

Conversation

asistradition
Copy link
Contributor

What does your PR implement? Be specific.

calculate_cooks casts normed_counts into a pandas DataFrame for robust_method_of_moments_disp. This is memory inefficient for large data.

robust_method_of_moments_disp has been refactored to accept an ndarray directly and the DataFrame has been removed. There is no numerical change as a result.

Copy link
Collaborator

@umarteauowkin umarteauowkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @asistradition , thanks a lot for this PR. I was wondering how important it was to you that we put _mu_LFC and _hat_diagonals in the obsm and not in the layers. I agree it makes sense not to store useless nan values. However, fundamentally, these matrices are more layers than simply obsm (we always have this issue between objects restricted to non zero genes and objects defined on all genes, ideallly we would like to have a layers field restricted to non zero genes but we don't). However, if you experience significant memory differences by keeping it in the layers, I will accept !

@asistradition
Copy link
Contributor Author

The main advantage to using .obsm is that column slicing a row-major array requires a copy, and there is a considerable amount of overhead when calling .layers[key][:, filtered_genes] repeatedly.

As those keys are only used in cooks_distances it is a considerable optimization (for large data, e.g. 50k x 30k) to move them to .obsm remove those copies.

@asistradition
Copy link
Contributor Author

Also includes an optional control_genes argument to fit_size_factors which has the same behavior as the controlGenes argument to estimateSizeFactors and the associated unit test

pydeseq2/dds.py Outdated Show resolved Hide resolved
@BorisMuzellec
Copy link
Collaborator

Thanks @asistradition for this PR!

I agree with @umarteauowkin that on principle storing _mu_LFC and _hat_diagonals in the obsm and not in the layers is a bit awkward, but if this leads to memory gains I'm fine with it, given that (as you pointed out) they are only used in cooks_distances.

@BorisMuzellec BorisMuzellec merged commit 3505d78 into owkin:main Jul 2, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants