-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark relationship matrices #42
Comments
I ran into dask/dask#10645 when benchmarking which is probably slowing us down in a few locations. Hopefully there will be a quick fix. |
I guess the factor of 2 here is basically the difference between us and R implementations, since the time is dominated by the matmul, and we'd expect linear algebra implementations to be broadly equivalent (if not identical, maybe actually calling in to the same fortran code, ultimately). A good angle might be to explore parallelisation - can we parallelise this over multiple threads with Dask (no point in going the full distributed across a cluster here, I think)? I expect that's quite hard with R implementations. |
Yes, but it's not worth it unless distributing across multiple nodes. The matmul in numpy is already multi-threaded. Likewise, the equivalent functions in R are multithreaded (crossprod/tcrossprod etc.). Based on a preliminary look, there is also quite a bit of difference in timings due to additional checks on the data which can be quite inefficient. Things like checking that dosages are integers, performing mean imputation, filtering by MAF and etc. Our implementation leaves these tasks to the user (they may not be wanted or appropriate!), but this makes it harder to perform a like-for-like comparison. |
I'm also looking for implementations of pedigree kinship/relatedness to compare against. I think we will compare quite favorably but I'm currently being slowed down by the share number of implementations (mostly R packages). |
I guess we can just say that the dominant operation is a matmul then, and quote some times based on that? We could time doing just the matmul in R, showing that we're basically calling into equivalently optimised code? So, report out time end-to-end, and then quote just the time taken to do the matmul in R? Or would that be super fiddly? |
I think that would work. Report end-to-end time for implementations and have R and Numpy timings for just the matmul. Maybe a sentence or two covering some of the preprocessing/validation that some of the other implementations do. |
dask/dask#10645 appears to be fixed in their latest release (2023.12.1). |
Benchmarking against commonly used libraries using public data. The details listed here are a starting point and may change.
Data:
Start with Cleveland, Hickey and Forni (2012). A widely used pig dataset with 6473 pedigreed individuals of which 3534 have been genotyped. The genotype data includes 52843 biallelic SNVs with missing values already imputed.
Scope:
Commonly used estimators applied to diploid data.
genomic_relationship(estimator='VanRaden')
pedigree_kinship(method='diploid')
hybrid_relationship()
pc_relate()
(?)pedigree_inbreeding()
(? not a matrix)Comparisons:
Notes/Concerns:
The text was updated successfully, but these errors were encountered: