Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark relationship matrices #42

Open
timothymillar opened this issue Nov 20, 2023 · 7 comments
Open

Benchmark relationship matrices #42

timothymillar opened this issue Nov 20, 2023 · 7 comments

Comments

@timothymillar
Copy link
Contributor

Benchmarking against commonly used libraries using public data. The details listed here are a starting point and may change.

Data:
Start with Cleveland, Hickey and Forni (2012). A widely used pig dataset with 6473 pedigreed individuals of which 3534 have been genotyped. The genotype data includes 52843 biallelic SNVs with missing values already imputed.

Scope:
Commonly used estimators applied to diploid data.

  • genomic_relationship(estimator='VanRaden')
  • pedigree_kinship(method='diploid')
  • hybrid_relationship()
  • pc_relate() (?)
  • pedigree_inbreeding() (? not a matrix)

Comparisons:

  • R:
    • AGHmatrix
    • sommer
    • ASRGenomics (calls AGHmatrix for some functions)

Notes/Concerns:

  • VanRaden estimator:
    • Performance will largely be measuring a single matrix multiplication and hence this will primarily be a test of underlying linalg libs. These are highly parallelized so dask won't improve performance on a single machine.
    • Some of the commonly used implementations (especially AGHmatrix) do a lot of additional validation and computation such as mean imputation and filtering by minor allele frequency. This requires some thought about what constitutes a reasonable comparison. One option would be to time the equivalent operations in sgkit. However, these operations being optional in sgkit is a strength that we should emphasize (they are often unnecessary).
    • Potential to emphasize the advantage of dask here if we have a suitably large dataset that benefits from being spread across multiple nodes.
@timothymillar
Copy link
Contributor Author

I ran into dask/dask#10645 when benchmarking which is probably slowing us down in a few locations. Hopefully there will be a quick fix.

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Nov 24, 2023

I guess the factor of 2 here is basically the difference between us and R implementations, since the time is dominated by the matmul, and we'd expect linear algebra implementations to be broadly equivalent (if not identical, maybe actually calling in to the same fortran code, ultimately).

A good angle might be to explore parallelisation - can we parallelise this over multiple threads with Dask (no point in going the full distributed across a cluster here, I think)? I expect that's quite hard with R implementations.

@timothymillar
Copy link
Contributor Author

can we parallelise this over multiple threads with Dask

Yes, but it's not worth it unless distributing across multiple nodes. The matmul in numpy is already multi-threaded. Likewise, the equivalent functions in R are multithreaded (crossprod/tcrossprod etc.).

Based on a preliminary look, there is also quite a bit of difference in timings due to additional checks on the data which can be quite inefficient. Things like checking that dosages are integers, performing mean imputation, filtering by MAF and etc. Our implementation leaves these tasks to the user (they may not be wanted or appropriate!), but this makes it harder to perform a like-for-like comparison.

@timothymillar
Copy link
Contributor Author

I'm also looking for implementations of pedigree kinship/relatedness to compare against. I think we will compare quite favorably but I'm currently being slowed down by the share number of implementations (mostly R packages).

@jeromekelleher
Copy link
Collaborator

I guess we can just say that the dominant operation is a matmul then, and quote some times based on that? We could time doing just the matmul in R, showing that we're basically calling into equivalently optimised code?

So, report out time end-to-end, and then quote just the time taken to do the matmul in R? Or would that be super fiddly?

@timothymillar
Copy link
Contributor Author

I think that would work. Report end-to-end time for implementations and have R and Numpy timings for just the matmul. Maybe a sentence or two covering some of the preprocessing/validation that some of the other implementations do.

@timothymillar
Copy link
Contributor Author

timothymillar commented Dec 18, 2023

dask/dask#10645 appears to be fixed in their latest release (2023.12.1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants