Benchmark relationship matrices #42

timothymillar · 2023-11-20T22:58:13Z

Benchmarking against commonly used libraries using public data. The details listed here are a starting point and may change.

Data:
Start with Cleveland, Hickey and Forni (2012). A widely used pig dataset with 6473 pedigreed individuals of which 3534 have been genotyped. The genotype data includes 52843 biallelic SNVs with missing values already imputed.

Scope:
Commonly used estimators applied to diploid data.

genomic_relationship(estimator='VanRaden')
pedigree_kinship(method='diploid')
hybrid_relationship()
pc_relate() (?)
pedigree_inbreeding() (? not a matrix)

Comparisons:

R:
- AGHmatrix
- sommer
- ASRGenomics (calls AGHmatrix for some functions)

Notes/Concerns:

VanRaden estimator:
- Performance will largely be measuring a single matrix multiplication and hence this will primarily be a test of underlying linalg libs. These are highly parallelized so dask won't improve performance on a single machine.
- Some of the commonly used implementations (especially AGHmatrix) do a lot of additional validation and computation such as mean imputation and filtering by minor allele frequency. This requires some thought about what constitutes a reasonable comparison. One option would be to time the equivalent operations in sgkit. However, these operations being optional in sgkit is a strength that we should emphasize (they are often unnecessary).
- Potential to emphasize the advantage of dask here if we have a suitably large dataset that benefits from being spread across multiple nodes.

The text was updated successfully, but these errors were encountered:

timothymillar · 2023-11-23T19:43:50Z

I ran into dask/dask#10645 when benchmarking which is probably slowing us down in a few locations. Hopefully there will be a quick fix.

jeromekelleher · 2023-11-24T09:31:59Z

I guess the factor of 2 here is basically the difference between us and R implementations, since the time is dominated by the matmul, and we'd expect linear algebra implementations to be broadly equivalent (if not identical, maybe actually calling in to the same fortran code, ultimately).

A good angle might be to explore parallelisation - can we parallelise this over multiple threads with Dask (no point in going the full distributed across a cluster here, I think)? I expect that's quite hard with R implementations.

timothymillar · 2023-11-28T02:34:31Z

can we parallelise this over multiple threads with Dask

Yes, but it's not worth it unless distributing across multiple nodes. The matmul in numpy is already multi-threaded. Likewise, the equivalent functions in R are multithreaded (crossprod/tcrossprod etc.).

Based on a preliminary look, there is also quite a bit of difference in timings due to additional checks on the data which can be quite inefficient. Things like checking that dosages are integers, performing mean imputation, filtering by MAF and etc. Our implementation leaves these tasks to the user (they may not be wanted or appropriate!), but this makes it harder to perform a like-for-like comparison.

timothymillar · 2023-11-28T03:27:25Z

I'm also looking for implementations of pedigree kinship/relatedness to compare against. I think we will compare quite favorably but I'm currently being slowed down by the share number of implementations (mostly R packages).

jeromekelleher · 2023-11-28T09:57:16Z

I guess we can just say that the dominant operation is a matmul then, and quote some times based on that? We could time doing just the matmul in R, showing that we're basically calling into equivalently optimised code?

So, report out time end-to-end, and then quote just the time taken to do the matmul in R? Or would that be super fiddly?

timothymillar · 2023-11-28T19:20:52Z

I think that would work. Report end-to-end time for implementations and have R and Numpy timings for just the matmul. Maybe a sentence or two covering some of the preprocessing/validation that some of the other implementations do.

timothymillar · 2023-12-18T08:42:08Z

dask/dask#10645 appears to be fixed in their latest release (2023.12.1).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark relationship matrices #42

Benchmark relationship matrices #42

timothymillar commented Nov 20, 2023

timothymillar commented Nov 23, 2023

jeromekelleher commented Nov 24, 2023 •

edited

Loading

timothymillar commented Nov 28, 2023

timothymillar commented Nov 28, 2023

jeromekelleher commented Nov 28, 2023

timothymillar commented Nov 28, 2023

timothymillar commented Dec 18, 2023 •

edited

Loading

Benchmark relationship matrices #42

Benchmark relationship matrices #42

Comments

timothymillar commented Nov 20, 2023

timothymillar commented Nov 23, 2023

jeromekelleher commented Nov 24, 2023 • edited Loading

timothymillar commented Nov 28, 2023

timothymillar commented Nov 28, 2023

jeromekelleher commented Nov 28, 2023

timothymillar commented Nov 28, 2023

timothymillar commented Dec 18, 2023 • edited Loading

jeromekelleher commented Nov 24, 2023 •

edited

Loading

timothymillar commented Dec 18, 2023 •

edited

Loading