Expensive operations in `updateL!` on large models #649

dmbates · 2022-10-05T16:30:45Z

dmbates
Oct 5, 2022
Maintainer

I created timedupdateL! to report on the time spent in each step of updateL!, and applied it to the model of the movielens.org data (ml-latest.zip) with the constraints that the number of ratings per user and per movie both were required to be greater than or equal to 20

The model is

julia> restoreoptsum!(m, "./optsums/mvm20u20.json")
Linear mixed model fit by maximum likelihood
 rating ~ 1 + (1 | userId) + (1 | movieId)
     logLik       -2 logLik         AIC           AICc            BIC      
 -33644689.0320  67289378.0639  67289386.0639  67289386.0639  67289446.4165

Variance components:
            Column   Variance Std.Dev. 
userId   (Intercept)  0.184529 0.429568
movieId  (Intercept)  0.242959 0.492909
Residual              0.732964 0.856133
 Number of obs: 26380766; levels of grouping factors: 174605, 18366

  Fixed-effects parameters:
──────────────────────────────────────────────────
               Coef.  Std. Error       z  Pr(>|z|)
──────────────────────────────────────────────────
(Intercept)  3.43345   0.0038669  887.91    <1e-99
──────────────────────────────────────────────────

providing timings of

ulia> @time updateL!(m);
 54.998591 seconds (16 allocations: 624 bytes)

julia> MixedModels.timedupdateL!(m);
Copy last(m.A) to last(m.L):          3.604e-6
copyscaleinflate on diagonal block 1: 0.000294542
Postmultiply of block[2,1]:           0.08106269
Postmultiply of block[3,1]:           0.000848256
copyscaleinflate on diagonal block 2: 0.235504656
Postmultiply of block[3,2]:           0.000154175
Premultiply of block [2, 1]:          0.037202548
cholUnblocked of L[1,1]:              0.000320741
Postmultiply L[2,1] by L[1,1]:        0.04062198
Postmultiply L[3,1] by L[1,1]:        0.001781148
rankUpdate! of L[2,2] from L[2,1]:    53.670266357
cholUnblocked of L[2,2]:              13.16504857
update L[3,2] from L[3,1[ and L[2,1]: 0.08640797
Postmultiply L[3,2] by L[2,2]:        0.29992936
rankUpdate! of L[3,3] from L[3,1]:    0.000519499
rankUpdate! of L[3,3] from L[3,2]:    4.7886e-5
cholUnblocked of L[3,3]:              4.902e-6

That is, the rankUpdate! of L[2,2] from L[2,1] is the elephant in the room. And the timings of that operation are much worse when L[2,2] is stored in rectangular full-block (RFP) format. (The Cholesky factorization of the diagonal block after the rankUpdate! is comparable in speed under RFP.)

@palday What are the limitations on locking storage in a multi-threaded situation? The way that this update of C, the symmetric block on the diagonal, from A, a sparse block to the left of it, is performed is to iterate over the columns of A then a non-zero in that column, say a at row r, then the non-zeros at and below a. The updates in C from non-zeros at and below a occur in the rth column of C. So it is feasible to use multiple threads as long as clashes caused by two threads both hitting the rth row are avoided. I'm not sure how to put a lock on an index though. Do you think it is feasible?

palday · 2022-10-05T20:49:58Z

palday
Oct 5, 2022
Maintainer

Until Julia's threading scheduler is fully aware of BLAS threads, I'm not sure there's much to gain from threading code that's calling a lot of BLAS things.

That warning aside, I probably wouldn't do locking here, but rather spawn tasks everytime we get to a "parallel" point and then synchronize on them, before moving on to the next one. It's not the richest way to handle the data dependencies, but it avoids locks and shouldn't be perceptively slower than the current single-threaded case, no matter how bad scheduling contention is.

0 replies

dmbates · 2022-10-06T22:02:23Z

dmbates
Oct 6, 2022
Maintainer Author

The rankUpdate! method I would be considering doesn't call, directly or indirectly, any BLAS functions. All of the operations are muladd scalar operations - it is just that there are a lot of them and the memory positions hop around a lot.

I had an idea today while out on a walk that it may be possible to use Cartesian indexing for the majority of the operations. We are updating the lower triangle of the symmetric matrix stored in column-major rectangular full packed format. That means that 3/4 of the elements are stored in their original positions in the parent array when the size is odd, or shifted down by one row when the size is even. If the levels of the second grouping factor are sorted by decreasing number of observations then the majority of the update operations will be in this fast update group.

I'll experiment with some code.

1 reply

palday Oct 7, 2022
Maintainer

Ah, I understand. In that case, I'm not sure how much threading would help / it could be very CPU dependent because of the way the hopping around interacts with caching, and in turn, how the CPU cache is structured on different multicore processors (e.g. whether each core has its own or shares L1,2,3 caches). That said, you have definitely piqued my interest on this because trying to multithread the objective has been on my dream list for a while. I'll give it some more thought on my next walk. 😄

Your approach sounds good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expensive operations in `updateL!` on large models #649

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Expensive operations in updateL! on large models #649

dmbates Oct 5, 2022 Maintainer

Replies: 2 comments · 1 reply

palday Oct 5, 2022 Maintainer

dmbates Oct 6, 2022 Maintainer Author

palday Oct 7, 2022 Maintainer

Expensive operations in `updateL!` on large models #649

dmbates
Oct 5, 2022
Maintainer

Replies: 2 comments 1 reply

palday
Oct 5, 2022
Maintainer

dmbates
Oct 6, 2022
Maintainer Author

palday Oct 7, 2022
Maintainer