Optionally assemble using multiple threads #625

kinnala · 2021-04-15T09:28:56Z

As discussed in #450. Here is a snippet that I used to verify the performance gain:

import numpy as np
from skfem import *
from numba import jit

m = MeshTet().refined(5)
basis = InteriorBasis(m, ElementTetP2())

@jit(nogil=True, nopython=True)
def nlaplace(out, du, dv):
    for i in range(du.shape[1]):
        for j in range(du.shape[2]):
            for k in range(du.shape[0]):
                out[i, j] += du[k, i, j] * dv[k, i, j]

@BilinearForm(nthreads=4)
def bilinf(u, v, w):
    out = np.empty_like(u.grad[0])
    nlaplace(out, u.grad, v.grad)
    return out

A = bilinf.assemble(basis)

adtzlr · 2021-04-15T19:27:41Z

I extended your above snippet to

import numpy as np
from skfem import *
from numba import jit

from skfem.helpers import dot, grad

m = MeshTet().refined(5)
basis = InteriorBasis(m, ElementTetP2())

@BilinearForm
def bilinf2(u, v, w):
    return dot(grad(u), grad(v))

@jit(nogil=True, nopython=True)
def nlaplace(out, du, dv):
    for i in range(du.shape[1]):
        for j in range(du.shape[2]):
            for k in range(du.shape[0]):
                out[i, j] += du[k, i, j] * dv[k, i, j]

@BilinearForm(nthreads=4)
def bilinf1(u, v, w):
    out = np.empty_like(u.grad[0])
    nlaplace(out, u.grad, v.grad)
    return out

Then I measured execution times with %timeit in IPython, first for the parallel assembly

In [2] : %timeit bilinf1.assemble(basis)
3.13 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and another execution with the default assembly

In [3] : %timeit bilinf2.assemble(basis)
4.18 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(I'm using my Surface Pro 4 now so execution times could be a bit slow...). Therefore the execution time for this example is about 25% faster for the parallel version with 4 threads. If I use 8 or 16 threads it takes only 2.75s --> 34% faster assemble compared to the non-parallel version.

If I try to assemble the scikit-fem-"default" bilinf1 with 8 threads in parallel I also get impressive 2.91s without all the numba hassle. That's really impressive! Hopefully %timeit measures execution times the right way... 😄

I'll check my hyperelasticity problem from #616 ...

adtzlr · 2021-04-15T19:54:10Z

For the hyperelasticity example in #616 the total execution time is 50s (nthreads=8) compared to 76s (no parallel assembly) for one increment with 4 newton-iterations. If I compare the execution times of the K11 stiffness assemble I get 8.9s in parallel vs. 13.7s in default mode. I'll post the script code in an additional comment. I'm too lazy for today ro rewrite the whole form in "explicit" numba code 😪

adtzlr · 2021-04-15T19:56:15Z

import numpy as np
from scipy.sparse import bmat

import skfem
from skfem.helpers import grad, det, inv


mu = 1.0
bulk = 5000.0


# Newton-Rhapson parameters
tol = 1e-8
maxiter = 8

# array with values of "External Displacement Z" on boundary "move"
move_uz = -np.arange(0,0.2,0.1)

print("Init Stage")
print("==========")
print("* Init mesh.")
m = skfem.mesh.MeshHex().refined(3)
eu = skfem.element.ElementVectorH1(skfem.element.ElementHex1())
ep = skfem.element.ElementHex0()
eJ = skfem.element.ElementHex0()

e = {
    "u": eu,
    "p": ep,
    "J": eJ
    }

print("* Init basis for (Hu-Washizu) Three-Field-Variation (u-p-J̅)")
print("  for nearly-incompressible hyperelastic materials.\n")
print("  Π_int = ∫ ̂ψ(̂F) dV + ∫ U(J̅) dV + ∫ p ((det(F) - J̅) dV\n")
basis = {"u": skfem.InteriorBasis(m,eu, intorder=1),
         "p": skfem.InteriorBasis(m,ep, intorder=0),
         "J": skfem.InteriorBasis(m,ep, intorder=0)
        }

def transpose(A):
    "Transpose of 2d and Major-transpose of 4d Arrays with trailing axes."
    if len(np.shape(A))-2 == 0:
        return A.T
    elif len(np.shape(A))-2 == 2:
        return np.transpose(A,(1,0,2,3))
    elif len(np.shape(A))-2 == 4:
        return np.transpose(A,(2,3,0,1,4,5))
    elif len(np.shape(A))-2 > 4:
        return np.transpose(A,(2,3,0,1,*range(len(A.shape))[4:]))
    else:
        raise NotImplementedError("Unknown shape of array.")

def identity(A):
    "Identity Matrix of same shape as A with trailing axes."
    if isinstance(A,np.ndarray):
        ndim = A.shape[0]
        a,b = A.shape[-2:]
    else:
        # scikit-fem
        ndim = A.value.shape[0]
        a,b = A.value.shape[-2:]
    return np.tile(np.identity(ndim), (b,a,1,1)).T

def dya(A,B=None):
    "Dyadic (outer tensor) product of two Tensors."
    if B is None:
        B = A
    return np.einsum("ij...,kl...->ijkl...",A,B)

def cdyau(A,B=None):
    "ik-crossed dyadic product of two Tensors."
    if B is None:
        B = A
    return np.einsum("ij...,kl...->ikjl...",A,B)

def cdyal(A,B=None):
    "il-crossed dyadic product of two Arrays."
    if B is None:
        B = A
    return np.einsum("ij...,kl...->ilkj...",A,B)

def cdya(A,B=None):
    "Symmetric crossed dyadic product of two Arrays."
    if B is None:
        B = A
    return (cdyau(A,B) + cdyal(A,B)) / 2

def __dot22(A,B):
    "Dot product of a 2d and a 2d Array."
    return np.einsum("ij...,jk...->ik...",A,B)

def __dot42(A,B):
    "Dot product of a 4d and a 2d Array."
    return np.einsum("ijkl...,lm...->ijkm...",A,B)

def __dot24(A,B):
    "Dot product of a 2d and a 4d Array."
    return np.einsum("ij...,jklm...->iklm...",A,B)

def __dot44(A,B):
    "Dot product of a 4d and a 4d Array."
    return np.einsum("ijkl...,lmnp...->ijkmnp...",A,B)

def _singledot(A,B):
    "Dot product of two Arrays."
    if len(np.shape(A))-2 == 2 and len(np.shape(B))-2 == 4:
        return __dot24(A,B)
    elif len(np.shape(A))-2 == 4 and len(np.shape(B))-2 == 2:
        return __dot42(A,B)
    else:
        return __dot22(A,B)
    
def dot(A,B,*args):
    "Multi-argument dot product of two Arrays."
    C = _singledot(A,B)
    for i in range(len(args)):
        C = _singledot(C,args[i])
    return C

def __ddot22(A,B):
    "Double dot product of a 2d and a 2d Array."
    return np.einsum("ij...,ij...->...",A,B)

def __ddot24(A,B):
    "Double dot product of a 2d and a 4d Array."
    return np.einsum("ij...,ijkl...->kl...",A,B)

def __ddot42(A,B):
    "Double dot product of a 4d and a 2d Array."
    return np.einsum("ijkl...,kl...->ij...",A,B)

def __ddot44(A,B):
    "Double dot product of a 4d and a 4d Array."
    return np.einsum("ijkl...,klmn...->ijmn...",A,B)

def _singledoubledot(A,B):
    "Double-Dot product of two Arrays."
    if len(np.shape(A))-2 == 2 and len(np.shape(B))-2 == 4:
        return __ddot24(A,B)
    elif len(np.shape(A))-2 == 4 and len(np.shape(B))-2 == 2:
        return __ddot42(A,B)
    else:
        return __ddot22(A,B)
    
def ddot(A,B,*args):
    "Multi-argument double-dot product of two Arrays."
    C = _singledoubledot(A,B)
    for i in range(len(args)):
        C = _singledoubledot(C,args[i])
    return C

def _det(A):
    "Determinant with optional trailing axes."
    return np.linalg.det(A.T).T

def _inv(A):
    "Inverse  with optional trailing axes."
    return np.linalg.inv(A.T).T

def trace(A):
    "Trace with optional trailing axes."
    return np.trace(A)

def vol(A):
    "Spherical (volumetric) part of A (with optional trailing axes)."
    return trace(A)/3 * identity(A)

def dev(A):
    "Deviatoric part of A (with optional trailing axes)."
    return A - vol(A)

def fields(w):
    return w["displacements"], w["pressure"].value, w["volumeratio"].value

def F1(w):
    """1st Piola-Kirchhoff stress tensor obtained from 
    variation of the potential w.r.t. the displacements."""
    u,p,theta = fields(w)
    
    F = grad(u) + identity(u)
    J = det(F)

    Pdev = mu * (F - ddot(F,F)/3*transpose(inv(F))) * J**(-2/3)
    Pvol = p * J * transpose(inv(F))
    
    return Pdev + Pvol

def F2(w):
    """Equilibrium equation obtained from variation 
    of the potential w.r.t. the pressure."""
    u,p,theta = fields(w)
    
    F = grad(u) + identity(u)
    J = det(F)
    
    return J-theta

def F3(w):
    """Equilibrium equation obtained from variation 
    of the potential w.r.t. the volume ratio 'theta'."""
    u,p,theta = fields(w)
    
    return bulk * (theta-1) - p

def A11(w):
    """Elasticity tensor obtained from the linearization 
    w.r.t. the displacements of the variation of the 
    potential w.r.t. the displacements."""
    u,p,theta = fields(w)
    
    eye = identity(u)
    F = grad(u) + eye
    J = det(F)
    iFT = transpose(inv(F))
    
    A4_dev = mu * ( cdyau(eye)
              -2/3 * dya(F,iFT)
              -2/3 * dya(iFT,F)
              +2/9 * ddot(F,F) * dya(iFT)
              +1/3 * ddot(F,F) * cdyal(iFT)
             ) * J**(-2/3)
    
    A4_vol = p * J * (dya(iFT) - cdyal(iFT))
    
    return A4_dev + A4_vol

def A12(w):
    """Linearization w.r.t. the displacements of the variation of the 
    potential w.r.t. the pressure."""
    u,p,theta = fields(w)
    
    F = grad(u) + identity(u)
    J = det(F)

    return J*transpose(inv(F))

def A13(w):
    """Linearization w.r.t. the displacements of the variation of the 
    potential w.r.t. the volume ratio."""
    u,p,theta = fields(w)

    return np.zeros_like(grad(u))

def A22(w):
    """Linearization w.r.t. the pressure of the variation of the 
    potential w.r.t. the pressure."""
    u,p,theta = fields(w)
    
    return np.zeros_like(p)

def A23(w):
    """Linearization w.r.t. the volume ratio of the variation of the 
    potential w.r.t. the pressure."""
    u,p,theta = fields(w)
    
    return - np.ones_like(p)

def A33(w):
    """Linearization w.r.t. the volume ratio of the variation of the 
    potential w.r.t. the volume ratio."""
    u,p,theta = fields(w)
    
    return np.ones_like(theta) * bulk

print("* Init forms.")

@skfem.LinearForm(nthreads=8)
def a1(v, w):
    return ddot(F1(w), grad(v))

@skfem.LinearForm
def a2(v, w):
    return F2(w) * v

@skfem.LinearForm
def a3(v, w):
    return F3(w) * v

@skfem.BilinearForm(nthreads=8)
def b11(u, v, w):
    return ddot(grad(u), A11(w), grad(v))

@skfem.BilinearForm
def b22(u, v, w):
    return A22(w) * u * v

@skfem.BilinearForm
def b33(u, v, w):
    return A33(w) * u * v

@skfem.BilinearForm
def b12(u, v, w):
    return ddot(A12(w), grad(v)) * u

@skfem.BilinearForm
def b13(u, v, w):
    return ddot(A13(w), grad(v)) * u

@skfem.BilinearForm
def b23(u, v, w):
    return A23(w) * u * v

print("* Init Boundary conditions.")
dofs_left = basis["u"].find_dofs(
        {"left": m.facets_satisfying(lambda x: x[0] <= 0.0)},
        skip=["u^2", "u^3"]
    )
dofs_bottom = basis["u"].find_dofs(
        {"bottom": m.facets_satisfying(lambda x: x[1] <= 0.0)},
        skip=["u^1", "u^3"]
    )
dofs_back = basis["u"].find_dofs(
        {"back": m.facets_satisfying(lambda x: x[2] <= 0.0)},
        skip=["u^1", "u^2"]
    )
dofs_front = basis["u"].find_dofs(
        {"front": m.facets_satisfying(lambda x: np.abs(x[2] - 1.0) <= 0.0)},
        skip=["u^3"]
    )
dofs_move = basis["u"].find_dofs(
            {"move": m.facets_satisfying(lambda x: np.abs(x[2] - 1.0) <= 0.0)},
            skip=["u^1", "u^2"]
            )

print("* Assemble boundaries conditions.")
dofs = {}
for dof in [dofs_left, 
            dofs_bottom, 
            dofs_back, 
            dofs_front, 
            dofs_move,
           ]:
    dofs.update(dof)

print("* Obtain degrees of freedom to keep.")
I = np.hstack((
    basis["u"].complement_dofs(dofs),
    basis["u"].N + np.arange(basis["p"].N),
    basis["u"].N + basis["p"].N + np.arange(basis["J"].N)
))

print("* Init total and incremental unknowns.")
u = basis["u"].zeros()
p = basis["p"].zeros()
J = basis["J"].zeros()

δu = basis["u"].zeros()
δp = basis["p"].zeros()
δJ = basis["J"].zeros()

print("* Initial assembly of stiffness matrix.")
asmprops = {
    "displacements": basis["u"].interpolate(u), 
    "pressure":      basis["p"].interpolate(p), 
    "volumeratio":   basis["J"].interpolate(J)
    }
K11 = skfem.asm(b11, basis["u"], basis["u"], **asmprops)
K22 = skfem.asm(b22, basis["p"], basis["p"], **asmprops)
K33 = skfem.asm(b33, basis["J"], basis["J"], **asmprops)
K12 = skfem.asm(b12, basis["p"], basis["u"], **asmprops)
K13 = skfem.asm(b13, basis["J"], basis["u"], **asmprops)
K23 = skfem.asm(b23, basis["J"], basis["p"], **asmprops)

K = bmat(
    [[K11,   K12,   K13],
     [K12.T, K22,   K23],
     [K13.T, K23.T, K33]], "csr"
)

print("* Initial assembly of equilibrium equations.\n")
f1 = skfem.asm(a1, basis["u"], **asmprops)
f2 = skfem.asm(a2, basis["p"], **asmprops)
f3 = skfem.asm(a3, basis["J"], **asmprops)

f = np.concatenate((f1,f2,f3))

# init array for "Reaction Force Z" on boundary "move" for all increments
move_fz =  np.zeros_like(move_uz)

# increment loop for ramping up "External Displacement Z" on "move"
for increment, move in enumerate(move_uz[1:]):
    print(f"Increment {(increment+1):2d}")
    print("============")
    print(f'* prescribed displacement Z on boundary "move" U_Z={move:1.3e}''')

    # newton-rhapson iteration loop
    print("\nNewton-Rhapson iterations")
    print("-------------------------")
    
    # init converged flag
    converged = False
    
    # Newton-Rhapson iteration loop
    for iteration in range(maxiter):
        
        # enforce displacements on boundary "move"
        δupJ_prescribed = np.hstack((u,p,J))
        δupJ_prescribed[dofs["move"].all()] = move - δupJ_prescribed[dofs["move"].all()]
    
        #if not np.allclose(δupJ_prescribed[dofs["move"].all()],0):
        #    print(f'* increase incremental displacements on boundary "move" δU_Z={δupJ_prescribed[dofs["move"].all()][0]:1.3e}\n')

        # solve system
        system = skfem.condense(K, -f, x=δupJ_prescribed, I=I)
        uvpJ = skfem.solve(*system, use_umfpack=True)

        δu, pJ = np.split(uvpJ, [u.shape[0]])
        δp, δJ = np.split(pJ,   [p.shape[0]])
        
        # calculate norms of unknowns
        norm_δu = np.linalg.norm(δu)
        norm_δp = np.linalg.norm(δp)
        norm_δJ = np.linalg.norm(δJ)
        
        # check if solution is valid
        if np.isnan(norm_δu) or np.isnan(norm_δp) or np.isnan(norm_δJ):
            break
        
        # update unknowns
        u += δu
        p += δp
        J += δJ
        
        # keyword arguments for assembly
        asmprops = {
            "displacements": basis["u"].interpolate(u), 
            "pressure":      basis["p"].interpolate(p), 
            "volumeratio":   basis["J"].interpolate(J)
            }
        
        # update equilibrium equations
        f1 = skfem.asm(a1, basis["u"], **asmprops)
        f2 = skfem.asm(a2, basis["p"], **asmprops)
        f3 = skfem.asm(a3, basis["J"], **asmprops)
        
        f = np.concatenate((f1,f2,f3))

        # get active displacement dofs
        dof1  = basis["u"].complement_dofs(dofs)
        
        # reference force as norm of nodals reaction forces at prescribed dofs
        f1_ref  = np.linalg.norm(np.delete(f1,dof1))
        
        # check nodal equilibrium for nodal force residuals at active dofs
        norm_f1 = np.linalg.norm(f1[dof1]) / f1_ref
        
        # update stiffness
        K11 = skfem.asm(b11, basis["u"], basis["u"], **asmprops)
        K22 = skfem.asm(b22, basis["p"], basis["p"], **asmprops)
        K33 = skfem.asm(b33, basis["J"], basis["J"], **asmprops)
        K12 = skfem.asm(b12, basis["p"], basis["u"], **asmprops)
        K13 = skfem.asm(b13, basis["J"], basis["u"], **asmprops)
        K23 = skfem.asm(b23, basis["J"], basis["p"], **asmprops)
        
        # build sparse stiffness matrix from sparse sub-blocks
        K = bmat(
            [[K11,   K12,   K13],
             [K12.T, K22,   K23],
             [K13.T, K23.T, K33]], "csr"
        )
        
        print(f"#{iteration+1:2d}: |f|={norm_f1:1.3e} (|δu|={norm_δu:1.3e} |δp|={norm_δp:1.3e} |δJ|={norm_δJ:1.3e})")
        
        if norm_f1 < tol:
            converged = True
            move_fz[increment+1] = np.sum(f1[dofs["move"].all()])
            print(f'\n* final "Reaction Force Z" on boundary "move" F_Z={move_fz[increment+1]:1.3e}\n')
            break
        
    if np.isnan(norm_δu) or not converged:
        # reset counter for last converged increment and break
        increment = increment - 1
        break
    else:
        # export results to xdmf-file and go to next increment
        savedata = {"u": u[basis["u"].nodal_dofs].T, 
                    #"p": p[basis["p"].element_dofs[0]],
                    #"J": J[basis["J"].element_dofs[0]]
                   }
```

adtzlr · 2021-04-15T20:00:26Z

Beside the improvements in execution time the commits look good. Thank you for your efforts!

adtzlr

I'm not really sure about the test-case but the code changes / extensions are fine. Thanks!

kinnala · 2021-04-16T05:30:36Z

I'm not really sure about the test-case but the code changes / extensions are fine. Thanks!

The tests are basically simple proofs of correctness of the implementation, by comparing nthreads=0 and nthreads>0. They could be extended but I don't see any reason at this point. The main thing is that we get the same result for symmetric and nonsymmetric matrices.

kinnala · 2021-04-16T05:31:57Z

For the hyperelasticity example in #616 the total execution time is 50s (nthreads=8) compared to 76s (no parallel assembly) for one increment with 4 newton-iterations. If I compare the execution times of the K11 stiffness assemble I get 8.9s in parallel vs. 13.7s in default mode. I'll post the script code in an additional comment. I'm too lazy for today ro rewrite the whole form in "explicit" numba code

Are you saying that the improvement in execution time is not as good as when using joblib? Maybe there is some room for improvement here.

Edit: I mean, "8.9s in parallel vs. 13.7s in default mode." is about 1.5x improvement vs. in #450 you said "about 2x" improvement.

kinnala · 2021-04-16T06:01:35Z

Could the issue be that my variant passes all the input data to all the threads whereas your joblib variant passes only the data that the specific thread needs?

kinnala · 2021-04-16T06:04:23Z

Also, I think the joblib version always creates as many threads as there are entries in the local stiffness matrix?

kinnala · 2021-04-16T06:11:29Z

I think here the best performance can be obtained by verifying that the input data gets split evenly to the threads, so in this Laplace example the local stiffness matrix has 100 entries so dividing 100 / nthreads should be an integer.

Optionally assemble using multiple threads

c284d7b

kinnala mentioned this pull request Apr 15, 2021

Parallelizing the assembly over the elements #450

Closed

Include threaded assembly in tests

ae0dc55

adtzlr approved these changes Apr 15, 2021

View reviewed changes

kinnala closed this Apr 16, 2021

kinnala reopened this Apr 16, 2021

Some simplifications

122c298

kinnala merged commit d3da97a into master Apr 19, 2021

kinnala deleted the add-optional-threading branch April 19, 2021 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally assemble using multiple threads #625

Optionally assemble using multiple threads #625

kinnala commented Apr 15, 2021

adtzlr commented Apr 15, 2021 •

edited

Loading

adtzlr commented Apr 15, 2021 •

edited

Loading

adtzlr commented Apr 15, 2021

adtzlr commented Apr 15, 2021

adtzlr left a comment

kinnala commented Apr 16, 2021

kinnala commented Apr 16, 2021 •

edited

Loading

kinnala commented Apr 16, 2021

kinnala commented Apr 16, 2021

kinnala commented Apr 16, 2021

Optionally assemble using multiple threads #625

Optionally assemble using multiple threads #625

Conversation

kinnala commented Apr 15, 2021

adtzlr commented Apr 15, 2021 • edited Loading

adtzlr commented Apr 15, 2021 • edited Loading

adtzlr commented Apr 15, 2021

adtzlr commented Apr 15, 2021

adtzlr left a comment

Choose a reason for hiding this comment

kinnala commented Apr 16, 2021

kinnala commented Apr 16, 2021 • edited Loading

kinnala commented Apr 16, 2021

kinnala commented Apr 16, 2021

kinnala commented Apr 16, 2021

adtzlr commented Apr 15, 2021 •

edited

Loading

adtzlr commented Apr 15, 2021 •

edited

Loading

kinnala commented Apr 16, 2021 •

edited

Loading