Compute real matrix logarithm and matrix square root using real arithmetic #39973

sethaxen · 2021-03-10T05:04:38Z

As discussed on Slack (cc @andreasnoack and @baggepinnen), this PR implements the real matrix square root and real matrix logarithm in the following papers, respectively:

Higham, Nicholas J. (1987) Computing real square roots of a real matrix. Linear Algebra and its Applications, 88-89, 405-430. eprint
Al-Mohy, Awad H. and Higham, Nicholas J. and Relton, Samuel D. (2013) Computing the Frechet Derivative of the Matrix Logarithm and Estimating the Condition Number. SIAM J. Sci. Comput., 35 (4). C394 -C410. ISSN 1095-7197. eprint

These algorithms are both recommended in:

Higham, Nicholas J. and Hopkins, Edvin (2020) A Catalogue of Software for Matrix Functions. Version 3.0. eprint

This was proposed in #5840 and discussed for sqrt in #4006.

The result is that whenever a real logarithm or square root exists, it is returned. This both speeds up the algorithms (see benchmarks below) and prevents error from accumulating in the imaginary parts of the outputs. To avoid heavy code duplication, the PR refactors the log and sqrt of upper triangular matrices to more generally compute log and sqrt of upper quasi-triangular matrices.

The changes:

Previously a complex result was always returned for sqrt(A::Matrix{<:Real}) and log(A::Matrix{<:Real}) unless A was symmetric with positive eigenvalues, in which case a real result was returned. Now a real result is returned whenever the eigenvalues are positive.
For a real matrix A (with Real or Complex eltype), if a real matrix square root or real matrix logarithm exists, then it is computed using entirely real arithmetic. If the eltype of A is Complex, then a complex result is returned, where the imaginary part is exactly zero. Previously the function would be computed using complex arithmetic, resulting in imaginary parts that were not exactly zero.
log(::UnitUpperTriangular) and log(::UnitLowerTriangular) are now defined.

Issues fixed along the way:

For Float32 and ComplexF32 inputs, log always promoted to the Float64 analogs. It no longer does.
The inferred return type of log(::Matrix{ComplexF64}) was Any. It is now Matrix{ComplexF64}.
For A::UpperTriangular{<:Real} with negative eigenvalues, sqrt(A) returned a complex result, while log(A) would error. log(A) now returns a complex result.

Example

Before:

julia> log(exp([1 -2; 1 1]))
2×2 Matrix{ComplexF64}:
 1.0+3.33067e-16im  -2.0+1.66533e-16im
 1.0-3.33067e-16im   1.0+1.66533e-16im

This PR:

julia> log(exp([1 -2; 1 1]))
2×2 Matrix{Float64}:
 1.0  -2.0
 1.0   1.0

Basic Benchmarks

Here are some quick and dirty benchmarks on my (very old) machine.

Matrix square root

using LinearAlgebra, BenchmarkTools

n = 20
Artor = randn(n, n)^2;
Actor = complex(Artor);
Artoc = randn(n, n);
Actoc = randn(ComplexF64, n, n);

println("Dense matrices:")
for (name, mat) in (("ℝ → ℝ", Artor), ("ℂ → ℝ", Actor), ("ℝ → ℂ", Artoc), ("ℂ → ℂ", Actoc))
    @assert sqrt(mat)^2 ≈ mat
    println(name)
    @btime sqrt($mat);
end

Urtor = UpperTriangular(randn(n, n))^2;
Uctor = complex(Urtor);
Urtoc = UpperTriangular(randn(n, n));
Uctoc = UpperTriangular(randn(ComplexF64, n, n));

println("\nUpper triangular matrices:")
for (name, mat) in (("ℝ → ℝ", Urtor), ("ℂ → ℝ", Uctor), ("ℝ → ℂ", Urtoc), ("ℂ → ℂ", Uctoc))
    @assert sqrt(mat)^2 ≈ mat
    println(name)
    @btime sqrt($mat);
end

Before:

Dense matrices:
ℝ → ℝ
  266.140 μs (10 allocations: 105.59 KiB)
ℂ → ℝ
  265.059 μs (9 allocations: 99.22 KiB)
ℝ → ℂ
  310.661 μs (10 allocations: 105.59 KiB)
ℂ → ℂ
  320.669 μs (9 allocations: 99.22 KiB)

Upper triangular matrices:
ℝ → ℝ
  3.123 μs (2 allocations: 3.27 KiB)
ℂ → ℝ
  7.821 μs (1 allocation: 6.38 KiB)
ℝ → ℂ
  6.819 μs (2 allocations: 6.39 KiB)
ℂ → ℂ
  6.669 μs (1 allocation: 6.38 KiB)

This PR:

Dense matrices:
ℝ → ℝ
  110.562 μs (120 allocations: 58.70 KiB)
ℂ → ℝ
  111.417 μs (122 allocations: 68.33 KiB)
ℝ → ℂ
  154.114 μs (22 allocations: 159.44 KiB)
ℂ → ℂ
  321.852 μs (10 allocations: 99.23 KiB)

Upper triangular matrices:
ℝ → ℝ
  3.072 μs (2 allocations: 3.27 KiB)
ℂ → ℝ
  5.062 μs (4 allocations: 12.89 KiB)
ℝ → ℂ
  5.771 μs (2 allocations: 6.39 KiB)
ℂ → ℂ
  5.580 μs (2 allocations: 6.39 KiB)

Matrix logarithm

using LinearAlgebra, BenchmarkTools
n = 20
Artor = exp(randn(n, n));
Actor = complex(Artor);
Artoc = randn(n, n);
Actoc = randn(ComplexF64, n, n);

Urtor = UpperTriangular(exp(triu(randn(n, n))));
Uctor = complex(Urtor);
Urtoc = UpperTriangular(randn(n, n));
Uctoc = UpperTriangular(randn(ComplexF64, n, n));

println("Dense matrices:")
for (name, mat) in (("ℝ → ℝ", Artor), ("ℂ → ℝ", Actor), ("ℝ → ℂ", Artoc), ("ℂ → ℂ", Actoc))
    @assert exp(log(mat)) ≈ mat
    println(name)
    @btime log($mat);
end

println("\nUpper triangular matrices:")
for (name, mat) in (("ℝ → ℝ", Urtor), ("ℂ → ℝ", Uctor), ("ℝ → ℂ", Urtoc), ("ℂ → ℂ", Uctoc))
    println(name)
    try
        @assert exp(Matrix(log(mat))) ≈ mat
        @btime log($mat);
    catch
        println("  Error")
    end
end

Before:

Dense matrices:
ℝ → ℝ
  335.860 μs (92 allocations: 490.84 KiB)
ℂ → ℝ
  342.864 μs (102 allocations: 494.31 KiB)
ℝ → ℂ
  417.632 μs (97 allocations: 495.61 KiB)
ℂ → ℂ
  560.571 μs (85 allocations: 423.34 KiB)

Upper triangular matrices:
ℝ → ℝ
  103.731 μs (123 allocations: 169.84 KiB)
ℂ → ℝ
  190.273 μs (70 allocations: 325.05 KiB)
ℝ → ℂ
  Error
ℂ → ℂ
  352.777 μs (101 allocations: 464.41 KiB)

This PR:

Dense matrices:
ℝ → ℝ
  254.740 μs (506 allocations: 273.17 KiB)
ℂ → ℝ
  254.052 μs (508 allocations: 282.80 KiB)
ℝ → ℂ
  402.231 μs (103 allocations: 489.34 KiB)
ℂ → ℂ
  553.679 μs (87 allocations: 423.36 KiB)

Upper triangular matrices:
ℝ → ℝ
  99.269 μs (124 allocations: 170.08 KiB)
ℂ → ℝ
  101.995 μs (124 allocations: 176.59 KiB)
ℝ → ℂ
  407.174 μs (231 allocations: 556.72 KiB)
ℂ → ℂ
  346.416 μs (192 allocations: 459.47 KiB)

This substantially reduces allocations for some reason

sethaxen · 2021-03-11T02:18:34Z

I added tests for all branches and substantially reduced the allocations. Notably, the matrix square root can be computed in-place and now is within the logarithm. Note that even where there are more allocations, the actual memory used is lower.

I also worked out all of the excess allocations:

ℝ → ℝ: All excess allocations are explained by LAPACK.trsyl!, which internally has 3 allocations and 240 bytes each time it is called. I can only explain one of those allocations from the source. As a result, the real logarithm likewise has the same excess allocations.
ℂ → ℝ: Two additional allocations beyond above, one for realifying the input and one for complexifying the output.
ℝ → ℂ: Previously a single complex Schur decomposition was performed. Now a real Schur decomposition followed by a complex Schur decomposition was performed, which adds ~10 allocations but is still faster.
ℂ → ℂ: The increase in allocations on triu log seemed to be due to type-instability. I separated out the type-stable implementation from the type-unstable allocation and constraining of output, and this halved these allocations.

Updated Benchmarks

Benchmark code

using LinearAlgebra, BenchmarkTools, Random
n = 20;
# matrix square root

Random.seed!(1);

Artor = randn(n, n)^2;
Actor = complex(Artor);
Artoc = randn(n, n);
Actoc = randn(ComplexF64, n, n);

Urtor = UpperTriangular(randn(n, n))^2;
Uctor = complex(Urtor);
Urtoc = UpperTriangular(randn(n, n));
Uctoc = UpperTriangular(randn(ComplexF64, n, n));

println("Dense matrices:");
for (name, mat) in (("ℝ → ℝ", Artor), ("ℂ → ℝ", Actor), ("ℝ → ℂ", Artoc), ("ℂ → ℂ", Actoc))
    @assert sqrt(mat)^2 ≈ mat
    println(name)
    @btime sqrt($mat);
end

println("\nUpper triangular matrices:");
for (name, mat) in (("ℝ → ℝ", Urtor), ("ℂ → ℝ", Uctor), ("ℝ → ℂ", Urtoc), ("ℂ → ℂ", Uctoc))
    @assert sqrt(mat)^2 ≈ mat
    println(name)
    @btime sqrt($mat);
end

# matrix logarithm

Random.seed!(1);

Artor = exp(randn(n, n));
Actor = complex(Artor);
Artoc = randn(n, n);
Actoc = randn(ComplexF64, n, n);

Urtor = UpperTriangular(exp(triu(randn(n, n))));
Uctor = complex(Urtor);
Urtoc = UpperTriangular(randn(n, n));
Uctoc = UpperTriangular(randn(ComplexF64, n, n));

println("Dense matrices:");
for (name, mat) in (("ℝ → ℝ", Artor), ("ℂ → ℝ", Actor), ("ℝ → ℂ", Artoc), ("ℂ → ℂ", Actoc))
    @assert exp(log(mat)) ≈ mat
    println(name)
    @btime log($mat);
end

println("\nUpper triangular matrices:");
for (name, mat) in (("ℝ → ℝ", Urtor), ("ℂ → ℝ", Uctor), ("ℝ → ℂ", Urtoc), ("ℂ → ℂ", Uctoc))
    println(name)
    try
        @assert exp(Matrix(log(mat))) ≈ mat
        @btime log($mat);
    catch
        println("  Error")
    end
end

Matrix square root

Before:

Dense matrices:
ℝ → ℝ
  285.076 μs (10 allocations: 105.59 KiB)
ℂ → ℝ
  283.468 μs (9 allocations: 99.22 KiB)
ℝ → ℂ
  322.385 μs (10 allocations: 105.59 KiB)
ℂ → ℂ
  318.138 μs (9 allocations: 99.22 KiB)

Upper triangular matrices:
ℝ → ℝ
  3.127 μs (2 allocations: 3.27 KiB)
ℂ → ℝ
  7.844 μs (1 allocation: 6.38 KiB)
ℝ → ℂ
  7.076 μs (2 allocations: 6.39 KiB)
ℂ → ℂ
  6.704 μs (1 allocation: 6.38 KiB)

This PR:

Dense matrices:
ℝ → ℝ
  125.525 μs (120 allocations: 58.70 KiB)
ℂ → ℝ
  129.777 μs (122 allocations: 68.33 KiB)
ℝ → ℂ
  146.215 μs (21 allocations: 153.06 KiB)
ℂ → ℂ
  317.354 μs (10 allocations: 99.23 KiB)

Upper triangular matrices:
ℝ → ℝ
  3.100 μs (2 allocations: 3.27 KiB)
ℂ → ℝ
  5.295 μs (4 allocations: 12.89 KiB)
ℝ → ℂ
  5.694 μs (2 allocations: 6.39 KiB)
ℂ → ℂ
  5.535 μs (2 allocations: 6.39 KiB)

Matrix logarithm

Before:

Dense matrices:
ℝ → ℝ
  386.569 μs (103 allocations: 541.89 KiB)
ℂ → ℝ
  391.631 μs (113 allocations: 545.36 KiB)
ℝ → ℂ
  384.807 μs (94 allocations: 489.84 KiB)
ℂ → ℂ
  554.130 μs (85 allocations: 423.34 KiB)

Upper triangular matrices:
ℝ → ℝ
  103.988 μs (123 allocations: 169.84 KiB)
ℂ → ℝ
  191.811 μs (70 allocations: 325.05 KiB)
ℝ → ℂ
  Error
ℂ → ℂ
  412.665 μs (113 allocations: 521.83 KiB)

This PR:

Dense matrices:
ℝ → ℝ
  266.868 μs (587 allocations: 155.17 KiB)
ℂ → ℝ
  267.419 μs (589 allocations: 164.80 KiB)
ℝ → ℂ
  356.544 μs (64 allocations: 292.23 KiB)
ℂ → ℂ
  528.548 μs (53 allocations: 238.41 KiB)

Upper triangular matrices:
ℝ → ℝ
  84.371 μs (43 allocations: 68.55 KiB)
ℂ → ℝ
  86.358 μs (43 allocations: 75.06 KiB)
ℝ → ℂ
  320.204 μs (71 allocations: 254.30 KiB)
ℂ → ℂ
  382.941 μs (79 allocations: 298.72 KiB)

sethaxen · 2021-03-11T23:34:07Z

stdlib/LinearAlgebra/src/triangular.jl

-        end
-    end
-    return UpperTriangular(R)
+    return rdiv!(AG, UpperTriangular(B))


A consequence to removing all of these allocations is that the function is less general. e.g. log(UpperTriangular(exp(@MMatrix(randn(2,2))))) worked before this PR. Now it no longer does because rdiv! is not defined for MMatrix inputs.

sethaxen · 2021-03-19T09:25:17Z

@andreasnoack is there anything else this PR needs?

andreasnoack · 2021-03-19T13:54:08Z

The current version is definitely already an improvement so if you want to stop here then I think it's okay. However, I think it's a shame that there are so many allocations. Ideally, the LAPACK wrappers would allow reuse of memory as proposed in #16263 and it should be fairly straightforward to modify the trsyl! method to take

julia/stdlib/LinearAlgebra/src/lapack.jl

Lines 6460 to 6461 in efad4e3

    
           scale = Vector{$relty}(undef, 1) 
        
           info  = Ref{BlasInt}()

as arguments while giving them default values similar to their current values. It should probably be

scale = Ref{$relty}()

though

sethaxen · 2021-03-19T21:56:38Z

It should probably be
scale = Ref{$relty}()
though

It looks like this change alone eliminated ~half of the allocations. The remaining ones come from checksquare(A, B), which oddly allocates a vector with the sizes of A and B instead of returning a 2-tuple (this is the documented behavior). Since it does nothing fancier than essentially checksquare on each argument, I've changed trsyl! to just do that. I did not need to modify the signature of trsyl!. With these changes, _sqrt_quasitriu! is completely non-allocating.

I also eliminated excess allocations when computing the degree of the Padé approximant. Now the same 3 allocations are reused for all steps. This also cut ~1/3 of _log_quasitriu!s allocations.

Updated benchmarks

Matrix square root

All remaining allocations come from copies or the Schur decomposition.

Dense matrices:
ℝ → ℝ
  115.852 μs (12 allocations: 50.83 KiB)
ℂ → ℝ
  117.443 μs (14 allocations: 60.45 KiB)
ℝ → ℂ
  145.638 μs (21 allocations: 153.06 KiB)
ℂ → ℂ
  316.831 μs (10 allocations: 99.23 KiB)

Upper triangular matrices:
ℝ → ℝ
  3.050 μs (2 allocations: 3.27 KiB)
ℂ → ℝ
  5.179 μs (4 allocations: 12.89 KiB)
ℝ → ℂ
  5.641 μs (2 allocations: 6.39 KiB)
ℂ → ℂ
  5.529 μs (2 allocations: 6.39 KiB)

Matrix logarithm

Dense matrices:
ℝ → ℝ
  220.330 μs (42 allocations: 96.14 KiB)
ℂ → ℝ
  221.420 μs (44 allocations: 105.77 KiB)
ℝ → ℂ
  330.091 μs (48 allocations: 228.39 KiB)
ℂ → ℂ
  501.972 μs (37 allocations: 174.56 KiB)

Upper triangular matrices:
ℝ → ℝ
  76.920 μs (32 allocations: 48.75 KiB)
ℂ → ℝ
  78.498 μs (32 allocations: 55.27 KiB)
ℝ → ℂ
  249.985 μs (31 allocations: 81.97 KiB)
ℂ → ℂ
  295.925 μs (29 allocations: 81.72 KiB)

sethaxen · 2021-03-19T22:05:19Z

stdlib/LinearAlgebra/src/triangular.jl

    p = 0
    m = 0

-    # Compute repeated roots
-    d = complex(diag(A))
+    # Find s0, the smallest s such that the ρ(triu(A)^(1/2^s) - I) ≤ theta[tmax], where ρ(X)


This maybe should be ρ(A^(1/2^s), but the algorithm in the paper says to use the approach in Algorithm 4.1, which is for upper triangular A and uses only the diagonal.

andreasnoack · 2021-03-20T12:44:18Z

It looks like this change alone eliminated ~half of the allocations.

Great. This is of course a much cleaner solution. I need to update my mental model of the compiler to something post Julia 0.5.

The remaining ones come from checksquare(A, B), which oddly allocates a vector

Yes, that's unfortunate. See #28897

stdlib/LinearAlgebra/src/triangular.jl

…metic (JuliaLang#39973) * Add failing test * Add sylvester methods for small matrices * Add 2x2 real matrix square root * Add real square root of quasitriangular matrix * Simplify 2x2 real square root * Rename functions to use quasitriu * Avoid NaNs when eigenvalues are all zero * Reuse ranges * Add clarifying comments * Unify real and complex matrix square root * Add reference for real sqrt * Move quasitriu auxiliary functions to triangular.jl * Ensure loops are type-stable and use simd * Remove duplicate computation * Correctly promote for dimensionful A * Use simd directive * Test that UpperTriangular is returned by sqrt * Test sqrt for UnitUpperTriangular * Test that return type is complex when input type is * Test that output is complex when input is * Add failing test * Separate type-stable from type-unstable part * Use generic sqrt_quasitriu for sqrt triu * Avoid redundant matmul * Clarify comment * Return complex output for complex input * Call log_quasitriu * Add failing test for log type-inferrability * Realify or complexify as necessary * Call sqrt_quasitriu directly * Refactor sqrt_diag! * Simplify utility function * Add comment * Compute accurate block-diagonal * Compute superdiagonal for quasi triu A0 * Compute accurate block superdiagonal * Avoid full LU decomposition in inner loop * Avoid promotion to improve type-stability * Modify return type if necessary * Clarify comment * Add comments * Call log_quasitriu on quasitriu matrices * Document quasi-triangular algorithm * Remove test This matrix has eigenvalues to close to zero that its eltype is not stable * Rearrange definition * Add compatibility for unit triangular matrices * Release constraints on tests * Separate copying of A from log computation * Revert "Separate copying of A from log computation" This reverts commit 23becc5. * Use Givens rotations * Compute Schur in-place when possible * Always allocate a copy * Fix block indexing * Compute sqrt in-place * Overwrite AmI * Reduce allocations in Pade approximation * Use T * Don't unnecessarily unwrap * Test remaining log branches * Add additional matrix square root tests * Separate type-unstable from type-stable part This substantially reduces allocations for some reason * Use Ref instead of a Vector * Eliminate allocation in checksquare * Refactor param choosing code to own function * Comment section * Use more descriptive variable name * Reuse temporaries * Add reference * More accurately describe condition

sethaxen added 30 commits March 1, 2021 02:56

Add failing test

b5a1e13

Add sylvester methods for small matrices

391eab5

Add 2x2 real matrix square root

413290c

Add real square root of quasitriangular matrix

cc60b10

Simplify 2x2 real square root

063b2e9

Rename functions to use quasitriu

f384944

Avoid NaNs when eigenvalues are all zero

359c722

Reuse ranges

5d2c60a

Add clarifying comments

7e980ed

Unify real and complex matrix square root

2bb972f

Add reference for real sqrt

e465ead

Move quasitriu auxiliary functions to triangular.jl

539b54d

Ensure loops are type-stable and use simd

7c5ec4a

Remove duplicate computation

e1e0484

Correctly promote for dimensionful A

56a6cf9

Use simd directive

6f45ecb

Test that UpperTriangular is returned by sqrt

20a4942

Test sqrt for UnitUpperTriangular

e39698f

Test that return type is complex when input type is

96922de

Test that output is complex when input is

f175738

Add failing test

a47ccf4

Separate type-stable from type-unstable part

eb589cd

Use generic sqrt_quasitriu for sqrt triu

46561e1

Avoid redundant matmul

4c0799b

Clarify comment

a829f81

Return complex output for complex input

c19b557

Call log_quasitriu

0b53d89

Add failing test for log type-inferrability

2b61405

Realify or complexify as necessary

fd3624b

Call sqrt_quasitriu directly

0ba02c1

sethaxen added 6 commits March 10, 2021 15:46

Reduce allocations in Pade approximation

f673914

Use T

c96a05b

Don't unnecessarily unwrap

9db29fc

Test remaining log branches

0c475bc

Add additional matrix square root tests

96b6f3d

Separate type-unstable from type-stable part

d1da048

This substantially reduces allocations for some reason

sethaxen commented Mar 11, 2021

View reviewed changes

sethaxen added 7 commits March 19, 2021 11:08

Use Ref instead of a Vector

69e298d

Eliminate allocation in checksquare

9785686

Refactor param choosing code to own function

de1682c

Comment section

27d2063

Use more descriptive variable name

def31ba

Reuse temporaries

cb941b1

Add reference

d1d7095

sethaxen commented Mar 19, 2021

View reviewed changes

More accurately describe condition

de5fdd7

andreasnoack merged commit 6913f9c into JuliaLang:master Mar 20, 2021

sethaxen deleted the reallog branch March 20, 2021 17:58

mateuszbaran mentioned this pull request Mar 22, 2021

broadcasting for UMVTVector JuliaManifolds/Manifolds.jl#339

Merged

sethaxen mentioned this pull request Mar 22, 2021

Real matrix sqrt fails for specific quasi-triangular matrix #40141

Closed

stevengj reviewed Mar 22, 2021

View reviewed changes

stdlib/LinearAlgebra/src/triangular.jl Show resolved Hide resolved

sethaxen mentioned this pull request Mar 23, 2021

Fix real matrix sqrt and log for edge cases #40144

Merged

olivierverdier mentioned this pull request Jun 17, 2024

Matrix exp and log are not inverse of each other #54833

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute real matrix logarithm and matrix square root using real arithmetic #39973

Compute real matrix logarithm and matrix square root using real arithmetic #39973

sethaxen commented Mar 10, 2021

sethaxen commented Mar 11, 2021

sethaxen Mar 11, 2021

sethaxen commented Mar 19, 2021

andreasnoack commented Mar 19, 2021 •

edited

Loading

sethaxen commented Mar 19, 2021

sethaxen Mar 19, 2021 •

edited

Loading

andreasnoack commented Mar 20, 2021

Compute real matrix logarithm and matrix square root using real arithmetic #39973

Compute real matrix logarithm and matrix square root using real arithmetic #39973

Conversation

sethaxen commented Mar 10, 2021

Example

Basic Benchmarks

Matrix square root

Matrix logarithm

sethaxen commented Mar 11, 2021

Updated Benchmarks

Matrix square root

Matrix logarithm

sethaxen Mar 11, 2021

Choose a reason for hiding this comment

sethaxen commented Mar 19, 2021

andreasnoack commented Mar 19, 2021 • edited Loading

sethaxen commented Mar 19, 2021

Updated benchmarks

Matrix square root

Matrix logarithm

sethaxen Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

andreasnoack commented Mar 20, 2021

andreasnoack commented Mar 19, 2021 •

edited

Loading

sethaxen Mar 19, 2021 •

edited

Loading