Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute real matrix logarithm and matrix square root using real arithmetic #39973

Merged
merged 70 commits into from
Mar 20, 2021

Conversation

sethaxen
Copy link
Contributor

As discussed on Slack (cc @andreasnoack and @baggepinnen), this PR implements the real matrix square root and real matrix logarithm in the following papers, respectively:

  • Higham, Nicholas J. (1987) Computing real square roots of a real matrix. Linear Algebra and its Applications, 88-89, 405-430. eprint
  • Al-Mohy, Awad H. and Higham, Nicholas J. and Relton, Samuel D. (2013) Computing the Frechet Derivative of the Matrix Logarithm and Estimating the Condition Number. SIAM J. Sci. Comput., 35 (4). C394 -C410. ISSN 1095-7197. eprint

These algorithms are both recommended in:

  • Higham, Nicholas J. and Hopkins, Edvin (2020) A Catalogue of Software for Matrix Functions. Version 3.0. eprint

This was proposed in #5840 and discussed for sqrt in #4006.

The result is that whenever a real logarithm or square root exists, it is returned. This both speeds up the algorithms (see benchmarks below) and prevents error from accumulating in the imaginary parts of the outputs. To avoid heavy code duplication, the PR refactors the log and sqrt of upper triangular matrices to more generally compute log and sqrt of upper quasi-triangular matrices.

The changes:

  • Previously a complex result was always returned for sqrt(A::Matrix{<:Real}) and log(A::Matrix{<:Real}) unless A was symmetric with positive eigenvalues, in which case a real result was returned. Now a real result is returned whenever the eigenvalues are positive.
  • For a real matrix A (with Real or Complex eltype), if a real matrix square root or real matrix logarithm exists, then it is computed using entirely real arithmetic. If the eltype of A is Complex, then a complex result is returned, where the imaginary part is exactly zero. Previously the function would be computed using complex arithmetic, resulting in imaginary parts that were not exactly zero.
  • log(::UnitUpperTriangular) and log(::UnitLowerTriangular) are now defined.

Issues fixed along the way:

  • For Float32 and ComplexF32 inputs, log always promoted to the Float64 analogs. It no longer does.
  • The inferred return type of log(::Matrix{ComplexF64}) was Any. It is now Matrix{ComplexF64}.
  • For A::UpperTriangular{<:Real} with negative eigenvalues, sqrt(A) returned a complex result, while log(A) would error. log(A) now returns a complex result.

Example

Before:

julia> log(exp([1 -2; 1 1]))
2×2 Matrix{ComplexF64}:
 1.0+3.33067e-16im  -2.0+1.66533e-16im
 1.0-3.33067e-16im   1.0+1.66533e-16im

This PR:

julia> log(exp([1 -2; 1 1]))
2×2 Matrix{Float64}:
 1.0  -2.0
 1.0   1.0

Basic Benchmarks

Here are some quick and dirty benchmarks on my (very old) machine.

Matrix square root

using LinearAlgebra, BenchmarkTools

n = 20
Artor = randn(n, n)^2;
Actor = complex(Artor);
Artoc = randn(n, n);
Actoc = randn(ComplexF64, n, n);

println("Dense matrices:")
for (name, mat) in (("ℝ → ℝ", Artor), ("ℂ → ℝ", Actor), ("ℝ → ℂ", Artoc), ("ℂ → ℂ", Actoc))
    @assert sqrt(mat)^2  mat
    println(name)
    @btime sqrt($mat);
end

Urtor = UpperTriangular(randn(n, n))^2;
Uctor = complex(Urtor);
Urtoc = UpperTriangular(randn(n, n));
Uctoc = UpperTriangular(randn(ComplexF64, n, n));

println("\nUpper triangular matrices:")
for (name, mat) in (("ℝ → ℝ", Urtor), ("ℂ → ℝ", Uctor), ("ℝ → ℂ", Urtoc), ("ℂ → ℂ", Uctoc))
    @assert sqrt(mat)^2  mat
    println(name)
    @btime sqrt($mat);
end

Before:

Dense matrices:266.140 μs (10 allocations: 105.59 KiB)
ℂ 265.059 μs (9 allocations: 99.22 KiB)
ℝ 310.661 μs (10 allocations: 105.59 KiB)
ℂ 320.669 μs (9 allocations: 99.22 KiB)

Upper triangular matrices:3.123 μs (2 allocations: 3.27 KiB)
ℂ 7.821 μs (1 allocation: 6.38 KiB)
ℝ 6.819 μs (2 allocations: 6.39 KiB)
ℂ 6.669 μs (1 allocation: 6.38 KiB)

This PR:

Dense matrices:110.562 μs (120 allocations: 58.70 KiB)
ℂ 111.417 μs (122 allocations: 68.33 KiB)
ℝ 154.114 μs (22 allocations: 159.44 KiB)
ℂ 321.852 μs (10 allocations: 99.23 KiB)

Upper triangular matrices:3.072 μs (2 allocations: 3.27 KiB)
ℂ 5.062 μs (4 allocations: 12.89 KiB)
ℝ 5.771 μs (2 allocations: 6.39 KiB)
ℂ 5.580 μs (2 allocations: 6.39 KiB)

Matrix logarithm

using LinearAlgebra, BenchmarkTools
n = 20
Artor = exp(randn(n, n));
Actor = complex(Artor);
Artoc = randn(n, n);
Actoc = randn(ComplexF64, n, n);

Urtor = UpperTriangular(exp(triu(randn(n, n))));
Uctor = complex(Urtor);
Urtoc = UpperTriangular(randn(n, n));
Uctoc = UpperTriangular(randn(ComplexF64, n, n));

println("Dense matrices:")
for (name, mat) in (("ℝ → ℝ", Artor), ("ℂ → ℝ", Actor), ("ℝ → ℂ", Artoc), ("ℂ → ℂ", Actoc))
    @assert exp(log(mat))  mat
    println(name)
    @btime log($mat);
end

println("\nUpper triangular matrices:")
for (name, mat) in (("ℝ → ℝ", Urtor), ("ℂ → ℝ", Uctor), ("ℝ → ℂ", Urtoc), ("ℂ → ℂ", Uctoc))
    println(name)
    try
        @assert exp(Matrix(log(mat)))  mat
        @btime log($mat);
    catch
        println("  Error")
    end
end

Before:

Dense matrices:335.860 μs (92 allocations: 490.84 KiB)
ℂ 342.864 μs (102 allocations: 494.31 KiB)
ℝ 417.632 μs (97 allocations: 495.61 KiB)
ℂ 560.571 μs (85 allocations: 423.34 KiB)

Upper triangular matrices:103.731 μs (123 allocations: 169.84 KiB)
ℂ 190.273 μs (70 allocations: 325.05 KiB)
ℝ  ℂ
  Error
ℂ 352.777 μs (101 allocations: 464.41 KiB)

This PR:

Dense matrices:254.740 μs (506 allocations: 273.17 KiB)
ℂ 254.052 μs (508 allocations: 282.80 KiB)
ℝ 402.231 μs (103 allocations: 489.34 KiB)
ℂ 553.679 μs (87 allocations: 423.36 KiB)

Upper triangular matrices:99.269 μs (124 allocations: 170.08 KiB)
ℂ 101.995 μs (124 allocations: 176.59 KiB)
ℝ 407.174 μs (231 allocations: 556.72 KiB)
ℂ 346.416 μs (192 allocations: 459.47 KiB)

@sethaxen
Copy link
Contributor Author

I added tests for all branches and substantially reduced the allocations. Notably, the matrix square root can be computed in-place and now is within the logarithm. Note that even where there are more allocations, the actual memory used is lower.

I also worked out all of the excess allocations:

  • ℝ → ℝ: All excess allocations are explained by LAPACK.trsyl!, which internally has 3 allocations and 240 bytes each time it is called. I can only explain one of those allocations from the source. As a result, the real logarithm likewise has the same excess allocations.
  • ℂ → ℝ: Two additional allocations beyond above, one for realifying the input and one for complexifying the output.
  • ℝ → ℂ: Previously a single complex Schur decomposition was performed. Now a real Schur decomposition followed by a complex Schur decomposition was performed, which adds ~10 allocations but is still faster.
  • ℂ → ℂ: The increase in allocations on triu log seemed to be due to type-instability. I separated out the type-stable implementation from the type-unstable allocation and constraining of output, and this halved these allocations.

Updated Benchmarks

Benchmark code
using LinearAlgebra, BenchmarkTools, Random
n = 20;
# matrix square root

Random.seed!(1);

Artor = randn(n, n)^2;
Actor = complex(Artor);
Artoc = randn(n, n);
Actoc = randn(ComplexF64, n, n);

Urtor = UpperTriangular(randn(n, n))^2;
Uctor = complex(Urtor);
Urtoc = UpperTriangular(randn(n, n));
Uctoc = UpperTriangular(randn(ComplexF64, n, n));

println("Dense matrices:");
for (name, mat) in (("ℝ → ℝ", Artor), ("ℂ → ℝ", Actor), ("ℝ → ℂ", Artoc), ("ℂ → ℂ", Actoc))
    @assert sqrt(mat)^2  mat
    println(name)
    @btime sqrt($mat);
end

println("\nUpper triangular matrices:");
for (name, mat) in (("ℝ → ℝ", Urtor), ("ℂ → ℝ", Uctor), ("ℝ → ℂ", Urtoc), ("ℂ → ℂ", Uctoc))
    @assert sqrt(mat)^2  mat
    println(name)
    @btime sqrt($mat);
end

# matrix logarithm

Random.seed!(1);

Artor = exp(randn(n, n));
Actor = complex(Artor);
Artoc = randn(n, n);
Actoc = randn(ComplexF64, n, n);

Urtor = UpperTriangular(exp(triu(randn(n, n))));
Uctor = complex(Urtor);
Urtoc = UpperTriangular(randn(n, n));
Uctoc = UpperTriangular(randn(ComplexF64, n, n));

println("Dense matrices:");
for (name, mat) in (("ℝ → ℝ", Artor), ("ℂ → ℝ", Actor), ("ℝ → ℂ", Artoc), ("ℂ → ℂ", Actoc))
    @assert exp(log(mat))  mat
    println(name)
    @btime log($mat);
end

println("\nUpper triangular matrices:");
for (name, mat) in (("ℝ → ℝ", Urtor), ("ℂ → ℝ", Uctor), ("ℝ → ℂ", Urtoc), ("ℂ → ℂ", Uctoc))
    println(name)
    try
        @assert exp(Matrix(log(mat)))  mat
        @btime log($mat);
    catch
        println("  Error")
    end
end

Matrix square root

Before:

Dense matrices:285.076 μs (10 allocations: 105.59 KiB)
ℂ 283.468 μs (9 allocations: 99.22 KiB)
ℝ 322.385 μs (10 allocations: 105.59 KiB)
ℂ 318.138 μs (9 allocations: 99.22 KiB)

Upper triangular matrices:3.127 μs (2 allocations: 3.27 KiB)
ℂ 7.844 μs (1 allocation: 6.38 KiB)
ℝ 7.076 μs (2 allocations: 6.39 KiB)
ℂ 6.704 μs (1 allocation: 6.38 KiB)

This PR:

Dense matrices:125.525 μs (120 allocations: 58.70 KiB)
ℂ 129.777 μs (122 allocations: 68.33 KiB)
ℝ 146.215 μs (21 allocations: 153.06 KiB)
ℂ 317.354 μs (10 allocations: 99.23 KiB)

Upper triangular matrices:3.100 μs (2 allocations: 3.27 KiB)
ℂ 5.295 μs (4 allocations: 12.89 KiB)
ℝ 5.694 μs (2 allocations: 6.39 KiB)
ℂ 5.535 μs (2 allocations: 6.39 KiB)

Matrix logarithm

Before:

Dense matrices:386.569 μs (103 allocations: 541.89 KiB)
ℂ 391.631 μs (113 allocations: 545.36 KiB)
ℝ 384.807 μs (94 allocations: 489.84 KiB)
ℂ 554.130 μs (85 allocations: 423.34 KiB)

Upper triangular matrices:103.988 μs (123 allocations: 169.84 KiB)
ℂ 191.811 μs (70 allocations: 325.05 KiB)
ℝ  ℂ
  Error
ℂ 412.665 μs (113 allocations: 521.83 KiB)

This PR:

Dense matrices:266.868 μs (587 allocations: 155.17 KiB)
ℂ 267.419 μs (589 allocations: 164.80 KiB)
ℝ 356.544 μs (64 allocations: 292.23 KiB)
ℂ 528.548 μs (53 allocations: 238.41 KiB)

Upper triangular matrices:84.371 μs (43 allocations: 68.55 KiB)
ℂ 86.358 μs (43 allocations: 75.06 KiB)
ℝ 320.204 μs (71 allocations: 254.30 KiB)
ℂ 382.941 μs (79 allocations: 298.72 KiB)

end
end
return UpperTriangular(R)
return rdiv!(AG, UpperTriangular(B))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A consequence to removing all of these allocations is that the function is less general. e.g. log(UpperTriangular(exp(@MMatrix(randn(2,2))))) worked before this PR. Now it no longer does because rdiv! is not defined for MMatrix inputs.

@sethaxen
Copy link
Contributor Author

@andreasnoack is there anything else this PR needs?

@andreasnoack
Copy link
Member

andreasnoack commented Mar 19, 2021

The current version is definitely already an improvement so if you want to stop here then I think it's okay. However, I think it's a shame that there are so many allocations. Ideally, the LAPACK wrappers would allow reuse of memory as proposed in #16263 and it should be fairly straightforward to modify the trsyl! method to take

scale = Vector{$relty}(undef, 1)
info = Ref{BlasInt}()
as arguments while giving them default values similar to their current values. It should probably be

scale = Ref{$relty}()

though

@sethaxen
Copy link
Contributor Author

It should probably be

scale = Ref{$relty}()

though

It looks like this change alone eliminated ~half of the allocations. The remaining ones come from checksquare(A, B), which oddly allocates a vector with the sizes of A and B instead of returning a 2-tuple (this is the documented behavior). Since it does nothing fancier than essentially checksquare on each argument, I've changed trsyl! to just do that. I did not need to modify the signature of trsyl!. With these changes, _sqrt_quasitriu! is completely non-allocating.

I also eliminated excess allocations when computing the degree of the Padé approximant. Now the same 3 allocations are reused for all steps. This also cut ~1/3 of _log_quasitriu!s allocations.

Updated benchmarks

Matrix square root

All remaining allocations come from copies or the Schur decomposition.

Dense matrices:115.852 μs (12 allocations: 50.83 KiB)
ℂ 117.443 μs (14 allocations: 60.45 KiB)
ℝ 145.638 μs (21 allocations: 153.06 KiB)
ℂ 316.831 μs (10 allocations: 99.23 KiB)

Upper triangular matrices:3.050 μs (2 allocations: 3.27 KiB)
ℂ 5.179 μs (4 allocations: 12.89 KiB)
ℝ 5.641 μs (2 allocations: 6.39 KiB)
ℂ 5.529 μs (2 allocations: 6.39 KiB)

Matrix logarithm

Dense matrices:220.330 μs (42 allocations: 96.14 KiB)
ℂ 221.420 μs (44 allocations: 105.77 KiB)
ℝ 330.091 μs (48 allocations: 228.39 KiB)
ℂ 501.972 μs (37 allocations: 174.56 KiB)

Upper triangular matrices:76.920 μs (32 allocations: 48.75 KiB)
ℂ 78.498 μs (32 allocations: 55.27 KiB)
ℝ 249.985 μs (31 allocations: 81.97 KiB)
ℂ 295.925 μs (29 allocations: 81.72 KiB)

p = 0
m = 0

# Compute repeated roots
d = complex(diag(A))
# Find s0, the smallest s such that the ρ(triu(A)^(1/2^s) - I) ≤ theta[tmax], where ρ(X)
Copy link
Contributor Author

@sethaxen sethaxen Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe should be ρ(A^(1/2^s), but the algorithm in the paper says to use the approach in Algorithm 4.1, which is for upper triangular A and uses only the diagonal.

@andreasnoack
Copy link
Member

It looks like this change alone eliminated ~half of the allocations.

Great. This is of course a much cleaner solution. I need to update my mental model of the compiler to something post Julia 0.5.

The remaining ones come from checksquare(A, B), which oddly allocates a vector

Yes, that's unfortunate. See #28897

ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
…metic (JuliaLang#39973)

* Add failing test

* Add sylvester methods for small matrices

* Add 2x2 real matrix square root

* Add real square root of quasitriangular matrix

* Simplify 2x2 real square root

* Rename functions to use quasitriu

* Avoid NaNs when eigenvalues are all zero

* Reuse ranges

* Add clarifying comments

* Unify real and complex matrix square root

* Add reference for real sqrt

* Move quasitriu auxiliary functions to triangular.jl

* Ensure loops are type-stable and use simd

* Remove duplicate computation

* Correctly promote for dimensionful A

* Use simd directive

* Test that UpperTriangular is returned by sqrt

* Test sqrt for UnitUpperTriangular

* Test that return type is complex when input type is

* Test that output is complex when input is

* Add failing test

* Separate type-stable from type-unstable part

* Use generic sqrt_quasitriu for sqrt triu

* Avoid redundant matmul

* Clarify comment

* Return complex output for complex input

* Call log_quasitriu

* Add failing test for log type-inferrability

* Realify or complexify as necessary

* Call sqrt_quasitriu directly

* Refactor sqrt_diag!

* Simplify utility function

* Add comment

* Compute accurate block-diagonal

* Compute superdiagonal for quasi triu A0

* Compute accurate block superdiagonal

* Avoid full LU decomposition in inner loop

* Avoid promotion to improve type-stability

* Modify return type if necessary

* Clarify comment

* Add comments

* Call log_quasitriu on quasitriu matrices

* Document quasi-triangular algorithm

* Remove test

This matrix has eigenvalues to close to zero that its eltype is not stable

* Rearrange definition

* Add compatibility for unit triangular matrices

* Release constraints on tests

* Separate copying of A from log computation

* Revert "Separate copying of A from log computation"

This reverts commit 23becc5.

* Use Givens rotations

* Compute Schur in-place when possible

* Always allocate a copy

* Fix block indexing

* Compute sqrt in-place

* Overwrite AmI

* Reduce allocations in Pade approximation

* Use T

* Don't unnecessarily unwrap

* Test remaining log branches

* Add additional matrix square root tests

* Separate type-unstable from type-stable part

This substantially reduces allocations for some reason

* Use Ref instead of a Vector

* Eliminate allocation in checksquare

* Refactor param choosing code to own function

* Comment section

* Use more descriptive variable name

* Reuse temporaries

* Add reference

* More accurately describe condition
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request May 9, 2021
…metic (JuliaLang#39973)

* Add failing test

* Add sylvester methods for small matrices

* Add 2x2 real matrix square root

* Add real square root of quasitriangular matrix

* Simplify 2x2 real square root

* Rename functions to use quasitriu

* Avoid NaNs when eigenvalues are all zero

* Reuse ranges

* Add clarifying comments

* Unify real and complex matrix square root

* Add reference for real sqrt

* Move quasitriu auxiliary functions to triangular.jl

* Ensure loops are type-stable and use simd

* Remove duplicate computation

* Correctly promote for dimensionful A

* Use simd directive

* Test that UpperTriangular is returned by sqrt

* Test sqrt for UnitUpperTriangular

* Test that return type is complex when input type is

* Test that output is complex when input is

* Add failing test

* Separate type-stable from type-unstable part

* Use generic sqrt_quasitriu for sqrt triu

* Avoid redundant matmul

* Clarify comment

* Return complex output for complex input

* Call log_quasitriu

* Add failing test for log type-inferrability

* Realify or complexify as necessary

* Call sqrt_quasitriu directly

* Refactor sqrt_diag!

* Simplify utility function

* Add comment

* Compute accurate block-diagonal

* Compute superdiagonal for quasi triu A0

* Compute accurate block superdiagonal

* Avoid full LU decomposition in inner loop

* Avoid promotion to improve type-stability

* Modify return type if necessary

* Clarify comment

* Add comments

* Call log_quasitriu on quasitriu matrices

* Document quasi-triangular algorithm

* Remove test

This matrix has eigenvalues to close to zero that its eltype is not stable

* Rearrange definition

* Add compatibility for unit triangular matrices

* Release constraints on tests

* Separate copying of A from log computation

* Revert "Separate copying of A from log computation"

This reverts commit 23becc5.

* Use Givens rotations

* Compute Schur in-place when possible

* Always allocate a copy

* Fix block indexing

* Compute sqrt in-place

* Overwrite AmI

* Reduce allocations in Pade approximation

* Use T

* Don't unnecessarily unwrap

* Test remaining log branches

* Add additional matrix square root tests

* Separate type-unstable from type-stable part

This substantially reduces allocations for some reason

* Use Ref instead of a Vector

* Eliminate allocation in checksquare

* Refactor param choosing code to own function

* Comment section

* Use more descriptive variable name

* Reuse temporaries

* Add reference

* More accurately describe condition
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants