Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Inv for matrice bigger than 1000 #92

Open
mfauvel opened this issue Jun 21, 2024 · 3 comments
Open

[Bug] Inv for matrice bigger than 1000 #92

mfauvel opened this issue Jun 21, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@mfauvel
Copy link

mfauvel commented Jun 21, 2024

🐛 Bug

There is an issue with inv function for matrices bigger than 1000. For size larger or equal to 1001, the function hangs for ever and memory issues happends for size larger than 1300 (on my laptop).

To reproduce

** Code snippet to reproduce **

for n_samples in [999, 1000, 1300, 2000]:
    # Generate dummy data
    A = torch.rand(n_samples, n_samples)
    A = (A.T + A) / 2
    Ac = Dense(A)

    # Solve with torch
    ts = time.time()
    inv_A = torch.linalg.inv(A)
    proc_time_torch = time.time() - ts

    # Solve with cola
    ts = time.time()
    inv_Ac = inv(Ac).to_dense()
    proc_time_cola = time.time() - ts

    if torch.allclose(inv_A, inv_Ac):
        print(
            f"Processing time for n_samples {n_samples} with "
            f"torch and cola: {proc_time_torch} _ {proc_time_cola}"
        )
    else:
        print(f"inv does not match for n_samples: {n_samples}")

** Stack trace/error message **

Processing time for n_samples 999 with torch and cola: 0.03584551811218262 _ 0.03255629539489746
Processing time for n_samples 1000 with torch and cola: 0.022265911102294922 _ 0.02802872657775879
Process stopped

Expected Behavior

I would say the process should not hang.

System information

Please complete the following information:

  • Cola version: 0.0.5
  • Torch version: 2.0.1+cu117
  • OS: Debian GNU/Linux 12 (bookworm)

Additional context

It seems related to #41

I encounter the problem first with Sparse matrix, but the problem is not related to Sparse since I can reproduce the error with Dense matrix.

@mfauvel mfauvel added the bug Something isn't working label Jun 21, 2024
@mfauvel
Copy link
Author

mfauvel commented Jun 21, 2024

I try another check, using solverather than inv. I don't have the memory issue and the crash, but the results are not accurate for dim higher than 1000 and the processing time is really more important with cola than torch.

After checking how the inverse algorithm is selected (

def inv(A: LinearOperator, alg: Auto):
) I force the alg to be LU and now everything is fine.

import time
import torch
import cola
from cola.ops import Sparse, Dense, Identity
from cola.linalg import solve
from cola.linalg.inverse.cg import CG
from cola.linalg.decompositions.decompositions import LU
print(cola.__version__)
print(torch.__version__)

for n_samples in [999, 1000, 1001, 1300, 2000]:
    # Generate dummy data
    A = torch.rand(n_samples, n_samples, dtype=torch.float64)
    b = torch.rand(n_samples, 1, dtype=torch.float64)
    A = (A.T + A) / 2
    Ac = Dense(A)

    # Solve with torch
    ts = time.time()
    x = torch.linalg.solve(A, b)
    proc_time_torch = time.time() - ts

    # Solve with cola
    ts = time.time()
    xc = solve(Ac, b, alg=LU()).to_dense()
    proc_time_cola = time.time() - ts

    print(
        f"Processing time for n_samples {n_samples} with "
        f"torch and cola: {proc_time_torch} _ {proc_time_cola}"
    )
    print(f"inv does not match for n_samples: {n_samples}")
    print(f"Residual for torch: {torch.mean((A @ x -b)**2)}")
    print(f"Residual for cola: {torch.mean((A @ xc-b)**2)}")

which leads the following outputs:

0.0.5
2.0.1+cu117
Processing time for n_samples 999 with torch and cola: 0.009793996810913086 _ 0.02040863037109375
inv does not match for n_samples: 999
Residual for torch: 7.200492875557123e-25
Residual for cola: 2.4516683688616998e-25
Processing time for n_samples 1000 with torch and cola: 0.008498907089233398 _ 0.016102075576782227
inv does not match for n_samples: 1000
Residual for torch: 1.4491477357523913e-26
Residual for cola: 3.9325588238346365e-27
Processing time for n_samples 1001 with torch and cola: 0.009015798568725586 _ 0.01718282699584961
inv does not match for n_samples: 1001
Residual for torch: 4.434132455473258e-26
Residual for cola: 8.911297474077102e-27
Processing time for n_samples 1300 with torch and cola: 0.015895366668701172 _ 0.026092052459716797
inv does not match for n_samples: 1300
Residual for torch: 7.919092478759107e-25
Residual for cola: 1.8887265316154846e-25
Processing time for n_samples 2000 with torch and cola: 0.04905390739440918 _ 0.06647753715515137
inv does not match for n_samples: 2000
Residual for torch: 1.8726508482806117e-25
Residual for cola: 4.7504923991884954e-26

So i would say something is wrong with GMRES ?

@mfinzi
Copy link
Collaborator

mfinzi commented Jul 9, 2024

Hi @mfauvel,

Sorry for the late reply. I think I can see what is going on in your test here.

I believe the key problem is from calling to_dense() on inv(A), which implicitly is a wrapper on performing linear solves iteratively with GMRES (or CG in the symmetric case). When you call to_dense the iterative algorithm is run on the massive RHS being the (n x n) identity matrix, defeating the purpose of using an iterative algorithm in the first place, requiring a massive memory allocation and a very long runtime. The function hanging is likely due more to memory than compute, and we can look at adding an additional batching loop to to_dense() to make sure that even in unlikely cases such as these, the computation will eventually terminate. However, one should be very careful with using to_dense() (often to be used just for debugging purposes), as it will require forming the entire matrices in memory.

Instead, the intended use case of inv is to multiply it with a vector (n, ) or skinny matrix (n,k) with k<<n: inv(A)@b (a sometimes more convenient way writing cola.linalg.solve(A,b)). When to_dense() is called instead you get a multiply with a (n,n) matrix. We will try to make this more evident in the documentation.

The reason why this only happens at n>1000 is that the cola.Auto algorithm switches from dense methods (such as LU) to iterative methods (such as GMRES) at n=1000.

Hope this answers your question.

@mfauvel
Copy link
Author

mfauvel commented Jul 10, 2024

Thanks @mfinzi for your answer. I think it explained my initial comment. However, in my second comment (#92 (comment)), I effectively used cola.linalg.solve and as you said i did not face memory issues anymore. But the numerical precision is not accurate if I did not impose LU algorithm:

for n_samples in [999, 1000, 1001, 1300, 2000]:
    # Generate dummy data
    A = torch.rand(n_samples, n_samples, dtype=torch.float64)
    b = torch.rand(n_samples, 1, dtype=torch.float64)
    # A = (A.T + A) / 2
    Ac = Dense(A)

    # Solve with torch
    ts = time.time()
    x = torch.linalg.solve(A, b)
    proc_time_torch = time.time() - ts

    # Solve with cola
    ts = time.time()
    # xc = solve(Ac, b, alg=LU()).to_dense()
    xc = solve(Ac, b).to_dense()
    proc_time_cola = time.time() - ts

    print(
        f"Processing time for n_samples {n_samples} with "
        f"torch and cola: {proc_time_torch} _ {proc_time_cola}"
    )
    print(f"inv does not match for n_samples: {n_samples}")
    print(f"Residual for torch: {torch.mean((A @ x -b)**2)}")
    print(f"Residual for cola: {torch.mean((A @ xc-b)**2)}")
Processing time for n_samples 999 with torch and cola: 0.015241384506225586 _ 0.025005340576171875
inv does not match for n_samples: 999
Residual for torch: 3.769979443344865e-26
Residual for cola: 4.5632820650815264e-26
Processing time for n_samples 1000 with torch and cola: 0.01306605339050293 _ 0.023195981979370117
inv does not match for n_samples: 1000
Residual for torch: 6.740551482619981e-27
Residual for cola: 6.062314891002271e-27
Processing time for n_samples 1001 with torch and cola: 0.011551856994628906 _ 13.982511281967163
inv does not match for n_samples: 1001
Residual for torch: 1.3138882670132605e-26
Residual for cola: 4.7555078392359486e-05
Processing time for n_samples 1300 with torch and cola: 0.02622056007385254 _ 24.619579076766968
inv does not match for n_samples: 1300
Residual for torch: 1.0700557343056313e-26
Residual for cola: 3.9548693482973776
Processing time for n_samples 2000 with torch and cola: 0.04886674880981445 _ 24.41248393058777
inv does not match for n_samples: 2000
Residual for torch: 1.1910256580485765e-25
Residual for cola: 78.85581323368416

I may misused the algorithm GMRES ?

Anyway thanks a lot for the Cola package, it simplifies a lot my work !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants