magma: unbreak for cudaPackages_12 #283640

SomeoneSerge · 2024-01-25T01:33:40Z

#269639 seems to have broken magma (and torch, and what not); #281656 (comment) suggests a hot fix and it seems to work. CC @samuela @dmayle

Description of changes

Things done

Add a 👍 reaction to pull requests you find important.

Hotfix based on the suggestion from NixOS#281656 (comment)

samuela · 2024-01-25T02:10:10Z

pkgs/development/libraries/science/math/magma/generic.nix

+  ] ++ lists.optionals (cudaPackages.cudaAtLeast "12.0.0") [
+    (lib.cmakeBool "USE_FORTRAN" false)


any reason not to do

(lib.cmakeBool "USE_FORTRAN" !(cudaPackages.cudaAtLeast "12.0.0"))

or something similar?

Didn't test how the option affects cuda11 (it's somewhat late in the night...), this way we only specify it for cuda12

ConnorBaker · 2024-01-25T04:42:58Z

Is there a downside to taking the approach of specifying the fortran name-mangling convention?

Also, is there a downside to disabling Fortran for Magma?

ConnorBaker · 2024-01-25T06:48:24Z

Well nix-cuda-test built for both CUDA 11 and 12; I'm running the small ViT model to see whether there's any sort of horrible regression caused by disabling Fortran.

ConnorBaker · 2024-01-25T07:24:28Z

No crazy differences so I think it's good to go as a hot fix. Let's get it merged!

./pr-283640-cuda-11/bin/nix-cuda-test 
Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /home/connorbaker/nix-cuda-test/lightning_logs
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar-10-python.tar.gz
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 170498071/170498071 [00:02<00:00, 78881307.89it/s]
Extracting data/cifar-10-python.tar.gz to data
Files already downloaded and verified
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | model     | ViT              | 86.3 M
-----------------------------------------------
86.3 M    Trainable params
0         Non-trainable params
86.3 M    Total params
345.317   Total estimated model params size (MB)
Epoch 9: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [01:28<00:00,  8.80it/s, v_num=0, train_loss=2.350, val_loss=2.330]`Trainer.fit` stopped: `max_epochs=10` reached.                                                                                                                                                                            
Epoch 9: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [01:30<00:00,  8.62it/s, v_num=0, train_loss=2.350, val_loss=2.330]

$ ./pr-283640-cuda-12/bin/nix-cuda-test
Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Files already downloaded and verified
Files already downloaded and verified
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | model     | ViT              | 86.3 M
-----------------------------------------------
86.3 M    Trainable params
0         Non-trainable params
86.3 M    Total params
345.317   Total estimated model params size (MB)
Epoch 9: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [01:26<00:00,  9.04it/s, v_num=1, train_loss=2.350, val_loss=2.330]`Trainer.fit` stopped: `max_epochs=10` reached.                                                                                                                                                                            
Epoch 9: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [01:28<00:00,  8.86it/s, v_num=1, train_loss=2.350, val_loss=2.330]

samuela · 2024-01-25T13:38:42Z

@ConnorBaker AFAIK the effective linear algebra backend is mediated with the torch.backends.cuda.preferred_linalg_library function in pytorch. Options are "cusolver" and "magma", with the default being a heuristic choice made by PyTorch. So it's possible that if cuSOLVER is available in your environment that magma may not be used at all.

Also, IIUC, magma/cuSOLVER are only utilized for PyTorch functions related to matrix inversion, QR/Cholesky/LU decomposition, and SVD. A ViT training loop will not include any of these AFAIK.

tldr, I'm a big fan of running benchmarks but I'm concerned that we may not have benchmarked quite the right thing in this case

ConnorBaker · 2024-01-25T13:55:33Z

Ah, that's a good point. I also like benchmarks, especially when they measure the thing they're supposed to be benchmarking :/

I need to look into adding Magma's test cases as a passthru or something and exposing them separately.

In the meantime, I'm going to take the approach of specifying name-mangling as part of a general update to the magma package.

samuela · 2024-01-25T17:38:21Z

Yeah, sorry to be the parade rain-er here! But not a big deal ultimately... working magma > broken magma

magma: unbreak for cudaPackages_12

5faf086

Hotfix based on the suggestion from NixOS#281656 (comment)

SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Jan 25, 2024

samuela reviewed Jan 25, 2024

View reviewed changes

ofborg bot requested a review from ConnorBaker January 25, 2024 02:15

ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 11-100 labels Jan 25, 2024

srhb approved these changes Jan 25, 2024

View reviewed changes

ConnorBaker approved these changes Jan 25, 2024

View reviewed changes

ConnorBaker merged commit eea875a into NixOS:master Jan 25, 2024
26 checks passed

delroth added the 12.approvals: 2 This PR was reviewed and approved by two reputable people label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

magma: unbreak for cudaPackages_12 #283640

magma: unbreak for cudaPackages_12 #283640

SomeoneSerge commented Jan 25, 2024

samuela Jan 25, 2024

SomeoneSerge Jan 25, 2024

ConnorBaker commented Jan 25, 2024

ConnorBaker commented Jan 25, 2024

ConnorBaker commented Jan 25, 2024

samuela commented Jan 25, 2024

ConnorBaker commented Jan 25, 2024

samuela commented Jan 25, 2024

		] ++ lists.optionals (cudaPackages.cudaAtLeast "12.0.0") [
		(lib.cmakeBool "USE_FORTRAN" false)

magma: unbreak for cudaPackages_12 #283640

magma: unbreak for cudaPackages_12 #283640

Conversation

SomeoneSerge commented Jan 25, 2024

Description of changes

Things done

samuela Jan 25, 2024

Choose a reason for hiding this comment

SomeoneSerge Jan 25, 2024

Choose a reason for hiding this comment

ConnorBaker commented Jan 25, 2024

ConnorBaker commented Jan 25, 2024

ConnorBaker commented Jan 25, 2024

samuela commented Jan 25, 2024

ConnorBaker commented Jan 25, 2024

samuela commented Jan 25, 2024