M1/M2: Large matrix multiplications can contains NaNs #381

chengchingwen · 2024-07-04T05:15:39Z

MWE:

julia> a = Metal.randn(10000, 10000);

julia> b = Metal.randn(10000, 10000);

julia> c = a * b';

julia> for i in 1:10
           C = Metal.zeros(Float32, size(a))
           mul!(C, a, b')
           @assert C ≈ c "$i"
       end
ERROR: AssertionError: 1
Stacktrace:
 [1] top-level scope
   @ ./REPL[58]:4

julia> for i in 1:10
           C = Metal.zeros(Float32, size(a))
           mul!(C, a, b')
           @assert C ≈ c "$i"
       end
ERROR: AssertionError: 8
Stacktrace:
 [1] top-level scope
   @ ./REPL[58]:4

julia> for i in 1:10
           @assert a * b' ≈ c "$i"
       end
ERROR: AssertionError: 3
Stacktrace:
 [1] top-level scope
   @ ./REPL[59]:2

julia> for i in 1:10
           @assert a * b' ≈ c "$i"
       end
ERROR: AssertionError: 8
Stacktrace:
 [1] top-level scope
   @ ./REPL[59]:2

chengchingwen · 2024-07-04T06:23:13Z

adding wait_completed on matmul!'s command buffer does not help

christiangnrd · 2024-07-04T12:41:46Z

Adding Metal.@sync to the mul! also does not help. ~~However, I cannot reproduce when calling MPS.matmul! directly.~~

maleadt · 2024-07-05T13:49:33Z

I cannot reproduce at all on Metal.jl#master using an M3 Pro, but it does seem reproducible on an M1 Pro.

I wonder if this is a problem with mapreduce, since you're calling isapprox on GPU arrays. Can you test if calling @assert Array(C) ≈ Array(c) makes things pass? It does here, at least.

tgymnich · 2024-07-05T13:52:16Z

I can reproduce the issue on M1 master. It also looks like all the tasks run on the same queue.

chengchingwen · 2024-07-05T14:10:03Z

The issue was found on a M2 Max. The MWE only happens if the array is large enough. It seems to be launching the subsequent kernel before the matmul finished. Is it possible that the mapreduce not checking the availability of the input arrays?

p.s. I'm about to board the plane to JuliaCon so I won't be able to test it soon.

maleadt · 2024-07-05T14:21:05Z

I wonder if this is a problem with mapreduce, since you're calling isapprox on GPU arrays. Can you test if calling @assert Array(C) ≈ Array(c) makes things pass? It does here, at least.

It also reproduces when comparing on the CPU, just much less likely, so this isn't a mapreduce issue.

maleadt · 2024-07-05T14:24:25Z

Looks like a bunch of NaN's in the second matrix.

christiangnrd · 2024-07-05T14:24:43Z

My current MWE is:

using Metal, LinearAlgebra; begin
    n = 10000
    a = mtl(randn(Float32,n,n))
    b = mtl(randn(Float32,n,n))
    C = Metal.zeros(Float32, size(a))
    for i in 1:10
        C = Metal.zeros(Float32, size(a))
        mul!(C,a,b)
        @assert !any(isnan.(C)) "$i"
    end
end

I define C out of the loop to access it afterwards. When I had C .= ... in the loop instead of C = .... It only ever happened at iteration 1. I suspect it has to do with the location in memory of the array.

maleadt · 2024-07-05T14:44:41Z

I cannot reproduce when calling MPS.matmul! directly

I can:

using Metal, LinearAlgebra

function main(T=Float32, N=10000)
    a = Metal.rand(T, N, N)
    b = Metal.rand(T, N, N)
    c = a * b'
    synchronize()

    for i in 1:100
        println("Iteration $i")
        d = Metal.zeros(T, size(a))
        MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
                    #=transpose_a=#false, #=transpose_b=#true)
        @assert !any(isnan.(Array(d))) "NaN in iteration $i"

        # XXX: this redundant check is needed, or the failure never occurs
        @assert !any(isnan.(d))
    end
end

isinteractive() || main()

The need for a secondary kernel is very weird.

tgymnich · 2024-07-05T14:58:43Z

It is ~~not~~ MPS related:

 for i in 1:10
       C = Metal.zeros(Float32, size(a))
       GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())
       @assert C ≈ c "$i"
end

maleadt · 2024-07-05T15:13:53Z

GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())

I don't see how that's related; it's an entirely different kernel. Does it contain NaNs in similar places?
The generic matmatmul kernel, while being extraordinarily slow, doesn't introduce NaNs here.

tgymnich · 2024-07-05T15:17:20Z

Just wanted to confirm that its MPS rather than the synchronisation between kernel launches.

tgymnich · 2024-07-05T15:18:54Z

I've been seeing the NaN issues with large arrays for a long time in #145

MPX seems fine:

import mlx.core as mx

a = mx.random.normal((10000, 10000))
b = mx.random.normal((10000, 10000))
c = a @ b.T


for i in range(0,10):
    C = a @ b.T
    assert(mx.allclose(C,c))

christiangnrd · 2024-07-12T21:07:04Z

I would love for someone to review my code because I'm not a Swift expert by any means, but I was able to reproduce this in the Swift REPL.

Swift MWE


import Metal 
import MetalPerformanceShaders
 
func main(T: Float.Type = Float32.self, N: Int = 10000) { 
    guard let device = MTLCreateSystemDefaultDevice(), 
          let commandQueue = device.makeCommandQueue() else { 
        fatalError("Metal device or command queue could not be created") 
          } 
     
    print("Initializing a & b") 
    // Generate random NxN matrices 
    var a = [Float](repeating: 1, count: N * N) 
    var b = [Float](repeating: 1, count: N * N) 
 
    print("a and b created\n") 
    // Metal buffers for matrices 
    let aBuffer = device.makeBuffer(bytes: &a, length: MemoryLayout<Float>.size * N * N, options: []) 
    let bBuffer = device.makeBuffer(bytes: &b, length: MemoryLayout<Float>.size * N * N, options: []) 
 
    print("Starting matmul\n") 
    for i in 1...10 { 
        print(i) 
        print("\n") 
        // Create MPSMatrices 
        let aMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32) 
        let bMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32) 
 
 
        let aMatrix = MPSMatrix(buffer: aBuffer!, descriptor: aMatrixDescriptor) 
        let bMatrix = MPSMatrix(buffer: bBuffer!, descriptor: bMatrixDescriptor) 
 
        // Matrix multiplication using MPSMatrixMultiplication 
        let matrixMultiplication = MPSMatrixMultiplication(device: device, 
        transposeLeft: false, 
        transposeRight: false, 
        resultRows: N, 
        resultColumns: N, 
        interiorColumns: N, 
        alpha: 1.0, 
        beta: 0.0) 
        let cBuffer = device.makeBuffer(length: MemoryLayout<Float>.size * N * N, options: []) 
        let cMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32) 
        let cMatrix = MPSMatrix(buffer: cBuffer!, descriptor: cMatrixDescriptor) 
         
        let commandBuffer = commandQueue.makeCommandBuffer()! 
        matrixMultiplication.encode(commandBuffer: commandBuffer, 
        leftMatrix: aMatrix, 
        rightMatrix: bMatrix, 
        resultMatrix: cMatrix) 
        commandBuffer.commit() 
        commandBuffer.waitUntilCompleted() 

        // Check for NaNs in the result matrix 
        let cPointer = cBuffer!.contents().bindMemory(to: Float.self, capacity: N * N) 
        var j = 0
        while j < N*N {
            if cPointer[j].isNaN {
                fatalError("NaN in iteration \(i)")
            }
            j += 1
        }
    } 
}
 
Output:
Initializing a & b
a and b created

Starting matmul

1


2


3


4


__lldb_expr_3/repl.swift:56: Fatal error: NaN in iteration 4
2024-07-12 17:58:38.583349-0300 repl_swift[1500:21665] __lldb_expr_3/repl.swift:56: Fatal error: NaN in iteration 4
Execution interrupted. Enter code to recover and continue.
Enter LLDB commands to investigate (type :help for assistance.)

tgymnich · 2024-07-12T21:17:42Z

@christiangnrd Your Swift Code looks good to me. It turns out MPX doesn’t even use MPS.

maleadt · 2024-07-13T10:24:36Z

Haven't been able to look into this, but here's the ObjC version:

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>

void performMatrixMultiplication(NSInteger N) {
    if (N == 0) {
        N = 10000;
    }
    
    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    id<MTLCommandQueue> commandQueue = [device newCommandQueue];
    
    if (!device || !commandQueue) {
        NSLog(@"Metal device or command queue could not be created");
        return;
    }
    
    NSLog(@"Initializing a & b");
    // Generate random NxN matrices
    float *a = calloc(N * N, sizeof(float));
    float *b = calloc(N * N, sizeof(float));
    
    for (NSInteger i = 0; i < N * N; i++) {
        a[i] = 1.0f;
        b[i] = 1.0f;
    }
    
    NSLog(@"a and b created\n");
    // Metal buffers for matrices
    id<MTLBuffer> aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
    id<MTLBuffer> bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
    
    NSLog(@"Starting matmul\n");
    for (NSInteger i = 1; i <= 10; i++) {
        NSLog(@"%ld\n", (long)i);
        
        // Create MPSMatrices
        MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];
        MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];
        
        MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor];
        MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor];
        
        // Matrix multiplication using MPSMatrixMultiplication
        MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
                                                                                          transposeLeft:NO
                                                                                         transposeRight:NO
                                                                                             resultRows:N
                                                                                          resultColumns:N
                                                                                       interiorColumns:N
                                                                                                 alpha:1.0
                                                                                                  beta:0.0];
        
        id<MTLBuffer> cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared];
        MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];
        MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor];
        
        id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
        [matrixMultiplication encodeToCommandBuffer:commandBuffer
                                         leftMatrix:aMatrix
                                        rightMatrix:bMatrix
                                       resultMatrix:cMatrix];
        [commandBuffer commit];
        [commandBuffer waitUntilCompleted];
        
        // Check for NaNs in the result matrix
        float *cPointer = cBuffer.contents;
        for (NSInteger j = 0; j < N * N; j++) {
            if (isnan(cPointer[j])) {
                NSLog(@"NaN in iteration %ld", (long)i);
                free(a);
                free(b);
                return;
            }
        }
    }
    
    free(a);
    free(b);
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        NSInteger N = 10000;
        if (argc > 1) {
            N = atoi(argv[1]);
        }
        performMatrixMultiplication(N);
    }
    return 0;
}

❯ clang mps.m -o mps -framework Foundation -framework Metal -framework MetalPerformanceShaders -fobjc-arc -mmacosx-version-min=10.13

❯ ./mps
2024-07-13 12:23:11.771 mps[54256:2493528] Initializing a & b
2024-07-13 12:23:11.931 mps[54256:2493528] a and b created
2024-07-13 12:23:12.001 mps[54256:2493528] Starting matmul
2024-07-13 12:23:12.001 mps[54256:2493528] 1
2024-07-13 12:23:13.933 mps[54256:2493528] 2
2024-07-13 12:23:15.477 mps[54256:2493528] 3
2024-07-13 12:23:16.997 mps[54256:2493528] 4
2024-07-13 12:23:18.440 mps[54256:2493528] NaN in iteration 4

tgymnich · 2024-07-13T12:46:51Z

Should we just file a radar / feedback?

maleadt · 2024-07-13T12:59:37Z

I'll have a better look first and forward it to our Apple contact.

maleadt · 2024-08-28T08:22:06Z

Apparently this looks like an ARC bug. Curiously, the ObjC reproducer is "fixed" by adding an @autoreleasepool around the for loop body, but the same doesn't hold in Julia (in fact, the original issue was calling into mul! which is already marked @autoreleasepool).

Of course, the Julia MWE is more complex, as the @assert !any(isnan.(d)) involves two additional kernels...

Still broken Julia MWE

using Metal, LinearAlgebra
using ObjectiveC, .Foundation

function main(T=Float32, N=10000)
    a = Metal.rand(T, N, N)
    b = Metal.rand(T, N, N)
    synchronize()

    for i in 1:100
        @autoreleasepool begin
            println("Iteration $i")
            d = Metal.zeros(T, size(a))
            MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
                        #=transpose_a=#false, #=transpose_b=#false)
            @assert !any(isnan.(Array(d))) "NaN in iteration $i"

            # XXX: this redundant check is needed, or the failure never occurs
            @assert !any(isnan.(d))
        end
    end
end

isinteractive() || main()

"Fixed" ObjeC MWE

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>

void performMatrixMultiplication(NSInteger N) {
    if (N == 0) {
        N = 10000;
    }

    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    id<MTLCommandQueue> commandQueue = [device newCommandQueue];

    if (!device || !commandQueue) {
        NSLog(@"Metal device or command queue could not be created");
        return;
    }

    NSLog(@"Initializing a & b");
    // Generate random NxN matrices
    float *a = calloc(N * N, sizeof(float));
    float *b = calloc(N * N, sizeof(float));

    for (NSInteger i = 0; i < N * N; i++) {
        a[i] = 1.0f;
        b[i] = 1.0f;
    }

    NSLog(@"a and b created\n");
    // Metal buffers for matrices
    id<MTLBuffer> aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
    id<MTLBuffer> bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared];

    NSLog(@"Starting matmul\n");
    for (NSInteger i = 1; i <= 100; i++) {
        @autoreleasepool {
            NSLog(@"Iteration %ld\n", (long)i);

            // Create MPSMatrices
            MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                           columns:N
                                                                                          rowBytes:sizeof(float) * N
                                                                                          dataType:MPSDataTypeFloat32];
            MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                           columns:N
                                                                                          rowBytes:sizeof(float) * N
                                                                                          dataType:MPSDataTypeFloat32];

            MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor];
            MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor];

            // Matrix multiplication using MPSMatrixMultiplication
            MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
                                                                                              transposeLeft:NO
                                                                                             transposeRight:NO
                                                                                                 resultRows:N
                                                                                              resultColumns:N
                                                                                           interiorColumns:N
                                                                                                     alpha:1.0
                                                                                                      beta:0.0];

            id<MTLBuffer> cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared];
            MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                           columns:N
                                                                                          rowBytes:sizeof(float) * N
                                                                                          dataType:MPSDataTypeFloat32];
            MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor];

            id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
            [matrixMultiplication encodeToCommandBuffer:commandBuffer
                                             leftMatrix:aMatrix
                                            rightMatrix:bMatrix
                                           resultMatrix:cMatrix];
            [commandBuffer commit];
            [commandBuffer waitUntilCompleted];

            // Check for NaNs in the result matrix
            float *cPointer = cBuffer.contents;
            for (NSInteger j = 0; j < N * N; j++) {
                if (isnan(cPointer[j])) {
                    NSLog(@"NaN in iteration %ld", (long)i);
                    free(a);
                    free(b);
                    return;
                }
            }
        }
    }

    free(a);
    free(b);
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        NSInteger N = 10000;
        if (argc > 1) {
            N = atoi(argv[1]);
        }
        performMatrixMultiplication(N);
    }
    return 0;
}

tgymnich · 2024-08-28T15:42:35Z

Couldn't reproduce the ObjectiveC case today with and without autoreleasepool.
Swift and Julia were still reproducible.

christiangnrd · 2024-08-28T18:11:25Z

I can reproduce the error in both Swift and ObjectiveC and it goes away when surrounded by an autoreleasepool block in both languages.

tgymnich · 2024-08-28T18:17:20Z

Oops. I just overlooked the second autoreleasepool. The first one is actually not necessary (at least to hide our bug.)

christiangnrd · 2024-08-28T18:21:43Z

By "the first one" do you mean the autoreleasepool in main?

christiangnrd · 2024-08-28T18:51:46Z

I'm able to reproduce this without the second redundant check.

Still broken simpler Julia MWE

using Metal, LinearAlgebra
using ObjectiveC, .Foundation

function main(T=Float32, N=10000)
    a = Metal.rand(T, N, N)
    b = Metal.rand(T, N, N)
    synchronize()

    for i in 1:100
        # @autoreleasepool begin
        begin
            println("Iteration $i")
            d = Metal.zeros(T, size(a))
            MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
                        #=transpose_a=#false, #=transpose_b=#false)
            @assert !any(isnan.(Array(d))) "NaN in iteration $i"
        end
    end
end

isinteractive() || main()

tgymnich · 2024-09-25T18:39:14Z

Our NSAutoreleasePool seems to contain roughly the same objects before the nan check compared to the objc version from above. Most obvious difference is that the correct objc version has a CaptureMTLDevice and a AGXG13XFamilyComputeContext and we have a AGXG13XFamilyCommandBuffer (could be debug / xcode related).

iteration 1
objc[6905]: ##############
objc[6905]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6905]: 77 releases pending.
objc[6905]: [0x14300d000]  ................  PAGE  (hot) (cold)
objc[6905]: [0x14300d038]  ################  POOL 0x14300d038
objc[6905]: [0x14300d040]    0x6000004c4860  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d048]    0x6000027ccd20  NSBundle  autorelease count 2
objc[6905]: [0x14300d050]    0x6000004cfd40  __NSDictionaryM  autorelease count 2
objc[6905]: [0x14300d058]    0x600002fc8690  MTLCommandQueueDescriptorInternal
objc[6905]: [0x14300d060]    0x600000ac0090  NSUserDefaults  autorelease count 4
objc[6905]: [0x14300d068]    0x6000004c4b20  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d070]    0x6000004d4660  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d078]    0x6000004d4220  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d080]    0x6000004d46a0  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d088]  ################  POOL 0x14300d088
objc[6905]: [0x14300d090]    0x6000011ce180  MPSMatrixDescriptor
objc[6905]: [0x14300d098]    0x6000011cde00  MPSMatrixDescriptor
objc[6905]: [0x14300d0a0]       0x145809000  AGXG13XDevice  autorelease count 15
objc[6905]: [0x14300d0a8]       0x144105550  CaptureMTLDevice  autorelease count 4
objc[6905]: [0x14300d0b0]    0x6000011cc540  __NSCFString
objc[6905]: [0x14300d0b8]    0x600002ac8d80  __NSCFString
objc[6905]: [0x14300d0c0]    0x600003dcc540  NSPathStore2
objc[6905]: [0x14300d0c8]    0x6000011cc600  __NSBundleTables  autorelease count 3
objc[6905]: [0x14300d0d0]    0x6000027dc140  NSBundle  autorelease count 2
objc[6905]: [0x14300d0d8]    0x6000027cd0e0  NSBundle
objc[6905]: [0x14300d0e0]    0x6000020cc480  NSURL
objc[6905]: [0x14300d0e8]    0x6000035cc500  __NSCFString
objc[6905]: [0x14300d0f0]    0x6000004dd4e0  NSFileManager
objc[6905]: [0x14300d0f8]    0x6000020cc5a0  NSURL
objc[6905]: [0x14300d100]    0x6000035cc280  __NSCFString
objc[6905]: [0x14300d108]    0x6000035cc6e0  __NSCFString  autorelease count 2
objc[6905]: [0x14300d110]    0x6000004df3e0  NSConcreteData
objc[6905]: [0x14300d118]    0x6000027cd810  Swift.__StringStorage
objc[6905]: [0x14300d120]    0x6000027cd860  Swift.__StringStorage
objc[6905]: [0x14300d128]    0x6000027cd8b0  Swift.__StringStorage
objc[6905]: [0x14300d130]    0x6000027cd900  Swift.__StringStorage
objc[6905]: [0x14300d138]    0x6000027cd950  Swift.__StringStorage
objc[6905]: [0x14300d140]    0x6000027cd9a0  Swift.__StringStorage
objc[6905]: [0x14300d148]    0x6000035cc6e0  __NSCFString  autorelease count 6
objc[6905]: [0x14300d150]    0x6000011d4980  MPSMatrixDescriptor
objc[6905]: [0x14300d158]       0x144105550  CaptureMTLDevice  autorelease count 2
objc[6905]: [0x14300d160]    0x6000036ceeb0  AGXG13XFamilyComputeContext
objc[6905]: [0x14300d168]    0x6000011d4b80  __NSCFString
objc[6905]: [0x14300d170]    0x600002acaf80  __NSCFString
objc[6905]: [0x14300d178]    0x600003dcc2a0  NSPathStore2
objc[6905]: [0x14300d180]    0x6000011cc600  __NSBundleTables  autorelease count 3
objc[6905]: [0x14300d188]    0x6000027cd0e0  NSBundle
objc[6905]: [0x14300d190]    0x6000027dc140  NSBundle
objc[6905]: [0x14300d198]    0x6000027cda90  NSBundle  autorelease count 2
objc[6905]: [0x14300d1a0]    0x6000020cc600  NSURL
objc[6905]: [0x14300d1a8]    0x6000035cc8c0  __NSCFString
objc[6905]: [0x14300d1b0]    0x6000004df980  NSFileManager
objc[6905]: [0x14300d1b8]    0x6000020cc6c0  NSURL
objc[6905]: [0x14300d1c0]    0x6000035ccb40  __NSCFString
objc[6905]: [0x14300d1c8]    0x6000035cc960  __NSCFString  autorelease count 2
objc[6905]: [0x14300d1d0]    0x6000004d1e40  NSConcreteData
objc[6905]: [0x14300d1d8]    0x6000027cdbd0  Swift.__StringStorage
objc[6905]: [0x14300d1e0]    0x6000027cdc20  Swift.__StringStorage
objc[6905]: [0x14300d1e8]    0x6000027cdc70  Swift.__StringStorage
objc[6905]: [0x14300d1f0]    0x6000027cdcc0  Swift.__StringStorage
objc[6905]: [0x14300d1f8]    0x6000027cdd10  Swift.__StringStorage
objc[6905]: [0x14300d200]    0x6000027cdd60  Swift.__StringStorage
objc[6905]: [0x14300d208]    0x6000035cc960  __NSCFString  autorelease count 6
objc[6905]: [0x14300d210]       0x144105550  CaptureMTLDevice  autorelease count 2
objc[6905]: [0x14300d218]    0x600000a80330  __NSArrayM
objc[6905]: [0x14300d220]    0x600000a80360  __NSArrayM
objc[6905]: [0x14300d228]    0x6000004d2f40  __NSCFString
objc[6905]: [0x14300d230]    0x6000004d2ec0  __NSCFString
objc[6905]: [0x14300d238]    0x6000004d2ee0  __NSCFString
objc[6905]: [0x14300d240]    0x6000004d2f00  __NSCFString
objc[6905]: [0x14300d248]    0x6000008e7240  __NSCFString
objc[6905]: [0x14300d250]    0x6000008e7090  __NSCFString
objc[6905]: [0x14300d258]    0x6000008e70c0  __NSCFString
objc[6905]: [0x14300d260]    0x6000008e70f0  __NSCFString
objc[6905]: [0x14300d268]    0x6000004d2f20  __NSCFString
objc[6905]: [0x14300d270]    0x6000004d2ea0  __NSCFString
objc[6905]: [0x14300d278]    0x6000004d2fc0  __NSCFString
objc[6905]: [0x14300d280]    0x6000008e6e20  __NSArrayM
objc[6905]: [0x14300d288]    0x6000004d3140  __NSCFNumber
objc[6905]: [0x14300d290]       0x14304b800  __NSCFString
objc[6905]: [0x14300d298]    0x6000020cc780  MTLComputePipelineReflectionInternal
objc[6905]: ##############
iteration 2
objc[36563]: ##############
objc[36563]: AUTORELEASE POOLS for thread 0x203b9b240
objc[36563]: 16 releases pending.
objc[36563]: [0x14080a000]  ................  PAGE  (hot) (cold)
objc[36563]: [0x14080a038]  ################  POOL 0x14080a038
objc[36563]: [0x14080a040]    0x600001f3c5a0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a048]    0x600003c202d0  NSBundle  autorelease count 2
objc[36563]: [0x14080a050]    0x600001f2e7a0  __NSDictionaryM  autorelease count 2
objc[36563]: [0x14080a058]    0x60000342c0e0  MTLCommandQueueDescriptorInternal
objc[36563]: [0x14080a060]    0x60000112c2a0  NSUserDefaults  autorelease count 4
objc[36563]: [0x14080a068]    0x600001f3cb00  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a070]    0x600001f3c4e0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a078]    0x600001f3cac0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a080]    0x600001f3cae0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a088]  ################  POOL 0x14080a088
objc[36563]: [0x14080a090]    0x600000aa2740  MPSMatrixDescriptor
objc[36563]: [0x14080a098]    0x600000aa2780  MPSMatrixDescriptor
objc[36563]: [0x14080a0a0]    0x600000a21040  MPSMatrixDescriptor
objc[36563]: [0x14080a0a8]       0x141005410  CaptureMTLDevice  autorelease count 6
objc[36563]: [0x14080a0b0]    0x600002d24510  AGXG13XFamilyComputeContext
objc[36563]: ##############

Iteration 1
objc[6186]: ##############
objc[6186]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6186]: 20 releases pending.
objc[6186]: [0x12e00b000]  ................  PAGE  (hot) (cold)
objc[6186]: [0x12e00b038]       0x12d20c0f0  _NSSwiftProcessInfo
objc[6186]: [0x12e00b040]       0x12d304d20  Swift.__SwiftDeferredNSArray
objc[6186]: [0x12e00b048]       0x12d304f30  __NSCFCharacterSet
objc[6186]: [0x12e00b050]       0x12d3061e0  __NSCFString
objc[6186]: [0x12e00b058]       0x12c64cdf0  __NSCFString
objc[6186]: [0x12e00b060]       0x12c79d6b0  __NSCFString
objc[6186]: [0x12e00b068]  ################  POOL 0x12e00b068
objc[6186]: [0x12e00b070]       0x11c635370  __NSCFString
objc[6186]: [0x12e00b078]       0x141619730  MPSMatrixDescriptor
objc[6186]: [0x12e00b080]       0x1491eabe0  MPSMatrixDescriptor
objc[6186]: [0x12e00b088]       0x14911b550  MPSMatrixDescriptor
objc[6186]: [0x12e00b090]       0x1496f2b40  __NSCFString
objc[6186]: [0x12e00b098]       0x1491a0d80  __NSCFString
objc[6186]: [0x12e00b0a0]       0x13b718b50  __NSBundleTables
objc[6186]: [0x12e00b0a8]       0x12d33d8e0  NSBundle  autorelease count 3
objc[6186]: [0x12e00b0b0]       0x149152250  NSURL
objc[6186]: [0x12e00b0b8]       0x149111be0  __NSCFString
objc[6186]: [0x12e00b0c0]       0x14913f620  AGXG13XFamilyCommandBuffer
objc[6186]: [0x12e00b0c8]       0x14977a970  __NSArrayM
objc[6186]: [0x12e00b0d0]       0x14978b090  __NSArrayM
objc[6186]: ##############
Iteration 2
objc[6186]: ##############
objc[6186]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6186]: 12 releases pending.
objc[6186]: [0x12e00b000]  ................  PAGE  (hot) (cold)
objc[6186]: [0x12e00b038]       0x12d20c0f0  _NSSwiftProcessInfo
objc[6186]: [0x12e00b040]       0x12d304d20  Swift.__SwiftDeferredNSArray
objc[6186]: [0x12e00b048]       0x12d304f30  __NSCFCharacterSet
objc[6186]: [0x12e00b050]       0x12d3061e0  __NSCFString
objc[6186]: [0x12e00b058]       0x12c64cdf0  __NSCFString
objc[6186]: [0x12e00b060]       0x12c79d6b0  __NSCFString
objc[6186]: [0x12e00b068]  ################  POOL 0x12e00b068
objc[6186]: [0x12e00b070]       0x12c7ff7d0  __NSCFString
objc[6186]: [0x12e00b078]       0x13b7c3da0  MPSMatrixDescriptor
objc[6186]: [0x12e00b080]       0x13b7fc3b0  MPSMatrixDescriptor
objc[6186]: [0x12e00b088]       0x13b714930  MPSMatrixDescriptor
objc[6186]: [0x12e00b090]       0x148cb99a0  AGXG13XFamilyCommandBuffer
objc[6186]: ##############

[NSAutoreleasePool showPools]

christiangnrd · 2024-09-26T01:21:34Z

Apparently this looks like an ARC bug.

Are we using ARC in Julia?

tgymnich · 2024-09-26T07:27:17Z

We don’t use ARC, but the libraries we are using might have been compiled with ARC enabled.

christiangnrd · 2024-09-26T12:03:04Z

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

maleadt · 2024-09-26T13:08:38Z

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

AFAIU -fobjc-arc make the compiler automatically insert release/retain/autorelease calls, and doesn't affect how precompiled libraries like MPS may behave.

christiangnrd · 2024-09-28T03:24:35Z

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

AFAIU -fobjc-arc make the compiler automatically insert release/retain/autorelease calls, and doesn't affect how precompiled libraries like MPS may behave.

That's my understanding too. However, from what I understand about our implementation of the @autoreleasepool macro, we're using an NSAutoreleasePool object and a [pool release]; statement at the end, which according to the documentation, isn't possible with ARC on. By turning ARC off for the objc version, I was trying to reproduce the conditions of the failing Julia code.

The only thing is that I don't know it this information is actually helpful.

christiangnrd · 2024-12-06T02:02:31Z

I could no longer reproduce after calling GC.gc() after every loop, which led to #490 which was a bit of a disaster, but maybe it'll inspire an actual solution.

christiangnrd · 2024-12-12T21:48:39Z

AutoreleasePools only hide the issue. It seems like when the buffer overlaps an address that's a multiple of 2 GiB, and the buffer length itself is not a power of 2, the NaNs sometimes happen. I haven't figured out the exact conditions (and I probably won't), but I can somewhat reliably predict them.

Not exactly a MWE, but you should be able to copy-paste into a new xcode swift project:

Swift working example:

import Metal
import Foundation
import MetalPerformanceShaders
import QuartzCore

//Borrowed from https://stackoverflow.com/questions/48380937/how-to-check-if-a-number-is-a-power-of-2-in-swift#48381524
func isPowerOfTwo(_ n: Int) -> Bool {
    return (n > 0) && (n & (n - 1) == 0)
}

func willfail(addr:UInt, len:Int, verbose:Bool = false) -> Bool {
    if isPowerOfTwo(len) {
        return false
    }
    let TWOGIB = 2 * UInt(pow(2.0,30))
    let lower = addr - (addr % TWOGIB)
    let upper = lower + TWOGIB
    if verbose{
        print("lower: \(lower), upper: \(upper)")
        
        print("start: \(addr), end  : \(addr+UInt(len))")
    }
    return addr + UInt(len) > upper
}


func performMatrixMultiplication(N: Int = 16384, withARP: Bool = false, n_iter:Int = 20, verbose:Bool=false) {
    guard let device = MTLCreateSystemDefaultDevice(),
          let commandQueue = device.makeCommandQueue() else {
        fatalError("Metal device or command queue could not be created")
          }
    let bytesize = MemoryLayout<Float32>.size * N * N
    
    print("Initializing a & b. N = \(N), autoreleasepools: \(withARP)")
    // Generate random NxN matrices
    var a = [Float](repeating: 1, count: N * N)
    var b = [Float](repeating: 1, count: N * N)
 
    let ashouldfail = willfail(addr:UInt(bitPattern: UnsafeRawPointer(a)), len:bytesize, verbose:verbose)
    let bshouldfail = willfail(addr:UInt(bitPattern: UnsafeRawPointer(a)), len:bytesize, verbose:verbose)
    
    print("a at \(UnsafeRawPointer(a)) should fail: \(ashouldfail)")
    print("b at \(UnsafeRawPointer(b)) should fail: \(bshouldfail)")
    // Metal buffers for matrices
    let aBuffer = device.makeBuffer(bytes: &a, length: bytesize, options: [])
    let bBuffer = device.makeBuffer(bytes: &b, length: bytesize, options: [])
 
    var totalBufferAlloc = 2 * bytesize
    
    print("Starting matmul\n")
    if withARP {
        for i in 1...n_iter {
            autoreleasepool {
                // Create MPSMatrices
                let aMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32)
                let bMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32)
                
                
                let aMatrix = MPSMatrix(buffer: aBuffer!, descriptor: aMatrixDescriptor)
                let bMatrix = MPSMatrix(buffer: bBuffer!, descriptor: bMatrixDescriptor)
                
                // Matrix multiplication using MPSMatrixMultiplication
                let matrixMultiplication = MPSMatrixMultiplication(device: device,
                                                                   transposeLeft: false,
                                                                   transposeRight: false,
                                                                   resultRows: N,
                                                                   resultColumns: N,
                                                                   interiorColumns: N,
                                                                   alpha: 1.0,
                                                                   beta: 0.0)
                
                
                let cBuffer = device.makeBuffer(length: bytesize, options: [])
                totalBufferAlloc += bytesize
                
                let cMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32)
                let cMatrix = MPSMatrix(buffer: cBuffer!, descriptor: cMatrixDescriptor)
                
                let commandBuffer = commandQueue.makeCommandBuffer()!
                matrixMultiplication.encode(commandBuffer: commandBuffer,
                                            leftMatrix: aMatrix,
                                            rightMatrix: bMatrix,
                                            resultMatrix: cMatrix)
                commandBuffer.commit()
                commandBuffer.waitUntilCompleted()
                
                // Check for NaNs in the result matrix
                let cPointer = cBuffer!.contents().bindMemory(to: Float.self, capacity: N * N)
                
                let shouldfail = willfail(addr:UInt(bitPattern: cPointer), len:bytesize, verbose:verbose)
                let preamble = shouldfail ? "As expected," : "Suprisingly,"
                
                let allocingb = (Float64(totalBufferAlloc)/Float64(pow(2.0,30)))

                var j = 0
                while j < N*N {
                    if cPointer[j].isNaN {
                        print("\(preamble) NaN in iteration \(i) with a total buffer size allocated of \(allocingb)GiB. Address: \(cPointer + j)")
                        break
                    }
                    j += 1
                }
                if shouldfail && j == N*N{
                    print("Predicted that iteration \(i) would fail but it did not...")
                }
            }
        }
    } else {
        for i in 1...n_iter {            // Create MPSMatrices
            let aMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32)
            let bMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32)
            
            
            let aMatrix = MPSMatrix(buffer: aBuffer!, descriptor: aMatrixDescriptor)
            let bMatrix = MPSMatrix(buffer: bBuffer!, descriptor: bMatrixDescriptor)
            
            // Matrix multiplication using MPSMatrixMultiplication
            let matrixMultiplication = MPSMatrixMultiplication(device: device,
                                                               transposeLeft: false,
                                                               transposeRight: false,
                                                               resultRows: N,
                                                               resultColumns: N,
                                                               interiorColumns: N,
                                                               alpha: 1.0,
                                                               beta: 0.0)
            
            
            let cBuffer = device.makeBuffer(length: bytesize, options: [])
            totalBufferAlloc += bytesize
            
            let cMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout<Float>.size * N, dataType: .float32)
            let cMatrix = MPSMatrix(buffer: cBuffer!, descriptor: cMatrixDescriptor)
            
            let commandBuffer = commandQueue.makeCommandBuffer()!
            matrixMultiplication.encode(commandBuffer: commandBuffer,
                                        leftMatrix: aMatrix,
                                        rightMatrix: bMatrix,
                                        resultMatrix: cMatrix)
            commandBuffer.commit()
            commandBuffer.waitUntilCompleted()
            
            // Check for NaNs in the result matrix
            let cPointer = cBuffer!.contents().bindMemory(to: Float.self, capacity: N * N)
            
            let shouldfail = willfail(addr:UInt(bitPattern: cPointer), len:bytesize, verbose:verbose)
            let preamble = shouldfail ? "As expected," : "Suprisingly,"
            
            let allocingb = (Float64(totalBufferAlloc)/Float64(pow(2.0,30)))

            var j = 0
            while j < N*N {
                if cPointer[j].isNaN {
                    print("\(preamble) NaN in iteration \(i) with a total buffer size allocated of \(allocingb)GiB. Address: \(cPointer + j)")
                    break
                }
                j += 1
            }
            if shouldfail && j == N*N{
                print("Predicted that iteration \(i) would fail but it did not...")
            }
        }
    }
    print("")
}

performMatrixMultiplication(N:8192, withARP: false) // 256.000 MiB Matrices, multiple of 16384, doesn't fail
performMatrixMultiplication(N:8192, withARP: true)  // 256.000 MiB Matrices, multiple of 16384, doesn't fail
performMatrixMultiplication(N:10000,withARP: false) // 381.470 MiB Matrices, not a multiple of 16384, fails
performMatrixMultiplication(N:10000,withARP: true)    // 381.470 MiB Matrices, not a multiple of 16384, doesn't fail
performMatrixMultiplication(N:16384,withARP: false) // 1024.000 MiB Matrices, multiple of 16384, doesn't fail
performMatrixMultiplication(N:16384,withARP: true)  // 1024.000 MiB Matrices, multiple of 16384, doesn't fail
performMatrixMultiplication(N:16385,withARP: false) // 1024.000 MiB Matrices, not a multiple of 16384, doesn't fail
performMatrixMultiplication(N:16385,withARP: true)  // 1024.000 MiB Matrices, not a multiple of 16384, sometimes fails?
performMatrixMultiplication(N:20000,withARP: false) // 1.971 GiB Matrices, not a multiple of 16384, fails
performMatrixMultiplication(N:20000,withARP: true)  // 1.971 GiB Matrices, not a multiple of 16384, fails

// First square Float32 matrix of size > 2 GiB
performMatrixMultiplication(N:23171,withARP: false) // 2.000 GiB Matrices, not a multiple of 16384, fails
performMatrixMultiplication(N:23171,withARP: true)  // 2.000 GiB Matrices, not a multiple of 16384, fails

Swift working example output:

Initializing a & b. N = 8192, autoreleasepools: false
a at 0x0000000138000020 should fail: false
b at 0x0000000148004020 should fail: false
Starting matmul


Initializing a & b. N = 8192, autoreleasepools: true
a at 0x0000000138000020 should fail: false
b at 0x0000000148004020 should fail: false
Starting matmul


Initializing a & b. N = 10000, autoreleasepools: false
a at 0x0000000440000020 should fail: false
b at 0x0000000457d7c020 should fail: false
Starting matmul

Suprisingly, NaN in iteration 1 with a total buffer size allocated of 1.1175870895385742GiB. Address: 0x00000004a5045000
Predicted that iteration 5 would fail but it did not...
Suprisingly, NaN in iteration 6 with a total buffer size allocated of 2.9802322387695312GiB. Address: 0x000000052514b000
Predicted that iteration 10 would fail but it did not...
Suprisingly, NaN in iteration 11 with a total buffer size allocated of 4.842877388000488GiB. Address: 0x00000005a4fe0000
Predicted that iteration 15 would fail but it did not...
Suprisingly, NaN in iteration 17 with a total buffer size allocated of 7.078051567077637GiB. Address: 0x0000000625186000

Initializing a & b. N = 10000, autoreleasepools: true
a at 0x0000000440000020 should fail: false
b at 0x0000000457d7c020 should fail: false
Starting matmul

Predicted that iteration 1 would fail but it did not...
Predicted that iteration 3 would fail but it did not...
Predicted that iteration 5 would fail but it did not...
Predicted that iteration 7 would fail but it did not...
Predicted that iteration 9 would fail but it did not...
Predicted that iteration 11 would fail but it did not...
Predicted that iteration 13 would fail but it did not...
Predicted that iteration 15 would fail but it did not...
Predicted that iteration 17 would fail but it did not...
Predicted that iteration 19 would fail but it did not...

Initializing a & b. N = 16384, autoreleasepools: false
a at 0x000000067c3a0020 should fail: false
b at 0x00000006bc3a4020 should fail: false
Starting matmul


Initializing a & b. N = 16384, autoreleasepools: true
a at 0x000000067c3a0020 should fail: false
b at 0x00000006bc3a4020 should fail: false
Starting matmul


Initializing a & b. N = 16385, autoreleasepools: false
a at 0x000000067c3a0020 should fail: true
b at 0x00000006bc3c4020 should fail: true
Starting matmul

As expected, NaN in iteration 2 with a total buffer size allocated of 4.000488296151161GiB. Address: 0x0000000d24ffa300
As expected, NaN in iteration 4 with a total buffer size allocated of 6.000732444226742GiB. Address: 0x0000000da5042300
As expected, NaN in iteration 6 with a total buffer size allocated of 8.000976592302322GiB. Address: 0x0000000e2508a300
As expected, NaN in iteration 8 with a total buffer size allocated of 10.001220740377903GiB. Address: 0x0000000ea50d2300
As expected, NaN in iteration 10 with a total buffer size allocated of 12.001464888453484GiB. Address: 0x0000000f2511a300
As expected, NaN in iteration 12 with a total buffer size allocated of 14.001709036529064GiB. Address: 0x0000000fa5162300
As expected, NaN in iteration 14 with a total buffer size allocated of 16.001953184604645GiB. Address: 0x0000007068c2e300
As expected, NaN in iteration 16 with a total buffer size allocated of 18.002197332680225GiB. Address: 0x00000070e8c76300
As expected, NaN in iteration 18 with a total buffer size allocated of 20.002441480755806GiB. Address: 0x0000007168cbe300
As expected, NaN in iteration 20 with a total buffer size allocated of 22.002685628831387GiB. Address: 0x00000071e8d06300

Initializing a & b. N = 16385, autoreleasepools: true
a at 0x000000067c3a0020 should fail: true
b at 0x00000006bc3c4020 should fail: true
Starting matmul

Predicted that iteration 2 would fail but it did not...
Predicted that iteration 4 would fail but it did not...
Predicted that iteration 6 would fail but it did not...
Predicted that iteration 8 would fail but it did not...
Predicted that iteration 10 would fail but it did not...
Predicted that iteration 12 would fail but it did not...
Predicted that iteration 14 would fail but it did not...
Predicted that iteration 16 would fail but it did not...
Predicted that iteration 18 would fail but it did not...
Predicted that iteration 20 would fail but it did not...

Initializing a & b. N = 20000, autoreleasepools: false
a at 0x000000067c3a0020 should fail: true
b at 0x00000006db984020 should fail: true
Starting matmul

As expected, NaN in iteration 1 with a total buffer size allocated of 4.470348358154297GiB. Address: 0x00000072c7f64000
Suprisingly, NaN in iteration 2 with a total buffer size allocated of 5.9604644775390625GiB. Address: 0x0000007347fbe000
As expected, NaN in iteration 3 with a total buffer size allocated of 7.450580596923828GiB. Address: 0x00000073c8018000
Predicted that iteration 4 would fail but it did not...
As expected, NaN in iteration 5 with a total buffer size allocated of 10.43081283569336GiB. Address: 0x00000074482e6000
Suprisingly, NaN in iteration 6 with a total buffer size allocated of 11.920928955078125GiB. Address: 0x00000074c8340000
As expected, NaN in iteration 7 with a total buffer size allocated of 13.41104507446289GiB. Address: 0x000000754839a000
Predicted that iteration 8 would fail but it did not...
As expected, NaN in iteration 9 with a total buffer size allocated of 16.391277313232422GiB. Address: 0x00000075c8186000
Suprisingly, NaN in iteration 10 with a total buffer size allocated of 17.881393432617188GiB. Address: 0x00000076481e0000
As expected, NaN in iteration 11 with a total buffer size allocated of 19.371509552001953GiB. Address: 0x00000076c823a000
Predicted that iteration 12 would fail but it did not...
As expected, NaN in iteration 13 with a total buffer size allocated of 22.351741790771484GiB. Address: 0x0000007748026000
Suprisingly, NaN in iteration 14 with a total buffer size allocated of 23.84185791015625GiB. Address: 0x00000077c8080000
As expected, NaN in iteration 15 with a total buffer size allocated of 25.331974029541016GiB. Address: 0x00000078480da000
Predicted that iteration 16 would fail but it did not...
As expected, NaN in iteration 17 with a total buffer size allocated of 28.312206268310547GiB. Address: 0x00000078c7ec6000
Suprisingly, NaN in iteration 18 with a total buffer size allocated of 29.802322387695312GiB. Address: 0x0000007947f20000
As expected, NaN in iteration 19 with a total buffer size allocated of 31.292438507080078GiB. Address: 0x00000079c7f7a000
Predicted that iteration 20 would fail but it did not...

Initializing a & b. N = 20000, autoreleasepools: true
a at 0x000000067c3a0020 should fail: true
b at 0x00000006db984020 should fail: true
Starting matmul

As expected, NaN in iteration 1 with a total buffer size allocated of 4.470348358154297GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 2 with a total buffer size allocated of 5.9604644775390625GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 3 with a total buffer size allocated of 7.450580596923828GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 4 with a total buffer size allocated of 8.940696716308594GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 5 with a total buffer size allocated of 10.43081283569336GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 6 with a total buffer size allocated of 11.920928955078125GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 7 with a total buffer size allocated of 13.41104507446289GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 8 with a total buffer size allocated of 14.901161193847656GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 9 with a total buffer size allocated of 16.391277313232422GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 10 with a total buffer size allocated of 17.881393432617188GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 11 with a total buffer size allocated of 19.371509552001953GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 12 with a total buffer size allocated of 20.86162567138672GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 13 with a total buffer size allocated of 22.351741790771484GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 14 with a total buffer size allocated of 23.84185791015625GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 15 with a total buffer size allocated of 25.331974029541016GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 16 with a total buffer size allocated of 26.82209014892578GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 17 with a total buffer size allocated of 28.312206268310547GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 18 with a total buffer size allocated of 29.802322387695312GiB. Address: 0x0000007aa782c000
As expected, NaN in iteration 19 with a total buffer size allocated of 31.292438507080078GiB. Address: 0x0000007a48248000
Suprisingly, NaN in iteration 20 with a total buffer size allocated of 32.782554626464844GiB. Address: 0x0000007aa782c000

Initializing a & b. N = 23171, autoreleasepools: false
a at 0x000000067c3a0020 should fail: true
b at 0x0000007200120020 should fail: true
Starting matmul

As expected, NaN in iteration 1 with a total buffer size allocated of 6.000271897763014GiB. Address: 0x0000007b481a3a00
As expected, NaN in iteration 2 with a total buffer size allocated of 8.000362530350685GiB. Address: 0x0000007bc81bba00
As expected, NaN in iteration 3 with a total buffer size allocated of 10.000453162938356GiB. Address: 0x0000007c481d3a00
As expected, NaN in iteration 4 with a total buffer size allocated of 12.000543795526028GiB. Address: 0x0000007cc81eba00
As expected, NaN in iteration 5 with a total buffer size allocated of 14.000634428113699GiB. Address: 0x0000007d48203a00
As expected, NaN in iteration 6 with a total buffer size allocated of 16.00072506070137GiB. Address: 0x0000007dc821ba00
As expected, NaN in iteration 7 with a total buffer size allocated of 18.00081569328904GiB. Address: 0x0000007e48233a00
As expected, NaN in iteration 8 with a total buffer size allocated of 20.000906325876713GiB. Address: 0x0000007ec824ba00
As expected, NaN in iteration 9 with a total buffer size allocated of 22.000996958464384GiB. Address: 0x0000007f48263a00
As expected, NaN in iteration 10 with a total buffer size allocated of 24.001087591052055GiB. Address: 0x0000007fc827ba00
As expected, NaN in iteration 11 with a total buffer size allocated of 26.001178223639727GiB. Address: 0x0000008048293a00
As expected, NaN in iteration 12 with a total buffer size allocated of 28.001268856227398GiB. Address: 0x00000080c82aba00
As expected, NaN in iteration 13 with a total buffer size allocated of 30.00135948881507GiB. Address: 0x00000081482c3a00
As expected, NaN in iteration 14 with a total buffer size allocated of 32.00145012140274GiB. Address: 0x00000081c82dba00
As expected, NaN in iteration 15 with a total buffer size allocated of 34.00154075399041GiB. Address: 0x00000082482f3a00
As expected, NaN in iteration 16 with a total buffer size allocated of 36.00163138657808GiB. Address: 0x00000082c830ba00
As expected, NaN in iteration 17 with a total buffer size allocated of 38.001722019165754GiB. Address: 0x0000008348323a00
As expected, NaN in iteration 18 with a total buffer size allocated of 40.001812651753426GiB. Address: 0x00000083c833ba00
As expected, NaN in iteration 19 with a total buffer size allocated of 42.0019032843411GiB. Address: 0x0000008448353a00
As expected, NaN in iteration 20 with a total buffer size allocated of 44.00199391692877GiB. Address: 0x00000084c836ba00

Initializing a & b. N = 23171, autoreleasepools: true
a at 0x000000067c3a0020 should fail: true
b at 0x0000007200120020 should fail: true
Starting matmul

As expected, NaN in iteration 1 with a total buffer size allocated of 6.000271897763014GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 2 with a total buffer size allocated of 8.000362530350685GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 3 with a total buffer size allocated of 10.000453162938356GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 4 with a total buffer size allocated of 12.000543795526028GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 5 with a total buffer size allocated of 14.000634428113699GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 6 with a total buffer size allocated of 16.00072506070137GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 7 with a total buffer size allocated of 18.00081569328904GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 8 with a total buffer size allocated of 20.000906325876713GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 9 with a total buffer size allocated of 22.000996958464384GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 10 with a total buffer size allocated of 24.001087591052055GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 11 with a total buffer size allocated of 26.001178223639727GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 12 with a total buffer size allocated of 28.001268856227398GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 13 with a total buffer size allocated of 30.00135948881507GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 14 with a total buffer size allocated of 32.00145012140274GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 15 with a total buffer size allocated of 34.00154075399041GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 16 with a total buffer size allocated of 36.00163138657808GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 17 with a total buffer size allocated of 38.001722019165754GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 18 with a total buffer size allocated of 40.001812651753426GiB. Address: 0x00000085c839ba00
As expected, NaN in iteration 19 with a total buffer size allocated of 42.0019032843411GiB. Address: 0x0000008548383a00
As expected, NaN in iteration 20 with a total buffer size allocated of 44.00199391692877GiB. Address: 0x00000085c839ba00

@maleadt Is this worth forwarding to your Apple rep again? If you do, I updated my initial report (FB14293696).
It seems quite problematic that the MPS matrix multiplication is either bugged or has a limitation that is not mentioned anywhere.

christiangnrd · 2024-12-12T23:06:39Z

Similar issue? pytorch/pytorch#117549. Fixed by implementing own matmul kernel. pytorch/pytorch#117549

christiangnrd added the bug label Jul 4, 2024

maleadt changed the title ~~matrix multiplication not always synchronized~~ M1/M1: Large matrix multiplications can contains NaNs Jul 5, 2024

christiangnrd changed the title ~~M1/M1: Large matrix multiplications can contains NaNs~~ M1/M2: Large matrix multiplications can contains NaNs Jul 5, 2024

christiangnrd added the upstream Out of our hands label Jul 12, 2024

maleadt mentioned this issue Sep 21, 2024

Can't use gemm! methods with Metal #423

Closed

tgymnich removed the bug label Oct 18, 2024

chengchingwen mentioned this issue Dec 1, 2024

add Metal extension for batched_mul FluxML/NNlib.jl#614

Draft

2 tasks

This was referenced Dec 5, 2024

[dont merge] Check to see effect on performance christiangnrd/Metal.jl#6

Closed

[dont merge] Check to see effect on performance #490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M1/M2: Large matrix multiplications can contains NaNs #381

M1/M2: Large matrix multiplications can contains NaNs #381

chengchingwen commented Jul 4, 2024 •

edited

Loading

chengchingwen commented Jul 4, 2024

christiangnrd commented Jul 4, 2024 •

edited

Loading

maleadt commented Jul 5, 2024

tgymnich commented Jul 5, 2024 •

edited

Loading

chengchingwen commented Jul 5, 2024

maleadt commented Jul 5, 2024

maleadt commented Jul 5, 2024

christiangnrd commented Jul 5, 2024 •

edited

Loading

maleadt commented Jul 5, 2024 •

edited

Loading

tgymnich commented Jul 5, 2024 •

edited

Loading

maleadt commented Jul 5, 2024

tgymnich commented Jul 5, 2024

tgymnich commented Jul 5, 2024 •

edited

Loading

christiangnrd commented Jul 12, 2024 •

edited

Loading

tgymnich commented Jul 12, 2024

maleadt commented Jul 13, 2024

tgymnich commented Jul 13, 2024

maleadt commented Jul 13, 2024

maleadt commented Aug 28, 2024 •

edited

Loading

tgymnich commented Aug 28, 2024

christiangnrd commented Aug 28, 2024 •

edited

Loading

tgymnich commented Aug 28, 2024

christiangnrd commented Aug 28, 2024 •

edited

Loading

christiangnrd commented Aug 28, 2024

tgymnich commented Sep 25, 2024 •

edited by christiangnrd

Loading

christiangnrd commented Sep 26, 2024

tgymnich commented Sep 26, 2024

christiangnrd commented Sep 26, 2024 •

edited by maleadt

Loading

maleadt commented Sep 26, 2024

christiangnrd commented Sep 28, 2024

christiangnrd commented Dec 6, 2024

christiangnrd commented Dec 12, 2024 •

edited

Loading

christiangnrd commented Dec 12, 2024

M1/M2: Large matrix multiplications can contains NaNs #381

M1/M2: Large matrix multiplications can contains NaNs #381

Comments

chengchingwen commented Jul 4, 2024 • edited Loading

chengchingwen commented Jul 4, 2024

christiangnrd commented Jul 4, 2024 • edited Loading

maleadt commented Jul 5, 2024

tgymnich commented Jul 5, 2024 • edited Loading

chengchingwen commented Jul 5, 2024

maleadt commented Jul 5, 2024

maleadt commented Jul 5, 2024

christiangnrd commented Jul 5, 2024 • edited Loading

maleadt commented Jul 5, 2024 • edited Loading

tgymnich commented Jul 5, 2024 • edited Loading

maleadt commented Jul 5, 2024

tgymnich commented Jul 5, 2024

tgymnich commented Jul 5, 2024 • edited Loading

christiangnrd commented Jul 12, 2024 • edited Loading

tgymnich commented Jul 12, 2024

maleadt commented Jul 13, 2024

tgymnich commented Jul 13, 2024

maleadt commented Jul 13, 2024

maleadt commented Aug 28, 2024 • edited Loading

tgymnich commented Aug 28, 2024

christiangnrd commented Aug 28, 2024 • edited Loading

tgymnich commented Aug 28, 2024

christiangnrd commented Aug 28, 2024 • edited Loading

christiangnrd commented Aug 28, 2024

tgymnich commented Sep 25, 2024 • edited by christiangnrd Loading

christiangnrd commented Sep 26, 2024

tgymnich commented Sep 26, 2024

christiangnrd commented Sep 26, 2024 • edited by maleadt Loading

maleadt commented Sep 26, 2024

christiangnrd commented Sep 28, 2024

christiangnrd commented Dec 6, 2024

christiangnrd commented Dec 12, 2024 • edited Loading

christiangnrd commented Dec 12, 2024

chengchingwen commented Jul 4, 2024 •

edited

Loading

christiangnrd commented Jul 4, 2024 •

edited

Loading

tgymnich commented Jul 5, 2024 •

edited

Loading

christiangnrd commented Jul 5, 2024 •

edited

Loading

maleadt commented Jul 5, 2024 •

edited

Loading

tgymnich commented Jul 5, 2024 •

edited

Loading

tgymnich commented Jul 5, 2024 •

edited

Loading

christiangnrd commented Jul 12, 2024 •

edited

Loading

maleadt commented Aug 28, 2024 •

edited

Loading

christiangnrd commented Aug 28, 2024 •

edited

Loading

christiangnrd commented Aug 28, 2024 •

edited

Loading

tgymnich commented Sep 25, 2024 •

edited by christiangnrd

Loading

christiangnrd commented Sep 26, 2024 •

edited by maleadt

Loading

christiangnrd commented Dec 12, 2024 •

edited

Loading