-
Notifications
You must be signed in to change notification settings - Fork 621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-device 2D convolution numerical imprecision #18283
Comments
If I use just a single device I get the same erroneous result. |
The comment says that you can repro for a single device... your program doesnt compile with single device for me. Can you post the repro for a single device? |
@MaheshRavishankar, what type of compiler error did you get? Did you pick this recent fix #18217? Here are some of the program variants I tried. I compiled the program for a single device up to excluding ConvertToStreamPass (iree-stream-conversion). Then I removed the |
Here is another variant variant . // Good
flow.executable private @main$async_dispatch_2 {
flow.executable.export public @main$async_dispatch_2_conv_2d_nchw_fchw_2x4x7x9x6x5x5_f32 workgroups() -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
flow.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @main$async_dispatch_2_conv_2d_nchw_fchw_2x4x7x9x6x5x5_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<2x6x11x13xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<4x6x5x5xf32>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<2x4x7x9xf32>>) {
%cst = arith.constant 0.000000e+00 : f32
%0 = flow.dispatch.tensor.load %arg0, offsets = [0, 0, 0, 0], sizes = [2, 6, 11, 13], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x6x11x13xf32>> -> tensor<2x6x11x13xf32>
%1 = flow.dispatch.tensor.load %arg1, offsets = [0, 0, 0, 0], sizes = [4, 6, 5, 5], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x6x5x5xf32>> -> tensor<4x6x5x5xf32>
%2 = tensor.empty() : tensor<2x4x7x9xf32>
%3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<2x4x7x9xf32>) -> tensor<2x4x7x9xf32>
%4 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%0, %1 : tensor<2x6x11x13xf32>, tensor<4x6x5x5xf32>) outs(%3 : tensor<2x4x7x9xf32>) -> tensor<2x4x7x9xf32>
flow.dispatch.tensor.store %4, %arg2, offsets = [0, 0, 0, 0], sizes = [2, 4, 7, 9], strides = [1, 1, 1, 1] : tensor<2x4x7x9xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x4x7x9xf32>>
return
}
}
} // Bad
flow.executable private @main$async_dispatch_2 {
flow.executable.export public @main$async_dispatch_2_conv_2d_nchw_fchw_2x4x7x9x6x5x5_f32 workgroups(%arg0: index, %arg1: index, %arg2: index, %arg3: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1, %arg2, %arg3
flow.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @main$async_dispatch_2_conv_2d_nchw_fchw_2x4x7x9x6x5x5_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<?x?x?x?xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<4x6x5x5xf32>>, %arg2: index, %arg3: index, %arg4: index, %arg5: index, %arg6: !flow.dispatch.tensor<writeonly:tensor<2x4x7x9xf32>>) {
%cst = arith.constant 0.000000e+00 : f32
%0 = flow.dispatch.workload.ordinal %arg2, 0 : index
%1 = flow.dispatch.workload.ordinal %arg3, 1 : index
%2 = flow.dispatch.workload.ordinal %arg4, 2 : index
%3 = flow.dispatch.workload.ordinal %arg5, 3 : index
%4 = flow.dispatch.tie_shape %arg0 : !flow.dispatch.tensor<readonly:tensor<?x?x?x?xf32>>{%0, %1, %2, %3}
%5 = flow.dispatch.tensor.load %4, offsets = [0, 0, 0, 0], sizes = [%0, %1, %2, %3], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x?x?xf32>>{%0, %1, %2, %3} -> tensor<?x?x?x?xf32>
%6 = flow.dispatch.tensor.load %arg1, offsets = [0, 0, 0, 0], sizes = [4, 6, 5, 5], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x6x5x5xf32>> -> tensor<4x6x5x5xf32>
%7 = tensor.empty() : tensor<2x4x7x9xf32>
%cast = tensor.cast %5 : tensor<?x?x?x?xf32> to tensor<2x6x11x13xf32>
%8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<2x4x7x9xf32>) -> tensor<2x4x7x9xf32>
%9 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%cast, %6 : tensor<2x6x11x13xf32>, tensor<4x6x5x5xf32>) outs(%8 : tensor<2x4x7x9xf32>) -> tensor<2x4x7x9xf32>
flow.dispatch.tensor.store %9, %arg6, offsets = [0, 0, 0, 0], sizes = [2, 4, 7, 9], strides = [1, 1, 1, 1] : tensor<2x4x7x9xf32> -> !flow.dispatch.tensor<writeonly:tensor<2x4x7x9xf32>>
return
}
}
} This leads me to believe that these dynamic shapes and the conversion inside the dispatch to static shapes is the culprit. Note that there are other dispatches that copy slices that also use dynamic shapes, but they are fine. It is more involved and the problem is something more specific related to the particular conv dispatch. |
Here is a comparison of completely isolated just dispatches |
I tried the comparison I posted in my previous post for the HIP backend and got correct results. Maybe the issue is with dispatch's push constants for the CPU backend. |
I made a modification of the faulty dispatch, where constants that are passed in are also returned return-dispatch-consts.zip. They are correct. This means that the bug is squarely in the |
To further verify that the dynamic tensor argument to the dispatch is the culprit I edited the output of |
@lialan could you try to take a look. I can repro locally and will try to take a look as well, but please do take a look. |
Ok, I see the issue here
We need to make this more robust, but I think adding a What is actually happening is that because the number of workgroups region is left empty the default workgroup region added is just wrong.... I need to look deeper into how to manage the gaurds for this, but the current expectation is that if you are adding a The other alternative is that you just add a |
I added a count region to count(%arg2: index, %arg3: index, %arg4: index, %arg5: index) -> (index, index, index) {
%c7 = arith.constant 7 : index
%c4 = arith.constant 4 : index
%c1 = arith.constant 1 : index
flow.return %c7, %c4, %c1 : index, index, index
} This produces the same wrong result. In the original dispatch comparison the good and bad variants when they get to after Good: hal.executable.export public @main_dispatch_0_conv_2d_nchw_fchw_2x4x7x9x6x5x5_f32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, "ReadOnly|Indirect">, <2, storage_buffer, Indirect>], flags = Indirect>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]} {
^bb0(%arg0: !hal.device):
%c7 = arith.constant 7 : index
%c4 = arith.constant 4 : index
%c1 = arith.constant 1 : index
hal.return %c7, %c4, %c1 : index, index, index
} Bad: hal.executable.export public @main_dispatch_0_conv_2d_nchw_fchw_2x4x7x9x6x5x5_f32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 8, sets = [<0, bindings = [<0, storage_buffer, "ReadOnly|Indirect">, <1, storage_buffer, "ReadOnly|Indirect">, <2, storage_buffer, Indirect>], flags = Indirect>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]} {
^bb0(%arg0: !hal.device, %arg1: index, %arg2: index, %arg3: index, %arg4: index):
%c7 = arith.constant 7 : index
%c4 = arith.constant 4 : index
%c1 = arith.constant 1 : index
hal.return %c7, %c4, %c1 : index, index, index
} |
@lialan I think I have ruled out any "structural" issues... this does seem to be a straight-up correctness bug in the CPU backend. Just to post some notes, I tried this input
(just a better version of what I'd expect to work while I was triaging). This still repros the error. |
Ok, I think I found the issue. I think there is something wrong with This is the IR for the PASSING cases
and this is the IR for the failing case
The indexing seems off... the passing case increments the index of 1-th dimension by 1 everytime, but the failing case seems to not do that, instead is adding to the 2-th dimension. @lialan this should be enough triage for you to fix? |
In general we should avoid having many end-to-end tests, but I decided to add it since it was the isolation of a real problem iree-org/iree#18283 and I already had the test. We would like to have at least several E2E tests of tiny models that run on every PR.
In general we should avoid having many end-to-end tests, but I decided to add it since it was the isolation of a real problem iree-org/iree#18283 and I already had the test. We would like to have at least several E2E tests of tiny models that run on every PR.
In general we should avoid having many end-to-end tests, but I decided to add it since it was the isolation of a real problem iree-org/iree#18283 and I already had the test. We would like to have at least several E2E tests of tiny models that run on every PR.
In general we should avoid having many end-to-end tests, but I decided to add it since it was the isolation of a real problem iree-org/iree#18283 and I already had the test. We would like to have at least several E2E tests of tiny models that run on every PR.
Tracking this issue: it seems to be that the discrepancy in the dimension happens when trying to fold
when called with
and |
In
It does feel that by looking solely at the strides we reach a reasonable result in the dynamic shaped input, but in reality we need more information to decide which dim is dropped, as both dim 0 and dim 2 could possibly be dropped in:
@MaheshRavishankar in such case which a single unbiased result is not able to be reached, should we just bail out? or maybe you have better ideas.. |
In general we should avoid having many end-to-end tests, but I decided to add it since it was the isolation of a real problem iree-org/iree#18283 and I already had the test. We would like to have at least several E2E tests of tiny models that run on every PR.
In general we should avoid having many end-to-end tests, but I decided to add it since it was the isolation of a real problem iree-org/iree#18283 and I already had the test. We would like to have at least several E2E tests of tiny models that run on every PR.
Edit: I was wrong with previous comment. The dynamic dims are lowered correctly after materialization. The issue is with the memref.subview |
Aside form the issue of
%cast = tensor.cast %0 : tensor<2x3x11x13xf32> to tensor<?x?x?x?xf32>
%2 = flow.tensor.transfer %cast : tensor<?x?x?x?xf32>{%c2, %c3, %c11, %c13} to #hal.device.affinity<@__device_0>
%cast_2 = tensor.cast %2 : tensor<?x?x?x?xf32> to tensor<2x3x11x13xf32> @rsuderman suggested that we should have a pattern that moves the first cast after the transfer. %2 = flow.tensor.transfer %0 : tensor<2x3x11x13xf32> to #hal.device.affinity<@__device_0>
%cast = tensor.cast %2 : tensor<2x3x11x13xf32> to tensor<?x?x?x?xf32>
%cast_2 = tensor.cast %cast : tensor<?x?x?x?xf32> to tensor<2x3x11x13xf32> Then there is probably already a pattern to remove the 2 consecutive casts.
Before CloneProducersIntoDispatchRegionsPass %cast_7 = tensor.cast %5 : tensor<?x?x?x?xf32> to tensor<2x6x11x13xf32>
%7 = tensor.empty() : tensor<2x4x7x9xf32>
%8 = linalg.fill ins(%cst_1 : f32) outs(%7 : tensor<2x4x7x9xf32>) -> tensor<2x4x7x9xf32>
%9 = flow.dispatch.region -> (tensor<2x4x7x9xf32>) {
%18 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%cast_7, %cst : tensor<2x6x11x13xf32>, tensor<4x6x5x5xf32>) outs(%8 : tensor<2x4x7x9xf32>) -> tensor<2x4x7x9xf32>
flow.return %18 : tensor<2x4x7x9xf32>
} After CloneProducersIntoDispatchRegionsPass %5 = flow.tensor.transfer %cast_6 : tensor<?x?x?x?xf32>{%c2, %c6, %c11, %c13} to #hal.device.affinity<@__device_0>
%9 = flow.dispatch.region -> (tensor<2x4x7x9xf32>) {
%18 = tensor.empty() : tensor<2x4x7x9xf32>
%cst_17 = arith.constant 0.000000e+00 : f32
%cast_18 = tensor.cast %5 : tensor<?x?x?x?xf32> to tensor<2x6x11x13xf32>
%19 = linalg.fill ins(%cst_17 : f32) outs(%18 : tensor<2x4x7x9xf32>) -> tensor<2x4x7x9xf32>
%20 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%cast_18, %cst : tensor<2x6x11x13xf32>, tensor<4x6x5x5xf32>) outs(%19 : tensor<2x4x7x9xf32>) -> tensor<2x4x7x9xf32>
flow.return %20 : tensor<2x4x7x9xf32>
} Maybe not move a |
@benvanik as described at point 2 in my previous message, do you think we should push |
yep! any ops that don't impact data (reshapes, casts, bitcasts, etc) should be able to propagate through or be replicated on either side |
I will implement this rewrite pattern or parts of it. |
Not moving casts kind of makes sense to me. But I don't know what the fallout will be. We should try though |
This solves the problem in iree-org/iree#18283 The issue is that we generate cast to/from dynamic tensors that later lowering in IREE chokes on it. My assumption is that it should be able to digest this IR since it is of the form. ```mlir %2 = torch_c.to_builtin_tensor %arg0 : !torch.vtensor<[2,3,11,13],f32> -> tensor<2x3x11x13xf32> %cast = tensor.cast %2 : tensor<2x3x11x13xf32> to tensor<?x?x?x?xf32> %c0 = arith.constant 0 : index %dim = tensor.dim %cast, %c0 : tensor<?x?x?x?xf32> %c1 = arith.constant 1 : index %dim_0 = tensor.dim %cast, %c1 : tensor<?x?x?x?xf32> %c2 = arith.constant 2 : index %dim_1 = tensor.dim %cast, %c2 : tensor<?x?x?x?xf32> %c3 = arith.constant 3 : index %dim_2 = tensor.dim %cast, %c3 : tensor<?x?x?x?xf32> %3 = flow.tensor.transfer %cast : tensor<?x?x?x?xf32>{%dim, %dim_0, %dim_1, %dim_2} to #hal.device.promise<@__device_0> %cast_3 = tensor.cast %3 : tensor<?x?x?x?xf32> to tensor<2x3x11x13xf32> %4 = torch_c.from_builtin_tensor %cast_3 : tensor<2x3x11x13xf32> -> !torch.vtensor<[2,3,11,13],f32> ``` It essentially casts to a dynamic `tensor<...>` for the purpose of performing `flow.tensor.transfer` and then casts back to a static `torch.vtensor`. So it should be fine. With this change we get ```mlir %2 = torch_c.to_builtin_tensor %arg0 : !torch.vtensor<[2,3,11,13],f32> -> tensor<2x3x11x13xf32> %3 = flow.tensor.transfer %2 : tensor<2x3x11x13xf32> to #hal.device.promise<@__device_0> %4 = torch_c.from_builtin_tensor %3 : tensor<2x3x11x13xf32> -> !torch.vtensor<[2,3,11,13],f32> ``` Signed-off-by: Boian Petkantchin <[email protected]>
This solves the problem in iree-org/iree#18283 The issue is that we generate cast to/from dynamic tensors that later lowering in IREE chokes on it. My assumption is that it should be able to digest this IR since it is of the form. ```mlir %2 = torch_c.to_builtin_tensor %arg0 : !torch.vtensor<[2,3,11,13],f32> -> tensor<2x3x11x13xf32> %cast = tensor.cast %2 : tensor<2x3x11x13xf32> to tensor<?x?x?x?xf32> %c0 = arith.constant 0 : index %dim = tensor.dim %cast, %c0 : tensor<?x?x?x?xf32> %c1 = arith.constant 1 : index %dim_0 = tensor.dim %cast, %c1 : tensor<?x?x?x?xf32> %c2 = arith.constant 2 : index %dim_1 = tensor.dim %cast, %c2 : tensor<?x?x?x?xf32> %c3 = arith.constant 3 : index %dim_2 = tensor.dim %cast, %c3 : tensor<?x?x?x?xf32> %3 = flow.tensor.transfer %cast : tensor<?x?x?x?xf32>{%dim, %dim_0, %dim_1, %dim_2} to #hal.device.promise<@__device_0> %cast_3 = tensor.cast %3 : tensor<?x?x?x?xf32> to tensor<2x3x11x13xf32> %4 = torch_c.from_builtin_tensor %cast_3 : tensor<2x3x11x13xf32> -> !torch.vtensor<[2,3,11,13],f32> ``` It essentially casts to a dynamic `tensor<...>` for the purpose of performing `flow.tensor.transfer` and then casts back to a static `torch.vtensor`. So it should be fine. With this change we get ```mlir %2 = torch_c.to_builtin_tensor %arg0 : !torch.vtensor<[2,3,11,13],f32> -> tensor<2x3x11x13xf32> %3 = flow.tensor.transfer %2 : tensor<2x3x11x13xf32> to #hal.device.promise<@__device_0> %4 = torch_c.from_builtin_tensor %3 : tensor<2x3x11x13xf32> -> !torch.vtensor<[2,3,11,13],f32> ``` Signed-off-by: Boian Petkantchin <[email protected]>
I encountered some numerical imprecisions when executing on 2 local-task (most likely it is actually compiler lowering bug) devices.
I have narrowed it down to a 2D convolution without padding or bias.
Interestingly, with padding the result is correct.
conv2d-multi-device-numerical-imprecision.zip (updated) contains a reproducer.
My assumption is that the lowering of the different preparation of the main conv dispatch is wrong. When using padding we first 0-fill a tensor and then insert the input according to the padding size. Without padding we use the tensor directly.
The text was updated successfully, but these errors were encountered: