[frontend] allow var_mean to be implemented in one pass #1285

peterbell10 · 2023-03-06T14:19:11Z

Currently PyTorch inductor is forced to implement torch.var_mean as two passes over the input data, which causes a slowdown for batch norm. To allow single-pass computation we need a new reduction operator tl.welford(mean, m2, count) which implements the combination step of parallel Welford's algortihm.

A more general solution might be to instead add a tl.reduce which takes a function acting on scalars, so users can write their own reductions without needing to change the triton language.

The text was updated successfully, but these errors were encountered:

ptillet · 2023-03-06T17:19:42Z

I feel like such a function would not belong in the triton.language namespace. As you suggested, the right solution is probably to have some more flexible reduce op, but this would be a significant amount of work as none of our optimizer and backend codegen is ready to support instructions that take in functions as arguments

…1305) Fixes #1285 This changes `tt.reduce` to replace `redOp` by a region containing arbitrary code. For example, `tl.sum` is now lowered as: ```mlir %res = "tt.reduce"(%arg0) ({ ^bb0(%arg1: f32, %arg2: f32): %add = arith.addf %arg1, %arg2 : f32 tt.reduce.return %add : f32 }) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32> ``` Support for index reductions at the MLIR level are also dropped in favor of simultaneous reductions over multiple tensors. Which generalizes the code without loss of performance. So for example `argmin` gets lowered as: ```mlir %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32> %9:2 = "tt.reduce"(%6, %8) ({ ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32): %14 = arith.cmpf olt, %arg4, %arg6 : f32 %15 = arith.cmpf ogt, %arg4, %arg6 : f32 %16 = arith.cmpi slt, %arg5, %arg7 : i32 %17 = arith.select %16, %arg5, %arg7 : i32 %18 = arith.select %15, %arg7, %17 : i32 %19 = arith.select %14, %arg5, %18 : i32 %20 = arith.cmpf olt, %arg4, %arg6 : f32 %21 = arith.select %20, %arg4, %arg6 : f32 tt.reduce.return %21, %19 : f32, i32 }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>) ```

…riton-lang#1305) Fixes triton-lang#1285 This changes `tt.reduce` to replace `redOp` by a region containing arbitrary code. For example, `tl.sum` is now lowered as: ```mlir %res = "tt.reduce"(%arg0) ({ ^bb0(%arg1: f32, %arg2: f32): %add = arith.addf %arg1, %arg2 : f32 tt.reduce.return %add : f32 }) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32> ``` Support for index reductions at the MLIR level are also dropped in favor of simultaneous reductions over multiple tensors. Which generalizes the code without loss of performance. So for example `argmin` gets lowered as: ```mlir %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32> %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32> %9:2 = "tt.reduce"(%6, %8) ({ ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32): %14 = arith.cmpf olt, %arg4, %arg6 : f32 %15 = arith.cmpf ogt, %arg4, %arg6 : f32 %16 = arith.cmpi slt, %arg5, %arg7 : i32 %17 = arith.select %16, %arg5, %arg7 : i32 %18 = arith.select %15, %arg7, %17 : i32 %19 = arith.select %14, %arg5, %18 : i32 %20 = arith.cmpf olt, %arg4, %arg6 : f32 %21 = arith.select %20, %arg4, %arg6 : f32 tt.reduce.return %21, %19 : f32, i32 }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>) ```

peterbell10 changed the title ~~Add tl.welford to~~ Add tl.welford to allow var_mean to be implemented in one pass Mar 6, 2023

ptillet changed the title ~~Add tl.welford to allow var_mean to be implemented in one pass~~ [frontend] allow var_mean to be implemented in one pass Mar 6, 2023

ptillet added the enhancement label Mar 6, 2023

peterbell10 mentioned this issue Mar 8, 2023

Rewrite ReduceOp to support arbitrary reduce operations #1305

Merged

ptillet closed this as completed in #1305 Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[frontend] allow var_mean to be implemented in one pass #1285

[frontend] allow var_mean to be implemented in one pass #1285

peterbell10 commented Mar 6, 2023 •

edited

Loading

ptillet commented Mar 6, 2023

[frontend] allow var_mean to be implemented in one pass #1285

[frontend] allow var_mean to be implemented in one pass #1285

Comments

peterbell10 commented Mar 6, 2023 • edited Loading

ptillet commented Mar 6, 2023

peterbell10 commented Mar 6, 2023 •

edited

Loading