-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable serial grid reduction for split-K in matmul scheduler #1510
Commits on Dec 8, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 96225e8 - Browse repository at this point
Copy the full SHA 96225e8View commit details -
Configuration menu - View commit details
-
Copy full SHA for f9d2d01 - Browse repository at this point
Copy the full SHA f9d2d01View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6454b06 - Browse repository at this point
Copy the full SHA 6454b06View commit details -
Disable previous changes to indexing pass.
Will revisit once sync pass is done, when we have a TensorIndex
Configuration menu - View commit details
-
Copy full SHA for 1142e44 - Browse repository at this point
Copy the full SHA 1142e44View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1368ba8 - Browse repository at this point
Copy the full SHA 1368ba8View commit details -
Configuration menu - View commit details
-
Copy full SHA for b97db45 - Browse repository at this point
Copy the full SHA b97db45View commit details -
Configuration menu - View commit details
-
Copy full SHA for ac2da9d - Browse repository at this point
Copy the full SHA ac2da9dView commit details -
Configuration menu - View commit details
-
Copy full SHA for ebef797 - Browse repository at this point
Copy the full SHA ebef797View commit details -
Configuration menu - View commit details
-
Copy full SHA for dcd8606 - Browse repository at this point
Copy the full SHA dcd8606View commit details
Commits on Dec 11, 2023
-
Fix sync insertion in lowering pass.
Still missing allocation/indexing of work buffer
Configuration menu - View commit details
-
Copy full SHA for e196fa7 - Browse repository at this point
Copy the full SHA e196fa7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8be320a - Browse repository at this point
Copy the full SHA 8be320aView commit details
Commits on Dec 12, 2023
-
Allocate global work buffer. Index is zero for now
I need to replay leaf transforms, then get index.
Configuration menu - View commit details
-
Copy full SHA for 41b125f - Browse repository at this point
Copy the full SHA 41b125fView commit details -
Configuration menu - View commit details
-
Copy full SHA for e465d96 - Browse repository at this point
Copy the full SHA e465d96View commit details
Commits on Dec 13, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 34d623c - Browse repository at this point
Copy the full SHA 34d623cView commit details -
Use fullSelfReplay and getGlobalConsumerStridedIndices
Codegen is now like ```c++ // Allocate global tensor T5 reduction::serialReductionStep( T3[0LL], T2[(i14 + i18)], 0.000000000e+00f, T5[((((((((((((nvfuser_index_t)blockIdx.x) * 8LL) + ((nvfuser_index_t)blockIdx.y)) * 4LL) + i13) * 8LL) + (i18 + nvfuser_zero)) * 4LL) + ((nvfuser_index_t)threadIdx.y)) * 32LL) + ((nvfuser_index_t)threadIdx.x))], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); ``` This looks OK, although it will get a little better with hoisting. This compiles, but I get an error in `runFusion`: ``` C++ exception with description "Expected T5_g[ iblockIdx.x59{( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(( ceilDiv(262144, 32) ), 4) ), 8) ), 4) ), 8) )}, iblockIdx.y60{8}, ithreadIdx.y54{4}, ithreadIdx.x52{32}, iS58{4}, iS56{8}, rblockIdx.z49{5} ] to be bound to a tensor of rank 1, but got a tensor of rank 6 Exception raised from validateValWithConcreteValue at /opt/pytorch/nvfuser/csrc/expr_evaluator.cpp:38 (most recent call first): ``` This is happening when binding inputs I believe.
Configuration menu - View commit details
-
Copy full SHA for 910ff09 - Browse repository at this point
Copy the full SHA 910ff09View commit details -
Infer shape using allocation domain instead of root
Fixes execution error. Test passes!
Configuration menu - View commit details
-
Copy full SHA for 507cf47 - Browse repository at this point
Copy the full SHA 507cf47View commit details -
Configuration menu - View commit details
-
Copy full SHA for e44ef7e - Browse repository at this point
Copy the full SHA e44ef7eView commit details -
Generated kernel now looks like ```c++ // Allocate global tensor T4 grid_sync::blockSerializeWait<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); #pragma unroll for(nvfuser_index_t i13 = 0; i13 < 4LL; ++i13) { nvfuser_index_t i14; i14 = 8LL * i13; nvfuser_index_t i15; i15 = 2048LL * i13; nvfuser_index_t i16; i16 = i4 + i15; nvfuser_index_t i17; i17 = -i15; #pragma unroll for(nvfuser_index_t i18 = 0; i18 < 8LL; ++i18) { nvfuser_index_t i19; i19 = 256LL * (i18 + nvfuser_zero); nvfuser_index_t i20; i20 = i16 + i19; float T3[1LL]; T3[0LL] = 0.000000000e+00f; // Allocate global tensor T5 reduction::serialReductionStep( T3[0LL], T2[(i14 + i18)], 0.000000000e+00f, T5[i20], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); if ((b6 && (i5 < (i17 - i19)))) { T1[i20] = T3[0LL]; } } } NVFUSER_UPDATE_MAGIC_ZERO; grid_sync::blockSerializeRelease<false, false, true>(&T4[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); ``` Note that the index `i20` matches the output `T1`. This is what we need to reclaim `T1` in a later PR; it will still be a challenge in that work to exact map between `T5` and `T3` in order to get `T1` and `T5` exact mapped...
Configuration menu - View commit details
-
Copy full SHA for 8a3134e - Browse repository at this point
Copy the full SHA 8a3134eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 327573c - Browse repository at this point
Copy the full SHA 327573cView commit details -
Configuration menu - View commit details
-
Copy full SHA for d27a675 - Browse repository at this point
Copy the full SHA d27a675View commit details -
Update NVFuserTest.Pipeline_CUDA
Also sort expected output by line to give clearer error messages.
Configuration menu - View commit details
-
Copy full SHA for 46c70b6 - Browse repository at this point
Copy the full SHA 46c70b6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6d2089d - Browse repository at this point
Copy the full SHA 6d2089dView commit details
Commits on Dec 20, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 751f326 - Browse repository at this point
Copy the full SHA 751f326View commit details
Commits on Jan 4, 2024
-
Configuration menu - View commit details
-
Copy full SHA for f4ad5ff - Browse repository at this point
Copy the full SHA f4ad5ffView commit details -
Configuration menu - View commit details
-
Copy full SHA for 1149b43 - Browse repository at this point
Copy the full SHA 1149b43View commit details
Commits on Jan 8, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 94e55b8 - Browse repository at this point
Copy the full SHA 94e55b8View commit details
Commits on Jan 10, 2024
-
Configuration menu - View commit details
-
Copy full SHA for f2d7461 - Browse repository at this point
Copy the full SHA f2d7461View commit details
Commits on Jan 11, 2024
-
Configuration menu - View commit details
-
Copy full SHA for a864184 - Browse repository at this point
Copy the full SHA a864184View commit details
Commits on Jan 12, 2024
-
Configuration menu - View commit details
-
Copy full SHA for c236810 - Browse repository at this point
Copy the full SHA c236810View commit details
Commits on Jan 16, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 819a6e0 - Browse repository at this point
Copy the full SHA 819a6e0View commit details
Commits on Jan 18, 2024
-
Configuration menu - View commit details
-
Copy full SHA for cf42527 - Browse repository at this point
Copy the full SHA cf42527View commit details -
Configuration menu - View commit details
-
Copy full SHA for 449a296 - Browse repository at this point
Copy the full SHA 449a296View commit details
Commits on Jan 19, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 27e449a - Browse repository at this point
Copy the full SHA 27e449aView commit details -
Configuration menu - View commit details
-
Copy full SHA for c6ddebf - Browse repository at this point
Copy the full SHA c6ddebfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2c3278b - Browse repository at this point
Copy the full SHA 2c3278bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4e40d68 - Browse repository at this point
Copy the full SHA 4e40d68View commit details -
These were disabled in #1545 because of slow compilation with gridReduce
Configuration menu - View commit details
-
Copy full SHA for 7bfa709 - Browse repository at this point
Copy the full SHA 7bfa709View commit details -
Configuration menu - View commit details
-
Copy full SHA for fc07a9a - Browse repository at this point
Copy the full SHA fc07a9aView commit details
Commits on Jan 23, 2024
-
Configuration menu - View commit details
-
Copy full SHA for e9111e6 - Browse repository at this point
Copy the full SHA e9111e6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4f93290 - Browse repository at this point
Copy the full SHA 4f93290View commit details -
Configuration menu - View commit details
-
Copy full SHA for 915cd97 - Browse repository at this point
Copy the full SHA 915cd97View commit details