Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra FillFunctor kernels #3267

Open
cowanmeg opened this issue Oct 24, 2024 · 0 comments
Open

Extra FillFunctor kernels #3267

cowanmeg opened this issue Oct 24, 2024 · 0 comments
Assignees

Comments

@cowanmeg
Copy link
Collaborator

In the multidevice transformer tests there are extra FillFunctor kernels.

Previously we generated these kernels:

ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_nn
<unnamed>::nvfuser_pointwise_f0_c1_r0_g12(<unnamed>::Tensor<<unnamed>::__bfloat, (int)2, (int)2>, <unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int)3>, <unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int)3>)
ampere_bf16_s16816gemm_bf16_256x128_ldg8_f2f_stages_64x3_nn
ncclDevKernel_AllReduce_Sum_bf16_RING_LL(ncclDevKernelArgsStorage<(unsigned long)4096>)
<unnamed>::nvfuser_pointwise_f0_c1_r0_g15(<unnamed>::Tensor<<unnamed>::__bfloat, (int)1, (int)1>, <unnamed>::Tensor<<unnamed>::__bfloat, (int)2, (int)2>, <unnamed>::Tensor<float, (int)2, (int)2>, long long *, long long, long long *, long long, <unnamed>::Tensor<float, (int)2, (int)2>, <unnamed>::Tensor<float, (int)2, (int)2>)

Currently (note the different gemm is intentional and resulted from switching from matmul + bias add to linear):

ampere_bf16_s16816gemm_bf16_128x64_ldg8_relu_f2f_stages_64x4_tn
void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<c10::BFloat16>, at::detail::Array<char *, (int)1>>(int, T2, T3)
<unnamed>::nvfuser_pointwise_f0_c1_r0_g4(<unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int)3>, <unnamed>::Tensor<<unnamed>::__bfloat, (int)3, (int)3>)
ampere_bf16_s16816gemm_bf16_64x64_sliced1x2_ldg8_f2f_stages_64x5_tn
void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<c10::BFloat16>, at::detail::Array<char *, (int)1>>(int, T2, T3)
ncclDevKernel_AllReduce_Sum_bf16_RING_LL(ncclDevKernelArgsStorage<(unsigned long)4096>)
void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3)
void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, at::detail::Array<char *, (int)1>>(int, T2, T3)
<unnamed>::nvfuser_pointwise_f0_c1_r0_g5(<unnamed>::Tensor<<unnamed>::__bfloat, (int)1, (int)1>, <unnamed>::Tensor<<unnamed>::__bfloat, (int)2, (int)2>, long long *, long long, long long *, long long, <unnamed>::Tensor<float, (int)3, (int)3>, <unnamed>::Tensor<float, (int)3, (int)3>)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant