Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple ttnn operations hang when run many times in loop on WH cards #4702

Closed
Tracked by #13795
nemanjagrujic opened this issue Jan 12, 2024 · 14 comments
Closed
Tracked by #13795
Assignees
Labels
bug Something isn't working op_cat: eltwise P1 WH

Comments

@nemanjagrujic
Copy link
Contributor

nemanjagrujic commented Jan 12, 2024

ttnn.mul, ttnn.tanh, ttnn.add, ttnn.sub, ttnn.exp, ttnn.gelu operations hang when run many times in loop. Hang happens only on Wormhole cards (for now).

To Reproduce
Steps to reproduce the behavior:
Checkout syrmia/ttnn-sweeps branch (Soon to be merged in main). Run unit tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_mul_hang.py or
tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_tanh_hang.py or
tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_add_hang.py or
tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_sub_hang.py or
tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_exp_hang.py or
tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_gelu_hang.py using this command (for instance):

tests/ttnn/python_api_testing/non_working_unit_tests/wormhole/test_mul_hang.py

Expected behavior
Unit test runs 3 setups of the operation in a loop
After some non deterministic time of running ops (15-20 minutes) program hangs and stops running.

@nemanjagrujic nemanjagrujic added bug Something isn't working GS WH labels Jan 12, 2024
@nemanjagrujic nemanjagrujic changed the title ttnn.mul operation hangs when run many times in loop multiple ttnn operations hang when run many times in loop Jan 15, 2024
@nemanjagrujic nemanjagrujic removed the GS label Jan 17, 2024
@nemanjagrujic nemanjagrujic changed the title multiple ttnn operations hang when run many times in loop multiple ttnn operations hang when run many times in loop on WH cards Jan 17, 2024
@jliangTT jliangTT assigned umadevimcw and unassigned arakhmati Apr 24, 2024
@jliangTT jliangTT added the P1 label Apr 24, 2024
@VirdhatchaniKN
Copy link
Contributor

Hi @jliangTT , @eyonland , @nemanjagrujic

Myself and @Aswinmcw worked on the issue and found the following issues

  • We tried running the test for the same number of iterations in different combinations, but almost of the cases hang at different iterations.
  • We could see similar hang issues in GS card also.

We have documented the same with screenshots here: TTNN hang Issue Analysis.pdf. Need your inputs on this to proceed further.

@jliangTT
Copy link

jliangTT commented May 7, 2024

Next step:

@VirdhatchaniKN
Copy link
Contributor

VirdhatchaniKN commented May 8, 2024

Hi @jliangTT

  • We tried rebasing and re-ran all the test. We did not face any hang today, But instead, once after the test file passed, we got segmentation fault. We have inserted screenshots of the same here - TTNN hang Issue Analysis-8May.pdf
  • As suggested , we ran watcher for add op on both GS & WH_B0. Attached is the folder link to the log results : Link

Would like your inputs to proceed further on this issue.

umadevimcw added a commit that referenced this issue Sep 13, 2024
@ruthreshx
Copy link
Contributor

From the latest main, all the test were fails with below error. Debugging in progress

self = FastOperation(python_fully_qualified_name='ttnn.add', function=<ttnn._ttnn.operations.binary.add_t object at 0x7fbfd58...<function default_postprocess_golden_function_outputs at 0x7fbfcedf5790>, is_cpp_operation=True, is_experimental=False)
function_args = (ttnn.Tensor([[-85.50000, 99.00000,  ..., 15.00000, -11.18750],
             [-56.00000, 51.25000,  ..., 11.25000, 91....000, 17.37500,  ..., -3.73438,  3.01562]], shape=Shape([150, 72]), dtype=DataType::BFLOAT16, layout=Layout::ROW_MAJOR))
function_kwargs = {}

    def __call__(self, *function_args, **function_kwargs):
>       return self.function(*function_args, **function_kwargs)
E       RuntimeError: TT_ASSERT @ ../ttnn/cpp/ttnn/tensor/tensor_impl.cpp:81: total_size_bytes % page_size == 0
E       backtrace:
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x4b3f0b) [0x7fbfd085df0b]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x4b3e9c) [0x7fbfd085de9c]
E        --- tt::tt_metal::tensor_impl::get_page_size(tt::tt_metal::DataType, tt::tt_metal::Layout, unsigned int, tt::tt_metal::LegacyShape const&)
E        --- tt::tt_metal::tensor_impl::detail::allocate_interleaved_buffer_on_device(unsigned long, tt::tt_metal::Device*, tt::tt_metal::LegacyShape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&)
E        --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(unsigned long, tt::tt_metal::Device*, tt::tt_metal::LegacyShape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::MemoryConfig const&, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&)
E        --- tt::tt_metal::create_device_tensor(tt::tt_metal::LegacyShape const&, tt::tt_metal::DataType, tt::tt_metal::Layout, tt::tt_metal::Device*, tt::tt_metal::MemoryConfig const&)
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0xb1940c) [0x7fbfd0ec340c]
E        --- ttnn::operations::binary::BinaryDeviceOperation::create_output_tensors(ttnn::operations::binary::BinaryDeviceOperation::operation_attributes_t const&, ttnn::operations::binary::BinaryDeviceOperation::tensor_args_t const&)
E        --- ttnn::operations::binary::BinaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::launch_on_single_device<ttnn::operations::binary::BinaryDeviceOperation>(unsigned char, ttnn::operations::binary::BinaryDeviceOperation::operation_attributes_t const&, ttnn::operations::binary::BinaryDeviceOperation::tensor_args_t const&)
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e8ce4) [0x7fbfd0c92ce4]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e8c98) [0x7fbfd0c92c98]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e8c50) [0x7fbfd0c92c50]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e8bf8) [0x7fbfd0c92bf8]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e8970) [0x7fbfd0c92970]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e88d2) [0x7fbfd0c928d2]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e8834) [0x7fbfd0c92834]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e87dd) [0x7fbfd0c927dd]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e7be9) [0x7fbfd0c91be9]
E        --- ttnn::operations::binary::BinaryDeviceOperation::tensor_return_value_t ttnn::device_operation::detail::invoke<ttnn::operations::binary::BinaryDeviceOperation>(unsigned char, ttnn::operations::binary::BinaryDeviceOperation::operation_attributes_t const&, ttnn::operations::binary::BinaryDeviceOperation::tensor_args_t const&)
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8e5ca0) [0x7fbfd0c8fca0]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x8d2827) [0x7fbfd0c7c827]
E        --- ttnn::operations::binary::BinaryOperation<(ttnn::operations::binary::BinaryOpType)0>::invoke(unsigned char, tt::tt_metal::Tensor const&, tt::tt_metal::Tensor const&, std::__1::optional<tt::tt_metal::DataType const> const&, std::__1::optional<tt::tt_metal::MemoryConfig> const&, std::__1::optional<tt::tt_metal::Tensor>, std::__1::optional<std::__1::vector<ttnn::operations::unary::UnaryWithParam, std::__1::allocator<ttnn::operations::unary::UnaryWithParam>>>, std::__1::optional<ttnn::operations::unary::UnaryWithParam>)
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cec8f7) [0x7fbfd20968f7]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cec694) [0x7fbfd2096694]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cec606) [0x7fbfd2096606]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cec1a8) [0x7fbfd20961a8]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ceb983) [0x7fbfd2095983]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cee8fc) [0x7fbfd20988fc]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cee3e0) [0x7fbfd20983e0]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cee38d) [0x7fbfd209838d]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cee365) [0x7fbfd2098365]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ced651) [0x7fbfd2097651]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x53e92a) [0x7fbfd08e892a]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x53e8dd) [0x7fbfd08e88dd]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cf045c) [0x7fbfd209a45c]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cf0425) [0x7fbfd209a425]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cf03e5) [0x7fbfd209a3e5]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cf03bd) [0x7fbfd209a3bd]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1cef6f9) [0x7fbfd20996f9]
E        --- /home/ubuntu/Anasuya/tt-metal/build_Debug/lib/libtt_metal.so(+0x13c842) [0x7fbfcfee2842]
E        --- /home/ubuntu/Anasuya/tt-metal/build_Debug/lib/libtt_metal.so(+0x13c805) [0x7fbfcfee2805]
E        --- /home/ubuntu/Anasuya/tt-metal/build_Debug/lib/libtt_metal.so(+0x22babb) [0x7fbfcffd1abb]
E        --- tt::tt_metal::Device::push_work(std::__1::shared_ptr<std::__1::function<void ()>>, bool)
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ceb0d1) [0x7fbfd20950d1]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ce9d5e) [0x7fbfd2093d5e]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ce9b04) [0x7fbfd2093b04]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ce95be) [0x7fbfd20935be]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ce9457) [0x7fbfd2093457]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ce93c6) [0x7fbfd20933c6]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ce8b4c) [0x7fbfd2092b4c]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ce89f6) [0x7fbfd20929f6]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x1ce88f9) [0x7fbfd20928f9]
E        --- /home/ubuntu/Anasuya/tt-metal/ttnn/ttnn/_ttnn.so(+0x4695ff) [0x7fbfd08135ff]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(PyCFunction_Call+0x59) [0x5e66b9]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyObject_MakeTpCall+0x29e) [0x5e728e]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3() [0x4f9588]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3() [0x58db4c]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(PyObject_Call+0x25e) [0x5e602e]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3() [0x58d7be]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyObject_MakeTpCall+0x29e) [0x5e728e]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x5dac) [0x561f9c]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x1b6) [0x5e6a66]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x72d) [0x55c91d]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x57f2) [0x5619e2]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3() [0x4f8d5e]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x57f2) [0x5619e2]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3() [0x58d83f]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyObject_MakeTpCall+0x29e) [0x5e728e]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x628d) [0x56247d]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x1b6) [0x5e6a66]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x859) [0x55ca49]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x1b6) [0x5e6a66]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x57f2) [0x5619e2]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3() [0x4f8d5e]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x57f2) [0x5619e2]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/ubuntu/Anasuya/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]

ttnn/ttnn/decorators.py:326: RuntimeError

@umadevimcw umadevimcw assigned ruthreshx and unassigned umadevimcw Sep 20, 2024
@ruthreshx
Copy link
Contributor

ruthreshx commented Sep 23, 2024

Hi @nemanjagrujic , @eyonland , @KalaivaniMCW ,

During migration we have added only the Tile support for the below unary ops hence the test fails with runtime error.

  • Gelu
  • Exp
  • Tanh

During migration we have added only the Tile support for the below binary ops hence the test fails with runtime error.

  • Add
  • Sub
  • Mul

I need to initiate a discussion with @KalaivaniMCW to proceed further with Row_major_support for the above failures.

@nemanjagrujic
Copy link
Contributor Author

@ruthreshx Yes, at the time row major worked, and this hang was only observed for row major.

@eyonland
Copy link
Contributor

Given that we no longer support auto-formatting and that we expect the user to use the to_layout, can we verify that this operation does not hang when run many times for tile layout? If it does not, please close this issue.

@nemanjagrujic
Copy link
Contributor Author

@eyonland I can test, but I don't expect it to hang.

@ruthreshx
Copy link
Contributor

Hi @nemanjagrujic ,
Were you able to run the test without the hang issue in tile layout?

@nemanjagrujic
Copy link
Contributor Author

@ruthreshx Mul did not hang. I need to test others as well.

@nemanjagrujic
Copy link
Contributor Author

@ruthreshx Sub did not hang over night. Next night I'll try another op.

@umadevimcw
Copy link
Contributor

@nemanjagrujic I have updated the Layouts of all the tests in the PR #14318. Can you please test and confirm whether tests are hanging?

umadevimcw added a commit that referenced this issue Oct 29, 2024
umadevimcw added a commit that referenced this issue Oct 29, 2024
umadevimcw added a commit that referenced this issue Oct 29, 2024
umadevimcw added a commit that referenced this issue Oct 29, 2024
umadevimcw added a commit that referenced this issue Oct 29, 2024
umadevimcw added a commit that referenced this issue Oct 29, 2024
umadevimcw added a commit that referenced this issue Oct 29, 2024
@umadevimcw
Copy link
Contributor

umadevimcw commented Oct 29, 2024

@nemanjagrujic I have tested for tanh, exp and add its working fine. Observed no hang

@nemanjagrujic
Copy link
Contributor Author

@umadevimcw Yes. No hangs with TILE. This was surely ROW_MAJOR issue.

@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in External Requests and Reports Oct 30, 2024
umadevimcw added a commit that referenced this issue Nov 12, 2024
umadevimcw added a commit that referenced this issue Nov 12, 2024
umadevimcw added a commit that referenced this issue Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working op_cat: eltwise P1 WH
Projects
None yet
Development

No branches or pull requests

9 participants