Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heap buffer overflow when running simulation #6

Open
HeRaNO opened this issue Nov 5, 2024 · 0 comments · May be fixed by #14
Open

Heap buffer overflow when running simulation #6

HeRaNO opened this issue Nov 5, 2024 · 0 comments · May be fixed by #14

Comments

@HeRaNO
Copy link

HeRaNO commented Nov 5, 2024

Reproduce

  1. Turn on the NS3_SANITIZE https://github.com/aliyun/ns-3-alibabacloud/blob/master/simulation/CMakeLists.txt#L61
  2. Run simulation as normal

Logs

maxRtt=4720 maxBdp=236000
Running Simulation.
The final active chunks per dimension 1 after allocating to queues is: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
total nodes: 144
Success in opening workload file
model_parallel_NPU_group: is: 8
checkpoints layers are: 
layers initiating fwd_in_bckwd are: 
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
id: embedding_layer , depen: -1 , wg_comp_time: 1
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 1 compute scale: 1 ,comm scale: 1
stat path: ./ncclFlowModel_ ,total rows: 1 ,stat row: 0
CSV path and filename: ./ncclFlowModel_detailed_144.csv
CSV path and filename: ./ncclFlowModel_EndToEnd_144.csv
=================================================================
==9941==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000fd2f74 at pc 0x7f475725362f bp 0x7fff94b9a270 sp 0x7fff94b9a260
READ of size 4 at 0x602000fd2f74 thread T0
    #0 0x7f475725362e in MockNccl::MockNcclGroup::InterDouBinTreeShift(MockNccl::MockNcclGroup::DoubleBinaryTreeNode*, std::vector<int, std::allocator<int> >) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/MockNcclGroup.cc:2038
    #1 0x7f475725200b in MockNccl::MockNcclGroup::genInterDouBinTree(MockNccl::GroupInfo) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/MockNcclGroup.cc:2000
    #2 0x7f475724e5e3 in MockNccl::MockNcclGroup::gettreechannels(int, MockNccl::GroupType) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/MockNcclGroup.cc:1893
    #3 0x7f47571cf384 in MockNccl::MockNcclComm::MockNcclComm(int, MockNccl::GroupType, MockNccl::MockNcclGroup*) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/MockNcclChannel.cc:22
    #4 0x7f475738f260 in AstraSim::Sys::mock_nccl_comms_init() /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/Sys.cc:1411
    #5 0x7f4757363d59 in AstraSim::Sys::Sys(AstraSim::AstraNetworkAPI*, AstraSim::AstraMemoryAPI*, int, int, int, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, float, float, int, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, GPUType, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >, int) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/Sys.cc:297
    #6 0x5562830980ce in main /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/scratch/AstraSimNetwork.cc:311
    #7 0x7f473bda2d8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
    #8 0x7f473bda2e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f)
    #9 0x556283050384 in _start (/root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/build/scratch/ns3.36.1-AstraSimNetwork-debug+0x1d3384)

0x602000fd2f74 is located 0 bytes to the right of 4-byte region [0x602000fd2f70,0x602000fd2f74)
allocated by thread T0 here:
    #0 0x7f47694b51e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x55628316e51c in __gnu_cxx::new_allocator<int>::allocate(unsigned long, void const*) /usr/include/c++/11/ext/new_allocator.h:127
    #2 0x556283156623 in std::allocator_traits<std::allocator<int> >::allocate(std::allocator<int>&, unsigned long) /usr/include/c++/11/bits/alloc_traits.h:464
    #3 0x556283125b33 in std::_Vector_base<int, std::allocator<int> >::_M_allocate(unsigned long) /usr/include/c++/11/bits/stl_vector.h:346
    #4 0x5562830fc49b in std::_Vector_base<int, std::allocator<int> >::_M_create_storage(unsigned long) /usr/include/c++/11/bits/stl_vector.h:361
    #5 0x5562830d302a in std::_Vector_base<int, std::allocator<int> >::_Vector_base(unsigned long, std::allocator<int> const&) /usr/include/c++/11/bits/stl_vector.h:305
    #6 0x5562830affda in std::vector<int, std::allocator<int> >::vector(std::vector<int, std::allocator<int> > const&) /usr/include/c++/11/bits/stl_vector.h:555
    #7 0x7f4757251f96 in MockNccl::MockNcclGroup::genInterDouBinTree(MockNccl::GroupInfo) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/MockNcclGroup.cc:2000
    #8 0x7f475724e5e3 in MockNccl::MockNcclGroup::gettreechannels(int, MockNccl::GroupType) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/MockNcclGroup.cc:1893
    #9 0x7f47571cf384 in MockNccl::MockNcclComm::MockNcclComm(int, MockNccl::GroupType, MockNccl::MockNcclGroup*) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/MockNcclChannel.cc:22
    #10 0x7f475738f260 in AstraSim::Sys::mock_nccl_comms_init() /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/Sys.cc:1411
    #11 0x7f4757363d59 in AstraSim::Sys::Sys(AstraSim::AstraNetworkAPI*, AstraSim::AstraMemoryAPI*, int, int, int, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, float, float, int, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, bool, GPUType, std::vector<int, std::allocator<int> >, std::vector<int, std::allocator<int> >, int) /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/Sys.cc:297
    #12 0x5562830980ce in main /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/scratch/AstraSimNetwork.cc:311
    #13 0x7f473bda2d8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)

SUMMARY: AddressSanitizer: heap-buffer-overflow /root/SimAI/astra-sim-alibabacloud/extern/network_backend/ns3-interface/simulation/src/applications/astra-sim/system/MockNcclGroup.cc:2038 in MockNccl::MockNcclGroup::InterDouBinTreeShift(MockNccl::MockNcclGroup::DoubleBinaryTreeNode*, std::vector<int, std::allocator<int> >)
Shadow bytes around the buggy address:
  0x0c04801f2590: fa fa fd fa fa fa fd fd fa fa fd fa fa fa fd fa
  0x0c04801f25a0: fa fa fd fd fa fa fd fa fa fa fd fa fa fa fd fd
  0x0c04801f25b0: fa fa fd fa fa fa fd fa fa fa fd fd fa fa fd fa
  0x0c04801f25c0: fa fa fd fa fa fa fd fd fa fa fd fa fa fa fd fa
  0x0c04801f25d0: fa fa fd fd fa fa fd fa fa fa fd fa fa fa fd fd
=>0x0c04801f25e0: fa fa 04 fa fa fa 04 fa fa fa 00 fa fa fa[04]fa
  0x0c04801f25f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c04801f2600: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c04801f2610: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c04801f2620: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c04801f2630: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==9941==ABORTING

Potential Fix

https://github.com/aliyun/SimAI/blob/master/astra-sim-alibabacloud/astra-sim/system/MockNcclGroup.cc#L2038

Change to

    return node2treenode[nodes[(rank2index[root->node]+1) % nodes.size()]];
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant