[FEA] Multi-GPU C++ testing #1136

seunghwak · 2020-09-08T19:15:49Z

Is your feature request related to a problem? Please describe.
We use Dask + UCX + NCCL to run multi-GPU analytics (Dask for launching processes, UCX for P2P, and NCCL for collectives). UCX endpoints are set-up during Dask initialization.

We are currently relying on Python frontend to test multi-GPU analytics.

In C++ only testing, Dask (launching processes) and UCX (P2P) are not available. We can use MPI for this purpose in C++ testing.

Describe the solution you'd like
cuML folks have solved this problem, and we can generally follow their solution.

RAFT comms can be configured to work with different backends.

MPI based RAFT comms implementation is now available (https://github.com/rapidsai/raft/pull/63/files); RAFT comms (in python) uses UCX + NCCL in backend. This MPI based RAFT comms uses MPI + NCCL in backend.

For C++ only testing, we can create a raft::comms::mpi_comms object (instead of a raft::comms::std_comms object)
raft::comms::std_comms
https://github.com/rapidsai/raft/blob/branch-0.16/cpp/include/raft/comms/std_comms.hpp
raft::comms::mpi_comms
https://github.com/rapidsai/raft/blob/branch-0.16/cpp/include/raft/comms/mpi_comms.hpp

test/CMakeLists.txt need to be properly updated, and we need an example for multi-GPU C++ testing.

We can reference the cuML repo to implement ths.

e.g.
https://github.com/rapidsai/cuml/blob/branch-0.16/cpp/test/CMakeLists.txt#L78

##############################################################################
# - build test_ml_mg executable ----------------------------------------------

if(BUILD_CUML_MG_TESTS)
  find_package(MPI)

  if(MPI_CXX_FOUND)
    # (please keep the filenames in alphabetical order)
    add_executable(ml_mg
      mg/knn.cu
      mg/knn_classify.cu
      mg/knn_regress.cu
      mg/main.cu
      mg/pca.cu
      mg/test_ml_mg_utils.cu)

    add_dependencies(ml_mg cumlcommsmpi)

    set(CUML_TEST_INCLUDE_DIRS
        ${CUML_TEST_INCLUDE_DIRS}
        ${MPI_CXX_INCLUDE_PATH}
        ${CMAKE_CURRENT_SOURCE_DIR}/../comms/mpi/include)

    target_include_directories(ml_mg PUBLIC ${CUML_TEST_INCLUDE_DIRS})

    set(CUML_TEST_LINK_LIBRARIES 
    	${CUML_TEST_LINK_LIBRARIES} 
    	${MPI_CXX_LIBRARIES}
    	cumlcommsmpi
    	cumlprims)

    target_link_libraries(ml_mg ${CUML_TEST_LINK_LIBRARIES})

  else(MPI_CXX_FOUND)
   message("OpenMPI not found. Skipping 'ml_mg'")
  endif(MPI_CXX_FOUND)
endif(BUILD_CUML_MG_TESTS)

The text was updated successfully, but these errors were encountered:

@rlratzel

…t using it (#1361) Added initial infrastructure for MG C++ testing and a Pagerank MG test using it. <s>Still a WIP, need to:</s> * <s>Shuffle step is currently failing</s> * <s>`graph_t` ctor expensive check is failing</s> * <s>Finish comparison code to reference SG Pagerank results</s> * <s>Fix the `#include` guard hack in `test_utilities.hpp`</s> * <s>Lots of cleanup</s> * <s>Refactor common steps into proper `SetUp()` and `TearDown()` functions</s> closes #1136 Authors: - Rick Ratzel (@rlratzel) - Seunghwa Kang (@seunghwak) Approvers: - Brad Rees (@BradReesWork) - Andrei Schaffer (@aschaffer) - Chuck Hastings (@ChuckHastings) URL: #1361

@seunghwak

- [x] Add tests using graphs with isolated vertices - [x] Add personalized PageRank tests - [x] Test code refactoring - [x] Create libcugraphtestutil.a This PR fixes FIXMEs added in #1361 to address #1136 Authors: - Seunghwa Kang (@seunghwak) Approvers: - Rick Ratzel (@rlratzel) - Andrei Schaffer (@aschaffer) - Brad Rees (@BradReesWork) URL: #1419

seunghwak added the ? - Needs Triage Need team to review and classify label Sep 8, 2020

BradReesWork assigned seunghwak Sep 17, 2020

BradReesWork added feature request New feature or request and removed ? - Needs Triage Need team to review and classify labels Sep 17, 2020

BradReesWork added this to the 0.16 milestone Sep 17, 2020

BradReesWork unassigned seunghwak Sep 17, 2020

BradReesWork modified the milestones: 0.16, 0.17 Sep 28, 2020

BradReesWork modified the milestones: 0.17, 0.18 Nov 23, 2020

BradReesWork assigned rlratzel and seunghwak Nov 23, 2020

afender added the tests label Nov 30, 2020

rlratzel added the 2 - In Progress label Feb 1, 2021

rlratzel mentioned this issue Feb 8, 2021

Added initial infrastructure for MG C++ testing and a Pagerank MG test using it #1361

Merged

rapids-bot bot closed this as completed in #1361 Feb 10, 2021

seunghwak mentioned this issue Feb 26, 2021

Update C++ MG PageRank test #1419

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Multi-GPU C++ testing #1136

[FEA] Multi-GPU C++ testing #1136

seunghwak commented Sep 8, 2020

[FEA] Multi-GPU C++ testing #1136

[FEA] Multi-GPU C++ testing #1136

Comments

seunghwak commented Sep 8, 2020