Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Multi-GPU C++ testing #1136

Closed
seunghwak opened this issue Sep 8, 2020 · 0 comments · Fixed by #1361
Closed

[FEA] Multi-GPU C++ testing #1136

seunghwak opened this issue Sep 8, 2020 · 0 comments · Fixed by #1361
Assignees
Labels
feature request New feature or request
Milestone

Comments

@seunghwak
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We use Dask + UCX + NCCL to run multi-GPU analytics (Dask for launching processes, UCX for P2P, and NCCL for collectives). UCX endpoints are set-up during Dask initialization.

We are currently relying on Python frontend to test multi-GPU analytics.

In C++ only testing, Dask (launching processes) and UCX (P2P) are not available. We can use MPI for this purpose in C++ testing.

Describe the solution you'd like
cuML folks have solved this problem, and we can generally follow their solution.

RAFT comms can be configured to work with different backends.

MPI based RAFT comms implementation is now available (https://github.com/rapidsai/raft/pull/63/files); RAFT comms (in python) uses UCX + NCCL in backend. This MPI based RAFT comms uses MPI + NCCL in backend.

For C++ only testing, we can create a raft::comms::mpi_comms object (instead of a raft::comms::std_comms object)
raft::comms::std_comms
https://github.com/rapidsai/raft/blob/branch-0.16/cpp/include/raft/comms/std_comms.hpp
raft::comms::mpi_comms
https://github.com/rapidsai/raft/blob/branch-0.16/cpp/include/raft/comms/mpi_comms.hpp

test/CMakeLists.txt need to be properly updated, and we need an example for multi-GPU C++ testing.

We can reference the cuML repo to implement ths.

e.g.
https://github.com/rapidsai/cuml/blob/branch-0.16/cpp/test/CMakeLists.txt#L78

##############################################################################
# - build test_ml_mg executable ----------------------------------------------

if(BUILD_CUML_MG_TESTS)
  find_package(MPI)

  if(MPI_CXX_FOUND)
    # (please keep the filenames in alphabetical order)
    add_executable(ml_mg
      mg/knn.cu
      mg/knn_classify.cu
      mg/knn_regress.cu
      mg/main.cu
      mg/pca.cu
      mg/test_ml_mg_utils.cu)

    add_dependencies(ml_mg cumlcommsmpi)

    set(CUML_TEST_INCLUDE_DIRS
        ${CUML_TEST_INCLUDE_DIRS}
        ${MPI_CXX_INCLUDE_PATH}
        ${CMAKE_CURRENT_SOURCE_DIR}/../comms/mpi/include)

    target_include_directories(ml_mg PUBLIC ${CUML_TEST_INCLUDE_DIRS})

    set(CUML_TEST_LINK_LIBRARIES 
    	${CUML_TEST_LINK_LIBRARIES} 
    	${MPI_CXX_LIBRARIES}
    	cumlcommsmpi
    	cumlprims)

    target_link_libraries(ml_mg ${CUML_TEST_LINK_LIBRARIES})

  else(MPI_CXX_FOUND)
   message("OpenMPI not found. Skipping 'ml_mg'")
  endif(MPI_CXX_FOUND)
endif(BUILD_CUML_MG_TESTS)
@seunghwak seunghwak added the ? - Needs Triage Need team to review and classify label Sep 8, 2020
@BradReesWork BradReesWork added feature request New feature or request and removed ? - Needs Triage Need team to review and classify labels Sep 17, 2020
@BradReesWork BradReesWork added this to the 0.16 milestone Sep 17, 2020
@BradReesWork BradReesWork modified the milestones: 0.16, 0.17 Sep 28, 2020
@BradReesWork BradReesWork modified the milestones: 0.17, 0.18 Nov 23, 2020
@afender afender added the tests label Nov 30, 2020
rapids-bot bot pushed a commit that referenced this issue Feb 10, 2021
…t using it (#1361)

Added initial infrastructure for MG C++ testing and a Pagerank MG test using it.

<s>Still a WIP, need to:</s>
* <s>Shuffle step is currently failing</s>
* <s>`graph_t` ctor expensive check is failing</s>
* <s>Finish comparison code to reference SG Pagerank results</s>
* <s>Fix the `#include` guard hack in `test_utilities.hpp`</s>
* <s>Lots of cleanup</s>
* <s>Refactor common steps into proper `SetUp()` and `TearDown()` functions</s>

closes #1136

Authors:
  - Rick Ratzel (@rlratzel)
  - Seunghwa Kang (@seunghwak)

Approvers:
  - Brad Rees (@BradReesWork)
  - Andrei Schaffer (@aschaffer)
  - Chuck Hastings (@ChuckHastings)

URL: #1361
rapids-bot bot pushed a commit that referenced this issue Mar 1, 2021
- [x] Add tests using graphs with isolated vertices
- [x] Add personalized PageRank tests
- [x] Test code refactoring
- [x] Create libcugraphtestutil.a

This PR fixes FIXMEs added in #1361 to address #1136

Authors:
  - Seunghwa Kang (@seunghwak)

Approvers:
  - Rick Ratzel (@rlratzel)
  - Andrei Schaffer (@aschaffer)
  - Brad Rees (@BradReesWork)

URL: #1419
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants