Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SG TSP #1360

Merged
merged 112 commits into from
Feb 11, 2021
Merged

Add SG TSP #1360

merged 112 commits into from
Feb 11, 2021

Conversation

hlinsen
Copy link
Contributor

@hlinsen hlinsen commented Jan 27, 2021

This PR implements an approximated solution to the Traveling Salesperson Problem (TSP).
The algorithm is exposed under traversal through a Python API taking 2D pos as input and returning a route.
This PR relies on RAFT KNN: rapidsai/raft#126
Solves: #1185

@hlinsen hlinsen requested review from a team as code owners January 27, 2021 18:50
@BradReesWork BradReesWork added this to the 0.18 milestone Jan 29, 2021
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the TSP details as much, but did notice a few things that'll need to be addressed.

cpp/CMakeLists.txt Outdated Show resolved Hide resolved
cpp/tests/CMakeLists.txt Outdated Show resolved Hide resolved
datasets/get_test_data.sh Show resolved Hide resolved
python/cugraph/layout/force_atlas2_wrapper.pyx Outdated Show resolved Hide resolved
python/cugraph/tests/test_traveling_salesman.py Outdated Show resolved Hide resolved
python/cugraph/tests/utils.py Outdated Show resolved Hide resolved
python/cugraph/traversal/traveling_salesman.py Outdated Show resolved Hide resolved
@hlinsen hlinsen changed the title [WIP][skip-ci] Add SG TSP Add SG TSP Feb 3, 2021
@BradReesWork BradReesWork added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 3, 2021
cpp/include/algorithms.hpp Show resolved Hide resolved
cpp/include/algorithms.hpp Outdated Show resolved Hide resolved
* @param[in] verbose Logs configuration and iterative improvement.
*
*/
float traveling_salesman(raft::handle_t &handle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a float return type versus void?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently to get my outputs I return the minimum cost and the path is passed as a pointer.

Copy link
Member

@afender afender Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This isn't documented
  2. How do you pick what gets returned as an argument and what doesn't? It seems we are mixing both at the moment. Consider using a consistent output mechanism.
  3. Another possible option for routes: New APIs are returning rmm::uvector which is often easier to deal with than raw pointers.

@BradReesWork
Copy link
Member

rerun tests

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few more items I noticed.

python/cugraph/tests/utils.py Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Show resolved Hide resolved
cpp/src/traversal/tsp.cu Show resolved Hide resolved
cpp/src/traversal/tsp.cu Outdated Show resolved Hide resolved
cpp/src/traversal/tsp.cu Outdated Show resolved Hide resolved
cpp/src/traversal/tsp.cu Show resolved Hide resolved
cpp/src/traversal/tsp.cu Outdated Show resolved Hide resolved
best_route = nullptr;
}

__global__ __launch_bounds__(2048, 2) void two_opt_search(int *mylock,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quite large hand-written kernel. Two issues:
(1.) Couldn't be this designed using Thrust/CUB primitives?
(2.) Is there a possibility to split this kernel into smaller __device__ functions, so that reading and maintaining this code later would become possible / easier ?

cpp/src/traversal/tsp_utils.hpp Outdated Show resolved Hide resolved
cpp/src/traversal/tsp_utils.hpp Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Feb 7, 2021

Codecov Report

Merging #1360 (81caaf8) into branch-0.18 (2fb0725) will increase coverage by 0.37%.
The diff coverage is 62.28%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #1360      +/-   ##
===============================================
+ Coverage        60.38%   60.75%   +0.37%     
===============================================
  Files               67       70       +3     
  Lines             3029     3134     +105     
===============================================
+ Hits              1829     1904      +75     
- Misses            1200     1230      +30     
Impacted Files Coverage Δ
python/cugraph/centrality/__init__.py 100.00% <ø> (ø)
python/cugraph/dask/structure/renumber.py 0.00% <0.00%> (ø)
python/cugraph/link_analysis/pagerank.py 100.00% <ø> (ø)
python/cugraph/traversal/__init__.py 100.00% <ø> (ø)
python/cugraph/comms/comms.py 34.52% <25.00%> (ø)
python/cugraph/dask/common/input_utils.py 23.07% <28.57%> (+1.14%) ⬆️
python/cugraph/dask/common/mg_utils.py 37.50% <38.09%> (-2.50%) ⬇️
python/cugraph/community/spectral_clustering.py 72.54% <38.46%> (-11.67%) ⬇️
python/cugraph/structure/number_map.py 59.20% <50.00%> (+3.24%) ⬆️
python/cugraph/structure/graph.py 66.99% <76.47%> (+0.19%) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 466e29a...81caaf8. Read the comment docs.

@hlinsen
Copy link
Contributor Author

hlinsen commented Feb 8, 2021

rerun tests

Copy link
Member

@afender afender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool code. Just reviewed C++ part for now.

In addition to the comments, I'd like to see some performance data and a basic profile (ie. what's the perf bottleneck).

Kernels are nice but I would have preferred if there was more of Graph Prims, Thrust, and RAFT in this PR. API/Doc needs polishing. I also added some comments regarding integration and testing, for instance, I see an API that doesn't take graphs but graphs are passed in tests in a custom format that really looks like regular mtx/csv, so I think this should be reconciled.

* handle, the multi GPU version will be selected.
* @param[in] vtx_ptr Device array containing the vertex identifiers used
* to initialize the route.
* @param[out] route Device array containing the returned route.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No graph_t API?

* @param[in] verbose Logs configuration and iterative improvement.
*
*/
float traveling_salesman(raft::handle_t &handle,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider const raft::handle_t &handle instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a PR to make the brute_force_knn handle const for 0.19

* @param[in] verbose Logs configuration and iterative improvement.
*
*/
float traveling_salesman(raft::handle_t &handle,
Copy link
Member

@afender afender Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This isn't documented
  2. How do you pick what gets returned as an argument and what doesn't? It seems we are mixing both at the moment. Consider using a consistent output mechanism.
  3. Another possible option for routes: New APIs are returning rmm::uvector which is often easier to deal with than raw pointers.

cpp/src/traversal/tsp.cu Outdated Show resolved Hide resolved
h_y_pos.data(), soln + nodes_ + 1, sizeof(float) * (nodes_ + 1), cudaMemcpyDeviceToHost));
cudaDeviceSynchronize();

for (int i = 0; i < nodes_; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nodes_ is expected to always be smaller than 2B in the future or is this something we should template like in the experimental APIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the problem size we target is up to 2k nodes

cpp/src/traversal/tsp_utils.hpp Outdated Show resolved Hide resolved
cpp/tests/traversal/tsp_test.cu Show resolved Hide resolved
return tokens;
}

int load_tsp(const char* fname, Route* input)
Copy link
Member

@afender afender Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use csv files and I/O?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to compare against TSPLIB which solves TSP as a city problem, so the format corresponds to 2D positions as input.

Copy link
Member

@afender afender Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how it would be relevant in the context of a demo/benchmark. Not sure if that's a strong enough motivation in the context of cugraph testing though. It seems to me it is going in the opposite direction than what @rlratzel is trying to do for making testing more generic. I am ok not blocking the PR for this but we should probably have a FiXme/Todo for test refactoring work (if that's ok with Joseph/Rick).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A FIXME as Alex suggested isn't a bad idea, at the very least to capture that TSP deviates from the standard testing I/O pattern and we may want to revisit adopting a more universal approach (instead of or in addition to comparing using TSPLIB's input format).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the fixme in this commit: f85cb0d

DIMENSION: 280
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 288 149
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like all these files are really graph/sparse matrices rather than general euclidian space positions.
Are we supporting only positive integers within the same range?
If yes this should be mentioned in the API doc.
If not other test data should be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're supporting float data, the tsp225 file contains purely floats. There are no constraints on the range between the cities

Copy link
Member

@afender afender Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Should we test negative coordinates? parallel edges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added file with negative weights

* @param[in] vtx_ptr Device array containing the vertex identifiers used
* to initialize the route.
* @param[out] route Device array containing the returned route.
* @param[in] x_pos Device array containing starting x-axis positions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the argument ordering logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following a priority ordering on the parameters for the Python API. The cpp layer follows the Python API except for the route that is a returned parameter. I will work on updating the return and use a float and rmm::uvector tuple instead

Copy link
Contributor Author

@hlinsen hlinsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool code. Just reviewed C++ part for now.

In addition to the comments, I'd like to see some performance data and a basic profile (ie. what's the perf bottleneck).

Kernels are nice but I would have preferred if there was more of Graph Prims, Thrust, and RAFT in this PR. API/Doc needs polishing. I also added some comments regarding integration and testing, for instance, I see an API that doesn't take graphs but graphs are passed in tests in a custom format that really looks like regular mtx/csv, so I think this should be reconciled.

Thanks for the review. I'll work on adding timers. The current TSP implementation considers the graph is fully connected when looking for the best path. If we used a graph representation this means we would be O(V^2) in memory. We also consider than recomputing the distances is faster than storing them in shared memory and allows us to run on problems with larger size (more than 2k cities). One of the next step is indeed to reconcile with the rest of the API and take a graph as input but this would mean we have to refactor the kernel logic to look only at specific edges and have an efficient look up table (Cuco should have that) to get the weights. These optimizations are on the roadmap for next releases.

cpp/CMakeLists.txt Show resolved Hide resolved
python/cugraph/tests/test_traveling_salesman.py Outdated Show resolved Hide resolved
python/cugraph/traversal/traveling_salesman.py Outdated Show resolved Hide resolved
@hlinsen
Copy link
Contributor Author

hlinsen commented Feb 10, 2021

rerun tests

Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few minor change requests.

build.sh Outdated Show resolved Hide resolved
return tokens;
}

int load_tsp(const char* fname, Route* input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A FIXME as Alex suggested isn't a bad idea, at the very least to capture that TSP deviates from the standard testing I/O pattern and we may want to revisit adopting a more universal approach (instead of or in addition to comparing using TSPLIB's input format).

python/cugraph/tests/test_traveling_salesperson.py Outdated Show resolved Hide resolved
python/cugraph/tests/test_traveling_salesperson.py Outdated Show resolved Hide resolved
python/cugraph/traversal/traveling_salesperson.py Outdated Show resolved Hide resolved
python/cugraph/traversal/traveling_salesperson.py Outdated Show resolved Hide resolved
conda/recipes/libcugraph/meta.yaml Show resolved Hide resolved
@BradReesWork
Copy link
Member

rerun test

@BradReesWork
Copy link
Member

rerun tests

@BradReesWork
Copy link
Member

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants