Implement occupancy for self-collision #815

aprokop · 2023-01-06T18:30:08Z

No description provided.

aprokop · 2023-01-06T18:40:25Z

Summit

Using Kokkos develop (f2da62d0e).

DBSCAN (GeoLife, minPts = 2, eps = 1e-4)

default   0.120
10        0.157
20        0.105
30        0.105
40        0.095
50        0.096
60        0.094
70        0.096
80        0.098
90        0.097
100       0.121

DBSCAN (HACC 37M, minPts = 5, eps = 0.042)

default   0.219
10        0.228
20        0.146
30        0.146
40        0.125
50        0.125
60        0.125
70        0.131
80        0.131
90        0.131
100       0.220

DBSCAN (HACC 37M, minPts = 2, eps = 0.042)

default   0.170
10        0.220
20        0.141
30        0.142
40        0.122
50        0.117
60        0.117
70        0.121
80        0.119
90        0.119
100       0.165

DBSCAN (uniform100M3, minPts = 5, eps = 0.002)

default   0.262
10        0.310
20        0.209
30        0.209
40        0.190
50        0.192
60        0.192
70        0.203
80        0.203
90        0.203
100       0.262

Molecular dynamics (100^3)

Count (ArborX::Experimental::HalfNeighborList::Count)

default   4.37e-02
10        3.81e-02
20        3.26e-02
30        3.26e-02
40        3.22e-02
50        3.24e-02
60        3.25e-02
70        3.28e-02
80        3.27e-02
90        3.27e-02
100       3.91e-02

Fill (ArborX::Experimental::HalfNeighborList::Fill)

default   1.27e-01
10        3.75e-02
20        3.09e-02
30        3.09e-02
40        4.33e-02
50        6.30e-02
60        6.28e-02
70        8.14e-02
80        8.19e-02
90        8.10e-02
100       1.24e-01

aprokop · 2023-01-06T19:44:05Z

To emphasize the previous results for molecular dynamics.

Default occupancy [2x speedup]

|-> 3.78e-01 sec 41.8% 96.3% 0.0% 1.2% 4.76e+01 1 classic [region]
|-> 2.53e-01 sec 28.0% 78.4% 0.0% 1.3% 8.70e+01 1 half+expand [region]
|-> 1.81e-01 sec 20.0% 93.1% 0.0% 0.7% 9.39e+01 1 full [region]

30% occupancy [5x speedup !!!]

|-> 3.79e-01 sec 54.9% 96.2% 0.0% 1.2% 4.75e+01 1 classic [region]
|-> 1.43e-01 sec 20.8% 64.0% 0.0% 2.2% 1.53e+02 1 half+expand [region]
|-> 7.64e-02 sec 11.1% 83.6% 0.0% 1.6% 2.23e+02 1 full [region]

It seems that for the fill-in kernels, it really is desirable to keep the occupancy really low. For the lighter kernels (counts, union-find) it is not as important. And to be fair, we can probably accelerate the original kernel (classic) using some occupancy.

I still struggle to figure out how we go about it. Should we make it a value that we can provide when calling half traversal?

Or maybe a better way is to make it tunable, so that a user can run a kernel few times and that would spit out the best value.

aprokop · 2023-01-13T22:24:54Z

Just recording that I tried a similar trick with a regular spatial query for FDBSCAN-DenseBox, and have not observed any improvement (4%). Running HACC 497M (first 150M points), eps = 0.014, minpts = 2

default   0.886
10        1.689
20        1.085
30        1.085
40        0.925
50        0.870
60        0.871
70        0.850
80        0.850
90        0.850
100       0.886

aprokop · 2024-02-02T23:34:50Z

Initial results by @khuck using automated tuning with APEX are promising. For the standard HACC 37M problem, the tool converges on the 70-90 occupancy value that is in line with our experience.

It’s pretty clear that the kernel times get shorter as the occupancy goes up to ~90 like you said, then the kernel times have a small bump in the range 90-100. After all values are tested, it converged on 70 for this case. In this example, I am using a tuning “window” of 5, so each occupancy value is tested 5 times and the minimum value is recorded as the response to the setting. This way I can account for any system noise that might confuse the search, but it does make the search 5x longer (500 of the 600 total iterations). The simulated annealing algorithm requires many more tests before converging, so that search algorithm probably isn’t relevant for this case without some parameter tweaking.

The old Kokkos fork/branch from : davidp [email protected]:DavidPoliakoff/kokkos.git (fetch) was merged with current Kokkos develop, and tested with ArborX to confirm that autotuning occupancy for the DBSCAN benchmark worked. In tests on a system with V100, the original benchmark when iterated 600 times took 119.064 seconds to run. During the tuning process (using simulated annealing), the runtime was 108.014 seconds. When using cached results, the runtime was 109.058 seconds. The converged occupancy value was 70. Here are the cached results from APEX autotuning: Input_1: name: kokkos.kernel_name id: 1 info.type: string info.category: categorical info.valueQuantity: unbounded info.candidates: unbounded num_bins: 0 Input_2: name: kokkos.kernel_type id: 2 info.type: string info.category: categorical info.valueQuantity: set info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy] Output_3: name: ArborX::Experimental::HalfTraversal id: 3 info.type: int64 info.category: ratio info.valueQuantity: range info.candidates: lower: 5 upper: 100 step: 5 open upper: 0 open lower: 0 Context_0: Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]" Converged: true Results: NumVars: 1 id: 3 value: 70 In manual experiments, the ArborX team determined that the optimal occupancy for this example was beetween 40-90, which were a 10% improvement over baseline default of 100. See arborx/ArborX#815 for details. One deviation from the branch that David had written - the occupancy range is [5-100], with a step size of 5. The original implementation in Kokkos used [1-100] with a step size of 1.

Note: This is a re-commit of a somehow polluted branch when I rebased on develop. I started over with the 5 changed files. The old Kokkos fork/branch from : davidp [email protected]:DavidPoliakoff/kokkos.git (fetch) was merged with current Kokkos develop, and tested with ArborX to confirm that autotuning occupancy for the DBSCAN benchmark worked. In tests on a system with V100, the original benchmark when iterated 600 times took 119.064 seconds to run. During the tuning process (using simulated annealing), the runtime was 108.014 seconds. When using cached results, the runtime was 109.058 seconds. The converged occupancy value was 70. Here are the cached results from APEX autotuning: Input_1: name: kokkos.kernel_name id: 1 info.type: string info.category: categorical info.valueQuantity: unbounded info.candidates: unbounded num_bins: 0 Input_2: name: kokkos.kernel_type id: 2 info.type: string info.category: categorical info.valueQuantity: set info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy] Output_3: name: ArborX::Experimental::HalfTraversal id: 3 info.type: int64 info.category: ratio info.valueQuantity: range info.candidates: lower: 5 upper: 100 step: 5 open upper: 0 open lower: 0 Context_0: Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]" Converged: true Results: NumVars: 1 id: 3 value: 70 In manual experiments, the ArborX team determined that the optimal occupancy for this example was beetween 40-90, which were a 10% improvement over baseline default of 100. See arborx/ArborX#815 for details. One deviation from the branch that David had written - the occupancy range is [5-100], with a step size of 5. The original implementation in Kokkos used [1-100] with a step size of 1.

* Merging occupancy tuning changes from David Polikoff. Note: This is a re-commit of a somehow polluted branch when I rebased on develop. I started over with the 5 changed files. The old Kokkos fork/branch from : davidp [email protected]:DavidPoliakoff/kokkos.git (fetch) was merged with current Kokkos develop, and tested with ArborX to confirm that autotuning occupancy for the DBSCAN benchmark worked. In tests on a system with V100, the original benchmark when iterated 600 times took 119.064 seconds to run. During the tuning process (using simulated annealing), the runtime was 108.014 seconds. When using cached results, the runtime was 109.058 seconds. The converged occupancy value was 70. Here are the cached results from APEX autotuning: Input_1: name: kokkos.kernel_name id: 1 info.type: string info.category: categorical info.valueQuantity: unbounded info.candidates: unbounded num_bins: 0 Input_2: name: kokkos.kernel_type id: 2 info.type: string info.category: categorical info.valueQuantity: set info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy] Output_3: name: ArborX::Experimental::HalfTraversal id: 3 info.type: int64 info.category: ratio info.valueQuantity: range info.candidates: lower: 5 upper: 100 step: 5 open upper: 0 open lower: 0 Context_0: Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]" Converged: true Results: NumVars: 1 id: 3 value: 70 In manual experiments, the ArborX team determined that the optimal occupancy for this example was beetween 40-90, which were a 10% improvement over baseline default of 100. See arborx/ArborX#815 for details. One deviation from the branch that David had written - the occupancy range is [5-100], with a step size of 5. The original implementation in Kokkos used [1-100] with a step size of 1. * Fixing formatting check, not sure how those reverted * Fixing problems with recursive Impl namespace, MDRange Reduce tuning and OpenMP Reduce tuning. Now trying to fix Team tuning... * removing comments that failed format check * Removing commented code * Final code fixes, likely to be some formatting fixes needed. * Expected formatting changes * Yet another formatting fix... * Removing default operators and copy constructors that aren't needed * Update core/src/impl/Kokkos_Profiling.hpp Co-authored-by: Daniel Arndt <[email protected]> * Fixing formatting check * Clang-format complained about a newline * Update Kokkos_Profiling.hpp Minor fix to prevent incrementing the context id index when not calling `context_begin()`. In actuality, this should be refactored so that `begin_context()` increments the id, and returns it. `end_context()` is the only location that decrements the context id index. * Unify [begin|end]_parallel_* APIs * Merge more functionality * Update TestViewMapping_a test * Remove Reducers_d from MSVC tests --------- Co-authored-by: Daniel Arndt <[email protected]>

aprokop added the performance Something is slower than it should be label Jan 6, 2023

aprokop mentioned this pull request Jan 15, 2023

Add new facility to find "half" or "full" neighbor lists #812

Merged

aprokop added 2 commits February 1, 2024 17:04

Implement occupancy for self-collision

e981e11

Expose changing occupancy through a tuning variable

64478e3

aprokop force-pushed the self_collision_occupancy branch from b0b4493 to 64478e3 Compare February 1, 2024 22:22

khuck mentioned this pull request Feb 6, 2024

Adding occupancy tuning for CUDA architectures kokkos/kokkos#6788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement occupancy for self-collision #815

Implement occupancy for self-collision #815

aprokop commented Jan 6, 2023

aprokop commented Jan 6, 2023 •

edited

Loading

aprokop commented Jan 6, 2023 •

edited

Loading

aprokop commented Jan 13, 2023

aprokop commented Feb 2, 2024

Implement occupancy for self-collision #815

Are you sure you want to change the base?

Implement occupancy for self-collision #815

Conversation

aprokop commented Jan 6, 2023

aprokop commented Jan 6, 2023 • edited Loading

Summit

DBSCAN (GeoLife, minPts = 2, eps = 1e-4)

DBSCAN (HACC 37M, minPts = 5, eps = 0.042)

DBSCAN (HACC 37M, minPts = 2, eps = 0.042)

DBSCAN (uniform100M3, minPts = 5, eps = 0.002)

Molecular dynamics (100^3)

aprokop commented Jan 6, 2023 • edited Loading

aprokop commented Jan 13, 2023

aprokop commented Feb 2, 2024

aprokop commented Jan 6, 2023 •

edited

Loading

aprokop commented Jan 6, 2023 •

edited

Loading