Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement occupancy for self-collision #815

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

aprokop
Copy link
Contributor

@aprokop aprokop commented Jan 6, 2023

No description provided.

@aprokop aprokop added the performance Something is slower than it should be label Jan 6, 2023
@aprokop
Copy link
Contributor Author

aprokop commented Jan 6, 2023

Summit

Using Kokkos develop (f2da62d0e).

summit_results.zip

DBSCAN (GeoLife, minPts = 2, eps = 1e-4)

default   0.120
10        0.157
20        0.105
30        0.105
40        0.095
50        0.096
60        0.094
70        0.096
80        0.098
90        0.097
100       0.121

DBSCAN (HACC 37M, minPts = 5, eps = 0.042)

default   0.219
10        0.228
20        0.146
30        0.146
40        0.125
50        0.125
60        0.125
70        0.131
80        0.131
90        0.131
100       0.220

DBSCAN (HACC 37M, minPts = 2, eps = 0.042)

default   0.170
10        0.220
20        0.141
30        0.142
40        0.122
50        0.117
60        0.117
70        0.121
80        0.119
90        0.119
100       0.165

DBSCAN (uniform100M3, minPts = 5, eps = 0.002)

default   0.262
10        0.310
20        0.209
30        0.209
40        0.190
50        0.192
60        0.192
70        0.203
80        0.203
90        0.203
100       0.262

Molecular dynamics (100^3)

Count (ArborX::Experimental::HalfNeighborList::Count)

default   4.37e-02
10        3.81e-02
20        3.26e-02
30        3.26e-02
40        3.22e-02
50        3.24e-02
60        3.25e-02
70        3.28e-02
80        3.27e-02
90        3.27e-02
100       3.91e-02

Fill (ArborX::Experimental::HalfNeighborList::Fill)

default   1.27e-01
10        3.75e-02
20        3.09e-02
30        3.09e-02
40        4.33e-02
50        6.30e-02
60        6.28e-02
70        8.14e-02
80        8.19e-02
90        8.10e-02
100       1.24e-01

@aprokop
Copy link
Contributor Author

aprokop commented Jan 6, 2023

To emphasize the previous results for molecular dynamics.

Default occupancy [2x speedup]

|-> 3.78e-01 sec 41.8% 96.3% 0.0% 1.2% 4.76e+01 1 classic [region]
|-> 2.53e-01 sec 28.0% 78.4% 0.0% 1.3% 8.70e+01 1 half+expand [region]
|-> 1.81e-01 sec 20.0% 93.1% 0.0% 0.7% 9.39e+01 1 full [region]

30% occupancy [5x speedup !!!]

|-> 3.79e-01 sec 54.9% 96.2% 0.0% 1.2% 4.75e+01 1 classic [region]
|-> 1.43e-01 sec 20.8% 64.0% 0.0% 2.2% 1.53e+02 1 half+expand [region]
|-> 7.64e-02 sec 11.1% 83.6% 0.0% 1.6% 2.23e+02 1 full [region]

It seems that for the fill-in kernels, it really is desirable to keep the occupancy really low. For the lighter kernels (counts, union-find) it is not as important. And to be fair, we can probably accelerate the original kernel (classic) using some occupancy.

I still struggle to figure out how we go about it. Should we make it a value that we can provide when calling half traversal?

Or maybe a better way is to make it tunable, so that a user can run a kernel few times and that would spit out the best value.

@aprokop
Copy link
Contributor Author

aprokop commented Jan 13, 2023

Just recording that I tried a similar trick with a regular spatial query for FDBSCAN-DenseBox, and have not observed any improvement (4%). Running HACC 497M (first 150M points), eps = 0.014, minpts = 2

default   0.886
10        1.689
20        1.085
30        1.085
40        0.925
50        0.870
60        0.871
70        0.850
80        0.850
90        0.850
100       0.886

@aprokop
Copy link
Contributor Author

aprokop commented Feb 2, 2024

Initial results by @khuck using automated tuning with APEX are promising. For the standard HACC 37M problem, the tool converges on the 70-90 occupancy value that is in line with our experience.

PastedGraphic-1

It’s pretty clear that the kernel times get shorter as the occupancy goes up to ~90 like you said, then the kernel times have a small bump in the range 90-100. After all values are tested, it converged on 70 for this case. In this example, I am using a tuning “window” of 5, so each occupancy value is tested 5 times and the minimum value is recorded as the response to the setting. This way I can account for any system noise that might confuse the search, but it does make the search 5x longer (500 of the 600 total iterations). The simulated annealing algorithm requires many more tests before converging, so that search algorithm probably isn’t relevant for this case without some parameter tweaking.

khuck added a commit to khuck/kokkos that referenced this pull request Feb 6, 2024
The old Kokkos fork/branch from :
davidp	[email protected]:DavidPoliakoff/kokkos.git (fetch)
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:

Input_1:
  name: kokkos.kernel_name
  id: 1
  info.type: string
  info.category: categorical
  info.valueQuantity: unbounded
  info.candidates: unbounded
  num_bins: 0
Input_2:
  name: kokkos.kernel_type
  id: 2
  info.type: string
  info.category: categorical
  info.valueQuantity: set
  info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy]
Output_3:
  name: ArborX::Experimental::HalfTraversal
  id: 3
  info.type: int64
  info.category: ratio
  info.valueQuantity: range
  info.candidates:
    lower: 5
    upper: 100
    step: 5
    open upper: 0
    open lower: 0
Context_0:
  Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]"
  Converged: true
  Results:
    NumVars: 1
    id: 3
    value: 70

In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.

One deviation from the branch that David had written - the occupancy
range is [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1.
khuck added a commit to khuck/kokkos that referenced this pull request Feb 28, 2024
The old Kokkos fork/branch from :
davidp	[email protected]:DavidPoliakoff/kokkos.git (fetch)
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:

Input_1:
  name: kokkos.kernel_name
  id: 1
  info.type: string
  info.category: categorical
  info.valueQuantity: unbounded
  info.candidates: unbounded
  num_bins: 0
Input_2:
  name: kokkos.kernel_type
  id: 2
  info.type: string
  info.category: categorical
  info.valueQuantity: set
  info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy]
Output_3:
  name: ArborX::Experimental::HalfTraversal
  id: 3
  info.type: int64
  info.category: ratio
  info.valueQuantity: range
  info.candidates:
    lower: 5
    upper: 100
    step: 5
    open upper: 0
    open lower: 0
Context_0:
  Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]"
  Converged: true
  Results:
    NumVars: 1
    id: 3
    value: 70

In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.

One deviation from the branch that David had written - the occupancy
range is [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1.
khuck added a commit to khuck/kokkos that referenced this pull request Mar 11, 2024
Note: This is a re-commit of a somehow polluted branch when I rebased on
develop. I started over with the 5 changed files.

The old Kokkos fork/branch from :
davidp	[email protected]:DavidPoliakoff/kokkos.git (fetch)
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:

Input_1:
  name: kokkos.kernel_name
  id: 1
  info.type: string
  info.category: categorical
  info.valueQuantity: unbounded
  info.candidates: unbounded
  num_bins: 0
Input_2:
  name: kokkos.kernel_type
  id: 2
  info.type: string
  info.category: categorical
  info.valueQuantity: set
  info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy]
Output_3:
  name: ArborX::Experimental::HalfTraversal
  id: 3
  info.type: int64
  info.category: ratio
  info.valueQuantity: range
  info.candidates:
    lower: 5
    upper: 100
    step: 5
    open upper: 0
    open lower: 0
Context_0:
  Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]"
  Converged: true
  Results:
    NumVars: 1
    id: 3
    value: 70

In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.

One deviation from the branch that David had written - the occupancy
range is [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1.
Rombur pushed a commit to khuck/kokkos that referenced this pull request Jul 18, 2024
Note: This is a re-commit of a somehow polluted branch when I rebased on
develop. I started over with the 5 changed files.

The old Kokkos fork/branch from :
davidp	[email protected]:DavidPoliakoff/kokkos.git (fetch)
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:

Input_1:
  name: kokkos.kernel_name
  id: 1
  info.type: string
  info.category: categorical
  info.valueQuantity: unbounded
  info.candidates: unbounded
  num_bins: 0
Input_2:
  name: kokkos.kernel_type
  id: 2
  info.type: string
  info.category: categorical
  info.valueQuantity: set
  info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy]
Output_3:
  name: ArborX::Experimental::HalfTraversal
  id: 3
  info.type: int64
  info.category: ratio
  info.valueQuantity: range
  info.candidates:
    lower: 5
    upper: 100
    step: 5
    open upper: 0
    open lower: 0
Context_0:
  Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]"
  Converged: true
  Results:
    NumVars: 1
    id: 3
    value: 70

In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.

One deviation from the branch that David had written - the occupancy
range is [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1.
dalg24 pushed a commit to kokkos/kokkos that referenced this pull request Aug 21, 2024
* Merging occupancy tuning changes from David Polikoff.

Note: This is a re-commit of a somehow polluted branch when I rebased on
develop. I started over with the 5 changed files.

The old Kokkos fork/branch from :
davidp	[email protected]:DavidPoliakoff/kokkos.git (fetch)
was merged with current Kokkos develop, and tested with ArborX to
confirm that autotuning occupancy for the DBSCAN benchmark worked.
In tests on a system with V100, the original benchmark when iterated
600 times took 119.064 seconds to run. During the tuning process
(using simulated annealing), the runtime was 108.014 seconds.
When using cached results, the runtime was 109.058 seconds. The
converged occupancy value was 70. Here are the cached results
from APEX autotuning:

Input_1:
  name: kokkos.kernel_name
  id: 1
  info.type: string
  info.category: categorical
  info.valueQuantity: unbounded
  info.candidates: unbounded
  num_bins: 0
Input_2:
  name: kokkos.kernel_type
  id: 2
  info.type: string
  info.category: categorical
  info.valueQuantity: set
  info.candidates: [parallel_for,parallel_reduce,parallel_scan,parallel_copy]
Output_3:
  name: ArborX::Experimental::HalfTraversal
  id: 3
  info.type: int64
  info.category: ratio
  info.valueQuantity: range
  info.candidates:
    lower: 5
    upper: 100
    step: 5
    open upper: 0
    open lower: 0
Context_0:
  Name: "[2:parallel_for,1:ArborX::Experimental::HalfTraversal,tree_node:default]"
  Converged: true
  Results:
    NumVars: 1
    id: 3
    value: 70

In manual experiments, the ArborX team determined that the optimal
occupancy for this example was beetween 40-90, which were a 10%
improvement over baseline default of 100. See arborx/ArborX#815
for details.

One deviation from the branch that David had written - the occupancy
range is [5-100], with a step size of 5. The original implementation
in Kokkos used [1-100] with a step size of 1.

* Fixing formatting check, not sure how those reverted

* Fixing problems with recursive Impl namespace, MDRange Reduce tuning and OpenMP Reduce tuning. Now trying to fix Team tuning...

* removing comments that failed format check

* Removing commented code

* Final code fixes, likely to be some formatting fixes needed.

* Expected formatting changes

* Yet another formatting fix...

* Removing default operators and copy constructors that aren't needed

* Update core/src/impl/Kokkos_Profiling.hpp

Co-authored-by: Daniel Arndt <[email protected]>

* Fixing formatting check

* Clang-format complained about a newline

* Update Kokkos_Profiling.hpp

Minor fix to prevent incrementing the context id index when not calling `context_begin()`. In actuality, this should be refactored so that `begin_context()` increments the id, and returns it. `end_context()` is the only location that decrements the context id index.

* Unify [begin|end]_parallel_* APIs

* Merge more functionality

* Update TestViewMapping_a test

* Remove Reducers_d from MSVC tests

---------

Co-authored-by: Daniel Arndt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something is slower than it should be
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant