Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick documentation content for 6.3 #34

Merged
merged 2 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 15 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,13 +73,24 @@ Please report in the Github Issues.
- **Need for Cold Restart**: In the event of a hardware freeze, you may need to perform a cold restart (turning the hardware off and on) to restore normal operations.
Please use this beta feature cautiously. It may affect your system's stability and performance. Proceed at your own risk.

- At this point, We do not recommend stress-testing the beta implementation.
- At this point, We do not recommend stress-testing the beta implementation.

- Correlation IDs provided by the PC sampling service are verified only for HIP API calls.
- Correlation IDs provided by the PC sampling service are verified only for HIP API calls.

- Timestamps in PC sampling records might not be 100% accurate.
- Timestamps in PC sampling records might not be 100% accurate.

- Using PC sampling on multi-threaded applications might fail with `HSA_STATUS_ERROR_EXCEPTION`.Furthermore, if three or more threads launch operations to the same agent, and if PC sampling is enabled, the `HSA_STATUS_ERROR_EXCEPTION` might appear.
- Using PC sampling on multi-threaded applications might fail with `HSA_STATUS_ERROR_EXCEPTION`.Furthermore, if three or more threads launch operations to the same agent, and if PC sampling is enabled, the `HSA_STATUS_ERROR_EXCEPTION` might appear.

- Navi3x requires a stable power state for counter collection.
Currently, this state needs to be set by the user.
To do so, set "power_dpm_force_performance_level" to be writeable for non-root users, then set performance level to profile_standard:

```bash
sudo chmod 777 /sys/class/drm/card0/device/power_dpm_force_performance_level
echo profile_standard >> /sys/class/drm/card0/device/power_dpm_force_performance_level
```

Recommended: "profile_standard" for counter collection and "auto" for all other profiling. Use rocm-smi to verify the current power state. For multiGPU systems (includes integrated graphics), replace "card0" by the desired card.

> [!WARNING]
> The latest mainline version of AQLprofile can be found at [https://repo.radeon.com/rocm/misc/aqlprofile/](https://repo.radeon.com/rocm/misc/aqlprofile/). However, it's important to note that updates to the public AQLProfile may not occur as frequently as updates to the rocprofiler-sdk. This discrepancy could lead to a potential mismatch between the AQLprofile binary and the rocprofiler-sdk source.
2 changes: 1 addition & 1 deletion source/docs/api-reference/tool_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ myst:

# ROCprofiler-SDK tool library

The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.0.4.0`, which uses these libraries.
The tool library utilizes APIs from `rocprofiler-sdk` and `rocprofiler-register` libraries for profiling and tracing HIP applications. This document provides information to help you design a tool by utilizing the `rocprofiler-sdk` and `rocprofiler-register` libraries efficiently. The command-line tool `rocprofv3` is also built on `librocprofiler-sdk-tool.so.X.Y.Z`, which uses these libraries.

## ROCm runtimes design

Expand Down
8 changes: 7 additions & 1 deletion source/docs/conceptual/comparing-with-legacy-tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -383,4 +383,10 @@ ROCprofiler-SDK introduces a new command-line tool, `rocprofv3`, which is a more
Timing Difference Between rocprofv3 and rocprofv1/v2
========================================================

Rocprofv3 has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate.
``rocprofv3`` has improved the accuracy of timing information by reducing the tool overhead required to collect data and reducing the interference to the timing of the kernel being measured. The result of this work is a reduction in variance of kernel times received for the same kernel execution and more accurate timing in general. These changes have not been backported (and will not be backported) to rocprofv1/v2, so there can be substantial (20%) differences in execution time reported by v1/v2 vs v3 for a single kernel execution. Over a large number of samples of the same kernel, the difference in average execution time is in the low single digit percentage time with a much tighter variance of results on rocprofv3. We have included testing in the test suite to verify the timing information outputted by rocprofv3 to ensure that the values we are returning are accurate.

========================================================
Default run of rocprofv3 and rocprofv1/v2
========================================================

``rocprofv3`` has a different default behavior than rocprofv1/v2 when being run without any option. The default behavior of rocprofv3 is to collect all available agents on the system and to output it in ``csv`` format. The default behavior of rocprofv1/v2 was to output the `kernel traces` in CSV format. In rocprofv3, kernel traces can be obtained by using ``--kernel-trace`` option.
2 changes: 2 additions & 0 deletions source/docs/data/hip_domain_stats.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"HIP_API",13,458514859,35270373.769231,100.00,2300,352276613,99315857.546240
22 changes: 22 additions & 0 deletions source/docs/data/rccl_trace.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
"Domain","Function","Process_Id","Thread_Id","Correlation_Id","Start_Timestamp","End_Timestamp"
"RCCL_API","ncclGetVersion",1834151,1834151,416,18413845573432,18413845577374
"RCCL_API","ncclGetUniqueId",1834151,1834151,1116,18413961300878,18413963267869
"RCCL_API","ncclGetUniqueId",1834151,1834151,1481,18414166449182,18414166720831
"RCCL_API","ncclGroupStart",1834151,1834151,1482,18414166723772,18414166726834
"RCCL_API","ncclGroupEnd",1834151,1834151,1490,18414166823575,18414380520973
"RCCL_API","ncclCommInitAll",1834151,1834151,1477,18414166402665,18414380522536
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89098,18414380660695,18414380661652
"RCCL_API","ncclAllReduce",1834151,1834151,89097,18414380653860,18414380693574
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89108,18414380694631,18414380694659
"RCCL_API","ncclAllReduce",1834151,1834151,89107,18414380694212,18414380704722
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89117,18414380706650,18414380706677
"RCCL_API","ncclAllReduce",1834151,1834151,89116,18414380705574,18414380715055
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89126,18414380715749,18414380715774
"RCCL_API","ncclAllReduce",1834151,1834151,89125,18414380715463,18414380723944
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89135,18414380724688,18414380724715
"RCCL_API","ncclAllReduce",1834151,1834151,89134,18414380724395,18414380732209
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89154,18414380746383,18414380746411
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89157,18414380749863,18414380749889
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89160,18414380751671,18414380751696
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89163,18414380753326,18414380753353
"RCCL_API","ncclCommGetAsyncError",1834151,1834151,89166,18414380755128,18414380755154
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added source/docs/data/rocprofv3_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading