Omnitrace docs refactoring (ROCm#353)

* Add Sphinx and Read the Docs configs * Add documentation workflow configurations * Changed macros verbprintf and verbprintf_bare so they write to stdout… (ROCm#346) Flush stdout when listing keys + bump verbose level for GPU count * Removing static version asserts. (ROCm#347) It is causing failures on our internal builds Signed-off-by: David Galiffi <[email protected]> * Check for an empty vector before popping (ROCm#350) Protect from possible seg. fault Signed-off-by: David Galiffi <[email protected]> * Add release links to installation.md (ROCm#351) * Initial infrastructure rework for Omnitrace refactoring and a rewrite of the What is file * Add files in conceptual section, along with images and infrastructure changes. * Formatting and style fixes for files in conceptual directory * Add quick start install guide and fix spelling errors in other files * Add install document and fix code tags. Infrastructure changes * Add two how-to guides along with infra changes and spelling fixes * Add two new how to files and fix errors in the last commit * Fix spelling mistakes * Add new how to file on causal profiling and infra changes. * Add how to file on interpreting Omnitrace output, fixes, and images * Add remaining how-to guides and reference materials along with fixes and infrastructure * Add YouTube file and fix spelling and formatting * Fix a few loose ends and add link to license page * Add Sphinx and Doxygen infrastructure and some additional corrections * Update rocm-docs-core * Fix Doxyfile * Fix path to API header files * Run doxysphinx in conf.py * Add back custom css for doxygen * Remove doxygenlayout * Add api to toc * Update Doxyfile Generate from source .in * Proofreading edits and other changes * Add .gitignore for Doxygen and remove deprecated words and typos * Fix one additional typo * Turn off dot * Update doxyfile strip from path * Workflow, submodules, and thread info Updates (ROCm#352) * Update CI workflows - use node20 workflow packages * Update tests/source/CMakeLists.txt - Use OMNITRACE_TRACE and OMNTRACE_PROFILE instead of perfetto/timemory * Update timemory submodule - argparse: requires -> required - parse callbacks * Update thread_info.cpp - fix causal::delay::get_local usage * Update timemory submodule * Update kokkos submodule - release 3.7.02 * Revert opensuse.yml and ubuntu-bionic.yml to use node16 workflows * Update docs.yml * ROCm 6.1 Installers (ROCm#349) * Add ROCm 6.1 to packages * Bump version to 1.11.3 * Add 6.1 support to the docker build support. Simplified this by adding 6.* to case statements, now that repo links have been standardized. * Update timemory submodule (ROCm#354) - fix argparse::argument::required template deduction * Build omnitrace-rt library (ROCm#355) * Build omnitrace-rt library - Explicitly build dyninstAPI_RT as omnitrace-rt so that the SONAME in the ELF is omnitrace-rt instead of dyninstAPI_RT - Create symbolic link lib/omnitrace/libdyninstAPI_RT.so which points to lib/libomnitrace-rt.so - Simplify build tree location of libomnitrace-rt.so since it is ../lib from the bin directory even in the build tree - Update dyninst submodule with minor tweaks to dyninstAPI_RT/CMakeLists.txt * Update source/lib/omnitrace-rt/cmake/platform.cmake * Use ftpmirror.gnu.org instead of ftp.gnu.org - in timemory and dyninst submodules - minor .clang-tidy tweak * Executables append omnitrace library directory to LD_LIBRARY_PATH (ROCm#356) - omnitrace-run, omnitrace-sample, and omnitrace-causal now automatically append the LD_LIBRARY_PATH with the directory containing the omnitrace libraries - this helps ensure that binary rewritten exes can resolve omnitrace-rt library location * Fix a few typos and formatting issues * Additional fixes and minor formatting changes. * More fixes and minor formatting changes. * Complete second proofreading with fixes and minor formatting changes. * Make changes to table of contents and disable linting * Update links in the README doc to reflect the new structure. * Align intro on the Omnitrace index page with the first paragraph of the what-is page * Changes and edits based on review comments * Additional changes and edits based on external review * Additional updates and changes from the external review of Omnitrace * Additional changes based on the external review * New round of edits based on the external review * Additional edits based on the external review * Changes to address comments from the internal review * Correct to the RHEL SELinux note in the troubleshooting guide * One additional change to the development guide code example * Move troubleshooting to post-install of install.rst and other minor edits. * Remove troubleshooting page and modify new post-install troubleshooting section on install.rst * Refactor the how Omnitrace works page into seperate topics and redo infrastructure * API ToC changes * Additional API and ToC changes * Back out API and ToC changes and update requirements.txt * Additional API and ToC changes * Add commit for signing purposes * Add ElfUtils and BinUtils Download URL Overrides (ROCm#358) * Add CMake CACHE Variable ElfUtils_DOWNLOAD_URL Used to override the default URL to download ElfUtils from. Useful for internal builds Also, include a mirror to fallback to if the override URL fails. * Update timemory submodule Updating to include the BINUTIL_DOWNLOAD_URL override cmake variable. --------- Signed-off-by: David Galiffi <[email protected]> * Remove Ubuntu 18.04 and SUSE 15.2 * Update checkout action to v4 * Add `docs/**` to `paths-ignore` Document location is being refactored. * Modified submodules dyninst and timemory. (ROCm#361) --------- Signed-off-by: David Galiffi <[email protected]> Co-authored-by: Peter Jun Park <[email protected]> Co-authored-by: ajanicijamd <[email protected]> Co-authored-by: David Galiffi <[email protected]> Co-authored-by: Jonathan R. Madsen <[email protected]> Co-authored-by: Sam Wu <[email protected]>
samjwu · Jul 29, 2024 · cb6e6a6 · cb6e6a6
1 parent f0bd912
commit cb6e6a6
Show file tree

Hide file tree

Showing 38 changed files with 7,122 additions and 9 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -4,3 +4,4 @@
 docs/* @ROCm/rocm-documentation
 *.md @ROCm/rocm-documentation
 *.rst @ROCm/rocm-documentation
+.readthedocs.yaml @ROCm/rocm-documentation
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -9,3 +9,14 @@ updates:
     directory: "/" # Location of package manifests
     schedule:
       interval: "weekly"
+
+  - package-ecosystem: "pip" # See documentation for possible values
+    directory: "/docs/sphinx" # Location of package manifests
+    open-pull-requests-limit: 10
+    schedule:
+      interval: "daily"
+    labels:
+      - "documentation"
+      - "dependencies"
+    reviewers:
+      - "samjwu"
diff --git a/.gitignore b/.gitignore
@@ -37,6 +37,10 @@
 # Python cache files
 *.pyc
 
+# Documentation artifacts
+/_build
+_toc.yml
+
 /build*
 /.vscode
 /.cache

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,18 @@
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+version: 2
+
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.10"
+
+python:
+  install:
+  - requirements: docs/sphinx/requirements.txt
+
+sphinx:
+  configuration: docs/conf.py
+
+formats: []
diff --git a/README.md b/README.md
@@ -8,8 +8,6 @@
 [![Installer Packaging (CPack)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/cpack.yml)
 [![Documentation](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml/badge.svg)](https://github.com/ROCm/omnitrace/actions/workflows/docs.yml)
 
-> ***[Omnitrace](https://github.com/ROCm/omnitrace) is an AMD open source research project and is not supported as part of the ROCm software stack.***
-
 ## Overview
 
 AMD Research is seeking to improve observability and performance analysis for software running on AMD heterogeneous systems.
@@ -87,8 +85,8 @@ such as the memory usage, page-faults, and context-switches, and thread-level me
 
 ## Documentation
 
-The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [rocm.github.io/omnitrace](https://rocm.github.io/omnitrace/).
-See the [Getting Started documentation](https://rocm.github.io/omnitrace/getting_started) for general tips and a detailed discussion about sampling vs. binary instrumentation.
+The full documentation for [omnitrace](https://github.com/ROCm/omnitrace) is available at [the ROCm Omnitrace documentation repository](https://rocm.docs.amd.com/projects/omnitrace/en/latest/index.html).
+See the [Getting Started documentation](https://rocm.docs.amd.com/projects/omnitrace/en/conceptual/how-omnitrace-works.html) for general tips and a detailed discussion about sampling vs. binary instrumentation.
 
 ## Quick Start
 
@@ -109,7 +107,7 @@ wget https://github.com/ROCm/omnitrace/releases/latest/download/omnitrace-instal
 python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4
 ```
 
-See the [Installation Documentation](https://rocm.github.io/omnitrace/installation) for detailed information.
+See the [Installation Documentation](https://rocm.docs.amd.com/projects/omnitrace/en/install/install.html) for detailed information.
 
 ### Setup
 
@@ -298,13 +296,13 @@ for `foo` via the direct call within `spam`. There will be no entries for `bar`
 - Select "Open trace file" from panel on the left
 - Locate the omnitrace perfetto output (extension: `.proto`)
 
-![omnitrace-perfetto](source/docs/images/omnitrace-perfetto.png)
+![omnitrace-perfetto](docs/data/omnitrace-perfetto.png)
 
-![omnitrace-rocm](source/docs/images/omnitrace-rocm.png)
+![omnitrace-rocm](docs/data/omnitrace-rocm.png)
 
-![omnitrace-rocm-flow](source/docs/images/omnitrace-rocm-flow.png)
+![omnitrace-rocm-flow](docs/data/omnitrace-rocm-flow.png)
 
-![omnitrace-user-api](source/docs/images/omnitrace-user-api.png)
+![omnitrace-user-api](docs/data/omnitrace-user-api.png)
 
 ## Using Perfetto tracing with System Backend
 

diff --git a/docs/.gitignore b/docs/.gitignore
@@ -0,0 +1,2 @@
+_build/
+_doxygen/
diff --git a/docs/conceptual/data-collection-modes.rst b/docs/conceptual/data-collection-modes.rst
@@ -0,0 +1,146 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+**********************
+Data collection modes
+**********************
+
+Omnitrace supports several modes of recording trace and profiling data for your application.
+
+.. note::
+
+   For an explanation of the terms used in this topic, see 
+   the :doc:`Omnitrace glossary <../reference/omnitrace-glossary>`.
+
++-----------------------------+---------------------------------------------------------+
+| Mode                        | Description                                             |
++=============================+=========================================================+
+| Binary Instrumentation      | Locates functions (and loops, if desired) in the binary |
+|                             | and inserts snippets at the entry and exit              |
++-----------------------------+---------------------------------------------------------+
+| Statistical Sampling        | Periodically pauses application at specified intervals  |
+|                             | and records various metrics for the given call stack    |
++-----------------------------+---------------------------------------------------------+
+| Callback APIs               | Parallelism frameworks such as ROCm, OpenMP, and Kokkos |
+|                             | make callbacks into Omnitrace to provide information    |
+|                             | about the work the API is performing                    |
++-----------------------------+---------------------------------------------------------+
+| Dynamic Symbol Interception | Wrap function symbols defined in a position independent |
+|                             | dynamic library/executable, like ``pthread_mutex_lock`` |
+|                             | in ``libpthread.so`` or ``MPI_Init`` in the MPI library |
++-----------------------------+---------------------------------------------------------+
+| User API                    | User-defined regions and controls for Omnitrace         |
++-----------------------------+---------------------------------------------------------+
+
+The two most generic and important modes are binary instrumentation and statistical sampling. 
+It is important to understand their advantages and disadvantages.
+Binary instrumentation and statistical sampling can be performed with the ``omnitrace-instrument`` 
+executable. For statistical sampling, it's highly recommended to use the
+``omnitrace-sample`` executable instead if binary instrumentation isn't required or needed. 
+Callback APIs and dynamic symbol interception can be utilized with either tool.
+
+Binary instrumentation
+-----------------------------------
+
+Binary instrumentation lets you record deterministic measurements for 
+every single invocation of a given function.
+Binary instrumentation effectively adds instructions to the target application to 
+collect the required information. It therefore has the potential to cause performance 
+changes which might, in some cases, lead to inaccurate results. The effect depends on 
+the information being collected and which features are activated in Omnitrace. 
+For example, collecting only the wall-clock timing data
+has less of an effect than collecting the wall-clock timing, CPU-clock timing, 
+memory usage, cache-misses, and number of instructions that were run. Similarly, 
+collecting a flat profile has less overhead than a hierarchical profile 
+and collecting a trace OR a profile has less overhead than collecting a 
+trace AND a profile.
+
+In Omnitrace, the primary heuristic for controlling the overhead with binary 
+instrumentation is the minimum number of instructions for selecting functions 
+for instrumentation.
+
+Statistical sampling
+-----------------------------------
+
+Statistical call-stack sampling periodically interrupts the application at 
+regular intervals using operating system interrupts.
+Sampling is typically less numerically accurate and specific, but the 
+target program runs at nearly full speed.
+In contrast to the data derived from binary instrumentation, the resulting 
+data is not exact but is instead a statistical approximation.
+However, sampling often provides a more accurate picture of the application 
+execution because it is less intrusive to the target application and has fewer
+side effects on memory caches or instruction decoding pipelines. Furthermore, 
+because sampling does not affect the execution speed as much, is it
+relatively immune to over-evaluating the cost of small, frequently called 
+functions or "tight" loops.
+
+In Omnitrace, the overhead for statistical sampling depends on the 
+sampling rate and whether the samples are taken with respect to the CPU time 
+and/or real time.
+
+Binary instrumentation vs. statistical sampling example
+-------------------------------------------------------
+
+Consider the following code:
+
+.. code-block:: c++
+
+   long fib(long n)
+   {
+        if(n < 2) return n;
+        return fib(n - 1) + fib(n - 2);
+   }
+
+   void run(long n)
+   {
+        long result = fib(n);
+        printf("[%li] fibonacci(%li) = %li\n", i, n, result);
+   }
+
+   int main(int argc, char** argv)
+   {
+        long nfib = 30;
+        long nitr = 10;
+        if(argc > 1) nfib = atol(argv[1]);
+        if(argc > 2) nitr = atol(argv[2]);
+
+        for(long i = 0; i < nitr; ++i)
+            run(nfib);
+
+        return 0;
+   }
+
+Binary instrumentation of the ``fib`` function will record **every single invocation** 
+of the function. For a very small function
+such as ``fib``, this results in **significant** overhead since this simple function 
+takes about 20 instructions, whereas the entry and
+exit snippets are ~1024 instructions. Therefore, you generally want to avoid 
+instrumenting functions where the instrumented function has significantly fewer
+instructions than entry and exit instrumentation. (Note that many of the 
+instructions in entry and exit functions are either logging functions or
+depend on the runtime settings and thus might never run). However, 
+due to the number of potential instructions in the entry and exit snippets,
+the default behavior of ``omnitrace-instrument`` is to only instrument functions 
+which contain fewer than 1024 instructions.
+
+However, recording every single invocation of the function can be extremely 
+useful for detecting anomalies, such as profiles that show minimum or maximum values much smaller or larger
+than the average or a high standard deviation. In this case, the traces help you 
+identify exactly when and where those instances deviated from the norm.
+Compare the level of detail in the following traces. In the top image, 
+every instance of the ``fib`` function is instrumented, while in the bottom image,
+the ``fib`` call-stack is derived via sampling.
+
+Binary instrumentation of the Fibonacci function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: ../data/fibonacci-instrumented.png
+   :alt: Visualization of the output of a binary instrumentation of the Fibonacci function
+
+Statistical sampling of the Fibonacci function
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. image:: ../data/fibonacci-sampling.png
+   :alt: Visualization of the output of a statistical sample of the Fibonacci function
diff --git a/docs/conceptual/omnitrace-feature-set.rst b/docs/conceptual/omnitrace-feature-set.rst
@@ -0,0 +1,137 @@
+.. meta::
+   :description: Omnitrace documentation and reference
+   :keywords: Omnitrace, ROCm, profiler, tracking, visualization, tool, Instinct, accelerator, AMD
+
+***************************************
+The Omnitrace feature set and use cases
+***************************************
+
+`Omnitrace <https://github.com/ROCm/omnitrace>`_ is designed to be highly extensible. 
+Internally, it leverages the `Timemory performance analysis toolkit <https://github.com/NERSC/timemory>`_ 
+to manage extensions, resources, data, and other items. It supports the following features, 
+modes, metrics, and APIs.
+
+Data collection modes
+========================================
+
+* Dynamic instrumentation
+
+  * Runtime instrumentation: Instrument executables and shared libraries at runtime
+  * Binary rewriting: Generate a new executable and/or library with instrumentation built-in
+
+* Statistical sampling: Periodic software interrupts per-thread
+* Process-level sampling: A background thread records process-, system- and device-level metrics while the application runs
+* Causal profiling: Quantifies the potential impact of optimizations in parallel code
+
+.. note::
+
+   Critical trace support was removed in Omnitrace v1.11.0. 
+   It was replaced by the causal profiling feature.
+
+Data analysis
+========================================
+
+* High-level summary profiles with mean, min, max, and standard deviation statistics
+
+  * Low overhead and memory efficient
+  * Ideal for running at scale
+
+* Comprehensive traces for every individual event and measurement
+* Application speed-up predictions resulting from potential optimizations in functions and lines of code based on causal profiling
+
+Parallelism API support
+========================================
+
+* HIP
+* HSA
+* Pthreads
+* MPI
+* Kokkos-Tools (KokkosP)
+* OpenMP-Tools (OMPT)
+
+GPU metrics
+========================================
+
+* GPU hardware counters
+* HIP API tracing
+* HIP kernel tracing
+* HSA API tracing
+* HSA operation tracing
+* System-level sampling (via rocm-smi)
+
+  * Memory usage
+  * Power usage
+  * Temperature
+  * Utilization
+
+CPU metrics
+========================================
+
+* CPU hardware counters sampling and profiles
+* CPU frequency sampling
+* Various timing metrics
+
+  * Wall time
+  * CPU time (process and thread)
+  * CPU utilization (process and thread)
+  * User CPU time
+  * Kernel CPU time
+
+* Various memory metrics
+
+  * High-water mark (sampling and profiles)
+  * Memory page allocation
+  * Virtual memory usage
+
+* Network statistics
+* I/O metrics
+* Many others
+
+Third-party API support
+========================================
+
+* TAU
+* LIKWID
+* Caliper
+* CrayPAT
+* VTune
+* NVTX
+* ROCTX
+
+Omnitrace use cases
+========================================
+
+When analyzing the performance of an application, do NOT 
+assume you know where the performance bottlenecks are
+and why they are happening. Omnitrace is a tool for analyzing the entire 
+application and its performance. It is
+ideal for characterizing where optimization would have the greatest impact 
+on an end-to-end run of the application and for
+viewing what else is happening on the system during a performance bottleneck.
+
+When GPUs are involved, there is a tendency to assume that 
+the quickest path to performance improvement is minimizing
+the runtime of the GPU kernels. This is a highly flawed assumption. 
+If you optimize the runtime of a kernel from one millisecond
+to 1 microsecond (1000x speed-up) but the original application never 
+spent time waiting for kernels to complete,
+there would be no statistically significant reduction in the end-to-end 
+runtime of your application. In other words, it does not matter
+how fast or slow the code on GPU is if the application has a  
+bottleneck on waiting on the GPU.
+
+Use Omnitrace to obtain a high-level view of the entire application. Use it 
+to determine where the performance bottlenecks are and
+obtain clues to why these bottlenecks are happening. Rather than worrying about kernel
+performance, start your investigation with Omnitrace, which characterizes the
+broad picture.
+
+.. note::
+
+   For insight into the execution of individual kernels on the GPU, 
+   use `Omniperf <https://github.com/rocm/omniperf>`_.
+
+In terms of CPU analysis, Omnitrace does not target any specific vendor. 
+It works just as well on AMD and non-AMD CPUs.
+With regard to the GPU, Omnitrace is currently restricted to HIP and HSA APIs 
+and kernels running on AMD GPUs.