From 2f54ede25076d38579f7e0571c64a1d35c4f3ecc Mon Sep 17 00:00:00 2001
From: CameronChurchwell <cameronchurchwell2024@u.northwestern.edu>
Date: Fri, 16 Feb 2024 16:23:29 -0500
Subject: [PATCH 1/8] proposal

---
 RFC-0035-viterbi-decoding.md | 136 +++++++++++++++++++++++++++++++++++
 1 file changed, 136 insertions(+)
 create mode 100644 RFC-0035-viterbi-decoding.md
diff --git a/RFC-0035-viterbi-decoding.md b/RFC-0035-viterbi-decoding.md
new file mode 100644
index 0000000..0174997
--- /dev/null
+++ b/RFC-0035-viterbi-decoding.md
@@ -0,0 +1,136 @@
+
+
+<details>
+<summary>Instructions - click to expand</summary>
+
+- Fork the rfcs repo: https://github.com/pytorch/rfcs
+- Copy `RFC-0000-template.md` to `RFC-00xx-my-feature.md`, or write your own open-ended proposal. Put care into the details.
+- Submit a pull request titled `RFC-00xx-my-feature`. 
+    - Assign the `draft` label while composing the RFC. You may find it easier to use a WYSIWYG editor (like Google Docs) when working with a few close collaborators; feel free to use whatever platform you like. Ideally this document is publicly visible and is linked to from the PR.
+    - When opening the RFC for general discussion, copy your document into the `RFC-00xx-my-feature.md` file on the PR and assign the `commenting` label.
+- Build consensus for your proposal, integrate feedback and revise it as needed, and summarize the outcome of the discussion via a [resolution template](https://github.com/pytorch/rfcs/blob/rfc-process/RFC-0000-template.md#resolution).
+    - If the RFC is idle here (no activity for 2 weeks), assign the label `stalled` to the PR.
+- Once the discussion has settled, assign a new label based on the level of support:
+    - `accepted` if a decision has been made in the RFC
+    - `draft` if the author needs to rework the RFC’s proposal
+    - `shelved` if there are no plans to move ahead with the current RFC’s proposal. We want neither to think about evaluating the proposal
+nor about implementing the described feature until some time in the future.
+- A state of `accepted` means that the core team has agreed in principle to the proposal, and it is ready for implementation. 
+- The author (or any interested developer) should next open a tracking issue on Github corresponding to the RFC.
+    - This tracking issue should contain the implementation next steps. Link to this tracking issue on the RFC (in the Resolution > Next Steps section)
+- Once all relevant PRs are merged, the RFC’s status label can be finally updated to `closed`.
+
+</details>
+
+
+
+
+
+# [viterbi-decoding]
+
+**Authors:**
+* @CameronChurchwell
+* @maxmorrison
+
+
+## **Summary**
+Add Viterbi Algorithm Decoding operation to torch!
+
+
+## **Motivation**
+Viterbi decoding is a very common algorithm which achieves the maximum likelihood over a sequence of states at the cost of speed. In our research on speech editing, we have observed a massive improvement in pitch reconstruction accuracy when the pitch inputs are decoded using Viterbi as opposed to other methods, but commonly used implementations like that in Librosa are quite slow and are not feasible for use with larger datasets.
+
+
+## **Proposed Implementation**
+We have implemented the Viterbi Decoding Algorithm in five parts:
+* A python wrapper module ([torbi](https://github.com/maxrmorrison/torbi))
+    * A C++, Pybind11 style Torch extension ([viterbi.cpp](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi.cpp))
+        * A `viterbi_make_trellis_cpu` CPU function which uses OpenMP (with SIMD) to parallelize some loops. ([viterbi.cpp](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi.cpp))
+        * A `viterbi_make_trellis_kernel` CUDA kernel which parallelizes one sequence per thread block ([viterbi_kernel.cu](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi_kernel.cu))
+        * A `viterbi_backtrace_trellis_cpu` CPU function which does the final decoding ([viterbi.cpp](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi.cpp))
+        * A `viterbi_backtrace_tellis_kernel` CUDA kernel which does the final decoding on the GPU ([viterbi_kernel.cu](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi_kernel.cu))
+
+We have also implemented a series of tests and [benchmarks](https://github.com/maxrmorrison/torbi/blob/main/torbi/evaluate/core.py) to evaluate our method against the implementation in Librosa. See [metrics](#metrics)
+
+### CUDA Algorithm
+
+Our CUDA algorithm makes efficient use of warps to cache posterior probabilities in shared memory. The core design is nested loop, first over timesteps, and then over possible states. One warp is assigned to each state to compute posterior distributions and then perform a parallel argmax (with reduction) to find the best next state from the current state that the warp is assigned to.
+
+The warps iterate over the input states for cases where there are more than 32 (#warps in a block) input states.
+
+Instead of storing the entire posterior distribution as in the Librosa implementation, we only store the current and next timesteps, drastically reducing memory usage. To avoid expensive memory copies, we use pointers to switch which array stores current values and which stores next values. In addition, to support a variable number of input states, these two arrays are just pointers to the two halves of a shared memory array which is sized externally.
+
+Because we use only a single block per input sequence, we can process a batch of input sequences very quickly in parallel, depending on the GPU in use. This also cuts down on the number of kernel-invocation-style syncs that must be performed.
+
+## **Metrics**
+All recorded with batch size 512 on VCTK
+
+| Method  | Real Time Factor (higher is better) |
+| ------------- | ------------- |
+| Librosa (1x cpu)|  |
+| Librosa (16x cpu)| |
+| Torbi (1x cpu)|  |
+| Torbi (16x cpu)| |
+| Torbi (1x RTX 4090)| |
+
+## **Drawbacks**
+* The only real drawback is that it could be considered bloat by some.
+
+
+## **Alternatives**
+* Our design is currently open source so anyone wanting to make use of it need only install it. Unfortunately, due to the [well known difficulties](https://github.com/pytorch/builder/issues/468#issuecomment-661943587) with packaging torch extensions, it must be built from source which requires users to have installed the cuda toolkit and g++ which satisfy version constraints.
+* We tested a variety of other implementations which ultimately were all slower:
+    * Pure Python torch implementation
+    * Cython numpy implementation
+    * Cython implementaiton (without numpy operations)
+    * C++ implementation without OpenMP
+    * Librosa Numba implementation
+
+
+## **Prior Art**
+[Current librosa implementation](https://librosa.org/doc/main/generated/librosa.sequence.viterbi.html)
+
+
+## **How we teach this**
+* No reorganization of documentation would be necessary to the best of my knowledge.
+* Ideally, this would take no more work to document than any other `torch.nn.functional` function.
+
+
+## **Unresolved questions**
+* Right now our implementation is written as a pytorch extension. How can it be converted to something like a `TORCH_MODULE_FRAGMENT`?
+* How can our implementation be changed to support float16 and float64 types in addition?
+* Currently our kernel only supports recent compute capabilities (7 and later?) and makes assumptions about that capability. Ideally this would be generalized to easily support new compute capabilities as they are announced. The assumptions made are the following:
+    * The number of threads in a block
+    * The number of threads in a warp
+    * The number of warps in a block
+* Does torch allow the use of OpenMP?
+
+
+## Resolution
+We decided to do it. X% of the engineering team actively approved of this change.
+
+### Level of Support
+Choose one of the following:
+* 1: Overwhelming positive feedback.
+* 2: Positive feedback.
+* 3: Majority Acceptance, with conflicting Feedback.
+* 4: Acceptance, with Little Feedback.
+* 5: Unclear Resolution.
+* 6: RFC Rejected.
+* 7: RFC Rejected, with Conflicting Feedback.
+
+
+#### Additional Context
+Some people were in favor of it, but some people didn’t want it for project X.
+
+
+### Next Steps
+Will implement it. 
+
+
+#### Tracking issue
+<github issue URL>
+
+
+#### Exceptions
+Not implementing on project X now. Will revisit the decision in 1 year.

From 443e11cb3c656290913971c045d297f6b5d7082b Mon Sep 17 00:00:00 2001
From: Max <maxrmorrison@gmail.com>
Date: Fri, 16 Feb 2024 16:05:25 -0600
Subject: [PATCH 2/8] Updated viterbi rfc

---
 RFC-0035-viterbi-decoding.md | 66 ++++++++++--------------------------
 1 file changed, 18 insertions(+), 48 deletions(-)

diff --git a/RFC-0035-viterbi-decoding.md b/RFC-0035-viterbi-decoding.md
index 0174997..c2e5576 100644
--- a/RFC-0035-viterbi-decoding.md
+++ b/RFC-0035-viterbi-decoding.md
@@ -5,7 +5,7 @@
 
 - Fork the rfcs repo: https://github.com/pytorch/rfcs
 - Copy `RFC-0000-template.md` to `RFC-00xx-my-feature.md`, or write your own open-ended proposal. Put care into the details.
-- Submit a pull request titled `RFC-00xx-my-feature`. 
+- Submit a pull request titled `RFC-00xx-my-feature`.
     - Assign the `draft` label while composing the RFC. You may find it easier to use a WYSIWYG editor (like Google Docs) when working with a few close collaborators; feel free to use whatever platform you like. Ideally this document is publicly visible and is linked to from the PR.
     - When opening the RFC for general discussion, copy your document into the `RFC-00xx-my-feature.md` file on the PR and assign the `commenting` label.
 - Build consensus for your proposal, integrate feedback and revise it as needed, and summarize the outcome of the discussion via a [resolution template](https://github.com/pytorch/rfcs/blob/rfc-process/RFC-0000-template.md#resolution).
@@ -15,7 +15,7 @@
     - `draft` if the author needs to rework the RFC’s proposal
     - `shelved` if there are no plans to move ahead with the current RFC’s proposal. We want neither to think about evaluating the proposal
 nor about implementing the described feature until some time in the future.
-- A state of `accepted` means that the core team has agreed in principle to the proposal, and it is ready for implementation. 
+- A state of `accepted` means that the core team has agreed in principle to the proposal, and it is ready for implementation.
 - The author (or any interested developer) should next open a tracking issue on Github corresponding to the RFC.
     - This tracking issue should contain the implementation next steps. Link to this tracking issue on the RFC (in the Resolution > Next Steps section)
 - Once all relevant PRs are merged, the RFC’s status label can be finally updated to `closed`.
@@ -23,26 +23,25 @@ nor about implementing the described feature until some time in the future.
 </details>
 
 
-
-
-
 # [viterbi-decoding]
 
 **Authors:**
 * @CameronChurchwell
-* @maxmorrison
+* @maxrmorrison
 
 
 ## **Summary**
-Add Viterbi Algorithm Decoding operation to torch!
+Add Viterbi decoding to PyTorch
 
 
 ## **Motivation**
-Viterbi decoding is a very common algorithm which achieves the maximum likelihood over a sequence of states at the cost of speed. In our research on speech editing, we have observed a massive improvement in pitch reconstruction accuracy when the pitch inputs are decoded using Viterbi as opposed to other methods, but commonly used implementations like that in Librosa are quite slow and are not feasible for use with larger datasets.
+Viterbi decoding finds the path of maximum likelihood over a time-varying distribution with applications in automatic speech recognition (ASR), pitch estimation, bioinformatics, and more. No implementation of Viterbi decoding exists in PyTorch, and no convenient alternative implementation exists for ML practitioners. A commonly-used implementation in Librosa is used as a reference implementation for correctness, but this reference does not scale well to large datasets due to a relatively inefficient implementation.
+
+Concretely, Viterbi decoding consists of two stages: (1) construction of a _trellis_ matrix containing path probabilities, and (2) backtracing along the maximal path. We have developed and open-sourced fast CPU and CUDA implementations of both stages. We think our implementations would be a viable starting point for adding Viterbi decoding to PyTorch.
 
 
 ## **Proposed Implementation**
-We have implemented the Viterbi Decoding Algorithm in five parts:
+We have implemented the Viterbi decoding algorithm in five parts:
 * A python wrapper module ([torbi](https://github.com/maxrmorrison/torbi))
     * A C++, Pybind11 style Torch extension ([viterbi.cpp](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi.cpp))
         * A `viterbi_make_trellis_cpu` CPU function which uses OpenMP (with SIMD) to parallelize some loops. ([viterbi.cpp](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi.cpp))
@@ -50,7 +49,8 @@ We have implemented the Viterbi Decoding Algorithm in five parts:
         * A `viterbi_backtrace_trellis_cpu` CPU function which does the final decoding ([viterbi.cpp](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi.cpp))
         * A `viterbi_backtrace_tellis_kernel` CUDA kernel which does the final decoding on the GPU ([viterbi_kernel.cu](https://github.com/maxrmorrison/torbi/blob/main/torbi/viterbi_kernel.cu))
 
-We have also implemented a series of tests and [benchmarks](https://github.com/maxrmorrison/torbi/blob/main/torbi/evaluate/core.py) to evaluate our method against the implementation in Librosa. See [metrics](#metrics)
+We have also implemented a series of tests and [benchmarks](https://github.com/maxrmorrison/torbi/blob/main/torbi/evaluate/core.py) to evaluate our method against the implementation in Librosa. See [metrics](#metrics) for results.
+
 
 ### CUDA Algorithm
 
@@ -58,23 +58,23 @@ Our CUDA algorithm makes efficient use of warps to cache posterior probabilities
 
 The warps iterate over the input states for cases where there are more than 32 (#warps in a block) input states.
 
-Instead of storing the entire posterior distribution as in the Librosa implementation, we only store the current and next timesteps, drastically reducing memory usage. To avoid expensive memory copies, we use pointers to switch which array stores current values and which stores next values. In addition, to support a variable number of input states, these two arrays are just pointers to the two halves of a shared memory array which is sized externally.
+Instead of storing the entire posterior distribution as in the Librosa implementation, we only store the current and next timesteps, reducing memory usage. To avoid expensive memory copies, we use pointers to switch which array stores current values and which stores next values. In addition, to support a variable number of input states, these two arrays are just pointers to the two halves of a shared memory array which is sized externally.
 
 Because we use only a single block per input sequence, we can process a batch of input sequences very quickly in parallel, depending on the GPU in use. This also cuts down on the number of kernel-invocation-style syncs that must be performed.
 
+
 ## **Metrics**
-All recorded with batch size 512 on VCTK
+
+We use Viterbi decoding to decode distributions over pitch inferred by a pitch estimating neural network. We compare our proposed implementation to the reference implementation in Librosa that uses just-in-time compilation via numba.
 
 | Method  | Real Time Factor (higher is better) |
 | ------------- | ------------- |
 | Librosa (1x cpu)|  |
 | Librosa (16x cpu)| |
-| Torbi (1x cpu)|  |
-| Torbi (16x cpu)| |
-| Torbi (1x RTX 4090)| |
-
-## **Drawbacks**
-* The only real drawback is that it could be considered bloat by some.
+| Proposed (1x cpu)|  |
+| Proposed (16x cpu)| |
+| Proposed (1x RTX 4090; batch size 1)| |
+| Proposed (1x RTX 4090; batch size 512)| |
 
 
 ## **Alternatives**
@@ -104,33 +104,3 @@ All recorded with batch size 512 on VCTK
     * The number of threads in a warp
     * The number of warps in a block
 * Does torch allow the use of OpenMP?
-
-
-## Resolution
-We decided to do it. X% of the engineering team actively approved of this change.
-
-### Level of Support
-Choose one of the following:
-* 1: Overwhelming positive feedback.
-* 2: Positive feedback.
-* 3: Majority Acceptance, with conflicting Feedback.
-* 4: Acceptance, with Little Feedback.
-* 5: Unclear Resolution.
-* 6: RFC Rejected.
-* 7: RFC Rejected, with Conflicting Feedback.
-
-
-#### Additional Context
-Some people were in favor of it, but some people didn’t want it for project X.
-
-
-### Next Steps
-Will implement it. 
-
-
-#### Tracking issue
-<github issue URL>
-
-
-#### Exceptions
-Not implementing on project X now. Will revisit the decision in 1 year.

From b3118b4a19380339a9d6830b7a2a4b804d054208 Mon Sep 17 00:00:00 2001
From: CameronChurchwell <cameronchurchwell2024@u.northwestern.edu>
Date: Fri, 23 Feb 2024 16:57:42 -0500
Subject: [PATCH 3/8] filled in table

---
 RFC-0035-viterbi-decoding.md | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/RFC-0035-viterbi-decoding.md b/RFC-0035-viterbi-decoding.md
index 0174997..dd53ef6 100644
--- a/RFC-0035-viterbi-decoding.md
+++ b/RFC-0035-viterbi-decoding.md
@@ -63,15 +63,17 @@ Instead of storing the entire posterior distribution as in the Librosa implement
 Because we use only a single block per input sequence, we can process a batch of input sequences very quickly in parallel, depending on the GPU in use. This also cuts down on the number of kernel-invocation-style syncs that must be performed.
 
 ## **Metrics**
-All recorded with batch size 512 on VCTK
+All recorded with batch size 512 on a subset of 8192 files in VCTK
 
 | Method  | Real Time Factor (higher is better) |
 | ------------- | ------------- |
-| Librosa (1x cpu)|  |
-| Librosa (16x cpu)| |
-| Torbi (1x cpu)|  |
-| Torbi (16x cpu)| |
-| Torbi (1x RTX 4090)| |
+| Librosa (1x cpu)| 1.93* |
+| Librosa (16x cpu)| 13.82 |
+| Torbi (1x cpu)| 1.71 |
+| Torbi (16x cpu)| **22.40** |
+| Torbi (1x a40 gpu)| **2907471.42** |
+
+*This was performed with only 1 cpu allocated, but the implementation may have used multiple threads, whereas Torbi uses exactly as many threads as listed in the table.
 
 ## **Drawbacks**
 * The only real drawback is that it could be considered bloat by some.

From 6f772a2487f9d332d5695d8cb57af615526e5d5c Mon Sep 17 00:00:00 2001
From: Cameron <cameronchurchwell2024@u.northwestern.edu>
Date: Tue, 27 Feb 2024 23:48:03 -0600
Subject: [PATCH 4/8] updated metrics

---
 RFC-0035-viterbi-decoding.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/RFC-0035-viterbi-decoding.md b/RFC-0035-viterbi-decoding.md
index e93f07c..e6c16c9 100644
--- a/RFC-0035-viterbi-decoding.md
+++ b/RFC-0035-viterbi-decoding.md
@@ -66,18 +66,18 @@ Because we use only a single block per input sequence, we can process a batch of
 ## **Metrics**
 We use Viterbi decoding to decode distributions over pitch inferred by a pitch estimating neural network. We compare our proposed implementation to the reference implementation in Librosa that uses just-in-time compilation via numba.
 
-All recorded with batch size 512 on a subset of 8192 files in VCTK
+Unless otherwise noted, all recorded with batch size 512 on a subset of 8192 files randomly selected from VCTK
 
 | Method  | Real Time Factor (higher is better) |
 | ------------- | ------------- |
-| Librosa (1x cpu)| 1.93* |
-| Librosa (16x cpu)| 13.82 |
+| Librosa (1x cpu)| 2.08 |
+| Librosa (16x cpu)| 13.82* |
 | Proposed (1x cpu)| 1.71 |
 | Proposed (16x cpu)| **22.40** |
-| Proposed (1x a40 gpu, batch size 512)| **39444.52** |
-| Proposed (1x a40 gpu)| **2907471.42** |
+| Proposed (1x a40 gpu, batch size 1)| **39444.52** |
+| Proposed (1x a40 gpu)| **6921604.22** |
 
-*This was performed with only 1 cpu allocated, but the implementation may have used multiple threads, whereas Torbi uses exactly as many threads as listed in the table.
+*We use a Multiprocessing pool to parallelize the Librosa implementation.
 
 ## **Drawbacks**
 * The only real drawback is that it could be considered bloat by some.

From dacdad08dd91a54fc33ae82a1d35f8165f15cc1e Mon Sep 17 00:00:00 2001
From: Max <maxrmorrison@gmail.com>
Date: Thu, 29 Feb 2024 21:54:25 -0600
Subject: [PATCH 5/8] Summary and motivation

---
 RFC-0035-viterbi-decoding.md | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/RFC-0035-viterbi-decoding.md b/RFC-0035-viterbi-decoding.md
index c2e5576..955466f 100644
--- a/RFC-0035-viterbi-decoding.md
+++ b/RFC-0035-viterbi-decoding.md
@@ -31,11 +31,26 @@ nor about implementing the described feature until some time in the future.
 
 
 ## **Summary**
-Add Viterbi decoding to PyTorch
+
+We want to add Viterbi decoding to PyTorch. Viterbi decoding is a well-known algorithm that finds the path of maximum likelihood over a time-varying distribution. It is used in automatic speech recognition, bioinformatics, digital communications, and other tasks that produce models that infer or generate sequences of probability distributions. No implementation of Viterbi decoding exists in PyTorch, and no convenient alternative implementation exists for ML practitioners that is fast enough to scale to large datasets. We have created batched CPU and GPU implementations of Viterbi decoding significantly faster than available implementations. We have found our implementations useful for our own research tasks, and believe the community may find them useful as well.
 
 
 ## **Motivation**
-Viterbi decoding finds the path of maximum likelihood over a time-varying distribution with applications in automatic speech recognition (ASR), pitch estimation, bioinformatics, and more. No implementation of Viterbi decoding exists in PyTorch, and no convenient alternative implementation exists for ML practitioners. A commonly-used implementation in Librosa is used as a reference implementation for correctness, but this reference does not scale well to large datasets due to a relatively inefficient implementation.
+
+Viterbi decoding is a generally useful algorithm that is missing from the PyTorch library, with applications in automatic speech recognition, bioinformatics, digital communications, and more. However, Viterbi decoding is O(C^2T) for C classes and T timesteps, making it challenging to scale to large datasets and real-time applications. A commonly-used implementation of Viterbi decoding exists in Librosa (`librosa.sequence.viterbi`). We use Librosa's implementation as a reference for correctness and a baseline for
+
+
+We use Viterbi decoding to decode distributions over pitch inferred by a pitch estimating neural network. We compare our proposed implementation to the reference implementation in Librosa that uses just-in-time compilation via numba.
+
+| Method  | Real Time Factor (higher is better) |
+| ------------- | ------------- |
+| Librosa (1x cpu)|  |
+| Librosa (16x cpu)| |
+| Proposed (1x cpu)|  |
+| Proposed (16x cpu)| |
+| Proposed (1x RTX 4090; batch size 1)| |
+| Proposed (1x RTX 4090; batch size 512)| |
+
 
 Concretely, Viterbi decoding consists of two stages: (1) construction of a _trellis_ matrix containing path probabilities, and (2) backtracing along the maximal path. We have developed and open-sourced fast CPU and CUDA implementations of both stages. We think our implementations would be a viable starting point for adding Viterbi decoding to PyTorch.
 
@@ -63,20 +78,6 @@ Instead of storing the entire posterior distribution as in the Librosa implement
 Because we use only a single block per input sequence, we can process a batch of input sequences very quickly in parallel, depending on the GPU in use. This also cuts down on the number of kernel-invocation-style syncs that must be performed.
 
 
-## **Metrics**
-
-We use Viterbi decoding to decode distributions over pitch inferred by a pitch estimating neural network. We compare our proposed implementation to the reference implementation in Librosa that uses just-in-time compilation via numba.
-
-| Method  | Real Time Factor (higher is better) |
-| ------------- | ------------- |
-| Librosa (1x cpu)|  |
-| Librosa (16x cpu)| |
-| Proposed (1x cpu)|  |
-| Proposed (16x cpu)| |
-| Proposed (1x RTX 4090; batch size 1)| |
-| Proposed (1x RTX 4090; batch size 512)| |
-
-
 ## **Alternatives**
 * Our design is currently open source so anyone wanting to make use of it need only install it. Unfortunately, due to the [well known difficulties](https://github.com/pytorch/builder/issues/468#issuecomment-661943587) with packaging torch extensions, it must be built from source which requires users to have installed the cuda toolkit and g++ which satisfy version constraints.
 * We tested a variety of other implementations which ultimately were all slower:

From 947a8c87f57ba5f3c918a07a8864cf75448bbc34 Mon Sep 17 00:00:00 2001
From: Max <maxrmorrison@gmail.com>
Date: Thu, 29 Feb 2024 22:54:05 -0600
Subject: [PATCH 6/8] Corrected dataset length

---
 RFC-0035-viterbi-decoding.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0035-viterbi-decoding.md b/RFC-0035-viterbi-decoding.md
index a46e649..eb538e0 100644
--- a/RFC-0035-viterbi-decoding.md
+++ b/RFC-0035-viterbi-decoding.md
@@ -37,7 +37,7 @@ We want to add Viterbi decoding to PyTorch. Viterbi decoding is a well-known alg
 
 ## **Motivation**
 
-Viterbi decoding is a generally useful algorithm that is missing from the PyTorch library, with applications in automatic speech recognition, bioinformatics, digital communications, and more. However, Viterbi decoding is O(C^2T) for C classes and T timesteps, making it challenging to scale to large datasets and real-time applications. A commonly-used implementation of Viterbi decoding exists in Librosa (`librosa.sequence.viterbi`). We use Librosa's implementation as a reference for correctness and as a baseline for benchmarking. Our benchmark uses `C = 1,440` states and approximately `T ~= 10 million` time steps across approximately 40k files.
+Viterbi decoding is a generally useful algorithm that is missing from the PyTorch library, with applications in automatic speech recognition, bioinformatics, digital communications, and more. However, Viterbi decoding is O(C^2T) for C classes and T timesteps, making it challenging to scale to large datasets and real-time applications. A commonly-used implementation of Viterbi decoding exists in Librosa (`librosa.sequence.viterbi`). We use Librosa's implementation as a reference for correctness and as a baseline for benchmarking. Our benchmark uses `C = 1,440` states and approximately `T ~= 200 million` time steps across approximately 40k files.
 
 We use Viterbi decoding to decode distributions over pitch inferred by a pitch estimating neural network. We compare our proposed implementation to the reference implementation in Librosa ([`librosa.sequence.viterbi`](https://librosa.org/doc/main/generated/librosa.sequence.viterbi.html)) that uses just-in-time compilation via numba.
 

From 0a3e6924b3903a5d3286efce53ce0bf2370950b6 Mon Sep 17 00:00:00 2001
From: Max <maxrmorrison@gmail.com>
Date: Fri, 1 Mar 2024 19:16:49 -0600
Subject: [PATCH 7/8] Dataset size

---
 RFC-0035-viterbi-decoding.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RFC-0035-viterbi-decoding.md b/RFC-0035-viterbi-decoding.md
index eb538e0..178c0af 100644
--- a/RFC-0035-viterbi-decoding.md
+++ b/RFC-0035-viterbi-decoding.md
@@ -37,7 +37,7 @@ We want to add Viterbi decoding to PyTorch. Viterbi decoding is a well-known alg
 
 ## **Motivation**
 
-Viterbi decoding is a generally useful algorithm that is missing from the PyTorch library, with applications in automatic speech recognition, bioinformatics, digital communications, and more. However, Viterbi decoding is O(C^2T) for C classes and T timesteps, making it challenging to scale to large datasets and real-time applications. A commonly-used implementation of Viterbi decoding exists in Librosa (`librosa.sequence.viterbi`). We use Librosa's implementation as a reference for correctness and as a baseline for benchmarking. Our benchmark uses `C = 1,440` states and approximately `T ~= 200 million` time steps across approximately 40k files.
+Viterbi decoding is a generally useful algorithm that is missing from the PyTorch library, with applications in automatic speech recognition, bioinformatics, digital communications, and more. However, Viterbi decoding is O(C^2T) for C classes and T timesteps, making it challenging to scale to large datasets and real-time applications. A commonly-used implementation of Viterbi decoding exists in Librosa (`librosa.sequence.viterbi`). We use Librosa's implementation as a reference for correctness and as a baseline for benchmarking. Our benchmark uses `C = 1,440` states and approximately `T ~= 20 million` time steps across approximately 40k files.
 
 We use Viterbi decoding to decode distributions over pitch inferred by a pitch estimating neural network. We compare our proposed implementation to the reference implementation in Librosa ([`librosa.sequence.viterbi`](https://librosa.org/doc/main/generated/librosa.sequence.viterbi.html)) that uses just-in-time compilation via numba.
 

From fb66ee215e441ae8ef9c78ad6a012f053eba87fc Mon Sep 17 00:00:00 2001
From: Max <maxrmorrison@gmail.com>
Date: Sat, 2 Mar 2024 15:34:32 -0600
Subject: [PATCH 8/8] Torch-style comments

---
 RFC-0035-viterbi-decoding.md | 232 +++++++++++++++++------------------
 1 file changed, 112 insertions(+), 120 deletions(-)

diff --git a/RFC-0035-viterbi-decoding.md b/RFC-0035-viterbi-decoding.md
index 178c0af..2e292c5 100644
--- a/RFC-0035-viterbi-decoding.md
+++ b/RFC-0035-viterbi-decoding.md
@@ -37,9 +37,7 @@ We want to add Viterbi decoding to PyTorch. Viterbi decoding is a well-known alg
 
 ## **Motivation**
 
-Viterbi decoding is a generally useful algorithm that is missing from the PyTorch library, with applications in automatic speech recognition, bioinformatics, digital communications, and more. However, Viterbi decoding is O(C^2T) for C classes and T timesteps, making it challenging to scale to large datasets and real-time applications. A commonly-used implementation of Viterbi decoding exists in Librosa (`librosa.sequence.viterbi`). We use Librosa's implementation as a reference for correctness and as a baseline for benchmarking. Our benchmark uses `C = 1,440` states and approximately `T ~= 20 million` time steps across approximately 40k files.
-
-We use Viterbi decoding to decode distributions over pitch inferred by a pitch estimating neural network. We compare our proposed implementation to the reference implementation in Librosa ([`librosa.sequence.viterbi`](https://librosa.org/doc/main/generated/librosa.sequence.viterbi.html)) that uses just-in-time compilation via numba.
+Viterbi decoding is a generally useful algorithm that is missing from the PyTorch library, with applications in automatic speech recognition, bioinformatics, digital communications, and more. However, Viterbi decoding is O(C^2T) for C classes and T timesteps, making it challenging to scale to large datasets and real-time applications. A commonly-used implementation of Viterbi decoding exists in Librosa (`librosa.sequence.viterbi`). We use Librosa's implementation as a reference for correctness and as a baseline for benchmarking. Our benchmark uses `C = 1,440` states and approximately `T ~= 20 million` time steps across approximately 40k files. We compare our proposed implementation to the reference implementation in Librosa ([`librosa.sequence.viterbi`](https://librosa.org/doc/main/generated/librosa.sequence.viterbi.html)) that uses just-in-time compilation via numba.
 
 | Method  | Timesteps decoded per second |
 | ------------- | ------------- |
@@ -77,33 +75,48 @@ We propose a Python API and underlying C++/CUDA extensions for Viterbi decoding
 ```
 def decode(
     observation: torch.Tensor,
-    batch_frames: Optional[torch.Tensor] = None,
-    transition: Optional[torch.Tensor] = None,
-    initial: Optional[torch.Tensor] = None,
-    log_probs: bool = False
-) -> torch.Tensor:
+    batch_frames: torch.Tensor,
+    transition: torch.Tensor,
+    initial: torch.Tensor
+):
     """Decode a time-varying categorical distribution
 
-    Arguments
-        observation
+    Args:
+        observation: :math:`(N, T, S)` or :math:`(T, S)`
+            where `S = the number of states`,
+            `T = the length of the sequence`,
+            and `N = batch size`.
             Time-varying categorical distribution
-            shape=(batch, frames, states)
-        batch_frames
-            Number of frames in each batch item; defaults to all
-            shape=(batch,)
-        transition
-            Categorical transition matrix; defaults to uniform
-            shape=(states, states)
-        initial
-            Categorical initial distribution; defaults to uniform
-            shape=(states,)
-        log_probs
-            Whether inputs are in (natural) log space
-
-    Returns
-        indices
+        batch_frames :math:`(N)`
+            Sequence length of each batch item
+        transition :math:`(S, S)`
+            Categorical transition matrix
+        initial :math:`(S)`
+            Categorical initial distribution
+
+    Return:
+        indices: :math:`(N, T)`
             The decoded bin indices
-            shape=(batch, frames)
+
+    Example::
+
+            >>> observation = torch.tensor([[
+            >>>     [0.25, 0.5, 0.25],
+            >>>     [0.25, 0.25, 0.5],
+            >>>     [0.33, 0.33, 0.33]
+            >>> ]])
+            >>> batch_frames = torch.tensor([3])
+            >>> transition = torch.tensor([
+            >>>     [0.5, 0.25, 0.25],
+            >>>     [0.33, 0.34, 0.33],
+            >>>     [0.25, 0.25, 0.5]
+            >>> ])
+            >>> initial = torch.tensor([0.4, 0.35, 0.25])
+            >>> bins = torch.viterbi.decode(
+            >>>     observation,
+            >>>     batch_frames,
+            >>>     transition,
+            >>>     initial)
     """
 ```
 
@@ -112,35 +125,29 @@ def decode(
 
 ```
 def make_trellis(
-    self,
     observation: torch.Tensor,
-    batch_frames: Optional[torch.Tensor] = None,
-    transition: Optional[torch.Tensor] = None,
-    initial: Optional[torch.Tensor] = None,
-    log_probs: bool = False
+    batch_frames: torch.Tensor,
+    transition: torch.Tensor,
+    initial: torch.Tensor
 ) -> torch.Tensor:
     """Perform first step of Viterbi decoding to construct the path trellis
 
-    Arguments
-        observation
+   Args:
+        observation: :math:`(N, T, S)` or :math:`(T, S)`
+            where `S = the number of states`,
+            `T = the length of the sequence`,
+            and `N = batch size`.
             Time-varying categorical distribution
-            shape=(batch, frames, states)
-        batch_frames
-            Number of frames in each batch item; defaults to all
-            shape=(batch,)
-        transition
-            Categorical transition matrix; defaults to uniform
-            shape=(states, states)
-        initial
-            Categorical initial distribution; defaults to uniform
-            shape=(states,)
-        log_probs
-            Whether inputs are in (natural) log space
-
-    Returns
-        trellis
-            The matrix of greedy path pointers used to decode the optimal path
-            shape=(batch, frames, states)
+        batch_frames :math:`(N)`
+            Sequence length of each batch item
+        transition :math:`(S, S)`
+            Categorical transition matrix
+        initial :math:`(S)`
+            Categorical initial distribution
+
+    Return:
+        trellis: :math:`(N, T, S)`
+            Matrix of minimum path indices for backtracing
     """
 ```
 
@@ -150,30 +157,25 @@ def make_trellis(
 ```
 def backtrace_trellis(
     trellis: torch.Tensor,
-    batch_frames: Optional[torch.Tensor] = None,
-    transition: Optional[torch.Tensor] = None,
-    initial: Optional[torch.Tensor] = None
+    batch_frames: torch.Tensor,
+    transition: torch.Tensor,
+    initial: torch.Tensor
 ) -> torch.Tensor:
     """Perform second step of Viterbi decoding to backtrace optimal path
 
-    Arguments
-        trellis
-            The matrix of greedy path pointers used to decode the optimal path
-            shape=(batch, frames, states)
-        batch_frames
-            Number of frames in each batch item; defaults to all
-            shape=(batch,)
-        transition
-            Categorical transition matrix; defaults to uniform
-            shape=(states, states)
-        initial
-            Categorical initial distribution; defaults to uniform
-            shape=(states,)
-
-    Returns
-        indices
+    Args:
+        trellis: :math:`(N, T, S)`
+            Matrix of minimum path indices for backtracing
+        batch_frames :math:`(N)`
+            Sequence length of each batch item
+        transition :math:`(S, S)`
+            Categorical transition matrix
+        initial :math:`(S)`
+            Categorical initial distribution
+
+    Return:
+        indices: :math:`(N, T)`
             The decoded bin indices
-            shape=(batch, frames)
     """
 ```
 
@@ -186,86 +188,75 @@ class Decoder:
 
     def __init__(
         self,
-        transition: Optional[torch.Tensor] = None,
-        initial: Optional[torch.Tensor] = None
+        transition: torch.Tensor,
+        initial: torch.Tensor
     ) -> None:
         """
-        Arguments
-            transition
-                Categorical transition matrix; defaults to uniform
-                shape=(states, states)
-            initial
-                Categorical initial distribution; defaults to uniform
-                shape=(states,)
+        Args:
+            transition :math:`(S, S)`
+                Categorical transition matrix
+            initial :math:`(S)`
+                Categorical initial distribution
         """
 
     def decode(
         self,
         observation: torch.Tensor,
-        batch_frames: Optional[torch.Tensor] = None,
-        log_probs: bool = False
+        batch_frames: torch.Tensor
     ) -> torch.Tensor:
         """Decode a time-varying categorical distribution
 
-        Arguments
-            observation
+        Args:
+            observation: :math:`(N, T, S)` or :math:`(T, S)`
+                where `S = the number of states`,
+                `T = the length of the sequence`,
+                and `N = batch size`.
                 Time-varying categorical distribution
-                shape=(batch, frames, states)
-            batch_frames
-                Number of frames in each batch item; defaults to all
-                shape=(batch,)
-            log_probs
-                Whether inputs are in (natural) log space
-
-        Returns
-            indices
+            batch_frames :math:`(N)`
+                Sequence length of each batch item
+
+        Return:
+            indices: :math:`(N, T)`
                 The decoded bin indices
-                shape=(batch, frames)
         """
 
     def make_trellis(
         self,
         observation: torch.Tensor,
-        batch_frames: Optional[torch.Tensor] = None,
-        log_probs: bool = False
+        batch_frames: torch.Tensor
     ) -> torch.Tensor:
         """Perform first step of Viterbi decoding to construct the path trellis
 
-        Arguments
-            observation
+        Args:
+            observation: :math:`(N, T, S)` or :math:`(T, S)`
+                where `S = the number of states`,
+                `T = the length of the sequence`,
+                and `N = batch size`.
                 Time-varying categorical distribution
-                shape=(batch, frames, states)
-            batch_frames
-                Number of frames in each batch item; defaults to all
-                shape=(batch,)
-            log_probs
-                Whether inputs are in (natural) log space
-
-        Returns
-            trellis
-                The matrix of greedy path pointers used to decode the optimal path
-                shape=(batch, frames, states)
+            batch_frames :math:`(N)`
+                Sequence length of each batch item
+
+        Return:
+            trellis: :math:`(N, T, S)`
+                Matrix of minimum path indices for backtracing
         """
 
     def backtrace_trellis(
         self,
         trellis: torch.Tensor,
-        batch_frames: Optional[torch.Tensor] = None
+        batch_frames: torch.Tensor
     ) -> torch.Tensor:
         """Perform second step of Viterbi decoding to backtrace optimal path
 
-        Arguments
-            trellis
-                The matrix of greedy path pointers used to decode the optimal path
-                shape=(batch, frames, states)
-            batch_frames
-                Number of frames in each batch item; defaults to all
-                shape=(batch,)
+        Args:
+            trellis: :math:`(N, T, S)`
+                Matrix of minimum path indices for backtracing
+            batch_frames :math:`(N)`
+                Sequence length of each batch item
 
-        Returns
-            indices
-                The decoded bin indices
-                shape=(batch, frames)
+        Return:
+            trellis: :math:`(N, T, S)`
+                Matrix of minimum path indices for backtracing
         """
 ```
 
@@ -291,6 +282,7 @@ Because we use only a single block per input sequence, we can process a batch of
 
 
 ## **Discussion questions**
+
 * Are there desired changes in the naming conventions?
 * Right now our implementation is written as a PyTorch extension. How can it be converted to something like a `TORCH_MODULE_FRAGMENT`?
 * Are there recommended methods for ensuring compliance over a set of allowed dtypes? Our implementation currently works for torch.float32, but is not guaranteed to work for all types.