Add SDP documentation (NVIDIA#5274) (NVIDIA#5376)

* Add details to SDP README.md Signed-off-by: Elena Rastorgueva <[email protected]> * Add docstring to WriteManifest processor Signed-off-by: Elena Rastorgueva <[email protected]> * Add docstring to CreateInitialManifestMLS Signed-off-by: Elena Rastorgueva <[email protected]> * Add ModifyManifestTextProcessor docstring Signed-off-by: Elena Rastorgueva <[email protected]> * Add ASRInference docstring Signed-off-by: Elena Rastorgueva <[email protected]> * Add base_processor docstrings Signed-off-by: Elena Rastorgueva <[email protected]> * Add minimal SDP docs page Signed-off-by: Elena Rastorgueva <[email protected]> * Update tools/speech_dataset_processor/README.md Co-authored-by: Igor Gitman <[email protected]> Signed-off-by: Elena Rastorgueva <[email protected]> * Write simple README for SDP and move complex explanations to docs Signed-off-by: Elena Rastorgueva <[email protected]> * Remove incorrect type hints Signed-off-by: Elena Rastorgueva <[email protected]> * Make config example less confusing Signed-off-by: Elena Rastorgueva <[email protected]> * Fix typo Signed-off-by: Elena Rastorgueva <[email protected]> * Clarify that YAML file is config file in README Signed-off-by: Elena Rastorgueva <[email protected]> * Remove unused imports Signed-off-by: Elena Rastorgueva <[email protected]> * Remove SDP docs for now Signed-off-by: Elena Rastorgueva <[email protected]> * Remove links to docs in SDP README Signed-off-by: Elena Rastorgueva <[email protected]> Signed-off-by: Elena Rastorgueva <[email protected]> Signed-off-by: Elena Rastorgueva <[email protected]> Co-authored-by: Igor Gitman <[email protected]> Signed-off-by: Elena Rastorgueva <[email protected]> Signed-off-by: Elena Rastorgueva <[email protected]> Co-authored-by: Elena Rastorgueva <[email protected]> Co-authored-by: Igor Gitman <[email protected]>
JimmyZhang12 · Dec 14, 2022 · b964f24 · b964f24
1 parent e42997e
commit b964f24
Show file tree

Hide file tree

Showing 6 changed files with 136 additions and 14 deletions.
diff --git a/tools/speech_dataset_processor/README.md b/tools/speech_dataset_processor/README.md
@@ -1,7 +1,69 @@
 # Speech Dataset Processor
 
-Toolkit to make it easy to write and share the steps for processing a speech dataset.
+Speech Dataset Processor (SDP) is a toolkit to make it easy to:
+1. write code to process a new dataset, minimizing the amount of boilerplate code required.
+2. share the steps for processing a speech dataset. Sharing processing steps can be as easy as sharing a YAML file.
 
-This toolkit contains many of the most common speech dataset processing operations. To process a new dataset, you simply need to write a YAML file containing the parameters needed for dataset processing. It is also easy to add your own code for various speech dataset processing steps if needed.
+SDP's philosophy is to represent processing operations as 'processor' classes. Many common processing operations are provided, and it is easy to add your own. In some cases, all you will need to do to process a new dataset is simply to write a YAML file containing the parameters needed to process your dataset.
 
-TBD
+SDP is specifically intended for the use case when you have an existing dataset with the audio & text pairs already specified in some form, and you wish to create a JSON manifest suitable for use with NeMo. SDP allows for intermediate cleaning and filtering steps which involve amending the 'ground truth' `"text"` or dropping utterances which are deemed to be too inaccurate for training on.
+
+## Quick intro to Speech Dataset Processor
+
+* The steps to process a dataset are specified by a YAML config file.
+* The YAML config file contains a list of processor classes & the args to pass into the constructor.
+* Each processor class inputs an existing manifest (except for classes which create an 'initial' manifest from some external transcript file)  & outputs a modified version of the manifest. It may change other files in the process, e.g. resample audio.
+* To process a manifest, you need to list the chain of processors you wish to use.
+* If a processor is not included, you can make your own.
+
+## YAML config file layout
+A simplified version of an SDP file can be:
+
+```yaml
+processors: 
+
+  # use existing classes for popular datasets or make your own class
+  - _target_: sdp.processors.CreateInitialManifestMLS 
+    output_manifest_file: ...
+    download_dir: ...
+    ...
+
+  # use existing classes for common operations or write your own
+  - _target_: sdp.processors.SubSubstringToSubstring 
+
+    substring_pairs: { 
+      # specify the parameters needed for your usecase 
+      " mr ": " mister ",
+      " misteak ": " mistake ",
+      ...
+    }
+
+  - _target_: sdp.processors.DropNonAlphabet 
+    alphabet: " abcdefghijklmnopqrstuvwxyz"
+    output_manifest_file: ... 
+    ...
+```
+## Existing processor classes
+In addition to those mentioned in the example config file, many more classes are already included in Speech Dataset Processor, for example:
+* `sdp.processors.ASRInference` will run inference on the manifest using a specified `pretrained_model`.
+* `sdp.processors.DropHighWER` will compute WER between `text` and `pred_text` of each utterance and remove the utterance if WER is greater than the specified `wer_threshold`.
+* `sdp.processors.DropHighLowCharrate` will compute the character rate in the utterance using `text` and `duration`, and drop the utterance if it is outside the bounds of the specified `high_charrate_threshold` and `low_charrate_threshold`. Carefully chosen thresholds will allow us to drop utterances with incorrect ground truth `text`.
+
+## Processor test cases
+You can add test cases to verify you have specified your desired changes correctly and to help document why your are making these changes.
+
+For example:
+```yaml
+processors:
+  ...
+  - _target_: sdp.processors.DropIfRegexInAttribute
+    attribute_to_regex:
+      "text" : ["(\\D ){5,20}"] # looks for between 4 and 19 characters surrounded by spaces
+
+    test_cases:
+      - {input: {text: "some s p a c e d out letters"}, output: null}
+      - {input: {text: "normal words only"}, output: {text: "normal words only"}}
+      - {input: {text: "three a b c spaced out letters"}, output: {text: "three a b c spaced out letters"}}
+      - {input: {text: "four a b c d spaced out letters"}, output: null}
+  ...
+```
diff --git a/tools/speech_dataset_processor/sdp/processors/asr_inference.py b/tools/speech_dataset_processor/sdp/processors/asr_inference.py
@@ -20,7 +20,14 @@
 
 
 class ASRInference(BaseProcessor):
-    """This processor perforce ASR inference.
+    """This processor performs ASR inference on the input manifest.
+
+    Args:
+        output_manifest: the path to the output manifest. It will be the same as the input manifest, but will
+            also have "pred_true" entries for every utterance.
+        input_manifest_file: the path to the input manifest which will be transcribed.
+        pretrained_model: the name of the pretrained NeMo ASR model which will be used to do inference.
+        batch_size: the batch size to use for ASR inference.
 
     Note that it does not re-use base parallel implementation, since the ASR
     inference is already run in batches.
@@ -29,7 +36,9 @@ class ASRInference(BaseProcessor):
         parallelization, but that needs to be tested.
     """
 
-    def __init__(self, output_manifest_file, input_manifest_file, pretrained_model, batch_size=32):
+    def __init__(
+        self, output_manifest_file: str, input_manifest_file: str, pretrained_model: str, batch_size: int = 32
+    ):
         self.output_manifest_file = output_manifest_file
         self.input_manifest_file = input_manifest_file
         self.script_path = Path(__file__).parents[4] / "examples" / "asr" / "transcribe_speech.py"

diff --git a/tools/speech_dataset_processor/sdp/processors/base_processor.py b/tools/speech_dataset_processor/sdp/processors/base_processor.py
@@ -34,6 +34,17 @@ class DataEntry:
 
 
 class BaseProcessor(ABC):
+    """
+    Abstract class for SDP processors.
+
+    Args
+    output_manifest_file: path of where the output manifest file will be located.
+    input_manifest_file: path of where the input manifest file is located. This arg 
+        is optional - some processors may not take in an input manifest because they
+        need to create an initial manifest from scratch (ie from some transcript file
+        that is in a format different to the NeMo manifest format).
+    """
+
     def __init__(self, output_manifest_file, input_manifest_file=None):
         self.output_manifest_file = output_manifest_file
         self.input_manifest_file = input_manifest_file
@@ -55,13 +66,15 @@ def test(self):
 
 class BaseParallelProcessor(BaseProcessor):
     """
-    TBD
+    Processor class which allows operations on each utterance to be parallelized. Parallelization 
+    is done using tqdm.contrib.concurrent.process_map.
 
-    input_manifest_file should always be specified unless it's the first
-    processor that reads from original dataset representation.
+    Args:
+        max_workers: maximum number of workers that will be spawned during parallel processing.
+        chunksize: the size of the chunks that will be sent to worker processes. 
     """
 
-    def __init__(self, max_workers=-1, chunksize=100, **kwargs):
+    def __init__(self, max_workers: int = -1, chunksize: int = 100, **kwargs):
         super().__init__(**kwargs)
         if max_workers == -1:
             max_workers = multiprocessing.cpu_count()

diff --git a/...h_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py b/...h_dataset_processor/sdp/processors/create_initial_manifest/create_initial_manifest_mls.py
@@ -25,8 +25,27 @@
 
 
 class CreateInitialManifestMLS(BaseParallelProcessor):
+    """
+    Downloads and unzips raw MLS data for the specified language, and creates an initial manifest using
+    the transcripts provided in the raw data. 
+
+    Args:
+        language: the language of the data you wish to be downloaded. This will be used to format the 
+            URL from which we attempt to download the data.
+        download_dir: the directory where the downloaded data will be saved.
+        data_split: the data split for which the initial manifest will be created.
+        resampled_audio_dir: the directory where the resampled (16kHz) wav files will be stored.
+        use_test_data: if `True`, will use the test data manifest located at `TEST_DATA_PATH` to carry out tests.
+    """
+
     def __init__(
-        self, language, download_dir, resampled_audio_dir, data_split, use_test_data=False, **kwargs,
+        self,
+        language: str,
+        download_dir: str,
+        resampled_audio_dir: str,
+        data_split: str,
+        use_test_data: bool = False,
+        **kwargs,
     ):
         super().__init__(**kwargs)
         self.language = language
@@ -65,7 +84,7 @@ def read_manifest(self):
 
         return dataset_entries
 
-    def process_dataset_entry(self, data_entry):
+    def process_dataset_entry(self, data_entry: str):
         if len(data_entry.split("\t")) != 2:
             raise RuntimeError(f"have more than one tab in line {data_entry}")
 

diff --git a/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py b/tools/speech_dataset_processor/sdp/processors/modify_manifest/modify_manifest.py
@@ -23,12 +23,20 @@
 class ModifyManifestTextProcessor(BaseParallelProcessor):
     """Base class useful for most "text-only" modifications of the manifest.
 
-    Will add the following functionality:
-        - Add space in the beginning and end of sentence for easier regex-based
+    This adds the following functionality on top of BaseParallelProcessor
+        - Adds space in the beginning and end of sentence for easier regex-based
           processing.
         - Automatically handles common test cases by comparing input to output
           values.
 
+    Args:
+        test_cases: an optional list of dicts containing test cases for checking 
+            that the processor makes the changes that we are expecting.
+            The dicts must have a key 'input', the value of which is a dictionary
+            containing data which is our test input manifest line, and a key 
+            'output', the value of which is a dictionary containing data which is
+            the expected output manifest line.
+
     .. note::
         This class only supports one-to-one or one-to-none mappings.
     """

diff --git a/tools/speech_dataset_processor/sdp/processors/write_manifest.py b/tools/speech_dataset_processor/sdp/processors/write_manifest.py
@@ -13,13 +13,24 @@
 # limitations under the License.
 
 import json
+from typing import List
 
 from sdp.processors.base_processor import BaseProcessor
 from tqdm import tqdm
 
 
 class WriteManifest(BaseProcessor):
-    def __init__(self, output_manifest_file, input_manifest_file, fields_to_save):
+    """
+    Saves a copy of a manifest but only with the fields specified in fields_to_save.
+
+    Args:
+        output_manifest_file: path of where the output file will be saved.
+        input_manifest_file: path of where the input file that we will be copying is saved.
+        fields_to_save: list of the fields in the input manifest that we want to copy over. 
+            The output file will only contain these fields.
+    """
+
+    def __init__(self, output_manifest_file: str, input_manifest_file: str, fields_to_save: List[str]):
         self.output_manifest_file = output_manifest_file
         self.input_manifest_file = input_manifest_file
         self.fields_to_save = fields_to_save