From 0c2b6837872447029200b55ede3cb8a5dd3af001 Mon Sep 17 00:00:00 2001 From: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Date: Tue, 17 Jan 2023 13:36:22 -0800 Subject: [PATCH] NeMo Forced Aligner (#5571) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Merge r1.13.0 main (#5570) * update branch Signed-off-by: ericharper * Rename Speech Dataset Processor to Speech Data Processor (#5378) Signed-off-by: Elena Rastorgueva Signed-off-by: Elena Rastorgueva * Megatron Export Update (#5343) * export update for Megatron + change ORT optimization Signed-off-by: David Mosallanezhad * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated export_utils to use autocast instead of manually casting >:/ Signed-off-by: David Mosallanezhad * removed dtype from LayerNorm Signed-off-by: David Mosallanezhad * added comment Signed-off-by: David Mosallanezhad * reverting changes on FloatCast Signed-off-by: David Mosallanezhad * Cherry-picked changes from megatron-norm Signed-off-by: Boris Fomitchev * updated asr_model import to cast_utils Signed-off-by: David Mosallanezhad * updated del onnx_model place Signed-off-by: David Mosallanezhad * changed ort optimization to basic -> temp fix Signed-off-by: David Mosallanezhad Signed-off-by: David Mosallanezhad Signed-off-by: Boris Fomitchev Co-authored-by: David Mosallanezhad Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Boris Fomitchev * Disable sync_batch_comm in validation_step for GPT (#5397) * disable sync_batch_comm in validation_step Signed-off-by: ericharper * Read sync_batch_comm from config or default to False Signed-off-by: Markel Sanz Ausin * Update megatron_gpt_config to default sync_batch_comm to False to avoid CUDA error Signed-off-by: Markel Sanz Ausin * Empty Signed-off-by: MaximumEntropy * Comment out test Signed-off-by: MaximumEntropy Signed-off-by: ericharper Signed-off-by: Markel Sanz Ausin Signed-off-by: MaximumEntropy Signed-off-by: Oleksii Kuchaiev Co-authored-by: Oleksii Kuchaiev Co-authored-by: Markel Sanz Ausin Co-authored-by: Sandeep Subramanian Co-authored-by: Oleksii Kuchaiev * Radtts 1.13 (#5451) * [TTS] Fixing RADTTS training - removing view buffer and fixing accuracy issue (#5358) * [TTS] add CI test for RADTTS training recipe. Signed-off-by: Boris Fomitchev Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Oleksii Kuchaiev * Support for finetuning and finetuning inference with .ckpt files & batch size refactoring (#5339) (#5478) * Initial refactor Signed-off-by: MaximumEntropy * Resolve config before passing to load_from_checkpoint Signed-off-by: MaximumEntropy * Fixes for model parallel and nemo restore Signed-off-by: MaximumEntropy * Fixes for eval Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert config changes Signed-off-by: MaximumEntropy * Refactor Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix typo Signed-off-by: MaximumEntropy * Remove comments Signed-off-by: MaximumEntropy * Minor Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix validation reconfiguration Signed-off-by: MaximumEntropy * Remove old comment Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes for test_ds Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: MaximumEntropy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: MaximumEntropy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * export_utils bugfix (#5480) * updated export_utils Signed-off-by: David Mosallanezhad * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: David Mosallanezhad Co-authored-by: David Mosallanezhad Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Export fixes for Riva (#5496) * Export fixes for Riva Signed-off-by: Boris Fomitchev * Cleaning up training_utils Signed-off-by: Boris Fomitchev Signed-off-by: Boris Fomitchev * added set_start_method + function param bugfix (#5539) * added set_start_method + function param bugfix Signed-off-by: David Mosallanezhad * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * upper bound torchmetrics Signed-off-by: ericharper Signed-off-by: David Mosallanezhad Signed-off-by: ericharper Co-authored-by: David Mosallanezhad Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: ericharper * remove notebook (#5548) Signed-off-by: ericharper Signed-off-by: ericharper * update readme Signed-off-by: ericharper * update branch Signed-off-by: ericharper * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert Signed-off-by: ericharper * revert Signed-off-by: ericharper * revert Signed-off-by: ericharper * revert Signed-off-by: ericharper * revert Signed-off-by: ericharper * revert Signed-off-by: ericharper * revert Signed-off-by: ericharper Signed-off-by: ericharper Signed-off-by: Elena Rastorgueva Signed-off-by: David Mosallanezhad Signed-off-by: Boris Fomitchev Signed-off-by: Markel Sanz Ausin Signed-off-by: MaximumEntropy Signed-off-by: Oleksii Kuchaiev Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Co-authored-by: David Co-authored-by: David Mosallanezhad Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Boris Fomitchev Co-authored-by: Oleksii Kuchaiev Co-authored-by: Markel Sanz Ausin Co-authored-by: Sandeep Subramanian Co-authored-by: Oleksii Kuchaiev Co-authored-by: Boris Fomitchev Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Optimized loop and bugfix in SDE (#5573) - Fixed bug with loading custom data attributes from JSON in Speech Data Explorer Signed-off-by: George Zelenfroynd Signed-off-by: Elena Rastorgueva * Update torchmetrics (#5566) * add task arg Signed-off-by: nithinraok * update state Signed-off-by: nithinraok Signed-off-by: nithinraok Co-authored-by: Taejin Park Signed-off-by: Elena Rastorgueva * remove useless files. (#5580) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * add initial NFA code Signed-off-by: Elena Rastorgueva * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Elena Rastorgueva * Make use of the specified device during viterbi decoding Signed-off-by: Elena Rastorgueva * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Elena Rastorgueva * Fix CodeQL notes Signed-off-by: Elena Rastorgueva * Fix CodeQL warning Signed-off-by: Elena Rastorgueva * Add an option to defer data setup from ``__init__`` to ``setup`` (#5569) * Add an option to defer dataloader setup from __init__ to setup Signed-off-by: Ante Jukić * Updated doc Signed-off-by: Ante Jukić Signed-off-by: Ante Jukić Signed-off-by: Elena Rastorgueva * Make utt_id specified by number of parts of audio_filepath user wishes to use Signed-off-by: Elena Rastorgueva * remove audio_sr TODO - reduce risk of silent bugs Signed-off-by: Elena Rastorgueva * Add check that model is CTC Signed-off-by: Elena Rastorgueva * Remove unused import Signed-off-by: Elena Rastorgueva * Text generation improvement (UI client, data parallel support) (#5437) * Squashed commit of the following: commit a5e124f34be31bd6eafe5e5fdf5bedcd0d50915c Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu Oct 13 15:07:42 2022 +0000 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci commit 35b424044fe80c3081e7756ab21244f701716f7e Author: Yi Dong Date: Thu Oct 13 08:04:49 2022 -0700 get rid of base Signed-off-by: Yi Dong commit 2955210e2311791543538cfbb5ad26b79414c954 Merge: d52edef8c eaf6757ca Author: Yi Dong Date: Thu Oct 13 13:17:02 2022 +0000 Merge branch 'universal_prompt' of github.com:NVIDIA/NeMo into universal_prompt commit d52edef8cd7b36593838fb270047e80f8ccb652e Author: Yi Dong Date: Thu Oct 13 13:16:24 2022 +0000 align with main Signed-off-by: Yi Dong commit eaf6757ca5be8e099492f57c81d984429b0ad49c Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu Oct 13 13:12:11 2022 +0000 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci commit c4b86d97626ea0721bf8fb4c0a45dec5becc94c9 Author: Yi Dong Date: Thu Oct 13 13:10:58 2022 +0000 same as main Signed-off-by: Yi Dong commit e335de51bcc0d681c58b568c3d8c238bc5687c3b Merge: c231086e0 4463a9fe9 Author: Yi Dong Date: Thu Oct 13 13:08:09 2022 +0000 Merge branch 'main' into universal_prompt commit c231086e057f1efaa915f691d84664cb3d5aad85 Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Wed Oct 12 19:59:12 2022 +0000 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci commit 6a821a4b49a23dd3408a706a2a3dd393149b0bb1 Author: Yi Dong Date: Wed Oct 12 19:56:17 2022 +0000 default to pad Signed-off-by: Yi Dong commit 9d908e39fef1beed9ba2da4d1a6806161eb7ef25 Author: Yi Dong Date: Wed Oct 12 19:55:44 2022 +0000 add the option to pad the tokens Signed-off-by: Yi Dong commit 876dc395b43fdeeaa2bcbbe13c76523633764c33 Merge: fbb0f4035 fe3c77ee9 Author: Yi Dong Date: Wed Oct 12 19:20:47 2022 +0000 Merge branch 'fix_global_init' into universal_prompt commit fe3c77ee93ab6cf3ea152db68cb6beefcac2a392 Author: Yi Dong Date: Wed Oct 12 18:59:49 2022 +0000 fix import again Signed-off-by: Yi Dong commit fbb0f4035c6cd6bfefed50a20605503de8c1dccb Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Wed Oct 12 16:00:24 2022 +0000 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci commit 372ca8c0d7988f2339b15888dc72aa21f4fb6937 Author: Yi Dong Date: Wed Oct 12 15:58:32 2022 +0000 enable server Signed-off-by: Yi Dong commit cbe05d9fbc978f812cfbb671f45f147f300713c4 Author: Yi Dong Date: Wed Oct 12 13:07:28 2022 +0000 fix comment error Signed-off-by: Yi Dong commit 1948048922e726ec6131e44b1a745389f18d4ef2 Merge: 232c2cce3 984f5c09a Author: Yi Dong Date: Wed Oct 12 13:05:30 2022 +0000 Merge branch 'fix_global_init' into universal_prompt commit 232c2cce34d7a8b902da406706f3dd9b39475091 Merge: 34c8a68df 658243fb6 Author: Yi Dong Date: Wed Oct 12 12:50:00 2022 +0000 Merge branch 'fix_global_init' into universal_prompt commit 984f5c09a6dbf1d1fb5aa30ed9b0df188e66a50f Merge: 658243fb6 3fda5de46 Author: Yi Dong <43824965+yidong72@users.noreply.github.com> Date: Wed Oct 12 08:42:11 2022 -0400 Merge branch 'main' into fix_global_init commit 658243fb6580191b5d60edd30cde16dcc23cbb85 Author: Yi Dong Date: Wed Oct 12 12:40:57 2022 +0000 fix import error Signed-off-by: Yi Dong commit 8e0fe1cad05ec288ec122b3cd0e139a96872e08c Author: Yi Dong Date: Tue Oct 11 22:44:12 2022 +0000 update the fused kernel Signed-off-by: Yi Dong commit 536cf6bef9447b75843fad630729c47a2fba35f3 Author: Yi Dong Date: Tue Oct 11 14:44:52 2022 -0700 add the missing file Signed-off-by: Yi Dong commit 1b437ec41dc5e354453ce0a089bca0171cbcb6c2 Author: Yi Dong Date: Tue Oct 11 14:43:14 2022 -0700 fix fused softmax Signed-off-by: Yi Dong commit 7813f60e05f9783af61f8c14ec1cb0c6c4f1f263 Author: Yi Dong Date: Tue Oct 11 14:16:48 2022 -0700 move global step to base Signed-off-by: Yi Dong commit 34c8a68df084b18d377e84415d9f07b2cd6673dd Author: Yi Dong Date: Thu Oct 6 13:50:11 2022 +0000 fix pipeline for eval Signed-off-by: Yi Dong commit eee5d38218f26660c3ffebe9f615c850c80a1f0d Author: Yi Dong Date: Thu Oct 6 13:48:22 2022 +0000 fix for pipleline parallel Signed-off-by: Yi Dong commit 323bca73e7ef6099ee79c0a2fffac7b709ed6c5d Merge: 125e49947 e3b4c4d1f Author: Yi Dong Date: Wed Oct 5 19:29:13 2022 +0000 Merge branch 'universal_prompt' of github.com:NVIDIA/NeMo into universal_prompt commit 125e4994760448ff75dd9328395813eda1c87547 Author: Yi Dong Date: Wed Oct 5 19:29:04 2022 +0000 add share option Signed-off-by: Yi Dong commit e3b4c4d1f7346c9fa596f3cca6d4df0a9e05c368 Author: Yi Dong Date: Wed Oct 5 11:43:48 2022 -0700 make sure consolidation works Signed-off-by: Yi Dong commit a5c833964ecf05dc460ca1da69275c4019742150 Merge: 2a07ab52d abcb74be2 Author: Yi Dong Date: Wed Oct 5 18:40:29 2022 +0000 Merge branch 'universal_prompt' of github.com:NVIDIA/NeMo into universal_prompt commit 2a07ab52d95f15ba666823028c69e23825666c05 Author: Yi Dong Date: Wed Oct 5 18:40:23 2022 +0000 added requirement Signed-off-by: Yi Dong commit 3abecd9dd1611993a87c537636abe7f7e6a9b04c Author: Yi Dong Date: Wed Oct 5 18:39:42 2022 +0000 added a simple web server Signed-off-by: Yi Dong commit abcb74be2caf1cdec40eb9ba2be4dde4d45a3b4b Author: Yi Dong Date: Wed Oct 5 06:54:12 2022 -0700 fix empty val loss Signed-off-by: Yi Dong commit b8eb92ac4a0d665570af75e34c9ba3c2e2420c26 Author: Yi Dong Date: Tue Oct 4 19:25:30 2022 -0700 text gen working Signed-off-by: Yi Dong commit d59f3e3f3a6fd19736d1c5706fed65a3dd4049ba Author: Yi Dong Date: Tue Oct 4 16:08:40 2022 -0700 first change Signed-off-by: Yi Dong commit 59d077585e6962a669b824af58f64e8a0bea6547 Author: Yi Dong Date: Tue Oct 4 15:00:40 2022 -0700 revert Signed-off-by: Yi Dong commit 12a0f3902d99e9179403644bd951c045df716ca7 Author: Yi Dong Date: Tue Oct 4 21:26:23 2022 +0000 init imp Signed-off-by: Yi Dong commit 62a15dfd943cc48be495ac61b9f2f00995775c5f Merge: 82c90d2cd e0cc6b767 Author: Yi Dong Date: Tue Oct 4 11:58:26 2022 -0700 Merge branch 'main' into universal_prompt commit 82c90d2cd0fd156f16a4b899f8c741d598f33990 Author: Yi Dong Date: Tue Oct 4 11:17:13 2022 -0700 add sync Signed-off-by: Yi Dong commit 9819b703eef877d90cd1257bf3610c69de9b4d7e Author: Yi Dong Date: Sun Oct 2 17:52:34 2022 -0700 fix save model Signed-off-by: root commit e4937e2fc5fb7d70754c97668416e4a69c3079fe Author: Yi Dong Date: Sat Oct 1 18:56:09 2022 +0000 working Signed-off-by: Yi Dong commit b73b06d1c7cf5417a6d87cb33d8ed83a57e38b7b Author: Yi Dong Date: Sat Oct 1 17:34:03 2022 +0000 calcuate the mask Signed-off-by: Yi Dong commit 9db3bc13eb65a94a475b837603351da68e3745bc Author: Yi Dong Date: Fri Sep 30 23:26:32 2022 +0000 fix bug in datasets Signed-off-by: Yi Dong commit f289900375d4412f53f8110be00fec6587627550 Author: Yi Dong Date: Fri Sep 30 22:29:40 2022 +0000 update the code Signed-off-by: Yi Dong commit 8e28a1f208aabaab72dbe769e72756baada04d99 Author: Yi Dong Date: Fri Sep 30 21:52:52 2022 +0000 added new ds Signed-off-by: Yi Dong commit 8d41315bab7ce90e200a8a7d1023c34f8e046897 Author: Yi Dong Date: Fri Sep 30 18:57:09 2022 +0000 added new files Signed-off-by: Yi Dong commit 984e0e94e15e16323c1ba1ca2efeabd84f69463f Merge: cbe8b7ab1 fa6cd8588 Author: Yi Dong Date: Thu Sep 29 21:43:29 2022 +0000 Merge branch 'llm-prompt-learning-improvements' into universal_prompt commit fa6cd858839277939446afe7275976078d54c512 Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Thu Sep 29 16:47:30 2022 +0000 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci commit 78ba46e5d6fde1be53c08e1e30a54cce59824be0 Merge: 7d6d46742 8d670bc77 Author: Virginia Adams <78445382+vadam5@users.noreply.github.com> Date: Thu Sep 29 09:43:27 2022 -0700 Merge branch 'main' into llm-prompt-learning-improvements commit 7d6d46742170a66758287a207d67e1b1bfd15613 Author: Virginia Adams Date: Thu Sep 29 16:42:43 2022 +0000 Removed inference step and added sentence peice check to predict step Signed-off-by: Virginia Adams commit 20fd265acd6f7f9912cf52155fe66ccfa6b201a2 Author: Virginia Adams Date: Thu Sep 29 15:26:32 2022 +0000 fixed first stage check for pipeline parallel T5 pt Signed-off-by: Virginia Adams commit 3637be2b258c8d9028856f9971edb7da4a8121f0 Merge: a3ea722fd 986a76612 Author: Virginia Adams <78445382+vadam5@users.noreply.github.com> Date: Wed Sep 28 10:23:30 2022 -0700 Merge branch 'main' into llm-prompt-learning-improvements commit a3ea722fdc12fbcc5989b76ef5643a574b763bc4 Merge: 770967a52 971485ce7 Author: Virginia Adams <78445382+vadam5@users.noreply.github.com> Date: Mon Sep 26 13:35:52 2022 -0700 Merge branch 'main' into llm-prompt-learning-improvements commit 770967a5251a474b6dcc2d44bf9a2076adbcb604 Merge: d23bf6c30 e3ac280a8 Author: Virginia Adams <78445382+vadam5@users.noreply.github.com> Date: Mon Sep 26 10:17:03 2022 -0700 Merge branch 'main' into llm-prompt-learning-improvements commit d23bf6c30acc0e3f6af9b4e24547669866a34d62 Merge: de6a31651 333d2b749 Author: Virginia Adams Date: Mon Sep 26 10:05:16 2022 -0700 Merge branch 'llm-prompt-learning-improvements' of https://github.com/NVIDIA/NeMo into llm-prompt-learning-improvements commit de6a31651e63d88a42b971794d93f18ff5a3cdff Author: Virginia Adams Date: Mon Sep 26 17:00:53 2022 +0000 Updated PP check to be on first stage pipeline only Signed-off-by: Virginia Adams commit 333d2b7498e6742ce66436f733c980a74616900c Merge: 592c0986a a39fc925a Author: Virginia Adams <78445382+vadam5@users.noreply.github.com> Date: Fri Sep 23 16:11:21 2022 -0700 Merge branch 'main' into llm-prompt-learning-improvements commit 592c0986a476a91b57b8605d7b70830d7acfa021 Author: Virginia Adams Date: Fri Sep 23 23:08:41 2022 +0000 Fixed unused import and CI test bug Signed-off-by: Virginia Adams commit ea9cd82d85638bc60ae4ad7ef105db931c8e3455 Merge: ce4b72c8c b566c2d0e Author: Virginia Adams Date: Fri Sep 23 18:57:25 2022 +0000 Merge branch 'llm-prompt-learning-improvements' of https://github.com/NVIDIA/NeMo into llm-prompt-learning-improvements commit ce4b72c8c52f32be336e323dd78a38089edc3e7c Author: Virginia Adams Date: Fri Sep 23 18:57:16 2022 +0000 Switch to import from base class Signed-off-by: Virginia Adams commit b566c2d0e35a068f758fd1310bc620a47be4590b Merge: 6621f2854 e872061ac Author: Virginia Adams <78445382+vadam5@users.noreply.github.com> Date: Fri Sep 23 10:09:03 2022 -0700 Merge branch 'main' into llm-prompt-learning-improvements commit 6621f28543828a48484a5637f6c9f3ccb23a5b02 Author: Virginia Adams Date: Wed Sep 14 20:47:35 2022 +0000 python format fix Signed-off-by: Virginia Adams commit 8deafc8987b6af5f7b99a250310f57a40198c37f Author: Virginia Adams Date: Wed Sep 14 20:28:02 2022 +0000 Save .nemo on new best val score Signed-off-by: Virginia Adams commit 761bd36969cb465d6a129e9eee6ce1f883d3cf41 Author: Virginia Adams Date: Wed Sep 14 18:03:19 2022 +0000 Added automatic checkpoint to nemo file method Signed-off-by: Virginia Adams commit 3be4ed57b6cd3ddfe4876d78650dfe8fe794598b Author: Virginia Adams Date: Wed Sep 14 02:11:56 2022 +0000 Make GPT use base prompt learning model class: Signed-off-by: Virginia Adams Signed-off-by: Yi Dong * fix LGTM Signed-off-by: Yi Dong * fix validation Signed-off-by: Yi Dong * change for the lm eval Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make text generation work in data parallel environment Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implement the service with rest service Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * surpress log Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: MaximumEntropy * Fix Signed-off-by: MaximumEntropy * Fixes Signed-off-by: MaximumEntropy * Update config Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Restore function needed for NMT Signed-off-by: MaximumEntropy * handles no answer only Signed-off-by: Yi Dong * Fix config Signed-off-by: MaximumEntropy * added knn to web Signed-off-by: Yi Dong * fix lgtm.com comments Signed-off-by: Yi Dong * output the retrieved context Signed-off-by: Yi Dong * allow no neighbor query Signed-off-by: Yi Dong * remove the imports Signed-off-by: Yi Dong * warn only once Signed-off-by: Yi Dong * Change output file format from JSON to JSONL Signed-off-by: MaximumEntropy * new t0 dataset Signed-off-by: Yi Dong * Add T0 data preproc scripts Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Merge and multiprocessing Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for is_correct Signed-off-by: MaximumEntropy * fix epoch > 2 Signed-off-by: Yi Dong * handles multiple dataloader Signed-off-by: Yi Dong * remove template Signed-off-by: Yi Dong * Refactor T0 dataset Signed-off-by: MaximumEntropy * Add script to merge train folder into individual training files to minimize number of blends Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added on the fly service Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add combo instance Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added combo service Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * send weights back to server Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix index store Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor changes Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add reset button Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add add eos Signed-off-by: Yi Dong * use a seperate bert service Signed-off-by: Yi Dong * no loss of accuracy Signed-off-by: Yi Dong * pin the gradio version Signed-off-by: Yi Dong * Remove bin compat Signed-off-by: MaximumEntropy * Fix header lines Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * evaluate based on text generation Signed-off-by: Yi Dong * exact match result aggregation Signed-off-by: Yi Dong * working SP and SA Signed-off-by: Yi Dong * sync Signed-off-by: Yi Dong * fix checkpoint Signed-off-by: Yi Dong * fix eval Signed-off-by: Yi Dong * backup states Signed-off-by: Yi Dong * backup states reset Signed-off-by: Yi Dong * fix the bug Signed-off-by: Yi Dong * fix evaluation for sentence piece Signed-off-by: Yi Dong * fix a bug Signed-off-by: Yi Dong * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * potential fix in the future Signed-off-by: Yi Dong * remove the universal codes Signed-off-by: Yi Dong * remove universal strategy Signed-off-by: Yi Dong * address reviewer comment Signed-off-by: Yi Dong Signed-off-by: Yi Dong Signed-off-by: MaximumEntropy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: MaximumEntropy Co-authored-by: Oleksii Kuchaiev Signed-off-by: Elena Rastorgueva * Add align function docstrings and make most args optional Signed-off-by: Elena Rastorgueva * Remove redundant returns of viterbi and log probs matrices Signed-off-by: Elena Rastorgueva * Rename h# to Signed-off-by: Elena Rastorgueva * Update manifest format description in README Signed-off-by: Elena Rastorgueva * always remove any spaces from utt_id Signed-off-by: Elena Rastorgueva * Patch the hanging of threads on very large stderr (#5589) (#5590) Signed-off-by: smajumdar Signed-off-by: smajumdar Signed-off-by: smajumdar Co-authored-by: Somshubra Majumdar Signed-off-by: Elena Rastorgueva * O2 style amp for gpt3 ptuning (#5246) * enable amp o2 plugin Signed-off-by: Jimmy Zhang * only create master param if param requires gradient Signed-off-by: Jimmy Zhang * remove pytorch autocast Signed-off-by: Jimmy Zhang * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Jimmy Zhang * Update optimizer_with_main_params.py Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> * create master grad only if param group requires grad Signed-off-by: Jimmy Zhang * fix grad scaler for pp > 1 Signed-off-by: Jimmy Zhang Signed-off-by: Jimmy Zhang Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Co-authored-by: Jimmy Zhang Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Oleksii Kuchaiev Signed-off-by: Elena Rastorgueva * Better patch hydra (#5591) (#5592) * Readd buffereing and thread drain to Hydra Launcher Signed-off-by: smajumdar * Readd buffereing and thread drain to Hydra Launcher Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: smajumdar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: smajumdar Co-authored-by: Somshubra Majumdar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Yet another fix with hydra multirun (#5594) (#5595) Signed-off-by: smajumdar Signed-off-by: smajumdar Signed-off-by: smajumdar Co-authored-by: Somshubra Majumdar Signed-off-by: Elena Rastorgueva * Add RETRO model documentation (#5578) * added retro doc Signed-off-by: Yi Dong * finish data part Signed-off-by: Yi Dong * added the data format Signed-off-by: Yi Dong * added training script Signed-off-by: Yi Dong * added training and evaluation steps Signed-off-by: Yi Dong * edit the text Signed-off-by: Yi Dong * added the images Signed-off-by: Yi Dong * fix beginning Signed-off-by: Yi Dong * fix the grammar Signed-off-by: Yi Dong * trim it down Signed-off-by: Yi Dong * add wandb option Signed-off-by: Yi Dong * add reference Signed-off-by: Yi Dong * fix path Signed-off-by: Yi Dong * added the parameters table Signed-off-by: Yi Dong * fix section Signed-off-by: Yi Dong Signed-off-by: Yi Dong Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * Fix: setup_multiple validation/test data (#5585) Fix: setup_multiple validation/test data (#5585) Signed-off-by: Ante Jukić Signed-off-by: Elena Rastorgueva * Move to optimizer based EMA implementation (#5169) * Move to optimizer Signed-off-by: SeanNaren * Fix replacing weights Signed-off-by: SeanNaren * Allow swapping of weights be optional Signed-off-by: SeanNaren * Save 2 models Signed-off-by: SeanNaren * Use different hook Signed-off-by: SeanNaren * Expose cpu device Signed-off-by: SeanNaren * Add clause to see if this fixes issue with O2 optimizer Signed-off-by: SeanNaren * Try to get O2 working Signed-off-by: SeanNaren * WIP Signed-off-by: SeanNaren * Fixes Signed-off-by: SeanNaren * Fixes to tests Signed-off-by: SeanNaren * Add guard Signed-off-by: SeanNaren * Remove import Signed-off-by: SeanNaren * Add guard Signed-off-by: SeanNaren * Add comment Signed-off-by: SeanNaren * Remove overwrite Signed-off-by: SeanNaren * Add BatchNorm, currently tests fail Signed-off-by: SeanNaren * Fix tests/functionality for batch norm Signed-off-by: SeanNaren * Get rid of NLP changes Signed-off-by: SeanNaren Signed-off-by: SeanNaren Signed-off-by: Elena Rastorgueva * AIStore for ASR datasets (#5462) AIStore for ASR datasets Signed-off-by: Ante Jukić Signed-off-by: Elena Rastorgueva * Add support for MHA adapters to ASR (#5396) * Convert AbstractAdapterModule to AbstractAdapterMixin Signed-off-by: smajumdar * Temporary fixes to new signature of mixin Signed-off-by: smajumdar * Add adapter util for constants, add all mha adapters. Signed-off-by: smajumdar * Update name of function Signed-off-by: smajumdar * Roll back changes to convASR Signed-off-by: smajumdar * Convert AbstractAdapterModule to AbstractAdapterMixin Signed-off-by: smajumdar * First draft of Conformer support for MHA attention Signed-off-by: smajumdar * Add some preliminary tests Signed-off-by: smajumdar * Add support for projection of the hidden dimension for attention Signed-off-by: smajumdar * Add support for squeezeformer Signed-off-by: smajumdar * Update train adapter config Signed-off-by: smajumdar * Add tests for squeezeformer and unit tests for new modules Signed-off-by: smajumdar * Update config for hp search,set limits on modules for conformer and squeezeformer, update adapter mixin, add cache to import_from_class_path Signed-off-by: smajumdar * Update location of adapters Signed-off-by: smajumdar * Add pre_norm for proper attention learning, Fix the issue with nan/inf in pos_bias_u and pos_bias_v Signed-off-by: smajumdar * Update expmanager to clean up checkpoints Signed-off-by: smajumdar * Fix style Signed-off-by: smajumdar * Add docstrings and update tests Signed-off-by: smajumdar * Add docstrings and update tests Signed-off-by: smajumdar * Add docstrings and update tests Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update training scripts Signed-off-by: smajumdar * Update config and docs Signed-off-by: smajumdar * Expose nemo delete function Signed-off-by: smajumdar * Correct adapter partial state saving Signed-off-by: smajumdar * Correct a bug with state management of adapter tokens Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Pull down EMA test Signed-off-by: smajumdar * Correct name of adapter module utility class Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: smajumdar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Remove unused TTS eval functions w/ pesq and pystoi dependencies (#5605) (#5606) Signed-off-by: Jocelyn Huang Signed-off-by: Jocelyn Huang Signed-off-by: Jocelyn Huang Co-authored-by: Jocelyn Signed-off-by: Elena Rastorgueva * Create separator parameter Signed-off-by: Elena Rastorgueva * Call align function with hydra config Signed-off-by: Elena Rastorgueva * update usage example Signed-off-by: Elena Rastorgueva * Update Dockerfile (#5614) (#5616) Pinned to use `numba==0.53.1` to avoid crashing in training with `num_workers > 0`. This is just a temporary workaround, still need to fix it in the future. Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Make separate pretrained_name and model_path parameters Signed-off-by: Elena Rastorgueva * make "optional" tags bold in markdown Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Move non-main functions to utils dir Signed-off-by: Elena Rastorgueva * Temp workaround: Disable test with cache_audio=True since it is failing in CI (#5607) (#5615) Signed-off-by: Ante Jukić Co-authored-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * [TTS] fix ranges of char set for accented letters. (#5607) * [TTS] fix ranges of char set for accented letters. * remove digits pattern and added unit tests for math operators. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Change success message to reduce confusion (#5621) Signed-off-by: SeanNaren Signed-off-by: SeanNaren Signed-off-by: Elena Rastorgueva * Update documentation and tutorials for Adapters (#5610) * Improve docs for adapter and tests Signed-off-by: smajumdar * Improve docs for adapter and tests Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update test Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Rename test file Signed-off-by: smajumdar Signed-off-by: smajumdar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * [TTS] add type hints and change varialbe names for tokenizers and g2p (#5602) * [TTS] add type hints and change variable names for tokenizers and g2p Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * 1. Added missing import for gather_objects. (#5627) Signed-off-by: Micha Livne Signed-off-by: Micha Livne Co-authored-by: Micha Livne Signed-off-by: Elena Rastorgueva * [TTS][ZH] add fastpitch and hifigan model NGC urls and update NeMo docs. (#5596) (#5625) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Fixed RadTTS unit test (#5572) Signed-off-by: Boris Fomitchev Signed-off-by: Boris Fomitchev Signed-off-by: Elena Rastorgueva * remove tests (#5633) Signed-off-by: ericharper Signed-off-by: ericharper Signed-off-by: Elena Rastorgueva * [TTS][DOC] add notes about automatic conversion to target sampling rates. (#5624) (#5634) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Conformer local attention (#5525) * local attn and merge Signed-off-by: sam1373 * optional Signed-off-by: sam1373 * override Signed-off-by: sam1373 * incorporate comments Signed-off-by: sam1373 * update Signed-off-by: sam1373 * fix Signed-off-by: sam1373 * comment Signed-off-by: sam1373 * changes, test Signed-off-by: sam1373 * changes Signed-off-by: sam1373 * check att context Signed-off-by: sam1373 * readme link Signed-off-by: sam1373 * utils Signed-off-by: sam1373 * update Signed-off-by: sam1373 Signed-off-by: sam1373 Signed-off-by: Samuel Kriman Co-authored-by: Vahid Noroozi Signed-off-by: Elena Rastorgueva * Add core classes and functions for online clustering diarizer part 1 (#5526) * Add core classes and functions for online clustering diarizer Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add audio to labels code Signed-off-by: Taejin Park * resolve type errors Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added unit=tests for very short audio Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Filled all missing docstrings Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolved conflict and added missing docstrings Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed unit-test errors Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the wrongly added file - megatron_gpt_model.py Signed-off-by: Taejin Park * Fix wrongly included file - megatron_gpt_model.py Signed-off-by: Taejin Park * resolve code quality issue Signed-off-by: Taejin Park * Fixed unit-test errors and bugs Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * changed total_sec for offline_clustering toy_data in unit-tests Signed-off-by: Taejin Park * fixed merging index offset bug Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * only including part 1 files Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed unused function Signed-off-by: Taejin Park * fixed unused imports Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * divided nmesc_clustering.py into two and reflected first-pass comments Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adding offline/online_clustering.py Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix code QL autocomment Signed-off-by: Taejin Park * Removed unused imports Signed-off-by: Taejin Park * Update nemo/collections/asr/parts/utils/online_clustering.py Co-authored-by: Sean Naren Signed-off-by: Taejin Park * Reflected comments Signed-off-by: Taejin Park * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolved code scanning issue Signed-off-by: Taejin Park * Update nemo/collections/asr/parts/utils/offline_clustering.py Co-authored-by: Sean Naren Signed-off-by: Taejin Park Signed-off-by: Taejin Park Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Nithin Rao Co-authored-by: Sean Naren Signed-off-by: Elena Rastorgueva * [STT] Add Esperanto (Eo) ASR Conformer-CTC and Conformer-Transducer models (#5639) (#5641) * add stt_eo_conformer_ctc_large model * stt_eo_conformer_transducer_large Co-authored-by: Andrei Andrusenko <52885736+andrusenkoau@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Removed unused import Signed-off-by: Elena Rastorgueva * Specify that filepaths need to be absolute Signed-off-by: Elena Rastorgueva * replaces any spaces in utt_id with dashes Signed-off-by: Elena Rastorgueva * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Elena Rastorgueva * Make hydra script callable by another script Signed-off-by: Elena Rastorgueva * do not specify default model or model_downsample_factor Signed-off-by: Elena Rastorgueva * [Dockerfile] Remove AIS archive from docker image (#5629) Signed-off-by: Ante Jukić Signed-off-by: Elena Rastorgueva * Measure audio_sr from audio instead of needing to specify Signed-off-by: Elena Rastorgueva * [TTS][ZH] Disambiguate polyphones with augmented dict and Jieba segmenter for Chinese FastPitch (#5541) * Chinese TTS replaces default pypinyin dict * Add jieba word segmenter as an option Signed-off-by: Yuekai Zhang Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Make separate parameters for device of transcription and viterbi steps Signed-off-by: Elena Rastorgueva * Add mention of gecko Signed-off-by: Elena Rastorgueva * [workflow] add exclude labels option to ignore cherry-picks in release changelog. (#5645) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * [TTS][ZH] bugfix for the tutorial and add NGC CLI installation guide. (#5643) (#5647) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * [Add] ASR+VAD Inference Pipeline (#5575) Added offline ASR+VAD inference pipeline that matches with what's in RIVA, along with some feature-based ASR and classification datasets. Signed-off-by: stevehuang52 Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * rename separator to ctm_grouping_separator and refactor Signed-off-by: Elena Rastorgueva * Bert interleaved (#5556) * Adding SP and SAR support Bert * Adding Sequence parallel support to Bert * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Adding Sequence parallel support to Bert * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Adding SP and SAR support Bert * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Adding SP and SAR support Bert * Adding SP and SAR support Bert * Adding Sequence parallel support to Bert * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Adding Sequence parallel support to Bert * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Adding Sequence parallel support to Bert * Update bert_model.py Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> * Adding tests * Adding interleaved pipeline parallelism * Adding interleaved pipeline parallelism * Adding interleaved pipeline parallelism * Adding interleaved pipeline parallelism * Adding interleaved pipeline parallelism * Adding interleaved pipeline parallelism * Adding interleaved pipeline parallelism * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressing Eric's comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Addressing Eric's comments * Fix bug fix sequence parallel and Interleaved * Fix bug fix sequence parallel and Interleaved Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Oleksii Kuchaiev Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * Add duration padding support for RADTTS inference (#5650) * Added duration padding support for RADTTS inference * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Kevin Shih Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Add remove_blank_tokens_from_ctm parameter Signed-off-by: Elena Rastorgueva * Dont save initial_silence line in CTM Signed-off-by: Elena Rastorgueva * Add DLLogger support to exp_manager (#5658) * Add DLLogger support to exp_manager Signed-off-by: Alexandre Milesi * Move dllogger to separate file and check import Signed-off-by: Alexandre Milesi * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import Signed-off-by: Alexandre Milesi Signed-off-by: Alexandre Milesi Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * add minimum_timestamp_duration parameter Signed-off-by: Elena Rastorgueva * add suggestion about removing blanks to README Signed-off-by: Elena Rastorgueva * reorder args Signed-off-by: Elena Rastorgueva * clarify description of ctm_grouping_separator in README Signed-off-by: Elena Rastorgueva * update docstring Signed-off-by: Elena Rastorgueva * [TTS][ZH] bugfix for ngc cli installation. (#5652) (#5664) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Port stateless timer to exp manager (#5584) * Port stateless timer to exp manager Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes and remove from all megatron code Signed-off-by: MaximumEntropy * Fixes Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change message Signed-off-by: MaximumEntropy Signed-off-by: MaximumEntropy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Fix EMA restart by allowing device to be set by the class init (#5668) Signed-off-by: SeanNaren Signed-off-by: SeanNaren Signed-off-by: Elena Rastorgueva * Remove SDP (moved to separate repo) - merge to main (#5630) * Remove sdp files from tools folder Signed-off-by: Elena Rastorgueva * Add page to docs with new SDP location Signed-off-by: Elena Rastorgueva Signed-off-by: Elena Rastorgueva * Add interface for making amax reduction optional for FP8 (#5447) * add TE interface for making amax reduction optional Signed-off-by: Kirthi Shankar Sivamani * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Kirthi Shankar Sivamani Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * [TTS] add tts dict cust notebook (#5662) * add tts dict cust notebook Signed-off-by: ekmb * review Signed-off-by: ekmb * fixed audio links Signed-off-by: ekmb * remove old notebook Signed-off-by: ekmb * fix typo Signed-off-by: ekmb Signed-off-by: ekmb Signed-off-by: Elena Rastorgueva * [ASR] Audio processing base, multi-channel enhancement models (#5356) * Audio processing base model, enc-mask-dec enhancement, tests and modules Signed-off-by: Ante Jukić * Addressed review comments Signed-off-by: Ante Jukić * Fixed CodeQL warnings Signed-off-by: Ante Jukić * Addressed PR comments Signed-off-by: Ante Jukić * Addressed PR comments: - renamed AudioProcessingModel to AudioToAudioModel - various small modifications - updated unit tests Signed-off-by: Ante Jukić * Addressed comments - Moved spectrogram to audio_preprocessing - Renamed MultichannelFeatures - Updated config and unit tests Signed-off-by: Ante Jukić Signed-off-by: Ante Jukić Signed-off-by: Elena Rastorgueva * Expose ClusteringDiarizer device (#5681) * Expose device for users to set Signed-off-by: SeanNaren * Expose device for users to set Signed-off-by: SeanNaren Signed-off-by: SeanNaren Signed-off-by: Elena Rastorgueva * Add Beam Search support to ASR transcribe() (#5443) * Add support for beam decoding via high level API. Signed-off-by: smajumdar * Add ctc decoding section Signed-off-by: smajumdar * Update ctc transcribe API to return results from beam search Signed-off-by: smajumdar * Add argument to preserve arpa file Signed-off-by: smajumdar * Update script to use hydra config, add some support for future compute timesteps, add doc for ctc decoding Signed-off-by: smajumdar * Update eval script and doc to use new API Signed-off-by: smajumdar * Add tests for ctc greedy decoding Signed-off-by: smajumdar * Address reviewer comments and add docstrings Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix changes and address comments Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: smajumdar Co-authored-by: Samuel Kriman Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Propagate attention_dropout flag for GPT-3 (#5669) * Propagate attention_dropout flag for GPT-3 Signed-off-by: Mikołaj Błaż * Add default to megatron_gpt_config Signed-off-by: Mikołaj Błaż Signed-off-by: Mikołaj Błaż Co-authored-by: Oleksii Kuchaiev Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * Enc-Dec model size reporting fixes (#5623) * Update for enc-dec models Signed-off-by: MaximumEntropy * Fix for bert as well Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix for PP Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: MaximumEntropy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Multiblank Transducer (#5527) * multi-blank transducers Signed-off-by: Hainan Xu * one line bug fix Signed-off-by: Hainan Xu * change interface of RNNTDecoding class to extract num-extra-output from joint instead of constructor Signed-off-by: Hainan Xu * addressed PR comments Signed-off-by: Hainan Xu * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Hainan Xu Co-authored-by: Hainan Xu Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * [TTS][ZH] fix broken link for the script. (#5680) * change to main branch. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * [TN/TTS docs] TN customization, g2p docs moved to tts (#5683) * TN customization, g2p docs moved to tts Signed-off-by: ekmb * link new TTS tutorial Signed-off-by: ekmb * combine 3 and 4 Signed-off-by: ekmb * remove note Signed-off-by: ekmb Signed-off-by: ekmb Signed-off-by: Elena Rastorgueva * Add prompt learning tests (#5649) * patch to allow using tokenizers without additional_special_tokens_ids attribute Signed-off-by: arendu * added gpt prompt learning and t5 prompt learning, made them run one after the other Signed-off-by: arendu * fixed changes Signed-off-by: arendu * gave unique names Signed-off-by: arendu * num workers set to 0 Signed-off-by: arendu * fixes to make num_workers>0 fast by using persistent_workers flag in dataloaders Signed-off-by: arendu * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated to num_workers 8 Signed-off-by: arendu * updates to make num_workers arg in gpt/t5 infernce/training work Signed-off-by: arendu * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * style fix Signed-off-by: arendu * add num_workers arg in jenkins Signed-off-by: arendu * bs fix Signed-off-by: arendu * numworkers > 0 added for gpt prompt learning eval Signed-off-by: arendu * added num_workers Signed-off-by: arendu Signed-off-by: arendu Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * remove output (#5689) (#5690) Signed-off-by: ericharper Signed-off-by: ericharper Signed-off-by: ericharper Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * Minor fixes (#5691) Signed-off-by: MaximumEntropy Signed-off-by: MaximumEntropy Signed-off-by: Elena Rastorgueva * temp disbale speaker reco CI (#5696) Signed-off-by: fayejf Signed-off-by: fayejf Signed-off-by: Elena Rastorgueva * some tokenizers do not have additional_special_tokens_ids attribute (#5642) (#5648) Signed-off-by: arendu Signed-off-by: arendu Signed-off-by: arendu Co-authored-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com> Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * Bump setuptools from 59.5.0 to 65.5.1 in /requirements (#5704) Bumps [setuptools](https://github.com/pypa/setuptools) from 59.5.0 to 65.5.1. - [Release notes](https://github.com/pypa/setuptools/releases) - [Changelog](https://github.com/pypa/setuptools/blob/main/CHANGES.rst) - [Commits](https://github.com/pypa/setuptools/compare/v59.5.0...v65.5.1) --- updated-dependencies: - dependency-name: setuptools dependency-type: direct:production ... Signed-off-by: dependabot[bot] Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Merge 1.14.0 main (#5705) * update branch Signed-off-by: ericharper * [TTS][ZH] fix broken link for the script. (#5666) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * update readme Signed-off-by: ericharper * update branch Signed-off-by: ericharper * update package info Signed-off-by: ericharper * unpin lightning Signed-off-by: ericharper Signed-off-by: ericharper Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Don't print exp_manager warning when max_steps == -1 (#5725) Signed-off-by: Alexandre Milesi Signed-off-by: Elena Rastorgueva * pin torchmetrics version (#5720) * fix torchmetrics version Signed-off-by: nithinraok * add lower bound Signed-off-by: nithinraok Signed-off-by: nithinraok Signed-off-by: Elena Rastorgueva * Update to pytorch 22.12 container (#5694) * update to pytorch 22.12 container Signed-off-by: ericharper * please fix waveglow export in 22.12 container Signed-off-by: ericharper * Update torch.stft() calls due to deprecation of return_complex=False (#5729) Signed-off-by: Jocelyn Huang Signed-off-by: Jocelyn Huang * Update ASR torch.stft() call to use return_complex=True (#5730) Signed-off-by: Jocelyn Huang Signed-off-by: Jocelyn Huang Signed-off-by: ericharper Signed-off-by: Jocelyn Huang Co-authored-by: Jocelyn Signed-off-by: Elena Rastorgueva * add keep_initializers_as_inputs to _export method (#5731) Signed-off-by: Patrick Simianer Signed-off-by: Patrick Simianer Signed-off-by: Elena Rastorgueva * added tab former doc to the index page (#5733) Signed-off-by: Yi Dong Signed-off-by: Yi Dong Signed-off-by: Elena Rastorgueva * ALiBi Positional Embeddings (#5467) * 1. Working on alibi positional embeddings. Signed-off-by: Micha Livne * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Debugging. Signed-off-by: Micha Livne * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Debugging. Signed-off-by: Micha Livne * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Debugging. Signed-off-by: Micha Livne * 1. Debugging. Signed-off-by: Micha Livne * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Debugging. Signed-off-by: Micha Livne * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Added encoder and decoder alibi classes. Signed-off-by: Micha Livne * 1. Debugging. Signed-off-by: Micha Livne * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Simplified code. 2. Added bidirectional support. Signed-off-by: Micha Livne * 1. Added support in config to alibi. Signed-off-by: Micha Livne * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Added Jenkins tests. Signed-off-by: Micha Livne * 1. Added missing file. Signed-off-by: Micha Livne * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Debugging. Signed-off-by: Micha Livne * 1. Debugging. Signed-off-by: Micha Livne * 1. Debugging. Signed-off-by: Micha Livne Signed-off-by: Micha Livne Co-authored-by: Micha Livne Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Oleksii Kuchaiev Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * Ensure EMA checkpoints are also deleted when normal checkpoints are (#5724) * Ensure EMA checkpoints are also deleted when normal checkpoints are Signed-off-by: SeanNaren * Simplify test Signed-off-by: SeanNaren * Remove comment Signed-off-by: SeanNaren * Fix bug where `save_best_model` caused a crash Signed-off-by: SeanNaren * Swap to logging only on rank 0 Signed-off-by: SeanNaren Signed-off-by: SeanNaren Signed-off-by: Elena Rastorgueva * Fix P-Tuning Truncation (#5663) * untokenize truncated field Signed-off-by: Virginia Adams * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated truncation method arugments Signed-off-by: Virginia Adams * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Virginia Adams Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Oleksii Kuchaiev Signed-off-by: Elena Rastorgueva * Update 00_NeMo_Primer.ipynb (#5740) Fixed a minor typo in primer tutorial. Signed-off-by: schaltung Signed-off-by: schaltung Signed-off-by: Elena Rastorgueva * Support non-standard padding token id (#5543) * Support non-standard padding token id Read the id of the padding token from the tokenizer when creating the embedding, rather than always defaulting to 0. This allows use of (admittedly bizarre) non-standard tokenizer models that don't give the padding token the id 0. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Co-authored-by: Numeri Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sandeep Subramanian Signed-off-by: Elena Rastorgueva * typo and link fixed (#5741) (#5744) Signed-off-by: ekmb Signed-off-by: ekmb Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * link fixed (#5745) (#5746) Signed-off-by: ekmb Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * [TTS] Update Spanish TTS model to 1.15 (#5742) Signed-off-by: Ryan Signed-off-by: Elena Rastorgueva * Fix for incorrect computation of batched alignment in transducers (#5692) * Fix rnnt alignment bug and add test Signed-off-by: Igor Gitman * Add tests/fixes for more decoding configurations Signed-off-by: Igor Gitman * Add tests/fixes for frame confidence computation Signed-off-by: Igor Gitman * Rename test file to avoid local execution Signed-off-by: Igor Gitman * Add test to jenkinsfile Signed-off-by: Igor Gitman * Proper fix for alignments + remove code duplication Signed-off-by: Igor Gitman * Return back separate mask processing Signed-off-by: Igor Gitman * Override cleanup fixture Signed-off-by: Igor Gitman * Add a TODO for multiblank RNNT Signed-off-by: Igor Gitman Signed-off-by: Igor Gitman Signed-off-by: Elena Rastorgueva * Move Attention and MLP classes to a separate file in Megatron transformers (#5453) * Move attention and mlp to separate files Signed-off-by: MaximumEntropy * Add new attention and mlp files Signed-off-by: MaximumEntropy * Fix import in tests Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused imports in attention Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix missing import Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: MaximumEntropy Signed-off-by: MaximumEntropy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper Co-authored-by: Oleksii Kuchaiev Signed-off-by: Elena Rastorgueva * Adithyare/prompt learning seed (#5749) * patch to allow using tokenizers without additional_special_tokens_ids attribute Signed-off-by: arendu * seeding for param-efficient learning methods Signed-off-by: arendu * seeding the datasampler Signed-off-by: arendu * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed seed_everything Signed-off-by: arendu Signed-off-by: arendu Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Set the stream position to 0 for pydub (#5752) Signed-off-by: Jonghwan Hyeon Signed-off-by: Jonghwan Hyeon Signed-off-by: Elena Rastorgueva * Fix: conformer encoder forward when length is None (#5761) Signed-off-by: Ante Jukić Signed-off-by: Elena Rastorgueva * Update Tacotron2 NGC checkpoint load to latest version (#5760) (#5762) Signed-off-by: Jocelyn Huang Signed-off-by: Elena Rastorgueva * [TTS][DE] refine grapheme-based tokenizer and fastpitch training recipe on thorsten's neutral datasets. (#5753) * refine GermanCharsTokenizer to support only graphemes as inputs by removing sentence-level phoneme representation; * refine GermanCharsTokenizer to preserve mixed cases from the original input graphemes; * add a new Thorsten's 22.10 dataset; * revise thorsten voice neutral datasets preparation script to support two versions of thorsten's voice datasets in a single script; Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Refactor so token, word and additonal segment-level alignments are generated in the same run Signed-off-by: Elena Rastorgueva * change CTM rounding to remove unnecessary decimal figures Signed-off-by: Elena Rastorgueva * Move obtaining start and end of batch line IDs to separate util function Signed-off-by: Elena Rastorgueva * Sanitize params before DLLogger log_hyperparams (#5736) * Sanitize params before DLLogger log_hyperparams Signed-off-by: Alexandre Milesi * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Alexandre Milesi Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper Signed-off-by: Elena Rastorgueva * Allow to run alignment on transcribed pred_text Signed-off-by: Elena Rastorgueva * Update README Signed-off-by: Elena Rastorgueva * update README Signed-off-by: Elena Rastorgueva * Rename output_ctm_folder to output_dir Signed-off-by: Elena Rastorgueva * rename n_parts_for_ctm to audio_filepath_parts_in_utt_id Signed-off-by: Elena Rastorgueva * Rename some variables to improve readability Signed-off-by: Elena Rastorgueva * move constants to separate file Signed-off-by: Elena Rastorgueva * Add extra data args to support proper finetuning of HF converted T5 checkpoints (#5719) * Initial addition of extra args Signed-off-by: MaximumEntropy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change defaults Signed-off-by: MaximumEntropy Signed-off-by: MaximumEntropy Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Rename some functions Signed-off-by: Elena Rastorgueva * update year Signed-off-by: Elena Rastorgueva * No-script TS export, prepared for ONNX export (#5653) * Changed unfold to reshape, merged padding chenges * Almost working ONNX export of RadTTS * restored radtts function * Added explicit assume_padded flag * Fixing attn_mask * Fixing unfold * Trying no hx * Back with hx * Made fx only for tracing * Tests annotated * Fully working no-script TS export, prepared for ONNX export * Restored no-autocast block, addressed code review * Fine-tuning autocast option * Protecting InstanceNorm * Forcing eval and param freeze on export Signed-off-by: Boris Fomitchev Signed-off-by: Elena Rastorgueva * ASR evaluator (#5728) * backbone Signed-off-by: fayejf * engineer and analyzer Signed-off-by: fayejf * offline_by_chunked Signed-off-by: fayejf * test_ds wip Signed-off-by: fayejf * temp remove inference Signed-off-by: fayejf * mandarin yaml Signed-off-by: fayejf * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * augmentor and a few updates Signed-off-by: fayejf * address alerts and revert unnecessary changes Signed-off-by: fayejf * Add readme Signed-off-by: fayejf * rename Signed-off-by: fayejf * typo fix Signed-off-by: fayejf * small fix Signed-off-by: fayejf * add missing header Signed-off-by: fayejf * rename augmentor_config to augmentor Signed-off-by: fayejf * raname inference_mode to inference Signed-off-by: fayejf * move utils.py Signed-off-by: fayejf * update temp file Signed-off-by: fayejf * make wer cer clear Signed-off-by: fayejf * seed_everything Signed-off-by: fayejf * fix missing rn augmentor_config in rnnt Signed-off-by: fayejf * fix rnnt transcribe Signed-off-by: fayejf * add more docstring and style fix Signed-off-by: fayejf * address codeQL Signed-off-by: fayejf * reflect comments Signed-off-by: fayejf * update readme Signed-off-by: fayejf * clearer Signed-off-by: fayejf Signed-off-by: fayejf Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * Docs g2p update (#5769) (#5775) * links update, riva docs link Signed-off-by: ekmb Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * adding back tar script for decoder dataset for duplex (#5773) * adding back tar script for decoder dataset for duplex Signed-off-by: Yang Zhang * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Yang Zhang Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * [ASR][Test] Enable test for cache audio with a single worker (#5763) Signed-off-by: Ante Jukić Signed-off-by: Ante Jukić Signed-off-by: Elena Rastorgueva * Fixing masking in RadTTS bottleneck layer (#5771) * Fixing masking in RadTTS bottleneck layer Signed-off-by: Boris Fomitchev Signed-off-by: Elena Rastorgueva * Update torchaudio dependency version for tutorials (#5781) (#5782) Signed-off-by: smajumdar Co-authored-by: Somshubra Majumdar Signed-off-by: Elena Rastorgueva * [TTS][ZH] bugfix import jieba errors. (#5776) (#5784) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Elena Rastorgueva * fix typos Signed-off-by: Elena Rastorgueva * update requirements.txt Signed-off-by: Elena Rastorgueva * Make default devices None and set to GPU if it is available Signed-off-by: Elena Rastorgueva * add warning for non-zero minimum_timestamp_duration Signed-off-by: Elena Rastorgueva * Clarify phrasing in README regarding raising error if pred_text exists Signed-off-by: Elena Rastorgueva * Update README section on evaluating alignment accuracy Signed-off-by: Elena Rastorgueva * fix some code in creating segments Signed-off-by: Elena Rastorgueva * Add some unit tests for NFA boundary_info creation Signed-off-by: Elena Rastorgueva * Added test for function adding t_start and t_end Signed-off-by: Elena Rastorgueva * add comments to get_y_and_boundary_info_for_utt and remove redundant variables Signed-off-by: Elena Rastorgueva * add comments to get_batch_tensors_and_boundary_info Signed-off-by: Elena Rastorgueva * Add comments to make_output_files.py Signed-off-by: Elena Rastorgueva * add comments to viterbi decoding code Signed-off-by: Elena Rastorgueva * Add copyright headers Signed-off-by: Elena Rastorgueva * Change req to nemo_toolkit[all] Signed-off-by: Elena Rastorgueva Signed-off-by: ericharper Signed-off-by: Elena Rastorgueva Signed-off-by: David Mosallanezhad Signed-off-by: Boris Fomitchev Signed-off-by: Markel Sanz Ausin Signed-off-by: MaximumEntropy Signed-off-by: Oleksii Kuchaiev Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: George Zelenfroynd Signed-off-by: nithinraok Signed-off-by: Ante Jukić Signed-off-by: Yi Dong Signed-off-by: smajumdar Signed-off-by: Jimmy Zhang Signed-off-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Signed-off-by: SeanNaren Signed-off-by: Jocelyn Huang Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Signed-off-by: Micha Livne Signed-off-by: sam1373 Signed-off-by: Samuel Kriman Signed-off-by: Taejin Park Signed-off-by: Yuekai Zhang Signed-off-by: stevehuang52 Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Signed-off-by: Alexandre Milesi Signed-off-by: Kirthi Shankar Sivamani Signed-off-by: ekmb Signed-off-by: Mikołaj Błaż Signed-off-by: Hainan Xu Signed-off-by: arendu Signed-off-by: fayejf Signed-off-by: Patrick Simianer Signed-off-by: Virginia Adams Signed-off-by: schaltung Signed-off-by: Ryan Signed-off-by: Igor Gitman Signed-off-by: Jonghwan Hyeon Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Yang Zhang Co-authored-by: Eric Harper Co-authored-by: David Co-authored-by: David Mosallanezhad Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Boris Fomitchev Co-authored-by: Oleksii Kuchaiev Co-authored-by: Markel Sanz Ausin Co-authored-by: Sandeep Subramanian Co-authored-by: Oleksii Kuchaiev Co-authored-by: Boris Fomitchev Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Nithin Rao Co-authored-by: Taejin Park Co-authored-by: anteju <108555623+anteju@users.noreply.github.com> Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Somshubra Majumdar Co-authored-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Co-authored-by: Jimmy Zhang Co-authored-by: Sean Naren Co-authored-by: Jocelyn Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Co-authored-by: Micha Livne Co-authored-by: Micha Livne Co-authored-by: Samuel Kriman Co-authored-by: Vahid Noroozi Co-authored-by: Andrei Andrusenko <52885736+andrusenkoau@users.noreply.github.com> Co-authored-by: Yuekai Zhang Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: kevjshih Co-authored-by: Kevin Shih Co-authored-by: milesial Co-authored-by: Kirthi Shankar Sivamani Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Co-authored-by: mikolajblaz Co-authored-by: Hainan Xu Co-authored-by: Hainan Xu Co-authored-by: Adi Renduchintala <108822655+arendu@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: pks Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com> Co-authored-by: schaltung Co-authored-by: Kaden Uhlig Co-authored-by: Numeri Co-authored-by: Ryan Langman Co-authored-by: Igor Gitman Co-authored-by: Jonghwan Hyeon Co-authored-by: Yang Zhang --- tools/nemo_forced_aligner/README.md | 84 ++++ tools/nemo_forced_aligner/align.py | 287 +++++++++++++ tools/nemo_forced_aligner/requirements.txt | 2 + .../test_add_t_start_end_to_boundary_info.py | 121 ++++++ .../test_get_y_and_boundary_info_for_utt.py | 158 +++++++ tools/nemo_forced_aligner/utils/constants.py | 19 + tools/nemo_forced_aligner/utils/data_prep.py | 385 ++++++++++++++++++ .../utils/make_output_files.py | 210 ++++++++++ .../utils/viterbi_decoding.py | 136 +++++++ 9 files changed, 1402 insertions(+) create mode 100644 tools/nemo_forced_aligner/README.md create mode 100644 tools/nemo_forced_aligner/align.py create mode 100644 tools/nemo_forced_aligner/requirements.txt create mode 100644 tools/nemo_forced_aligner/tests/test_add_t_start_end_to_boundary_info.py create mode 100644 tools/nemo_forced_aligner/tests/test_get_y_and_boundary_info_for_utt.py create mode 100644 tools/nemo_forced_aligner/utils/constants.py create mode 100644 tools/nemo_forced_aligner/utils/data_prep.py create mode 100644 tools/nemo_forced_aligner/utils/make_output_files.py create mode 100644 tools/nemo_forced_aligner/utils/viterbi_decoding.py diff --git a/tools/nemo_forced_aligner/README.md b/tools/nemo_forced_aligner/README.md new file mode 100644 index 0000000000000..1f96eba988871 --- /dev/null +++ b/tools/nemo_forced_aligner/README.md @@ -0,0 +1,84 @@ +# NeMo Forced Aligner (NFA) + +A tool for doing Forced Alignment using Viterbi decoding of NeMo CTC-based models. + +## Usage example + +``` bash +python /tools/nemo_forced_aligner/align.py \ + pretrained_name="stt_en_citrinet_1024_gamma_0_25" \ + model_downsample_factor=8 \ + manifest_filepath= \ + output_dir= +``` + +## How do I use NeMo Forced Aligner? +To use NFA, all you need to provide is a correct NeMo manifest (with `"audio_filepath"` and `"text"` fields). + +Call the `align.py` script, specifying the parameters as follows: + +* `pretrained_name`: string specifying the name of a CTC NeMo ASR model which will be automatically downloaded from NGC and used for generating the log-probs which we will use to do alignment. Any Quartznet, Citrinet, Conformer CTC model should work, in any language (only English has been tested so far). If `model_path` is specified, `pretrained_name` must not be specified. +>Note: NFA can only use CTC models (not Transducer models) at the moment. If you want to transcribe a long audio file (longer than ~5-10 mins), do not use Conformer CTC model as that will likely give Out Of Memory errors. + +* `model_path`: string specifying the local filepath to a CTC NeMo ASR model which will be used to generate the log-probs which we will use to do alignment. If `pretrained_name` is specified, `model_path` must not be specified. +>Note: NFA can only use CTC models (not Transducer models) at the moment. If you want to transcribe a long audio file (longer than ~5-10 mins), do not use Conformer CTC model as that will likely give Out Of Memory errors. + +* `model_downsample_factor`: the downsample factor of the ASR model. It should be 2 if your model is QuartzNet, 4 if it is Conformer CTC, 8 if it is Citrinet. + +* `manifest_filepath`: The path to the manifest of the data you want to align, containing `'audio_filepath'` and `'text'` fields. The audio filepaths need to be absolute paths. + +* `output_dir`: The folder where to save CTM files containing the generated alignments and new JSON manifest containing paths to those CTM files. There will be one CTM file per utterance (ie one CTM file per line in the manifest). The files will be called `/{tokens,words,additional_segments}/.ctm` and each line in each file will start with ``. By default, `utt_id` will be the stem of the audio_filepath. This can be changed by overriding `audio_filepath_parts_in_utt_id`. The new JSON manifest will be at `/_with_ctm_paths.json`. + +* **[OPTIONAL]** `align_using_pred_text`: if True, will transcribe the audio using the ASR model (specified by `pretrained_name` or `model_path`) and then use that transcription as the 'ground truth' for the forced alignment. The `"pred_text"` will be saved in the output JSON manifest at `/{original manifest name}_with_ctm_paths.json`. To avoid over-writing other transcribed texts, if there are already `"pred_text"` entries in the original manifest, the program will exit without attempting to generate alignments. (Default: False). + +* **[OPTIONAL]** `transcribe_device`: The device that will be used for generating log-probs (i.e. transcribing). If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). If specified `transcribe_device` needs to be a string that can be input to the `torch.device()` method. (Default: `None`). + +* **[OPTIONAL]** `viterbi_device`: The device that will be used for doing Viterbi decoding. If None, NFA will set it to 'cuda' if it is available (otherwise will set it to 'cpu'). If specified `transcribe_device` needs to be a string that can be input to the `torch.device()` method.(Default: `None`). + +* **[OPTIONAL]** `batch_size`: The batch_size that will be used for generating log-probs and doing Viterbi decoding. (Default: 1). + +* **[OPTIONAL]** `additional_ctm_grouping_separator`: the string used to separate CTM segments if you want to obtain CTM files at a level that is not the token level or the word level. NFA will always produce token-level and word-level CTM files in: `/tokens/.ctm` and `/words/.ctm`. If `additional_ctm_grouping_separator` is specified, an additional folder `/{tokens/words/additional_segments}/.ctm` will be created containing CTMs for `addtional_ctm_grouping_separator`-separated segments. (Default: `None`. Cannot be empty string or space (" "), as space-separated word-level CTMs will always be saved in `/words/.ctm`.) +> Note: the `additional_ctm_grouping_separator` will be removed from the ground truth text and all the output CTMs, ie it is treated as a marker which is not part of the ground truth. The separator will essentially be treated as a space, and any additional spaces around it will be amalgamated into one, i.e. if `additional_ctm_grouping_separator="|"`, the following texts will be treated equivalently: `“abc|def”`, `“abc |def”`, `“abc| def”`, `“abc | def"`. + +* **[OPTIONAL]** `remove_blank_tokens_from_ctm`: a boolean denoting whether to remove tokens from token-level output CTMs. (Default: False). + +* **[OPTIONAL]** `audio_filepath_parts_in_utt_id`: This specifies how many of the 'parts' of the audio_filepath we will use (starting from the final part of the audio_filepath) to determine the utt_id that will be used in the CTM files. (Default: 1, i.e. utt_id will be the stem of the basename of audio_filepath). Note also that any spaces that are present in the audio_filepath will be replaced with dashes, so as not to change the number of space-separated elements in the CTM files. + +* **[OPTIONAL]** `minimum_timestamp_duration`: a float indicating a minimum duration (in seconds) for timestamps in the CTM. If any line in the CTM has a duration lower than the `minimum_timestamp_duration`, it will be enlarged from the middle outwards until it meets the minimum_timestamp_duration, or reaches the beginning or end of the audio file. Note that this may cause timestamps to overlap. (Default: 0, i.e. no modifications to predicted duration). + +# Input manifest file format +By default, NFA needs to be provided with a 'manifest' file where each line specifies the absolute "audio_filepath" and "text" of each utterance that you wish to produce alignments for, like the format below: +```json +{"audio_filepath": "/absolute/path/to/audio.wav", "text": "the transcription of the utterance"} +``` + +You can omit the `"text"` field from the manifest if you specify `align_using_pred_text=true`. In that case, any `"text"` fields in the manifest will be ignored: the ASR model at `pretrained_name` or `model_path` will be used to transcribe the audio and obtain `"pred_text"`, which will be used as the 'ground truth' for the forced alignment process. The `"pred_text"` will also be saved in the output manifest JSON file at `/_with_ctm_paths.json`. To remove the possibility of overwriting `"pred_text"`, NFA will raise an error if `align_using_pred_text=true` and there are existing `"pred_text"` fields in the original manifest. + +> Note: NFA does not require `"duration"` fields in the manifest, and can align long audio files without running out of memory. Depending on your machine specs, you can align audios up to 5-10 minutes on Conformer CTC models, up to around 1.5 hours for QuartzNet models, and up to several hours for Citrinet models. NFA will also produce better alignments the more accurate the ground-truth `"text"` is. + + +# Output CTM file format +For each utterance specified in a line of `manifest_filepath`, several CTM files will be generated: +* a CTM file containing token-level alignments at `/tokens/.ctm`, +* a CTM file containing word-level alignments at `/words/.ctm`, +* if `additional_ctm_grouping_separator` is specified, there will also be a CTM file containing those segments at `output_dir/additional_segments`. +Each CTM file will contain lines of the format: +` 1 `. +Note the second item in the line (the 'channel ID', which is required by the CTM file format) is always 1, as NFA operates on single channel audio. + +# Output JSON manifest file format +A new manifest file will be saved at `/_with_ctm_paths.json`. It will contain the same fields as the original manifest, and additionally: +* `"token_level_ctm_filepath"` +* `"word_level_ctm_filepath"` +* `"additonal_segment_level_ctm_filepath"` (if `additional_ctm_grouping_separator` is specified) +* `"pred_text"` (if `align_using_pred_text=true`) + + +# How do I evaluate the alignment accuracy? +Ideally you would have some 'true' CTM files to compare with your generated CTM files. With these you could obtain metrics such as the mean (absolute) errors between predicted starts/ends and the 'true' starts/ends of the segments. + +Alternatively (or additionally), you can visualize the quality of alignments using tools such as Gecko, which can play your audio file and display the predicted alignments at the same time. The Gecko tool requires you to upload an audio file and at least one CTM file. The Gecko tool can be accessed here: https://gong-io.github.io/gecko/. More information about the Gecko tool can be found on its Github page here: https://github.com/gong-io/gecko. + +**Note**: the following may help improve your experience viewing the CTMs in Gecko: +* setting `minimum_timestamp_duration` to a larger number, as Gecko may not display some tokens/words/segments properly if their timestamps are too short. +* setting `remove_blank_tokens_from_ctm=true` if you are analyzing token-level CTMs, as it will make the Gecko visualization less cluttered. diff --git a/tools/nemo_forced_aligner/align.py b/tools/nemo_forced_aligner/align.py new file mode 100644 index 0000000000000..5f2a781a381fe --- /dev/null +++ b/tools/nemo_forced_aligner/align.py @@ -0,0 +1,287 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from dataclasses import dataclass, is_dataclass +from typing import Optional + +import torch +from omegaconf import OmegaConf +from utils.data_prep import ( + get_audio_sr, + get_batch_starts_ends, + get_batch_tensors_and_boundary_info, + get_manifest_lines_batch, + is_entry_in_all_lines, + is_entry_in_any_lines, +) +from utils.make_output_files import make_ctm, make_new_manifest +from utils.viterbi_decoding import viterbi_decoding + +from nemo.collections.asr.models.ctc_models import EncDecCTCModel +from nemo.collections.asr.parts.utils.transcribe_utils import setup_model +from nemo.core.config import hydra_runner +from nemo.utils import logging + + +""" +Align the utterances in manifest_filepath. +Results are saved in ctm files in output_dir. + +Arguments: + pretrained_name: string specifying the name of a CTC NeMo ASR model which will be automatically downloaded + from NGC and used for generating the log-probs which we will use to do alignment. + Note: NFA can only use CTC models (not Transducer models) at the moment. + model_path: string specifying the local filepath to a CTC NeMo ASR model which will be used to generate the + log-probs which we will use to do alignment. + Note: NFA can only use CTC models (not Transducer models) at the moment. + Note: if a model_path is provided, it will override the pretrained_name. + model_downsample_factor: an int indicating the downsample factor of the ASR model, ie the ratio of input + timesteps to output timesteps. + If the ASR model is a QuartzNet model, its downsample factor is 2. + If the ASR model is a Conformer CTC model, its downsample factor is 4. + If the ASR model is a Citirnet model, its downsample factor is 8. + manifest_filepath: filepath to the manifest of the data you want to align, + containing 'audio_filepath' and 'text' fields. + output_dir: the folder where output CTM files and new JSON manifest will be saved. + align_using_pred_text: if True, will transcribe the audio using the specified model and then use that transcription + as the 'ground truth' for the forced alignment. + transcribe_device: None, or a string specifying the device that will be used for generating log-probs (i.e. "transcribing"). + The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available + (otherwise will set it to 'cpu'). + viterbi_device: None, or string specifying the device that will be used for doing Viterbi decoding. + The string needs to be in a format recognized by torch.device(). If None, NFA will set it to 'cuda' if it is available + (otherwise will set it to 'cpu'). + batch_size: int specifying batch size that will be used for generating log-probs and doing Viterbi decoding. + additional_ctm_grouping_separator: the string used to separate CTM segments if you want to obtain CTM files at a + level that is not the token level or the word level. NFA will always produce token-level and word-level CTM + files in: `/tokens/.ctm` and `/words/.ctm`. + If `additional_ctm_grouping_separator` is specified, an additional folder + `/{tokens/words/additional_segments}/.ctm` will be created containing CTMs + for `addtional_ctm_grouping_separator`-separated segments. + remove_blank_tokens_from_ctm: a boolean denoting whether to remove tokens from token-level output CTMs. + audio_filepath_parts_in_utt_id: int specifying how many of the 'parts' of the audio_filepath + we will use (starting from the final part of the audio_filepath) to determine the + utt_id that will be used in the CTM files. Note also that any spaces that are present in the audio_filepath + will be replaced with dashes, so as not to change the number of space-separated elements in the + CTM files. + e.g. if audio_filepath is "/a/b/c/d/e 1.wav" and audio_filepath_parts_in_utt_id is 1 => utt_id will be "e1" + e.g. if audio_filepath is "/a/b/c/d/e 1.wav" and audio_filepath_parts_in_utt_id is 2 => utt_id will be "d_e1" + e.g. if audio_filepath is "/a/b/c/d/e 1.wav" and audio_filepath_parts_in_utt_id is 3 => utt_id will be "c_d_e1" + minimum_timestamp_duration: a float indicating a minimum duration (in seconds) for timestamps in the CTM. If any + line in the CTM has a duration lower than the `minimum_timestamp_duration`, it will be enlarged from the + middle outwards until it meets the minimum_timestamp_duration, or reaches the beginning or end of the audio + file. Note that this may cause timestamps to overlap. +""" + + +@dataclass +class AlignmentConfig: + # Required configs + pretrained_name: Optional[str] = None + model_path: Optional[str] = None + model_downsample_factor: Optional[int] = None + manifest_filepath: Optional[str] = None + output_dir: Optional[str] = None + + # General configs + align_using_pred_text: bool = False + transcribe_device: Optional[str] = None + viterbi_device: Optional[str] = None + batch_size: int = 1 + additional_ctm_grouping_separator: Optional[str] = None + remove_blank_tokens_from_ctm: bool = False + minimum_timestamp_duration: float = 0 + audio_filepath_parts_in_utt_id: int = 1 + + +@hydra_runner(config_name="AlignmentConfig", schema=AlignmentConfig) +def main(cfg: AlignmentConfig): + + logging.info(f'Hydra config: {OmegaConf.to_yaml(cfg)}') + + if is_dataclass(cfg): + cfg = OmegaConf.structured(cfg) + + # Validate config + if cfg.model_path is None and cfg.pretrained_name is None: + raise ValueError("Both cfg.model_path and cfg.pretrained_name cannot be None") + + if cfg.model_path is not None and cfg.pretrained_name is not None: + raise ValueError("One of cfg.model_path and cfg.pretrained_name must be None") + + if cfg.model_downsample_factor is None: + raise ValueError("cfg.model_downsample_factor must be specified") + + if cfg.manifest_filepath is None: + raise ValueError("cfg.manifest_filepath must be specified") + + if cfg.output_dir is None: + raise ValueError("cfg.output_dir must be specified") + + if cfg.batch_size < 1: + raise ValueError("cfg.batch_size cannot be zero or a negative number") + + if cfg.additional_ctm_grouping_separator == "" or cfg.additional_ctm_grouping_separator == " ": + raise ValueError("cfg.additional_grouping_separator cannot be empty string or space character") + + if cfg.minimum_timestamp_duration < 0: + raise ValueError("cfg.minimum_timestamp_duration cannot be a negative number") + + # Validate manifest contents + if not is_entry_in_all_lines(cfg.manifest_filepath, "audio_filepath"): + raise RuntimeError( + "At least one line in cfg.manifest_filepath does not contain an 'audio_filepath' entry. " + "All lines must contain an 'audio_filepath' entry." + ) + + if cfg.align_using_pred_text: + if is_entry_in_any_lines(cfg.manifest_filepath, "pred_text"): + raise RuntimeError( + "Cannot specify cfg.align_using_pred_text=True when the manifest at cfg.manifest_filepath " + "contains 'pred_text' entries. This is because the audio will be transcribed and may produce " + "a different 'pred_text'. This may cause confusion." + ) + else: + if not is_entry_in_all_lines(cfg.manifest_filepath, "text"): + raise RuntimeError( + "At least one line in cfg.manifest_filepath does not contain a 'text' entry. " + "NFA requires all lines to contain a 'text' entry when cfg.align_using_pred_text=True." + ) + + # init devices + if cfg.transcribe_device is None: + transcribe_device = torch.device("cuda" if torch.cuda.is_available else "cpu") + else: + transcribe_device = torch.device(cfg.transcribe_device) + logging.info(f"Device to be used for transcription step (`transcribe_device`) is {transcribe_device}") + + if cfg.viterbi_device is None: + viterbi_device = torch.device("cuda" if torch.cuda.is_available else "cpu") + else: + viterbi_device = torch.device(cfg.viterbi_device) + logging.info(f"Device to be used for viterbi step (`viterbi_device`) is {viterbi_device}") + + if transcribe_device.type == 'cuda' or viterbi_device.type == 'cuda': + logging.warning( + 'One or both of transcribe_device and viterbi_device are GPUs. If you run into OOM errors ' + 'it may help to change both devices to be the CPU.' + ) + + # load model + model, _ = setup_model(cfg, transcribe_device) + + if not isinstance(model, EncDecCTCModel): + raise NotImplementedError( + f"Model {cfg.model_name} is not an instance of NeMo EncDecCTCModel." + " Currently only instances of EncDecCTCModels are supported" + ) + + audio_sr = get_audio_sr(cfg.manifest_filepath) + logging.info( + f"Detected audio sampling rate {audio_sr}Hz in first audio in manifest at {cfg.manifest_filepath}. " + "Will assume all audios in manifest have this sampling rate. Sampling rate will be used to determine " + "timestamps in output CTM." + ) + + if cfg.minimum_timestamp_duration > 0: + logging.warning( + f"cfg.minimum_timestamp_duration has been set to {cfg.minimum_timestamp_duration} seconds. " + "This may cause the alignments for some tokens/words/additional segments to be overlapping." + ) + + # get start and end line IDs of batches + starts, ends = get_batch_starts_ends(cfg.manifest_filepath, cfg.batch_size) + + if cfg.align_using_pred_text: + # record pred_texts to save them in the new manifest at the end of this script + pred_text_all_lines = [] + else: + pred_text_all_lines = None + + # get alignment and save in CTM batch-by-batch + for start, end in zip(starts, ends): + manifest_lines_batch = get_manifest_lines_batch(cfg.manifest_filepath, start, end) + + ( + log_probs_batch, + y_batch, + T_batch, + U_batch, + token_info_batch, + word_info_batch, + segment_info_batch, + pred_text_batch, + ) = get_batch_tensors_and_boundary_info( + manifest_lines_batch, model, cfg.additional_ctm_grouping_separator, cfg.align_using_pred_text, + ) + + if cfg.align_using_pred_text: + pred_text_all_lines.extend(pred_text_batch) + + alignments_batch = viterbi_decoding(log_probs_batch, y_batch, T_batch, U_batch, viterbi_device) + + make_ctm( + token_info_batch, + alignments_batch, + manifest_lines_batch, + model, + cfg.model_downsample_factor, + os.path.join(cfg.output_dir, "tokens"), + cfg.remove_blank_tokens_from_ctm, + cfg.audio_filepath_parts_in_utt_id, + cfg.minimum_timestamp_duration, + audio_sr, + ) + + make_ctm( + word_info_batch, + alignments_batch, + manifest_lines_batch, + model, + cfg.model_downsample_factor, + os.path.join(cfg.output_dir, "words"), + False, # dont try to remove blank tokens because we dont expect them to be there anyway + cfg.audio_filepath_parts_in_utt_id, + cfg.minimum_timestamp_duration, + audio_sr, + ) + + if cfg.additional_ctm_grouping_separator: + make_ctm( + segment_info_batch, + alignments_batch, + manifest_lines_batch, + model, + cfg.model_downsample_factor, + os.path.join(cfg.output_dir, "additional_segments"), + False, # dont try to remove blank tokens because we dont expect them to be there anyway + cfg.audio_filepath_parts_in_utt_id, + cfg.minimum_timestamp_duration, + audio_sr, + ) + + make_new_manifest( + cfg.output_dir, + cfg.manifest_filepath, + cfg.additional_ctm_grouping_separator, + cfg.audio_filepath_parts_in_utt_id, + pred_text_all_lines, + ) + + return None + + +if __name__ == "__main__": + main() diff --git a/tools/nemo_forced_aligner/requirements.txt b/tools/nemo_forced_aligner/requirements.txt new file mode 100644 index 0000000000000..3af8ebf1b4881 --- /dev/null +++ b/tools/nemo_forced_aligner/requirements.txt @@ -0,0 +1,2 @@ +nemo_toolkit[all] +pytest diff --git a/tools/nemo_forced_aligner/tests/test_add_t_start_end_to_boundary_info.py b/tools/nemo_forced_aligner/tests/test_add_t_start_end_to_boundary_info.py new file mode 100644 index 0000000000000..406c4be1fb702 --- /dev/null +++ b/tools/nemo_forced_aligner/tests/test_add_t_start_end_to_boundary_info.py @@ -0,0 +1,121 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pytest +from utils.make_output_files import add_t_start_end_to_boundary_info + +ALIGNMENT = [ + 1, + 1, + 3, + 3, + 4, + 5, + 7, + 7, + 9, + 10, + 11, + 12, + 13, + 15, + 17, + 17, + 19, + 21, + 23, + 23, +] + +INPUT_TOKEN_INFO = [ + {'text': '', 's_start': 0, 's_end': 0}, + {'text': 'h', 's_start': 1, 's_end': 1}, + {'text': '', 's_start': 2, 's_end': 2}, + {'text': 'i', 's_start': 3, 's_end': 3}, + {'text': '', 's_start': 4, 's_end': 4}, + {'text': '', 's_start': 5, 's_end': 5}, + {'text': '', 's_start': 6, 's_end': 6}, + {'text': 'w', 's_start': 7, 's_end': 7}, + {'text': '', 's_start': 8, 's_end': 8}, + {'text': 'o', 's_start': 9, 's_end': 9}, + {'text': '', 's_start': 10, 's_end': 10}, + {'text': 'r', 's_start': 11, 's_end': 11}, + {'text': '', 's_start': 12, 's_end': 12}, + {'text': 'l', 's_start': 13, 's_end': 13}, + {'text': '', 's_start': 14, 's_end': 14}, + {'text': 'd', 's_start': 15, 's_end': 15}, + {'text': '', 's_start': 16, 's_end': 16}, + {'text': '', 's_start': 17, 's_end': 17}, + {'text': '', 's_start': 18, 's_end': 18}, + {'text': 'h', 's_start': 19, 's_end': 19}, + {'text': '', 's_start': 20, 's_end': 20}, + {'text': 'e', 's_start': 21, 's_end': 21}, + {'text': '', 's_start': 22, 's_end': 22}, + {'text': 'y', 's_start': 23, 's_end': 23}, + {'text': '', 's_start': 24, 's_end': 24}, +] + +EXPECTED_OUTPUT_TOKEN_INFO = [ + {'text': 'h', 's_start': 1, 's_end': 1, 't_start': 0, 't_end': 1}, + {'text': 'i', 's_start': 3, 's_end': 3, 't_start': 2, 't_end': 3}, + {'text': '', 's_start': 4, 's_end': 4, 't_start': 4, 't_end': 4}, + {'text': '', 's_start': 5, 's_end': 5, 't_start': 5, 't_end': 5}, + {'text': 'w', 's_start': 7, 's_end': 7, 't_start': 6, 't_end': 7}, + {'text': 'o', 's_start': 9, 's_end': 9, 't_start': 8, 't_end': 8}, + {'text': '', 's_start': 10, 's_end': 10, 't_start': 9, 't_end': 9}, + {'text': 'r', 's_start': 11, 's_end': 11, 't_start': 10, 't_end': 10}, + {'text': '', 's_start': 12, 's_end': 12, 't_start': 11, 't_end': 11}, + {'text': 'l', 's_start': 13, 's_end': 13, 't_start': 12, 't_end': 12}, + {'text': 'd', 's_start': 15, 's_end': 15, 't_start': 13, 't_end': 13}, + {'text': '', 's_start': 17, 's_end': 17, 't_start': 14, 't_end': 15}, + {'text': 'h', 's_start': 19, 's_end': 19, 't_start': 16, 't_end': 16}, + {'text': 'e', 's_start': 21, 's_end': 21, 't_start': 17, 't_end': 17}, + {'text': 'y', 's_start': 23, 's_end': 23, 't_start': 18, 't_end': 19}, +] + + +INPUT_WORD_INFO = [ + {'text': 'hi', 's_start': 1, 's_end': 3}, + {'text': 'world', 's_start': 7, 's_end': 15}, + {'text': 'hey', 's_start': 19, 's_end': 23}, +] + +EXPECTED_OUTPUT_WORD_INFO = [ + {'text': 'hi', 's_start': 1, 's_end': 3, 't_start': 0, 't_end': 3}, + {'text': 'world', 's_start': 7, 's_end': 15, 't_start': 6, 't_end': 13}, + {'text': 'hey', 's_start': 19, 's_end': 23, 't_start': 16, 't_end': 19}, +] + +INPUT_SEGMENT_INFO = [ + {'text': 'hi world', 's_start': 1, 's_end': 15}, + {'text': 'hey', 's_start': 19, 's_end': 23}, +] + +EXPECTED_OUTPUT_SEGMENT_INFO = [ + {'text': 'hi world', 's_start': 1, 's_end': 15, 't_start': 0, 't_end': 13}, + {'text': 'hey', 's_start': 19, 's_end': 23, 't_start': 16, 't_end': 19}, +] + + +@pytest.mark.parametrize( + "input_boundary_info_utt,alignment_utt,expected_output_boundary_info_utt", + [ + (INPUT_TOKEN_INFO, ALIGNMENT, EXPECTED_OUTPUT_TOKEN_INFO), + (INPUT_WORD_INFO, ALIGNMENT, EXPECTED_OUTPUT_WORD_INFO), + (INPUT_SEGMENT_INFO, ALIGNMENT, EXPECTED_OUTPUT_SEGMENT_INFO), + ], +) +def test_add_t_start_end_to_boundary_info(input_boundary_info_utt, alignment_utt, expected_output_boundary_info_utt): + output_boundary_info_utt = add_t_start_end_to_boundary_info(input_boundary_info_utt, alignment_utt) + assert output_boundary_info_utt == expected_output_boundary_info_utt diff --git a/tools/nemo_forced_aligner/tests/test_get_y_and_boundary_info_for_utt.py b/tools/nemo_forced_aligner/tests/test_get_y_and_boundary_info_for_utt.py new file mode 100644 index 0000000000000..f5bc722d5a1c7 --- /dev/null +++ b/tools/nemo_forced_aligner/tests/test_get_y_and_boundary_info_for_utt.py @@ -0,0 +1,158 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import pytest +from utils.data_prep import get_y_and_boundary_info_for_utt + +from nemo.collections.asr.models import ASRModel + +EN_TEXT = "hi world | hey" + +EN_QN_EXPECTED_TOKEN_INFO = [ + {'text': '', 's_start': 0, 's_end': 0}, + {'text': 'h', 's_start': 1, 's_end': 1}, + {'text': '', 's_start': 2, 's_end': 2}, + {'text': 'i', 's_start': 3, 's_end': 3}, + {'text': '', 's_start': 4, 's_end': 4}, + {'text': '', 's_start': 5, 's_end': 5}, + {'text': '', 's_start': 6, 's_end': 6}, + {'text': 'w', 's_start': 7, 's_end': 7}, + {'text': '', 's_start': 8, 's_end': 8}, + {'text': 'o', 's_start': 9, 's_end': 9}, + {'text': '', 's_start': 10, 's_end': 10}, + {'text': 'r', 's_start': 11, 's_end': 11}, + {'text': '', 's_start': 12, 's_end': 12}, + {'text': 'l', 's_start': 13, 's_end': 13}, + {'text': '', 's_start': 14, 's_end': 14}, + {'text': 'd', 's_start': 15, 's_end': 15}, + {'text': '', 's_start': 16, 's_end': 16}, + {'text': '', 's_start': 17, 's_end': 17}, + {'text': '', 's_start': 18, 's_end': 18}, + {'text': 'h', 's_start': 19, 's_end': 19}, + {'text': '', 's_start': 20, 's_end': 20}, + {'text': 'e', 's_start': 21, 's_end': 21}, + {'text': '', 's_start': 22, 's_end': 22}, + {'text': 'y', 's_start': 23, 's_end': 23}, + {'text': '', 's_start': 24, 's_end': 24}, +] + +EN_QN_EXPECTED_WORD_INFO = [ + {'text': 'hi', 's_start': 1, 's_end': 3}, + {'text': 'world', 's_start': 7, 's_end': 15}, + {'text': 'hey', 's_start': 19, 's_end': 23}, +] + +EN_QN_EXPECTED_SEGMENT_INFO = [ + {'text': 'hi world', 's_start': 1, 's_end': 15}, + {'text': 'hey', 's_start': 19, 's_end': 23}, +] + +EN_CN_EXPECTED_TOKEN_INFO = [ + {'text': '', 's_start': 0, 's_end': 0}, + {'text': '▁hi', 's_start': 1, 's_end': 1}, + {'text': '', 's_start': 2, 's_end': 2}, + {'text': '▁world', 's_start': 3, 's_end': 3}, + {'text': '', 's_start': 4, 's_end': 4}, + {'text': '▁he', 's_start': 5, 's_end': 5}, + {'text': '', 's_start': 6, 's_end': 6}, + {'text': 'y', 's_start': 7, 's_end': 7}, + {'text': '', 's_start': 8, 's_end': 8}, +] + +EN_CN_EXPECTED_WORD_INFO = [ + {'text': 'hi', 's_start': 1, 's_end': 1}, + {'text': 'world', 's_start': 3, 's_end': 3}, + {'text': 'hey', 's_start': 5, 's_end': 7}, +] + +EN_CN_EXPECTED_SEGMENT_INFO = [ + {'text': 'hi world', 's_start': 1, 's_end': 3}, + {'text': 'hey', 's_start': 5, 's_end': 7}, +] + + +ZH_TEXT = "人工 智能|技术" + +ZH_EXPECTED_TOKEN_INFO = [ + {'text': '', 's_start': 0, 's_end': 0}, + {'text': '人', 's_start': 1, 's_end': 1}, + {'text': '', 's_start': 2, 's_end': 2}, + {'text': '工', 's_start': 3, 's_end': 3}, + {'text': '', 's_start': 4, 's_end': 4}, + {'text': '', 's_start': 5, 's_end': 5}, + {'text': '', 's_start': 6, 's_end': 6}, + {'text': '智', 's_start': 7, 's_end': 7}, + {'text': '', 's_start': 8, 's_end': 8}, + {'text': '能', 's_start': 9, 's_end': 9}, + {'text': '', 's_start': 10, 's_end': 10}, + {'text': '', 's_start': 11, 's_end': 11}, + {'text': '', 's_start': 12, 's_end': 12}, + {'text': '技', 's_start': 13, 's_end': 13}, + {'text': '', 's_start': 14, 's_end': 14}, + {'text': '术', 's_start': 15, 's_end': 15}, + {'text': '', 's_start': 16, 's_end': 16}, +] + +ZH_EXPECTED_WORD_INFO = [ + {'text': '人工', 's_start': 1, 's_end': 3}, + {'text': '智能', 's_start': 7, 's_end': 9}, + {'text': '技术', 's_start': 13, 's_end': 15}, +] + +ZH_EXPECTED_SEGMENT_INFO = [ + {'text': '人工 智能', 's_start': 1, 's_end': 9}, + {'text': '技术', 's_start': 13, 's_end': 15}, +] + + +@pytest.mark.parametrize( + "text,model_pretrained_name,separator,expected_token_info", + [ + (EN_TEXT, "stt_en_quartznet15x5", "|", EN_QN_EXPECTED_TOKEN_INFO), + (EN_TEXT, "stt_en_citrinet_256_gamma_0_25", "|", EN_CN_EXPECTED_TOKEN_INFO), + (ZH_TEXT, "stt_zh_citrinet_512", "|", ZH_EXPECTED_TOKEN_INFO), + ], +) +def test_token_info(text, model_pretrained_name, separator, expected_token_info): + model = ASRModel.from_pretrained(model_pretrained_name) + _, token_info, *_ = get_y_and_boundary_info_for_utt(text, model, separator) + assert token_info == expected_token_info + + +@pytest.mark.parametrize( + "text,model_pretrained_name,separator,expected_word_info", + [ + (EN_TEXT, "stt_en_quartznet15x5", "|", EN_QN_EXPECTED_WORD_INFO), + (EN_TEXT, "stt_en_citrinet_256_gamma_0_25", "|", EN_CN_EXPECTED_WORD_INFO), + (ZH_TEXT, "stt_zh_citrinet_512", "|", ZH_EXPECTED_WORD_INFO), + ], +) +def test_word_info(text, model_pretrained_name, separator, expected_word_info): + model = ASRModel.from_pretrained(model_pretrained_name) + _, _, word_info, _ = get_y_and_boundary_info_for_utt(text, model, separator) + assert word_info == expected_word_info + + +@pytest.mark.parametrize( + "text,model_pretrained_name,separator,expected_segment_info", + [ + (EN_TEXT, "stt_en_quartznet15x5", "|", EN_QN_EXPECTED_SEGMENT_INFO), + (EN_TEXT, "stt_en_citrinet_256_gamma_0_25", "|", EN_CN_EXPECTED_SEGMENT_INFO), + (ZH_TEXT, "stt_zh_citrinet_512", "|", ZH_EXPECTED_SEGMENT_INFO), + ], +) +def test_segment_info(text, model_pretrained_name, separator, expected_segment_info): + model = ASRModel.from_pretrained(model_pretrained_name) + *_, segment_info = get_y_and_boundary_info_for_utt(text, model, separator) + assert segment_info == expected_segment_info diff --git a/tools/nemo_forced_aligner/utils/constants.py b/tools/nemo_forced_aligner/utils/constants.py new file mode 100644 index 0000000000000..894f880401cbc --- /dev/null +++ b/tools/nemo_forced_aligner/utils/constants.py @@ -0,0 +1,19 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +BLANK_TOKEN = "" + +SPACE_TOKEN = "" + +V_NEGATIVE_NUM = -1e30 diff --git a/tools/nemo_forced_aligner/utils/data_prep.py b/tools/nemo_forced_aligner/utils/data_prep.py new file mode 100644 index 0000000000000..26d8a328b50d4 --- /dev/null +++ b/tools/nemo_forced_aligner/utils/data_prep.py @@ -0,0 +1,385 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os + +import soundfile as sf +import torch +from utils.constants import BLANK_TOKEN, SPACE_TOKEN, V_NEGATIVE_NUM + + +def get_batch_starts_ends(manifest_filepath, batch_size): + """ + Get the start and end ids of the lines we will use for each 'batch'. + """ + + with open(manifest_filepath, 'r') as f: + num_lines_in_manifest = sum(1 for _ in f) + + starts = [x for x in range(0, num_lines_in_manifest, batch_size)] + ends = [x - 1 for x in starts] + ends.pop(0) + ends.append(num_lines_in_manifest) + + return starts, ends + + +def is_entry_in_any_lines(manifest_filepath, entry): + """ + Returns True if entry is a key in any of the JSON lines in manifest_filepath + """ + + entry_in_manifest = False + + with open(manifest_filepath, 'r') as f: + for line in f: + data = json.loads(line) + + if entry in data: + entry_in_manifest = True + + return entry_in_manifest + + +def is_entry_in_all_lines(manifest_filepath, entry): + """ + Returns True is entry is a key in all of the JSON lines in manifest_filepath. + """ + with open(manifest_filepath, 'r') as f: + for line in f: + data = json.loads(line) + + if entry not in data: + return False + + return True + + +def get_audio_sr(manifest_filepath): + """ + Measure the sampling rate of the audio file in the first line + of the manifest at manifest_filepath + """ + with open(manifest_filepath, "r") as f_manifest: + first_line = json.loads(f_manifest.readline()) + + audio_file = first_line["audio_filepath"] + if not os.path.exists(audio_file): + raise RuntimeError(f"Did not find filepath {audio_file} which was specified in manifest {manifest_filepath}.") + + with sf.SoundFile(audio_file, "r") as f_audio: + return f_audio.samplerate + + +def get_manifest_lines_batch(manifest_filepath, start, end): + manifest_lines_batch = [] + with open(manifest_filepath, "r") as f: + for line_i, line in enumerate(f): + if line_i == start and line_i == end: + manifest_lines_batch.append(json.loads(line)) + break + + if line_i == end: + break + if line_i >= start: + manifest_lines_batch.append(json.loads(line)) + return manifest_lines_batch + + +def get_char_tokens(text, model): + tokens = [] + for character in text: + if character in model.decoder.vocabulary: + tokens.append(model.decoder.vocabulary.index(character)) + else: + tokens.append(len(model.decoder.vocabulary)) # return unk token (same as blank token) + + return tokens + + +def get_y_and_boundary_info_for_utt(text, model, separator): + """ + Get y_token_ids_with_blanks, token_info, word_info and segment_info for the text provided, tokenized + by the model provided. + y_token_ids_with_blanks is a list of the indices of the text tokens with the blank token id in between every + text token. + token_info, word_info and segment_info are lists of dictionaries containing information about + where the tokens/words/segments start and end. + For example, 'hi world | hey ' with separator = '|' and tokenized by a BPE tokenizer can have token_info like: + token_info = [ + {'text': '', 's_start': 0, 's_end': 0}, + {'text': '▁hi', 's_start': 1, 's_end': 1}, + {'text': '', 's_start': 2, 's_end': 2}, + {'text': '▁world', 's_start': 3, 's_end': 3}, + {'text': '', 's_start': 4, 's_end': 4}, + {'text': '▁he', 's_start': 5, 's_end': 5}, + {'text': '', 's_start': 6, 's_end': 6}, + {'text': 'y', 's_start': 7, 's_end': 7}, + {'text': '', 's_start': 8, 's_end': 8}, + ] + 's_start' and 's_end' indicate where in the sequence of tokens does each token start and end. + + The word_info will be as follows: + word_info = [ + {'text': 'hi', 's_start': 1, 's_end': 1}, + {'text': 'world', 's_start': 3, 's_end': 3}, + {'text': 'hey', 's_start': 5, 's_end': 7}, + ] + 's_start' and 's_end' indicate where in the sequence of tokens does each word start and end. + + segment_info will be as follows: + segment_info = [ + {'text': 'hi world', 's_start': 1, 's_end': 3}, + {'text': 'hey', 's_start': 5, 's_end': 7}, + ] + 's_start' and 's_end' indicate where in the sequence of tokens does each segment start and end. + """ + + if not separator: # if separator is not defined - treat the whole text as one segment + segments = [text] + else: + segments = text.split(separator) + + # remove any spaces at start and end of segments + segments = [seg.strip() for seg in segments] + + if hasattr(model, 'tokenizer'): + + BLANK_ID = len(model.decoder.vocabulary) # TODO: check + + y_token_ids_with_blanks = [BLANK_ID] + token_info = [{"text": BLANK_TOKEN, "s_start": 0, "s_end": 0,}] + word_info = [] + segment_info = [] + + segment_s_pointer = 1 # first segment will start at s=1 because s=0 is a blank + word_s_pointer = 1 # first word will start at s=1 because s=0 is a blank + + for segment in segments: + words = segment.split(" ") # we define words to be space-separated sub-strings + for word in words: + + word_tokens = model.tokenizer.text_to_tokens(word) + word_ids = model.tokenizer.text_to_ids(word) + for token, id_ in zip(word_tokens, word_ids): + # add the text token and the blank that follows it + # to our token-based variables + y_token_ids_with_blanks.extend([id_, BLANK_ID]) + token_info.extend( + [ + { + "text": token, + "s_start": len(y_token_ids_with_blanks) - 2, + "s_end": len(y_token_ids_with_blanks) - 2, + }, + { + "text": BLANK_TOKEN, + "s_start": len(y_token_ids_with_blanks) - 1, + "s_end": len(y_token_ids_with_blanks) - 1, + }, + ] + ) + + # add the word to word_info and increment the word_s_pointer + word_info.append( + { + "text": word, + "s_start": word_s_pointer, + "s_end": word_s_pointer + (len(word_tokens) - 1) * 2, # TODO check this, + } + ) + word_s_pointer += len(word_tokens) * 2 # TODO check this + + # add the segment to segment_info and increment the segment_s_pointer + segment_tokens = model.tokenizer.text_to_tokens(segment) + segment_info.append( + { + "text": segment, + "s_start": segment_s_pointer, + "s_end": segment_s_pointer + (len(segment_tokens) - 1) * 2, + } + ) + segment_s_pointer += len(segment_tokens) * 2 + + return y_token_ids_with_blanks, token_info, word_info, segment_info + + elif hasattr(model.decoder, "vocabulary"): # i.e. tokenization is simply character-based + + BLANK_ID = len(model.decoder.vocabulary) # TODO: check this is correct + SPACE_ID = model.decoder.vocabulary.index(" ") + + y_token_ids_with_blanks = [BLANK_ID] + token_info = [{"text": BLANK_TOKEN, "s_start": 0, "s_end": 0,}] + word_info = [] + segment_info = [] + + segment_s_pointer = 1 # first segment will start at s=1 because s=0 is a blank + word_s_pointer = 1 # first word will start at s=1 because s=0 is a blank + + for i_segment, segment in enumerate(segments): + words = segment.split(" ") # we define words to be space-separated characters + for i_word, word in enumerate(words): + + # convert string to list of characters + word_tokens = list(word) + # convert list of characters to list of their ids in the vocabulary + word_ids = get_char_tokens(word, model) + for token, id_ in zip(word_tokens, word_ids): + # add the text token and the blank that follows it + # to our token-based variables + y_token_ids_with_blanks.extend([id_, BLANK_ID]) + token_info.extend( + [ + { + "text": token, + "s_start": len(y_token_ids_with_blanks) - 2, + "s_end": len(y_token_ids_with_blanks) - 2, + }, + { + "text": BLANK_TOKEN, + "s_start": len(y_token_ids_with_blanks) - 1, + "s_end": len(y_token_ids_with_blanks) - 1, + }, + ] + ) + + # add space token (and the blank after it) unless this is the final word in the final segment + if not (i_segment == len(segments) - 1 and i_word == len(words) - 1): + y_token_ids_with_blanks.extend([SPACE_ID, BLANK_ID]) + token_info.extend( + ( + { + "text": SPACE_TOKEN, + "s_start": len(y_token_ids_with_blanks) - 2, + "s_end": len(y_token_ids_with_blanks) - 2, + }, + { + "text": BLANK_TOKEN, + "s_start": len(y_token_ids_with_blanks) - 1, + "s_end": len(y_token_ids_with_blanks) - 1, + }, + ) + ) + # add the word to word_info and increment the word_s_pointer + word_info.append( + { + "text": word, + "s_start": word_s_pointer, + "s_end": word_s_pointer + len(word_tokens) * 2 - 2, # TODO check this, + } + ) + word_s_pointer += len(word_tokens) * 2 + 2 # TODO check this + + # add the segment to segment_info and increment the segment_s_pointer + segment_tokens = get_char_tokens(segment, model) + segment_info.append( + { + "text": segment, + "s_start": segment_s_pointer, + "s_end": segment_s_pointer + (len(segment_tokens) - 1) * 2, + } + ) + segment_s_pointer += len(segment_tokens) * 2 + 2 + + return y_token_ids_with_blanks, token_info, word_info, segment_info + + else: + raise RuntimeError("Cannot get tokens of this model.") + + +def get_batch_tensors_and_boundary_info(manifest_lines_batch, model, separator, align_using_pred_text): + """ + Returns: + log_probs, y, T, U (y and U are s.t. every other token is a blank) - these are the tensors we will need + during Viterbi decoding. + token_info_list, word_info_list, segment_info_list - these are lists of dictionaries which we will need + for writing the CTM files with the human-readable alignments. + pred_text_list - this is a list of the transcriptions from our model which we will save to our output JSON + file if align_using_pred_text is True. + """ + + # get hypotheses by calling 'transcribe' + # we will use the output log_probs, the duration of the log_probs, + # and (optionally) the predicted ASR text from the hypotheses + audio_filepaths_batch = [line["audio_filepath"] for line in manifest_lines_batch] + B = len(audio_filepaths_batch) + with torch.no_grad(): + hypotheses = model.transcribe(audio_filepaths_batch, return_hypotheses=True, batch_size=B) + + log_probs_list_batch = [] + T_list_batch = [] + pred_text_batch = [] + for hypothesis in hypotheses: + log_probs_list_batch.append(hypothesis.y_sequence) + T_list_batch.append(hypothesis.y_sequence.shape[0]) + pred_text_batch.append(hypothesis.text) + + # we loop over every line in the manifest that is in our current batch, + # and record the y (list of tokens, including blanks), U (list of lengths of y) and + # token_info_batch, word_info_batch, segment_info_batch + y_list_batch = [] + U_list_batch = [] + token_info_batch = [] + word_info_batch = [] + segment_info_batch = [] + + for i_line, line in enumerate(manifest_lines_batch): + if align_using_pred_text: + gt_text_for_alignment = pred_text_batch[i_line] + else: + gt_text_for_alignment = line["text"] + y_utt, token_info_utt, word_info_utt, segment_info_utt = get_y_and_boundary_info_for_utt( + gt_text_for_alignment, model, separator + ) + + y_list_batch.append(y_utt) + U_list_batch.append(len(y_utt)) + token_info_batch.append(token_info_utt) + word_info_batch.append(word_info_utt) + segment_info_batch.append(segment_info_utt) + + # turn log_probs, y, T, U into dense tensors for fast computation during Viterbi decoding + T_max = max(T_list_batch) + U_max = max(U_list_batch) + # V = the number of tokens in the vocabulary + 1 for the blank token. + V = len(model.decoder.vocabulary) + 1 + T_batch = torch.tensor(T_list_batch) + U_batch = torch.tensor(U_list_batch) + + # make log_probs_batch tensor of shape (B x T_max x V) + log_probs_batch = V_NEGATIVE_NUM * torch.ones((B, T_max, V)) + for b, log_probs_utt in enumerate(log_probs_list_batch): + t = log_probs_utt.shape[0] + log_probs_batch[b, :t, :] = log_probs_utt + + # make y tensor of shape (B x U_max) + # populate it initially with all 'V' numbers so that the 'V's will remain in the areas that + # are 'padding'. This will be useful for when we make 'log_probs_reorderd' during Viterbi decoding + # in a different function. + y_batch = V * torch.ones((B, U_max), dtype=torch.int64) + for b, y_utt in enumerate(y_list_batch): + U_utt = U_batch[b] + y_batch[b, :U_utt] = torch.tensor(y_utt) + + return ( + log_probs_batch, + y_batch, + T_batch, + U_batch, + token_info_batch, + word_info_batch, + segment_info_batch, + pred_text_batch, + ) diff --git a/tools/nemo_forced_aligner/utils/make_output_files.py b/tools/nemo_forced_aligner/utils/make_output_files.py new file mode 100644 index 0000000000000..830bf476ff2f8 --- /dev/null +++ b/tools/nemo_forced_aligner/utils/make_output_files.py @@ -0,0 +1,210 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import json +import os +from pathlib import Path + +import soundfile as sf +from utils.constants import BLANK_TOKEN, SPACE_TOKEN + + +def _get_utt_id(audio_filepath, audio_filepath_parts_in_utt_id): + fp_parts = Path(audio_filepath).parts[-audio_filepath_parts_in_utt_id:] + utt_id = Path("_".join(fp_parts)).stem + utt_id = utt_id.replace(" ", "-") # replace any spaces in the filepath with dashes + return utt_id + + +def add_t_start_end_to_boundary_info(boundary_info_utt, alignment_utt): + """ + We use the list of alignments to add the timesteps where each token/word/segment is predicted to + start and end. + boundary_info_utt can be any one of the variables referred to as `token_info`, `word_info`, `segment_info` + in other parts of the code. + + e.g. the input boundary info could be + boundary_info_utt = [ + {'text': 'hi', 's_start': 1, 's_end': 3}, + {'text': 'world', 's_start': 7, 's_end': 15}, + {'text': 'hey', 's_start': 19, 's_end': 23}, + ] + + and the alignment could be + alignment_utt = [ 1, 1, 3, 3, 4, 5, 7, 7, 9, 10, 11, 12, 13, 15, 17, 17, 19, 21, 23, 23] + + in which case the output would be: + boundary_info_utt = [ + {'text': 'hi', 's_start': 1, 's_end': 3, 't_start': 0, 't_end': 3}, + {'text': 'world', 's_start': 7, 's_end': 15, 't_start': 6, 't_end': 13}, + {'text': 'hey', 's_start': 19, 's_end': 23, 't_start': 16, 't_end': 19}, + ] + """ + # first remove boundary_info of any items that are not in the alignment + # the only items we expect not to be in the alignment are blanks that the alignment chooses to skip + # we will iterate boundary_info in reverse order for this to make popping the items simple + s_in_alignment = set(alignment_utt) + for boundary_info_pointer in range(len(boundary_info_utt) - 1, -1, -1): + s_in_boundary_info = set( + range( + boundary_info_utt[boundary_info_pointer]["s_start"], + boundary_info_utt[boundary_info_pointer]["s_end"] + 1, + ) + ) + item_not_in_alignment = True + for s_ in s_in_boundary_info: + if s_ in s_in_alignment: + item_not_in_alignment = False + + if item_not_in_alignment: + boundary_info_utt.pop(boundary_info_pointer) + + # now update boundary_info with t_start and t_end + boundary_info_pointer = 0 + for t, s_at_t in enumerate(alignment_utt): + if s_at_t == boundary_info_utt[boundary_info_pointer]["s_start"]: + if "t_start" not in boundary_info_utt[boundary_info_pointer]: + # we have just reached the start of the word/token/segment in the alignment => update t_start + boundary_info_utt[boundary_info_pointer]["t_start"] = t + + if t < len(alignment_utt) - 1: # this if is to avoid accessing an index that is not in the list + if alignment_utt[t + 1] > boundary_info_utt[boundary_info_pointer]["s_end"]: + if "t_end" not in boundary_info_utt[boundary_info_pointer]: + boundary_info_utt[boundary_info_pointer]["t_end"] = t + + boundary_info_pointer += 1 + else: # i.e. t == len(alignment) - 1, i.e. we are a the final element in alignment + # add final t_end if we haven't already + if "t_end" not in boundary_info_utt[boundary_info_pointer]: + boundary_info_utt[boundary_info_pointer]["t_end"] = t + + if boundary_info_pointer == len(boundary_info_utt): + # we have finished populating boundary_info with t_start and t_end, + # but we might have some final remaining elements (blanks) in the alignment which we dont care about + # => break, so as not to cause issues trying to access boundary_info[boundary_info_pointer] + break + + return boundary_info_utt + + +def make_ctm( + boundary_info_batch, + alignments_batch, + manifest_lines_batch, + model, + model_downsample_factor, + output_dir, + remove_blank_tokens_from_ctm, + audio_filepath_parts_in_utt_id, + minimum_timestamp_duration, + audio_sr, +): + """ + Function to save CTM files for all the utterances in the incoming batch. + """ + + assert len(boundary_info_batch) == len(alignments_batch) == len(manifest_lines_batch) + # we also assume that utterances are in the same order in boundary_info_batch, alignments_batch + # and manifest_lines_batch - this should be the case unless there is a strange bug upstream in the + # code + + os.makedirs(output_dir, exist_ok=True) + + # the ratio to convert from timesteps (the units of 't_start' and 't_end' in boundary_info_utt) + # to the number of samples ('samples' in the sense of 16000 'samples' per second) + timestep_to_sample_ratio = model.preprocessor.featurizer.hop_length * model_downsample_factor + + for boundary_info_utt, alignment_utt, manifest_line in zip( + boundary_info_batch, alignments_batch, manifest_lines_batch + ): + + boundary_info_utt = add_t_start_end_to_boundary_info(boundary_info_utt, alignment_utt) + + # get utt_id that will be used for saving CTM file as .ctm + utt_id = _get_utt_id(manifest_line['audio_filepath'], audio_filepath_parts_in_utt_id) + + # get audio file duration if we will need it later + if minimum_timestamp_duration > 0: + with sf.SoundFile(manifest_line["audio_filepath"]) as f: + audio_file_duration = f.frames / f.samplerate + + with open(os.path.join(output_dir, f"{utt_id}.ctm"), "w") as f_ctm: + for boundary_info_ in boundary_info_utt: # loop over every token/word/segment + text = boundary_info_["text"] + start_sample = boundary_info_["t_start"] * timestep_to_sample_ratio + end_sample = (boundary_info_["t_end"] + 1) * timestep_to_sample_ratio - 1 + + start_time = start_sample / audio_sr + end_time = end_sample / audio_sr + + if minimum_timestamp_duration > 0 and minimum_timestamp_duration > end_time - start_time: + # make the predicted duration of the token/word/segment longer, growing it outwards equal + # amounts from the predicted center of the token/word/segment + token_mid_point = (start_time + end_time) / 2 + start_time = max(token_mid_point - minimum_timestamp_duration / 2, 0) + end_time = min(token_mid_point + minimum_timestamp_duration / 2, audio_file_duration) + + if not (text == BLANK_TOKEN and remove_blank_tokens_from_ctm): # don't save blanks if we don't want to + # replace any spaces with so we dont introduce extra space characters to our CTM files + text = text.replace(" ", SPACE_TOKEN) + + f_ctm.write(f"{utt_id} 1 {start_time:.2f} {end_time - start_time:.2f} {text}\n") + + return None + + +def make_new_manifest( + output_dir, + original_manifest_filepath, + additional_ctm_grouping_separator, + audio_filepath_parts_in_utt_id, + pred_text_all_lines, +): + """ + Function to save a new manifest with the same info as the original manifest, but also the paths to the + CTM files for each utterance and the "pred_text" if it was used for the alignment. + """ + if pred_text_all_lines: + with open(original_manifest_filepath, 'r') as f: + num_lines_in_manifest = sum(1 for _ in f) + + if not num_lines_in_manifest == len(pred_text_all_lines): + raise RuntimeError( + f"Number of lines in the original manifest ({num_lines_in_manifest}) does not match " + f"the number of pred_texts we have ({len(pred_text_all_lines)}). Something has gone wrong." + ) + + tgt_manifest_name = str(Path(original_manifest_filepath).stem) + "_with_ctm_paths.json" + tgt_manifest_filepath = str(Path(output_dir) / tgt_manifest_name) + + with open(original_manifest_filepath, 'r') as fin, open(tgt_manifest_filepath, 'w') as fout: + for i_line, line in enumerate(fin): + data = json.loads(line) + + utt_id = _get_utt_id(data["audio_filepath"], audio_filepath_parts_in_utt_id) + + data["token_level_ctm_filepath"] = str(Path(output_dir) / "tokens" / f"{utt_id}.ctm") + data["word_level_ctm_filepath"] = str(Path(output_dir) / "words" / f"{utt_id}.ctm") + + if additional_ctm_grouping_separator: + data["additional_segment_level_ctm_filepath"] = str( + Path(output_dir) / "additional_segments" / f"{utt_id}.ctm" + ) + + if pred_text_all_lines: + data['pred_text'] = pred_text_all_lines[i_line] + + new_line = json.dumps(data) + + fout.write(f"{new_line}\n") diff --git a/tools/nemo_forced_aligner/utils/viterbi_decoding.py b/tools/nemo_forced_aligner/utils/viterbi_decoding.py new file mode 100644 index 0000000000000..bc9a45dda527b --- /dev/null +++ b/tools/nemo_forced_aligner/utils/viterbi_decoding.py @@ -0,0 +1,136 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch +from utils.constants import V_NEGATIVE_NUM + + +def viterbi_decoding(log_probs_batch, y_batch, T_batch, U_batch, viterbi_device): + """ + Do Viterbi decoding with an efficient algorithm (the only for-loop in the 'forward pass' is over the time dimension). + Args: + log_probs_batch: tensor of shape (B, T_max, V). The parts of log_probs_batch which are 'padding' are filled + with 'V_NEGATIVE_NUM' - a large negative number which represents a very low probability. + y_batch: tensor of shape (B, U_max) - contains token IDs including blanks in every other position. The parts of + y_batch which are padding are filled with the number 'V'. V = the number of tokens in the vocabulary + 1 for + the blank token. + T_batch: tensor of shape (B, 1) - contains the durations of the log_probs_batch (so we can ignore the + parts of log_probs_batch which are padding) + U_batch: tensor of shape (B, 1) - contains the lengths of y_batch (so we can ignore the parts of y_batch + which are padding). + viterbi_device: the torch device on which Viterbi decoding will be done. + + Returns: + alignments_batch: list of lists containing locations for the tokens we align to at each timestep. + Looks like: [[0, 0, 1, 2, 2, 3, 3, ..., ], ..., [0, 1, 2, 2, 2, 3, 4, ....]]. + Each list inside alignments_batch is of length T_batch[location of utt in batch]. + """ + B, T_max, _ = log_probs_batch.shape + U_max = y_batch.shape[1] + + # transfer all tensors to viterbi_device + log_probs_batch = log_probs_batch.to(viterbi_device) + y_batch = y_batch.to(viterbi_device) + T_batch = T_batch.to(viterbi_device) + U_batch = U_batch.to(viterbi_device) + + # make tensor that we will put at timesteps beyond the duration of the audio + padding_for_log_probs = V_NEGATIVE_NUM * torch.ones((B, T_max, 1), device=viterbi_device) + # make log_probs_padded tensor of shape (B, T_max, V +1 ) where all of + # log_probs_padded[:,:,-1] is the 'V_NEGATIVE_NUM' + log_probs_padded = torch.cat((log_probs_batch, padding_for_log_probs), dim=2) + # make log_probs_reordered tensor of shape (B, T_max, U_max) + # it contains the log_probs for only the tokens that are in the Ground Truth, and in the order + # that they occur + log_probs_reordered = torch.gather(input=log_probs_padded, dim=2, index=y_batch.unsqueeze(1).repeat(1, T_max, 1)) + + # initialize tensors of viterbi probabilies and backpointers + v_matrix = V_NEGATIVE_NUM * torch.ones_like(log_probs_reordered) + backpointers = -999 * torch.ones_like(v_matrix) + v_matrix[:, 0, :2] = log_probs_reordered[:, 0, :2] + + # Make a letter_repetition_mask the same shape as y_batch + # the letter_repetition_mask will have 'True' where the token (including blanks) is the same + # as the token two places before it in the ground truth (and 'False everywhere else). + # We will use letter_repetition_mask to determine whether the Viterbi algorithm needs to look two tokens back or + # three tokens back + y_shifted_left = torch.roll(y_batch, shifts=2, dims=1) + letter_repetition_mask = y_batch - y_shifted_left + letter_repetition_mask[:, :2] = 1 # make sure dont apply mask to first 2 tokens + letter_repetition_mask = letter_repetition_mask == 0 + + # bp_absolute_template is a tensor we will need during the Viterbi decoding to convert our argmaxes from indices between 0 and 2, + # to indices in the range (0, U_max-1) indicating from which token the mostly path up to that point came from. + # it is a tensor of shape (B, U_max) that looks like + # bp_absolute_template = [ + # [0, 1, 2, ...,, U_max] + # [0, 1, 2, ...,, U_max] + # [0, 1, 2, ...,, U_max] + # ... rows repeated so there are B number of rows in total + # ] + bp_absolute_template = torch.arange(U_max, device=viterbi_device).unsqueeze(0).repeat(B, 1) + + for t in range(1, T_max): + + # e_current is a tensor of shape (B, U_max) of the log probs of every possible token at the current timestep + e_current = log_probs_reordered[:, t, :] + + # v_prev is a tensor of shape (B, U_max) of the viterbi probabilities 1 timestep back and in the same token position + v_prev = v_matrix[:, t - 1, :] + + # v_prev_shifted is a tensor of shape (B, U_max) of the viterbi probabilities 1 timestep back and 1 token position back + v_prev_shifted = torch.roll(v_prev, shifts=1, dims=1) + # by doing a roll shift of size 1, we have brought the viterbi probability in the final token position to the + # first token position - let's overcome this by 'zeroing out' the probabilities in the firest token position + v_prev_shifted[:, 0] = V_NEGATIVE_NUM + + # v_prev_shifted2 is a tensor of shape (B, U_max) of the viterbi probabilities 1 timestep back and 2 token position back + v_prev_shifted2 = torch.roll(v_prev, shifts=2, dims=1) + v_prev_shifted2[:, :2] = V_NEGATIVE_NUM # zero out as we did for v_prev_shifted + # use our letter_repetition_mask to remove the connections between 2 blanks (so we don't skip over a letter) + # and to remove the connections between 2 consective letters (so we don't skip over a blank) + v_prev_shifted2.masked_fill_(letter_repetition_mask, V_NEGATIVE_NUM) + + # we need this v_prev_dup tensor so we can calculated the viterbi probability of every possible + # token position simultaneously + v_prev_dup = torch.cat( + (v_prev.unsqueeze(2), v_prev_shifted.unsqueeze(2), v_prev_shifted2.unsqueeze(2),), dim=2, + ) + + # candidates_v_current are our candidate viterbi probabilities for every token position, from which + # we will pick the max and record the argmax + candidates_v_current = v_prev_dup + e_current.unsqueeze(2) + v_current, bp_relative = torch.max(candidates_v_current, dim=2) + + # convert our argmaxes from indices between 0 and 2, to indices in the range (0, U_max-1) indicating + # from which token the mostly path up to that point came from + bp_absolute = bp_absolute_template - bp_relative + + # update our tensors containing all the viterbi probabilites and backpointers + v_matrix[:, t, :] = v_current + backpointers[:, t, :] = bp_absolute + + # trace backpointers TODO: parallelize over batch_size + alignments_batch = [] + for b in range(B): + T_b = int(T_batch[b]) + U_b = int(U_batch[b]) + + final_state = int(torch.argmax(v_matrix[b, T_b - 1, U_b - 2 : U_b])) + U_b - 2 + alignment_b = [final_state] + for t in range(T_b - 1, 0, -1): + alignment_b.insert(0, int(backpointers[b, t, alignment_b[0]])) + alignments_batch.append(alignment_b) + + return alignments_batch