From 62b044888d2ca4d3409471d005c69b644685fa8a Mon Sep 17 00:00:00 2001 From: Eric Harper Date: Tue, 7 Jun 2022 15:03:40 -0600 Subject: [PATCH] Merge r1.9.0 main (#4331) * update branch Signed-off-by: ericharper * update package info Signed-off-by: ericharper * cleaned up TN/ ITN doc (#4119) * cleaned up TN/ ITN doc Signed-off-by: Yang Zhang * fix typo Signed-off-by: Yang Zhang * fix image Signed-off-by: Yang Zhang * fix image Signed-off-by: Yang Zhang * Draft: Fix restoring from checkpoint for case when `model.common_dataset_parameters.label_vocab_dir` is provided (#4136) * Fix restoring from checkpoint with label vocab dir Signed-off-by: PeganovAnton * Add tests for various ways to pass label ids to model Signed-off-by: PeganovAnton * Fix typo Signed-off-by: PeganovAnton * Fix typo Signed-off-by: PeganovAnton * Do not create tmp directory Signed-off-by: PeganovAnton * Fix parameter name Signed-off-by: PeganovAnton * finish cherry-pick op Signed-off-by: PeganovAnton * Fix labels errors Signed-off-by: PeganovAnton * Remove duplicate stage Signed-off-by: PeganovAnton * Change target branch Signed-off-by: PeganovAnton * fix doc (#4146) Signed-off-by: Yang Zhang * Tacotron2 retrain (#4103) * fix yaml Signed-off-by: treacker * Fix for new TTSDataset class Signed-off-by: treacker * added wandb logging Signed-off-by: treacker * added wandb logging Signed-off-by: treacker * fix numpy version Signed-off-by: treacker * fix numpy version Signed-off-by: treacker * inference fix Signed-off-by: treacker * removed old code Signed-off-by: treacker * updated parser logic Signed-off-by: treacker * reverted version update Signed-off-by: treacker * refactored parser logic Signed-off-by: treacker * Updated Jenkinsfile Signed-off-by: treacker * Refactored tutorial for Tacotron2 Signed-off-by: treacker * Made backward compatibility Signed-off-by: treacker * Made backward compatibility Signed-off-by: treacker * Update Jenkinsfile Signed-off-by: treacker * Update tacotron.yaml Signed-off-by: treacker * Refactoring Signed-off-by: treacker * cleaned up TN/ ITN doc (#4119) * cleaned up TN/ ITN doc Signed-off-by: Yang Zhang * fix typo Signed-off-by: Yang Zhang * fix image Signed-off-by: Yang Zhang * fix image Signed-off-by: Yang Zhang Signed-off-by: treacker * Check implicit grad acc in GLUE dataset building (#4123) * Check implicit grad acc in GLUE dataset building Signed-off-by: MaximumEntropy * Fix jenkins test for GLUE/XNLI Signed-off-by: MaximumEntropy Signed-off-by: treacker * Refactoring Signed-off-by: treacker * Refactoring Signed-off-by: treacker * Fixed jenkins Signed-off-by: treacker * Refactoring Signed-off-by: treacker * Refactoring Signed-off-by: treacker * Refactoring Signed-off-by: treacker Co-authored-by: Yang Zhang Co-authored-by: Sandeep Subramanian * Multiprocess improvements (#4127) * initial commit Signed-off-by: nithinraok * start fix Signed-off-by: nithinraok * improve multiprocessing speed while creating speaker dataset Signed-off-by: nithinraok * updated scp to filelist Signed-off-by: nithinraok * notebooks' link, typo and import fix (#4158) * redo missing pr 4007 Signed-off-by: fayejf * remove extremely unreliable links Signed-off-by: fayejf * update speaker docs (#4164) * update speaker docs Signed-off-by: nithinraok * chunks -> segments Signed-off-by: nithinraok * Khz -> kHz Signed-off-by: nithinraok * small fix (#4180) Signed-off-by: fayejf * fix the server key value problem (#4196) Signed-off-by: Yi Dong * Fix/punctuation/trainer required for setting test data (#4199) * Draft of fix Signed-off-by: PeganovAnton * Add warnings and replace globa_step with current_epoch Signed-off-by: PeganovAnton * Small improvements to warnings Signed-off-by: PeganovAnton * Error and warning messages improvements Signed-off-by: PeganovAnton * Replace self.trainer with self._trainer Signed-off-by: PeganovAnton * Update ContextNet version (#4207) Signed-off-by: smajumdar * fix bugs for dialogue tutorial (#4211) Signed-off-by: Zhilin Wang * Dialogue tutorial fix (#4214) * fix bugs for dialogue tutorial Signed-off-by: Zhilin Wang * update path for convert_datasets.py due to conflict PR Signed-off-by: Zhilin Wang * Add docs for Thutmose Tagger (#4173) * Add docs for Thutmose Tagger Signed-off-by: Alexandra Antonova * add level in docs Signed-off-by: Alexandra Antonova * delete folder to avoid error with running when folder exists from previous run Signed-off-by: Alexandra Antonova Co-authored-by: Alexandra Antonova Co-authored-by: ekmb * Dialogue tutorial fix (#4218) * fix bugs for dialogue tutorial Signed-off-by: Zhilin Wang * update path for convert_datasets.py due to conflict PR Signed-off-by: Zhilin Wang * restore previously deleted files Signed-off-by: Zhilin Wang * style fix Signed-off-by: Zhilin Wang * Dialogue tutorial fix (#4221) * fix bugs for dialogue tutorial Signed-off-by: Zhilin Wang * update path for convert_datasets.py due to conflict PR Signed-off-by: Zhilin Wang * restore previously deleted files Signed-off-by: Zhilin Wang * style fix Signed-off-by: Zhilin Wang * update tutorial Signed-off-by: Zhilin Wang * fix syntax error in ipynb-file (#4228) Signed-off-by: Alexandra Antonova Co-authored-by: Alexandra Antonova * fix json serialize (#4235) Signed-off-by: Yi Dong * Prompt Learning Typo Fixes (#4238) * Prompt tuning notebook typo fixes Signed-off-by: Virginia Adams * Update tutorials.rst * Update prompt_learning.rst * Update prompt_learning.rst * fixing bug 3642622 (#4250) * fixing bug 3642622 Signed-off-by: Ghasem Pasandi * fixing bug 3642622 Signed-off-by: Ghasem Pasandi Co-authored-by: Ghasem Pasandi * fix broken link in the tutorial (#4257) Signed-off-by: Alexandra Antonova Co-authored-by: Alexandra Antonova * Typo fix, branch change, better download messagae (#4262) Signed-off-by: Virginia Adams * Raise error if bicleaner is not installed in NMT Data preprocesing notebook (#4264) * Raise error if bicleaner is not installed Signed-off-by: MaximumEntropy * Clear cells Signed-off-by: MaximumEntropy * Fix missing validation dataset, whitelist certain keywords for datasets (#4269) * Fix missing validation dataset, whitelist certain keywords for datasets Signed-off-by: smajumdar * Fix missing validation dataset, whitelist certain keywords for datasets Signed-off-by: smajumdar * Update asr configs with num_workers and pin_memory (#4270) Signed-off-by: smajumdar * Fix epoch end (#4265) Signed-off-by: MaximumEntropy Co-authored-by: Eric Harper * Set Save on train end to false (#4274) * Set Save on train end to false Signed-off-by: Virginia Adams * Update prompt_learning.rst * Update prompt_learning.rst * Update YAML (#4261) Signed-off-by: MaximumEntropy * Updated config to fix CI test OOM error (#4279) * Updated config to fix CI test issue Signed-off-by: Virginia Adams * Increased num workers Signed-off-by: Virginia Adams * verbose k2 install, skip if failed (#4289) Signed-off-by: Aleksandr Laptev Co-authored-by: Aleksandr Laptev * Changed total virtual prompt tokens (#4295) * Changed total virtual prompt tokens Signed-off-by: Virginia Adams * put number of workers back Signed-off-by: Virginia Adams * upper bound lightning Signed-off-by: ericharper * update branch Signed-off-by: ericharper * update config Signed-off-by: ericharper * remove duplicate test Signed-off-by: ericharper * fix tn test cases Signed-off-by: ericharper * add another safe.directory Signed-off-by: ericharper * typo Signed-off-by: ericharper Co-authored-by: Yang Zhang Co-authored-by: PeganovAnton Co-authored-by: treacker <36159472+treacker@users.noreply.github.com> Co-authored-by: Sandeep Subramanian Co-authored-by: Nithin Rao Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com> Co-authored-by: Somshubra Majumdar Co-authored-by: Zhilin Wang Co-authored-by: bene-ges <61418381+bene-ges@users.noreply.github.com> Co-authored-by: Alexandra Antonova Co-authored-by: ekmb Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com> Co-authored-by: Ghasem <35242805+pasandi20@users.noreply.github.com> Co-authored-by: Ghasem Pasandi Co-authored-by: Aleksandr Laptev Co-authored-by: Aleksandr Laptev --- Dockerfile | 2 +- Jenkinsfile | 54 +- docs/source/nlp/prompt_learning.rst | 31 +- docs/source/nlp/text_normalization/intro.rst | 7 +- .../nlp/text_normalization/neural_models.rst | 23 + .../text_normalization_as_tagging.rst | 165 ++ .../nlp/text_normalization/tn_itn_all.bib | 9 +- docs/source/starthere/tutorials.rst | 3 + .../asr/asr_adapters/train_asr_adapter.py | 19 +- .../asr/conf/carnelinet/carnelinet_384.yaml | 6 + examples/asr/conf/citrinet/citrinet_1024.yaml | 6 + examples/asr/conf/citrinet/citrinet_384.yaml | 6 + examples/asr/conf/citrinet/config_bpe.yaml | 4 + examples/asr/conf/config.yaml | 4 + .../asr/conf/contextnet_rnnt/config_rnnt.yaml | 6 + .../conf/contextnet_rnnt/config_rnnt_bpe.yaml | 6 + examples/asr/conf/jasper/jasper_10x5dr.yaml | 4 + .../asr/conf/marblenet/marblenet_3x2x64.yaml | 6 + .../matchboxnet/matchboxnet_3x1x64_v1.yaml | 6 + .../matchboxnet/matchboxnet_3x1x64_v2.yaml | 6 + .../asr/conf/quartznet/quartznet_15x5.yaml | 6 + .../conf/ssl/citrinet/citrinet_ssl_1024.yaml | 4 + .../conf/ssl/citrinet/citrinet_ssl_ci.yaml | 2 + examples/asr/conf/wav2vec/wav2vecCTC.yaml | 6 + .../asr/conf/wav2vec/wav2vecCTC_large.yaml | 6 + .../asr/conf/wav2vec/wav2vec_pretrain.yaml | 4 + .../conf/wav2vec/wav2vec_pretrain_large.yaml | 4 + .../k2/conf/citrinet/citrinet_mmi_1024.yaml | 6 + .../wav2vec/configs/wav2vecCTC.yaml | 6 + .../wav2vec/configs/wav2vecCTC_large.yaml | 6 + .../wav2vec/configs/wav2vec_pretrain.yaml | 4 + .../configs/wav2vec_pretrain_large.yaml | 4 + .../data/assistant_utils.py | 4 +- .../data/import_datasets.py | 1 - .../conf/megatron_bart_config.yaml | 1 + .../conf/megatron_bert_config.yaml | 1 + .../conf/megatron_gpt_config.yaml | 1 + .../megatron_gpt_prompt_learning_config.yaml | 13 +- .../conf/megatron_ptune_t5.yaml | 1 + .../conf/megatron_t5_config.yaml | 1 + ...megatron_t5_config_finetune_glue_eval.yaml | 1 + ...megatron_t5_config_finetune_glue_mnli.yaml | 1 + ...megatron_t5_config_finetune_glue_xnli.yaml | 1 + .../conf/transformer_lm_config.yaml | 1 + .../machine_translation/conf/aayn_base.yaml | 1 + .../conf/aayn_base_megatron.yaml | 1 + .../machine_translation/conf/huggingface.yaml | 1 + examples/tts/conf/tacotron2_44100.yaml | 4 +- .../collections/asr/models/rnnt_bpe_models.py | 2 +- .../punctuation_capitalization_dataset.py | 2 +- .../machine_translation/mt_enc_dec_model.py | 3 + .../punctuation_capitalization_model.py | 38 +- .../modules/common/text_generation_server.py | 3 + scripts/speech_recognition/k2/setup.sh | 2 +- .../test_cases_address.txt | 2 +- .../test_cases_word.txt | 3 +- .../Offline_ASR_with_VAD_for_CTC_models.ipynb | 2 +- tutorials/nlp/02_NLP_Tokenizers.ipynb | 2 +- ...a_Preprocessing_and_Cleaning_for_NMT.ipynb | 1628 +++++++-------- ...on_Synthetic_Tabular_Data_Generation.ipynb | 2 +- .../nlp/Multitask_Prompt_and_PTuning.ipynb | 43 +- .../Non_English_Downstream_Tasks_(NER).ipynb | 1789 +++++++++-------- .../nlp/Punctuation_and_Capitalization.ipynb | 5 +- .../ASR_with_SpeakerDiarization.ipynb | 2 +- .../Speaker_Diarization_Inference.ipynb | 2 +- .../Speaker_Identification_Verification.ipynb | 2 +- .../ITN_with_Thutmose_Tagger.ipynb | 3 +- 67 files changed, 2223 insertions(+), 1777 deletions(-) create mode 100644 docs/source/nlp/text_normalization/neural_models.rst create mode 100644 docs/source/nlp/text_normalization/text_normalization_as_tagging.rst diff --git a/Dockerfile b/Dockerfile index 9384d97bb0e8..4f2b8c50b4a5 100644 --- a/Dockerfile +++ b/Dockerfile @@ -55,7 +55,7 @@ RUN for f in $(ls requirements*.txt); do pip install --disable-pip-version-check # install k2, skip if installation fails COPY scripts /tmp/nemo/scripts/ -RUN /bin/bash /tmp/nemo/scripts/speech_recognition/k2/setup.sh; exit 0 +RUN /bin/bash /tmp/nemo/scripts/speech_recognition/k2/setup.sh || exit 0 # copy nemo source into a scratch image FROM scratch as nemo-src diff --git a/Jenkinsfile b/Jenkinsfile index e4f2f47ffea1..05295a70c844 100644 --- a/Jenkinsfile +++ b/Jenkinsfile @@ -15,6 +15,7 @@ pipeline { stage('Add git safe directory'){ steps{ sh 'git config --global --add safe.directory /var/lib/jenkins/workspace/NeMo_$GIT_BRANCH' + sh 'git config --global --add safe.directory /raid/JenkinsWorkDir/workspace/NeMo_$GIT_BRANCH' } } @@ -1590,22 +1591,20 @@ pipeline { } failFast true stages { - stage('Punctuation & Capitalization, Using model.common_dataset_parameters.label_vocab_dir') { + stage('Punctuation & Capitalization, Using model.common_datasest_parameters.label_vocab_dir') { steps { sh 'cd examples/nlp/token_classification && \ - output_dir="$(mktemp -d -p "$(pwd)")" && \ - data_dir="$(mktemp -d -p "$(pwd)")" && \ - cp /home/TestData/nlp/token_classification_punctuation/*.txt "${data_dir}"/ && \ - label_vocab_dir="$(mktemp -d -p "$(pwd)")" && \ + label_vocab_dir=label_vocab_dir && \ + mkdir -p ${label_vocab_dir} && \ punct_label_vocab="${label_vocab_dir}/punct_label_vocab.csv" && \ capit_label_vocab="${label_vocab_dir}/capit_label_vocab.csv" && \ printf "O\n,\n.\n?\n" > "${punct_label_vocab}" && \ printf "O\nU\n" > "${capit_label_vocab}" && \ - python punctuation_capitalization_train_evaluate.py \ + CUDA_LAUNCH_BLOCKING=1 python punctuation_capitalization_train_evaluate.py \ model.train_ds.use_tarred_dataset=false \ - model.train_ds.ds_item="${data_dir}" \ - model.validation_ds.ds_item="${data_dir}" \ - model.test_ds.ds_item="${data_dir}" \ + model.train_ds.ds_item=/home/TestData/nlp/token_classification_punctuation \ + model.validation_ds.ds_item=/home/TestData/nlp/token_classification_punctuation \ + model.test_ds.ds_item=/home/TestData/nlp/token_classification_punctuation \ model.language_model.pretrained_model_name=distilbert-base-uncased \ model.common_dataset_parameters.label_vocab_dir="${label_vocab_dir}" \ model.class_labels.punct_labels_file="$(basename "${punct_label_vocab}")" \ @@ -1616,15 +1615,15 @@ pipeline { trainer.devices=[0,1] \ trainer.strategy=ddp \ trainer.max_epochs=1 \ - +exp_manager.explicit_log_dir="${output_dir}" \ + +exp_manager.explicit_log_dir=/home/TestData/nlp/token_classification_punctuation/output \ +do_testing=false && \ - python punctuation_capitalization_train_evaluate.py \ + CUDA_LAUNCH_BLOCKING=1 python punctuation_capitalization_train_evaluate.py \ +do_training=false \ +do_testing=true \ ~model.train_ds \ ~model.validation_ds \ - model.test_ds.ds_item="${data_dir}" \ - pretrained_model="${output_dir}/checkpoints/Punctuation_and_Capitalization.nemo" \ + model.test_ds.ds_item=/home/TestData/nlp/token_classification_punctuation \ + pretrained_model=/home/TestData/nlp/token_classification_punctuation/output/checkpoints/Punctuation_and_Capitalization.nemo \ +model.train_ds.use_cache=false \ +model.validation_ds.use_cache=false \ +model.test_ds.use_cache=false \ @@ -1632,29 +1631,27 @@ pipeline { trainer.strategy=ddp \ trainer.max_epochs=1 \ exp_manager=null && \ - rm -rf "${label_vocab_dir}" "${data_dir}" "${output_dir}"' + rm -r "${label_vocab_dir}" && \ + rm -rf /home/TestData/nlp/token_classification_punctuation/output/*' } } - stage('Punctuation & Capitalization, Using model.common_dataset_parameters.{punct,capit}_label_ids') { + stage('Punctuation & Capitalization, Using model.common_datasest_parameters.{punct,capit}_label_ids') { steps { sh 'cd examples/nlp/token_classification && \ - output_dir="$(mktemp -d -p "$(pwd)")" && \ - data_dir="$(mktemp -d -p "$(pwd)")" && \ - cp /home/TestData/nlp/token_classification_punctuation/*.txt "${data_dir}"/ && \ - conf_path="$(mktemp -d -p "$(pwd)")" && \ + conf_path=/home/TestData/nlp/token_classification_punctuation && \ conf_name=punctuation_capitalization_config_with_ids && \ cp conf/punctuation_capitalization_config.yaml "${conf_path}/${conf_name}.yaml" && \ sed -i $\'s/punct_label_ids: null/punct_label_ids: {O: 0, \\\',\\\': 1, .: 2, \\\'?\\\': 3}/\' \ "${conf_path}/${conf_name}.yaml" && \ sed -i $\'s/capit_label_ids: null/capit_label_ids: {O: 0, U: 1}/\' \ "${conf_path}/${conf_name}.yaml" && \ - python punctuation_capitalization_train_evaluate.py \ + CUDA_LAUNCH_BLOCKING=1 python punctuation_capitalization_train_evaluate.py \ --config-path "${conf_path}" \ --config-name "${conf_name}" \ model.train_ds.use_tarred_dataset=false \ - model.train_ds.ds_item="${data_dir}" \ - model.validation_ds.ds_item="${data_dir}" \ - model.test_ds.ds_item="${data_dir}" \ + model.train_ds.ds_item=/home/TestData/nlp/token_classification_punctuation \ + model.validation_ds.ds_item=/home/TestData/nlp/token_classification_punctuation \ + model.test_ds.ds_item=/home/TestData/nlp/token_classification_punctuation \ model.language_model.pretrained_model_name=distilbert-base-uncased \ +model.train_ds.use_cache=false \ +model.validation_ds.use_cache=false \ @@ -1662,15 +1659,15 @@ pipeline { trainer.devices=[0,1] \ trainer.strategy=ddp \ trainer.max_epochs=1 \ - +exp_manager.explicit_log_dir="${output_dir}" \ + +exp_manager.explicit_log_dir=/home/TestData/nlp/token_classification_punctuation/output \ +do_testing=false && \ - python punctuation_capitalization_train_evaluate.py \ + CUDA_LAUNCH_BLOCKING=1 python punctuation_capitalization_train_evaluate.py \ +do_training=false \ +do_testing=true \ ~model.train_ds \ ~model.validation_ds \ - model.test_ds.ds_item="${data_dir}" \ - pretrained_model="${output_dir}/checkpoints/Punctuation_and_Capitalization.nemo" \ + model.test_ds.ds_item=/home/TestData/nlp/token_classification_punctuation \ + pretrained_model=/home/TestData/nlp/token_classification_punctuation/output/checkpoints/Punctuation_and_Capitalization.nemo \ +model.train_ds.use_cache=false \ +model.validation_ds.use_cache=false \ +model.test_ds.use_cache=false \ @@ -1678,7 +1675,8 @@ pipeline { trainer.strategy=ddp \ trainer.max_epochs=1 \ exp_manager=null && \ - rm -rf "${output_dir}" "${data_dir}" "${conf_path}"' + rm -rf /home/TestData/nlp/token_classification_punctuation/output/* && \ + rm "${conf_path}/${conf_name}.yaml"' } } } diff --git a/docs/source/nlp/prompt_learning.rst b/docs/source/nlp/prompt_learning.rst index ca3d3d4e8f95..a92923e26e21 100644 --- a/docs/source/nlp/prompt_learning.rst +++ b/docs/source/nlp/prompt_learning.rst @@ -10,6 +10,8 @@ Instead of selecting discrete text prompts in a manual or automated fashion, pro Our continuous learning capability for combined p-tuning and prompt tuning with GPT style models is a NeMo specific extension of the author's original work. +Please also checkout our `prompt learning tutorial notebook. `_ + Terminology ^^^^^^^^^^ @@ -89,14 +91,17 @@ the input will be translated into ``VVV Hypothesis: And he said, Mama, I'm home. "prompt_template": "<|VIRTUAL_PROMPT_0|> {sentence} sentiment: {label}", "total_virtual_tokens": 10, "virtual_token_splits": [10], - "truncate_field": "sentence" + "truncate_field": "sentence", + "answer_only_loss": False, }, { "taskname": "intent_and_slot", "prompt_template": "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}", "total_virtual_tokens": 10, "virtual_token_splits": [7, 3], - "truncate_field": None + "truncate_field": None, + "answer_only_loss": True, + "answer_field": "label" } ] @@ -198,9 +203,9 @@ Setting New Tasks After you p-tune or prompt-tune your model, you can always go back and p-tune or prompt-tune your model on more tasks without over writing the virtual prompts who've trained already. You can also use a different number of ``total_virtual_tokens`` between each training session as long as tasks ptuned or prompt tuned at the same time have the same number of ``total_virtual_tokens``. For this reason, when you ptune on a new task, you need to tell your model which of your tasks are new and which ones already exist (and thus you don't want to tune them). You do this by setting the ``new_tasks`` and ``existing_tasks`` values in the config file. -Example Multi-Task Prompt Tuning Command +Example Multi-Task Prompt Tuning Config and Command ^^^^^^^^^^ -First define a config called ``multitask-prompt-learning.yaml`` that looks like: +First define a config called ``multitask-prompt-learning.yaml`` demonstrated below. **In the** ``exp_manager`` **portion of the config,** ``save_on_train_end`` **should be set to** ``False`` **to avoid unnecessarily saving the incorrect model weights.** This is already done in the example `megatron_gpt_prompt_learning_config.yaml config `_ that you should use as your starting point. The correct prompt learning model will be saved at the ``model.nemo_path`` you set. .. code:: @@ -229,12 +234,15 @@ First define a config called ``multitask-prompt-learning.yaml`` that looks like: total_virtual_tokens: 100 virtual_token_splits: [100] truncate_field: null + answer_only_loss: False - taskname: "intent_and_slot" prompt_template: "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}" total_virtual_tokens: 100 virtual_token_splits: [80, 20] truncate_field: null + answer_only_loss: True + answer_field: "label" prompt_tuning: new_prompt_init_methods: ["text", "text"] @@ -259,7 +267,7 @@ Then run the command python megatron_gpt_prompt_learning.py --config-name=multitask-prompt-learning.yaml -Example Multi-Task P-Tuning Command After Prompt-Tuning +Example Multi-Task P-Tuning Config and Command After Prompt-Tuning ^^^^^^^^^^ Update ``multitask-prompt-learning.yaml`` from the example above with p-tuning parameters for the new task. Be sure to update ``model.existing_tasks`` with the tasknames from previous prompt learning runs and to use the ``.nemo`` file saved at the end of your last prompt learning session. Values different from the config above have stars commented next to them. @@ -284,7 +292,7 @@ In this example, the SQuAD task includes the question context as part of the pro restore_path: multitask_prompt_tuning.nemo # *** language_model_path: models/megatron_125M_gpt.nemo existing_tasks: ["sentiment", "intent_and_slot"] # *** - new_tasks: ["sentiment", "intent_and_slot"] + new_tasks: ["squad"] task_templates: - taskname: "sentiment" @@ -292,20 +300,23 @@ In this example, the SQuAD task includes the question context as part of the pro total_virtual_tokens: 100 virtual_token_splits: [100] truncate_field: null + answer_only_loss: False - taskname: "intent_and_slot" prompt_template: "<|VIRTUAL_PROMPT_0|> Predict intent and slot <|VIRTUAL_PROMPT_1|> :\n{utterance}{label}" total_virtual_tokens: 100 virtual_token_splits: [80, 20] truncate_field: null + answer_only_loss: True + answer_field: "label" - taskname: "squad" # *** - prompt_template: "<|VIRTUAL_PROMPT_0|> Answer the question from the context <|VIRTUAL_PROMPT_1|> {question} <|VIRTUAL_PROMPT_2|> {context} <|VIRTUAL_PROMPT_3|> Answer: {answer}" # *** - total_virtual_tokens: 16 # *** - virtual_token_splits: [4, 4, 4, 4] # *** + prompt_template: "<|VIRTUAL_PROMPT_0|> Answer the question from the context {question} {context} Answer: {answer}" # *** + total_virtual_tokens: 9 # *** + virtual_token_splits: [9] # *** truncate_field: context # *** answer_only_loss: True # *** - answer_field: 'answer # *** + answer_field: "answer" # *** p_tuning: # *** dropout: 0.0 # *** diff --git a/docs/source/nlp/text_normalization/intro.rst b/docs/source/nlp/text_normalization/intro.rst index e560372f8831..1b9365728fcc 100644 --- a/docs/source/nlp/text_normalization/intro.rst +++ b/docs/source/nlp/text_normalization/intro.rst @@ -1,6 +1,8 @@ (Inverse) Text Normalization ============================ +NeMo supports Text Normalization (TN) and Inverse Text Normalization (ITN) tasks via rule-based `nemo_text_processing` python package and Neural-based TN/ITN models. + Rule-based (WFST) TN/ITN: .. toctree:: @@ -9,11 +11,10 @@ Rule-based (WFST) TN/ITN: wfst/intro -Neural TN/ITN: +Neural-based TN/ITN: .. toctree:: :maxdepth: 1 - nn_text_normalization - + neural_models diff --git a/docs/source/nlp/text_normalization/neural_models.rst b/docs/source/nlp/text_normalization/neural_models.rst new file mode 100644 index 000000000000..10206da067a3 --- /dev/null +++ b/docs/source/nlp/text_normalization/neural_models.rst @@ -0,0 +1,23 @@ +.. _neural_models: + +Neural Models for (Inverse) Text Normalization +============================================== + +NeMo provides two types of neural models: + + +Duplex T5-based TN/ITN: + +.. toctree:: + :maxdepth: 1 + + nn_text_normalization + + +Single-pass Tagger-based ITN: + +.. toctree:: + :maxdepth: 1 + + text_normalization_as_tagging + diff --git a/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst b/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst new file mode 100644 index 000000000000..25926bd45c69 --- /dev/null +++ b/docs/source/nlp/text_normalization/text_normalization_as_tagging.rst @@ -0,0 +1,165 @@ +.. _text_normalization_as_tagging: + +Thutmose Tagger: Single-pass Tagger-based ITN Model +=================================================== +Inverse text normalization(ITN) converts text from spoken domain (e.g., an ASR output) into its written form: + +Input: ``on may third we paid one hundred and twenty three dollars`` +Output: ``on may 3 we paid $123`` + +`ThutmoseTaggerModel `__ is a single-pass tagger-based model mapping spoken-domain words to written-domain fragments. +Additionally this model predicts "semiotic" classes of the spoken words (e.g., words belonging to the spans that are about times, dates, or monetary amounts) + +The typical workflow is to first prepare the dataset, which requires to find granular alignments between spoken-domain words and written-domain fragments. +An example bash-script for data preparation pipeline is provided: `prepare_dataset_en.sh `__. +After getting the dataset you can train the model. An example training script is provided: `normalization_as_tagging_train.py `__. +The script for inference from a raw text file is provided here: `normalization_as_tagging_infer.py `__. +An example bash-script that runs inference and evaluation is provided here: `run_infer.sh `__. + + +Quick Start Guide +----------------- + +To run the pretrained models see :ref:`inference_text_normalization`. + +Available models +^^^^^^^^^^^^^^^^ + +.. list-table:: *Pretrained Models* + :widths: 5 10 + :header-rows: 1 + + * - Model + - Pretrained Checkpoint + * - itn_en_thutmose_bert + - https://ngc.nvidia.com/catalog/models/nvidia:nemo:itn_en_thutmose_bert + + +Initial Data +------------ +The initial data from which the dataset is prepared is `Google text normalization dataset `__. +It is stored in TAB separated files (``.tsv``) with three columns. +The first column is the "semiotic class" (e.g., numbers, times, dates) , the second is the token +in written form, and the third is the spoken form. An example sentence in the dataset is shown below. +In the example, ```` denotes that the spoken form is the same as the written form. + +.. code:: + + PLAIN The + PLAIN company + PLAIN revenues + PLAIN grew + PLAIN four + PLAIN fold + PLAIN between + DATE 2005 two thousand five + PLAIN and + DATE 2008 two thousand eight + PUNCT . + + + +More information about the Google Text Normalization Dataset can be found in the paper `RNN Approaches to Text Normalization: A Challenge `__ :cite:`nlp-textnorm-sproat2016rnn`. + + +Data preprocessing +------------------ + +Our preprocessing is rather complicated, because we need to find granular alignments for semiotic spans that are aligned at phrase-level in Google Text Normalization Dataset. +Right now we only provide data preparation scripts for English and Russian languages, see `prepare_dataset_en.sh `__ and `prepare_dataset_ru.sh `__. +Data preparation includes running the GIZA++ automatic alignment tool, see `install_requirements.sh `__ for installation details. +The purpose of the preprocessing scripts is to build the training dataset for the tagging model. +The final dataset has a simple 3-column tsv format: 1) input sentence, 2) tags for input words, 3) coordinates of "semiotic" spans if any + +.. code:: + + this plan was first enacted in nineteen eighty four and continued to be followed for nineteen years _19 8 4_ _19_ DATE 6 9;CARDINAL 15 16 + + +Model Training +-------------- + +An example training script is provided: `normalization_as_tagging_train.py `__. +The config file used by default is `thutmose_tagger_itn_config.yaml `__. +You can change any of the parameters directly from the config file or update them with the command-line arguments. + +Most arguments in the example config file are quite self-explanatory (e.g., *model.optim.lr* refers to the learning rate for training the decoder). We have set most of the hyper-parameters to +be the values that we found to be effective (for the English and the Russian subsets of the Google TN dataset). +Some arguments that you may want to modify are: + +- *lang*: The language of the dataset. + +- *data.train_ds.data_path*: The path to the training file. + +- *data.validation_ds.data_path*: The path to the validation file. + +- *model.language_model.pretrained_model_name*: The huggingface transformer model used to initialize the model weights + +- *model.label_map*: The path/.../label_map.txt. This is the dictionary of possible output tags that model may produce. + +- *model.semiotic_classes*: The path/to/.../semiotic_classes.txt. This is the list of possible semiotic classes. + + +Example of a training command: + +.. code:: + + python examples/nlp/text_normalization_as_tagging/normalization_as_tagging_train.py \ + lang=en \ + data.validation_ds.data_path=/valid.tsv \ + data.train_ds.data_path=/train.tsv \ + model.language_model.pretrained_model_name=bert-base-uncased \ + model.label_map=/label_map.txt \ + model.semiotic_classes=/semiotic_classes.txt \ + trainer.max_epochs=5 + + + +.. _inference_text_normalization: + +Model Inference +--------------- + +Run the inference: + +.. code:: + + python examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \ + pretrained_model=itn_en_thutmose_bert \ + inference.from_file=./test_sent.txt \ + inference.out_file=./output.tsv + +The output tsv file consists of 5 columns: + + * Final output text - it is generated from predicted tags after some simple post-processing. + * Input text. + * Sequence of predicted tags - one tag for each input word. + * Sequence of tags after post-processing (some swaps may be applied). + * Sequence of predicted semiotic classes - one class for each input word. + + +Model Architecture +------------------ + +The model first uses a Transformer encoder (e.g., bert-base-uncased) to build a +contextualized representation for each input token. It then uses a classification head +to predict the tag for each token. Another classification head is used to predict a "semiotic" class label for each token. + +Overall, our design is partly inspired by the LaserTagger approach proposed in the paper +`Encode, tag, realize: High-precision text editing `__ :cite:`nlp-textnorm-malmi2019encode`. + +The LaserTagger method is not directly applicable to ITN because it can only regard the whole non-common fragment as a single +replacement tag, whereas spoken-to-written conversion, e.g. a date, needs to be aligned on a more granular level. Otherwise, +the tag vocabulary should include all possible numbers, dates etc. which is impossible. For example, given an example pair "over +four hundred thousand fish" - "over 400,000 fish", LaserTagger will need a single replacement "400,000" in the tag vocabulary. +To overcome this problem, we use another method of collecting the vocabulary of replacement tags, based on automatic alignment of spoken-domain words to small fragments of +written-domain text along with and tags. + + +References +---------- + +.. bibliography:: tn_itn_all.bib + :style: plain + :labelprefix: NLP-TEXTNORM + :keyprefix: nlp-textnorm- diff --git a/docs/source/nlp/text_normalization/tn_itn_all.bib b/docs/source/nlp/text_normalization/tn_itn_all.bib index 42f9a090021f..6fc843110e16 100644 --- a/docs/source/nlp/text_normalization/tn_itn_all.bib +++ b/docs/source/nlp/text_normalization/tn_itn_all.bib @@ -87,4 +87,11 @@ @inproceedings{koehn-etal-2007-moses publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/P07-2045", pages = "177--180", -} \ No newline at end of file +} + +@article{malmi2019encode, + title={Encode, tag, realize: High-precision text editing}, + author={Malmi, Eric and Krause, Sebastian and Rothe, Sascha and Mirylenka, Daniil and Severyn, Aliaksei}, + journal={arXiv preprint arXiv:1909.01187}, + year={2019} +} diff --git a/docs/source/starthere/tutorials.rst b/docs/source/starthere/tutorials.rst index e0115937a0c1..5208e4022ea9 100644 --- a/docs/source/starthere/tutorials.rst +++ b/docs/source/starthere/tutorials.rst @@ -136,6 +136,9 @@ To run a tutorial: * - NLP - Relation Extraction - BioMegatron - `Relation Extraction - BioMegatron `_ + * - NLP + - P-Tuning/Prompt-Tuning + - `P-Tuning/Prompt-Tuning `_ * - TTS - Speech Synthesis - `TTS Inference `_ diff --git a/examples/asr/asr_adapters/train_asr_adapter.py b/examples/asr/asr_adapters/train_asr_adapter.py index 684774dbe5df..fb55ac18d24f 100644 --- a/examples/asr/asr_adapters/train_asr_adapter.py +++ b/examples/asr/asr_adapters/train_asr_adapter.py @@ -85,8 +85,15 @@ def update_model_config_to_support_adapter(model_cfg, current_cfg): def update_model_cfg(original_cfg, new_cfg): - with open_dict(new_cfg): - # drop keys which dont exist in old config + with open_dict(original_cfg), open_dict(new_cfg): + # force inject some keys into the config + whitelist_keys = ['num_workers', 'pin_memory'] + for wkey in whitelist_keys: + if wkey in new_cfg: + original_cfg[wkey] = new_cfg[wkey] + print(f"Injecting white listed key `{wkey}` into config") + + # drop keys which don't exist in old config and are not whitelisted new_keys = list(new_cfg.keys()) for key in new_keys: if key not in original_cfg: @@ -141,11 +148,11 @@ def main(cfg): # Setup model for finetuning (train and validation only) cfg.model.train_ds = update_model_cfg(model.cfg.train_ds, cfg.model.train_ds) - cfg.model.validation_ds = update_model_cfg(model.cfg.validation_ds, cfg.model.validation_ds) - - # Call the dataloaders and optimizer + scheduler model.setup_training_data(cfg.model.train_ds) - model.setup_multiple_validation_data(cfg.model.validation_ds) + + if 'validation_ds' in cfg.model: + cfg.model.validation_ds = update_model_cfg(model.cfg.validation_ds, cfg.model.validation_ds) + model.setup_multiple_validation_data(cfg.model.validation_ds) # Setup optimizer model.setup_optimization(cfg.model.optim) diff --git a/examples/asr/conf/carnelinet/carnelinet_384.yaml b/examples/asr/conf/carnelinet/carnelinet_384.yaml index a706632cce25..2d3d567be510 100644 --- a/examples/asr/conf/carnelinet/carnelinet_384.yaml +++ b/examples/asr/conf/carnelinet/carnelinet_384.yaml @@ -35,6 +35,8 @@ model: use_start_end_token: false max_duration: 16.7 shuffle: true + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -50,6 +52,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -57,6 +61,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true model_defaults: repeat: 5 diff --git a/examples/asr/conf/citrinet/citrinet_1024.yaml b/examples/asr/conf/citrinet/citrinet_1024.yaml index 89bdfff036ea..324623c5fd88 100644 --- a/examples/asr/conf/citrinet/citrinet_1024.yaml +++ b/examples/asr/conf/citrinet/citrinet_1024.yaml @@ -25,6 +25,8 @@ model: max_duration: 20.0 shuffle: true use_start_end_token: false + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -39,6 +41,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -46,6 +50,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true model_defaults: repeat: 5 diff --git a/examples/asr/conf/citrinet/citrinet_384.yaml b/examples/asr/conf/citrinet/citrinet_384.yaml index de0e8082f103..b49ab1f5aee5 100644 --- a/examples/asr/conf/citrinet/citrinet_384.yaml +++ b/examples/asr/conf/citrinet/citrinet_384.yaml @@ -24,6 +24,8 @@ model: max_duration: 16.7 shuffle: true use_start_end_token: false + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -39,6 +41,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -46,6 +50,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true model_defaults: repeat: 5 diff --git a/examples/asr/conf/citrinet/config_bpe.yaml b/examples/asr/conf/citrinet/config_bpe.yaml index 96b1f58ad835..2cb2768793c0 100644 --- a/examples/asr/conf/citrinet/config_bpe.yaml +++ b/examples/asr/conf/citrinet/config_bpe.yaml @@ -12,6 +12,8 @@ model: trim_silence: True max_duration: 16.7 shuffle: True + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -26,6 +28,8 @@ model: sample_rate: 16000 batch_size: 32 shuffle: False + num_workers: 8 + pin_memory: true tokenizer: dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe) diff --git a/examples/asr/conf/config.yaml b/examples/asr/conf/config.yaml index dfd27a03f65a..2b2163b57474 100644 --- a/examples/asr/conf/config.yaml +++ b/examples/asr/conf/config.yaml @@ -15,6 +15,8 @@ model: trim_silence: True max_duration: 16.7 shuffle: True + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -29,6 +31,8 @@ model: labels: *labels batch_size: 32 shuffle: False + num_workers: 8 + pin_memory: true preprocessor: _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor diff --git a/examples/asr/conf/contextnet_rnnt/config_rnnt.yaml b/examples/asr/conf/contextnet_rnnt/config_rnnt.yaml index 0534d38389d9..a58c467b8110 100644 --- a/examples/asr/conf/contextnet_rnnt/config_rnnt.yaml +++ b/examples/asr/conf/contextnet_rnnt/config_rnnt.yaml @@ -15,6 +15,8 @@ model: max_duration: 16.7 labels: ${model.labels} shuffle: true + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -30,6 +32,8 @@ model: batch_size: 32 shuffle: false labels: ${model.labels} + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -37,6 +41,8 @@ model: batch_size: 32 shuffle: false labels: ${model.labels} + num_workers: 8 + pin_memory: true model_defaults: repeat: 5 diff --git a/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml b/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml index 2660fab0a25e..1f4dd0e954c9 100644 --- a/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml +++ b/examples/asr/conf/contextnet_rnnt/config_rnnt_bpe.yaml @@ -16,6 +16,8 @@ model: max_duration: 16.7 labels: [] shuffle: true + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -30,6 +32,8 @@ model: batch_size: 32 shuffle: false labels: [] + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -37,6 +41,8 @@ model: batch_size: 32 shuffle: false labels: [] + num_workers: 8 + pin_memory: true model_defaults: repeat: 5 diff --git a/examples/asr/conf/jasper/jasper_10x5dr.yaml b/examples/asr/conf/jasper/jasper_10x5dr.yaml index 63ee9a41c6db..ad2f0536c133 100644 --- a/examples/asr/conf/jasper/jasper_10x5dr.yaml +++ b/examples/asr/conf/jasper/jasper_10x5dr.yaml @@ -13,6 +13,8 @@ model: trim_silence: True max_duration: 16.7 shuffle: True + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -28,6 +30,8 @@ model: labels: *labels batch_size: 32 shuffle: False + num_workers: 8 + pin_memory: true preprocessor: _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor diff --git a/examples/asr/conf/marblenet/marblenet_3x2x64.yaml b/examples/asr/conf/marblenet/marblenet_3x2x64.yaml index 58ea0b3c65d1..1042408fc4f1 100644 --- a/examples/asr/conf/marblenet/marblenet_3x2x64.yaml +++ b/examples/asr/conf/marblenet/marblenet_3x2x64.yaml @@ -19,6 +19,8 @@ model: tarred_audio_filepaths: null tarred_shard_strategy: "scatter" shuffle_n: 2048 + num_workers: 8 + pin_memory: true # bucketing params bucketing_strategy: "synced_randomized" bucketing_batch_size: null @@ -38,6 +40,8 @@ model: labels: ${model.labels} batch_size: 128 shuffle: False + num_workers: 8 + pin_memory: true val_loss_idx: 0 test_ds: @@ -46,6 +50,8 @@ model: labels: ${model.labels} batch_size: 128 shuffle: False + num_workers: 8 + pin_memory: true test_loss_idx: 0 preprocessor: diff --git a/examples/asr/conf/matchboxnet/matchboxnet_3x1x64_v1.yaml b/examples/asr/conf/matchboxnet/matchboxnet_3x1x64_v1.yaml index 33f25af7c047..f1a3336bb1b2 100644 --- a/examples/asr/conf/matchboxnet/matchboxnet_3x1x64_v1.yaml +++ b/examples/asr/conf/matchboxnet/matchboxnet_3x1x64_v1.yaml @@ -21,6 +21,8 @@ model: labels: ${model.labels} batch_size: 128 shuffle: True + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -44,6 +46,8 @@ model: labels: ${model.labels} batch_size: 128 shuffle: False + num_workers: 8 + pin_memory: true val_loss_idx: 0 test_ds: @@ -52,6 +56,8 @@ model: labels: ${model.labels} batch_size: 128 shuffle: False + num_workers: 8 + pin_memory: true test_loss_idx: 0 preprocessor: diff --git a/examples/asr/conf/matchboxnet/matchboxnet_3x1x64_v2.yaml b/examples/asr/conf/matchboxnet/matchboxnet_3x1x64_v2.yaml index 168c9e0ab531..929ec7a9afe4 100644 --- a/examples/asr/conf/matchboxnet/matchboxnet_3x1x64_v2.yaml +++ b/examples/asr/conf/matchboxnet/matchboxnet_3x1x64_v2.yaml @@ -21,6 +21,8 @@ model: labels: ${model.labels} batch_size: 128 shuffle: True + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -44,6 +46,8 @@ model: labels: ${model.labels} batch_size: 128 shuffle: False + num_workers: 8 + pin_memory: true val_loss_idx: 0 test_ds: @@ -52,6 +56,8 @@ model: labels: ${model.labels} batch_size: 128 shuffle: False + num_workers: 8 + pin_memory: true test_loss_idx: 0 preprocessor: diff --git a/examples/asr/conf/quartznet/quartznet_15x5.yaml b/examples/asr/conf/quartznet/quartznet_15x5.yaml index a6c1bbda38d5..269be113e7be 100644 --- a/examples/asr/conf/quartznet/quartznet_15x5.yaml +++ b/examples/asr/conf/quartznet/quartznet_15x5.yaml @@ -16,6 +16,8 @@ model: trim_silence: True max_duration: 16.7 shuffle: True + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -30,6 +32,8 @@ model: labels: *labels batch_size: 32 shuffle: False + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -37,6 +41,8 @@ model: labels: *labels batch_size: 32 shuffle: False + num_workers: 8 + pin_memory: true preprocessor: _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor diff --git a/examples/asr/conf/ssl/citrinet/citrinet_ssl_1024.yaml b/examples/asr/conf/ssl/citrinet/citrinet_ssl_1024.yaml index 3dfaca588441..bc6fc7536972 100644 --- a/examples/asr/conf/ssl/citrinet/citrinet_ssl_1024.yaml +++ b/examples/asr/conf/ssl/citrinet/citrinet_ssl_1024.yaml @@ -28,6 +28,8 @@ model: tarred_audio_filepaths: null shuffle_n: 2048 use_start_end_token: false + num_workers: 8 + pin_memory: true # bucketing params bucketing_strategy: "synced_randomized" bucketing_batch_size: null @@ -38,6 +40,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true max_duration: 35.0 min_duration: 8.0 diff --git a/examples/asr/conf/ssl/citrinet/citrinet_ssl_ci.yaml b/examples/asr/conf/ssl/citrinet/citrinet_ssl_ci.yaml index 20bf2e0c9fe8..ac3e1bc8dffe 100644 --- a/examples/asr/conf/ssl/citrinet/citrinet_ssl_ci.yaml +++ b/examples/asr/conf/ssl/citrinet/citrinet_ssl_ci.yaml @@ -16,6 +16,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true # bucketing params bucketing_strategy: "synced_randomized" bucketing_batch_size: null diff --git a/examples/asr/conf/wav2vec/wav2vecCTC.yaml b/examples/asr/conf/wav2vec/wav2vecCTC.yaml index 89d97aa2e5e1..11c9576e6f6d 100644 --- a/examples/asr/conf/wav2vec/wav2vecCTC.yaml +++ b/examples/asr/conf/wav2vec/wav2vecCTC.yaml @@ -19,6 +19,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true validation_ds: manifest_filepath: ??? @@ -26,6 +28,8 @@ model: batch_size: ?? shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -33,6 +37,8 @@ model: batch_size: null shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true tokenizer: dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe) diff --git a/examples/asr/conf/wav2vec/wav2vecCTC_large.yaml b/examples/asr/conf/wav2vec/wav2vecCTC_large.yaml index 911c466aa137..0ca0914acecc 100644 --- a/examples/asr/conf/wav2vec/wav2vecCTC_large.yaml +++ b/examples/asr/conf/wav2vec/wav2vecCTC_large.yaml @@ -18,6 +18,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true validation_ds: manifest_filepath: ??? @@ -25,6 +27,8 @@ model: batch_size: 4 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -32,6 +36,8 @@ model: batch_size: null shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true tokenizer: dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe) diff --git a/examples/asr/conf/wav2vec/wav2vec_pretrain.yaml b/examples/asr/conf/wav2vec/wav2vec_pretrain.yaml index 836294fbeef2..0aaad93be4c1 100644 --- a/examples/asr/conf/wav2vec/wav2vec_pretrain.yaml +++ b/examples/asr/conf/wav2vec/wav2vec_pretrain.yaml @@ -24,6 +24,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true validation_ds: manifest_filepath: ??? @@ -31,6 +33,8 @@ model: batch_size: ??? shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true preprocessor: _target_: nemo.collections.asr.modules.wav2vec_modules.ConvFeatureEncoder diff --git a/examples/asr/conf/wav2vec/wav2vec_pretrain_large.yaml b/examples/asr/conf/wav2vec/wav2vec_pretrain_large.yaml index c1d74cf4d29d..b69dade0d98d 100644 --- a/examples/asr/conf/wav2vec/wav2vec_pretrain_large.yaml +++ b/examples/asr/conf/wav2vec/wav2vec_pretrain_large.yaml @@ -23,6 +23,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true validation_ds: manifest_filepath: ??? @@ -30,6 +32,8 @@ model: batch_size: ??? shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true preprocessor: _target_: nemo.collections.asr.modules.wav2vec_modules.ConvFeatureEncoder diff --git a/examples/asr/experimental/k2/conf/citrinet/citrinet_mmi_1024.yaml b/examples/asr/experimental/k2/conf/citrinet/citrinet_mmi_1024.yaml index 80200feacad5..60d5c2bfd95d 100644 --- a/examples/asr/experimental/k2/conf/citrinet/citrinet_mmi_1024.yaml +++ b/examples/asr/experimental/k2/conf/citrinet/citrinet_mmi_1024.yaml @@ -25,6 +25,8 @@ model: max_duration: 20.0 shuffle: true use_start_end_token: false + num_workers: 8 + pin_memory: true # tarred datasets is_tarred: false tarred_audio_filepaths: null @@ -39,6 +41,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -46,6 +50,8 @@ model: batch_size: 32 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true model_defaults: repeat: 5 diff --git a/examples/asr/experimental/wav2vec/configs/wav2vecCTC.yaml b/examples/asr/experimental/wav2vec/configs/wav2vecCTC.yaml index 4cc7736115b2..d7c365b91bf1 100644 --- a/examples/asr/experimental/wav2vec/configs/wav2vecCTC.yaml +++ b/examples/asr/experimental/wav2vec/configs/wav2vecCTC.yaml @@ -19,6 +19,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true validation_ds: manifest_filepath: ??? @@ -26,6 +28,8 @@ model: batch_size: ?? shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -33,6 +37,8 @@ model: batch_size: null shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true tokenizer: dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe) diff --git a/examples/asr/experimental/wav2vec/configs/wav2vecCTC_large.yaml b/examples/asr/experimental/wav2vec/configs/wav2vecCTC_large.yaml index 92d016da4f03..27470cd70fe3 100644 --- a/examples/asr/experimental/wav2vec/configs/wav2vecCTC_large.yaml +++ b/examples/asr/experimental/wav2vec/configs/wav2vecCTC_large.yaml @@ -18,6 +18,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true validation_ds: manifest_filepath: ??? @@ -25,6 +27,8 @@ model: batch_size: 4 shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true test_ds: manifest_filepath: null @@ -32,6 +36,8 @@ model: batch_size: null shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true tokenizer: dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe) diff --git a/examples/asr/experimental/wav2vec/configs/wav2vec_pretrain.yaml b/examples/asr/experimental/wav2vec/configs/wav2vec_pretrain.yaml index e5e761b8e1cf..b792683e34b9 100644 --- a/examples/asr/experimental/wav2vec/configs/wav2vec_pretrain.yaml +++ b/examples/asr/experimental/wav2vec/configs/wav2vec_pretrain.yaml @@ -24,6 +24,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true validation_ds: manifest_filepath: ??? @@ -31,6 +33,8 @@ model: batch_size: ??? shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true preprocessor: _target_: nemo.collections.asr.modules.wav2vec_modules.ConvFeatureEncoder diff --git a/examples/asr/experimental/wav2vec/configs/wav2vec_pretrain_large.yaml b/examples/asr/experimental/wav2vec/configs/wav2vec_pretrain_large.yaml index 09b386606bf5..772201d239bb 100644 --- a/examples/asr/experimental/wav2vec/configs/wav2vec_pretrain_large.yaml +++ b/examples/asr/experimental/wav2vec/configs/wav2vec_pretrain_large.yaml @@ -23,6 +23,8 @@ model: is_tarred: false tarred_audio_filepaths: null use_start_end_token: false + num_workers: 8 + pin_memory: true validation_ds: manifest_filepath: ??? @@ -30,6 +32,8 @@ model: batch_size: ??? shuffle: false use_start_end_token: false + num_workers: 8 + pin_memory: true preprocessor: _target_: nemo.collections.asr.modules.wav2vec_modules.ConvFeatureEncoder diff --git a/examples/nlp/intent_slot_classification/data/assistant_utils.py b/examples/nlp/intent_slot_classification/data/assistant_utils.py index 6221c46e660b..8e9b451bfec1 100644 --- a/examples/nlp/intent_slot_classification/data/assistant_utils.py +++ b/examples/nlp/intent_slot_classification/data/assistant_utils.py @@ -51,6 +51,7 @@ def get_intents(infold): intents = [f[:-4] for f in os.listdir(infold)] intents.sort() logging.info(f'Found {len(intents)} intents') + return intents @@ -70,7 +71,7 @@ def get_intent_queries(infold, intent_names, mode): def get_slots(infold, modes): """ - Find a slot of unique slot types in training and testing data. + Find a list of unique slot types in training and testing data. We use a single slot type name both for starting and continuation tokens (not using B-, I- notation). """ slots = set() @@ -89,6 +90,7 @@ def get_slots(infold, modes): slots = sorted(slots) slots.append("O") logging.info(f'Found {len(slots)} slot types') + return slots diff --git a/examples/nlp/intent_slot_classification/data/import_datasets.py b/examples/nlp/intent_slot_classification/data/import_datasets.py index d0c72f5e10a5..2468ed7927d2 100644 --- a/examples/nlp/intent_slot_classification/data/import_datasets.py +++ b/examples/nlp/intent_slot_classification/data/import_datasets.py @@ -143,7 +143,6 @@ def process_jarvis_datasets( do_lowercase: whether to lowercase the input utterances ignore_prev_intent: whether to include intent from previous turn in predicting intent of current turn """ - dataset_name = "jarvis" if if_exist(outfold, ['dict.intents.csv', 'dict.slots.csv']): logging.info(DATABASE_EXISTS_TMP.format(dataset_name, outfold)) diff --git a/examples/nlp/language_modeling/conf/megatron_bart_config.yaml b/examples/nlp/language_modeling/conf/megatron_bart_config.yaml index 3e6811ca2602..fb8094842ca3 100644 --- a/examples/nlp/language_modeling/conf/megatron_bart_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_bart_config.yaml @@ -17,6 +17,7 @@ trainer: limit_test_batches: 500 accumulate_grad_batches: 1 gradient_clip_val: 1.0 + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/language_modeling/conf/megatron_bert_config.yaml b/examples/nlp/language_modeling/conf/megatron_bert_config.yaml index 61d98a2e9de4..e93dcbe297e7 100644 --- a/examples/nlp/language_modeling/conf/megatron_bert_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_bert_config.yaml @@ -17,6 +17,7 @@ trainer: limit_test_batches: 500 accumulate_grad_batches: 1 gradient_clip_val: 1.0 + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml b/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml index 5e1a7fe86789..e1cbc2c7ca6e 100755 --- a/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_gpt_config.yaml @@ -17,6 +17,7 @@ trainer: limit_test_batches: 500 accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models gradient_clip_val: 1.0 + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml b/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml index dcb2ab073c2b..97037e7b18db 100644 --- a/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_config.yaml @@ -15,6 +15,7 @@ trainer: accumulate_grad_batches: 1 gradient_clip_val: 1.0 resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc. + benchmark: False exp_manager: @@ -32,7 +33,7 @@ exp_manager: monitor: val_loss save_top_k: 2 mode: min - save_nemo_on_train_end: True + save_nemo_on_train_end: False filename: 'megatron_gpt_prompt_tune--{val_loss:.3f}-{step}' model_parallel_size: ${model.tensor_model_parallel_size} save_best_model: True @@ -69,8 +70,8 @@ model: - taskname: 'rte' prompt_template: '<|VIRTUAL_PROMPT_0|>{text}{answer}' - total_virtual_tokens: 100 - virtual_token_splits: [100] + total_virtual_tokens: 9 + virtual_token_splits: [9] truncate_field: null answer_only_loss: True answer_field: 'answer' @@ -84,11 +85,11 @@ model: num_layers: 2 data: - train_ds: [data/squad_train.jsonl,] - validation_ds: [data/squad_val.jsonl,] + train_ds: [data/rte_train.jsonl,] + validation_ds: [data/rte_val.jsonl,] add_eos: True shuffle: True - num_workers: 1 + num_workers: 8 pin_memory: True diff --git a/examples/nlp/language_modeling/conf/megatron_ptune_t5.yaml b/examples/nlp/language_modeling/conf/megatron_ptune_t5.yaml index aaa93ba6d352..b02bb4f0ebe7 100644 --- a/examples/nlp/language_modeling/conf/megatron_ptune_t5.yaml +++ b/examples/nlp/language_modeling/conf/megatron_ptune_t5.yaml @@ -14,6 +14,7 @@ trainer: val_check_interval: 300 accumulate_grad_batches: 2 gradient_clip_val: 1.0 + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config.yaml index 647c066408a6..d3f8f402bdb2 100644 --- a/examples/nlp/language_modeling/conf/megatron_t5_config.yaml +++ b/examples/nlp/language_modeling/conf/megatron_t5_config.yaml @@ -17,6 +17,7 @@ trainer: limit_test_batches: 500 accumulate_grad_batches: 1 gradient_clip_val: 1.0 + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_eval.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_eval.yaml index 84c25d436717..11bad4dc639a 100644 --- a/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_eval.yaml +++ b/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_eval.yaml @@ -8,6 +8,7 @@ trainer: logger: False # logger provided by exp_manager enable_checkpointing: False replace_sampler_ddp: False + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_mnli.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_mnli.yaml index 24541f13a01a..bac1bac2ec89 100644 --- a/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_mnli.yaml +++ b/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_mnli.yaml @@ -14,6 +14,7 @@ trainer: val_check_interval: 300 accumulate_grad_batches: 1 gradient_clip_val: 1.0 + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_xnli.yaml b/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_xnli.yaml index bc1c8a4e8c49..10eedd384e79 100644 --- a/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_xnli.yaml +++ b/examples/nlp/language_modeling/conf/megatron_t5_config_finetune_glue_xnli.yaml @@ -14,6 +14,7 @@ trainer: val_check_interval: 300 accumulate_grad_batches: 1 gradient_clip_val: 1.0 + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/language_modeling/conf/transformer_lm_config.yaml b/examples/nlp/language_modeling/conf/transformer_lm_config.yaml index a5d2b0062b44..31040dd7239a 100644 --- a/examples/nlp/language_modeling/conf/transformer_lm_config.yaml +++ b/examples/nlp/language_modeling/conf/transformer_lm_config.yaml @@ -95,6 +95,7 @@ trainer: logger: False log_every_n_steps: 50 # Interval of logging. check_val_every_n_epoch: 1 + benchmark: False exp_manager: name: TransformerLM diff --git a/examples/nlp/machine_translation/conf/aayn_base.yaml b/examples/nlp/machine_translation/conf/aayn_base.yaml index 051a6aaa80d6..39f500da70bd 100644 --- a/examples/nlp/machine_translation/conf/aayn_base.yaml +++ b/examples/nlp/machine_translation/conf/aayn_base.yaml @@ -152,6 +152,7 @@ trainer: logger: False log_every_n_steps: 50 # Interval of logging. check_val_every_n_epoch: 1 + benchmark: False exp_manager: name: AAYNBase diff --git a/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml b/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml index 0b847f8af023..d85946287e2d 100644 --- a/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml +++ b/examples/nlp/machine_translation/conf/aayn_base_megatron.yaml @@ -15,6 +15,7 @@ trainer: val_check_interval: 1000 accumulate_grad_batches: 1 gradient_clip_val: 1.0 + benchmark: False exp_manager: explicit_log_dir: null diff --git a/examples/nlp/machine_translation/conf/huggingface.yaml b/examples/nlp/machine_translation/conf/huggingface.yaml index 8d298dddd83d..f1874a93d055 100644 --- a/examples/nlp/machine_translation/conf/huggingface.yaml +++ b/examples/nlp/machine_translation/conf/huggingface.yaml @@ -125,6 +125,7 @@ trainer: logger: False log_every_n_steps: 50 # Interval of logging. check_val_every_n_epoch: 1 + benchmark: False exp_manager: name: HuggingFaceEncoder diff --git a/examples/tts/conf/tacotron2_44100.yaml b/examples/tts/conf/tacotron2_44100.yaml index 9b44c7611311..3965bfd09f10 100644 --- a/examples/tts/conf/tacotron2_44100.yaml +++ b/examples/tts/conf/tacotron2_44100.yaml @@ -169,8 +169,8 @@ trainer: enable_checkpointing: False # Provided by exp_manager logger: False # Provided by exp_manager gradient_clip_val: 1.0 - log_every_n_steps: 200 - check_val_every_n_epoch: 25 + log_every_n_steps: 60 + check_val_every_n_epoch: 2 benchmark: false exp_manager: diff --git a/nemo/collections/asr/models/rnnt_bpe_models.py b/nemo/collections/asr/models/rnnt_bpe_models.py index 93bc6973dedd..dff9ed6a67d2 100644 --- a/nemo/collections/asr/models/rnnt_bpe_models.py +++ b/nemo/collections/asr/models/rnnt_bpe_models.py @@ -60,7 +60,7 @@ def list_available_models(cls) -> List[PretrainedModelInfo]: model = PretrainedModelInfo( pretrained_model_name="stt_en_contextnet_1024", description="For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_contextnet_1024", - location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_contextnet_1024/versions/1.6.0/files/stt_en_contextnet_1024.nemo", + location="https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_contextnet_1024/versions/1.9.0/files/stt_en_contextnet_1024.nemo", ) results.append(model) diff --git a/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py b/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py index 9cc506cba2e5..4b9ff6d5b27e 100644 --- a/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py +++ b/nemo/collections/nlp/data/token_classification/punctuation_capitalization_dataset.py @@ -1109,7 +1109,7 @@ def _check_label_ids_loaded_from_pkl( ) -> None: if not isinstance(pkl_punct_label_ids, dict): raise ValueError( - f"Punctuation label ids loaded from features file {self.features_pkl} has wrong type " + f"Punctuation label ids loaded from features file {self.features_pkl} have wrong type " f"{type(pkl_punct_label_ids)}" ) if parameter_punct_label_ids is not None: diff --git a/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py b/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py index 9cad46b9ac08..7c27aeb1bb20 100644 --- a/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py +++ b/nemo/collections/nlp/models/machine_translation/mt_enc_dec_model.py @@ -407,6 +407,9 @@ def validation_step(self, batch, batch_idx, dataloader_idx=0): def eval_epoch_end(self, outputs, mode, global_rank): # if user specifies one validation dataloader, then PTL reverts to giving a list of dictionary instead of a list of list of dictionary + if not outputs: + return + if isinstance(outputs[0], dict): outputs = [outputs] diff --git a/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py b/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py index 5a5f6c025eea..5f6fa7f6164f 100644 --- a/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py +++ b/nemo/collections/nlp/models/token_classification/punctuation_capitalization_model.py @@ -13,6 +13,7 @@ # limitations under the License. import copy +import warnings from math import ceil from pathlib import Path from typing import Any, Dict, List, Optional, Tuple, Union @@ -770,6 +771,39 @@ def _setup_dataloader_from_config(self, cfg: DictConfig, train: bool) -> torch.u 'punct_label_vocab_file': punct_label_vocab_file, 'capit_label_vocab_file': capit_label_vocab_file, } + if train: + number_of_batches_is_multiple_of = 1 + if self._trainer is None: + warnings.warn( + 'A model attribute `trainer` is not set before training dataset setting. If training is ' + 'resumed from checkpoint, then current epoch data loading can be distorted: some batches ' + 'may be processed several times and some can be not processed at all. `trainer.current_epoch`' + ' is used as random seed for shuffling batches. Now 0 will be used. If the ' + 'checkpoint was created not during initial epoch a shuffling of the dataset will ' + 'be different. You may try use `exp_manager()` function and ' + '`PunctuationCapitalizationModel.set_trainer()` method before ' + '`PunctuationCapitalizationModel.setup_training_data()` method.' + ) + batch_shuffling_random_seed = 0 + else: + batch_shuffling_random_seed = self._trainer.current_epoch + else: + batch_shuffling_random_seed = 0 + if self._trainer is None: + warnings.warn( + 'A model attribute `trainer` is not set before test or validation dataset setting. If more ' + 'than 1 GPU is used for testing, then some examples may be tested several times because ' + 'number of batches may be not evenly divisible by number of processes. This leads to ' + 'distortion of metrics. See more in description of `number_of_batches_is_multiple_of` ' + 'parameter of class `BertPunctuationCapitalizationDataset` initializer and ' + 'https://pytorch.org/docs/stable/data.html#multi-process-data-loading. You may try to use ' + '`PunctuationCapitalizationModel.set_trainer()` method before ' + '`PunctuationCapitalizationModel.setup_validation_data()` and ' + '`PunctuationCapitalizationModel.setup_test_data()` methods.' + ) + number_of_batches_is_multiple_of = 1 + else: + number_of_batches_is_multiple_of = self._trainer.num_nodes * self._trainer.num_devices dataset = BertPunctuationCapitalizationDataset( tokenizer=self.tokenizer, text_file=text_file, @@ -783,8 +817,8 @@ def _setup_dataloader_from_config(self, cfg: DictConfig, train: bool) -> torch.u num_samples=cfg.num_samples, tokens_in_batch=cfg.tokens_in_batch, n_jobs=cfg.n_jobs, - number_of_batches_is_multiple_of=1 if train else self.trainer.num_nodes * self.trainer.num_devices, - batch_shuffling_random_seed=self.trainer.global_step if train else 42, + number_of_batches_is_multiple_of=number_of_batches_is_multiple_of, + batch_shuffling_random_seed=batch_shuffling_random_seed, verbose=cfg.verbose, get_label_frequencies=cfg.get_label_frequences, cache_dir=cfg.cache_dir, diff --git a/nemo/collections/nlp/modules/common/text_generation_server.py b/nemo/collections/nlp/modules/common/text_generation_server.py index 40c9dc385e5e..3939f82f3e0d 100644 --- a/nemo/collections/nlp/modules/common/text_generation_server.py +++ b/nemo/collections/nlp/modules/common/text_generation_server.py @@ -158,6 +158,9 @@ def put(self): repetition_penalty, min_tokens_to_generate, ) + for k in output: + if isinstance(output[k], torch.Tensor): + output[k] = output[k].tolist() if not all_probs: del output['full_logprob'] return jsonify(output) diff --git a/scripts/speech_recognition/k2/setup.sh b/scripts/speech_recognition/k2/setup.sh index e3d1c475d23e..e110e99c3088 100755 --- a/scripts/speech_recognition/k2/setup.sh +++ b/scripts/speech_recognition/k2/setup.sh @@ -20,5 +20,5 @@ LATEST_RELEASE=$(git -c 'versionsort.suffix=-' \ | tail --lines=1 \ | cut --delimiter='/' --fields=3) -K2_MAKE_ARGS="-j" pip install git+${K2_REPO}@${LATEST_RELEASE}#egg=k2 || (echo "k2 could not be installed!"; exit 1) +K2_MAKE_ARGS="-j" pip install -v git+${K2_REPO}@${LATEST_RELEASE}#egg=k2 || (echo "k2 could not be installed!"; exit 1) python3 -m k2.version > /dev/null || (echo "k2 installed with errors! Please check installation manually."; exit 1) && echo "k2 installed successfully!" diff --git a/tests/nemo_text_processing/en/data_text_normalization/test_cases_address.txt b/tests/nemo_text_processing/en/data_text_normalization/test_cases_address.txt index 46167ef468af..bd9394163d4d 100644 --- a/tests/nemo_text_processing/en/data_text_normalization/test_cases_address.txt +++ b/tests/nemo_text_processing/en/data_text_normalization/test_cases_address.txt @@ -6,4 +6,4 @@ 708 N 1st St, San City~seven zero eight North first Street, San City 12 S 1st st~twelve South first Street 1990 for the Ata ST~nineteen ninety for the Ata ST -Main St.~Main St. \ No newline at end of file +Main St.~Main St. diff --git a/tests/nemo_text_processing/en/data_text_normalization/test_cases_word.txt b/tests/nemo_text_processing/en/data_text_normalization/test_cases_word.txt index 5209688b5c2f..3fb24ebcc202 100644 --- a/tests/nemo_text_processing/en/data_text_normalization/test_cases_word.txt +++ b/tests/nemo_text_processing/en/data_text_normalization/test_cases_word.txt @@ -32,6 +32,7 @@ $ and 5% or %~dollar and five percent or percent sign 1~one 1~one !1~! one -love him while we may,~love him while we may, mar~mar /$€₩£BB¥#%AA and $€₩£¥#%~slash dollar euro won pound BB yen hash percent sign AA and dollar euro won pound yen hash percent sign +love him while we may,~love him while we may, +mar~mar diff --git a/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb b/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb index 55f84c94cbce..1f92bdca205d 100644 --- a/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb +++ b/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb @@ -386,4 +386,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} diff --git a/tutorials/nlp/02_NLP_Tokenizers.ipynb b/tutorials/nlp/02_NLP_Tokenizers.ipynb index d0c8017320e6..c63d2a8b1689 100644 --- a/tutorials/nlp/02_NLP_Tokenizers.ipynb +++ b/tutorials/nlp/02_NLP_Tokenizers.ipynb @@ -585,4 +585,4 @@ }, "nbformat": 4, "nbformat_minor": 1 -} \ No newline at end of file +} diff --git a/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb b/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb index c91a0adc0640..323bfa1c49b8 100644 --- a/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb +++ b/tutorials/nlp/Data_Preprocessing_and_Cleaning_for_NMT.ipynb @@ -1,802 +1,828 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "68e3feb7", - "metadata": {}, - "outputs": [], - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", - "\"\"\"\n", - "## Install dependencies\n", - "!pip install wget\n", - "!apt-get install libboost-all-dev\n", - "!apt-get install gawk\n", - "\n", - "## Install NeMo\n", - "BRANCH = 'main'\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", - "\n", - "!pip uninstall -y sacrebleu\n", - "!pip install sacrebleu[ja]\n", - "!pip install xxhash\n", - "\n", - "## Install kenlm with 7-gram support\n", - "!mkdir -p data\n", - "!rm -rf data/kenlm\n", - "!git clone https://github.com/kpu/kenlm data/kenlm\n", - "!cd data/kenlm \\\n", - " && pip install . --install-option=\"--max_order 7\" \\\n", - " && mkdir -p build \\\n", - " && cd build \\\n", - " && cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=../../kenlm_install \\\n", - " && make -j all install && cd ../../kenlm_install \\\n", - " && export PATH=$PATH:$PWD\n", - "\n", - "# Install bicleaner\n", - "\n", - "!pip install bicleaner" - ] - }, - { - "cell_type": "markdown", - "id": "0075e98c", - "metadata": {}, - "source": [ - "# Data Preprocessing & Cleaning for NMT\n", - "\n", - "This notebook contains a tutorial of data processing and cleaning for NMT (Neural Machine Translation) to train translation models with the [NeMo framework](https://github.com/NVIDIA/NeMo).\n", - "\n", - "A pre-requisite to train supervised neural machine translation systems is the availability of *parallel corpora* of reasonable quality.\n", - "\n", - "A parallel corpus is a collection of sentences or documents that are translations of each other in 2 or more languages.\n", - "\n", - "For example,\n", - "\n", - "| English | Russian |\n", - "| :-: | :-: |\n", - "| To date, a total of 43 participants from 15 countries have completed the training. | К настоящему времени подготовку прошли в общей сложности 43 участника из 15 стран . |\n", - "| M-Sport Bentley writes a new piece of Bentley history at Silverstone | M-Sport Bentley открывает новую страницу в истории Bentley в Сильверстоуне |\n", - "| Information in the application was not true. | Информация в заявлении не была достоверна. |\n", - "\n", - "This notebook will cover the following data pre-processing and data cleaning techniques for such corpora.\n", - "\n", - "## The importance of data cleaning\n", - "\n", - "The presence of noise in the training dataset can adversely affect model quality (https://arxiv.org/abs/1805.12282). Webcrawled and automatically aligned data sources in particular, such as [Paracrawl](https://paracrawl.eu/), [WikiMatrix](https://arxiv.org/abs/1907.05791), [CC-Aligned](https://arxiv.org/abs/1911.06154) and [CC-Matrix](https://arxiv.org/abs/1911.04944) can be extremely noisy.\n", - "\n", - "## Cleaning\n", - "1. Downloading and filtering publicly available datasets based on confidence thresholds (if available). For example, [WikiMatrix](https://arxiv.org/abs/1907.05791) filtering based on [LASER](https://arxiv.org/abs/1812.10464) confidence scores.\n", - "2. Language ID filtering using a pre-trained [fastText classifier](https://fasttext.cc/docs/en/language-identification.html). This step will remove all sentences from the parallel corpus that our classifier predicts as not being in the appropriate language (ex: sentences in the English column that aren't in English or sentences in Russian column that aren't in Russian).\n", - "3. Length and Length-ratio filtering. This steps removes all sentences that are 1) too long 2) too short or 3) have a ratio between their lengths greater than a certain factor (this typically removes partial translations).\n", - "4. [Bicleaner](https://github.com/bitextor/bicleaner) classifier-based cleaning. Bicleaner identifies noisy parallel sentences using a classifier that leverages multiple features such as n-gram language model likelihood scores, word alignment scores and other heuristics.\n", - "\n", - "## Pre-processing\n", - "5. [Moses Punctuation Normalization](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl). This step standardizes punctuation. For example the less common way to write apostrophes Tiffany`s will be standardized to Tiffany's.\n", - "6. Unicode standardization. There exist some unicode characters that aren't punctuation that need to be standardized for example, this step normalizes the number 4 to 4.\n", - "7. [Moses Tokenization](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) or text segmentation for Chinese/Japanese with [Jieba](https://github.com/fxsjy/jieba) and [mecab](https://github.com/taku910/mecab). For languages like Chinese and Japanese that do not have explicit word segmentation markers (like spaces), we use these tools to introduce spaces into the text that will let us split the string into words. For other languages, we use Moses to separate punctuation markers from words so that they become separate tokens.\n", - "8. Deduplication - This step removes duplicate translation pairs from the corpus.\n", - "9. Shuffling - This step shuffles the order of occurrence of translation pairs.\n", - "\n", - "## Tarred Datasets for Large Corpora\n", - "10. Large datasets with over 50M sentence pairs when batched and pickled can be up to 60GB in size. Loading them entirely into CPU memory when using say 8 or 16 workers with DistributedDataParallel training uses 480-960GB of RAM which is often impractical and inefficient. Instead, we use [Webdataset](https://github.com/webdataset/webdataset) to allow training while keeping datasets on disk and let webddataset handle pre-loading and fetching of data into CPU RAM.\n", - "\n", - "\n", - "## Disclaimer\n", - "\n", - "The data cleaning techniques used in this notebook are only meant to be loose guidelines and are not guaranteed to produced clean parallel corpora at the end of it. Not all of these steps are a necessity for every dataset, " - ] - }, - { - "cell_type": "markdown", - "id": "bb0eb698", - "metadata": {}, - "source": [ - "![NMT Data Pipeline](images/nmt_data_pipeline.png)" - ] - }, - { - "cell_type": "markdown", - "id": "4a9fd8d3", - "metadata": {}, - "source": [ - "# Downloading Publicly Available Data\n", - "\n", - "## WikiMatrix (https://arxiv.org/abs/1907.05791)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "78984523", - "metadata": {}, - "outputs": [], - "source": [ - "!mkdir -p data\n", - "print('Downloading data ...')\n", - "!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ru.tsv.gz -O data/WikiMatrix.en-ru.tsv.gz\n", - "print('---------------------')\n", - "print('Unzipping file ...')\n", - "!gunzip -k -f data/WikiMatrix.en-ru.tsv.gz\n", - "print('---------------------')\n", - "print('Peek into the file')\n", - "!head -10 data/WikiMatrix.en-ru.tsv\n", - "print('---------------------')\n", - "print('File length ...')\n", - "!wc -l data/WikiMatrix.en-ru.tsv\n", - "print('---------------------')" - ] - }, - { - "cell_type": "markdown", - "id": "b9a62f9e", - "metadata": {}, - "source": [ - "## Filter Based on LASER Confidence\n", - "\n", - "LASER (https://arxiv.org/abs/1812.10464) is a multi-lingual neural sentence embedding model that is often used for cross-lingual sentence/document retrieval. Similarities in the embedding space are often used as proxies for cross-lingual similarities." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21608388", - "metadata": {}, - "outputs": [], - "source": [ - "from tqdm import tqdm\n", - "import numpy as np\n", - "\n", - "def num_lines_in_file(fname):\n", - " \"\"\"\n", - " Returns the number of lines in a file.\n", - " \"\"\"\n", - " with open(fname, 'r') as f:\n", - " for i, _ in enumerate(f):\n", - " pass\n", - " return i + 1\n", - "\n", - "def filter_tsv_with_conf(\n", - " input_file, output_file_lang_1, output_file_lang_2,\n", - " confidence_threshold=None, confidence_column=None\n", - "):\n", - " \"\"\"\n", - " Filters a tsv file that has confidence scores associated with each parallel example.\n", - "\n", - " For example:\n", - "\n", - " 1.23 \\t This is a sentence in lang1 \\t This is a sentence in lang2\n", - " \"\"\"\n", - " print()\n", - " print('====================================')\n", - " print('======= TSV Conf Filtering =========')\n", - " print('====================================')\n", - " print()\n", - " num_lines = num_lines_in_file(input_file)\n", - " scores = []\n", - " num_output_lines = 0\n", - " lang_1_col = 0\n", - " lang_2_col = 1\n", - " with open(input_file, 'r') as f, \\\n", - " open(output_file_lang_1, 'w') as f_out_1, \\\n", - " open(output_file_lang_2, 'w') as f_out_2:\n", - " for line in tqdm(f, total=num_lines, desc=f\"Filtering file by confidence {confidence_threshold}\"):\n", - " if line.strip() == '':\n", - " continue\n", - " line = line.strip().split('\\t')\n", - " if len(line) < 2:\n", - " continue\n", - " if confidence_threshold is not None and float(line[confidence_column]) < confidence_threshold:\n", - " continue\n", - " else:\n", - " if confidence_threshold is not None:\n", - " scores.append(float(line[confidence_column]))\n", - " if confidence_column == 0:\n", - " lang_1_col, lang_2_col = 1, 2\n", - " elif confidence_column == 2:\n", - " lang_1_col, lang_2_col = 0, 1\n", - " elif confidence_column == 1:\n", - " lang_1_col, lang_2_col = 0, 2\n", - " else:\n", - " raise ValueError(f\"Invalid Column for confidence {confidence_column}\")\n", - " f_out_1.write(line[lang_1_col] + '\\n')\n", - " f_out_2.write(line[lang_2_col] + '\\n')\n", - " num_output_lines += 1\n", - "\n", - " if confidence_threshold is not None:\n", - " print(f'Confidence score average : {np.mean(scores)}')\n", - " print(f'Confidence score variance : {np.var(scores)}')\n", - " print(f'Kept {num_output_lines} out of {num_lines} after conversion ({(num_output_lines / num_lines) * 100}%)')\n", - " print('====================================')\n", - "\n", - "filter_tsv_with_conf(\n", - " 'data/WikiMatrix.en-ru.tsv',\n", - " 'data/WikiMatrix.en-ru.en', \n", - " 'data/WikiMatrix.en-ru.ru',\n", - " confidence_threshold=1.04, confidence_column=0\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "18a171d1", - "metadata": {}, - "source": [ - "## Language ID filtering with fastText\n", - "\n", - "Noisy parallel corpora often contain sentences that are not in the intended language. A classifier that determines the language in which a sentence is written can be used to filter out sentences that aren't in the appropriate language." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d58b7148", - "metadata": {}, - "outputs": [], - "source": [ - "!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O data/lid.176.bin\n", - "print()\n", - "print('====================================')\n", - "print('====== Language ID Filtering =======')\n", - "print('====================================')\n", - "print()\n", - "\n", - "\n", - "!wget https://raw.github.com/NVIDIA/NeMo/main/scripts/neural_machine_translation/filter_langs_nmt.py \\\n", - " -O filter_langs_nmt.py\n", - "\n", - "!python filter_langs_nmt.py \\\n", - " --input-src data/WikiMatrix.en-ru.en \\\n", - " --input-tgt data/WikiMatrix.en-ru.ru \\\n", - " --output-src data/WikiMatrix.en-ru.langidfilter.en \\\n", - " --output-tgt data/WikiMatrix.en-ru.langidfilter.ru \\\n", - " --source-lang en \\\n", - " --target-lang ru \\\n", - " --removed-src data/WikiMatrix.en-ru.langidfilter.removed.en \\\n", - " --removed-tgt data/WikiMatrix.en-ru.langidfilter.removed.ru \\\n", - " --fasttext-model data/lid.176.bin\n", - "\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Number of removed sentences:')\n", - "print('-----------------------------------------')\n", - "print()\n", - "!wc -l data/WikiMatrix.en-ru.langidfilter.removed.ru\n", - "\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Examples of removed sentences')\n", - "print('-----------------------------------------')\n", - "print()\n", - "\n", - "!paste -d \"\\t\" \\\n", - " data/WikiMatrix.en-ru.langidfilter.removed.en \\\n", - " data/WikiMatrix.en-ru.langidfilter.removed.ru \\\n", - " | head -10\n", - "print('-----------------------------------------')" - ] - }, - { - "cell_type": "markdown", - "id": "ffb42e92", - "metadata": {}, - "source": [ - "## Length and Ratio Filtering\n", - "\n", - "This step filters out sentences based on their lengths and the ratio between source and target lengths. If (a) src_len / tgt_len or tgt_len / src_len exceed 1.3 or (b) source or target sequence lengths are less than 1 or greater than 250, the sentence pair will be removed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "52ff172a", - "metadata": {}, - "outputs": [], - "source": [ - "!git clone https://github.com/moses-smt/mosesdecoder data/mosesdecoder\n", - "!cd data/mosesdecoder && git checkout RELEASE-4.0 && cd ../..\n", - "!perl data/mosesdecoder/scripts/training/clean-corpus-n.perl -ratio 1.3 \\\n", - " data/WikiMatrix.en-ru.langidfilter \\\n", - " en ru \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio \\\n", - " 1 250" - ] - }, - { - "cell_type": "markdown", - "id": "01f2b589", - "metadata": {}, - "source": [ - "## Bicleaner Filtering\n", - "\n", - "Bicleaner (https://aclanthology.org/W18-6488/ and https://aclanthology.org/2020.eamt-1.31/) is a tool to identify noisy parallel sentences in translation corpora. It applies 3 different filtering steps:\n", - "\n", - "1. Pre-filtering based on 37 rules.\n", - "2. Language model fluency scores based on n-gram language models trained with kenlm.\n", - "3. Random forest classifier that uses all examples filtered out in steps 1 & 2 as \"negative\" examples." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9be8d4ca", - "metadata": {}, - "outputs": [], - "source": [ - "print('Downloading En-Ru Bicleaner models.')\n", - "!git clone https://github.com/bitextor/bicleaner data/bicleaner\n", - "!cd data/bicleaner && git checkout bicleaner-0.15 && cd ../..\n", - "!data/bicleaner/utils/download-pack.sh en ru\n", - "\n", - "print('Generating Bicleaner scores ...')\n", - "!gawk '{{print \"-\\t-\"}}' \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.en | \\\n", - " paste -d \"\\t\" - data/WikiMatrix.en-ru.langidfilter.lengthratio.en \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.ru | \\\n", - " bicleaner-classify - - en-ru/en-ru.yaml \\\n", - " > data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "43059b8a", - "metadata": {}, - "outputs": [], - "source": [ - "print('Score file ...')\n", - "!head -10 data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores\n", - "\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Filtering based on Bicleaner scores > 0.6 ...')\n", - "print('-----------------------------------------')\n", - "print()\n", - "\n", - "print('Filtering out English ...')\n", - "!gawk -F \"\\t\" '{if ($5>0.6) {print $3}}' \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en\n", - "\n", - "print('Filtering out Russian ...')\n", - "!gawk -F \"\\t\" '{if ($5>0.6) {print $4}}' \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru\n", - "\n", - "!paste -d \"\\t\" \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru \\\n", - " | head -10" - ] - }, - { - "cell_type": "markdown", - "id": "0726510c", - "metadata": {}, - "source": [ - "## Normalize Punctuation\n", - "\n", - "Punctuation can vary across languages and even between ascii and unicode variants of the same punctuation marker. For example, across languages. For example, in German, quotes are often written as „ and “ while in English we typically just use \". This step normalizes such punctuation differences to use the same character everywhere.\n", - "\n", - "We use [moses](https://github.com/moses-smt/mosesdecoder) or [sacremoses](https://github.com/alvations/sacremoses) to normalize punctuation. The moses implementation is in perl while sacremoses is in python with a CLI interface. The perl implementation is buffered and works better for large corpora that may not fit into CPU memory all at once while sacremoses is unbuffered and multi-processed." - ] - }, - { - "cell_type": "markdown", - "id": "e73670d6", - "metadata": {}, - "source": [ - "### Sacremoses" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "597e041a", - "metadata": {}, - "outputs": [], - "source": [ - "print('Normalizing English ...')\n", - "!sacremoses -j 4 normalize \\\n", - " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.en\n", - "\n", - "print('Normalizing Russian ...')\n", - "!sacremoses -j 4 normalize \\\n", - " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.ru\n" - ] - }, - { - "cell_type": "markdown", - "id": "240b0a1f", - "metadata": {}, - "source": [ - "## Moses\n", - "\n", - "Punctuation can vary across languages and even between ascii and unicode variants of the same punctuation marker. For example, across languages. For example, in German, quotes are often written as „ and “ while in English we typically just use \". This step normalizes such punctuation differences to use the same character everywhere.\n", - "\n", - "We use [moses](https://github.com/moses-smt/mosesdecoder) or [sacremoses](https://github.com/alvations/sacremoses) to normalize punctuation. The moses implementation is in perl while sacremoses is in python with a CLI interface. The perl implementation is buffered and works better for large corpora that may not fit into CPU memory all at once while sacremoses is unbuffered and multi-processed." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1f5adaa4", - "metadata": {}, - "outputs": [], - "source": [ - "print('Normalizing English ...')\n", - "!perl data/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en \\\n", - " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.en\n", - "\n", - "print('Normalizing Russian ...')\n", - "!perl data/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru \\\n", - " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.ru\n" - ] - }, - { - "cell_type": "markdown", - "id": "b8bfad64", - "metadata": {}, - "source": [ - "## Tokenize\n", - "\n", - "Tokenization splits a string into a sequence of tokens. A naive way of doing this would be to simply split the string on spaces (for languages where this is possible). This however, will result in punctuation being \"attached\" to the neighboring word when tokenizing. For example, \n", - "\n", - "\"This is a sentence.\" will be tokenized as [\"This, is, a, sentence.\"].\n", - "\n", - "However, we'd typically like punctuation to be separate tokens for example,\n", - "\n", - "\"This is a sentence.\" will be tokenized my moses or sacremoses as [\", This, is, a, sentence, ., \"]." - ] - }, - { - "cell_type": "markdown", - "id": "06c60b90", - "metadata": {}, - "source": [ - "### Sacremoses" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7bb4c631", - "metadata": {}, - "outputs": [], - "source": [ - "print('Tokenizing English ...')\n", - "!sacremoses -j 4 -l en tokenize -x \\\n", - " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.en > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.en\n", - "\n", - "print('Tokenizing Russian ...')\n", - "!sacremoses -j 4 -l ru tokenize -x \\\n", - " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.ru > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.ru\n" - ] - }, - { - "cell_type": "markdown", - "id": "444bebd7", - "metadata": {}, - "source": [ - "### Moses" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "21333e27", - "metadata": {}, - "outputs": [], - "source": [ - "print('Tokenizing English ...')\n", - "!perl data/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en -no-escape -threads 4 \\\n", - " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.en > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.en\n", - "\n", - "print('Tokenizing Russian ...')\n", - "!perl data/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ru -no-escape -threads 4 \\\n", - " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.ru > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.ru\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b28df2bb", - "metadata": {}, - "outputs": [], - "source": [ - "print()\n", - "print('-----------------------------------------')\n", - "print('Tokenized Russian Sentences ...')\n", - "print('-----------------------------------------')\n", - "print()\n", - "\n", - "!head -10 data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.ru\n", - "\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Tokenized English Sentences ...')\n", - "print('-----------------------------------------')\n", - "print()\n", - "\n", - "!head -10 data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.en" - ] - }, - { - "cell_type": "markdown", - "id": "dee5409d", - "metadata": {}, - "source": [ - "## Segmenting Chinese and Japanese\n", - "\n", - "### Jieba segmentation for Chinese" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "41b4cc91", - "metadata": {}, - "outputs": [], - "source": [ - "import jieba\n", - "\n", - "!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-zh.tsv.gz -O data/WikiMatrix.en-zh.tsv.gz\n", - "!gunzip -k -f data/WikiMatrix.en-zh.tsv.gz\n", - "\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Chinese text before segmentation ...')\n", - "print('-----------------------------------------')\n", - "print()\n", - "\n", - "!awk -F \"\\t\" '{print $3}' data/WikiMatrix.en-zh.tsv | head -10\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Segmenting Chinese text ...')\n", - "print('-----------------------------------------')\n", - "print()\n", - "\n", - "zh_lines = []\n", - "with open('data/WikiMatrix.en-zh.tsv', 'r') as f:\n", - " for idx, line in enumerate(f):\n", - " line = line.strip().split('\\t')[2]\n", - " zh_lines.append(' '.join(jieba.cut(line)))\n", - " if idx == 100:\n", - " break\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Chinese text after segmentation ...')\n", - "print('\\n'.join(zh_lines[:10]))\n", - "print('-----------------------------------------')\n", - "print()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "489bd915", - "metadata": {}, - "outputs": [], - "source": [ - "import MeCab\n", - "import ipadic\n", - "\n", - "!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ja.tsv.gz -O data/WikiMatrix.en-ja.tsv.gz\n", - "!gunzip -k -f data/WikiMatrix.en-ja.tsv.gz\n", - "\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Japanese text before segmentation ...')\n", - "print('-----------------------------------------')\n", - "print()\n", - "\n", - "!awk -F \"\\t\" '{print $3}' data/WikiMatrix.en-ja.tsv | head -10\n", - "\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Segmenting Japanese text ...')\n", - "print('-----------------------------------------')\n", - "print()\n", - "\n", - "mecab_tokenizer = MeCab.Tagger(ipadic.MECAB_ARGS + \" -Owakati\")\n", - "\n", - "ja_lines = []\n", - "with open('data/WikiMatrix.en-ja.tsv', 'r') as f:\n", - " for idx, line in enumerate(f):\n", - " line = line.strip().split('\\t')[2]\n", - " ja_lines.append(mecab_tokenizer.parse(line))\n", - " if idx == 100:\n", - " break\n", - "print()\n", - "print('-----------------------------------------')\n", - "print('Japanese text after segmentation ...')\n", - "print('\\n'.join(ja_lines[:10]))\n", - "print('-----------------------------------------')\n", - "print()" - ] - }, - { - "cell_type": "markdown", - "id": "4a079efe", - "metadata": {}, - "source": [ - "## Deduplicate" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "55d98bf3", - "metadata": {}, - "outputs": [], - "source": [ - "import xxhash\n", - "\n", - "def dedup_file(input_file_lang_1, input_file_lang_2, output_file_lang_1, output_file_lang_2):\n", - " print()\n", - " print('====================================')\n", - " print('========== De-duplicate ============')\n", - " print('====================================')\n", - " print()\n", - " num_lines = num_lines_in_file(input_file_lang_1)\n", - " hashes = set()\n", - " num_output_lines = 0\n", - " with open(input_file_lang_1, 'r') as f_lang1, \\\n", - " open(input_file_lang_2, 'r') as f_lang2, \\\n", - " open(output_file_lang_1, 'w') as f_out_lang1, \\\n", - " open(output_file_lang_2, 'w') as f_out_lang2:\n", - " for line_1, line_2 in tqdm(zip(f_lang1, f_lang2), total=num_lines, desc=f\"Deduplicating files\"):\n", - " parallel_hash = xxhash.xxh64((line_1.strip() + '\\t' + line_2.strip()).encode('utf-8')).hexdigest()\n", - " if parallel_hash not in hashes:\n", - " hashes.add(parallel_hash)\n", - " f_out_lang1.write(line_1.strip() + '\\n')\n", - " f_out_lang2.write(line_2.strip() + '\\n')\n", - " num_output_lines += 1\n", - "\n", - " print(f\"Kept {num_output_lines} out of {num_lines} after deduplication\")\n", - "\n", - "dedup_file(\n", - " 'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.en',\n", - " 'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.ru',\n", - " 'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.en',\n", - " 'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.ru'\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "da4c181a", - "metadata": {}, - "source": [ - "## Shuffle" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "413734bd", - "metadata": {}, - "outputs": [], - "source": [ - "!shuf --random-source=data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.en \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.en > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.en\n", - "\n", - "!shuf --random-source=data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.en \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.ru > \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.ru\n", - "\n", - "!paste -d \"\\t\" \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.en \\\n", - " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.ru \\\n", - " | head -10" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5f3b3640", - "metadata": {}, - "outputs": [], - "source": [ - "!rm -rf data/tarred_dataset_en_ru_8k_tokens" - ] - }, - { - "cell_type": "markdown", - "id": "844a9f26", - "metadata": {}, - "source": [ - "## Tarred Dataset Creation" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2b045df5", - "metadata": {}, - "outputs": [], - "source": [ - "!wget https://raw.github.com/NVIDIA/NeMo/main/examples/nlp/machine_translation/create_tarred_parallel_dataset.py \\\n", - " -O create_tarred_parallel_dataset.py\n", - "\n", - "!python create_tarred_parallel_dataset.py \\\n", - " --src_fname data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.en \\\n", - " --tgt_fname data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.ru \\\n", - " --out_dir data/tarred_dataset_en_ru_8k_tokens \\\n", - " --clean \\\n", - " --encoder_tokenizer_name yttm \\\n", - " --encoder_tokenizer_vocab_size 32000 \\\n", - " --encoder_tokenizer_coverage 0.999 \\\n", - " --encoder_tokenizer_bpe_dropout 0.1 \\\n", - " --decoder_tokenizer_name yttm \\\n", - " --decoder_tokenizer_vocab_size 32000 \\\n", - " --decoder_tokenizer_coverage 0.999 \\\n", - " --decoder_tokenizer_bpe_dropout 0.1 \\\n", - " --max_seq_length 512 \\\n", - " --min_seq_length 1 \\\n", - " --tokens_in_batch 8000 \\\n", - " --lines_per_dataset_fragment 100000 \\\n", - " --num_batches_per_tarfile 20\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "990265e5", - "metadata": {}, - "outputs": [], - "source": [ - "!ls data/tarred_dataset_en_ru_8k_tokens" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cc5e123b", - "metadata": {}, - "outputs": [], - "source": [ - "!cat data/tarred_dataset_en_ru_8k_tokens/metadata.tokens.8000.json" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.8" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} \ No newline at end of file + "cells": [ + { + "cell_type": "markdown", + "id": "bd9c257a", + "metadata": {}, + "source": [ + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect" + ] + }, + { + "cell_type": "markdown", + "id": "0075e98c", + "metadata": {}, + "source": [ + "# Data Preprocessing & Cleaning for NMT\n", + "\n", + "This notebook contains a tutorial of data processing and cleaning for NMT (Neural Machine Translation) to train translation models with the [NeMo framework](https://github.com/NVIDIA/NeMo).\n", + "\n", + "A pre-requisite to train supervised neural machine translation systems is the availability of *parallel corpora* of reasonable quality.\n", + "\n", + "A parallel corpus is a collection of sentences or documents that are translations of each other in 2 or more languages.\n", + "\n", + "For example,\n", + "\n", + "| English | Russian |\n", + "| :-: | :-: |\n", + "| To date, a total of 43 participants from 15 countries have completed the training. | К настоящему времени подготовку прошли в общей сложности 43 участника из 15 стран . |\n", + "| M-Sport Bentley writes a new piece of Bentley history at Silverstone | M-Sport Bentley открывает новую страницу в истории Bentley в Сильверстоуне |\n", + "| Information in the application was not true. | Информация в заявлении не была достоверна. |\n", + "\n", + "This notebook will cover the following data pre-processing and data cleaning techniques for such corpora.\n", + "\n", + "## The importance of data cleaning\n", + "\n", + "The presence of noise in the training dataset can adversely affect model quality (https://arxiv.org/abs/1805.12282). Webcrawled and automatically aligned data sources in particular, such as [Paracrawl](https://paracrawl.eu/), [WikiMatrix](https://arxiv.org/abs/1907.05791), [CC-Aligned](https://arxiv.org/abs/1911.06154) and [CC-Matrix](https://arxiv.org/abs/1911.04944) can be extremely noisy.\n", + "\n", + "## Cleaning\n", + "1. Downloading and filtering publicly available datasets based on confidence thresholds (if available). For example, [WikiMatrix](https://arxiv.org/abs/1907.05791) filtering based on [LASER](https://arxiv.org/abs/1812.10464) confidence scores.\n", + "2. Language ID filtering using a pre-trained [fastText classifier](https://fasttext.cc/docs/en/language-identification.html). This step will remove all sentences from the parallel corpus that our classifier predicts as not being in the appropriate language (ex: sentences in the English column that aren't in English or sentences in Russian column that aren't in Russian).\n", + "3. Length and Length-ratio filtering. This steps removes all sentences that are 1) too long 2) too short or 3) have a ratio between their lengths greater than a certain factor (this typically removes partial translations).\n", + "4. [Bicleaner](https://github.com/bitextor/bicleaner) classifier-based cleaning. Bicleaner identifies noisy parallel sentences using a classifier that leverages multiple features such as n-gram language model likelihood scores, word alignment scores and other heuristics.\n", + "\n", + "## Pre-processing\n", + "5. [Moses Punctuation Normalization](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/normalize-punctuation.perl). This step standardizes punctuation. For example the less common way to write apostrophes Tiffany`s will be standardized to Tiffany's.\n", + "6. Unicode standardization. There exist some unicode characters that aren't punctuation that need to be standardized for example, this step normalizes the number 4 to 4.\n", + "7. [Moses Tokenization](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) or text segmentation for Chinese/Japanese with [Jieba](https://github.com/fxsjy/jieba) and [mecab](https://github.com/taku910/mecab). For languages like Chinese and Japanese that do not have explicit word segmentation markers (like spaces), we use these tools to introduce spaces into the text that will let us split the string into words. For other languages, we use Moses to separate punctuation markers from words so that they become separate tokens.\n", + "8. Deduplication - This step removes duplicate translation pairs from the corpus.\n", + "9. Shuffling - This step shuffles the order of occurrence of translation pairs.\n", + "\n", + "## Tarred Datasets for Large Corpora\n", + "10. Large datasets with over 50M sentence pairs when batched and pickled can be up to 60GB in size. Loading them entirely into CPU memory when using say 8 or 16 workers with DistributedDataParallel training uses 480-960GB of RAM which is often impractical and inefficient. Instead, we use [Webdataset](https://github.com/webdataset/webdataset) to allow training while keeping datasets on disk and let webddataset handle pre-loading and fetching of data into CPU RAM.\n", + "\n", + "\n", + "## Disclaimer\n", + "\n", + "The data cleaning techniques used in this notebook are only meant to be loose guidelines and are not guaranteed to produced clean parallel corpora at the end of it. Not all of these steps are a necessity for every dataset, " + ] + }, + { + "cell_type": "markdown", + "id": "bb0eb698", + "metadata": {}, + "source": [ + "![NMT Data Pipeline](images/nmt_data_pipeline.png)" + ] + }, + { + "cell_type": "markdown", + "id": "4a9fd8d3", + "metadata": {}, + "source": [ + "# Downloading Publicly Available Data\n", + "\n", + "## WikiMatrix (https://arxiv.org/abs/1907.05791)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78984523", + "metadata": {}, + "outputs": [], + "source": [ + "!mkdir -p data\n", + "print('Downloading data ...')\n", + "!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ru.tsv.gz -O data/WikiMatrix.en-ru.tsv.gz\n", + "print('---------------------')\n", + "print('Unzipping file ...')\n", + "!gunzip -k -f data/WikiMatrix.en-ru.tsv.gz\n", + "print('---------------------')\n", + "print('Peek into the file')\n", + "!head -10 data/WikiMatrix.en-ru.tsv\n", + "print('---------------------')\n", + "print('File length ...')\n", + "!wc -l data/WikiMatrix.en-ru.tsv\n", + "print('---------------------')" + ] + }, + { + "cell_type": "markdown", + "id": "b9a62f9e", + "metadata": {}, + "source": [ + "## Filter Based on LASER Confidence\n", + "\n", + "LASER (https://arxiv.org/abs/1812.10464) is a multi-lingual neural sentence embedding model that is often used for cross-lingual sentence/document retrieval. Similarities in the embedding space are often used as proxies for cross-lingual similarities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21608388", + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm import tqdm\n", + "import numpy as np\n", + "\n", + "def num_lines_in_file(fname):\n", + " \"\"\"\n", + " Returns the number of lines in a file.\n", + " \"\"\"\n", + " with open(fname, 'r') as f:\n", + " for i, _ in enumerate(f):\n", + " pass\n", + " return i + 1\n", + "\n", + "def filter_tsv_with_conf(\n", + " input_file, output_file_lang_1, output_file_lang_2,\n", + " confidence_threshold=None, confidence_column=None\n", + "):\n", + " \"\"\"\n", + " Filters a tsv file that has confidence scores associated with each parallel example.\n", + "\n", + " For example:\n", + "\n", + " 1.23 \\t This is a sentence in lang1 \\t This is a sentence in lang2\n", + " \"\"\"\n", + " print()\n", + " print('====================================')\n", + " print('======= TSV Conf Filtering =========')\n", + " print('====================================')\n", + " print()\n", + " num_lines = num_lines_in_file(input_file)\n", + " scores = []\n", + " num_output_lines = 0\n", + " lang_1_col = 0\n", + " lang_2_col = 1\n", + " with open(input_file, 'r') as f, \\\n", + " open(output_file_lang_1, 'w') as f_out_1, \\\n", + " open(output_file_lang_2, 'w') as f_out_2:\n", + " for line in tqdm(f, total=num_lines, desc=f\"Filtering file by confidence {confidence_threshold}\"):\n", + " if line.strip() == '':\n", + " continue\n", + " line = line.strip().split('\\t')\n", + " if len(line) < 2:\n", + " continue\n", + " if confidence_threshold is not None and float(line[confidence_column]) < confidence_threshold:\n", + " continue\n", + " else:\n", + " if confidence_threshold is not None:\n", + " scores.append(float(line[confidence_column]))\n", + " if confidence_column == 0:\n", + " lang_1_col, lang_2_col = 1, 2\n", + " elif confidence_column == 2:\n", + " lang_1_col, lang_2_col = 0, 1\n", + " elif confidence_column == 1:\n", + " lang_1_col, lang_2_col = 0, 2\n", + " else:\n", + " raise ValueError(f\"Invalid Column for confidence {confidence_column}\")\n", + " f_out_1.write(line[lang_1_col] + '\\n')\n", + " f_out_2.write(line[lang_2_col] + '\\n')\n", + " num_output_lines += 1\n", + "\n", + " if confidence_threshold is not None:\n", + " print(f'Confidence score average : {np.mean(scores)}')\n", + " print(f'Confidence score variance : {np.var(scores)}')\n", + " print(f'Kept {num_output_lines} out of {num_lines} after conversion ({(num_output_lines / num_lines) * 100}%)')\n", + " print('====================================')\n", + "\n", + "filter_tsv_with_conf(\n", + " 'data/WikiMatrix.en-ru.tsv',\n", + " 'data/WikiMatrix.en-ru.en', \n", + " 'data/WikiMatrix.en-ru.ru',\n", + " confidence_threshold=1.04, confidence_column=0\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "18a171d1", + "metadata": {}, + "source": [ + "## Language ID filtering with fastText\n", + "\n", + "Noisy parallel corpora often contain sentences that are not in the intended language. A classifier that determines the language in which a sentence is written can be used to filter out sentences that aren't in the appropriate language." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d58b7148", + "metadata": {}, + "outputs": [], + "source": [ + "!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -O data/lid.176.bin\n", + "print()\n", + "print('====================================')\n", + "print('====== Language ID Filtering =======')\n", + "print('====================================')\n", + "print()\n", + "\n", + "\n", + "!wget https://raw.github.com/NVIDIA/NeMo/main/scripts/neural_machine_translation/filter_langs_nmt.py \\\n", + " -O filter_langs_nmt.py\n", + "\n", + "!python filter_langs_nmt.py \\\n", + " --input-src data/WikiMatrix.en-ru.en \\\n", + " --input-tgt data/WikiMatrix.en-ru.ru \\\n", + " --output-src data/WikiMatrix.en-ru.langidfilter.en \\\n", + " --output-tgt data/WikiMatrix.en-ru.langidfilter.ru \\\n", + " --source-lang en \\\n", + " --target-lang ru \\\n", + " --removed-src data/WikiMatrix.en-ru.langidfilter.removed.en \\\n", + " --removed-tgt data/WikiMatrix.en-ru.langidfilter.removed.ru \\\n", + " --fasttext-model data/lid.176.bin\n", + "\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Number of removed sentences:')\n", + "print('-----------------------------------------')\n", + "print()\n", + "!wc -l data/WikiMatrix.en-ru.langidfilter.removed.ru\n", + "\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Examples of removed sentences')\n", + "print('-----------------------------------------')\n", + "print()\n", + "\n", + "!paste -d \"\\t\" \\\n", + " data/WikiMatrix.en-ru.langidfilter.removed.en \\\n", + " data/WikiMatrix.en-ru.langidfilter.removed.ru \\\n", + " | head -10\n", + "print('-----------------------------------------')" + ] + }, + { + "cell_type": "markdown", + "id": "ffb42e92", + "metadata": {}, + "source": [ + "## Length and Ratio Filtering\n", + "\n", + "This step filters out sentences based on their lengths and the ratio between source and target lengths. If (a) src_len / tgt_len or tgt_len / src_len exceed 1.3 or (b) source or target sequence lengths are less than 1 or greater than 250, the sentence pair will be removed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52ff172a", + "metadata": {}, + "outputs": [], + "source": [ + "!git clone https://github.com/moses-smt/mosesdecoder data/mosesdecoder\n", + "!cd data/mosesdecoder && git checkout RELEASE-4.0 && cd ../..\n", + "!perl data/mosesdecoder/scripts/training/clean-corpus-n.perl -ratio 1.3 \\\n", + " data/WikiMatrix.en-ru.langidfilter \\\n", + " en ru \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio \\\n", + " 1 250" + ] + }, + { + "cell_type": "markdown", + "id": "28de44eb", + "metadata": {}, + "source": [ + "THE FOLLOWING CELLS REQUIRE THE INSTALLATION OF BICLEANER, WHICH REQUIRES COMPILING PACKAGES FROM SOURCE AND IS TRICKY TO GET WORKING INSIDE THIS CONTAINER. PLEASE INSTALL BICLEANER FROM THE REPOSITORY - https://github.com/bitextor/bicleaner OR FOLLOW INSTRUCTIONS BELOW. CELLS FOLLOWING THIS WILL NOT RUN IF BICLEANER IS NOT INSTALLED." + ] + }, + { + "cell_type": "markdown", + "id": "4ea62d2c", + "metadata": {}, + "source": [ + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "## Install dependencies\n", + "\n", + "!pip install wget\n", + "!apt-get install libboost-all-dev\n", + "!apt-get install gawk\n", + "\n", + "## Install NeMo\n", + "\n", + "BRANCH = 'main'\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", + "\n", + "!pip uninstall -y sacrebleu\n", + "!pip install sacrebleu[ja]\n", + "!pip install xxhash\n", + "\n", + "## Install kenlm with 7-gram support\n", + "!mkdir -p data\n", + "!rm -rf data/kenlm\n", + "!git clone https://github.com/kpu/kenlm data/kenlm\n", + "!cd data/kenlm \\\n", + " && pip install . --install-option=\"--max_order 7\" \\\n", + " && mkdir -p build \\\n", + " && cd build \\\n", + " && cmake .. -DKENLM_MAX_ORDER=7 -DCMAKE_INSTALL_PREFIX:PATH=../../kenlm_install \\\n", + " && make -j all install && cd ../../kenlm_install \\\n", + " && export PATH=$PATH:$PWD\n", + "\n", + "# Install bicleaner\n", + "\n", + "!pip install bicleaner" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f3c0cf69", + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " import bicleaner\n", + "except ImportError:\n", + " raise ImportError(f\"You need to install Bicleaner to proceed. Could not import the bicleaner package.\")" + ] + }, + { + "cell_type": "markdown", + "id": "01f2b589", + "metadata": {}, + "source": [ + "## Bicleaner Filtering\n", + "\n", + "Bicleaner (https://aclanthology.org/W18-6488/ and https://aclanthology.org/2020.eamt-1.31/) is a tool to identify noisy parallel sentences in translation corpora. It applies 3 different filtering steps:\n", + "\n", + "1. Pre-filtering based on 37 rules.\n", + "2. Language model fluency scores based on n-gram language models trained with kenlm.\n", + "3. Random forest classifier that uses all examples filtered out in steps 1 & 2 as \"negative\" examples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9be8d4ca", + "metadata": {}, + "outputs": [], + "source": [ + "print('Downloading En-Ru Bicleaner models.')\n", + "!git clone https://github.com/bitextor/bicleaner data/bicleaner\n", + "!cd data/bicleaner && git checkout bicleaner-0.15 && cd ../..\n", + "!data/bicleaner/utils/download-pack.sh en ru\n", + "\n", + "print('Generating Bicleaner scores ...')\n", + "!gawk '{{print \"-\\t-\"}}' \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.en | \\\n", + " paste -d \"\\t\" - data/WikiMatrix.en-ru.langidfilter.lengthratio.en \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.ru | \\\n", + " bicleaner-classify - - en-ru/en-ru.yaml \\\n", + " > data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "43059b8a", + "metadata": {}, + "outputs": [], + "source": [ + "print('Score file ...')\n", + "!head -10 data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores\n", + "\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Filtering based on Bicleaner scores > 0.6 ...')\n", + "print('-----------------------------------------')\n", + "print()\n", + "\n", + "print('Filtering out English ...')\n", + "!gawk -F \"\\t\" '{if ($5>0.6) {print $3}}' \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en\n", + "\n", + "print('Filtering out Russian ...')\n", + "!gawk -F \"\\t\" '{if ($5>0.6) {print $4}}' \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.scores > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru\n", + "\n", + "!paste -d \"\\t\" \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru \\\n", + " | head -10" + ] + }, + { + "cell_type": "markdown", + "id": "0726510c", + "metadata": {}, + "source": [ + "## Normalize Punctuation\n", + "\n", + "Punctuation can vary across languages and even between ascii and unicode variants of the same punctuation marker. For example, across languages. For example, in German, quotes are often written as „ and “ while in English we typically just use \". This step normalizes such punctuation differences to use the same character everywhere.\n", + "\n", + "We use [moses](https://github.com/moses-smt/mosesdecoder) or [sacremoses](https://github.com/alvations/sacremoses) to normalize punctuation. The moses implementation is in perl while sacremoses is in python with a CLI interface. The perl implementation is buffered and works better for large corpora that may not fit into CPU memory all at once while sacremoses is unbuffered and multi-processed." + ] + }, + { + "cell_type": "markdown", + "id": "e73670d6", + "metadata": {}, + "source": [ + "### Sacremoses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "597e041a", + "metadata": {}, + "outputs": [], + "source": [ + "print('Normalizing English ...')\n", + "!sacremoses -j 4 normalize \\\n", + " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.en\n", + "\n", + "print('Normalizing Russian ...')\n", + "!sacremoses -j 4 normalize \\\n", + " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.ru\n" + ] + }, + { + "cell_type": "markdown", + "id": "240b0a1f", + "metadata": {}, + "source": [ + "## Moses\n", + "\n", + "Punctuation can vary across languages and even between ascii and unicode variants of the same punctuation marker. For example, across languages. For example, in German, quotes are often written as „ and “ while in English we typically just use \". This step normalizes such punctuation differences to use the same character everywhere.\n", + "\n", + "We use [moses](https://github.com/moses-smt/mosesdecoder) or [sacremoses](https://github.com/alvations/sacremoses) to normalize punctuation. The moses implementation is in perl while sacremoses is in python with a CLI interface. The perl implementation is buffered and works better for large corpora that may not fit into CPU memory all at once while sacremoses is unbuffered and multi-processed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f5adaa4", + "metadata": {}, + "outputs": [], + "source": [ + "print('Normalizing English ...')\n", + "!perl data/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l en \\\n", + " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.en > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.en\n", + "\n", + "print('Normalizing Russian ...')\n", + "!perl data/mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l ru \\\n", + " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.ru > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.ru\n" + ] + }, + { + "cell_type": "markdown", + "id": "b8bfad64", + "metadata": {}, + "source": [ + "## Tokenize\n", + "\n", + "Tokenization splits a string into a sequence of tokens. A naive way of doing this would be to simply split the string on spaces (for languages where this is possible). This however, will result in punctuation being \"attached\" to the neighboring word when tokenizing. For example, \n", + "\n", + "\"This is a sentence.\" will be tokenized as [\"This, is, a, sentence.\"].\n", + "\n", + "However, we'd typically like punctuation to be separate tokens for example,\n", + "\n", + "\"This is a sentence.\" will be tokenized my moses or sacremoses as [\", This, is, a, sentence, ., \"]." + ] + }, + { + "cell_type": "markdown", + "id": "06c60b90", + "metadata": {}, + "source": [ + "### Sacremoses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bb4c631", + "metadata": {}, + "outputs": [], + "source": [ + "print('Tokenizing English ...')\n", + "!sacremoses -j 4 -l en tokenize -x \\\n", + " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.en > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.en\n", + "\n", + "print('Tokenizing Russian ...')\n", + "!sacremoses -j 4 -l ru tokenize -x \\\n", + " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.ru > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.sacremoses.norm.tok.ru\n" + ] + }, + { + "cell_type": "markdown", + "id": "444bebd7", + "metadata": {}, + "source": [ + "### Moses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21333e27", + "metadata": {}, + "outputs": [], + "source": [ + "print('Tokenizing English ...')\n", + "!perl data/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en -no-escape -threads 4 \\\n", + " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.en > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.en\n", + "\n", + "print('Tokenizing Russian ...')\n", + "!perl data/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ru -no-escape -threads 4 \\\n", + " < data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.ru > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.ru\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b28df2bb", + "metadata": {}, + "outputs": [], + "source": [ + "print()\n", + "print('-----------------------------------------')\n", + "print('Tokenized Russian Sentences ...')\n", + "print('-----------------------------------------')\n", + "print()\n", + "\n", + "!head -10 data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.ru\n", + "\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Tokenized English Sentences ...')\n", + "print('-----------------------------------------')\n", + "print()\n", + "\n", + "!head -10 data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.en" + ] + }, + { + "cell_type": "markdown", + "id": "dee5409d", + "metadata": {}, + "source": [ + "## Segmenting Chinese and Japanese\n", + "\n", + "### Jieba segmentation for Chinese" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41b4cc91", + "metadata": {}, + "outputs": [], + "source": [ + "import jieba\n", + "\n", + "!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-zh.tsv.gz -O data/WikiMatrix.en-zh.tsv.gz\n", + "!gunzip -k -f data/WikiMatrix.en-zh.tsv.gz\n", + "\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Chinese text before segmentation ...')\n", + "print('-----------------------------------------')\n", + "print()\n", + "\n", + "!awk -F \"\\t\" '{print $3}' data/WikiMatrix.en-zh.tsv | head -10\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Segmenting Chinese text ...')\n", + "print('-----------------------------------------')\n", + "print()\n", + "\n", + "zh_lines = []\n", + "with open('data/WikiMatrix.en-zh.tsv', 'r') as f:\n", + " for idx, line in enumerate(f):\n", + " line = line.strip().split('\\t')[2]\n", + " zh_lines.append(' '.join(jieba.cut(line)))\n", + " if idx == 100:\n", + " break\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Chinese text after segmentation ...')\n", + "print('\\n'.join(zh_lines[:10]))\n", + "print('-----------------------------------------')\n", + "print()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "489bd915", + "metadata": {}, + "outputs": [], + "source": [ + "import MeCab\n", + "import ipadic\n", + "\n", + "!wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ja.tsv.gz -O data/WikiMatrix.en-ja.tsv.gz\n", + "!gunzip -k -f data/WikiMatrix.en-ja.tsv.gz\n", + "\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Japanese text before segmentation ...')\n", + "print('-----------------------------------------')\n", + "print()\n", + "\n", + "!awk -F \"\\t\" '{print $3}' data/WikiMatrix.en-ja.tsv | head -10\n", + "\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Segmenting Japanese text ...')\n", + "print('-----------------------------------------')\n", + "print()\n", + "\n", + "mecab_tokenizer = MeCab.Tagger(ipadic.MECAB_ARGS + \" -Owakati\")\n", + "\n", + "ja_lines = []\n", + "with open('data/WikiMatrix.en-ja.tsv', 'r') as f:\n", + " for idx, line in enumerate(f):\n", + " line = line.strip().split('\\t')[2]\n", + " ja_lines.append(mecab_tokenizer.parse(line))\n", + " if idx == 100:\n", + " break\n", + "print()\n", + "print('-----------------------------------------')\n", + "print('Japanese text after segmentation ...')\n", + "print('\\n'.join(ja_lines[:10]))\n", + "print('-----------------------------------------')\n", + "print()" + ] + }, + { + "cell_type": "markdown", + "id": "4a079efe", + "metadata": {}, + "source": [ + "## Deduplicate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55d98bf3", + "metadata": {}, + "outputs": [], + "source": [ + "import xxhash\n", + "\n", + "def dedup_file(input_file_lang_1, input_file_lang_2, output_file_lang_1, output_file_lang_2):\n", + " print()\n", + " print('====================================')\n", + " print('========== De-duplicate ============')\n", + " print('====================================')\n", + " print()\n", + " num_lines = num_lines_in_file(input_file_lang_1)\n", + " hashes = set()\n", + " num_output_lines = 0\n", + " with open(input_file_lang_1, 'r') as f_lang1, \\\n", + " open(input_file_lang_2, 'r') as f_lang2, \\\n", + " open(output_file_lang_1, 'w') as f_out_lang1, \\\n", + " open(output_file_lang_2, 'w') as f_out_lang2:\n", + " for line_1, line_2 in tqdm(zip(f_lang1, f_lang2), total=num_lines, desc=f\"Deduplicating files\"):\n", + " parallel_hash = xxhash.xxh64((line_1.strip() + '\\t' + line_2.strip()).encode('utf-8')).hexdigest()\n", + " if parallel_hash not in hashes:\n", + " hashes.add(parallel_hash)\n", + " f_out_lang1.write(line_1.strip() + '\\n')\n", + " f_out_lang2.write(line_2.strip() + '\\n')\n", + " num_output_lines += 1\n", + "\n", + " print(f\"Kept {num_output_lines} out of {num_lines} after deduplication\")\n", + "\n", + "dedup_file(\n", + " 'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.en',\n", + " 'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.ru',\n", + " 'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.en',\n", + " 'data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.ru'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "da4c181a", + "metadata": {}, + "source": [ + "## Shuffle" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "413734bd", + "metadata": {}, + "outputs": [], + "source": [ + "!shuf --random-source=data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.en \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.en > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.en\n", + "\n", + "!shuf --random-source=data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.en \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.ru > \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.ru\n", + "\n", + "!paste -d \"\\t\" \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.en \\\n", + " data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.ru \\\n", + " | head -10" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5f3b3640", + "metadata": {}, + "outputs": [], + "source": [ + "!rm -rf data/tarred_dataset_en_ru_8k_tokens" + ] + }, + { + "cell_type": "markdown", + "id": "844a9f26", + "metadata": {}, + "source": [ + "## Tarred Dataset Creation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b045df5", + "metadata": {}, + "outputs": [], + "source": [ + "!wget https://raw.github.com/NVIDIA/NeMo/main/examples/nlp/machine_translation/create_tarred_parallel_dataset.py \\\n", + " -O create_tarred_parallel_dataset.py\n", + "\n", + "!python create_tarred_parallel_dataset.py \\\n", + " --src_fname data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.en \\\n", + " --tgt_fname data/WikiMatrix.en-ru.langidfilter.lengthratio.bicleaner.60.moses.norm.tok.dedup.shuf.ru \\\n", + " --out_dir data/tarred_dataset_en_ru_8k_tokens \\\n", + " --clean \\\n", + " --encoder_tokenizer_name yttm \\\n", + " --encoder_tokenizer_vocab_size 32000 \\\n", + " --encoder_tokenizer_coverage 0.999 \\\n", + " --encoder_tokenizer_bpe_dropout 0.1 \\\n", + " --decoder_tokenizer_name yttm \\\n", + " --decoder_tokenizer_vocab_size 32000 \\\n", + " --decoder_tokenizer_coverage 0.999 \\\n", + " --decoder_tokenizer_bpe_dropout 0.1 \\\n", + " --max_seq_length 512 \\\n", + " --min_seq_length 1 \\\n", + " --tokens_in_batch 8000 \\\n", + " --lines_per_dataset_fragment 100000 \\\n", + " --num_batches_per_tarfile 20\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "990265e5", + "metadata": {}, + "outputs": [], + "source": [ + "!ls data/tarred_dataset_en_ru_8k_tokens" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc5e123b", + "metadata": {}, + "outputs": [], + "source": [ + "!cat data/tarred_dataset_en_ru_8k_tokens/metadata.tokens.8000.json" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/tutorials/nlp/Megatron_Synthetic_Tabular_Data_Generation.ipynb b/tutorials/nlp/Megatron_Synthetic_Tabular_Data_Generation.ipynb index bb2c951b53a7..3993b046b7d5 100644 --- a/tutorials/nlp/Megatron_Synthetic_Tabular_Data_Generation.ipynb +++ b/tutorials/nlp/Megatron_Synthetic_Tabular_Data_Generation.ipynb @@ -827,4 +827,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/tutorials/nlp/Multitask_Prompt_and_PTuning.ipynb b/tutorials/nlp/Multitask_Prompt_and_PTuning.ipynb index ff85e619ae20..05c1749fde2e 100644 --- a/tutorials/nlp/Multitask_Prompt_and_PTuning.ipynb +++ b/tutorials/nlp/Multitask_Prompt_and_PTuning.ipynb @@ -39,7 +39,7 @@ "source": [ "# Introduction\n", "\n", - "In this notebook we demonstrate how to use p-tunining and prompt tuning within NeMo-Megatron. Both methods are parameter efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT model on many downstream tasks without needing to tune the model’s full set of parameters. It also allows for adding new tasks to your model without overwriting or disrupting previous tasks for which the model has already been p-tuned/prompt-tuned. Because the original model parameters are frozen and never altered by either method, p-tuning/prompt-tuning also avoid cartographic forgetting issues often encountered when fine-tuning models.\n", + "In this notebook we demonstrate how to use p-tuning and prompt tuning within NeMo-Megatron. Both methods are parameter efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT model on many downstream tasks without needing to tune the model’s full set of parameters. It also allows for adding new tasks to your model without overwriting or disrupting previous tasks for which the model has already been p-tuned/prompt-tuned. Because the original model parameters are frozen and never altered by either method, p-tuning/prompt-tuning also avoid cartographic forgetting issues often encountered when fine-tuning models.\n", "\n", "- Our prompt tuning implementation is based off Lester et. al’s EMNLP 2021 paper [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/abs/2104.08691)\n", "\n", @@ -61,7 +61,7 @@ "We will first p-tune a GPT model on sentiment analysis, and intent and slot classification tasks. Then we will show how to add the squad question answering task to the same model we already p-tuned once.\n", "\n", "\n", - "# Techincal Overview\n", + "# Technical Overview\n", "Instead of selecting discrete text prompts in a manual or automated fashion, prompt tuning and p-tuning utilize virtual prompt embeddings that can be optimized via gradient decent. The only difference between prompt tuning and p-tuning within NeMo-Megatron is the architecture used to tune the soft prompt tokens during training.\n", "\n", "### Terminology\n", @@ -212,8 +212,29 @@ "outputs": [], "source": [ "# Download the financial phrase bank dataset\n", - "!wget https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v10/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v10.zip\n", - "!unzip FinancialPhraseBank-v10.zip -d {DATA_DIR}" + "!wget https://www.researchgate.net/profile/Pekka_Malo/publication/251231364_FinancialPhraseBank-v1.0/data/0c96051eee4fb1d56e000000/FinancialPhraseBank-v1.0.zip\n", + "\n", + "# If you are having issues with the research gate link above, copy and paste it in your browser \n", + "# and the file should download automatically. Then place it in the same directory in which \n", + "# you are running this notebook. " + ] + }, + { + "cell_type": "markdown", + "id": "964a3903", + "metadata": {}, + "source": [ + "If you are having issues with the research gate link above, copy and paste it in your browser and the file should download automatically. Then place it in the same directory in which you are running this notebook. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b934dac", + "metadata": {}, + "outputs": [], + "source": [ + "!unzip FinancialPhraseBank-v1.0.zip -d {DATA_DIR}" ] }, { @@ -255,7 +276,7 @@ "id": "06481f49", "metadata": {}, "source": [ - "Our financial phrase bank preprocessing script converted the raw text file of sentences and labels into three `.jsonl` files for training, validation, and testing. Each line in the files contains a json object with the fields `taskname`, `sentiment`,`sentence`, and `label`. You can inspect the preprocessing script and play with different arguments for the script by looking at and running `prompt_learning_financial_phrase_bank_preprocessing.py` which should currently be downloaded in `NEMO_DIR`. It is also located at `scripts/dataset_processing/nlp/financial_phrase_bank/prompt_learning_financial_phrase_bank_preprocessing.py` in the NeMo repo.\n", + "Our financial phrase bank preprocessing script converted the raw text file of sentences and labels into three `.jsonl` files for training, validation, and testing. Each line in the files contains a json object with the fields `taskname`, `sentiment`, `sentence`, and `label`. You can inspect the preprocessing script and play with different arguments for the script by looking at and running `prompt_learning_financial_phrase_bank_preprocessing.py` which should currently be downloaded in `NEMO_DIR`. It is also located at `scripts/dataset_processing/nlp/financial_phrase_bank/prompt_learning_financial_phrase_bank_preprocessing.py` in the NeMo repo.\n", "\n", "By default 80% of the data was randomly selected for the training set, 10% for the validation set, and 10% for the test set. We only used training examples with 100% agreement from labelers on the correct sentiment label. This data is from `Sentences_AllAgree.txt`. This should result in `1811` training examples, `226` validation examples, and `227` examples for testing. The `label` field was removed from test examples. \n", "\n", @@ -385,7 +406,7 @@ "id": "e803eaed", "metadata": {}, "source": [ - "For the virtual assistent dataset, there are a set of 64 possible intents:" + "For the virtual assistant dataset, there are a set of 64 possible intents:" ] }, { @@ -430,7 +451,7 @@ "source": [ "Each slot label consists of the slot type followed by specific text from the utterance corresponding to that slot type in parentheses. For example, the utterance `\"tell my facebook group that i've arrived\"` has the intent label `social_post` and the slot label `media_type(facebook)`. Utterances each have one intent label and zero or more slot labels. In cases where there is no slot label, our GPT model should predict the word `None`. \n", "\n", - "Json objects for each training example contain three fields: `taskname`, `utterance`, and `label`. For this dataset, our preprocessing scipt formatted our intent and slot labels to look like `\"\\nIntent: transport_taxi\\nSlots: transport_agency(golden taxi), time(seven pm), date(today)\"`. With newline characters (\\n) separating intent and slot labels. Our train jsonl file has `9960` training examples. Our validation and test jsonl files each have `538` training examples. Test examples do not have the `label` field. \n", + "Json objects for each training example contain three fields: `taskname`, `utterance`, and `label`. For this dataset, our preprocessing script formatted our intent and slot labels to look like `\"\\nIntent: transport_taxi\\nSlots: transport_agency(golden taxi), time(seven pm), date(today)\"`. With newline characters (\\n) separating intent and slot labels. Our train jsonl file has `9960` training examples. Our validation and test jsonl files each have `538` training examples. Test examples do not have the `label` field. \n", "\n", "The preprocessing script can be found at `scripts/dataset_processing/nlp/intent_and_slot/prompt_learning_assistant_preprocessing.py`" ] @@ -442,7 +463,7 @@ "source": [ "# P-Tuning Model Config Setup\n", "\n", - "Now we will begin setting up the conifg file used for prompt/p-tuning our GPT models! GPT Prompt learning within NeMo uses a class called `MegatronGPTPromptLearningModel` which has its own config file. We will start by loading an example prompt learning config file, then make changes to it to fit our tasks and training plans. " + "Now we will begin setting up the config file used for prompt/p-tuning our GPT models! GPT Prompt learning within NeMo uses a class called `MegatronGPTPromptLearningModel` which has its own config file. We will start by loading an example prompt learning config file, then make changes to it to fit our tasks and training plans. " ] }, { @@ -593,7 +614,7 @@ "metadata": {}, "source": [ "### Setting New Tasks\n", - "After you p-tune your model this time, you can always go back and p-tune or prompt-tune your model on more tasks without over writting the virtual prompts who've trained this time. You can also use a different number of `total_virtual_tokens` between each training session as long as tasks ptuned or prompt tuned at the same time have the same number of `total_virtual_tokens`. For this reason, when you p-tune on a new task, you need to tell your model which of your tasks are new and which ones already exist (and thus you don't want to tune them). \n", + "After you p-tune your model this time, you can always go back and p-tune or prompt-tune your model on more tasks without over writing the virtual prompts who've trained this time. You can also use a different number of `total_virtual_tokens` between each training session as long as tasks ptuned or prompt tuned at the same time have the same number of `total_virtual_tokens`. For this reason, when you p-tune on a new task, you need to tell your model which of your tasks are new and which ones already exist (and thus you don't want to tune them). \n", "\n", "You do this by setting the `new_tasks` and `existing_tasks` values in the config file. Because we are p-tuning a model with no existing tasks, you should set `existing_tasks=[]` and `new_tasks=[\"sentiment\", \"intent_and_slot\"]` as follows:" ] @@ -614,7 +635,7 @@ "id": "3b77e88c", "metadata": {}, "source": [ - "After p-tuning and/or prompt tuning is complete, you can run inference on all tasks at the same time, regradless of their `total_virtual_tokens` value." + "After p-tuning and/or prompt tuning is complete, you can run inference on all tasks at the same time, regardless of their `total_virtual_tokens` value." ] }, { @@ -1209,7 +1230,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.12" + "version": "3.8.13" } }, "nbformat": 4, diff --git a/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb b/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb index 809c81558947..0d5826bee4ea 100755 --- a/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb +++ b/tutorials/nlp/Non_English_Downstream_Tasks_(NER).ipynb @@ -1,893 +1,902 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "OETcTQlcguCm" - }, - "outputs": [], - "source": [ - "BRANCH = 'main'" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "o_0K1lsW1dj9" - }, - "outputs": [], - "source": [ - "\"\"\"\n", - "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", - "\n", - "Instructions for setting up Colab are as follows:\n", - "1. Open a new Python 3 notebook.\n", - "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", - "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", - "4. Run this cell to set up dependencies.\n", - "\"\"\"\n", - "# If you're using Google Colab and not running locally, run this cell\n", - "\n", - "# install NeMo\n", - "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@${BRANCH}#egg=nemo_toolkit[nlp]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "pC0slAc0h9zN", - "pycharm": { - "name": "#%%\n" - } - }, - "outputs": [], - "source": [ - "# If you're not using Colab, you might need to upgrade jupyter notebook to avoid the following error:\n", - "# 'ImportError: IProgress not found. Please update jupyter and ipywidgets.'\n", - "\n", - "! pip install ipywidgets\n", - "! jupyter nbextension enable --py widgetsnbextension\n", - "\n", - "# Please restart the kernel after running this cell" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "dzqD2WDFOIN-" - }, - "outputs": [], - "source": [ - "from nemo.collections import nlp as nemo_nlp\n", - "from nemo.utils.exp_manager import exp_manager\n", - "\n", - "import os\n", - "import wget \n", - "import torch\n", - "import pytorch_lightning as pl\n", - "from omegaconf import OmegaConf\n", - "\n", - "import zipfile\n", - "import random" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "daYw_Xll2ZR9" - }, - "source": [ - "# Tutorial Overview\n", - "In this tutorial, we will show how to use a pre-trained BERT language model on a non-English downstream task. Here we are going to use Persian language and Named entity recognition (NER) task as an example. Note, most of the rest downstream tasks supported in NeMo should work similarly for other languages. \n", - "\n", - "# Task Description\n", - "NER is the task of detecting and classifying key information (entities) in text.\n", - "For example, in a sentence: `Mary lives in Santa Clara and works at NVIDIA`, we should detect that `Mary` is a person, `Santa Clara` is a location and `NVIDIA` is a company.\n", - "\n", - "In this tutorial we will be using [BERT language model](https://arxiv.org/abs/1810.04805).\n", - "\n", - "To read more about other topics and downstream task that can be done in NeMo, you can see the [NeMo's tutorial page](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZnuziSwJ1yEB" - }, - "source": [ - "# Dataset\n", - "\n", - "In this tutorial we are going to use [Persian Arman dataset for our NER task](https://github.com/HaniehP/PersianNER).\n", - "\n", - "Arman is a hand annotated Persian corpus for NER task with 250,015 tokens and 7,682 sentences. Using [IOB encoding](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), tokens are labeled with either one of the following name entities or labeled with O. \n", - "\n", - "* event = event\n", - "* fac = facility\n", - "* loc = location\n", - "* org = organization\n", - "* pers = person\n", - "* pro = product\n", - "\n", - "Each of these has a label staring with **B** that indicates it is the first token of the name entity and with **I** for others. \n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qzcZ3nb_-SVT" - }, - "source": [ - "# NeMo Token Classification Data Format\n", - "\n", - "[TokenClassification Model](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/token_classification/token_classification_model.py) in NeMo supports NER and other token level classification tasks, as long as the data follows the format specified below. \n", - "\n", - "Token Classification Model requires the data to be split into 2 files: \n", - "* text.txt \n", - "* labels.txt. \n", - "\n", - "Each line of the **text.txt** file contains text sequences, where words are separated with spaces, i.e.: \n", - "[WORD] [SPACE] [WORD] [SPACE] [WORD].\n", - "\n", - "The **labels.txt** file contains corresponding labels for each word in text.txt, the labels are separated with spaces, i.e.:\n", - "[LABEL] [SPACE] [LABEL] [SPACE] [LABEL].\n", - "\n", - "Example of a text.txt file:\n", - "```\n", - "دبیر شورای عالی انقلاب فرهنگی از گنجانده شدن 5 زبان خارجی جدید در برنامه درسی مدارس خبر داد.\n", - "```\n", - "Corresponding labels.txt file:\n", - "```\n", - "O B_ORG I_ORG I_ORG I_ORG O O O O O O O O O O O O O O \n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "SL58EWkd2ZVb" - }, - "source": [ - "## Download and preprocess the data¶" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_z2tCEIXZa90" - }, - "source": [ - "You can download the Arman dataset by cloning to the following github repository: https://github.com/HaniehP/PersianNER.\n", - "\n", - "After downloading the data, you will see a few files and folders inside a directory named PersianNER. Take ArmanPersoNERCorpus.zip and upload it to `DATA_DIR` (if running in a docker or locally) or use **files** from Google colab to upload the files.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "n8HZrDmr12_-" - }, - "outputs": [], - "source": [ - "# path to the folder with ArmanPersoNERCorpus.zip file (if running locally on in a docker)\n", - "DATA_DIR = \"PATH_TO_FOLDER_WITH_ZIP.ZIP_FILE\"\n", - "WORK_DIR = \"WORK_DIR\"\n", - "MODEL_CONFIG = \"token_classification_config.yaml\"\n", - "os.makedirs(WORK_DIR, exist_ok=True)\n", - "os.makedirs(DATA_DIR, exist_ok=True)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "k1TmF5rrdPMj" - }, - "outputs": [], - "source": [ - "if 'google.colab' in str(get_ipython):\n", - " from google.colab import files\n", - " uploaded = files.upload() " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "HTUKJOownkrF" - }, - "outputs": [], - "source": [ - "if 'google.colab' in str(get_ipython):\n", - " ! mv ArmanPersoNERCorpus.zip $DATA_DIR/." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "NhUzIeF0Yg0l" - }, - "source": [ - "Let's extract files from the zip file. It will generate three test and train files which have overlaps and are intended to be used in turn as train and test sets. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Y01BdjPRW-7B" - }, - "outputs": [], - "source": [ - "! cd {DATA_DIR} && unzip \"ArmanPersoNERCorpus.zip\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qaDgL-sQaX2e" - }, - "source": [ - "Next, we will be putting all data into a single file and removing any repeated sentences. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "B0T4CzJvbBJ4" - }, - "outputs": [], - "source": [ - "file_all = os.path.join(DATA_DIR, \"all_data.txt\")\n", - "with open(file_all, \"w\") as f1:\n", - " for filename in os.listdir(DATA_DIR):\n", - " if (filename == \"ReadMe.txt\" or filename == \"ArmanPersoNERCorpus.zip\" or filename == \"all_data.txt\"):\n", - " continue\n", - " with open(DATA_DIR + \"/\" + filename, \"r\") as f2:\n", - " for line in f2:\n", - " f1.write(line)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VzVuET8HESFB" - }, - "source": [ - "Now, you need to convert this data into NeMo compatible format before starting the training process. For this purpose, you can run [examples/nlp/token_classification/data/import_from_iob_format.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/token_classification/data/import_from_iob_format.py) on your train and dev files, as follows:\n", - "\n", - "\n", - "\n", - "\n", - "```\n", - "python examples/nlp/token_classification/data/import_from_iob_format.py --data_file PATH_TO_IOB_FORMAT_DATAFILE, e.g., \"DATA_DIR/all_data.txt\"\n", - "```\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ord_6KlkeNl8" - }, - "outputs": [], - "source": [ - "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/data/import_from_iob_format.py" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "IfSUkxffeSpL" - }, - "outputs": [], - "source": [ - "!python import_from_iob_format.py --data_file $DATA_DIR/all_data.txt" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Aj0rXbYXbivW" - }, - "source": [ - "Now we process the data to remove potentially any repeated sentences and then split them into train and dev sets. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "CgvnTlqzbq5-" - }, - "outputs": [], - "source": [ - "sent_dict = dict()\n", - "line_removed = dict()\n", - "line_counter = 0\n", - "with open(DATA_DIR + \"/text_all_not_repeated.txt\", \"w\") as f1:\n", - " with open(DATA_DIR + \"/text_all_data.txt\", \"r\") as f2:\n", - " for line in f2:\n", - " line_counter += 1\n", - " if (not line in sent_dict):\n", - " sent_dict[line] = 1\n", - " f1.write(line)\n", - " else:\n", - " line_removed[line_counter] = 1\n", - "#labels:\n", - "line_counter = 0\n", - "with open(DATA_DIR + \"/labels_all_not_repeated.txt\", \"w\") as f1:\n", - " with open(DATA_DIR + \"/labels_all_data.txt\", \"r\") as f2:\n", - " for line in f2:\n", - " line_counter += 1\n", - " if(not line_counter in line_removed):\n", - " f1.write(line)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "0cO3crs_gXjt" - }, - "source": [ - "After preprocessing the data and removing repeated sentences, there will be 7668 total valid sentences. We will be using 85% of that as train and 15% as dev. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "7oHQYsMMbugP" - }, - "outputs": [], - "source": [ - "total_data = 7668\n", - "train_share = 0.85\n", - "used_lines_train = dict()\n", - "flag = 1\n", - "count = 0\n", - "while flag:\n", - " idx = random.randint(1, total_data)\n", - " if (not idx in used_lines_train):\n", - " used_lines_train[idx] = 1\n", - " count += 1\n", - " if (count/total_data > train_share):\n", - " flag = 0\n", - "\n", - "line_counter = 0\n", - "with open(DATA_DIR+ \"/text_train.txt\", \"w\") as f1:\n", - " with open(DATA_DIR + \"/text_dev.txt\", \"w\") as f2:\n", - " with open(DATA_DIR + \"/text_all_not_repeated.txt\", \"r\") as f3:\n", - " for line in f3:\n", - " line_counter += 1\n", - " if (line_counter in used_lines_train):\n", - " f1.write(line)\n", - " else:\n", - " f2.write(line)\n", - "\n", - "line_counter = 0\n", - "with open(DATA_DIR + \"/labels_train.txt\", \"w\") as f1:\n", - " with open(DATA_DIR + \"/labels_dev.txt\", \"w\") as f2:\n", - " with open(DATA_DIR + \"/labels_all_not_repeated.txt\", \"r\") as f3:\n", - " for line in f3:\n", - " line_counter += 1\n", - " if (line_counter in used_lines_train):\n", - " f1.write(line)\n", - " else:\n", - " f2.write(line)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1Q-GWNwDbzKl" - }, - "source": [ - "Finally, we remove files that are not needed anymore." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "II20ustub5BF" - }, - "outputs": [], - "source": [ - "print(\"Removed files:\")\n", - "for filename in os.listdir(DATA_DIR):\n", - " if (filename == \"text_dev.txt\" or filename == \"text_train.txt\" or filename == \"labels_dev.txt\" or filename == \"labels_train.txt\"):\n", - " continue\n", - " print(filename)\n", - " os.remove(DATA_DIR + \"/\" + filename)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "U8Ty5_S7Ye8h" - }, - "source": [ - "Now, the data folder should contain these 4 files:" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "L8vsyh3JZH26" - }, - "source": [ - "\n", - "\n", - "* labels_dev.txt\n", - "* labels_train.txt\n", - "* text_dev.txt\n", - "* text_train.txt\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qB0oLE4R9EhJ" - }, - "outputs": [], - "source": [ - "! ls -l {DATA_DIR}" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6UDPgadLN6SG" - }, - "outputs": [], - "source": [ - "# let's take a look at the data \n", - "print('Text:')\n", - "! head -n 5 {DATA_DIR}/text_train.txt\n", - "\n", - "print('\\nLabels:')\n", - "! head -n 5 {DATA_DIR}/labels_train.txt" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_whKCxfTMo6Y" - }, - "source": [ - "# Model configuration\n", - "\n", - "Our Named Entity Recognition model is comprised of the pretrained [BERT](https://arxiv.org/pdf/1810.04805.pdf) model followed by a Token Classification layer.\n", - "\n", - "The model is defined in a config file which declares multiple important sections. They are:\n", - "- **model**: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information\n", - "\n", - "- **trainer**: Any argument to be passed to PyTorch Lightning" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "T1gA8PsJ13MJ" - }, - "outputs": [], - "source": [ - "# download the model's configuration file \n", - "config_dir = WORK_DIR + '/configs/'\n", - "os.makedirs(config_dir, exist_ok=True)\n", - "if not os.path.exists(config_dir + MODEL_CONFIG):\n", - " print('Downloading config file...')\n", - " wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/conf/' + MODEL_CONFIG, config_dir)\n", - "else:\n", - " print ('config file is already exists')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "mX3KmWMvSUQw" - }, - "outputs": [], - "source": [ - "# this line will print the entire config of the model\n", - "config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'\n", - "print(config_path)\n", - "config = OmegaConf.load(config_path)\n", - "print(OmegaConf.to_yaml(config))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZCgWzNBkaQLZ" - }, - "source": [ - "# Fine-tuning the model using Arman dataset\n", - "\n", - "Let's select a [`bert-base-multilingual-uncased`](https://huggingface.co/bert-base-multilingual-uncased) BERT model and fine-tune it on the Arman dataset.\n", - "\n", - "## Setting up Data within the config\n", - "\n", - "Among other things, the config file contains dictionaries called dataset, train_ds and validation_ds. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.\n", - "\n", - "We assume that both training and evaluation files are in the same directory and use the default names mentioned during the data download step. \n", - "So, to start model training, we simply need to specify `model.dataset.data_dir`, like we are going to do below.\n", - "\n", - "Also notice that some config lines, including `model.dataset.data_dir`, have `???` in place of paths, this means that values for these fields are required to be specified by the user.\n", - "\n", - "Let us now add the data directory path to the config.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "LQHCJN-ZaoLp" - }, - "outputs": [], - "source": [ - "# in this tutorial train and dev datasets are located in the same folder, so it is enought to add the path of the data directory to the config\n", - "config.model.dataset.data_dir = DATA_DIR\n", - "\n", - "# if you want to use the full dataset, set NUM_SAMPLES to -1\n", - "NUM_SAMPLES = 1000\n", - "config.model.train_ds.num_samples = NUM_SAMPLES\n", - "config.model.validation_ds.num_samples = NUM_SAMPLES\n", - "\n", - "# for demonstartion purposes we're running only a single epoch\n", - "config.trainer.max_epochs = 5" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "nB96-3sTc3yk" - }, - "source": [ - "## Building the PyTorch Lightning Trainer\n", - "\n", - "NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.\n", - "\n", - "Let's first instantiate a Trainer object" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "1tG4FzZ4Ui60" - }, - "outputs": [], - "source": [ - "print(\"Trainer config - \\n\")\n", - "print(OmegaConf.to_yaml(config.trainer))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "knF6QeQQdMrH" - }, - "outputs": [], - "source": [ - "# lets modify some trainer configs\n", - "# checks if we have GPU available and uses it\n", - "accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'\n", - "config.trainer.devices = 1\n", - "config.trainer.accelerator = accelerator\n", - "\n", - "config.trainer.precision = 16 if torch.cuda.is_available() else 32\n", - "\n", - "# for mixed precision training, uncomment the line below (precision should be set to 16 and amp_level to O1):\n", - "# config.trainer.amp_level = O1\n", - "\n", - "# remove distributed training flags\n", - "config.trainer.strategy = None\n", - "\n", - "# setup max number of steps to reduce training time for demonstration purposes of this tutorial\n", - "config.trainer.max_steps = 32\n", - "\n", - "trainer = pl.Trainer(**config.trainer)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8IlEMdVxdr6p" - }, - "source": [ - "## Setting up a NeMo Experiment¶\n", - "\n", - "NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "8uztqGAmdrYt" - }, - "outputs": [], - "source": [ - "exp_dir = exp_manager(trainer, config.get(\"exp_manager\", None))\n", - "\n", - "# the exp_dir provides a path to the current experiment for easy access\n", - "exp_dir = str(exp_dir)\n", - "exp_dir" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8tjLhUvL_o7_" - }, - "source": [ - "Before initializing the model, we might want to modify some of the model configs. For example, we might want to modify the pretrained BERT model:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Xeuc2i7Y_nP5" - }, - "outputs": [], - "source": [ - "# get the list of supported BERT-like models, for the complete list of HugginFace models, see https://huggingface.co/models\n", - "print(nemo_nlp.modules.get_pretrained_lm_models_list(include_external=True))\n", - "\n", - "# specify BERT-like model, you want to use\n", - "PRETRAINED_BERT_MODEL = \"bert-base-multilingual-uncased\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "RK2xglXyAUOO" - }, - "outputs": [], - "source": [ - "# add the specified above model parameters to the config\n", - "config.model.language_model.pretrained_model_name = PRETRAINED_BERT_MODEL" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fzNZNAVRjDD-" - }, - "source": [ - "Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders we'll be prepared for training and evaluation.\n", - "Also, the pretrained BERT model will be downloaded, note it can take up to a few minutes depending on the size of the chosen BERT model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "NgsGLydWo-6-" - }, - "outputs": [], - "source": [ - "model = nemo_nlp.models.TokenClassificationModel(cfg=config.model, trainer=trainer)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kQ592Tx4pzyB" - }, - "source": [ - "## Monitoring training progress\n", - "Optionally, you can create a Tensorboard visualization to monitor training progress." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "mTJr16_pp0aS" - }, - "outputs": [], - "source": [ - "try:\n", - " from google import colab\n", - " COLAB_ENV = True\n", - "except (ImportError, ModuleNotFoundError):\n", - " COLAB_ENV = False\n", - "\n", - "# Load the TensorBoard notebook extension\n", - "if COLAB_ENV:\n", - " %load_ext tensorboard\n", - " %tensorboard --logdir {exp_dir}\n", - "else:\n", - " print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Fj1pdEdD0Vm3" - }, - "source": [ - "See how it performs before fine-tuning" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "wo1oVGIT0aBZ" - }, - "outputs": [], - "source": [ - "# define the list of queries for inference\n", - "queries = [\n", - " 'حمید طاهایی افزود : برای اجرای این طرحها 0 میلیارد و 0 میلیون ریال اعتبار هزینه شده است . ',\n", - " 'دکتر اصغری دبیر چهارمین همایش انجمن زمین‌شناسی ایران در این زمینه گفت : از مجموع چهار صد مقاله رسیده به دبیرخانه همایش ، يك صد و هشتاد مقاله ظرف مدت دو روز در هشت سالن همایش برگزار شد . '\n", - "]\n", - "results = model.add_predictions(queries)\n", - "\n", - "for query, result in zip(queries, results):\n", - " print()\n", - " print(f'Query : {query}')\n", - " print(f'Result: {result.strip()}\\n')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kyElt0Es-aSk" - }, - "outputs": [], - "source": [ - "print(\"Trainer config - \\n\")\n", - "print(OmegaConf.to_yaml(config.trainer))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "hUvnSpyjp0Dh" - }, - "outputs": [], - "source": [ - "# start model training\n", - "trainer.fit(model)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MOrR0PeJqa0j" - }, - "source": [ - "After the training is complete, `.nemo` file that contains model's checkpoints and all associated artifacts could be found under `nemo_experiments/token_classification_model/DATE_TIME`" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-lFo27PJ0o3W" - }, - "source": [ - "See how it gets better after:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9fNcBnz80rLO" - }, - "outputs": [], - "source": [ - "results = model.add_predictions(queries)\n", - "\n", - "for query, result in zip(queries, results):\n", - " print()\n", - " print(f'Query : {query}')\n", - " print(f'Result: {result.strip()}\\n')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JxBiIKMlH8yv" - }, - "source": [ - "After training for 100 epochs, with the default config and NUM_SAMPLES = -1 (i.e. all data is used), your model performance should look similar to this: \n", - "```\n", - " label precision recall f1 support\n", - " O (label_id: 0) 99.09 99.19 99.14 32867\n", - " B-event (label_id: 1) 67.74 70.00 68.85 90\n", - " B-fac (label_id: 2) 70.89 73.68 72.26 76\n", - " B-loc (label_id: 3) 87.45 82.70 85.01 497\n", - " B-org (label_id: 4) 81.88 87.06 84.39 649\n", - " B-pers (label_id: 5) 94.93 93.36 94.14 542\n", - " B-pro (label_id: 6) 79.31 70.41 74.59 98\n", - " I-event (label_id: 7) 87.38 74.72 80.55 352\n", - " I-fac (label_id: 8) 83.08 77.14 80.00 140\n", - " I-loc (label_id: 9) 77.78 73.39 75.52 124\n", - " I-org (label_id: 10) 86.51 89.93 88.18 834\n", - " I-pers (label_id: 11) 95.30 94.35 94.82 301\n", - " I-pro (label_id: 12) 82.86 86.57 84.67 67\n", - " -------------------\n", - " micro avg 97.78 97.78 97.78 36637\n", - " macro avg 84.17 82.50 83.24 36637\n", - " weighted avg 97.78 97.78 97.77 36637\n", - "```\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VZp9STMHQAp1" - }, - "source": [ - "**References**\n", - "\n", - "1. Devlin, Jacob, et al. \"BERT: Pre-training of deep bidirectional transformers for language understanding.\" arXiv preprint arXiv:1810.04805 (2018).\n", - "\n", - "2. Hanieh Poostchi, Ehsan Zare Borzeshi, Mohammad Abdous, and Massimo Piccardi, \"PersoNER: Persian Named-Entity Recognition,\" The 26th International Conference on Computational Linguistics (COLING 2016), pages 3381–3389, Osaka, Japan, 2016.\n", - "\n", - "3. Hanieh Poostchi, Ehsan Zare Borzeshi, and Massimo Piccardi, \"BiLSTM-CRF for Persian Named-Entity Recognition; ArmanPersoNERCorpus: the First Entity-Annotated Persian Dataset,\" The 11th Edition of the Language Resources and Evaluation Conference (LREC), Miyazaki, Japan, 7-12 May 2018, ISLRN 399-379-640-828-6, ISLRN 921-509-141-609-6." - ] - } - ], + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "OETcTQlcguCm" + }, + "outputs": [], + "source": [ + "BRANCH = 'main'" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "o_0K1lsW1dj9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Collecting nemo_toolkit[nlp]\r\n", + "\u001b[31mERROR: The URL 'git+https://github.com/NVIDIA/NeMo.git@#egg=nemo_toolkit[nlp]' has an empty revision (after @) which is not supported. Include a revision after @ or remove @ from the URL.\u001b[0m\r\n" + ] + } + ], + "source": [ + "\"\"\"\n", + "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", + "\n", + "Instructions for setting up Colab are as follows:\n", + "1. Open a new Python 3 notebook.\n", + "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", + "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", + "4. Run this cell to set up dependencies.\n", + "\"\"\"\n", + "# If you're using Google Colab and not running locally, run this cell\n", + "\n", + "# install NeMo\n", + "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@${BRANCH}#egg=nemo_toolkit[nlp]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pC0slAc0h9zN", + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "# If you're not using Colab, you might need to upgrade jupyter notebook to avoid the following error:\n", + "# 'ImportError: IProgress not found. Please update jupyter and ipywidgets.'\n", + "\n", + "! pip install ipywidgets\n", + "! jupyter nbextension enable --py widgetsnbextension\n", + "\n", + "# Please restart the kernel after running this cell" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dzqD2WDFOIN-" + }, + "outputs": [], + "source": [ + "from nemo.collections import nlp as nemo_nlp\n", + "from nemo.utils.exp_manager import exp_manager\n", + "\n", + "import os\n", + "import wget \n", + "import torch\n", + "import pytorch_lightning as pl\n", + "from omegaconf import OmegaConf\n", + "\n", + "import zipfile\n", + "import random" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "daYw_Xll2ZR9" + }, + "source": [ + "# Tutorial Overview\n", + "In this tutorial, we will show how to use a pre-trained BERT language model on a non-English downstream task. Here we are going to use Persian language and Named entity recognition (NER) task as an example. Note, most of the rest downstream tasks supported in NeMo should work similarly for other languages. \n", + "\n", + "# Task Description\n", + "NER is the task of detecting and classifying key information (entities) in text.\n", + "For example, in a sentence: `Mary lives in Santa Clara and works at NVIDIA`, we should detect that `Mary` is a person, `Santa Clara` is a location and `NVIDIA` is a company.\n", + "\n", + "In this tutorial we will be using [BERT language model](https://arxiv.org/abs/1810.04805).\n", + "\n", + "To read more about other topics and downstream task that can be done in NeMo, you can see the [NeMo's tutorial page](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZnuziSwJ1yEB" + }, + "source": [ + "# Dataset\n", + "\n", + "In this tutorial we are going to use [Persian Arman dataset for our NER task](https://github.com/HaniehP/PersianNER).\n", + "\n", + "Arman is a hand annotated Persian corpus for NER task with 250,015 tokens and 7,682 sentences. Using [IOB encoding](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)), tokens are labeled with either one of the following name entities or labeled with O. \n", + "\n", + "* event = event\n", + "* fac = facility\n", + "* loc = location\n", + "* org = organization\n", + "* pers = person\n", + "* pro = product\n", + "\n", + "Each of these has a label staring with **B** that indicates it is the first token of the name entity and with **I** for others. \n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qzcZ3nb_-SVT" + }, + "source": [ + "# NeMo Token Classification Data Format\n", + "\n", + "[TokenClassification Model](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/token_classification/token_classification_model.py) in NeMo supports NER and other token level classification tasks, as long as the data follows the format specified below. \n", + "\n", + "Token Classification Model requires the data to be split into 2 files: \n", + "* text.txt \n", + "* labels.txt. \n", + "\n", + "Each line of the **text.txt** file contains text sequences, where words are separated with spaces, i.e.: \n", + "[WORD] [SPACE] [WORD] [SPACE] [WORD].\n", + "\n", + "The **labels.txt** file contains corresponding labels for each word in text.txt, the labels are separated with spaces, i.e.:\n", + "[LABEL] [SPACE] [LABEL] [SPACE] [LABEL].\n", + "\n", + "Example of a text.txt file:\n", + "```\n", + "دبیر شورای عالی انقلاب فرهنگی از گنجانده شدن 5 زبان خارجی جدید در برنامه درسی مدارس خبر داد.\n", + "```\n", + "Corresponding labels.txt file:\n", + "```\n", + "O B_ORG I_ORG I_ORG I_ORG O O O O O O O O O O O O O O \n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SL58EWkd2ZVb" + }, + "source": [ + "## Download and preprocess the data¶" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_z2tCEIXZa90" + }, + "source": [ + "You can download the Arman dataset by cloning to the following github repository: https://github.com/HaniehP/PersianNER.\n", + "\n", + "After downloading the data, you will see a few files and folders inside a directory named PersianNER. Take ArmanPersoNERCorpus.zip and upload it to `DATA_DIR` (if running in a docker or locally) or use **files** from Google colab to upload the files.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "n8HZrDmr12_-" + }, + "outputs": [], + "source": [ + "# path to the folder with ArmanPersoNERCorpus.zip file (if running locally on in a docker)\n", + "DATA_DIR = \"PATH_TO_FOLDER_WITH_ZIP.ZIP_FILE\"\n", + "WORK_DIR = \"WORK_DIR\"\n", + "MODEL_CONFIG = \"token_classification_config.yaml\"\n", + "os.makedirs(WORK_DIR, exist_ok=True)\n", + "os.makedirs(DATA_DIR, exist_ok=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k1TmF5rrdPMj" + }, + "outputs": [], + "source": [ + "if 'google.colab' in str(get_ipython):\n", + " from google.colab import files\n", + " uploaded = files.upload() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HTUKJOownkrF" + }, + "outputs": [], + "source": [ + "if 'google.colab' in str(get_ipython):\n", + " ! mv ArmanPersoNERCorpus.zip $DATA_DIR/." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NhUzIeF0Yg0l" + }, + "source": [ + "Let's extract files from the zip file. It will generate three test and train files which have overlaps and are intended to be used in turn as train and test sets. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Y01BdjPRW-7B" + }, + "outputs": [], + "source": [ + "! cd {DATA_DIR} && unzip \"ArmanPersoNERCorpus.zip\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qaDgL-sQaX2e" + }, + "source": [ + "Next, we will be putting all data into a single file and removing any repeated sentences. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "B0T4CzJvbBJ4" + }, + "outputs": [], + "source": [ + "file_all = os.path.join(DATA_DIR, \"all_data.txt\")\n", + "with open(file_all, \"w\") as f1:\n", + " for filename in os.listdir(DATA_DIR):\n", + " if (filename == \"ReadMe.txt\" or filename == \"ArmanPersoNERCorpus.zip\" or filename == \"all_data.txt\"):\n", + " continue\n", + " with open(DATA_DIR + \"/\" + filename, \"r\", encoding = \"ISO-8859-1\") as f2:\n", + " for line in f2:\n", + " f1.write(line)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VzVuET8HESFB" + }, + "source": [ + "Now, you need to convert this data into NeMo compatible format before starting the training process. For this purpose, you can run [examples/nlp/token_classification/data/import_from_iob_format.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/token_classification/data/import_from_iob_format.py) on your train and dev files, as follows:\n", + "\n", + "\n", + "\n", + "\n", + "```\n", + "python examples/nlp/token_classification/data/import_from_iob_format.py --data_file PATH_TO_IOB_FORMAT_DATAFILE, e.g., \"DATA_DIR/all_data.txt\"\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ord_6KlkeNl8" + }, + "outputs": [], + "source": [ + "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/data/import_from_iob_format.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IfSUkxffeSpL" + }, + "outputs": [], + "source": [ + "!python import_from_iob_format.py --data_file $DATA_DIR/all_data.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Aj0rXbYXbivW" + }, + "source": [ + "Now we process the data to remove potentially any repeated sentences and then split them into train and dev sets. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CgvnTlqzbq5-" + }, + "outputs": [], + "source": [ + "sent_dict = dict()\n", + "line_removed = dict()\n", + "line_counter = 0\n", + "with open(DATA_DIR + \"/text_all_not_repeated.txt\", \"w\") as f1:\n", + " with open(DATA_DIR + \"/text_all_data.txt\", \"r\") as f2:\n", + " for line in f2:\n", + " line_counter += 1\n", + " if (not line in sent_dict):\n", + " sent_dict[line] = 1\n", + " f1.write(line)\n", + " else:\n", + " line_removed[line_counter] = 1\n", + "#labels:\n", + "line_counter = 0\n", + "with open(DATA_DIR + \"/labels_all_not_repeated.txt\", \"w\") as f1:\n", + " with open(DATA_DIR + \"/labels_all_data.txt\", \"r\") as f2:\n", + " for line in f2:\n", + " line_counter += 1\n", + " if(not line_counter in line_removed):\n", + " f1.write(line)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0cO3crs_gXjt" + }, + "source": [ + "After preprocessing the data and removing repeated sentences, there will be 7668 total valid sentences. We will be using 85% of that as train and 15% as dev. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7oHQYsMMbugP" + }, + "outputs": [], + "source": [ + "total_data = 7668\n", + "train_share = 0.85\n", + "used_lines_train = dict()\n", + "flag = 1\n", + "count = 0\n", + "while flag:\n", + " idx = random.randint(1, total_data)\n", + " if (not idx in used_lines_train):\n", + " used_lines_train[idx] = 1\n", + " count += 1\n", + " if (count/total_data > train_share):\n", + " flag = 0\n", + "\n", + "line_counter = 0\n", + "with open(DATA_DIR+ \"/text_train.txt\", \"w\") as f1:\n", + " with open(DATA_DIR + \"/text_dev.txt\", \"w\") as f2:\n", + " with open(DATA_DIR + \"/text_all_not_repeated.txt\", \"r\") as f3:\n", + " for line in f3:\n", + " line_counter += 1\n", + " if (line_counter in used_lines_train):\n", + " f1.write(line)\n", + " else:\n", + " f2.write(line)\n", + "\n", + "line_counter = 0\n", + "with open(DATA_DIR + \"/labels_train.txt\", \"w\") as f1:\n", + " with open(DATA_DIR + \"/labels_dev.txt\", \"w\") as f2:\n", + " with open(DATA_DIR + \"/labels_all_not_repeated.txt\", \"r\") as f3:\n", + " for line in f3:\n", + " line_counter += 1\n", + " if (line_counter in used_lines_train):\n", + " f1.write(line)\n", + " else:\n", + " f2.write(line)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1Q-GWNwDbzKl" + }, + "source": [ + "Finally, we remove files that are not needed anymore." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "II20ustub5BF" + }, + "outputs": [], + "source": [ + "print(\"Removed files:\")\n", + "for filename in os.listdir(DATA_DIR):\n", + " if (filename == \"text_dev.txt\" or filename == \"text_train.txt\" or filename == \"labels_dev.txt\" or filename == \"labels_train.txt\"):\n", + " continue\n", + " print(filename)\n", + " os.remove(DATA_DIR + \"/\" + filename)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U8Ty5_S7Ye8h" + }, + "source": [ + "Now, the data folder should contain these 4 files:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L8vsyh3JZH26" + }, + "source": [ + "\n", + "\n", + "* labels_dev.txt\n", + "* labels_train.txt\n", + "* text_dev.txt\n", + "* text_train.txt\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qB0oLE4R9EhJ" + }, + "outputs": [], + "source": [ + "! ls -l {DATA_DIR}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6UDPgadLN6SG" + }, + "outputs": [], + "source": [ + "# let's take a look at the data \n", + "print('Text:')\n", + "! head -n 5 {DATA_DIR}/text_train.txt\n", + "\n", + "print('\\nLabels:')\n", + "! head -n 5 {DATA_DIR}/labels_train.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_whKCxfTMo6Y" + }, + "source": [ + "# Model configuration\n", + "\n", + "Our Named Entity Recognition model is comprised of the pretrained [BERT](https://arxiv.org/pdf/1810.04805.pdf) model followed by a Token Classification layer.\n", + "\n", + "The model is defined in a config file which declares multiple important sections. They are:\n", + "- **model**: All arguments that are related to the Model - language model, token classifier, optimizer and schedulers, datasets and any other related information\n", + "\n", + "- **trainer**: Any argument to be passed to PyTorch Lightning" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "T1gA8PsJ13MJ" + }, + "outputs": [], + "source": [ + "# download the model's configuration file \n", + "config_dir = WORK_DIR + '/configs/'\n", + "os.makedirs(config_dir, exist_ok=True)\n", + "if not os.path.exists(config_dir + MODEL_CONFIG):\n", + " print('Downloading config file...')\n", + " wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/token_classification/conf/' + MODEL_CONFIG, config_dir)\n", + "else:\n", + " print ('config file is already exists')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mX3KmWMvSUQw" + }, + "outputs": [], + "source": [ + "# this line will print the entire config of the model\n", + "config_path = f'{WORK_DIR}/configs/{MODEL_CONFIG}'\n", + "print(config_path)\n", + "config = OmegaConf.load(config_path)\n", + "print(OmegaConf.to_yaml(config))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZCgWzNBkaQLZ" + }, + "source": [ + "# Fine-tuning the model using Arman dataset\n", + "\n", + "Let's select a [`bert-base-multilingual-uncased`](https://huggingface.co/bert-base-multilingual-uncased) BERT model and fine-tune it on the Arman dataset.\n", + "\n", + "## Setting up Data within the config\n", + "\n", + "Among other things, the config file contains dictionaries called dataset, train_ds and validation_ds. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.\n", + "\n", + "We assume that both training and evaluation files are in the same directory and use the default names mentioned during the data download step. \n", + "So, to start model training, we simply need to specify `model.dataset.data_dir`, like we are going to do below.\n", + "\n", + "Also notice that some config lines, including `model.dataset.data_dir`, have `???` in place of paths, this means that values for these fields are required to be specified by the user.\n", + "\n", + "Let us now add the data directory path to the config.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LQHCJN-ZaoLp" + }, + "outputs": [], + "source": [ + "# in this tutorial train and dev datasets are located in the same folder, so it is enought to add the path of the data directory to the config\n", + "config.model.dataset.data_dir = DATA_DIR\n", + "\n", + "# if you want to use the full dataset, set NUM_SAMPLES to -1\n", + "NUM_SAMPLES = 1000\n", + "config.model.train_ds.num_samples = NUM_SAMPLES\n", + "config.model.validation_ds.num_samples = NUM_SAMPLES\n", + "\n", + "# for demonstartion purposes we're running only a single epoch\n", + "config.trainer.max_epochs = 5" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nB96-3sTc3yk" + }, + "source": [ + "## Building the PyTorch Lightning Trainer\n", + "\n", + "NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem.\n", + "\n", + "Let's first instantiate a Trainer object" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1tG4FzZ4Ui60" + }, + "outputs": [], + "source": [ + "print(\"Trainer config - \\n\")\n", + "print(OmegaConf.to_yaml(config.trainer))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "knF6QeQQdMrH" + }, + "outputs": [], + "source": [ + "# lets modify some trainer configs\n", + "# checks if we have GPU available and uses it\n", + "accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'\n", + "config.trainer.devices = 1\n", + "config.trainer.accelerator = accelerator\n", + "\n", + "config.trainer.precision = 16 if torch.cuda.is_available() else 32\n", + "\n", + "# for mixed precision training, uncomment the line below (precision should be set to 16 and amp_level to O1):\n", + "# config.trainer.amp_level = O1\n", + "\n", + "# remove distributed training flags\n", + "config.trainer.strategy = None\n", + "\n", + "# setup max number of steps to reduce training time for demonstration purposes of this tutorial\n", + "config.trainer.max_steps = 32\n", + "\n", + "trainer = pl.Trainer(**config.trainer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8IlEMdVxdr6p" + }, + "source": [ + "## Setting up a NeMo Experiment¶\n", + "\n", + "NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8uztqGAmdrYt" + }, + "outputs": [], + "source": [ + "exp_dir = exp_manager(trainer, config.get(\"exp_manager\", None))\n", + "\n", + "# the exp_dir provides a path to the current experiment for easy access\n", + "exp_dir = str(exp_dir)\n", + "exp_dir" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8tjLhUvL_o7_" + }, + "source": [ + "Before initializing the model, we might want to modify some of the model configs. For example, we might want to modify the pretrained BERT model:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Xeuc2i7Y_nP5" + }, + "outputs": [], + "source": [ + "# get the list of supported BERT-like models, for the complete list of HugginFace models, see https://huggingface.co/models\n", + "print(nemo_nlp.modules.get_pretrained_lm_models_list(include_external=True))\n", + "\n", + "# specify BERT-like model, you want to use\n", + "PRETRAINED_BERT_MODEL = \"bert-base-multilingual-uncased\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RK2xglXyAUOO" + }, + "outputs": [], + "source": [ + "# add the specified above model parameters to the config\n", + "config.model.language_model.pretrained_model_name = PRETRAINED_BERT_MODEL" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fzNZNAVRjDD-" + }, + "source": [ + "Now, we are ready to initialize our model. During the model initialization call, the dataset and data loaders we'll be prepared for training and evaluation.\n", + "Also, the pretrained BERT model will be downloaded, note it can take up to a few minutes depending on the size of the chosen BERT model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NgsGLydWo-6-" + }, + "outputs": [], + "source": [ + "model = nemo_nlp.models.TokenClassificationModel(cfg=config.model, trainer=trainer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kQ592Tx4pzyB" + }, + "source": [ + "## Monitoring training progress\n", + "Optionally, you can create a Tensorboard visualization to monitor training progress." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mTJr16_pp0aS" + }, + "outputs": [], + "source": [ + "try:\n", + " from google import colab\n", + " COLAB_ENV = True\n", + "except (ImportError, ModuleNotFoundError):\n", + " COLAB_ENV = False\n", + "\n", + "# Load the TensorBoard notebook extension\n", + "if COLAB_ENV:\n", + " %load_ext tensorboard\n", + " %tensorboard --logdir {exp_dir}\n", + "else:\n", + " print(\"To use tensorboard, please use this notebook in a Google Colab environment.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fj1pdEdD0Vm3" + }, + "source": [ + "See how it performs before fine-tuning" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wo1oVGIT0aBZ" + }, + "outputs": [], + "source": [ + "# define the list of queries for inference\n", + "queries = [\n", + " 'حمید طاهایی افزود : برای اجرای این طرحها 0 میلیارد و 0 میلیون ریال اعتبار هزینه شده است . ',\n", + " 'دکتر اصغری دبیر چهارمین همایش انجمن زمین‌شناسی ایران در این زمینه گفت : از مجموع چهار صد مقاله رسیده به دبیرخانه همایش ، يك صد و هشتاد مقاله ظرف مدت دو روز در هشت سالن همایش برگزار شد . '\n", + "]\n", + "results = model.add_predictions(queries)\n", + "\n", + "for query, result in zip(queries, results):\n", + " print()\n", + " print(f'Query : {query}')\n", + " print(f'Result: {result.strip()}\\n')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kyElt0Es-aSk" + }, + "outputs": [], + "source": [ + "print(\"Trainer config - \\n\")\n", + "print(OmegaConf.to_yaml(config.trainer))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hUvnSpyjp0Dh" + }, + "outputs": [], + "source": [ + "# start model training\n", + "trainer.fit(model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MOrR0PeJqa0j" + }, + "source": [ + "After the training is complete, `.nemo` file that contains model's checkpoints and all associated artifacts could be found under `nemo_experiments/token_classification_model/DATE_TIME`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-lFo27PJ0o3W" + }, + "source": [ + "See how it gets better after:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9fNcBnz80rLO" + }, + "outputs": [], + "source": [ + "results = model.add_predictions(queries)\n", + "\n", + "for query, result in zip(queries, results):\n", + " print()\n", + " print(f'Query : {query}')\n", + " print(f'Result: {result.strip()}\\n')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JxBiIKMlH8yv" + }, + "source": [ + "After training for 100 epochs, with the default config and NUM_SAMPLES = -1 (i.e. all data is used), your model performance should look similar to this: \n", + "```\n", + " label precision recall f1 support\n", + " O (label_id: 0) 99.09 99.19 99.14 32867\n", + " B-event (label_id: 1) 67.74 70.00 68.85 90\n", + " B-fac (label_id: 2) 70.89 73.68 72.26 76\n", + " B-loc (label_id: 3) 87.45 82.70 85.01 497\n", + " B-org (label_id: 4) 81.88 87.06 84.39 649\n", + " B-pers (label_id: 5) 94.93 93.36 94.14 542\n", + " B-pro (label_id: 6) 79.31 70.41 74.59 98\n", + " I-event (label_id: 7) 87.38 74.72 80.55 352\n", + " I-fac (label_id: 8) 83.08 77.14 80.00 140\n", + " I-loc (label_id: 9) 77.78 73.39 75.52 124\n", + " I-org (label_id: 10) 86.51 89.93 88.18 834\n", + " I-pers (label_id: 11) 95.30 94.35 94.82 301\n", + " I-pro (label_id: 12) 82.86 86.57 84.67 67\n", + " -------------------\n", + " micro avg 97.78 97.78 97.78 36637\n", + " macro avg 84.17 82.50 83.24 36637\n", + " weighted avg 97.78 97.78 97.77 36637\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VZp9STMHQAp1" + }, + "source": [ + "**References**\n", + "\n", + "1. Devlin, Jacob, et al. \"BERT: Pre-training of deep bidirectional transformers for language understanding.\" arXiv preprint arXiv:1810.04805 (2018).\n", + "\n", + "2. Hanieh Poostchi, Ehsan Zare Borzeshi, Mohammad Abdous, and Massimo Piccardi, \"PersoNER: Persian Named-Entity Recognition,\" The 26th International Conference on Computational Linguistics (COLING 2016), pages 3381–3389, Osaka, Japan, 2016.\n", + "\n", + "3. Hanieh Poostchi, Ehsan Zare Borzeshi, and Massimo Piccardi, \"BiLSTM-CRF for Persian Named-Entity Recognition; ArmanPersoNERCorpus: the First Entity-Annotated Persian Dataset,\" The 11th Edition of the Language Resources and Evaluation Conference (LREC), Miyazaki, Japan, 7-12 May 2018, ISLRN 399-379-640-828-6, ISLRN 921-509-141-609-6." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "Non_English_Downstream_Tasks_(NER).ipynb", + "private_outputs": true, + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.10" + }, + "pycharm": { + "stem_cell": { + "cell_type": "raw", "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [], - "name": "Non_English_Downstream_Tasks_(NER).ipynb", - "private_outputs": true, - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.9" - }, - "pycharm": { - "stem_cell": { - "cell_type": "raw", - "metadata": { - "collapsed": false - }, - "source": [] - } - } + "collapsed": false }, - "nbformat": 4, - "nbformat_minor": 1 -} \ No newline at end of file + "source": [] + } + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/tutorials/nlp/Punctuation_and_Capitalization.ipynb b/tutorials/nlp/Punctuation_and_Capitalization.ipynb index cadd50840c3d..6d750a5c8dbc 100644 --- a/tutorials/nlp/Punctuation_and_Capitalization.ipynb +++ b/tutorials/nlp/Punctuation_and_Capitalization.ipynb @@ -990,14 +990,15 @@ " 'tokens_in_batch': 1024,\n", " },\n", ")\n", - "pretrained_model.setup_training_data()\n", - "pretrained_model.setup_validation_data()\n", "\n", "# and now we can create a PyTorch Lightning trainer and call `fit` again\n", "# for this tutorial we are setting fast_dev_run to True, and the trainer will run 1 training batch and 1 validation batch\n", "# for actual model training, disable the flag\n", "fast_dev_run = True\n", "trainer = pl.Trainer(devices=1, accelerator='gpu', fast_dev_run=fast_dev_run)\n", + "pretrained_model.set_trainer(trainer)\n", + "pretrained_model.setup_training_data()\n", + "pretrained_model.setup_validation_data()\n", "trainer.fit(pretrained_model)" ] } diff --git a/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb b/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb index 3cb53b0f9e3b..40076142d391 100644 --- a/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb +++ b/tutorials/speaker_tasks/ASR_with_SpeakerDiarization.ipynb @@ -663,4 +663,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} diff --git a/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb b/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb index ba19884f90ed..d13cb9d6b582 100644 --- a/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb +++ b/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb @@ -593,4 +593,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} diff --git a/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb b/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb index 912ce5556041..59fa2c7f46b4 100644 --- a/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb +++ b/tutorials/speaker_tasks/Speaker_Identification_Verification.ipynb @@ -1264,4 +1264,4 @@ }, "nbformat": 4, "nbformat_minor": 1 -} \ No newline at end of file +} diff --git a/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb b/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb index 4c7c9247d9e0..b72cee51003b 100644 --- a/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb +++ b/tutorials/text_processing/ITN_with_Thutmose_Tagger.ipynb @@ -113,6 +113,7 @@ }, "outputs": [], "source": [ + "!rm -r en_data_small\n", "!wget \"https://multilangaudiosamples.s3.us-east-2.amazonaws.com/en_data_small.zip\"\n", "!unzip en_data_small" ] @@ -1031,7 +1032,7 @@ "source": [ "# Inference with a pretrained model\n", "\n", - "We can also run inference with a pretrained model [itn_en_thutmose_bert](https://catalog.ngc.nvidia.com/orgs/nvidia/models/itn_en_thutmose_bert).\n", + "We can also run inference with a pretrained model [itn_en_thutmose_bert](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/itn_en_thutmose_bert).\n", "This is how to use it directly from python." ], "metadata": {