Fix CommonVoice for French #1126

AzizCode92 · 2020-12-24T13:17:05Z

In response to this issue #1125

jeremyfix · 2020-12-24T13:52:39Z

Thanks for the quick reply.

Unfortunately, it does not work on Ubuntu because os.path.join adds the directory separator '/' and therefore builds path like :

/opt/Datasets/CommonVoice/fr_79h_2019-02-25/fr/clips/2760b7263452ca460dd0ad387bbc091a7ef0eb5e5b28d19742047ed438857c123f47834a24b3087f2e9eb25ec3f9b506ff61ad60fb22b00e4aa3125e5920374c/.mp3

see the last '/.mp3'

Instead of

filename = os.path.join(path, folder_audio, fileid, ext_audio)

what about

filename = os.path.join(path, folder_audio, fileid + ext_audio)

AzizCode92 · 2020-12-24T13:55:10Z

Hi @jeremyfix , for the moment I can't reproduce it locally
I have this issue 'torchaudio C++ extension is not available.'
But I will make a change and see if it makes the cicd pipeline green again.
thanks.

jeremyfix · 2020-12-24T14:28:09Z

@AzizCode92 I tried to check why some unittest failed and I'm wondering if the content of the tsv file did not change

Indeed, if I look at the first lines of fr_79h_2019-02-25/fr/train.tsv on my recent download of CommonVoice

client_id       path    sentence        up_votes        down_votes      age     gender  accent
a2e8e1e1cc74d08c92a53d7b9ff84e077eb90410edd85b8882f16fd037cecfcb6a19413c6c63ce6458cfea9579878fa91cef18343441c601cae0597a4b0d3144        89e67e7682b36786a0b4b4022c4d42090c86edd96c78c12d30088e62522b8fe466ea4912e6a1055dfb91b296a0743e0a2bbe16cebac98ee5349e3e8262cb9329        Or sur ce point nous n’avons aucune réponse de votre part.      2       0       twenties        male    france
a2e8e1e1cc74d08c92a53d7b9ff84e077eb90410edd85b8882f16fd037cecfcb6a19413c6c63ce6458cfea9579878fa91cef18343441c601cae0597a4b0d3144        87d71819a26179e93acfee149d0b21b7bf5e926e367d80b2b3792d45f46e04853a514945783ff764c1fc237b4eb0ee2b0a7a7cbd395acbdfcfa9d76a6e199bbd        Monsieur de La Verpillière, laissez parler le ministre  2       0       twenties        male    france

Notice the second column, which holds the path if I'm not wrong, does not contain any extension.

If I now look at your unitest, you provide a wav extension in the second column containing the path to the clip.

Therefore, a possible fix could be to remove the "wav" extension in the _train_csv_contents list.

I then also think there is a need to adapt the audio_path to add the COMMONVOICE._ext_audio extension

jeremyfix · 2020-12-24T14:46:10Z

@AzizCode92 If we face the same os.path.join issue as before, I believe the line

audio_path = os.path.join(audio_base_path, content[1], COMMONVOICE._ext_audio)

should be changed to

audio_path = os.path.join(audio_base_path, content[1] + COMMONVOICE._ext_audio)

AzizCode92 · 2020-12-24T14:47:04Z

@jeremyfix yes, you're right. Good catch.
thanks

jeremyfix · 2020-12-24T15:10:13Z

@AzizCode92

This time, the TestCommonVoice passed but the cicd fails because of

torchaudio_unittest/datasets/utils_test.py::TestIterator::test_bg_iterator 

Too long with no output (exceeded 10m0s): context deadline exceeded

AzizCode92 · 2020-12-24T15:11:49Z

yes, the previous tests related to commonvoice were passed.
I will try to restart the pipeline again.

AzizCode92 · 2020-12-24T15:18:42Z

@jeremyfix
Couldn't restart it manually, so i did a new commit to launch it again, let's wait and see.

jeremyfix · 2020-12-24T16:04:14Z

@AzizCode92 apparently, still the same pipeline issue.

mthrok · 2020-12-24T16:21:45Z

@AzizCode92 , @jeremyfix

Thanks for working on this. However, the current approach does not seem to be the right approach.
The fundamental issue is that different language has different format for the path value.

For English ver 5.1, the extension is included, and this is how we tested the implementation.

$ head -3 CommonVoice/cv-corpus-5.1-2020-06-22/en/train.tsv
client_id	path	sentence	up_votes	down_votes	age	gender	accent	locale	segment
f1f6414c04e74453065e1b7fc1639c6f728dc03ed9589034b8531d8c7d8b994f223f2b79d5fcc42a2b7b19f8cbca5af08f31d47a554ddd682df04ba62caaaa56	common_voice_en_20009651.mp3	"It just didn't seem fair."	2	1				en
f1f6414c04e74453065e1b7fc1639c6f728dc03ed9589034b8531d8c7d8b994f223f2b79d5fcc42a2b7b19f8cbca5af08f31d47a554ddd682df04ba62caaaa56	common_voice_en_20009653.mp3	The anticipated synergies of the two modes of transportation were entirely absent.	2	0				en

But looks like French version does not have the extension, as part of the path.
So we need to do some sort of file existence check, with or without additional extension.
So instead of modifying the exiting test, we need to add a new test samples, then make the implementation handle both cases.

I would extract the construction of mock test dataset (what is done in setUpClass, right now) into a helper function, then add a new helper function of mock test dataset for French case, then reuse the test cases.

Also this make me wonder, is there another format than mp3?

mthrok · 2020-12-24T16:32:17Z

And you would like to run test locally, then you can build it with

python setup.py develop

mthrok

Thanks @AzizCode92

Please checkout my comments. The tricky thing is that we use "wav" format in test whereas the real CommonVoice dataset is composed of "mp3".

mthrok · 2020-12-25T16:36:39Z

torchaudio/datasets/commonvoice.py

+    if fileid.endswith(".wav"):
+        filename = os.path.join(path, folder_audio, fileid)
+    else:
+        filename = os.path.join(path, folder_audio, fileid + ext_audio)


I believe CommonVoice is all mp3 format, so checking against .wav would not work outside of the unit test where we intentionally change it to wav due to the limitation on mp3 format support on Windows.

Since you are passing ext_audio from CommonVoice._ext_audio, in this implementation, you can do

filename = os.path.join(path, folder_audio, field) if not filename.endswith(ext_audio): filename += ext_audio

mthrok · 2020-12-25T16:37:24Z

test/torchaudio_unittest/datasets/commonvoice_test.py


-    def test_commonvoice_path(self):
+    def test_fr_commonvoice_path(self):


You do not need path version of test. One is enough.

mthrok · 2020-12-25T16:49:40Z

test/torchaudio_unittest/datasets/commonvoice_test.py

@@ -11,29 +10,49 @@
    normalize_wav,
 )

+from torchaudio.datasets import COMMONVOICE
+

 class TestCommonVoice(TempDirMixin, TorchaudioTestCase):


Rather than putting data in class attribute, which makes harder to see the test logic, can you extract them in plain function?

def get_mock_dataset_en(root_dir) -> Tuple[Tensor, int, Dict[str, str]]: # create files... def get_mock_dataset_fr(root_dir) -> Tuple[Tensor, int, Dict[str, str]]: # create files...

then, split the test classes for different language. Now setUpClass should be something as simple as the following.

Also, since we use wav format in test, so CommonVoice._ext_audio should be changed to ".wav" before the test.

original_ext_audio = CommonVoice._ext_audio class TestCommonVoiceEn(TempDirMixin, TorchaudioTestCase): backend = 'default' root_dir = None data = [] @classmethod def setUpClass(cls): cls.root_dir = cls.get_base_temp_dir() cls.data = get_mock_dataset_en(cls.root_dir) CommonVoice._ext_audio = ".wav" @classmethod def tearDownClass(cls): CommonVoice._ext_audio = original_ext_audio def test_commonvoice_str(self) # this function should stay untouched ... def test_commonvoice_path(self) # this function should stay untouched ...

Same goes to French but no need to add path version.

Hi @mthrok, thanks a lot for your feedback.
I took your comments into consideration. I tried them locally and the tests were green.
I pushed my changes now.
I hope I did cover all your suggestions.

AzizCode92 · 2020-12-25T18:29:10Z

Update:
Commonvoice dataset tests are green but it seems that test_bg_iterator is having a timeout error.

torchaudio_unittest/datasets/utils_test.py::TestIterator::test_bg_iterator 

Too long with no output (exceeded 10m0s): context deadline exceeded

mthrok

Looks good. Thanks.

I made a few comments for the space of further improvement, but this is already very nice.
If you would like, you can check out my comments and change. If you think it's too much, then I will merge it in a few days.

mthrok · 2020-12-27T01:39:27Z

test/torchaudio_unittest/datasets/commonvoice_test.py

-class TestCommonVoice(TempDirMixin, TorchaudioTestCase):
-    backend = 'default'
+original_ext_audio = COMMONVOICE._ext_audio
+sample_rate = 48000


This sample_rate variable is shadowed in test function when test method reaches for i, (waveform, sample_rate, dictionary) in enumerate(dataset):.

This is an anti-pattern, though the current implementation is fine.

Can you change it to something like _SAMPLE_RATE?
The same goes to other global variables. (Yes, I am aware that I used original_ext_audio in my previous example, and sorry about that.)

mthrok · 2020-12-27T01:40:50Z

test/torchaudio_unittest/datasets/commonvoice_test.py

+class TestCommonVoiceFR(TempDirMixin, TorchaudioTestCase):
+    backend = 'default'
+    root_dir = None
+    sample_rate = 48000


Since sample rate is accessible as global variable, there is no need to define another one (duplicated one) on class level.

mthrok · 2020-12-27T01:45:23Z

test/torchaudio_unittest/datasets/utils_test.py

+
+    @classmethod
+    def tearDownClass(cls):
+        COMMONVOICE._ext_audio = original_ext_audio


Looks good, but this makes me think that we should not be using CommonVoice dataset implementation in this test.

This is not the scope of this PR but we should define an trivial, dedicated Dataset for this utility test and stop using CommonVoice here. Then we can finally remove the mp3 asset CommonVoice/cv-corpus-4-2019-12-10/tt/clips/common_voice_tt_00000000.mp3.

mthrok · 2020-12-27T01:48:42Z

Update:
Commonvoice dataset tests are green but it seems that test_bg_iterator is having a timeout error.
torchaudio_unittest/datasets/utils_test.py::TestIterator::test_bg_iterator 

Too long with no output (exceeded 10m0s): context deadline exceeded

So this is a bug in bg_iterator. Seems like it does not propagate error from the background job.

AzizCode92 · 2020-12-27T16:46:53Z

@mthrok, thank you for your feedback.
I committed some modification to the code to improve it further. It seems that there are some problem when collecting dependencies of macos_conda_py3.7. That's why the build is failing now.

mthrok

Thanks!

mthrok · 2020-12-27T17:43:16Z

@mthrok, thank you for your feedback.
I committed some modification to the code to improve it further. It seems that there are some problem when collecting dependencies of macos_conda_py3.7. That's why the build is failing now.

Yeah, build workflow often fails when upstream PyTorch nightly package has an issue.

jeremyfix · 2020-12-28T08:59:48Z

@AzizCode92 Thank you for having fixed this !

* add code for ff * completed the Pull reqs. * Final Commit * sorted * Update CODEOWNERS * Update main.py * Fixed 👨‍🔧Typo in Read me * 00

AzizCode92 added 6 commits December 20, 2020 01:39

remove walk_files

06f069f

remove deprecated transform from Dataset

ca0c03b

Merge remote-tracking branch 'upstream/master'

495654f

remove target_transform from dataset

6d18687

Merge branch 'master' of https://github.com/pytorch/audio

cf18bcb

add file extension when loading file

292796c

facebook-github-bot added the CLA Signed label Dec 24, 2020

AzizCode92 mentioned this pull request Dec 24, 2020

CommonVoice, missing extension when loading an audio clip #1125

Closed

AzizCode92 marked this pull request as draft December 24, 2020 13:39

fix filename path

20b1e04

fix audio_path inside unittest

792c6ef

update audio_path inside unittest

19d7724

reformat test file

15bb775

mthrok linked an issue Dec 24, 2020 that may be closed by this pull request

CommonVoice, missing extension when loading an audio clip #1125

Closed

AzizCode92 added 3 commits December 24, 2020 21:59

add french case

43cd2ba

fix typo and split tests

e942883

add class method decorator

1da329e

mthrok reviewed Dec 25, 2020

View reviewed changes

refactor test

6a7d970

remove print statement

a4267c2

AzizCode92 added 5 commits December 25, 2020 21:08

fix timeout

2cf76a6

fix code stye

7d5f077

remove encoding

b119999

add return type to helper functions

5e1f8dd

encode french characters

6eb9400

AzizCode92 marked this pull request as ready for review December 26, 2020 18:01

AzizCode92 requested a review from mthrok December 26, 2020 18:02

mthrok approved these changes Dec 27, 2020

View reviewed changes

This was referenced Dec 27, 2020

doc for bg_iterator et al #733

Closed

[Refactor][Dataset] YesNo implementation #1127

Merged

AzizCode92 added 2 commits December 27, 2020 13:43

improve the code

7ff28d7

restart pipeline

b8c09e2

mthrok changed the title ~~[Fix bug] add file extension when loading commonvoice audio file~~ Fix CommonVoice for French Dec 27, 2020

mthrok approved these changes Dec 27, 2020

View reviewed changes

mthrok merged commit aa56d30 into pytorch:master Dec 27, 2020

mthrok mentioned this pull request Dec 28, 2020

Improve Dataset test maintainability/readability #1131

Closed

10 tasks

mpc001 pushed a commit to mpc001/audio that referenced this pull request Aug 4, 2023

Typo (pytorch#1126)

54f4572

* add code for ff * completed the Pull reqs. * Final Commit * sorted * Update CODEOWNERS * Update main.py * Fixed 👨‍🔧Typo in Read me * 00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CommonVoice for French #1126

Fix CommonVoice for French #1126

AzizCode92 commented Dec 24, 2020

jeremyfix commented Dec 24, 2020

AzizCode92 commented Dec 24, 2020

jeremyfix commented Dec 24, 2020 •

edited

Loading

jeremyfix commented Dec 24, 2020

AzizCode92 commented Dec 24, 2020

jeremyfix commented Dec 24, 2020

AzizCode92 commented Dec 24, 2020

AzizCode92 commented Dec 24, 2020

jeremyfix commented Dec 24, 2020

mthrok commented Dec 24, 2020

mthrok commented Dec 24, 2020

mthrok left a comment

mthrok Dec 25, 2020

mthrok Dec 25, 2020

mthrok Dec 25, 2020

AzizCode92 Dec 25, 2020 •

edited

Loading

AzizCode92 commented Dec 25, 2020 •

edited

Loading

mthrok left a comment •

edited

Loading

mthrok Dec 27, 2020

mthrok Dec 27, 2020

mthrok Dec 27, 2020

mthrok commented Dec 27, 2020

AzizCode92 commented Dec 27, 2020 •

edited

Loading

mthrok left a comment

mthrok commented Dec 27, 2020

jeremyfix commented Dec 28, 2020


		def test_commonvoice_path(self):
		def test_fr_commonvoice_path(self):

Fix CommonVoice for French #1126

Fix CommonVoice for French #1126

Conversation

AzizCode92 commented Dec 24, 2020

jeremyfix commented Dec 24, 2020

AzizCode92 commented Dec 24, 2020

jeremyfix commented Dec 24, 2020 • edited Loading

jeremyfix commented Dec 24, 2020

AzizCode92 commented Dec 24, 2020

jeremyfix commented Dec 24, 2020

AzizCode92 commented Dec 24, 2020

AzizCode92 commented Dec 24, 2020

jeremyfix commented Dec 24, 2020

mthrok commented Dec 24, 2020

mthrok commented Dec 24, 2020

mthrok left a comment

Choose a reason for hiding this comment

mthrok Dec 25, 2020

Choose a reason for hiding this comment

mthrok Dec 25, 2020

Choose a reason for hiding this comment

mthrok Dec 25, 2020

Choose a reason for hiding this comment

AzizCode92 Dec 25, 2020 • edited Loading

Choose a reason for hiding this comment

AzizCode92 commented Dec 25, 2020 • edited Loading

mthrok left a comment • edited Loading

Choose a reason for hiding this comment

mthrok Dec 27, 2020

Choose a reason for hiding this comment

mthrok Dec 27, 2020

Choose a reason for hiding this comment

mthrok Dec 27, 2020

Choose a reason for hiding this comment

mthrok commented Dec 27, 2020

AzizCode92 commented Dec 27, 2020 • edited Loading

mthrok left a comment

Choose a reason for hiding this comment

mthrok commented Dec 27, 2020

jeremyfix commented Dec 28, 2020

jeremyfix commented Dec 24, 2020 •

edited

Loading

AzizCode92 Dec 25, 2020 •

edited

Loading

AzizCode92 commented Dec 25, 2020 •

edited

Loading

mthrok left a comment •

edited

Loading

AzizCode92 commented Dec 27, 2020 •

edited

Loading