-
Notifications
You must be signed in to change notification settings - Fork 666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CommonVoice for French #1126
Conversation
Thanks for the quick reply. Unfortunately, it does not work on Ubuntu because os.path.join adds the directory separator '/' and therefore builds path like : /opt/Datasets/CommonVoice/fr_79h_2019-02-25/fr/clips/2760b7263452ca460dd0ad387bbc091a7ef0eb5e5b28d19742047ed438857c123f47834a24b3087f2e9eb25ec3f9b506ff61ad60fb22b00e4aa3125e5920374c/.mp3 see the last '/.mp3' Instead of
what about
|
Hi @jeremyfix , for the moment I can't reproduce it locally |
@AzizCode92 I tried to check why some unittest failed and I'm wondering if the content of the tsv file did not change Indeed, if I look at the first lines of
Notice the second column, which holds the path if I'm not wrong, does not contain any extension. If I now look at your unitest, you provide a wav extension in the second column containing the path to the clip. Therefore, a possible fix could be to remove the "wav" extension in the _train_csv_contents list. I then also think there is a need to adapt the audio_path to add the COMMONVOICE._ext_audio extension |
@AzizCode92 If we face the same os.path.join issue as before, I believe the line
should be changed to
|
@jeremyfix yes, you're right. Good catch. |
This time, the TestCommonVoice passed but the cicd fails because of
|
yes, the previous tests related to |
@jeremyfix |
@AzizCode92 apparently, still the same pipeline issue. |
Thanks for working on this. However, the current approach does not seem to be the right approach. For English ver 5.1, the extension is included, and this is how we tested the implementation.
But looks like French version does not have the extension, as part of the I would extract the construction of mock test dataset (what is done in Also this make me wonder, is there another format than |
And you would like to run test locally, then you can build it with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AzizCode92
Please checkout my comments. The tricky thing is that we use "wav"
format in test whereas the real CommonVoice dataset is composed of "mp3"
.
torchaudio/datasets/commonvoice.py
Outdated
if fileid.endswith(".wav"): | ||
filename = os.path.join(path, folder_audio, fileid) | ||
else: | ||
filename = os.path.join(path, folder_audio, fileid + ext_audio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe CommonVoice is all mp3
format, so checking against .wav
would not work outside of the unit test where we intentionally change it to wav
due to the limitation on mp3
format support on Windows.
Since you are passing ext_audio from CommonVoice._ext_audio
, in this implementation, you can do
filename = os.path.join(path, folder_audio, field)
if not filename.endswith(ext_audio):
filename += ext_audio
|
||
def test_commonvoice_path(self): | ||
def test_fr_commonvoice_path(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do not need path
version of test. One is enough.
@@ -11,29 +10,49 @@ | |||
normalize_wav, | |||
) | |||
|
|||
from torchaudio.datasets import COMMONVOICE | |||
|
|||
|
|||
class TestCommonVoice(TempDirMixin, TorchaudioTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than putting data in class attribute, which makes harder to see the test logic, can you extract them in plain function?
def get_mock_dataset_en(root_dir) -> Tuple[Tensor, int, Dict[str, str]]:
# create files...
def get_mock_dataset_fr(root_dir) -> Tuple[Tensor, int, Dict[str, str]]:
# create files...
then, split the test classes for different language. Now setUpClass
should be something as simple as the following.
Also, since we use wav
format in test, so CommonVoice._ext_audio
should be changed to ".wav"
before the test.
original_ext_audio = CommonVoice._ext_audio
class TestCommonVoiceEn(TempDirMixin, TorchaudioTestCase):
backend = 'default'
root_dir = None
data = []
@classmethod
def setUpClass(cls):
cls.root_dir = cls.get_base_temp_dir()
cls.data = get_mock_dataset_en(cls.root_dir)
CommonVoice._ext_audio = ".wav"
@classmethod
def tearDownClass(cls):
CommonVoice._ext_audio = original_ext_audio
def test_commonvoice_str(self) # this function should stay untouched
...
def test_commonvoice_path(self) # this function should stay untouched
...
Same goes to French but no need to add path
version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mthrok, thanks a lot for your feedback.
I took your comments into consideration. I tried them locally and the tests were green.
I pushed my changes now.
I hope I did cover all your suggestions.
Update:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks.
I made a few comments for the space of further improvement, but this is already very nice.
If you would like, you can check out my comments and change. If you think it's too much, then I will merge it in a few days.
class TestCommonVoice(TempDirMixin, TorchaudioTestCase): | ||
backend = 'default' | ||
original_ext_audio = COMMONVOICE._ext_audio | ||
sample_rate = 48000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sample_rate
variable is shadowed in test function when test method reaches for i, (waveform, sample_rate, dictionary) in enumerate(dataset):
.
This is an anti-pattern, though the current implementation is fine.
Can you change it to something like _SAMPLE_RATE
?
The same goes to other global variables. (Yes, I am aware that I used original_ext_audio
in my previous example, and sorry about that.)
class TestCommonVoiceFR(TempDirMixin, TorchaudioTestCase): | ||
backend = 'default' | ||
root_dir = None | ||
sample_rate = 48000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since sample rate is accessible as global variable, there is no need to define another one (duplicated one) on class level.
|
||
@classmethod | ||
def tearDownClass(cls): | ||
COMMONVOICE._ext_audio = original_ext_audio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but this makes me think that we should not be using CommonVoice dataset implementation in this test.
This is not the scope of this PR but we should define an trivial, dedicated Dataset for this utility test and stop using CommonVoice here. Then we can finally remove the mp3
asset CommonVoice/cv-corpus-4-2019-12-10/tt/clips/common_voice_tt_00000000.mp3
.
So this is a bug in |
@mthrok, thank you for your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Yeah, |
@AzizCode92 Thank you for having fixed this ! |
* add code for ff * completed the Pull reqs. * Final Commit * sorted * Update CODEOWNERS * Update main.py * Fixed 👨🔧Typo in Read me * 00
In response to this issue #1125