Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CommonVoice for French #1126

Merged
merged 22 commits into from
Dec 27, 2020
Merged

Fix CommonVoice for French #1126

merged 22 commits into from
Dec 27, 2020

Conversation

AzizCode92
Copy link
Contributor

In response to this issue #1125

@jeremyfix
Copy link

Thanks for the quick reply.

Unfortunately, it does not work on Ubuntu because os.path.join adds the directory separator '/' and therefore builds path like :

/opt/Datasets/CommonVoice/fr_79h_2019-02-25/fr/clips/2760b7263452ca460dd0ad387bbc091a7ef0eb5e5b28d19742047ed438857c123f47834a24b3087f2e9eb25ec3f9b506ff61ad60fb22b00e4aa3125e5920374c/.mp3

see the last '/.mp3'

Instead of

filename = os.path.join(path, folder_audio, fileid, ext_audio)

what about

filename = os.path.join(path, folder_audio, fileid + ext_audio)

@AzizCode92
Copy link
Contributor Author

Hi @jeremyfix , for the moment I can't reproduce it locally
I have this issue 'torchaudio C++ extension is not available.'
But I will make a change and see if it makes the cicd pipeline green again.
thanks.

@jeremyfix
Copy link

jeremyfix commented Dec 24, 2020

@AzizCode92 I tried to check why some unittest failed and I'm wondering if the content of the tsv file did not change

Indeed, if I look at the first lines of fr_79h_2019-02-25/fr/train.tsv on my recent download of CommonVoice

client_id       path    sentence        up_votes        down_votes      age     gender  accent
a2e8e1e1cc74d08c92a53d7b9ff84e077eb90410edd85b8882f16fd037cecfcb6a19413c6c63ce6458cfea9579878fa91cef18343441c601cae0597a4b0d3144        89e67e7682b36786a0b4b4022c4d42090c86edd96c78c12d30088e62522b8fe466ea4912e6a1055dfb91b296a0743e0a2bbe16cebac98ee5349e3e8262cb9329        Or sur ce point nous n’avons aucune réponse de votre part.      2       0       twenties        male    france
a2e8e1e1cc74d08c92a53d7b9ff84e077eb90410edd85b8882f16fd037cecfcb6a19413c6c63ce6458cfea9579878fa91cef18343441c601cae0597a4b0d3144        87d71819a26179e93acfee149d0b21b7bf5e926e367d80b2b3792d45f46e04853a514945783ff764c1fc237b4eb0ee2b0a7a7cbd395acbdfcfa9d76a6e199bbd        Monsieur de La Verpillière, laissez parler le ministre  2       0       twenties        male    france

Notice the second column, which holds the path if I'm not wrong, does not contain any extension.

If I now look at your unitest, you provide a wav extension in the second column containing the path to the clip.

Therefore, a possible fix could be to remove the "wav" extension in the _train_csv_contents list.

I then also think there is a need to adapt the audio_path to add the COMMONVOICE._ext_audio extension

@jeremyfix
Copy link

@AzizCode92 If we face the same os.path.join issue as before, I believe the line

audio_path = os.path.join(audio_base_path, content[1], COMMONVOICE._ext_audio)

should be changed to

audio_path = os.path.join(audio_base_path, content[1] + COMMONVOICE._ext_audio)

@AzizCode92
Copy link
Contributor Author

@jeremyfix yes, you're right. Good catch.
thanks

@jeremyfix
Copy link

@AzizCode92

This time, the TestCommonVoice passed but the cicd fails because of

torchaudio_unittest/datasets/utils_test.py::TestIterator::test_bg_iterator 

Too long with no output (exceeded 10m0s): context deadline exceeded

@AzizCode92
Copy link
Contributor Author

yes, the previous tests related to commonvoice were passed.
I will try to restart the pipeline again.

@AzizCode92
Copy link
Contributor Author

@jeremyfix
Couldn't restart it manually, so i did a new commit to launch it again, let's wait and see.

@jeremyfix
Copy link

@AzizCode92 apparently, still the same pipeline issue.

@mthrok
Copy link
Collaborator

mthrok commented Dec 24, 2020

@AzizCode92 , @jeremyfix

Thanks for working on this. However, the current approach does not seem to be the right approach.
The fundamental issue is that different language has different format for the path value.

For English ver 5.1, the extension is included, and this is how we tested the implementation.

$ head -3 CommonVoice/cv-corpus-5.1-2020-06-22/en/train.tsv
client_id	path	sentence	up_votes	down_votes	age	gender	accent	locale	segment
f1f6414c04e74453065e1b7fc1639c6f728dc03ed9589034b8531d8c7d8b994f223f2b79d5fcc42a2b7b19f8cbca5af08f31d47a554ddd682df04ba62caaaa56	common_voice_en_20009651.mp3	"It just didn't seem fair."	2	1				en
f1f6414c04e74453065e1b7fc1639c6f728dc03ed9589034b8531d8c7d8b994f223f2b79d5fcc42a2b7b19f8cbca5af08f31d47a554ddd682df04ba62caaaa56	common_voice_en_20009653.mp3	The anticipated synergies of the two modes of transportation were entirely absent.	2	0				en

But looks like French version does not have the extension, as part of the path.
So we need to do some sort of file existence check, with or without additional extension.
So instead of modifying the exiting test, we need to add a new test samples, then make the implementation handle both cases.

I would extract the construction of mock test dataset (what is done in setUpClass, right now) into a helper function, then add a new helper function of mock test dataset for French case, then reuse the test cases.

Also this make me wonder, is there another format than mp3?

@mthrok mthrok linked an issue Dec 24, 2020 that may be closed by this pull request
@mthrok
Copy link
Collaborator

mthrok commented Dec 24, 2020

And you would like to run test locally, then you can build it with

python setup.py develop

Copy link
Collaborator

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AzizCode92

Please checkout my comments. The tricky thing is that we use "wav" format in test whereas the real CommonVoice dataset is composed of "mp3".

if fileid.endswith(".wav"):
filename = os.path.join(path, folder_audio, fileid)
else:
filename = os.path.join(path, folder_audio, fileid + ext_audio)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe CommonVoice is all mp3 format, so checking against .wav would not work outside of the unit test where we intentionally change it to wav due to the limitation on mp3 format support on Windows.

Since you are passing ext_audio from CommonVoice._ext_audio, in this implementation, you can do

filename = os.path.join(path, folder_audio, field)
if not filename.endswith(ext_audio):
    filename += ext_audio


def test_commonvoice_path(self):
def test_fr_commonvoice_path(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not need path version of test. One is enough.

@@ -11,29 +10,49 @@
normalize_wav,
)

from torchaudio.datasets import COMMONVOICE


class TestCommonVoice(TempDirMixin, TorchaudioTestCase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting data in class attribute, which makes harder to see the test logic, can you extract them in plain function?

def get_mock_dataset_en(root_dir) -> Tuple[Tensor, int, Dict[str, str]]:
   # create files...

def get_mock_dataset_fr(root_dir) -> Tuple[Tensor, int, Dict[str, str]]:
   # create files...

then, split the test classes for different language. Now setUpClass should be something as simple as the following.

Also, since we use wav format in test, so CommonVoice._ext_audio should be changed to ".wav" before the test.

original_ext_audio = CommonVoice._ext_audio

class TestCommonVoiceEn(TempDirMixin, TorchaudioTestCase):
    backend = 'default'
    root_dir = None
    data = []

@classmethod
def setUpClass(cls):
    cls.root_dir = cls.get_base_temp_dir()
    cls.data = get_mock_dataset_en(cls.root_dir)
    CommonVoice._ext_audio = ".wav"

@classmethod
def tearDownClass(cls):
    CommonVoice._ext_audio = original_ext_audio

def test_commonvoice_str(self)  # this function should stay untouched
    ...

def test_commonvoice_path(self)  # this function should stay untouched
    ...

Same goes to French but no need to add path version.

Copy link
Contributor Author

@AzizCode92 AzizCode92 Dec 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mthrok, thanks a lot for your feedback.
I took your comments into consideration. I tried them locally and the tests were green.
I pushed my changes now.
I hope I did cover all your suggestions.

@AzizCode92
Copy link
Contributor Author

AzizCode92 commented Dec 25, 2020

Update:
Commonvoice dataset tests are green but it seems that test_bg_iterator is having a timeout error.

torchaudio_unittest/datasets/utils_test.py::TestIterator::test_bg_iterator 

Too long with no output (exceeded 10m0s): context deadline exceeded

@AzizCode92 AzizCode92 marked this pull request as ready for review December 26, 2020 18:01
@AzizCode92 AzizCode92 requested a review from mthrok December 26, 2020 18:02
Copy link
Collaborator

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks.

I made a few comments for the space of further improvement, but this is already very nice.
If you would like, you can check out my comments and change. If you think it's too much, then I will merge it in a few days.

class TestCommonVoice(TempDirMixin, TorchaudioTestCase):
backend = 'default'
original_ext_audio = COMMONVOICE._ext_audio
sample_rate = 48000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sample_rate variable is shadowed in test function when test method reaches for i, (waveform, sample_rate, dictionary) in enumerate(dataset):.

This is an anti-pattern, though the current implementation is fine.

Can you change it to something like _SAMPLE_RATE?
The same goes to other global variables. (Yes, I am aware that I used original_ext_audio in my previous example, and sorry about that.)

class TestCommonVoiceFR(TempDirMixin, TorchaudioTestCase):
backend = 'default'
root_dir = None
sample_rate = 48000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since sample rate is accessible as global variable, there is no need to define another one (duplicated one) on class level.


@classmethod
def tearDownClass(cls):
COMMONVOICE._ext_audio = original_ext_audio
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but this makes me think that we should not be using CommonVoice dataset implementation in this test.

This is not the scope of this PR but we should define an trivial, dedicated Dataset for this utility test and stop using CommonVoice here. Then we can finally remove the mp3 asset CommonVoice/cv-corpus-4-2019-12-10/tt/clips/common_voice_tt_00000000.mp3.

@mthrok
Copy link
Collaborator

mthrok commented Dec 27, 2020

Update:
Commonvoice dataset tests are green but it seems that test_bg_iterator is having a timeout error.

torchaudio_unittest/datasets/utils_test.py::TestIterator::test_bg_iterator 

Too long with no output (exceeded 10m0s): context deadline exceeded

So this is a bug in bg_iterator. Seems like it does not propagate error from the background job.

@AzizCode92
Copy link
Contributor Author

AzizCode92 commented Dec 27, 2020

@mthrok, thank you for your feedback.
I committed some modification to the code to improve it further. It seems that there are some problem when collecting dependencies of macos_conda_py3.7. That's why the build is failing now.

@mthrok mthrok changed the title [Fix bug] add file extension when loading commonvoice audio file Fix CommonVoice for French Dec 27, 2020
Copy link
Collaborator

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mthrok
Copy link
Collaborator

mthrok commented Dec 27, 2020

@mthrok, thank you for your feedback.
I committed some modification to the code to improve it further. It seems that there are some problem when collecting dependencies of macos_conda_py3.7. That's why the build is failing now.

Yeah, build workflow often fails when upstream PyTorch nightly package has an issue.

@mthrok mthrok merged commit aa56d30 into pytorch:master Dec 27, 2020
@jeremyfix
Copy link

@AzizCode92 Thank you for having fixed this !

mpc001 pushed a commit to mpc001/audio that referenced this pull request Aug 4, 2023
* add code for ff

* completed the Pull reqs.

* Final Commit

* sorted

* Update CODEOWNERS

* Update main.py

* Fixed 👨‍🔧Typo in Read me

* 00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CommonVoice, missing extension when loading an audio clip
4 participants