Remove insignificant test assets #764

mthrok · 2020-07-02T14:05:44Z

@astaff had introduced guideline for test assets in #759 and we can get rid of the following existing assets.

General Direction for replacing assets with on-the-fly generation

Create Tensor

common_utils.get_sinusoid
common_utils.get_whitenoise

Get temporary file path

self.get_temp_path('foo.wav')
# suppose this class is composed of `common_utils.TempDirMixin`

Save wav file

common_utils.save_wav(path, data)

Load wav file

common_utils.load_wav(path)

The text was updated successfully, but these errors were encountered:

engineerchuan · 2020-07-10T16:49:14Z

I would like to work on this. Would we prefer doing this in one PR or okay to split across a 2-3 PR?

mthrok · 2020-07-10T17:44:28Z

Hi @engineerchuan

Thanks for signing up.

It is preferable to split PRs into ones simple enough to review and each PR works on one single aspect. For example

deleting all the unused files is simple enough to be in one PR.
replacing mp3 with wav is different nature so it would be easier to review if it's one single PR.
Replacing kaldi_file.wav is under-specified at this moment, so we need to discuss the detail before working on the actual code change, and once the change is ready I will need to run locally to verify it, so this one should be a separate PR than the others. (also for this one we will need to replace kaldi_file.wav with longer audio data, and we might find an actual bug on Kaldi compatible functions, in that case we need to follow up and fix the functions, so this one could potentially more involved than just replacing some data. Probably you do not want to start from this one.)
Replacing sinewave files with synthetic data will be easy to review if it's isolated from the others.
Same goes to whitenoise files too.

as such.

so I think it's even better to split PRs into more than 3.

engineerchuan · 2020-07-11T16:54:01Z

Couple of comments:

From this test run, I learned:

torchaudio/datasets/gtzan.py probably uses genres/noise/noise.0000.wav so we shouldn't remove that. I can't find a reference in the code though but the other files are put under genres. How does this work?
torchaudio/datasets/yesno.py still uses waves_yesno/0_1_0_1_0_1_1_0.wav

engineerchuan · 2020-07-11T16:55:02Z

Are we sure we can remove steam-train-whistle-daniel_simon.mp3 ?

In file torchaudio/test/test_io.py, it seems to use this file to test MP3 reading IO.

mthrok · 2020-07-11T20:47:17Z

Couple of comments:

From this test run, I learned:

torchaudio/datasets/gtzan.py probably uses genres/noise/noise.0000.wav so we shouldn't remove that. I can't find a reference in the code though but the other files are put under genres. How does this work?

Let me get back on this one.

torchaudio/datasets/yesno.py still uses waves_yesno/0_1_0_1_0_1_1_0.wav

This is my overlook. I only greped each asset name so files I did not see the files indirectly required by Dataset implementations. Thanks for pointing this out.

Are we sure we can remove steam-train-whistle-daniel_simon.mp3 ?

In file torchaudio/test/test_io.py, it seems to use this file to test MP3 reading IO.

You are right, we cannot remove the file because it's used by test_io.py, however, the other tests test_librosa_compatibility should not be using these files and they should use wav version instead.

Update: The following applies to tests other than test_io.py
We want to replace this with wav format. The reason is loading mp3 file as Tensor is a bit complicated with 3rd party libraries (the tests for non-I/O functionalities should not be using torchauiod.load), and as far as I checked, the tests that use steam-train-whistle-daniel_simon.mp3 do not have requirement to use mp3 file. So by changing it to wav, we can simplify the test logic and that allows us to run the same test on Windows.

engineerchuan · 2020-07-11T22:38:27Z

I did a grep for the mp3 file

$ grep -R steam-train-whistle-daniel_simon.mp3 test
test/common_utils/backend_utils.py:    test_filepath = data_utils.get_asset_path('steam-train-whistle-daniel_simon.mp3')
test/test_dataloader.py:        sound_files = ["sinewave.wav", "steam-train-whistle-daniel_simon.mp3"]
test/test_io.py:                                 "steam-train-whistle-daniel_simon.mp3")
test/test_sox_effects.py:    test_filepath = common_utils.get_asset_path("steam-train-whistle-daniel_simon.mp3")

I found the mp3 file in 4 places, but not in test_librosa_compatibility. For the 4 places I found the test, backend_utils, test_io and test_dataloader both seem to be "IO" so I will not touch them. I will remove it from test_sox_effects.

Second issue, I looked for test.wav accidentally checked in and did not find it.

mthrok · 2020-07-12T02:24:57Z

Hi @engineerchuan

I am sorry I replied you in very rushed manner and I gave you a wrong description which ended up confusing you.

I found the mp3 file in 4 places, but not in test_librosa_compatibility. For the 4 places I found the test, backend_utils, test_io and test_dataloader both seem to be "IO" so I will not touch them. I will remove it from test_sox_effects.

Yes, that makes sense.

Second issue, I looked for test.wav accidentally checked in and did not find it.

Yes, you are right. I updated the list.

Thanks for checking the details I missed.

Part of #764 - Replace `whitenoise.wav` with on-the-fly data generation - Replace `torchaudio.load` with `common_utils.load_wav` - Replace `steam-train-whistle-daniel_simon.mp3` with `.wav`

mthrok · 2020-07-14T22:34:32Z

@engineerchuan

torchaudio/datasets/gtzan.py probably uses genres/noise/noise.0000.wav so we shouldn't remove that. I can't find a reference in the code though but the other files are put under genres. How does this work?

I looked at this one and found out that GTZAN dataset just traverse all the wav file under the test/assets. I see two issues here.

GTZAN dataset should not be traversing other files in test/assets
GTZAN dataset does not have the list of files that corresponds to the entire dataset. Since GTZAN has a fixed number of samples, it should not be traversing all the audio files under the given directory. This could introduce extra files or does not react to missing files.
@mmxgn Do you have a thought on 2.?

Since we have data generation utilities and none of the dataset tests require a real data, I think we can remove these assets and move to on-the-fly generation. That way we can delete 0_1_0_1_0_1_1_0.wav and noise.0000.wav from test assets.

mmxgn · 2020-07-15T10:48:30Z

2. `GTZAN` dataset does not have the list of files that corresponds to the entire dataset. Since `GTZAN` has a fixed number of samples, it should not be traversing all the audio files under the given directory. This could introduce extra files or does not react to missing files.
   @mmxgn Do you have a thought on 2.?

Hello,

GTZAN indeed has a fixed number of samples but there are cases that paper change that. Some times people use e.g. different training-testing splits. Other times, papers remove some of the songs or move them to different genres to compensate to some of GTZAN's shortcomings. Of course now that you mention it there shouldn't be different genres, else it's not GTZAN anymore.

What I could do is have it still traverse the files , but limit to the <genre>/<genre>.NUM.wav pattern. I can also look also into generating on the fly data using the data generation utilities for testing. Tell me what you think.

mthrok · 2020-07-15T17:03:15Z

HI @mmxgn

Thanks for response.

What I could do is have it still traverse the files , but limit to the <genre>/<genre>.NUM.wav pattern.

This sounds a reasonable approach. Let me know if you have time and would like to work on it, if you are busy I will file an issue and try to get help.

One followup question: I downloaded genres.tar.gz and checked the contents and I see rock, reggae, pop, metal, jazz, hiphop, disco, country, classical and blues, but I do not see noise like it's present in test assets. Can GTZAN have genres other than those listed above?

I can also look also into generating on the fly data using the data generation utilities for testing.

Let me work on this for another dataset, first. It will be easier for contributors if there is an example for doing that.

mmxgn · 2020-07-15T17:15:37Z

Hello,

HI @mmxgn

Thanks for response.

What I could do is have it still traverse the files , but limit to the <genre>/<genre>.NUM.wav pattern.

This sounds a reasonable approach. Let me know if you have time and would like to work on it, if you are busy I will file an issue and try to get help.

No problem at all, there is a little timezone difference but I think I can do it tommorow.

One followup question: I downloaded genres.tar.gz and checked the contents and I see rock, reggae, pop, metal, jazz, hiphop, disco, country, classical and blues, but I do not see noise like it's present in test assets. Can GTZAN have genres other than those listed above?

No, in reality GTZAN doesn't have such things, I just used it for testing. I took a noise sample already in the test assets and put it in an imaginary genre `noise', and since the dataset traverses the directory there was no problem in that.

I can also look also into generating on the fly data using the data generation utilities for testing.

Let me work on this for another dataset, first. It will be easier for contributors if there is an example for doing that.

Great, shall I keep then the noise sample, just with a different name? (e.g. blues.0000.wav) I didn't want to put an original GTZAN file for testing.

mthrok · 2020-07-16T21:25:46Z

I made an example PR to generate dataset on-the-fly for YESNO dataset. #792

Part of #764

mthrok · 2020-11-04T15:02:59Z

Replacing the kaldi_file.wav, which has only 20 samples while sampling rate is 16k Hz, to kaldi_file_8000.wav, which has 8000 samples (16kHz), makes most of kaldi compatibility tests fail.

406 failed, 218 passed, 2 warnings in 37.35s

https://gist.github.com/mthrok/b44af36d7724def6a27811f48f01d42f

krishnakalyan3 · 2020-11-13T11:15:56Z

 steam-train-whistle-daniel_simon.mp3 should be replaced by steam-train-whistle-daniel_simon.wav

Do we just rename all occurrences of this file to above? (sox_compatibility_test.py, transforms_test.py, batch_consistency_test.py, io_test.py, README.md)

Fixes error: `error: You need C++14 to compile PyTorch`.

mthrok added module: tests help wanted labels Jul 2, 2020

mthrok mentioned this issue Jul 2, 2020

Add Waveforms for Testing Purposes section to test/README.md #759

Merged

mthrok mentioned this issue Jul 16, 2020

Add LibriTTS dataset #790

Merged

mmxgn mentioned this issue Jul 16, 2020

Changed GTZAN so that it only traverses filenames belonging to the dataset #791

Merged

This was referenced Jul 16, 2020

Generate YESNO dataset on-the-fly for test #792

Merged

Replace whitenoise_1min.mp3 with on-the-fly generation. #813

Merged

mthrok added a commit that referenced this issue Jul 22, 2020

Get rid of whitenoise_1min.mp3 (#813)

0812f22

Part of #764

mthrok mentioned this issue Jul 23, 2020

Make GTZAN dataset sorted and use on-the-fly data in GTZAN test #819

Merged

mmxgn mentioned this issue Aug 5, 2020

Re-align GTZAN with other datasets #852

Closed

mpc001 pushed a commit to mpc001/audio that referenced this issue Aug 4, 2023

Fix cpp/mnist CMakeList build. (pytorch#764)

22ee056

Fixes error: `error: You need C++14 to compile PyTorch`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove insignificant test assets #764

Remove insignificant test assets #764

mthrok commented Jul 2, 2020 •

edited

Loading

engineerchuan commented Jul 10, 2020

mthrok commented Jul 10, 2020 •

edited

Loading

engineerchuan commented Jul 11, 2020

engineerchuan commented Jul 11, 2020

mthrok commented Jul 11, 2020 •

edited

Loading

engineerchuan commented Jul 11, 2020 •

edited

Loading

mthrok commented Jul 12, 2020

mthrok commented Jul 14, 2020 •

edited

Loading

mmxgn commented Jul 15, 2020

mthrok commented Jul 15, 2020

mmxgn commented Jul 15, 2020 •

edited

Loading

mthrok commented Jul 16, 2020

mthrok commented Nov 4, 2020

krishnakalyan3 commented Nov 13, 2020

Remove insignificant test assets #764

Remove insignificant test assets #764

Comments

mthrok commented Jul 2, 2020 • edited Loading

General Direction for replacing assets with on-the-fly generation

engineerchuan commented Jul 10, 2020

mthrok commented Jul 10, 2020 • edited Loading

engineerchuan commented Jul 11, 2020

engineerchuan commented Jul 11, 2020

mthrok commented Jul 11, 2020 • edited Loading

engineerchuan commented Jul 11, 2020 • edited Loading

mthrok commented Jul 12, 2020

mthrok commented Jul 14, 2020 • edited Loading

mmxgn commented Jul 15, 2020

mthrok commented Jul 15, 2020

mmxgn commented Jul 15, 2020 • edited Loading

mthrok commented Jul 16, 2020

mthrok commented Nov 4, 2020

krishnakalyan3 commented Nov 13, 2020

mthrok commented Jul 2, 2020 •

edited

Loading

mthrok commented Jul 10, 2020 •

edited

Loading

mthrok commented Jul 11, 2020 •

edited

Loading

engineerchuan commented Jul 11, 2020 •

edited

Loading

mthrok commented Jul 14, 2020 •

edited

Loading

mmxgn commented Jul 15, 2020 •

edited

Loading