Changed GTZAN so that it only traverses filenames belonging to the dataset #791

mmxgn · 2020-07-16T19:20:16Z

After recommendation by @mthrok in #764

Now, instead of walking the whole directory of the dataset path,
GTZAN only looks for files under a genre/genre.5 digit number.wav format, where genre is an allowed GTZAN genre label.
This allows moving or removing files from the dataset (e.g. for fixing duplication or mislabeling issues) while not listing irrelevant files.

* Added the GTZAN class in torchaudio.datasets using the same format as the rest of the datasets. * Added the appropriate test function in test_datasets.py. * Added the GTZAN class in the datasets.rst documentation file.

* Added dummy noise .wav in `test/assets/` * Removed transforms of input and output from the dataset `__init__` function, as well as the corresponding methods. * Replaced rendundant `filtered` and `subset` methods from class initialization and also changed the corresponding assertion message.

…taset Now, instead of walking the whole directory and subdirectories of the dataset GTZAN only looks for files under a `genre`/`genre`.`5 digit number`.wav format, where `genre` is an allowed GTZAN genre label. This allows moving or removing files from the dataset (e.g. for fixing duplication or mislabeling issues).

codecov · 2020-07-16T21:04:43Z

Codecov Report

Merging #791 into master will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #791      +/-   ##
==========================================
+ Coverage   89.66%   89.71%   +0.04%     
==========================================
  Files          34       34              
  Lines        2652     2664      +12     
==========================================
+ Hits         2378     2390      +12     
  Misses        274      274

Impacted Files	Coverage Δ
torchaudio/datasets/gtzan.py	`80.30% <100.00%> (+4.37%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 60a8e23...7fb3b63. Read the comment docs.

mthrok

Looks good.

mthrok · 2020-07-17T20:22:25Z

thanks!

vincentqb · 2020-07-17T21:00:33Z

Thanks for the pull request :)

Before updating other datasets, I'd want us to make these abstractions more general so that the particular implementation of each dataset remains very simple to reproduce. Indeed, one of the strength of the dataset implementations that we currently have is how simple it is to replicate and extend to other cases.

mmxgn · 2020-07-17T21:36:29Z

Thanks for accepting it :)

Are you referring to utility functions such as to address #794 or something like an AudioDataset parent class (or a mixture of both)?

vincentqb · 2020-07-17T21:39:19Z

I meant for #794. I see advantages to staying closes to standard pytorch dataset. :)

mthrok · 2020-07-17T21:42:21Z

The simplest solution is to make walk_files to return lexicographical order and have it accept glob pattern options. To do that we need to get rid of os.walk and replace os.listdir with sorted(os.listdir

mthrok · 2020-07-17T21:43:03Z

Keeping the implementation simple is good but making the code work correctly is important here.

mthrok · 2020-07-17T21:49:02Z

Running simple grep, all the function calls to walk_files are followed by list(walker) so walk_files does not have to be generator. So replacing walk_files with implementation with glob.glob then sorting the returned file lists at the end should do.


torchaudio/datasets/vctk.py:        walker = walk_files(
torchaudio/datasets/vctk.py-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
torchaudio/datasets/vctk.py-        )
torchaudio/datasets/vctk.py-        walker = filter(lambda w: self._except_folder not in w, walker)
--
torchaudio/datasets/speechcommands.py:        walker = walk_files(self._path, suffix=".wav", prefix=True)
torchaudio/datasets/speechcommands.py-        walker = filter(lambda w: HASH_DIVIDER in w and EXCEPT_FOLDER not in w, walker)
torchaudio/datasets/speechcommands.py-        self._walker = list(walker)
torchaudio/datasets/speechcommands.py-
--
torchaudio/datasets/librispeech.py:        walker = walk_files(
torchaudio/datasets/librispeech.py-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
torchaudio/datasets/librispeech.py-        )
torchaudio/datasets/librispeech.py-        self._walker = list(walker)
--
torchaudio/datasets/yesno.py:        walker = walk_files(
torchaudio/datasets/yesno.py-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
torchaudio/datasets/yesno.py-        )
torchaudio/datasets/yesno.py-        self._walker = list(walker)

vincentqb · 2020-07-17T21:54:28Z

The simplest solution is to make walk_files to return lexicographical order and have it accept glob pattern options. To do that we need to get rid of os.walk and replace os.listdir with sorted(os.listdir

Yes, if we could sort as os.walk goes down, that would be great.

Keeping the implementation simple is good but making the code work correctly is important here.

Yes, but one does not prevent the other :)

Running simple grep, all the function calls to walk_files are followed by list(walker) so walk_files does not have to be generator. So replacing walk_files with implementation with glob.glob then sorting the returned file lists at the end should do.

The reason list(walker) is being used was that we were planning to eventually replace the concept of datasets by generator constructions. This was a simple way of preparing for this.

mthrok · 2020-07-17T22:17:27Z

Keeping the implementation simple is good but making the code work correctly is important here.

Yes, but one does not prevent the other :)

My point here is that together with the precious comment this sounds like you prefer to have the current (wrong) implementation for the sake of simple implementation. I do not think that's your intention but I would like to stress that the most important things is the torchaudio provides correct and easy-to-use dataset implementation. The simplicity of the implementation should come after that.

Running simple grep, all the function calls to walk_files are followed by list(walker) so walk_files does not have to be generator. So replacing walk_files with implementation with glob.glob then sorting the returned file lists at the end should do.

The reason list(walker) is being used was that we were planning to eventually replace the concept of datasets by generator constructions. This was a simple way of preparing for this.

In that case, neither glob.glob or glob.iglob cannot replace the implementation. We need to reinvent glob.iglob with guaranteed lexicographical order. I think traversing directory for thousands files are not as expensive as NN training, so giving up on generator here and using glob.glob (for pattern matching) and sorting at the end (for lexicographical ordering) will be simpler.

vincentqb · 2020-07-18T00:33:44Z

My point here is that together with the precious comment this sounds like you prefer to have the current (wrong) implementation for the sake of simple implementation. I do not think that's your intention but I would like to stress that the most important things is the torchaudio provides correct and easy-to-use dataset implementation. The simplicity of the implementation should come after that.

Given that the code is open source, we should aim for "easy to use" to also mean "easy to modify". In any case, this is not about correctness versus simplicity. It's about raising the bar and having both. :)

In that case, neither glob.glob or glob.iglob cannot replace the implementation. We need to reinvent glob.iglob with guaranteed lexicographical order. I think traversing directory for thousands files are not as expensive as NN training, so giving up on generator here and using glob.glob (for pattern matching) and sorting at the end (for lexicographical ordering) will be simpler.

The idea of using generator was to investigate what we could do in the event that the list is streamed (say as for IterableDataset with a large list of remote files that are expensive to query). We do not have examples of this in this repository, though this would be great example to nail down eventually.

I'm sure you have seen this, but there's a fun discussion here that would be relevant :) Apparently, one can simply sort the items returned by os.walk to get a deterministic order. This may or may not be a performance gain depending on the directory structure of course. I'm ok with not using generators/os.walk by the way, though if the stackoverflow solution actually works that would seem like a pretty neat solution to me.

There's been some changes recently in torchtext, though they were also using os.walk in the past. Do you know what they do? How about torchvision?

mmxgn · 2020-07-18T10:21:52Z

Hi,

I noticed that os.listdir does not guarantee lexicographic order either so I updated my fork to sort files before adding them to the dataset list. I also did a small change to look for GTZAN._ext_audio in the pattern (is _ext_audio used elsewhere?) instead of the literal .wav. Should I do a new PR or is there a way to change them faster from here?

vincentqb · 2020-07-22T22:27:58Z

Thanks for helping! :)

Just to make sure we are on the same page, there are two issues.

deciding how to handle datasets that have been modified by a user (the issue discussed here in Changed GTZAN so that it only traverses filenames belonging to the dataset #791), since we currently do not provide any guarantees.
making the order deterministic for any user of torchaudio.

Both issues should be addressed for all datasets, likely by modifying the common function walk_files, and so new pull requests would be needed. We should also re-align GTZAN with the other datasets to make it easier to maintain.

mthrok · 2020-07-22T22:52:54Z

Hi,

I noticed that os.listdir does not guarantee lexicographic order either so I updated my fork to sort files before adding them to the dataset list. I also did a small change to look for GTZAN._ext_audio in the pattern (is _ext_audio used elsewhere?) instead of the literal .wav. Should I do a new PR or is there a way to change them faster from here?

I started working on this in #814 so you can stay relax. Thanks.

word_language_model: Fix Transformer init_weights

mmxgn and others added 11 commits May 29, 2020 16:59

Added the popular GTZAN dataset:

f18d102

* Added the GTZAN class in torchaudio.datasets using the same format as the rest of the datasets. * Added the appropriate test function in test_datasets.py. * Added the GTZAN class in the datasets.rst documentation file.

Fixed E303: too many blank lines error

0cb6fa9

Added GTZAN to __init__.__all__

1586dda

Fixed incorrectly not importing GTZAN

79b6aa3

removed duplicate warning

405cae2

lint

93f8397

Merge remote-tracking branch 'upstream/master'

6b8316a

Merge remote-tracking branch 'upstream/master'

82d0aa5

Fixed typo when checking for extension

7fb3b63

mthrok mentioned this pull request Jul 17, 2020

Add LibriTTS dataset #790

Merged

mthrok approved these changes Jul 17, 2020

View reviewed changes

mthrok merged commit 47eb1e6 into pytorch:master Jul 17, 2020

mthrok mentioned this pull request Jul 23, 2020

Make GTZAN dataset sorted and use on-the-fly data in GTZAN test #819

Merged

vincentqb mentioned this pull request Aug 4, 2020

Re-align GTZAN with other datasets #852

Closed

vincentqb mentioned this pull request Sep 16, 2020

Better structure dataset implementations #910

Closed

4 tasks

mpc001 pushed a commit to mpc001/audio that referenced this pull request Aug 4, 2023

Merge pull request pytorch#791 from seemethere/fix_transformer

59caa16

word_language_model: Fix Transformer init_weights

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed GTZAN so that it only traverses filenames belonging to the dataset #791

Changed GTZAN so that it only traverses filenames belonging to the dataset #791

mmxgn commented Jul 16, 2020

codecov bot commented Jul 16, 2020 •

edited

Loading

mthrok left a comment

mthrok commented Jul 17, 2020

vincentqb commented Jul 17, 2020 •

edited

Loading

mmxgn commented Jul 17, 2020

vincentqb commented Jul 17, 2020

mthrok commented Jul 17, 2020

mthrok commented Jul 17, 2020

mthrok commented Jul 17, 2020 •

edited

Loading

vincentqb commented Jul 17, 2020

mthrok commented Jul 17, 2020

vincentqb commented Jul 18, 2020

mmxgn commented Jul 18, 2020

vincentqb commented Jul 22, 2020

mthrok commented Jul 22, 2020

Changed GTZAN so that it only traverses filenames belonging to the dataset #791

Changed GTZAN so that it only traverses filenames belonging to the dataset #791

Conversation

mmxgn commented Jul 16, 2020

codecov bot commented Jul 16, 2020 • edited Loading

Codecov Report

mthrok left a comment

Choose a reason for hiding this comment

mthrok commented Jul 17, 2020

vincentqb commented Jul 17, 2020 • edited Loading

mmxgn commented Jul 17, 2020

vincentqb commented Jul 17, 2020

mthrok commented Jul 17, 2020

mthrok commented Jul 17, 2020

mthrok commented Jul 17, 2020 • edited Loading

vincentqb commented Jul 17, 2020

mthrok commented Jul 17, 2020

vincentqb commented Jul 18, 2020

mmxgn commented Jul 18, 2020

vincentqb commented Jul 22, 2020

mthrok commented Jul 22, 2020

codecov bot commented Jul 16, 2020 •

edited

Loading

vincentqb commented Jul 17, 2020 •

edited

Loading

mthrok commented Jul 17, 2020 •

edited

Loading