Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed GTZAN so that it only traverses filenames belonging to the dataset #791

Merged
merged 11 commits into from
Jul 17, 2020

Conversation

mmxgn
Copy link
Contributor

@mmxgn mmxgn commented Jul 16, 2020

After recommendation by @mthrok in #764

Now, instead of walking the whole directory of the dataset path,
GTZAN only looks for files under a genre/genre.5 digit number.wav format, where genre is an allowed GTZAN genre label.
This allows moving or removing files from the dataset (e.g. for fixing duplication or mislabeling issues) while not listing irrelevant files.

mmxgn and others added 11 commits May 29, 2020 16:59
* Added the GTZAN class in torchaudio.datasets using the same format as the rest of the datasets.
* Added the appropriate test function in test_datasets.py.
* Added the GTZAN class in the datasets.rst documentation file.
* Added dummy noise .wav in `test/assets/`
* Removed transforms of input and output from the dataset
  `__init__` function, as well as the corresponding methods.
* Replaced rendundant `filtered` and `subset` methods from
  class initialization and also changed the corresponding
  assertion message.
…taset

Now, instead of walking the whole directory and subdirectories of the dataset
GTZAN only looks for files under a `genre`/`genre`.`5 digit number`.wav format, where `genre` is an allowed GTZAN genre label.
This allows moving or removing files from the dataset (e.g. for fixing duplication or mislabeling issues).
@codecov
Copy link

codecov bot commented Jul 16, 2020

Codecov Report

Merging #791 into master will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #791      +/-   ##
==========================================
+ Coverage   89.66%   89.71%   +0.04%     
==========================================
  Files          34       34              
  Lines        2652     2664      +12     
==========================================
+ Hits         2378     2390      +12     
  Misses        274      274              
Impacted Files Coverage Δ
torchaudio/datasets/gtzan.py 80.30% <100.00%> (+4.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 60a8e23...7fb3b63. Read the comment docs.

@mthrok mthrok mentioned this pull request Jul 17, 2020
Copy link
Collaborator

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@mthrok mthrok merged commit 47eb1e6 into pytorch:master Jul 17, 2020
@mthrok
Copy link
Collaborator

mthrok commented Jul 17, 2020

thanks!

@vincentqb
Copy link
Contributor

vincentqb commented Jul 17, 2020

Thanks for the pull request :)

Before updating other datasets, I'd want us to make these abstractions more general so that the particular implementation of each dataset remains very simple to reproduce. Indeed, one of the strength of the dataset implementations that we currently have is how simple it is to replicate and extend to other cases.

@mmxgn
Copy link
Contributor Author

mmxgn commented Jul 17, 2020

Thanks for accepting it :)

Are you referring to utility functions such as to address #794 or something like an AudioDataset parent class (or a mixture of both)?

@vincentqb
Copy link
Contributor

I meant for #794. I see advantages to staying closes to standard pytorch dataset. :)

@mthrok
Copy link
Collaborator

mthrok commented Jul 17, 2020

The simplest solution is to make walk_files to return lexicographical order and have it accept glob pattern options. To do that we need to get rid of os.walk and replace os.listdir with sorted(os.listdir

@mthrok
Copy link
Collaborator

mthrok commented Jul 17, 2020

Keeping the implementation simple is good but making the code work correctly is important here.

@mthrok
Copy link
Collaborator

mthrok commented Jul 17, 2020

Running simple grep, all the function calls to walk_files are followed by list(walker) so walk_files does not have to be generator. So replacing walk_files with implementation with glob.glob then sorting the returned file lists at the end should do.


torchaudio/datasets/vctk.py:        walker = walk_files(
torchaudio/datasets/vctk.py-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
torchaudio/datasets/vctk.py-        )
torchaudio/datasets/vctk.py-        walker = filter(lambda w: self._except_folder not in w, walker)
--
torchaudio/datasets/speechcommands.py:        walker = walk_files(self._path, suffix=".wav", prefix=True)
torchaudio/datasets/speechcommands.py-        walker = filter(lambda w: HASH_DIVIDER in w and EXCEPT_FOLDER not in w, walker)
torchaudio/datasets/speechcommands.py-        self._walker = list(walker)
torchaudio/datasets/speechcommands.py-
--
torchaudio/datasets/librispeech.py:        walker = walk_files(
torchaudio/datasets/librispeech.py-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
torchaudio/datasets/librispeech.py-        )
torchaudio/datasets/librispeech.py-        self._walker = list(walker)
--
torchaudio/datasets/yesno.py:        walker = walk_files(
torchaudio/datasets/yesno.py-            self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True
torchaudio/datasets/yesno.py-        )
torchaudio/datasets/yesno.py-        self._walker = list(walker)

@vincentqb
Copy link
Contributor

The simplest solution is to make walk_files to return lexicographical order and have it accept glob pattern options. To do that we need to get rid of os.walk and replace os.listdir with sorted(os.listdir

Yes, if we could sort as os.walk goes down, that would be great.

Keeping the implementation simple is good but making the code work correctly is important here.

Yes, but one does not prevent the other :)

Running simple grep, all the function calls to walk_files are followed by list(walker) so walk_files does not have to be generator. So replacing walk_files with implementation with glob.glob then sorting the returned file lists at the end should do.

The reason list(walker) is being used was that we were planning to eventually replace the concept of datasets by generator constructions. This was a simple way of preparing for this.

@mthrok
Copy link
Collaborator

mthrok commented Jul 17, 2020

Keeping the implementation simple is good but making the code work correctly is important here.

Yes, but one does not prevent the other :)

My point here is that together with the precious comment this sounds like you prefer to have the current (wrong) implementation for the sake of simple implementation. I do not think that's your intention but I would like to stress that the most important things is the torchaudio provides correct and easy-to-use dataset implementation. The simplicity of the implementation should come after that.

Running simple grep, all the function calls to walk_files are followed by list(walker) so walk_files does not have to be generator. So replacing walk_files with implementation with glob.glob then sorting the returned file lists at the end should do.

The reason list(walker) is being used was that we were planning to eventually replace the concept of datasets by generator constructions. This was a simple way of preparing for this.

In that case, neither glob.glob or glob.iglob cannot replace the implementation. We need to reinvent glob.iglob with guaranteed lexicographical order. I think traversing directory for thousands files are not as expensive as NN training, so giving up on generator here and using glob.glob (for pattern matching) and sorting at the end (for lexicographical ordering) will be simpler.

@vincentqb
Copy link
Contributor

My point here is that together with the precious comment this sounds like you prefer to have the current (wrong) implementation for the sake of simple implementation. I do not think that's your intention but I would like to stress that the most important things is the torchaudio provides correct and easy-to-use dataset implementation. The simplicity of the implementation should come after that.

Given that the code is open source, we should aim for "easy to use" to also mean "easy to modify". In any case, this is not about correctness versus simplicity. It's about raising the bar and having both. :)

In that case, neither glob.glob or glob.iglob cannot replace the implementation. We need to reinvent glob.iglob with guaranteed lexicographical order. I think traversing directory for thousands files are not as expensive as NN training, so giving up on generator here and using glob.glob (for pattern matching) and sorting at the end (for lexicographical ordering) will be simpler.

The idea of using generator was to investigate what we could do in the event that the list is streamed (say as for IterableDataset with a large list of remote files that are expensive to query). We do not have examples of this in this repository, though this would be great example to nail down eventually.

I'm sure you have seen this, but there's a fun discussion here that would be relevant :) Apparently, one can simply sort the items returned by os.walk to get a deterministic order. This may or may not be a performance gain depending on the directory structure of course. I'm ok with not using generators/os.walk by the way, though if the stackoverflow solution actually works that would seem like a pretty neat solution to me.

There's been some changes recently in torchtext, though they were also using os.walk in the past. Do you know what they do? How about torchvision?

@mmxgn
Copy link
Contributor Author

mmxgn commented Jul 18, 2020

Hi,

I noticed that os.listdir does not guarantee lexicographic order either so I updated my fork to sort files before adding them to the dataset list. I also did a small change to look for GTZAN._ext_audio in the pattern (is _ext_audio used elsewhere?) instead of the literal .wav. Should I do a new PR or is there a way to change them faster from here?

@vincentqb
Copy link
Contributor

Thanks for helping! :)

Just to make sure we are on the same page, there are two issues.

  1. deciding how to handle datasets that have been modified by a user (the issue discussed here in Changed GTZAN so that it only traverses filenames belonging to the dataset #791), since we currently do not provide any guarantees.
  2. making the order deterministic for any user of torchaudio.

Both issues should be addressed for all datasets, likely by modifying the common function walk_files, and so new pull requests would be needed. We should also re-align GTZAN with the other datasets to make it easier to maintain.

@mthrok
Copy link
Collaborator

mthrok commented Jul 22, 2020

Hi,

I noticed that os.listdir does not guarantee lexicographic order either so I updated my fork to sort files before adding them to the dataset list. I also did a small change to look for GTZAN._ext_audio in the pattern (is _ext_audio used elsewhere?) instead of the literal .wav. Should I do a new PR or is there a way to change them faster from here?

I started working on this in #814 so you can stay relax. Thanks.

mpc001 pushed a commit to mpc001/audio that referenced this pull request Aug 4, 2023
word_language_model: Fix Transformer init_weights
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants