-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Path and glob instead of walk_files #1069
Conversation
To test
|
|
torchaudio/datasets/librispeech.py
Outdated
walker = walk_files( | ||
self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True | ||
) | ||
walker = sorted([str(p.stem) for p in Path(self._path).glob('*/*/*'+self._ext_audio)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing whitespace around arithmetic operator :)
torchaudio/datasets/yesno.py
Outdated
walker = walk_files( | ||
self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True | ||
) | ||
walker = sorted([str(p.stem) for p in Path(self._path).glob('*.wav')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: + self._ext_audio
instead of .wav
ljspeech also doesn't use |
Thanks for looking into this! gave minor comments, but LGTM so far :) |
@vincentqb looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good. Could you also remove the definition of walk_files
?
torchaudio/datasets/librispeech.py
Outdated
walker = walk_files( | ||
self._path, suffix=self._ext_audio, prefix=False, remove_suffix=True | ||
) | ||
walker = sorted([str(p.stem) for p in Path(self._path).glob('*/*/*' + self._ext_audio)]) | ||
self._walker = list(walker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you get rid of list(walker)
as it is redundant?
I would do
self._walker = [str(p.stem) for p in Path(self._path).glob(f'*/*/*{self._ext_audio}')]
self._walker.sort()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, there's no need to make a copy of the list. It seems like using iterator+sorted is similar? Is this a fair comparison?
In [1]: %timeit sorted(range(1000, 0, -1))
14.2 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [2]: %timeit sorted(list(range(1000, 0, -1)))
16.4 µs ± 370 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: def func():
...: a = list(range(1000, 0, -1))
...: a.sort()
...:
In [4]: %timeit func()
15.4 µs ± 172 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the PR with sorted(generator), since it turns out that sorted(generator)
creates the list and then updates it in place anyway. Once the tests pass, I'll merge this PR as it is. Thanks @krishnakalyan3 for the quick follow-up :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List comprehension is not a generator, please revert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, okay, you changed the list comprehension to a generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next time, please consult it with me before doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot find a reference that says sorted
is in-place. can you give the reference for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See implementation of sorted. Regardless, the test above points to that implementation to be faster.
VCTK does have |
68938b6
to
ce06026
Compare
Codecov Report
@@ Coverage Diff @@
## master #1069 +/- ##
==============================
==============================
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the work! We can do VCTK and removing walk_files
as follow-ups. I'll go ahead and merge this once the tests are passed.
LGTM! Merging. Thanks again @krishnakalyan3! |
Thanks for all the awesome feedback gentlemen. The |
It was not explained in the original issue, but the fundamental reason of these changes you contributed is removing Unfortunately, the original VCTK dataset is no longer publicly available and I do not have a copy so I do not know the expected structure of directory, but we can still move the glob logic to |
@mthrok thanks for the explanation. I will work on the PR. |
* Adds files for minGPT training with DDP * filtered-clone, update script path, update readme * add refs to karpathy's repo * add training data * add AMP training * delete raw data file, update index.rst * Update gpt2_train_cfg.yaml
Hello @vincentqb,
I wanted to take an initial stab at this. I have made a minor modification for
Looks like
tedlium
,ljspeech
andvctk
(VCTK_092) don't havewalk_files
cc #1051