-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LibriTTS dataset #790
Add LibriTTS dataset #790
Conversation
torchaudio/datasets/libritts.py
Outdated
|
||
def __getitem__(self, n: int) -> Tuple[Tensor, int, str, str, int, int, str]: | ||
fileid = self._walker[n] | ||
return load_libritts_item(fileid, self._path, self._ext_audio, self._ext_original_txt, self._ext_normalized_txt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: have you ran black
on this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran black
to reformat it now.
test/test_datasets.py
Outdated
@@ -94,6 +95,7 @@ def setUpClass(cls): | |||
for label in cls.labels: | |||
filename = f'{"_".join(str(l) for l in label)}.wav' | |||
path = os.path.join(base_dir, filename) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it back.
torchaudio/datasets/libritts.py
Outdated
class LIBRITTS(Dataset): | ||
""" | ||
Create a Dataset for LibriTTS. Each item is a tuple of the form: | ||
waveform, sample_rate, original_utterance, normalized_utterance, speaker_id, chapter_id, utterance_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the waveform is normalized, there is no bits data to share in the data point. This looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for typos mentioned, LGTM. Please add the link to your notebook in the description of the pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops: the pull request is still a draft, and one test is failing :) Let me know when the pull request is ready for review!
download_url(url, root, hash_value=checksum) | ||
extract_archive(archive) | ||
|
||
walker = walk_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use walk_files
this returns files in unpredicted order. #794
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Since we are using walk_files
in all other datasets, I'm ok with moving forward as it is for this pull request, and leaving the migration for a later time.
Codecov Report
@@ Coverage Diff @@
## master #790 +/- ##
==========================================
+ Coverage 89.87% 89.89% +0.02%
==========================================
Files 34 35 +1
Lines 2666 2712 +46
==========================================
+ Hits 2396 2438 +42
- Misses 270 274 +4
Continue to review full report at Codecov.
|
def test_libritts(self): | ||
dataset = LIBRITTS(self.root_dir) | ||
samples = list(dataset) | ||
samples.sort(key=lambda s: s[4]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change walk_files
to be something else deterministic that returns items in lexicographical order, then you do not need to perform sort
here. In #792, I found out that work_files
return unpredicted so I had to first iterate through all the dataset and sort before performing comparison. Which is very counter intuitive because I would not expect Dataset class to return different samples for the same index.
If dataset implementation returns items in the order you put, you can do the assertion part directly like
for i, sample in enumerate(dataset):
expected_ids = ...
expected_data = ...
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the information. I will think about it. Do you have any suggest function instead of using walk_files
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the information. I will think about it. Do you have any suggest function instead of using
walk_files
?
FYI: In #791 GTZAN dataset got rid of walk_files
with very simple implementation to add pattern matching for file name patterns. If LibriTTS also has patterns in dataset, it should be checking the patterns, too. (I believe such pattern matching option should be available on utility function side, but that's separate topic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: In #791 GTZAN dataset got rid of
walk_files
with very simple implementation to add pattern matching for file name patterns. If LibriTTS also has patterns in dataset, it should be checking the patterns, too. (I believe such pattern matching option should be available on utility function side, but that's separate topic)
Thanks for pointing this out :) However, I wouldn't replicate this particular solution here, until we have more general abstractions to do so. One of the strength of the dataset implementations that we currently have is how simple it is to replicate and extend to other cases.
test/test_datasets.py
Outdated
os.makedirs(file_dir, exist_ok=True) | ||
path = os.path.join(file_dir, filename) | ||
|
||
data = get_whitenoise(sample_rate=8000, duration=6, n_channels=1, dtype='int16') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that I forgot to add seed
to get_whitenoise
function call in #792 and all the data generated are identical. Can you add seed
?
for i, utterance_id in enumerate(cls.utterance_ids):
...
data = get_whitenosie(..., seed=i)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seed has been added.
test/test_datasets.py
Outdated
|
||
for i, (waveform, | ||
sample_rate, | ||
original_utterance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
original_utterance -> original_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name has been changed.
test/test_datasets.py
Outdated
for i, (waveform, | ||
sample_rate, | ||
original_utterance, | ||
normalized_utterance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normalized_utterance -> normalized_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name has been changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Let's keep in mind that current dataset implementations (for all of torchaudio's datasets) have two issues that are not addressed in this particular pull request:
path = os.path.join(file_dir, filename) | ||
|
||
data = get_whitenoise(sample_rate=8000, duration=6, n_channels=1, dtype='int16', seed=i) | ||
save_wav(path, data, 8000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimchen90 I am checking LIbrispeech train-clean-100
but I do not see a file with 8kHz. Most of them seems 16000 Hz. Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mthrok You are right, I checked the published papers of LibriSpeech and LibriTTS linked from their website. They are 16 kHz for LibriSpeech and 24 kHz for LibriTTS.
In abstract of LibriSpeech paper, it mentions ' The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz.'
In abstract of LibriTTS paper, it mentions 'The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts.'
I will open a pull request to update this from 8000 to 24K.
Okay, I just moved the test script in #817. so make sure that you branch
from the latest upstream commits.
Also, please check the number of channels too.
You can also shorten the duration too (does not really matter though). 6
was picked for YESNO dataset as that's the expected length there.
moto 🛵
…On Thu, Jul 23, 2020 at 12:48 PM jimchen90 ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In test/test_datasets.py
<#790 (comment)>:
> + ]
+ original_text = 'this is the original text.'
+ normalized_text = 'this is the normalized text.'
+
+ @classmethod
+ def setUpClass(cls):
+ cls.root_dir = cls.get_base_temp_dir()
+ base_dir = os.path.join(cls.root_dir, 'LibriTTS', 'train-clean-100')
+ for i, utterance_id in enumerate(cls.utterance_ids):
+ filename = f'{"_".join(str(u) for u in utterance_id)}.wav'
+ file_dir = os.path.join(base_dir, str(utterance_id[0]), str(utterance_id[1]))
+ os.makedirs(file_dir, exist_ok=True)
+ path = os.path.join(file_dir, filename)
+
+ data = get_whitenoise(sample_rate=8000, duration=6, n_channels=1, dtype='int16', seed=i)
+ save_wav(path, data, 8000)
You are right, I checked the published papers of LibriSpeech
<http://www.danielpovey.com/files/2015_icassp_librispeech.pdf> and
LibriTTS <https://arxiv.org/pdf/1904.02882.pdf> linked from their
website. They are 16 kHz for LibriSpeech and 24 kHz for LibriTTS.
In abstract of LibriSpeech paper, it mentions ' The LibriSpeech corpus is
derived from audiobooks that are part of the LibriVox project, and contains
1000 hours of speech sampled at 16 kHz.'
In abstract of LibriTTS paper, it mentions 'The released corpus consists
of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and
the corresponding texts.'
I will open a pull request to update this from 8000 to 24K.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#790 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGQ6CRU2CGDBZVA2R3GESLR5BSWBANCNFSM4O4X4UMQ>
.
|
Add the LibriTTS dataset.
The output
utterance_id
is the full name of audio files.(for example
utterance_id
= '84_121123_000007_000001')Related to #550
Notebook