Add LibriTTS dataset #790

jimchen90 · 2020-07-16T18:51:23Z

Add the LibriTTS dataset.

The output utterance_id is the full name of audio files.
(for example utterance_id = '84_121123_000007_000001')

Related to #550

test/test_datasets.py

vincentqb · 2020-07-17T15:27:33Z

torchaudio/datasets/libritts.py

+
+    def __getitem__(self, n: int) -> Tuple[Tensor, int, str, str, int, int, str]:
+        fileid = self._walker[n]
+        return load_libritts_item(fileid, self._path, self._ext_audio, self._ext_original_txt, self._ext_normalized_txt)


nit: have you ran black on this file?

I ran black to reformat it now.

vincentqb · 2020-07-17T15:27:50Z

test/test_datasets.py

@@ -94,6 +95,7 @@ def setUpClass(cls):
        for label in cls.labels:
            filename = f'{"_".join(str(l) for l in label)}.wav'
            path = os.path.join(base_dir, filename)
+


nit: why is this changing?

I have changed it back.

vincentqb · 2020-07-17T15:29:10Z

torchaudio/datasets/libritts.py

+class LIBRITTS(Dataset):
+    """
+    Create a Dataset for LibriTTS. Each item is a tuple of the form:
+    waveform, sample_rate, original_utterance, normalized_utterance, speaker_id, chapter_id, utterance_id


Since the waveform is normalized, there is no bits data to share in the data point. This looks good to me.

vincentqb

Except for typos mentioned, LGTM. Please add the link to your notebook in the description of the pull request.

vincentqb

Oops: the pull request is still a draft, and one test is failing :) Let me know when the pull request is ready for review!

mthrok · 2020-07-17T16:09:41Z

torchaudio/datasets/libritts.py

+                    download_url(url, root, hash_value=checksum)
+                extract_archive(archive)
+
+        walker = walk_files(


Please do not use walk_files this returns files in unpredicted order. #794

Good point. Since we are using walk_files in all other datasets, I'm ok with moving forward as it is for this pull request, and leaving the migration for a later time.

codecov · 2020-07-17T16:30:50Z

Codecov Report

Merging #790 into master will increase coverage by 0.02%.
The diff coverage is 90.74%.

@@            Coverage Diff             @@
##           master     #790      +/-   ##
==========================================
+ Coverage   89.87%   89.89%   +0.02%     
==========================================
  Files          34       35       +1     
  Lines        2666     2712      +46     
==========================================
+ Hits         2396     2438      +42     
- Misses        270      274       +4

Impacted Files	Coverage Δ
torchaudio/datasets/libritts.py	`90.19% <90.19%> (ø)`
torchaudio/datasets/__init__.py	`100.00% <100.00%> (ø)`
torchaudio/models/_wavernn.py	`99.03% <100.00%> (+0.85%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 209858e...9d711d9. Read the comment docs.

mthrok · 2020-07-17T16:31:54Z

test/test_datasets.py

+    def test_libritts(self):
+        dataset = LIBRITTS(self.root_dir)
+        samples = list(dataset)
+        samples.sort(key=lambda s: s[4])


If you change walk_files to be something else deterministic that returns items in lexicographical order, then you do not need to perform sort here. In #792, I found out that work_files return unpredicted so I had to first iterate through all the dataset and sort before performing comparison. Which is very counter intuitive because I would not expect Dataset class to return different samples for the same index.

If dataset implementation returns items in the order you put, you can do the assertion part directly like

for i, sample in enumerate(dataset): expected_ids = ... expected_data = ... ...

Thank you for the information. I will think about it. Do you have any suggest function instead of using walk_files?

Good catch. Let's at least add a FIXME comment linking to #792. Is there something else in this snippet that should be flagged with #792?

Thank you for the information. I will think about it. Do you have any suggest function instead of using walk_files?

FYI: In #791 GTZAN dataset got rid of walk_files with very simple implementation to add pattern matching for file name patterns. If LibriTTS also has patterns in dataset, it should be checking the patterns, too. (I believe such pattern matching option should be available on utility function side, but that's separate topic)

FYI: In #791 GTZAN dataset got rid of walk_files with very simple implementation to add pattern matching for file name patterns. If LibriTTS also has patterns in dataset, it should be checking the patterns, too. (I believe such pattern matching option should be available on utility function side, but that's separate topic)

Thanks for pointing this out :) However, I wouldn't replicate this particular solution here, until we have more general abstractions to do so. One of the strength of the dataset implementations that we currently have is how simple it is to replicate and extend to other cases.

mthrok · 2020-07-17T20:55:36Z

test/test_datasets.py

+            os.makedirs(file_dir, exist_ok=True)
+            path = os.path.join(file_dir, filename)
+
+            data = get_whitenoise(sample_rate=8000, duration=6, n_channels=1, dtype='int16')


I just realized that I forgot to add seed to get_whitenoise function call in #792 and all the data generated are identical. Can you add seed?

for i, utterance_id in enumerate(cls.utterance_ids): ... data = get_whitenosie(..., seed=i)

The seed has been added.

zhizhengwu · 2020-07-17T23:07:52Z

test/test_datasets.py

+
+        for i, (waveform,
+                sample_rate,
+                original_utterance,


original_utterance -> original_text

This name has been changed.

zhizhengwu · 2020-07-17T23:08:12Z

test/test_datasets.py

+        for i, (waveform,
+                sample_rate,
+                original_utterance,
+                normalized_utterance,


normalized_utterance -> normalized_text

This name has been changed.

vincentqb

LGTM.

Let's keep in mind that current dataset implementations (for all of torchaudio's datasets) have two issues that are not addressed in this particular pull request:

The order of the data points is not deterministic (due to os.walk mentioned above).
The set of files visited may be affected by modifications made to the extracted dataset folder (one solution proposed is to have some form of file name pattern matching for each dataset, mentioned above).

mthrok · 2020-07-23T16:25:52Z

test/test_datasets.py

+            path = os.path.join(file_dir, filename)
+
+            data = get_whitenoise(sample_rate=8000, duration=6, n_channels=1, dtype='int16', seed=i)
+            save_wav(path, data, 8000)


@jimchen90 I am checking LIbrispeech train-clean-100 but I do not see a file with 8kHz. Most of them seems 16000 Hz. Can you confirm?

@mthrok You are right, I checked the published papers of LibriSpeech and LibriTTS linked from their website. They are 16 kHz for LibriSpeech and 24 kHz for LibriTTS.

In abstract of LibriSpeech paper, it mentions ' The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz.'

In abstract of LibriTTS paper, it mentions 'The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts.'

I will open a pull request to update this from 8000 to 24K.

mthrok · 2020-07-23T17:15:35Z

Okay, I just moved the test script in #817. so make sure that you branch from the latest upstream commits. Also, please check the number of channels too. You can also shorten the duration too (does not really matter though). 6 was picked for YESNO dataset as that's the expected length there. moto 🛵

…

On Thu, Jul 23, 2020 at 12:48 PM jimchen90 ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In test/test_datasets.py <#790 (comment)>: > + ] + original_text = 'this is the original text.' + normalized_text = 'this is the normalized text.' + + @classmethod + def setUpClass(cls): + cls.root_dir = cls.get_base_temp_dir() + base_dir = os.path.join(cls.root_dir, 'LibriTTS', 'train-clean-100') + for i, utterance_id in enumerate(cls.utterance_ids): + filename = f'{"_".join(str(u) for u in utterance_id)}.wav' + file_dir = os.path.join(base_dir, str(utterance_id[0]), str(utterance_id[1])) + os.makedirs(file_dir, exist_ok=True) + path = os.path.join(file_dir, filename) + + data = get_whitenoise(sample_rate=8000, duration=6, n_channels=1, dtype='int16', seed=i) + save_wav(path, data, 8000) You are right, I checked the published papers of LibriSpeech <http://www.danielpovey.com/files/2015_icassp_librispeech.pdf> and LibriTTS <https://arxiv.org/pdf/1904.02882.pdf> linked from their website. They are 16 kHz for LibriSpeech and 24 kHz for LibriTTS. In abstract of LibriSpeech paper, it mentions ' The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz.' In abstract of LibriTTS paper, it mentions 'The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts.' I will open a pull request to update this from 8000 to 24K. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#790 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGQ6CRU2CGDBZVA2R3GESLR5BSWBANCNFSM4O4X4UMQ> .

mthrok reviewed Jul 16, 2020

View reviewed changes

test/test_datasets.py Outdated Show resolved Hide resolved

jimchen90 requested a review from vincentqb July 16, 2020 18:59

jimchen90 force-pushed the libriTTS branch from 25738fe to 8c373ac Compare July 17, 2020 14:02

vincentqb reviewed Jul 17, 2020

View reviewed changes

vincentqb approved these changes Jul 17, 2020

View reviewed changes

vincentqb suggested changes Jul 17, 2020

View reviewed changes

mthrok reviewed Jul 17, 2020

View reviewed changes

jimchen90 marked this pull request as ready for review July 17, 2020 16:54

mthrok reviewed Jul 17, 2020

View reviewed changes

Ji Chen added 9 commits July 17, 2020 15:50

add libritts

8f4abf1

add libritts

52e9dd4

update output form

128673a

update format

eb3fab4

update test method

c43a53b

add audio and text test

1537621

update format

d046701

fix test error and remove assets libritts

cf90566

add seed in get_whitenoise function

f9f48cc

jimchen90 force-pushed the libriTTS branch from 77a7b6f to f9f48cc Compare July 17, 2020 22:55

zhizhengwu reviewed Jul 17, 2020

View reviewed changes

change utterance to text

9d711d9

jimchen90 requested a review from vincentqb July 18, 2020 19:59

vincentqb approved these changes Jul 20, 2020

View reviewed changes

jimchen90 merged commit 4b8aad7 into pytorch:master Jul 20, 2020

mthrok reviewed Jul 23, 2020

View reviewed changes

jimchen90 mentioned this pull request Jul 23, 2020

Add libritts dataset option #818

Merged

jimchen90 mentioned this pull request Jul 23, 2020

Update sample rate of libritts test #820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LibriTTS dataset #790

Add LibriTTS dataset #790

jimchen90 commented Jul 16, 2020 •

edited

Loading

vincentqb Jul 17, 2020

jimchen90 Jul 17, 2020

vincentqb Jul 17, 2020

jimchen90 Jul 17, 2020

vincentqb Jul 17, 2020

vincentqb left a comment

vincentqb left a comment •

edited

Loading

mthrok Jul 17, 2020

vincentqb Jul 17, 2020

codecov bot commented Jul 17, 2020 •

edited

Loading

mthrok Jul 17, 2020 •

edited

Loading

jimchen90 Jul 17, 2020 •

edited

Loading

vincentqb Jul 17, 2020 •

edited

Loading

mthrok Jul 17, 2020

vincentqb Jul 17, 2020

mthrok Jul 17, 2020

jimchen90 Jul 17, 2020

zhizhengwu Jul 17, 2020

jimchen90 Jul 18, 2020

zhizhengwu Jul 17, 2020

jimchen90 Jul 18, 2020

vincentqb left a comment •

edited

Loading

mthrok Jul 23, 2020

jimchen90 Jul 23, 2020 •

edited

Loading

mthrok commented Jul 23, 2020 via email

Add LibriTTS dataset #790

Add LibriTTS dataset #790

Conversation

jimchen90 commented Jul 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincentqb left a comment

Choose a reason for hiding this comment

vincentqb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 17, 2020 • edited Loading

Codecov Report

mthrok Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

jimchen90 Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

vincentqb Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincentqb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimchen90 Jul 23, 2020 • edited Loading

Choose a reason for hiding this comment

mthrok commented Jul 23, 2020 via email

jimchen90 commented Jul 16, 2020 •

edited

Loading

vincentqb left a comment •

edited

Loading

codecov bot commented Jul 17, 2020 •

edited

Loading

mthrok Jul 17, 2020 •

edited

Loading

jimchen90 Jul 17, 2020 •

edited

Loading

vincentqb Jul 17, 2020 •

edited

Loading

vincentqb left a comment •

edited

Loading

jimchen90 Jul 23, 2020 •

edited

Loading