Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LibriTTS dataset #790

Merged
merged 10 commits into from
Jul 20, 2020
Merged

Add LibriTTS dataset #790

merged 10 commits into from
Jul 20, 2020

Conversation

jimchen90
Copy link
Contributor

@jimchen90 jimchen90 commented Jul 16, 2020

Add the LibriTTS dataset.

The output utterance_id is the full name of audio files.
(for example utterance_id = '84_121123_000007_000001')

Related to #550

Notebook

test/test_datasets.py Outdated Show resolved Hide resolved

def __getitem__(self, n: int) -> Tuple[Tensor, int, str, str, int, int, str]:
fileid = self._walker[n]
return load_libritts_item(fileid, self._path, self._ext_audio, self._ext_original_txt, self._ext_normalized_txt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: have you ran black on this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran black to reformat it now.

@@ -94,6 +95,7 @@ def setUpClass(cls):
for label in cls.labels:
filename = f'{"_".join(str(l) for l in label)}.wav'
path = os.path.join(base_dir, filename)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it back.

class LIBRITTS(Dataset):
"""
Create a Dataset for LibriTTS. Each item is a tuple of the form:
waveform, sample_rate, original_utterance, normalized_utterance, speaker_id, chapter_id, utterance_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the waveform is normalized, there is no bits data to share in the data point. This looks good to me.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for typos mentioned, LGTM. Please add the link to your notebook in the description of the pull request.

Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops: the pull request is still a draft, and one test is failing :) Let me know when the pull request is ready for review!

download_url(url, root, hash_value=checksum)
extract_archive(archive)

walker = walk_files(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use walk_files this returns files in unpredicted order. #794

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Since we are using walk_files in all other datasets, I'm ok with moving forward as it is for this pull request, and leaving the migration for a later time.

@codecov
Copy link

codecov bot commented Jul 17, 2020

Codecov Report

Merging #790 into master will increase coverage by 0.02%.
The diff coverage is 90.74%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #790      +/-   ##
==========================================
+ Coverage   89.87%   89.89%   +0.02%     
==========================================
  Files          34       35       +1     
  Lines        2666     2712      +46     
==========================================
+ Hits         2396     2438      +42     
- Misses        270      274       +4     
Impacted Files Coverage Δ
torchaudio/datasets/libritts.py 90.19% <90.19%> (ø)
torchaudio/datasets/__init__.py 100.00% <100.00%> (ø)
torchaudio/models/_wavernn.py 99.03% <100.00%> (+0.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 209858e...9d711d9. Read the comment docs.

def test_libritts(self):
dataset = LIBRITTS(self.root_dir)
samples = list(dataset)
samples.sort(key=lambda s: s[4])
Copy link
Collaborator

@mthrok mthrok Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change walk_files to be something else deterministic that returns items in lexicographical order, then you do not need to perform sort here. In #792, I found out that work_files return unpredicted so I had to first iterate through all the dataset and sort before performing comparison. Which is very counter intuitive because I would not expect Dataset class to return different samples for the same index.

If dataset implementation returns items in the order you put, you can do the assertion part directly like

for i, sample in enumerate(dataset):
    expected_ids = ...
    expected_data = ...
    ... 

Copy link
Contributor Author

@jimchen90 jimchen90 Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information. I will think about it. Do you have any suggest function instead of using walk_files?

Copy link
Contributor

@vincentqb vincentqb Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Let's at least add a FIXME comment linking to #792. Is there something else in this snippet that should be flagged with #792?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information. I will think about it. Do you have any suggest function instead of using walk_files?

FYI: In #791 GTZAN dataset got rid of walk_files with very simple implementation to add pattern matching for file name patterns. If LibriTTS also has patterns in dataset, it should be checking the patterns, too. (I believe such pattern matching option should be available on utility function side, but that's separate topic)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: In #791 GTZAN dataset got rid of walk_files with very simple implementation to add pattern matching for file name patterns. If LibriTTS also has patterns in dataset, it should be checking the patterns, too. (I believe such pattern matching option should be available on utility function side, but that's separate topic)

Thanks for pointing this out :) However, I wouldn't replicate this particular solution here, until we have more general abstractions to do so. One of the strength of the dataset implementations that we currently have is how simple it is to replicate and extend to other cases.

@jimchen90 jimchen90 marked this pull request as ready for review July 17, 2020 16:54
os.makedirs(file_dir, exist_ok=True)
path = os.path.join(file_dir, filename)

data = get_whitenoise(sample_rate=8000, duration=6, n_channels=1, dtype='int16')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that I forgot to add seed to get_whitenoise function call in #792 and all the data generated are identical. Can you add seed?

for i, utterance_id in enumerate(cls.utterance_ids):
    ...
    data = get_whitenosie(..., seed=i)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed has been added.


for i, (waveform,
sample_rate,
original_utterance,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original_utterance -> original_text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name has been changed.

for i, (waveform,
sample_rate,
original_utterance,
normalized_utterance,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalized_utterance -> normalized_text

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name has been changed.

@jimchen90 jimchen90 requested a review from vincentqb July 18, 2020 19:59
Copy link
Contributor

@vincentqb vincentqb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Let's keep in mind that current dataset implementations (for all of torchaudio's datasets) have two issues that are not addressed in this particular pull request:

  • The order of the data points is not deterministic (due to os.walk mentioned above).
  • The set of files visited may be affected by modifications made to the extracted dataset folder (one solution proposed is to have some form of file name pattern matching for each dataset, mentioned above).

@jimchen90 jimchen90 merged commit 4b8aad7 into pytorch:master Jul 20, 2020
path = os.path.join(file_dir, filename)

data = get_whitenoise(sample_rate=8000, duration=6, n_channels=1, dtype='int16', seed=i)
save_wav(path, data, 8000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimchen90 I am checking LIbrispeech train-clean-100 but I do not see a file with 8kHz. Most of them seems 16000 Hz. Can you confirm?

Copy link
Contributor Author

@jimchen90 jimchen90 Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mthrok You are right, I checked the published papers of LibriSpeech and LibriTTS linked from their website. They are 16 kHz for LibriSpeech and 24 kHz for LibriTTS.

In abstract of LibriSpeech paper, it mentions ' The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz.'

In abstract of LibriTTS paper, it mentions 'The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts.'

I will open a pull request to update this from 8000 to 24K.

@mthrok
Copy link
Collaborator

mthrok commented Jul 23, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants