-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tedlium dataset (all 3 releases) #882
Conversation
Pinging @vincentqb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jiwidi
Thanks for the contribution, and thanks for opening the PR in the early stage. This will make our review process easier and more fruitful.
I see you mostly follow the existing Dataset implementation, and that's a good start, but I found many spaces for improvement in these existing implementations, so I added comments regarding them. Let me know what you think.
torchaudio/datasets/tedlium.py
Outdated
root: str, | ||
release: str = RELEASE, | ||
subset: str = None, | ||
folder_in_archive: str = _RELEASE_CONFIGS[RELEASE]["folder_in_archive"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get remove folder_in_archive
from the signature?
There is no necessity or benefit of making this into variable.
Also this variable is never used as it's overwritten at the beginning of the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, there is no reason why it would stay 👍
torchaudio/datasets/tedlium.py
Outdated
download_url(url, root, hash_value=checksum) | ||
extract_archive(archive) | ||
|
||
walker = walk_files(self._path, suffix=self._ext_txt, prefix=False, remove_suffix=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am advising people not to use walk_files
, because this function forces Dataset implementations to not be able to designate which directories it should be looking at and might traverse directories totally unnecessary and irrelevant, which is irresponsible behavior, and should be avoided.
Also since you need a different/new logic for going over the STM files, I see that the fact that walk_files
needed to be extended as this function is not suitable for the purpose of this Dataset. So I suggest add the logic that does the right amount of work, which is simply going over the subdirectories and check the files. Please see another example here. Following this advice, @Abhi011999 has found some data missing about VCTK 0.92, which is another reason why walk_files
function should not be used. It hinders what Dataset implementation should be looking at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am downloading all TEDLIUM datasets now so that we can discuss the detail of how to traverse these directories. I will give my thoughts once I see the dataset strucutres.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be perfect, lets have a discussion for it when you have the data.
torchaudio/datasets/tedlium.py
Outdated
|
||
def load_tedlium_item( | ||
fileid: str, line: int, path: str, ext_audio: str, ext_txt: str | ||
) -> Tuple[Tensor, int, str, int, int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this as a method of TEDLIUM
class? That way user code can customize the loading behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense
torchaudio/datasets/tedlium.py
Outdated
) | ||
|
||
wave_path = os.path.join(path, "sph/", fileid) | ||
waveform, sample_rate = torchaudio.load(wave_path + ext_audio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a method that wrap torchaudio.load
to TEDLIUM
class?
def load_auiod(self, path):
return torchaudio.load(path)
The problem with using bare torchaudio.load
is that user cannot customize the loading behavior. (such as disabling normalization or swapping channels dimension.) We could also pass down such parameters from the constructor of TEDLIUM
class but that will make the constructor signature ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mthrok -- What's the advantage of doing this over having torchaudio.load
inside the load item function? Is your suggestion to also eventually have a load_text
method? One property of having a single load item function was that it factored the item loading as a pure function that was taking an identifier to return a data point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mthrok -- What's the advantage of doing this over having
torchaudio.load
inside the load item function? Is your suggestion to also eventually have aload_text
method? One property of having a single load item function was that it factored the item loading as a pure function that was taking an identifier to return a data point.
As far as I understood it he wants to give the user the option to overwrite the load function of the dataset to use optional parameters if he feels like.
So user could do:
def custom_load(path):
return torchaudio.load(path,normalize=False,channels_first=False)
dt = torchaudio.datasets.TEDLIUM()
dt.load_audio = custom_load
As normalize
and channels_first
are True by default and the load_item
function doesnt change them the user has no ability to change them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another example is Windows users who cannot use torchaudio.load
because torchaudio on Windows does not support sph
format so they need to provide something else in the custom load method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As
normalize
andchannels_first
are True by default and theload_item
function doesnt change them the user has no ability to change them.
If this is indeed something desired, we could also consider exposing those parameters to the constructor instead. I wouldn't say this is enough alone to change the implementation through the addition of a load method.
In torchaudio, we went with the convention that waveforms are between -1 and 1 with batch first, see readme. This is in particular to avoid having to carry such parameters everywhere.
Another example is Windows users who cannot use
torchaudio.load
because torchaudio on Windows does not supportsph
format so they need to provide something else in the custom load method.
This is an interesting point, and there's precedence for this for datasets using mp3. I see two paths for this:
- Offering a backend that supports
sph
in windows - Decoupling
torchaudio.load
fromtorchaudio.datasets
.
In the former, a user who decides to change the datasets would copy-paste the code and modify to their heart's content. In the latter, we need to decide how we would standardize this. Is any of torch{text,vision} doing something like this already? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say offering support for sph in windows would be better, less hassling and moving around for the user. At the end, the users who use this custom dataset loaders are the ones looking for a working solution and easy to implement, if they want to customize loading or whatever they will probably rewrite a dataset on their own.
torchaudio/datasets/tedlium.py
Outdated
waveform, sample_rate, utterance, speaker_id, chapter_id, utterance_id | ||
""" | ||
|
||
_ext_txt = ".stm" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see a necessity or benefit of making this extension variable. STM is an established format, and since it's text, any platform (like Windows) should be able to open it, unlike the case of ".sph" which torchaudio on Windows cannot load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is used for the file walker as we look for files with this extension and I keept it as a variable for consistency with the Librispeech dataset
audio/torchaudio/datasets/librispeech.py
Lines 75 to 76 in c692fe9
_ext_txt = ".trans.txt" | |
_ext_audio = ".flac" |
It will probably unused depending on what logic we decide on to iterate though the files. We can take a look at this when we are done with the logic ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is used for the file walker as we look for files with this extension
Even that's the case, I think it can be hardcoded at where the walker function is called, however
It will probably unused depending on what logic we decide on to iterate though the files. We can take a look at this when we are done with the logic ?
Yeah, we can differ the change until then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved _ext_audio initialization to the constructor on my coming new commit so I removed this one as well. We can always go back.
torchaudio/datasets/tedlium.py
Outdated
|
||
walker = walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True) | ||
self._walker = list(walker) | ||
self._extended_walker = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need an extended walker in place of updating walker? It seems like this simply extracts information from a file. Could this be done in load item function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does simple extract information for the files contained in the original walker. I create it because we need this information before calling the load item function so the dataset contains samples equals to total lines in all files (1 line is one sentence) instead of containing samples equals to files.
This dataset is a bit tricky because each file contains a full ted talk and no one in speech recognition (where this dataset is usually used) treats a full talk as a training sample. We split the file and audio file into multiple samples (sentences).
This means that once we instantiate the dataset we need to know how many samples we have(lines in all files) so the user can index any sample N at any time or shuffle it. If we were to implement this on the load function then this functionality will be way harder. See the original post where I explain with detail how the files are iterated.
I'm happy to discuss which method is the best method to achieve this functionality but I'm checking first if you understand why this extender walker is here first, maybe I didn't explain it correctly on the original post
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding details :) What I meant is more about the particular use of _walker
and _extended_walker
. I would have suggested something like:
self._walker = []
for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True):
stm_path = os.path.join(self._path, "stm", file + ".stm")
with open(stm_path) as f:
l = len(f.readlines())
self._walker.extend((file, line) for line in range(l))
or, to stick with generators,
def _generate_walker(...):
for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True):
stm_path = os.path.join(self._path, "stm", file + ".stm")
with open(stm_path) as f:
for line in range(len(f.readlines())):
yield (file, line)
self._walker = list(_generate_walker(...))
Also, why just indexing and not storing the line directly since the file is already being read? How big is the total text?
def _generate_walker(...):
for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True):
stm_path = os.path.join(self._path, "stm", file + ".stm")
with open(stm_path) as f:
for line in f.readlines():
yield (file, line)
self._walker = list(_generate_walker(...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean now, yes it will be cleaner to stick to the _walker name for the resultant walker.
I went ahead and checked the size of the walker list for release3 (the biggest)
import numpy as np
np.array(test_dataset._extended_walker).nbytes
At its 316733528 bytes,~316 megabytes
storing the full line and 20449080 ~20 megabytes
with just the line identifier. I would say is a considerable size but it does remove a I/O operation each time we load the item. Does any other dataset take this much space in memory?
Regarding sticking with generators or not, we convert it to a list so its just adding more logic that we dont use, I would go with the:
self._walker = []
for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True):
stm_path = os.path.join(self._path, "stm", file + ".stm")
with open(stm_path) as f:
l = len(f.readlines())
self._walker.extend((file, line) for line in range(l))
Just because is easier to read
torchaudio/datasets/tedlium.py
Outdated
extract_archive(archive) | ||
|
||
walker = walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True) | ||
self._walker = list(walker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in our current implementation, this would simply go as the last command manipulating walker (or extended walker here)
Do you mean the lexicon file such as |
torchaudio/datasets/tedlium.py
Outdated
stm_path = os.path.join(self._path, "stm", file + ".stm") | ||
with open(stm_path) as f: | ||
l = len(f.readlines()) | ||
self._extended_walker += [(file, line) for line in range(l)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the contents of STM files. For the case of release-3
$ wc -l TEDLIUM_release1/*/stm/*.stm | tail -1
58863 total
$ wc -l TEDLIUM_release2/*/stm/*.stm | tail -1
95033 total
$ wc -l TEDLIUM_release-3/data/stm/* | tail -1
268263 total
So if loading the whole release-3
dataset, self._extended_walker
could take a couple of megabytes on memory.
Also this will open the number of files;
$ find TEDLIUM_release1 -type f -name '*.stm' | wc -l
793
$ find TEDLIUM_release2 -type f -name '*.stm' | wc -l
1514
$ find TEDLIUM_release-3 -type f -name '*.stm' | wc -l
2370
For release-3
dataset, this initialization process will go over 2370 files sequentially. How long does this take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh is this what your report as 24 micros second??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I report the time of initiating the dataset and per se going over all files counting lines. I think that is quite fast and we shouldn't worry about it.
About the size it occupies in memory, I did a quick test for a comment from @vincentqb just on top of this one. I checked how much would the walker occupy if we were to store the full line and not the line identifier as we already loop it, saving some I/O when loading each item.
I went ahead and checked the size of the walker list for release3 (the biggest)
import numpy as np
np.array(test_dataset._extended_walker).nbytes
At its 316733528 bytes,~316 megabytes
storing the full line and 20449080 ~20 megabytes
with just the line identifier. I would say is a considerable size but it does remove a I/O operation each time we load the item. Does any other dataset take this much space in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I report the time of initiating the dataset and per se going over all files counting lines. I think that is quite fast and we shouldn't worry about it.
I did a quick benchmark on my end and
script
import time
import torchaudio
torchaudio.set_audio_backend('sox_io')
n_rep = 100
t0 = time.monotonic()
for _ in range(n_rep):
ds = torchaudio.datasets.TEDLIUM("../dataset/tedlium", release="release3", download=False)
t1 = time.monotonic()
print(len(ds._walker))
print((t1 - t0) / n_rep)
$python foo.py
268263
0.21009648161940275
I did not get as fast as micro seconds, but it was only 0.2 seconds so I agree that speed is not an issue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the memory consumption, I talked with @cpuhrsch. 20 megabytes is not a big deal. However be aware that when using multiple data loader worker process, each process will consume the same amount of memory.
Also DataLoader has memory leakage issue pytorch/pytorch#13246 . (current workaround pytorch/pytorch#13246 (comment))
Yes, that |
torchaudio/datasets/tedlium.py
Outdated
def __init__( | ||
self, root: str, release: str = "release1", subset: str = None, download: bool = False, audio_ext=".sph" | ||
) -> None: | ||
"""Constructor for TEDLIUM dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this docstring to class-level docstring and merge them?
Also you need to add this class in docs/source/datasets.rst
to show up in doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked with the other datasets, they have a similar docstring under the class description:
We have
"""
Create a Dataset for Tedlium. Each item is a tuple of the form:
[waveform, sample_rate, transcript, talk_id, speaker_id, identifier]
"""
Librispeech:
"""
Create a Dataset for LibriSpeech. Each item is a tuple of the form:
waveform, sample_rate, utterance, speaker_id, chapter_id, utterance_id
"""
Are you sure you want to move the big constructor docstring to class-level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want to move the big constructor docstring to class-level?
Yes. The fact that the other dataset lacks the documentation for the constructor is not user-friendly.
It is better if each Dataset explains what options a user can change.
I will update the other dataset documentation.
torchaudio/datasets/tedlium.py
Outdated
Args: | ||
root (str): Path containing dataset or target path where its downloaded if needed | ||
release (str, optional): TEDLIUM identifier (release1,release2,release3). Defaults to RELEASE. | ||
subset (str, optional): Subset of data(train,test,dev) supported for release 1,2. Defaults to Train/None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be more strict about subset
value. for release 1 and 2 the allowed values are train/dev/test
, but for release 3 it has to be None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add value check and raise a ValueError
if it's not the right one for the given release?
torchaudio/datasets/tedlium.py
Outdated
self._walker = [] | ||
|
||
# Create walker for all samples | ||
for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not to use walk_files
and just list stm
files? I checked the dataset and we only need to check one directory par configuration.
TEDLIUM_release1/dev/stm/*.stm
TEDLIUM_release1/test/stm/*.stm
TEDLIUM_release1/train/stm/*.stm
TEDLIUM_release2/dev/stm/*.stm
TEDLIUM_release2/test/stm/*.stm
TEDLIUM_release2/train/stm/*.stm
TEDLIUM_release-3/data/stm/*.stm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, I used the walker since I saw it was used in other datasets but listing all stm files will do the job as all of them are supposed to be inside the same folder.
|
||
.. autoclass:: TEDLIUM | ||
:members: __getitem__ | ||
:special-members: get_phoneme_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mthrok What does special-members mean here? I interpretate it as extra functions to include in the docs? Thats why I included get_phoneme_dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are Sphinx's directive. Check out their documentation
:members:
is where you list the members you want document:special-members:
is where you list the special methods like__init__
,__len__
,__getitem__
etc...
I think the other documentations are wrong, (__getitem__
should be under :special-member:
but it will not show up either way because they don't have a docstring.)
I think you can just do .. autoclass:: TEDLIUM
and the rest (get_phoneme_dict
) will be handled.
You can build the documentation and check how the resulting documentation looks like.
cd docs
pip install -r requirements.txt
make html
# open ./build/html/index.html
torchaudio/datasets/tedlium.py
Outdated
from torchaudio.datasets.utils import ( | ||
download_url, | ||
extract_archive, | ||
walk_files, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove walk_files
from import too.
torchaudio/datasets/tedlium.py
Outdated
# Create walker for all samples | ||
self._walker = [] | ||
stm_path = os.path.join(self._path, "stm") | ||
for file in os.listdir(stm_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you sort the file? Different OS returns files in different order.
for file in os.listdir(stm_path): | |
for file in sorted(os.listdir(stm_path)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks much better now.
@jiwidi What else is left for this PR?
torchaudio/datasets/tedlium.py
Outdated
content = line.strip().split(maxsplit=1) | ||
self.phoneme_dict[content[0]] = content[1:] # content[1:] can be empty list | ||
|
||
def load_tedlium_item(self, fileid: str, line: int, path: str) -> Tedlium_item: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also mark this function as private?
def load_tedlium_item(self, fileid: str, line: int, path: str) -> Tedlium_item: | |
def _load_tedlium_item(self, fileid: str, line: int, path: str) -> Tedlium_item: |
Test! If we are okay with the logic for iterating files I'll go ahead and create some test based on the last updates I saw you guys do on the other dataset test. |
torchaudio/datasets/tedlium.py
Outdated
wave_path = os.path.join(path, "sph", fileid) | ||
waveform, sample_rate = self._load_audio(wave_path + self._ext_audio, start_time=start_time, end_time=end_time) | ||
|
||
return Tedlium_item(waveform, sample_rate, transcript, talk_id, speaker_id, identifier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find the discussion about NamedTuple, so I'm leaving a new comment here :)
There has been prior discussion about changing the type of data points (and in particular using NamedTuple), see internal document. Though there may be value in redefining what a data point is, I feel the discussion would delay this pull request, so I would avoid it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't access the document but regardless I have no problem with changing it back to the previous version.
We can delay this NamedTuple to a PR that will change multiple Datasets/files and not do 1 by 1 and get inconsistent code. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can delay this NamedTuple to a PR that will change multiple Datasets/files and not do 1 by 1 and get inconsistent code. What do you think?
Yes, I agree with your suggestion. It would be preferable to change them all at the same time once we settle on a specific format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I oppose to using tuple here. 6 heterogeneous items as tuple give very bad user experience and make code harder to read/maintain. Specifically named tuple is compatible with tuple, there is virtually no disadvantage of using named tuple here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main disadvantage of using namedtuple here is (1) inconsistent behavior with other datasets in audio and elsewhere, (2) would make the user leverage an API that could change in the near future, (3) will slow down this particular pull request.
The reason we did not standardize around NamedTuple in the past was that it didn't allow for easy introspection of keys, and each dataset would have its own DataPoint NamedTuple. Please see internal document for more discussions.
We are also using tuples with many items with other torchaudio datasets. That being said, we can revisit these design choices. Thoughts @zhangguanheng66 @fmassa @cpuhrsch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) inconsistent behavior with other datasets in audio and elsewhere
Can you elaborate this? Each dataset returns different number of items, so it does not look like they can be swapped easily from the beginning. And using NamedTuple with consistent key name will make it possible to give consistent user experience.
(2) would make the user leverage an API that could change in the near future
Not sure what this means. Can you elaborate?
(3) will slow down this particular pull request.
It's already in NamedTuple so, it seems that changing it tuple seems to be slowing down the process.
The reason we did not standardize around NamedTuple in the past was that it didn't allow for easy introspection of keys
When is it necessary or useful to perform key introspection?
each dataset would have its own DataPoint NamedTuple.
That sounds the right approach. Because each dataset returns the different number of items of different types. They should be typed differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) inconsistent behavior with other datasets in audio and elsewhere
Can you elaborate this? Each dataset returns different number of items, so it does not look like they can be swapped easily from the beginning. And using NamedTuple with consistent key name will make it possible to give consistent user experience.
I was really just referring to using NamedTuple in this particular pull request. Moving toward consistently using NamedTuple in torchaudio/text/vision is fine by me.
(2) would make the user leverage an API that could change in the near future
Not sure what this means. Can you elaborate?
Since we have not, as a team (torchaudio/text/vision), revisited the decision of using Tuple, using NamedTuple here exposes us to changing the API again would we settle on another option.
(3) will slow down this particular pull request.
It's already in NamedTuple so, it seems that changing it tuple seems to be slowing down the process.
But we should avoid using NamedTuple until we, as a team, have made a decision on the API for data point. The current decision had been Tuple, but I'm personally fine with revisiting this past decision.
The reason we did not standardize around NamedTuple in the past was that it didn't allow for easy introspection of keys
When is it necessary or useful to perform key introspection?
each dataset would have its own DataPoint NamedTuple.
That sounds the right approach. Because each dataset returns the different number of items of different types. They should be typed differently.
I don't feel too strongly about either points. I am really just listing reasons that we left in the internal document. :)
Hi @jiwidi Thanks for singing me again and sorry for missing your reply in the first place.
So because of 2., it is preferable if we can test the dataset implementation with WAV format, even though the real dataset is in SPH. (and I believe this is also a valid use case for end users to convert the entire dataset into WAV.). For that purpose we make dataset implementation to be able to change the target extension. So can you write your test with |
torchaudio/datasets/tedlium.py
Outdated
|
||
Tedlium_item = namedtuple( | ||
"Tedlium_item", ["waveform", "sample_rate", "transcript", "talk_id", "speaker_id", "identifier"] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem!
torchaudio/datasets/tedlium.py
Outdated
""" | ||
start_time = int(float(start_time) * 16000) | ||
end_time = int(float(end_time) * 16000) | ||
return torchaudio.load(path, frame_offset=start_time, num_frames=end_time - start_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that this will only work with sox_io
backend. Could you add fall-back for the other backend?
if torchaudio.get_audio_backen() == 'sox_io':
return torchaudio.load(path, frame_offset=start_time, num_frames=end_time - start_time)
return torchaudio.load(path)[:, start_time:end_time]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current sox backend also support reading file segment. Not soundfile? If the datasets needs to be aware of the backend, then we should see how we could evolve the backends also so that this does not need to happen. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchaudio.load(path)[:, start_time:end_time]
will still work for both backends as far as I understand. But @mthrok pointed out is more efficient to use the torchaudio.load(path, frame_offset=start_time, num_frames=end_time - start_time)
if sox_io is enabled.
We either:
-
Add support for other backends with same parameter name so the optimized loading can be used with all (this could be better for the long term, unifying
-
Use the slicing operator and support all backends right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current sox backend also support reading file segment. Not soundfile? If the datasets needs to be aware of the backend, then we should see how we could evolve the backends also so that this does not need to happen. Thoughts?
We have already planned to resolve this in 0.9.0 release. See #903
Dont worry, you've been very helpful. I will test it late today and get back if there are any problems, otherwise a commit will follow :) I think I can make it work with wav files as well but since this is the only interaction of the library with sph files I see value in keeping the test with sph format as it will serve as a check for this functionality. If we check for sph support in other test then I have no preference with wav or sph. edit: saving sph files work perfectly with |
load/save/info functions on If you decide to test sph, then following is (only the gist of) my recommendation. def _generate_dataset(root_dir, audio_ext, dataset_version, <other parameters, such as dataset versions>) -> List[tuple]:
# Generates dataset with the given extension
for foo in whatever:
...
# Switch saving function
if audio_ext == "wav":
save_wav(file_path, data, sample_rate) # use scipy-based test utility for wav
else:
torchaudio.save(file_path, data, sample_rate) # use sox-io backend
# assuming that the backend is properly set.
# See the bellow backend="sox_io" for this.
# Returns expected data
# Note: Probably better to define different test class for different TEDLIUM versions.
class TestTedliumWav(TempDirMixin, TorchaudioTestCase):
backend = "default" # <- this test should work on all supported platforms
@classmethod
def setUp(cls):
cls.root_dir = cls.get_base_temp_dir()
_generate_dataset(cls.root_dir, "wav", ...)
# define the test
def testFoo(self):
...
class TestTedliumSph(TempDirMixin, TorchaudioTestCase):
backend = "sox_io" # <- this test will run only on the platforms where "sox_io" backend is available.
@classmethod
def setUp(cls):
cls.root_dir = cls.get_base_temp_dir()
_generate_dataset(cls.root_dir, "sph", ...)
# define the test
def testFoo(self):
...
glad it works :) |
I just pushed the test for tedlium. I saw the sph load is already tested elsewhere as you mentioned so I kept the test with For this new commit I also changed the phoneme dictionary to be only read if the user actually calls the The testing is very similar to the librispeech one, I try to recreate the 3 releases folder structure in a temporary folder and put some white noise there. Instantiate tedlium datasets and check all the data is well retrieved when iterating them. |
@mthrok is it normal this many unittest fail due to failure to download/connectivity issues? Test run on my side, but we have to assure they also run in the CI pipeline. |
torchaudio/datasets/tedlium.py
Outdated
""" | ||
# Read phoneme dictionary | ||
if not hasattr(self, "phoneme_dict"): | ||
self.phoneme_dict = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design pattern might leave users questions like "Which one should I use, phoneme_dict
or get_phoneme_dict
?", "What are the differences?"
The following techniques are often used to improve this;
-
Use internal attribute name like
_phoneme_dict
.
This does not prevent users to access the underlying object directly, but accessing an attribute prefixed with underscore clearly sends a message that user code is not doing the right thing. so it's okay. -
Use
property
decorator.
Lazy-initialization often fits well inproperty
attribute.
Something like this;@property def phoneme_dict(self): if self._phoneme_dict is None: self._phoneme_dict = dict() # Fill the dictionary return self._phoneme_dict.copy() # see the bellow for `copy`
Also instead of checking if it's initialized with hasattr
, it's more straight forward to initialize the actual attribute object in constructor as None
such as self._phone_dict = None
and perform the check as if self._phone_dict is None:
Another concern is returning dictionary reference. The user code might accidentally modify the dictionary which can lead to a very subtle bug. Maybe it's better to return a copy of the dictionary. as return self._phoneme_dict.copy()
(however this is still shallow copy so the value points to the original lists and user code can still modify the lists. maybe changing it to tuple might prevent such unintended modification, like self.phoneme_dict[content[0]] = tuple(content[1:])
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good points, a commit with all the suggestions will follow 👍
@classmethod | ||
def tearDownClass(cls): | ||
# In case of test failure | ||
tedlium.TEDLIUM._ext_audio = ".flac" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hack is required in the case of librispeech_test
, because the dataset does not receive any audio extension in the constructor. TEDLIUM
simply accepts "audio_ext", so you do not need to do this (neither tedlium.TEDLIUM._ext_audio = ".wav"
) you can just do dataset = tedium.TEDLIUM(self.root_dir, audio_ext=".wav")
|
||
def test_tedlium(self): | ||
tedlium.TEDLIUM._ext_audio = ".wav" | ||
dataset = tedlium.TEDLIUM(self.root_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the mocked datasets are generated for all the releases, how about testing all the releases, (as separate test case (method))?
for i, utterance in enumerate(UTTERANCES): | ||
talk_id, _, speaker_id, start_time, end_time, identifier, transcript = utterance.split(" ", 6) | ||
start_time = int(float(start_time)) * 16000 | ||
end_time = int(float(end_time)) * 16000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is a variable sample_rate = 16000
, defined above, why not use it here?
If something has to be changed in the future, it will be easier that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
os.makedirs(dataset_dir, exist_ok=True) | ||
sample_rate = 16000 # 16kHz | ||
seed = 0 | ||
data = get_whitenoise(sample_rate=sample_rate, duration=10.00, n_channels=1, dtype="float32", seed=seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mocked audio data has to be generated for each files being mocked, with different seed value, otherwise all the files generated are exactly same (for the sake or test reproducibility, test helper functions are implemented deterministic by default, that is with fixed seed value) and we do not know if the dataset implementation is traversing the files in expected order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I get it now. it's one file with different segments.
Can we have multiple audio files per release?
Testing for cases with multiple objects often reveal a bug, so it's a good practice.
Here, we would like to make sure that os.listdir
function (which returns items in OS-specific way) does not change the result across OSs. (though our test is broken for non-linux environment, which I have to fix soon.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a new version with the previous feedback and including a separate test for every release. More test never hurt and as you say we could have miss some failures only testing for the 1st release. We also have different data to test between releases
Hi @jiwidi Yes, it's broken for Windows and macOS and I have to fix it at some point. These failures have nothing to do with your work. |
|
||
# Create a samples list to compare with | ||
cls.samples[release] = [] | ||
for i, utterance in enumerate(UTTERANCES): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i, utterance in enumerate(UTTERANCES): | |
for utterance in UTTERANCES: |
I don't think the enumerate is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
torchaudio/datasets/tedlium.py
Outdated
return len(self._filelist) | ||
|
||
@property | ||
def get_phoneme_dict(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@property
is accessed like an attribute and it is invoked without parenthesis, so get_phoneme_dict
is strange. Just phoneme_dict
makes more sense.
torchaudio/datasets/tedlium.py
Outdated
with open(self.dict_path, "r", encoding="utf-8") as f: | ||
for line in f.readlines(): | ||
content = line.strip().split(maxsplit=1) | ||
self._phoneme_dict[content[0]] = content[1:] # content[1:] can be empty list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of changing the turning this to tuple to prevent the modification on user code?
Does it affect user experience?
self._phoneme_dict[content[0]] = content[1:] # content[1:] can be empty list | |
self._phoneme_dict[content[0]] = tuple(content[1:]) # content[1:] can be empty list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
@mthrok new commit with the PR comments suggestion, what do you think of the state of this PR? Doesnt seem much left to do (?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jiwidi
I had one more question about the form of phoneme_dict
. Please check the comment.
It's also a good idea to add test for it but since this PR has got super long, so it's okay. Other than that, I think we are ready to merge. Thanks for sticking with the such a rigorous review. I enjoyed our interactions. :)
torchaudio/datasets/tedlium.py
Outdated
self._phoneme_dict = {} | ||
with open(self.dict_path, "r", encoding="utf-8") as f: | ||
for line in f.readlines(): | ||
content = line.strip().split(maxsplit=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was maxsplit=1
always here?
I thought this is splitting into the list of phonemes, but with maxsplit=1
it is a list of one string which has all the phonemes.
with maxsplit=1
, it's
content = ['dani', 'D AA N IY']
self._phenome_dict = {'dani': ('D AA N IY', ), ...}
whereas without maxsplit
, it's
content = ['dani', 'D', 'AA', 'N', 'IY']`,
self._phenome_dict = {'dani': ('D', 'AA', 'N', 'IY'), ...}
Which one do you intend? I thought it's the later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was always there and the intent is to have the first option you mention:
content = ['dani', 'D AA N IY']
self._phenome_dict = {'dani': ('D AA N IY', ), ...}
I'm used to seeing phoneme as a one string only instead of a list as second option, but this doesn't mean maybe other people expect the second option. But at least this is an easy change from the user perspective.
edit: I chatted with some colleagues and seems like second option would be more "standard" so I'll go with that one
torchaudio/datasets/tedlium.py
Outdated
[Tensor, int]: Audio tensor representation and sample rate | ||
""" | ||
start_time = int(float(start_time) * 16000) | ||
end_time = int(float(end_time) * 16000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use sample_rate
variable here instead of hard coded 16000
?
@mthrok I just pushed a new commit where I include test for the dictionary for every release as well. More test never hurt :) It also changes the value of the dictionary with the second option you described above and fixes wrong naming in a private variable. I also enjoyed the interaction here very much, very good coding practices reviews you left me here 💯 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work.
fixes wrong naming in a private variable.
Grad you caught this, I did not notice this :)
I also enjoyed the interaction here very much, very good coding practices reviews you left me here 💯
Thanks for the nice words. Let me know in future if there is something I can help you. ;)
Sorry, I just noticed that tedlium dataset test fails on style test. Do you have time to fix it? I tried to modify the PR but I cannot push to your branch. If you are done with it, let me know, I will merge it first and create a follow-up PR> |
Just pushed a new commit that should fix it |
Thank you so much! |
Co-authored-by: holly1238 <[email protected]>
[FX] Move examples from pytorch/pytorch
Hi!
So this is my pull request for adding tedlium support (all 3 releases of tedlium) reference to #765.
The code is not yet finished (still need to add test, read the phonetic dictionaries of each release, proper formatting) but I wanted to create the PR to get some feedback on the way I read/iterate tedlium files and if the torchaudio manteiners are okay with it or have an idea on how to improve it. This way I make sure the core of the datasetloader is good to merge and I can start working on the test or change it before coding the tests. I will explain how I iterate through the dataset now.
Tedlium dataset (all releases) is composed of
.sph
and.stm
files. These files are divided in the root folder in two different folders:Each
.stm
file contains the transcripts for a FULL tedtalk and while the brother audio.sph
file contains the audio for the FULL tedtalk as well.Since this dataset is used in speech recognition, people don't train in full tedtalk audios but rather split into sentences. Each sentence corresponds to a line in the
.stm
file. This means that a.stm
file contains as many training samples as lines inside the file.The previous datasets that I found in torchaudio didn't encounter this case and treated every transcript file as a sample, so I had to make a small change to the
self._walker
used in the datasets. I created a newself._extended_walker
list that contains as many occurrences of a.stm
file with its line number as lines this file has, so for example if we had two stm files A (5 lines/5 samples) and B (3 lines/3 samples)self._extended_walker
will look like this:And every time we load an item
i
of the dataset we will index this_extended_walker
where we retrieve the file containing the sample and the corresponding line, the line contains the transcript and start time,end time of the audio file corresponding to the audio file. With all of this information we can load the line transcript and corresponding subaudio from the full ted talk.This will allow the dataset to work as any other torchaudio datasets(being able to run sequentially or shuffling it). Looks great right? Then why I am checking with the maintainers if this is a good idea?
Well, in order to create this
_extended_walker
list we have to read how many lines each of our.stm
file has while initializing the dataset, causing a "not optimal I/O operation". The operation happens in 119-123:Is fast enough to not represent any downside for the user
(release 3 contains the higher number of files)
I'm open for edits and suggestions on the implementation of the TEDLIUM dataset and please notice is still a work in progress.
Things that still need work:
_getdict
function to return the phonems dictionary of the datasetThanks