Add tedlium dataset (all 3 releases) #882

jiwidi · 2020-08-15T11:58:18Z

Hi!

So this is my pull request for adding tedlium support (all 3 releases of tedlium) reference to #765.

The code is not yet finished (still need to add test, read the phonetic dictionaries of each release, proper formatting) but I wanted to create the PR to get some feedback on the way I read/iterate tedlium files and if the torchaudio manteiners are okay with it or have an idea on how to improve it. This way I make sure the core of the datasetloader is good to merge and I can start working on the test or change it before coding the tests. I will explain how I iterate through the dataset now.

Tedlium dataset (all releases) is composed of .sph and .stm files. These files are divided in the root folder in two different folders:

sph/
stm/

Each .stm file contains the transcripts for a FULL tedtalk and while the brother audio .sph file contains the audio for the FULL tedtalk as well.

Since this dataset is used in speech recognition, people don't train in full tedtalk audios but rather split into sentences. Each sentence corresponds to a line in the .stm file. This means that a .stm file contains as many training samples as lines inside the file.

The previous datasets that I found in torchaudio didn't encounter this case and treated every transcript file as a sample, so I had to make a small change to the self._walker used in the datasets. I created a new self._extended_walker list that contains as many occurrences of a .stm file with its line number as lines this file has, so for example if we had two stm files A (5 lines/5 samples) and B (3 lines/3 samples) self._extended_walker will look like this:

self._extended_walker = [[A,0],[A,1],[A,2],[A,3],[A,4],[B,0,][B,1]]

And every time we load an item i of the dataset we will index this _extended_walker where we retrieve the file containing the sample and the corresponding line, the line contains the transcript and start time,end time of the audio file corresponding to the audio file. With all of this information we can load the line transcript and corresponding subaudio from the full ted talk.

This will allow the dataset to work as any other torchaudio datasets(being able to run sequentially or shuffling it). Looks great right? Then why I am checking with the maintainers if this is a good idea?

Well, in order to create this _extended_walker list we have to read how many lines each of our .stm file has while initializing the dataset, causing a "not optimal I/O operation". The operation happens in 119-123:

for file in self._walker:
    stm_path = os.path.join(self._path, self._folder_txt, file + self._ext_txt)
    with open(stm_path) as f:
        l = len(f.readlines())
        self._extended_walker += [(file, line) for line in range(l)]

Is fast enough to not represent any downside for the user

(release 3 contains the higher number of files)

I'm open for edits and suggestions on the implementation of the TEDLIUM dataset and please notice is still a work in progress.

Things that still need work:

_getdict function to return the phonems dictionary of the dataset
Several test for the dataset and different releases.
Upload my notebook where I play with the dataset object to show the functionality for reviewers
Better styling and compliance with the rest of torchaudio code

Thanks

jiwidi · 2020-08-15T11:58:46Z

Pinging @vincentqb

mthrok

Hi @jiwidi

Thanks for the contribution, and thanks for opening the PR in the early stage. This will make our review process easier and more fruitful.

I see you mostly follow the existing Dataset implementation, and that's a good start, but I found many spaces for improvement in these existing implementations, so I added comments regarding them. Let me know what you think.

mthrok · 2020-08-17T00:36:31Z

torchaudio/datasets/tedlium.py

+        root: str,
+        release: str = RELEASE,
+        subset: str = None,
+        folder_in_archive: str = _RELEASE_CONFIGS[RELEASE]["folder_in_archive"],


Can you get remove folder_in_archive from the signature?
There is no necessity or benefit of making this into variable.
Also this variable is never used as it's overwritten at the beginning of the constructor.

Yep, there is no reason why it would stay 👍

mthrok · 2020-08-17T00:45:53Z

torchaudio/datasets/tedlium.py

+                    download_url(url, root, hash_value=checksum)
+                extract_archive(archive)
+
+        walker = walk_files(self._path, suffix=self._ext_txt, prefix=False, remove_suffix=True)


I am advising people not to use walk_files, because this function forces Dataset implementations to not be able to designate which directories it should be looking at and might traverse directories totally unnecessary and irrelevant, which is irresponsible behavior, and should be avoided.

Also since you need a different/new logic for going over the STM files, I see that the fact that walk_files needed to be extended as this function is not suitable for the purpose of this Dataset. So I suggest add the logic that does the right amount of work, which is simply going over the subdirectories and check the files. Please see another example here. Following this advice, @Abhi011999 has found some data missing about VCTK 0.92, which is another reason why walk_files function should not be used. It hinders what Dataset implementation should be looking at.

I am downloading all TEDLIUM datasets now so that we can discuss the detail of how to traverse these directories. I will give my thoughts once I see the dataset strucutres.

That would be perfect, lets have a discussion for it when you have the data.

mthrok · 2020-08-17T00:47:18Z

torchaudio/datasets/tedlium.py

+
+def load_tedlium_item(
+    fileid: str, line: int, path: str, ext_audio: str, ext_txt: str
+) -> Tuple[Tensor, int, str, int, int, int]:


Can you put this as a method of TEDLIUM class? That way user code can customize the loading behavior.

Yes, makes sense

mthrok · 2020-08-17T00:49:36Z

torchaudio/datasets/tedlium.py

+        )
+
+    wave_path = os.path.join(path, "sph/", fileid)
+    waveform, sample_rate = torchaudio.load(wave_path + ext_audio)


Can you add a method that wrap torchaudio.load to TEDLIUM class?

def load_auiod(self, path): return torchaudio.load(path)

The problem with using bare torchaudio.load is that user cannot customize the loading behavior. (such as disabling normalization or swapping channels dimension.) We could also pass down such parameters from the constructor of TEDLIUM class but that will make the constructor signature ugly.

Yes, makes sense

@mthrok -- What's the advantage of doing this over having torchaudio.load inside the load item function? Is your suggestion to also eventually have a load_text method? One property of having a single load item function was that it factored the item loading as a pure function that was taking an identifier to return a data point.

@mthrok -- What's the advantage of doing this over having torchaudio.load inside the load item function? Is your suggestion to also eventually have a load_text method? One property of having a single load item function was that it factored the item loading as a pure function that was taking an identifier to return a data point.

As far as I understood it he wants to give the user the option to overwrite the load function of the dataset to use optional parameters if he feels like.

So user could do:

def custom_load(path): return torchaudio.load(path,normalize=False,channels_first=False) dt = torchaudio.datasets.TEDLIUM() dt.load_audio = custom_load

As normalize and channels_first are True by default and the load_item function doesnt change them the user has no ability to change them.

Another example is Windows users who cannot use torchaudio.load because torchaudio on Windows does not support sph format so they need to provide something else in the custom load method.

As normalize and channels_first are True by default and the load_item function doesnt change them the user has no ability to change them.

If this is indeed something desired, we could also consider exposing those parameters to the constructor instead. I wouldn't say this is enough alone to change the implementation through the addition of a load method.

In torchaudio, we went with the convention that waveforms are between -1 and 1 with batch first, see readme. This is in particular to avoid having to carry such parameters everywhere.

Another example is Windows users who cannot use torchaudio.load because torchaudio on Windows does not support sph format so they need to provide something else in the custom load method.

This is an interesting point, and there's precedence for this for datasets using mp3. I see two paths for this:

Offering a backend that supports sph in windows

Decoupling torchaudio.load from torchaudio.datasets.

In the former, a user who decides to change the datasets would copy-paste the code and modify to their heart's content. In the latter, we need to decide how we would standardize this. Is any of torch{text,vision} doing something like this already? Thoughts?

I would say offering support for sph in windows would be better, less hassling and moving around for the user. At the end, the users who use this custom dataset loaders are the ones looking for a working solution and easy to implement, if they want to customize loading or whatever they will probably rewrite a dataset on their own.

torchaudio/datasets/tedlium.py

mthrok · 2020-08-17T00:56:48Z

torchaudio/datasets/tedlium.py

+    waveform, sample_rate, utterance, speaker_id, chapter_id, utterance_id
+    """
+
+    _ext_txt = ".stm"


I do not see a necessity or benefit of making this extension variable. STM is an established format, and since it's text, any platform (like Windows) should be able to open it, unlike the case of ".sph" which torchaudio on Windows cannot load.

Is used for the file walker as we look for files with this extension and I keept it as a variable for consistency with the Librispeech dataset

audio/torchaudio/datasets/librispeech.py

Lines 75 to 76 in c692fe9

_ext_txt = ".trans.txt"

_ext_audio = ".flac"

It will probably unused depending on what logic we decide on to iterate though the files. We can take a look at this when we are done with the logic ?

Is used for the file walker as we look for files with this extension

Even that's the case, I think it can be hardcoded at where the walker function is called, however

It will probably unused depending on what logic we decide on to iterate though the files. We can take a look at this when we are done with the logic ?

Yeah, we can differ the change until then.

I moved _ext_audio initialization to the constructor on my coming new commit so I removed this one as well. We can always go back.

torchaudio/datasets/tedlium.py

torchaudio/datasets/__init__.py

torchaudio/datasets/tedlium.py

vincentqb · 2020-08-17T16:04:36Z

torchaudio/datasets/tedlium.py

+
+        walker = walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True)
+        self._walker = list(walker)
+        self._extended_walker = []


I'm not sure why we need an extended walker in place of updating walker? It seems like this simply extracts information from a file. Could this be done in load item function?

It does simple extract information for the files contained in the original walker. I create it because we need this information before calling the load item function so the dataset contains samples equals to total lines in all files (1 line is one sentence) instead of containing samples equals to files.

This dataset is a bit tricky because each file contains a full ted talk and no one in speech recognition (where this dataset is usually used) treats a full talk as a training sample. We split the file and audio file into multiple samples (sentences).

This means that once we instantiate the dataset we need to know how many samples we have(lines in all files) so the user can index any sample N at any time or shuffle it. If we were to implement this on the load function then this functionality will be way harder. See the original post where I explain with detail how the files are iterated.

I'm happy to discuss which method is the best method to achieve this functionality but I'm checking first if you understand why this extender walker is here first, maybe I didn't explain it correctly on the original post

Thanks for adding details :) What I meant is more about the particular use of _walker and _extended_walker. I would have suggested something like:

self._walker = [] for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True): stm_path = os.path.join(self._path, "stm", file + ".stm") with open(stm_path) as f: l = len(f.readlines()) self._walker.extend((file, line) for line in range(l))

or, to stick with generators,

def _generate_walker(...): for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True): stm_path = os.path.join(self._path, "stm", file + ".stm") with open(stm_path) as f: for line in range(len(f.readlines())): yield (file, line) self._walker = list(_generate_walker(...))

Also, why just indexing and not storing the line directly since the file is already being read? How big is the total text?

def _generate_walker(...): for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True): stm_path = os.path.join(self._path, "stm", file + ".stm") with open(stm_path) as f: for line in f.readlines(): yield (file, line) self._walker = list(_generate_walker(...))

Oh I see what you mean now, yes it will be cleaner to stick to the _walker name for the resultant walker.

I went ahead and checked the size of the walker list for release3 (the biggest)

import numpy as np np.array(test_dataset._extended_walker).nbytes

At its 316733528 bytes,~316 megabytes storing the full line and 20449080 ~20 megabytes with just the line identifier. I would say is a considerable size but it does remove a I/O operation each time we load the item. Does any other dataset take this much space in memory?

Regarding sticking with generators or not, we convert it to a list so its just adding more logic that we dont use, I would go with the:

self._walker = [] for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True): stm_path = os.path.join(self._path, "stm", file + ".stm") with open(stm_path) as f: l = len(f.readlines()) self._walker.extend((file, line) for line in range(l))

Just because is easier to read

vincentqb · 2020-08-17T16:05:13Z

torchaudio/datasets/tedlium.py

+                extract_archive(archive)
+
+        walker = walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True)
+        self._walker = list(walker)


in our current implementation, this would simply go as the last command manipulating walker (or extended walker here)

mthrok · 2020-08-17T18:03:26Z

@jiwidi

_getdict function to return the phonems dictionary of the dataset

Do you mean the lexicon file such as TEDLIUM.150K.dic and TEDLIUM.152K.dic?
dict is very overloaded word here, so I just want a clarification on what you mean.

torchaudio/datasets/tedlium.py

mthrok · 2020-08-17T18:37:14Z

torchaudio/datasets/tedlium.py

+            stm_path = os.path.join(self._path, "stm", file + ".stm")
+            with open(stm_path) as f:
+                l = len(f.readlines())
+                self._extended_walker += [(file, line) for line in range(l)]


I checked the contents of STM files. For the case of release-3

$ wc -l TEDLIUM_release1/*/stm/*.stm | tail -1 58863 total $ wc -l TEDLIUM_release2/*/stm/*.stm | tail -1 95033 total $ wc -l TEDLIUM_release-3/data/stm/* | tail -1 268263 total

So if loading the whole release-3 dataset, self._extended_walker could take a couple of megabytes on memory.

Also this will open the number of files;

$ find TEDLIUM_release1 -type f -name '*.stm' | wc -l 793 $ find TEDLIUM_release2 -type f -name '*.stm' | wc -l 1514 $ find TEDLIUM_release-3 -type f -name '*.stm' | wc -l 2370

For release-3 dataset, this initialization process will go over 2370 files sequentially. How long does this take?

oh is this what your report as 24 micros second??

Yes, I report the time of initiating the dataset and per se going over all files counting lines. I think that is quite fast and we shouldn't worry about it.

About the size it occupies in memory, I did a quick test for a comment from @vincentqb just on top of this one. I checked how much would the walker occupy if we were to store the full line and not the line identifier as we already loop it, saving some I/O when loading each item.

I went ahead and checked the size of the walker list for release3 (the biggest)

import numpy as np np.array(test_dataset._extended_walker).nbytes

At its 316733528 bytes,~316 megabytes storing the full line and 20449080 ~20 megabytes with just the line identifier. I would say is a considerable size but it does remove a I/O operation each time we load the item. Does any other dataset take this much space in memory?

Yes, I report the time of initiating the dataset and per se going over all files counting lines. I think that is quite fast and we shouldn't worry about it.

I did a quick benchmark on my end and

script

import time import torchaudio torchaudio.set_audio_backend('sox_io') n_rep = 100 t0 = time.monotonic() for _ in range(n_rep): ds = torchaudio.datasets.TEDLIUM("../dataset/tedlium", release="release3", download=False) t1 = time.monotonic() print(len(ds._walker)) print((t1 - t0) / n_rep)

$python foo.py 268263 0.21009648161940275

I did not get as fast as micro seconds, but it was only 0.2 seconds so I agree that speed is not an issue here.

Regarding the memory consumption, I talked with @cpuhrsch. 20 megabytes is not a big deal. However be aware that when using multiple data loader worker process, each process will consume the same amount of memory.

Also DataLoader has memory leakage issue pytorch/pytorch#13246 . (current workaround pytorch/pytorch#13246 (comment))

jiwidi · 2020-08-17T19:26:23Z

@jiwidi

_getdict function to return the phonems dictionary of the dataset

Do you mean the lexicon file such as TEDLIUM.150K.dic and TEDLIUM.152K.dic?
dict is very overloaded word here, so I just want a clarification on what you mean.

Yes, that .dic file contains the phoneme dictionary of the dataset and is used sometimes. I want to add a function to the class to load and return it

torchaudio/datasets/tedlium.py

mthrok · 2020-08-18T15:43:26Z

torchaudio/datasets/tedlium.py

+    def __init__(
+        self, root: str, release: str = "release1", subset: str = None, download: bool = False, audio_ext=".sph"
+    ) -> None:
+        """Constructor for TEDLIUM dataset


Can you move this docstring to class-level docstring and merge them?
Also you need to add this class in docs/source/datasets.rst to show up in doc.

I just checked with the other datasets, they have a similar docstring under the class description:
We have

""" Create a Dataset for Tedlium. Each item is a tuple of the form: [waveform, sample_rate, transcript, talk_id, speaker_id, identifier] """

Librispeech:

""" Create a Dataset for LibriSpeech. Each item is a tuple of the form: waveform, sample_rate, utterance, speaker_id, chapter_id, utterance_id """

Are you sure you want to move the big constructor docstring to class-level?

Are you sure you want to move the big constructor docstring to class-level?

Yes. The fact that the other dataset lacks the documentation for the constructor is not user-friendly.
It is better if each Dataset explains what options a user can change.
I will update the other dataset documentation.

mthrok · 2020-08-18T15:45:12Z

torchaudio/datasets/tedlium.py

+        Args:
+            root (str): Path containing dataset or target path where its downloaded if needed
+            release (str, optional): TEDLIUM identifier (release1,release2,release3). Defaults to RELEASE.
+            subset (str, optional): Subset of data(train,test,dev) supported for release 1,2. Defaults to Train/None.


I think it should be more strict about subset value. for release 1 and 2 the allowed values are train/dev/test, but for release 3 it has to be None.

Could you add value check and raise a ValueError if it's not the right one for the given release?

mthrok · 2020-08-18T15:46:04Z

torchaudio/datasets/tedlium.py

+        self._walker = []
+
+        # Create walker for all samples
+        for file in walk_files(self._path, suffix=".stm", prefix=False, remove_suffix=True):


Can we not to use walk_files and just list stm files? I checked the dataset and we only need to check one directory par configuration.

TEDLIUM_release1/dev/stm/*.stm TEDLIUM_release1/test/stm/*.stm TEDLIUM_release1/train/stm/*.stm TEDLIUM_release2/dev/stm/*.stm TEDLIUM_release2/test/stm/*.stm TEDLIUM_release2/train/stm/*.stm TEDLIUM_release-3/data/stm/*.stm

That is true, I used the walker since I saw it was used in other datasets but listing all stm files will do the job as all of them are supposed to be inside the same folder.

torchaudio/datasets/tedlium.py

jiwidi · 2020-08-19T01:03:56Z

docs/source/datasets.rst

+
+.. autoclass:: TEDLIUM
+  :members: __getitem__
+  :special-members: get_phoneme_dict


@mthrok What does special-members mean here? I interpretate it as extra functions to include in the docs? Thats why I included get_phoneme_dict

These are Sphinx's directive. Check out their documentation

:members: is where you list the members you want document

:special-members: is where you list the special methods like __init__, __len__, __getitem__ etc...

I think the other documentations are wrong, (__getitem__ should be under :special-member: but it will not show up either way because they don't have a docstring.)

I think you can just do .. autoclass:: TEDLIUM and the rest (get_phoneme_dict) will be handled.

You can build the documentation and check how the resulting documentation looks like.

cd docs pip install -r requirements.txt make html # open ./build/html/index.html

mthrok · 2020-08-19T18:55:52Z

torchaudio/datasets/tedlium.py

+from torchaudio.datasets.utils import (
+    download_url,
+    extract_archive,
+    walk_files,


You can remove walk_files from import too.

mthrok · 2020-08-19T18:57:00Z

torchaudio/datasets/tedlium.py

+        # Create walker for all samples
+        self._walker = []
+        stm_path = os.path.join(self._path, "stm")
+        for file in os.listdir(stm_path):


Can you sort the file? Different OS returns files in different order.

Suggested change

for file in os.listdir(stm_path):

for file in sorted(os.listdir(stm_path)):

mthrok

The code looks much better now.

@jiwidi What else is left for this PR?

mthrok · 2020-08-19T18:57:31Z

torchaudio/datasets/tedlium.py

+                content = line.strip().split(maxsplit=1)
+                self.phoneme_dict[content[0]] = content[1:]  # content[1:] can be empty list
+
+    def load_tedlium_item(self, fileid: str, line: int, path: str) -> Tedlium_item:


Can you also mark this function as private?

Suggested change

def load_tedlium_item(self, fileid: str, line: int, path: str) -> Tedlium_item:

def _load_tedlium_item(self, fileid: str, line: int, path: str) -> Tedlium_item:

jiwidi · 2020-08-19T19:05:40Z

The code looks much better now.

@jiwidi What else is left for this PR?

Test! If we are okay with the logic for iterating files I'll go ahead and create some test based on the last updates I saw you guys do on the other dataset test.

vincentqb · 2020-08-20T15:50:07Z

torchaudio/datasets/tedlium.py

+        wave_path = os.path.join(path, "sph", fileid)
+        waveform, sample_rate = self._load_audio(wave_path + self._ext_audio, start_time=start_time, end_time=end_time)
+
+        return Tedlium_item(waveform, sample_rate, transcript, talk_id, speaker_id, identifier)


I can't find the discussion about NamedTuple, so I'm leaving a new comment here :)

There has been prior discussion about changing the type of data points (and in particular using NamedTuple), see internal document. Though there may be value in redefining what a data point is, I feel the discussion would delay this pull request, so I would avoid it for now.

I can't access the document but regardless I have no problem with changing it back to the previous version.

We can delay this NamedTuple to a PR that will change multiple Datasets/files and not do 1 by 1 and get inconsistent code. What do you think?

We can delay this NamedTuple to a PR that will change multiple Datasets/files and not do 1 by 1 and get inconsistent code. What do you think?

Yes, I agree with your suggestion. It would be preferable to change them all at the same time once we settle on a specific format.

I oppose to using tuple here. 6 heterogeneous items as tuple give very bad user experience and make code harder to read/maintain. Specifically named tuple is compatible with tuple, there is virtually no disadvantage of using named tuple here.

The main disadvantage of using namedtuple here is (1) inconsistent behavior with other datasets in audio and elsewhere, (2) would make the user leverage an API that could change in the near future, (3) will slow down this particular pull request.

The reason we did not standardize around NamedTuple in the past was that it didn't allow for easy introspection of keys, and each dataset would have its own DataPoint NamedTuple. Please see internal document for more discussions.

We are also using tuples with many items with other torchaudio datasets. That being said, we can revisit these design choices. Thoughts @zhangguanheng66 @fmassa @cpuhrsch ?

(1) inconsistent behavior with other datasets in audio and elsewhere

Can you elaborate this? Each dataset returns different number of items, so it does not look like they can be swapped easily from the beginning. And using NamedTuple with consistent key name will make it possible to give consistent user experience.

(2) would make the user leverage an API that could change in the near future

Not sure what this means. Can you elaborate?

(3) will slow down this particular pull request.

It's already in NamedTuple so, it seems that changing it tuple seems to be slowing down the process.

The reason we did not standardize around NamedTuple in the past was that it didn't allow for easy introspection of keys

When is it necessary or useful to perform key introspection?

each dataset would have its own DataPoint NamedTuple.

That sounds the right approach. Because each dataset returns the different number of items of different types. They should be typed differently.

(1) inconsistent behavior with other datasets in audio and elsewhere

Can you elaborate this? Each dataset returns different number of items, so it does not look like they can be swapped easily from the beginning. And using NamedTuple with consistent key name will make it possible to give consistent user experience.

I was really just referring to using NamedTuple in this particular pull request. Moving toward consistently using NamedTuple in torchaudio/text/vision is fine by me.

(2) would make the user leverage an API that could change in the near future

Not sure what this means. Can you elaborate?

Since we have not, as a team (torchaudio/text/vision), revisited the decision of using Tuple, using NamedTuple here exposes us to changing the API again would we settle on another option.

(3) will slow down this particular pull request.

It's already in NamedTuple so, it seems that changing it tuple seems to be slowing down the process.

But we should avoid using NamedTuple until we, as a team, have made a decision on the API for data point. The current decision had been Tuple, but I'm personally fine with revisiting this past decision.

The reason we did not standardize around NamedTuple in the past was that it didn't allow for easy introspection of keys

When is it necessary or useful to perform key introspection?

each dataset would have its own DataPoint NamedTuple.

That sounds the right approach. Because each dataset returns the different number of items of different types. They should be typed differently.

I don't feel too strongly about either points. I am really just listing reasons that we left in the internal document. :)

mthrok · 2020-09-06T12:32:00Z

Mm I have been copying the https://github.com/pytorch/audio/blob/master/test/torchaudio_unittest/datasets/librispeech_test.py but for TEDLIUM, I have a doubt here, how will you save the whitenoise into '.sph' files? As far as I know is not supported yet(?)

Friendly ping @mthrok :)

Hi @jiwidi

Thanks for singing me again and sorry for missing your reply in the first place.
There are two things to consider here.

Did you try saving audio data in sph format? With Add SPHERE support to "sox_io" backend #871, I believe you can save audio with sph format. (If not, let me know.) however,
In the case of librispeech_test.py and the others, we emulate the dataset with wav format. This is because wav is easy to handle and does not require any third-party library to load (although we use scipy for simplicity, but we do so because we believe scipy does the correct job to load wav file we can reference.). In unit-test, it is ideal to only use the module being tested. In the case of dataset implementation, we want to only use dataset module as much as possible, and we do not want to use I/O module in the test. Otherwise when test fails, we do not know immediately if it is due to I/O module or dataset module. (of course, the dataset implementation itself uses torchaudio.load, so the perfect separation at test time is not achieved without monkey-patching torchaudio.load function, but I think that's too much)

So because of 2., it is preferable if we can test the dataset implementation with WAV format, even though the real dataset is in SPH. (and I believe this is also a valid use case for end users to convert the entire dataset into WAV.). For that purpose we make dataset implementation to be able to change the target extension. So can you write your test with "audio_ext='.wav'" when instantiating Dataset?
If you strongly thing this dataset also should be tested with .sph format, then you can put the whole test logic in a helper method and make the test change format too.

mthrok · 2020-09-06T12:35:30Z

torchaudio/datasets/tedlium.py

+
+Tedlium_item = namedtuple(
+    "Tedlium_item", ["waveform", "sample_rate", "transcript", "talk_id", "speaker_id", "identifier"]
+)


@jiwidi

Regarding the namedtuple, I talked with @cpuhrsch and I will conduct performance check. This one is tricky and could take some time, So I'm sorry to ask but can you revert it back to a regular tuple for the sake of merging this PR first? I feel so sorry the process is taking so long.

No problem!

mthrok · 2020-09-06T12:37:20Z

torchaudio/datasets/tedlium.py

+        """
+        start_time = int(float(start_time) * 16000)
+        end_time = int(float(end_time) * 16000)
+        return torchaudio.load(path, frame_offset=start_time, num_frames=end_time - start_time)


I just realized that this will only work with sox_io backend. Could you add fall-back for the other backend?

if torchaudio.get_audio_backen() == 'sox_io': return torchaudio.load(path, frame_offset=start_time, num_frames=end_time - start_time) return torchaudio.load(path)[:, start_time:end_time]

The current sox backend also support reading file segment. Not soundfile? If the datasets needs to be aware of the backend, then we should see how we could evolve the backends also so that this does not need to happen. Thoughts?

torchaudio.load(path)[:, start_time:end_time] will still work for both backends as far as I understand. But @mthrok pointed out is more efficient to use the torchaudio.load(path, frame_offset=start_time, num_frames=end_time - start_time) if sox_io is enabled.

We either:

Add support for other backends with same parameter name so the optimized loading can be used with all (this could be better for the long term, unifying

Use the slicing operator and support all backends right away.

The current sox backend also support reading file segment. Not soundfile? If the datasets needs to be aware of the backend, then we should see how we could evolve the backends also so that this does not need to happen. Thoughts?

We have already planned to resolve this in 0.9.0 release. See #903

jiwidi · 2020-09-07T08:00:55Z

Mm I have been copying the https://github.com/pytorch/audio/blob/master/test/torchaudio_unittest/datasets/librispeech_test.py but for TEDLIUM, I have a doubt here, how will you save the whitenoise into '.sph' files? As far as I know is not supported yet(?)

Friendly ping @mthrok :)

Hi @jiwidi

Thanks for singing me again and sorry for missing your reply in the first place.
There are two things to consider here.

Did you try saving audio data in sph format? With Add SPHERE support to "sox_io" backend #871, I believe you can save audio with sph format. (If not, let me know.) however,

In the case of librispeech_test.py and the others, we emulate the dataset with wav format. This is because wav is easy to handle and does not require any third-party library to load (although we use scipy for simplicity, but we do so because we believe scipy does the correct job to load wav file we can reference.). In unit-test, it is ideal to only use the module being tested. In the case of dataset implementation, we want to only use dataset module as much as possible, and we do not want to use I/O module in the test. Otherwise when test fails, we do not know immediately if it is due to I/O module or dataset module. (of course, the dataset implementation itself uses torchaudio.load, so the perfect separation at test time is not achieved without monkey-patching torchaudio.load function, but I think that's too much)

So because of 2., it is preferable if we can test the dataset implementation with WAV format, even though the real dataset is in SPH. (and I believe this is also a valid use case for end users to convert the entire dataset into WAV.). For that purpose we make dataset implementation to be able to change the target extension. So can you write your test with "audio_ext='.wav'" when instantiating Dataset?
If you strongly thing this dataset also should be tested with .sph format, then you can put the whole test logic in a helper method and make the test change format too.

Dont worry, you've been very helpful. I will test it late today and get back if there are any problems, otherwise a commit will follow :)

I think I can make it work with wav files as well but since this is the only interaction of the library with sph files I see value in keeping the test with sph format as it will serve as a check for this functionality. If we check for sph support in other test then I have no preference with wav or sph.

edit: saving sph files work perfectly with torchaudio.save() sorry I didnt even checked it and thought it was only for wavs

mthrok · 2020-09-07T17:54:10Z

I think I can make it work with wav files as well but since this is the only interaction of the library with sph files I see value in keeping the test with sph format as it will serve as a check for this functionality. If we check for sph support in other test then I have no preference with wav or sph.

load/save/info functions on .sph format are tested separately. (check out #871)

If you decide to test sph, then following is (only the gist of) my recommendation.

def _generate_dataset(root_dir, audio_ext, dataset_version, <other parameters, such as dataset versions>) -> List[tuple]:
    # Generates dataset with the given extension
    for foo in whatever:
        ...
        # Switch saving function
        if audio_ext == "wav":
            save_wav(file_path, data, sample_rate)  # use scipy-based test utility for wav
        else:
            torchaudio.save(file_path, data, sample_rate)  # use sox-io backend
                                                            # assuming that the backend is properly set.
                                                            # See the bellow backend="sox_io" for this.
    # Returns expected data


# Note: Probably better to define different test class for different TEDLIUM versions.
class TestTedliumWav(TempDirMixin, TorchaudioTestCase):
    backend = "default"  # <- this test should work on all supported platforms

    @classmethod
    def setUp(cls):
        cls.root_dir = cls.get_base_temp_dir()
        _generate_dataset(cls.root_dir, "wav", ...)

    # define the test
    def testFoo(self):
        ...


class TestTedliumSph(TempDirMixin, TorchaudioTestCase):
    backend = "sox_io"  # <- this test will run only on the platforms where "sox_io" backend is available.

    @classmethod
    def setUp(cls):
        cls.root_dir = cls.get_base_temp_dir()
        _generate_dataset(cls.root_dir, "sph", ...)

    # define the test
    def testFoo(self):
        ...

edit: saving sph files work perfectly with torchaudio.save() sorry I didnt even checked it and thought it was only for wavs

glad it works :)

jiwidi · 2020-09-09T13:54:22Z

I think I can make it work with wav files as well but since this is the only interaction of the library with sph files I see value in keeping the test with sph format as it will serve as a check for this functionality. If we check for sph support in other test then I have no preference with wav or sph.

load/save/info functions on .sph format are tested separately. (check out #871)

If you decide to test sph, then following is (only the gist of) my recommendation.

def _generate_dataset(root_dir, audio_ext, dataset_version, <other parameters, such as dataset versions>) -> List[tuple]:
    # Generates dataset with the given extension
    for foo in whatever:
        ...
        # Switch saving function
        if audio_ext == "wav":
            save_wav(file_path, data, sample_rate)  # use scipy-based test utility for wav
        else:
            torchaudio.save(file_path, data, sample_rate)  # use sox-io backend
                                                            # assuming that the backend is properly set.
                                                            # See the bellow backend="sox_io" for this.
    # Returns expected data


# Note: Probably better to define different test class for different TEDLIUM versions.
class TestTedliumWav(TempDirMixin, TorchaudioTestCase):
    backend = "default"  # <- this test should work on all supported platforms

    @classmethod
    def setUp(cls):
        cls.root_dir = cls.get_base_temp_dir()
        _generate_dataset(cls.root_dir, "wav", ...)

    # define the test
    def testFoo(self):
        ...


class TestTedliumSph(TempDirMixin, TorchaudioTestCase):
    backend = "sox_io"  # <- this test will run only on the platforms where "sox_io" backend is available.

    @classmethod
    def setUp(cls):
        cls.root_dir = cls.get_base_temp_dir()
        _generate_dataset(cls.root_dir, "sph", ...)

    # define the test
    def testFoo(self):
        ...

edit: saving sph files work perfectly with torchaudio.save() sorry I didnt even checked it and thought it was only for wavs

glad it works :)

I just pushed the test for tedlium. I saw the sph load is already tested elsewhere as you mentioned so I kept the test with .wav files for simplicity and standard among other test.

For this new commit I also changed the phoneme dictionary to be only read if the user actually calls the get_phoneme_dict and this also makes it easier when testing.

The testing is very similar to the librispeech one, I try to recreate the 3 releases folder structure in a temporary folder and put some white noise there. Instantiate tedlium datasets and check all the data is well retrieved when iterating them.

jiwidi · 2020-09-09T14:06:01Z

@mthrok is it normal this many unittest fail due to failure to download/connectivity issues?

Test run on my side, but we have to assure they also run in the CI pipeline.

mthrok · 2020-09-09T17:19:11Z

torchaudio/datasets/tedlium.py

+        """
+        # Read phoneme dictionary
+        if not hasattr(self, "phoneme_dict"):
+            self.phoneme_dict = {}


This design pattern might leave users questions like "Which one should I use, phoneme_dict or get_phoneme_dict?", "What are the differences?"

The following techniques are often used to improve this;

Use internal attribute name like _phoneme_dict.
This does not prevent users to access the underlying object directly, but accessing an attribute prefixed with underscore clearly sends a message that user code is not doing the right thing. so it's okay.

Use property decorator.
Lazy-initialization often fits well in property attribute.
Something like this;

@property def phoneme_dict(self): if self._phoneme_dict is None: self._phoneme_dict = dict() # Fill the dictionary return self._phoneme_dict.copy() # see the bellow for `copy`

Also instead of checking if it's initialized with hasattr, it's more straight forward to initialize the actual attribute object in constructor as None such as self._phone_dict = None and perform the check as if self._phone_dict is None:

Another concern is returning dictionary reference. The user code might accidentally modify the dictionary which can lead to a very subtle bug. Maybe it's better to return a copy of the dictionary. as return self._phoneme_dict.copy() (however this is still shallow copy so the value points to the original lists and user code can still modify the lists. maybe changing it to tuple might prevent such unintended modification, like self.phoneme_dict[content[0]] = tuple(content[1:]))

Really good points, a commit with all the suggestions will follow 👍

mthrok · 2020-09-09T17:25:00Z

test/torchaudio_unittest/datasets/tedlium_test.py

+    @classmethod
+    def tearDownClass(cls):
+        # In case of test failure
+        tedlium.TEDLIUM._ext_audio = ".flac"


This hack is required in the case of librispeech_test, because the dataset does not receive any audio extension in the constructor. TEDLIUM simply accepts "audio_ext", so you do not need to do this (neither tedlium.TEDLIUM._ext_audio = ".wav") you can just do dataset = tedium.TEDLIUM(self.root_dir, audio_ext=".wav")

mthrok · 2020-09-09T17:25:54Z

test/torchaudio_unittest/datasets/tedlium_test.py

+
+    def test_tedlium(self):
+        tedlium.TEDLIUM._ext_audio = ".wav"
+        dataset = tedlium.TEDLIUM(self.root_dir)


Since the mocked datasets are generated for all the releases, how about testing all the releases, (as separate test case (method))?

mthrok · 2020-09-09T17:27:40Z

test/torchaudio_unittest/datasets/tedlium_test.py

+        for i, utterance in enumerate(UTTERANCES):
+            talk_id, _, speaker_id, start_time, end_time, identifier, transcript = utterance.split(" ", 6)
+            start_time = int(float(start_time)) * 16000
+            end_time = int(float(end_time)) * 16000


Since there is a variable sample_rate = 16000, defined above, why not use it here?
If something has to be changed in the future, it will be easier that way.

mthrok · 2020-09-09T17:29:19Z

test/torchaudio_unittest/datasets/tedlium_test.py

+        os.makedirs(dataset_dir, exist_ok=True)
+        sample_rate = 16000  # 16kHz
+        seed = 0
+        data = get_whitenoise(sample_rate=sample_rate, duration=10.00, n_channels=1, dtype="float32", seed=seed)


Mocked audio data has to be generated for each files being mocked, with different seed value, otherwise all the files generated are exactly same (for the sake or test reproducibility, test helper functions are implemented deterministic by default, that is with fixed seed value) and we do not know if the dataset implementation is traversing the files in expected order.

oh I get it now. it's one file with different segments.
Can we have multiple audio files per release?
Testing for cases with multiple objects often reveal a bug, so it's a good practice.
Here, we would like to make sure that os.listdir function (which returns items in OS-specific way) does not change the result across OSs. (though our test is broken for non-linux environment, which I have to fix soon.)

Just pushed a new version with the previous feedback and including a separate test for every release. More test never hurt and as you say we could have miss some failures only testing for the 1st release. We also have different data to test between releases

mthrok · 2020-09-09T17:38:57Z

@mthrok is it normal this many unittest fail due to failure to download/connectivity issues?

Test run on my side, but we have to assure they also run in the CI pipeline.

Hi @jiwidi

Yes, it's broken for Windows and macOS and I have to fix it at some point. These failures have nothing to do with your work.

NikeNano · 2020-09-10T20:06:40Z

test/torchaudio_unittest/datasets/tedlium_test.py

+
+            # Create a samples list to compare with
+            cls.samples[release] = []
+            for i, utterance in enumerate(UTTERANCES):


Suggested change

for i, utterance in enumerate(UTTERANCES):

for utterance in UTTERANCES:

I don't think the enumerate is used.

good catch!

mthrok · 2020-09-10T23:35:14Z

torchaudio/datasets/tedlium.py

+        return len(self._filelist)
+
+    @property
+    def get_phoneme_dict(self):


@property is accessed like an attribute and it is invoked without parenthesis, so get_phoneme_dict is strange. Just phoneme_dict makes more sense.

mthrok · 2020-09-10T23:36:25Z

torchaudio/datasets/tedlium.py

+            with open(self.dict_path, "r", encoding="utf-8") as f:
+                for line in f.readlines():
+                    content = line.strip().split(maxsplit=1)
+                    self._phoneme_dict[content[0]] = content[1:]  # content[1:] can be empty list


What do you think of changing the turning this to tuple to prevent the modification on user code?
Does it affect user experience?

Suggested change

self._phoneme_dict[content[0]] = content[1:] # content[1:] can be empty list

self._phoneme_dict[content[0]] = tuple(content[1:]) # content[1:] can be empty list

jiwidi · 2020-09-11T11:25:19Z

@mthrok new commit with the PR comments suggestion, what do you think of the state of this PR? Doesnt seem much left to do (?)

mthrok

Hi @jiwidi

I had one more question about the form of phoneme_dict. Please check the comment.
It's also a good idea to add test for it but since this PR has got super long, so it's okay. Other than that, I think we are ready to merge. Thanks for sticking with the such a rigorous review. I enjoyed our interactions. :)

mthrok · 2020-09-11T20:03:40Z

torchaudio/datasets/tedlium.py

+            self._phoneme_dict = {}
+            with open(self.dict_path, "r", encoding="utf-8") as f:
+                for line in f.readlines():
+                    content = line.strip().split(maxsplit=1)


Was maxsplit=1 always here?

I thought this is splitting into the list of phonemes, but with maxsplit=1 it is a list of one string which has all the phonemes.

with maxsplit=1, it's

content = ['dani', 'D AA N IY'] self._phenome_dict = {'dani': ('D AA N IY', ), ...}

whereas without maxsplit, it's

content = ['dani', 'D', 'AA', 'N', 'IY']`, self._phenome_dict = {'dani': ('D', 'AA', 'N', 'IY'), ...}

Which one do you intend? I thought it's the later.

It was always there and the intent is to have the first option you mention:

content = ['dani', 'D AA N IY'] self._phenome_dict = {'dani': ('D AA N IY', ), ...}

I'm used to seeing phoneme as a one string only instead of a list as second option, but this doesn't mean maybe other people expect the second option. But at least this is an easy change from the user perspective.

edit: I chatted with some colleagues and seems like second option would be more "standard" so I'll go with that one

mthrok · 2020-09-11T20:10:40Z

torchaudio/datasets/tedlium.py

+            [Tensor, int]: Audio tensor representation and sample rate
+        """
+        start_time = int(float(start_time) * 16000)
+        end_time = int(float(end_time) * 16000)


Can you use sample_rate variable here instead of hard coded 16000?

jiwidi · 2020-09-11T21:12:15Z

@mthrok I just pushed a new commit where I include test for the dictionary for every release as well. More test never hurt :)

It also changes the value of the dictionary with the second option you described above and fixes wrong naming in a private variable.

I also enjoyed the interaction here very much, very good coding practices reviews you left me here 💯

mthrok

@jiwidi

Thanks for your work.

fixes wrong naming in a private variable.

Grad you caught this, I did not notice this :)

I also enjoyed the interaction here very much, very good coding practices reviews you left me here 💯

Thanks for the nice words. Let me know in future if there is something I can help you. ;)

mthrok · 2020-09-14T22:24:54Z

@jiwidi

Sorry, I just noticed that tedlium dataset test fails on style test. Do you have time to fix it? I tried to modify the PR but I cannot push to your branch. If you are done with it, let me know, I will merge it first and create a follow-up PR>

jiwidi · 2020-09-14T22:30:01Z

@jiwidi

Sorry, I just noticed that tedlium dataset test fails on style test. Do you have time to fix it? I tried to modify the PR but I cannot push to your branch. If you are done with it, let me know, I will merge it first and create a follow-up PR>

Just pushed a new commit that should fix it

mthrok · 2020-09-15T18:06:29Z

@jiwidi

Thank you so much!

Co-authored-by: holly1238 <[email protected]>

[FX] Move examples from pytorch/pytorch

Added tedlium support for 3 releases

b695d9d

jiwidi changed the title ~~Added tedlium support for 3 releases~~ Add tedlium dataset (all 3 releases) Aug 15, 2020

mthrok reviewed Aug 17, 2020

View reviewed changes

jiwidi added 3 commits August 17, 2020 12:38

Minor fixes from PR feedback and better formatting

d684ac7

Minor fixes from PR feedback and better formatting

e1b3256

Minor fixes from PR feedback and better formatting

3cde3eb

mthrok reviewed Aug 17, 2020

View reviewed changes

torchaudio/datasets/tedlium.py Outdated Show resolved Hide resolved

mthrok reviewed Aug 17, 2020

View reviewed changes

torchaudio/datasets/tedlium.py Show resolved Hide resolved

jiwidi added 2 commits August 17, 2020 16:21

Minor fixes from PR feedback and docstrings

9655036

Style fix

e76ba7a

vincentqb reviewed Aug 17, 2020

View reviewed changes

mthrok reviewed Aug 17, 2020

View reviewed changes

torchaudio/datasets/tedlium.py Outdated Show resolved Hide resolved

mthrok reviewed Aug 17, 2020

View reviewed changes

Changes from PR feedback

d3fede5

mthrok reviewed Aug 18, 2020

View reviewed changes

torchaudio/datasets/tedlium.py Outdated Show resolved Hide resolved

torchaudio/datasets/tedlium.py Outdated Show resolved Hide resolved

Changes from PR feedback and phoneme dict function

556a7f3

mthrok reviewed Aug 18, 2020

View reviewed changes

jiwidi added 2 commits August 19, 2020 03:01

Changes from PR feedback

3f18636

Changes to dataset docs, adding tedlium

8a0b922

jiwidi commented Aug 19, 2020

View reviewed changes

mthrok reviewed Aug 19, 2020

View reviewed changes

Changes from PR feedback

90d1db1

vincentqb reviewed Aug 20, 2020

View reviewed changes

mthrok mentioned this pull request Aug 21, 2020

Add sample_rate conversion option to sox_io_backend.load #816

Closed

mthrok reviewed Sep 6, 2020

View reviewed changes

Tedlium test and minor improvements to tedlium class

5125ebf

mthrok reviewed Sep 9, 2020

View reviewed changes

Created test for every release and improvements from PR feedback

f6bae1c

vincentqb mentioned this pull request Sep 10, 2020

Add wsj0mix dataset #895

Merged

NikeNano reviewed Sep 10, 2020

View reviewed changes

mthrok reviewed Sep 10, 2020

View reviewed changes

PR feedback changes

0dfda8d

mthrok reviewed Sep 11, 2020

View reviewed changes

Test for dic loading and fix naming private variables

eecd46a

mthrok approved these changes Sep 14, 2020

View reviewed changes

fix style for tedlium test

b38a13c

mthrok merged commit 914a846 into pytorch:master Sep 15, 2020

vincentqb mentioned this pull request Sep 15, 2020

Better structure dataset implementations #910

Closed

4 tasks

vincentqb mentioned this pull request Oct 2, 2020

Add TED-LIUM 3 dataset #765

Closed

mthrok pushed a commit to mthrok/audio that referenced this pull request Dec 13, 2022

corrected typo (pytorch#882)

f119819

Co-authored-by: holly1238 <[email protected]>

mpc001 pushed a commit to mpc001/audio that referenced this pull request Aug 4, 2023

Merge pull request pytorch#882 from jamesr66a/fx

6e6e0d4

[FX] Move examples from pytorch/pytorch

	for file in os.listdir(stm_path):
	for file in sorted(os.listdir(stm_path)):

	def load_tedlium_item(self, fileid: str, line: int, path: str) -> Tedlium_item:
	def _load_tedlium_item(self, fileid: str, line: int, path: str) -> Tedlium_item:

	for i, utterance in enumerate(UTTERANCES):
	for utterance in UTTERANCES:

	self._phoneme_dict[content[0]] = content[1:] # content[1:] can be empty list
	self._phoneme_dict[content[0]] = tuple(content[1:]) # content[1:] can be empty list

Add tedlium dataset (all 3 releases) #882

Add tedlium dataset (all 3 releases) #882

Conversation

jiwidi commented Aug 15, 2020

Things that still need work:

jiwidi commented Aug 15, 2020

mthrok left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiwidi Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

vincentqb Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiwidi Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok commented Aug 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok Aug 18, 2020 • edited Loading

Choose a reason for hiding this comment

jiwidi commented Aug 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiwidi commented Aug 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok Aug 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok commented Sep 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiwidi commented Sep 7, 2020 • edited Loading

mthrok commented Sep 7, 2020

jiwidi commented Sep 9, 2020

jiwidi commented Sep 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mthrok Aug 17, 2020 •

edited

Loading

jiwidi Aug 17, 2020 •

edited

Loading

vincentqb Aug 17, 2020 •

edited

Loading

jiwidi Aug 17, 2020 •

edited

Loading

mthrok Aug 18, 2020 •

edited

Loading

mthrok Aug 19, 2020 •

edited

Loading

mthrok Aug 21, 2020 •

edited

Loading

jiwidi commented Sep 7, 2020 •

edited

Loading

jiwidi Sep 10, 2020 •

edited

Loading

NikeNano Sep 10, 2020 •

edited

Loading