CutSet multiplexing #565

pzelasko · 2022-02-03T20:36:40Z

CC @danpovey I think you might find it interesting for dataset combination, the code would look roughly like this:

libri_cuts = load_manifest_lazy(...)
giga_cuts = load_manifest_lazy(...)
tedlium_cuts = load_manifest_lazy(...)

muxed_cuts = CutSet.mux(libri_cuts, giga_cuts, tedlium_cuts, weights=[...], seed=10)

sampler = DynamicBucketingSampler(muxed_cuts, shuffle=True, num_buckets=..., ...)

It completely circumvents the issue of mismatched duration buckets with ZipSampler, as all datasets participate in estimation of duration buckets in this approach (might want to increase num_cuts_for_bucket_estimation param in sampler for a good estimate).

janvainer · 2022-02-04T08:54:58Z

lhotse/cut.py

+
+        :param cut_sets: cut sets to be multiplexed.
+            They can be either lazy or eager, but the resulting manifest will always be lazy.
+        :param weights: an optional weight for each CutSet, affects the probability of it being sampled.


@pzelasko why not set the weights proportional to cutset sizes by default? This way you would deplete all cutsets at the same time on average. If we keep it uniform, we are risking that a small cutset A is depleted fast and for the rest of the epoch, there will be only larger cutset B.

Leaving it up to the user in case the cut sets are very large and opened lazily (I can also imagine len not being available for some types of lazy manifests in the future)

csukuangfj · 2022-02-14T06:47:58Z

lhotse/serialization.py

+            selected = iters[idx]
+            try:
+                item = next(selected)
+                yield item


Is it possible to return also the idx?

We may need to select a different network to process the returned item depending on the returned idx.

For instance, when combining gigaspeech and librispeech in transducer training, the encoder network is shared and there are two separate decoder+joiner networks for each dataset. If the returned item is from gigaspeech, we would run the decoder+joiner for gigaspeech; if the returned item is from librispeech, we would run the decoder+joiner for librispeech.

If this function returns only item without idx, it is difficult to tell which dataset the returned item is sampled from. Therefore, it's also difficult to select which decoder+joiner to run.

Hmm that would be problematic, it would break the API that CutSet and CutSampler is depending on. But there are two other ways that would be much easier:

identify where a cut comes from based on cut/supervision/recording ID

extend the manifests by adding assigning a custom field that identifies which domain the cut belongs to (cut.origin = "libri") before saving them to disk

danpovey · 2022-02-14T06:52:04Z

Nice!
Yes, this might be a good way to go. We can see which is better (might affect both speed and convergence).

csukuangfj · 2022-02-14T06:54:55Z

lhotse/serialization.py

+        rng = random.Random(self.seed)
+        iters = [iter(it) for it in self.iterators]
+        exhausted = [False for _ in range(len(iters))]
+        while not all(exhausted):


Also, can we support to specify that as soon as a specified cutset is exhausted, it breaks the while loop.

Sure, see #585

CutSet multiplexing

8529a78

pzelasko added this to the v1.0 milestone Feb 3, 2022

pzelasko added 2 commits February 3, 2022 17:06

Ensure deterministic execution upon every __iter__ call

d2e0f22

black

2a1fad3

pzelasko merged commit 825a884 into master Feb 3, 2022

janvainer reviewed Feb 4, 2022

View reviewed changes

csukuangfj mentioned this pull request Feb 9, 2022

How to train with multiple datasets #554

Open

csukuangfj reviewed Feb 14, 2022

View reviewed changes

csukuangfj mentioned this pull request Feb 16, 2022

Begin to use multiple datasets in training k2-fsa/icefall#213

Merged

5 tasks

csukuangfj mentioned this pull request Apr 21, 2023

Add multidataset k2-fsa/icefall#1010

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CutSet multiplexing #565

CutSet multiplexing #565

pzelasko commented Feb 3, 2022 •

edited

Loading

janvainer Feb 4, 2022

pzelasko Feb 4, 2022

csukuangfj Feb 14, 2022

pzelasko Feb 14, 2022

danpovey commented Feb 14, 2022

csukuangfj Feb 14, 2022

pzelasko Feb 14, 2022

CutSet multiplexing #565

CutSet multiplexing #565

Conversation

pzelasko commented Feb 3, 2022 • edited Loading

janvainer Feb 4, 2022

Choose a reason for hiding this comment

pzelasko Feb 4, 2022

Choose a reason for hiding this comment

csukuangfj Feb 14, 2022

Choose a reason for hiding this comment

pzelasko Feb 14, 2022

Choose a reason for hiding this comment

danpovey commented Feb 14, 2022

csukuangfj Feb 14, 2022

Choose a reason for hiding this comment

pzelasko Feb 14, 2022

Choose a reason for hiding this comment

pzelasko commented Feb 3, 2022 •

edited

Loading