Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CutSet multiplexing #565

Merged
merged 3 commits into from
Feb 3, 2022
Merged

CutSet multiplexing #565

merged 3 commits into from
Feb 3, 2022

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Feb 3, 2022

CC @danpovey I think you might find it interesting for dataset combination, the code would look roughly like this:

libri_cuts = load_manifest_lazy(...)
giga_cuts = load_manifest_lazy(...)
tedlium_cuts = load_manifest_lazy(...)

muxed_cuts = CutSet.mux(libri_cuts, giga_cuts, tedlium_cuts, weights=[...], seed=10)

sampler = DynamicBucketingSampler(muxed_cuts, shuffle=True, num_buckets=..., ...)

It completely circumvents the issue of mismatched duration buckets with ZipSampler, as all datasets participate in estimation of duration buckets in this approach (might want to increase num_cuts_for_bucket_estimation param in sampler for a good estimate).

@pzelasko pzelasko added this to the v1.0 milestone Feb 3, 2022
@pzelasko pzelasko merged commit 825a884 into master Feb 3, 2022

:param cut_sets: cut sets to be multiplexed.
They can be either lazy or eager, but the resulting manifest will always be lazy.
:param weights: an optional weight for each CutSet, affects the probability of it being sampled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pzelasko why not set the weights proportional to cutset sizes by default? This way you would deplete all cutsets at the same time on average. If we keep it uniform, we are risking that a small cutset A is depleted fast and for the rest of the epoch, there will be only larger cutset B.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving it up to the user in case the cut sets are very large and opened lazily (I can also imagine len not being available for some types of lazy manifests in the future)

selected = iters[idx]
try:
item = next(selected)
yield item
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to return also the idx?

We may need to select a different network to process the returned item depending on the returned idx.

For instance, when combining gigaspeech and librispeech in transducer training, the encoder network is shared and there are two separate decoder+joiner networks for each dataset. If the returned item is from gigaspeech, we would run the decoder+joiner for gigaspeech; if the returned item is from librispeech, we would run the decoder+joiner for librispeech.

If this function returns only item without idx, it is difficult to tell which dataset the returned item is sampled from. Therefore, it's also difficult to select which decoder+joiner to run.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that would be problematic, it would break the API that CutSet and CutSampler is depending on. But there are two other ways that would be much easier:

  1. identify where a cut comes from based on cut/supervision/recording ID
  2. extend the manifests by adding assigning a custom field that identifies which domain the cut belongs to (cut.origin = "libri") before saving them to disk

@danpovey
Copy link
Collaborator

Nice!
Yes, this might be a good way to go. We can see which is better (might affect both speed and convergence).

rng = random.Random(self.seed)
iters = [iter(it) for it in self.iterators]
exhausted = [False for _ in range(len(iters))]
while not all(exhausted):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can we support to specify that as soon as a specified cutset is exhausted, it breaks the while loop.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, see #585

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants