-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CutSet multiplexing #565
CutSet multiplexing #565
Conversation
|
||
:param cut_sets: cut sets to be multiplexed. | ||
They can be either lazy or eager, but the resulting manifest will always be lazy. | ||
:param weights: an optional weight for each CutSet, affects the probability of it being sampled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pzelasko why not set the weights proportional to cutset sizes by default? This way you would deplete all cutsets at the same time on average. If we keep it uniform, we are risking that a small cutset A is depleted fast and for the rest of the epoch, there will be only larger cutset B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving it up to the user in case the cut sets are very large and opened lazily (I can also imagine len not being available for some types of lazy manifests in the future)
selected = iters[idx] | ||
try: | ||
item = next(selected) | ||
yield item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to return also the idx
?
We may need to select a different network to process the returned item
depending on the returned idx
.
For instance, when combining gigaspeech and librispeech in transducer training, the encoder network is shared and there are two separate decoder+joiner networks for each dataset. If the returned item is from gigaspeech, we would run the decoder+joiner for gigaspeech; if the returned item is from librispeech, we would run the decoder+joiner for librispeech.
If this function returns only item
without idx
, it is difficult to tell which dataset the returned item
is sampled from. Therefore, it's also difficult to select which decoder+joiner to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that would be problematic, it would break the API that CutSet and CutSampler is depending on. But there are two other ways that would be much easier:
- identify where a cut comes from based on cut/supervision/recording ID
- extend the manifests by adding assigning a custom field that identifies which domain the cut belongs to (
cut.origin = "libri"
) before saving them to disk
Nice! |
rng = random.Random(self.seed) | ||
iters = [iter(it) for it in self.iterators] | ||
exhausted = [False for _ in range(len(iters))] | ||
while not all(exhausted): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we support to specify that as soon as a specified cutset is exhausted, it breaks the while loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, see #585
CC @danpovey I think you might find it interesting for dataset combination, the code would look roughly like this:
It completely circumvents the issue of mismatched duration buckets with ZipSampler, as all datasets participate in estimation of duration buckets in this approach (might want to increase
num_cuts_for_bucket_estimation
param in sampler for a good estimate).