What type of data would create the most accurate model? #7

spicytigermeat · 2023-12-18T22:57:07Z

spicytigermeat
Dec 18, 2023

Upon reviewing the documentation from qiuqiao, there's 3 options for training data

Full Labels (transcriptions with ph_dur)
Weak Labels (transcriptions without ph_dur)
Audio Only (just audio)

The model I have trained uses just full labels, but would having examples of all 3 make a more accurate model for inference?

Answered by qiuqiao

Dec 19, 2023

It is not necessarily the case that having examples of all three types of data would make a more accurate model for inference.

If you keep the full label data constant and add additional weak label or audio only data, generally this can lead to better performance.

If you keep the total duration of data constant and move some of the full label data to weak label and audio only datasets, generally the performance will decrease.

In fact, during training, the full label data also calculates and back-propagates the loss for weak label and audio only data, but weak label data does not calculate the loss for full label data, and audio only data does not calculate the loss for weak label and full…

View full answer

qiuqiao · 2023-12-19T08:47:14Z

qiuqiao
Dec 19, 2023
Maintainer

It is not necessarily the case that having examples of all three types of data would make a more accurate model for inference.

If you keep the full label data constant and add additional weak label or audio only data, generally this can lead to better performance.

If you keep the total duration of data constant and move some of the full label data to weak label and audio only datasets, generally the performance will decrease.

In fact, during training, the full label data also calculates and back-propagates the loss for weak label and audio only data, but weak label data does not calculate the loss for full label data, and audio only data does not calculate the loss for weak label and full label data.

0 replies

spicytigermeat · 2023-12-19T18:15:21Z

spicytigermeat
Dec 19, 2023
Author

Thanks for the tips! So you're saying for the best results, training with Full Labels and very small amounts of just audio would be the best?

1 reply

qiuqiao Dec 19, 2023
Maintainer

This is only the case when the total duration of the training data remains constant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What type of data would create the most accurate model? #7

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

What type of data would create the most accurate model? #7

spicytigermeat Dec 18, 2023

Replies: 2 comments · 1 reply

qiuqiao Dec 19, 2023 Maintainer

spicytigermeat Dec 19, 2023 Author

qiuqiao Dec 19, 2023 Maintainer

spicytigermeat
Dec 18, 2023

Replies: 2 comments 1 reply

qiuqiao
Dec 19, 2023
Maintainer

spicytigermeat
Dec 19, 2023
Author

qiuqiao Dec 19, 2023
Maintainer