"Official" train/dev/test? #1

hbredin · 2022-06-12T08:39:08Z

I have never reported results on CALLHOME because of the (apparent) lack of an official train/validation/test split (or at least validation/test split).

What experimental protocol does BUT use for reporting results?
Validation on part1, test on part2?
Validation on part2, test on part1?
Both?

cc @fnlandini

fnlandini · 2022-06-14T13:30:40Z

Hi @hbredin
Thanks for bringing this up.
It is true that even our setup has evolved through time.
Following the setup that we inherited from JSALT 2016, in our original works with VBHMM clustering based methods (i.e. 1 and 2) we reported results on the whole set, excluding the file iaeu because it had labeling errors.
Later on, following the partition from Kaldi, we used part1 as validation and part2 as test and the other way around for cross validation and tuning VBx hyperparameters. Still, we reported results on the whole set and using oracle VAD.

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

hbredin · 2022-06-14T13:36:24Z

Thanks. That's very helpful.

So all papers by Hitachi use part1 for fine-tuning and part2 for testing?

What about updating the README with your answer? This would definitely help the community (in the same way AMI-diarization-setup does for AMI).

hbredin · 2022-06-14T13:42:11Z

cc @desh2608 @sw005320 @wq2012

sw005320 · 2022-06-14T13:45:23Z

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models

Yes, we used this setup.

wq2012 · 2022-06-14T14:12:50Z

Thanks for sharing. FYI, in our previous work we did 5-fold evaluation.

We randomly partition the dataset into five subsets, and each time leave one subset for evaluation, and train UIS-RNN on the other four subsets. Then we combine the evaluation on five subsets and report the averaged DER.

desh2608 · 2022-06-14T14:41:35Z

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Yes, we used the same setup recently (cc @popcornell) where part1 was used for adaptation.

fnlandini · 2022-06-14T14:48:53Z

Thanks everyone for the comments.
@hbredin I've added a pointer to this issue in the README and we can keep it open for future reference

hbredin · 2022-06-14T14:53:07Z

Thanks everyone for your feedback!
Let's make our (future) results comparable :)

hbredin · 2022-07-19T12:30:21Z

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

wq2012 · 2022-07-19T13:50:27Z

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

The ones I used are shared here: https://github.com/google/speaker-id/tree/master/publications/LstmDiarization/evaluation/NIST_SRE2000

Disk 8 is CALLHOME, and Disk 6 is SwitchBoard.

hbredin · 2022-07-21T07:03:31Z

Thanks @wq2012. That is what I started using as well.
Can anyone else confirm that those are the only version circulating in our community?

MireiaDS · 2022-07-22T07:15:13Z

Hi Herve,

Callhome is LDC propietary data that can only be obtained after purchase and we believe that we might violate some copyright issues if we publish the reference files from it.
But given that @wq2012 publicly shared his, yes, they are the same we use. With the exception that, as mentioned above, we do not use the file iaeu.

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

wq2012 · 2022-07-22T12:26:17Z

Hmm, are you sure?

Is that the same version as the LDC callhome?

IIRC we simply searched Google and downloaded them from other publicly available domains and thought these had already been publicly circulated.

hbredin · 2022-07-22T14:48:59Z

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

Totally makes sense. Thanks!

MireiaDS · 2022-07-26T12:05:18Z

@wq2012, there are several CALLHOME LDC datasets. That is why CALLHOME can refer so many sets in publications.
This specific CALLHOME data is not that easy to find, unless you know the origin. It is part of the 2000 NIST Speaker Recognition Evaluation, which can be found under LDC Catalog No. LDC2001S97.
The references were released as part of the NIST keys after the evaluation.

We are waiting for a response from LDC, we will write an update after we hear from them.

wq2012 · 2022-07-26T14:24:33Z

Thanks! But I don't think the references are included in any of the LDC Catalogs.

fnlandini · 2023-03-15T15:02:24Z

For future reference, the RTTMs are also here: http://www.openslr.org/resources/10/sre2000-key.tar.gz

jaehyoun · 2024-03-08T04:45:31Z

Hi @hbredin Thanks for bringing this up. It is true that even our setup has evolved through time. Following the setup that we inherited from JSALT 2016, in our original works with VBHMM clustering based methods (i.e. 1 and 2) we reported results on the whole set, excluding the file iaeu because it had labeling errors. Later on, following the partition from Kaldi, we used part1 as validation and part2 as test and the other way around for cross validation and tuning VBx hyperparameters. Still, we reported results on the whole set and using oracle VAD.

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Hi, Herve
So you mean for Hitachi EEND-EDA experimennts,
Train set = Callhome part 1
Validation set = Callhome part 2
Test set = Callhome part 2

Is it right?

hbredin · 2024-03-08T08:41:14Z

I guess this is for Hitashi people to answer here.
But I do hope that they are not using the same set for both validation and testing :)

Here is what I do, on my side:

use 75% of Callhome part 1 as train
use the remaining 25% of Callhome part 1 as validation
use Callhome part 2 as test

I don't think the actual split of part 1 (into train and dev) is really critical.
As long as part 2 never leaks into the various training steps (either train or validation) and we all report numbers on part 2, comparison should be fair.

fnlandini · 2024-03-08T12:08:37Z

I guess a good scenario is what Hervé commented where he has a split of Part 1. However, it can have the issue that the same speaker appears in the 75% used as train AND in the 25% used as validation and that can lead to over-optimistic results in the validation set. But that certainly is correct in that the test set (Part 2) is never used for developing the model.

If I can add, I am afraid that many people are making decisions on Part 2 (which is the test set) and that should not be the case. Very few works report results on Part 2 without fine-tuning or comparisons on Part 1 (without fine-tuning).
Something I've been doing recently is to make all my comparisons (and decisions) on Part 1 without fine-tuning and only at the very end perform fine-tuning using Part 1 in order to report results on Part 2.
Also, when doing fine-tuning there is the question of how many epochs to run. I used the same number for both (or more) methods that I was comparing. This might still play in favor of one or the other method but at least there is no direct decision made on the test set (like run fine-tuning until the performance stops improving for Part 2).

I would not mind ~~hearing~~ reading others' opinions :)

jaehyoun · 2024-03-11T07:07:33Z

@hbredin @fnlandini Thank you for your quick and considerable responses!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Official" train/dev/test? #1

"Official" train/dev/test? #1

hbredin commented Jun 12, 2022

fnlandini commented Jun 14, 2022

hbredin commented Jun 14, 2022

hbredin commented Jun 14, 2022

sw005320 commented Jun 14, 2022

wq2012 commented Jun 14, 2022

desh2608 commented Jun 14, 2022

fnlandini commented Jun 14, 2022

hbredin commented Jun 14, 2022

hbredin commented Jul 19, 2022

wq2012 commented Jul 19, 2022 •

edited

Loading

hbredin commented Jul 21, 2022

MireiaDS commented Jul 22, 2022

wq2012 commented Jul 22, 2022

hbredin commented Jul 22, 2022

MireiaDS commented Jul 26, 2022

wq2012 commented Jul 26, 2022

fnlandini commented Mar 15, 2023

jaehyoun commented Mar 8, 2024

hbredin commented Mar 8, 2024

fnlandini commented Mar 8, 2024

jaehyoun commented Mar 11, 2024 •

edited

Loading

"Official" train/dev/test? #1

"Official" train/dev/test? #1

Comments

hbredin commented Jun 12, 2022

fnlandini commented Jun 14, 2022

hbredin commented Jun 14, 2022

hbredin commented Jun 14, 2022

sw005320 commented Jun 14, 2022

wq2012 commented Jun 14, 2022

desh2608 commented Jun 14, 2022

fnlandini commented Jun 14, 2022

hbredin commented Jun 14, 2022

hbredin commented Jul 19, 2022

wq2012 commented Jul 19, 2022 • edited Loading

hbredin commented Jul 21, 2022

MireiaDS commented Jul 22, 2022

wq2012 commented Jul 22, 2022

hbredin commented Jul 22, 2022

MireiaDS commented Jul 26, 2022

wq2012 commented Jul 26, 2022

fnlandini commented Mar 15, 2023

jaehyoun commented Mar 8, 2024

hbredin commented Mar 8, 2024

fnlandini commented Mar 8, 2024

jaehyoun commented Mar 11, 2024 • edited Loading

wq2012 commented Jul 19, 2022 •

edited

Loading

jaehyoun commented Mar 11, 2024 •

edited

Loading