Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Official" train/dev/test? #1

Open
hbredin opened this issue Jun 12, 2022 · 21 comments
Open

"Official" train/dev/test? #1

hbredin opened this issue Jun 12, 2022 · 21 comments

Comments

@hbredin
Copy link
Contributor

hbredin commented Jun 12, 2022

I have never reported results on CALLHOME because of the (apparent) lack of an official train/validation/test split (or at least validation/test split).

What experimental protocol does BUT use for reporting results?
Validation on part1, test on part2?
Validation on part2, test on part1?
Both?

cc @fnlandini

@fnlandini
Copy link
Contributor

Hi @hbredin
Thanks for bringing this up.
It is true that even our setup has evolved through time.
Following the setup that we inherited from JSALT 2016, in our original works with VBHMM clustering based methods (i.e. 1 and 2) we reported results on the whole set, excluding the file iaeu because it had labeling errors.
Later on, following the partition from Kaldi, we used part1 as validation and part2 as test and the other way around for cross validation and tuning VBx hyperparameters. Still, we reported results on the whole set and using oracle VAD.

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

@hbredin
Copy link
Contributor Author

hbredin commented Jun 14, 2022

Thanks. That's very helpful.

So all papers by Hitachi use part1 for fine-tuning and part2 for testing?

What about updating the README with your answer? This would definitely help the community (in the same way AMI-diarization-setup does for AMI).

@hbredin
Copy link
Contributor Author

hbredin commented Jun 14, 2022

cc @desh2608 @sw005320 @wq2012

@sw005320
Copy link

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models

Yes, we used this setup.

@wq2012
Copy link

wq2012 commented Jun 14, 2022

Thanks for sharing. FYI, in our previous work we did 5-fold evaluation.

We randomly partition the dataset into five subsets, and each time leave one subset for evaluation, and train UIS-RNN on the other four subsets. Then we combine the evaluation on five subsets and report the averaged DER.

image

@desh2608
Copy link

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Yes, we used the same setup recently (cc @popcornell) where part1 was used for adaptation.

@fnlandini
Copy link
Contributor

Thanks everyone for the comments.
@hbredin I've added a pointer to this issue in the README and we can keep it open for future reference

@hbredin
Copy link
Contributor Author

hbredin commented Jun 14, 2022

Thanks everyone for your feedback!
Let's make our (future) results comparable :)

@hbredin
Copy link
Contributor Author

hbredin commented Jul 19, 2022

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

@wq2012
Copy link

wq2012 commented Jul 19, 2022

There's one more thing that needs to be checked before our results really are comparable: the reference labels. Would it be possible to share them here as well?

The ones I used are shared here: https://github.com/google/speaker-id/tree/master/publications/LstmDiarization/evaluation/NIST_SRE2000

Disk 8 is CALLHOME, and Disk 6 is SwitchBoard.

@hbredin
Copy link
Contributor Author

hbredin commented Jul 21, 2022

Thanks @wq2012. That is what I started using as well.
Can anyone else confirm that those are the only version circulating in our community?

@MireiaDS
Copy link

Hi Herve,

Callhome is LDC propietary data that can only be obtained after purchase and we believe that we might violate some copyright issues if we publish the reference files from it.
But given that @wq2012 publicly shared his, yes, they are the same we use. With the exception that, as mentioned above, we do not use the file iaeu.

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

@wq2012
Copy link

wq2012 commented Jul 22, 2022

Hmm, are you sure?

Is that the same version as the LDC callhome?

IIRC we simply searched Google and downloaded them from other publicly available domains and thought these had already been publicly circulated.

@hbredin
Copy link
Contributor Author

hbredin commented Jul 22, 2022

We will consult with LDC if we can directly share our rttm files here, it would be good to have it all together in the repository, but we prefer to be on the safer side and get an approval first.

Totally makes sense. Thanks!

@MireiaDS
Copy link

@wq2012, there are several CALLHOME LDC datasets. That is why CALLHOME can refer so many sets in publications.
This specific CALLHOME data is not that easy to find, unless you know the origin. It is part of the 2000 NIST Speaker Recognition Evaluation, which can be found under LDC Catalog No. LDC2001S97.
The references were released as part of the NIST keys after the evaluation.

We are waiting for a response from LDC, we will write an update after we hear from them.

@wq2012
Copy link

wq2012 commented Jul 26, 2022

Thanks! But I don't think the references are included in any of the LDC Catalogs.

@fnlandini
Copy link
Contributor

For future reference, the RTTMs are also here: http://www.openslr.org/resources/10/sre2000-key.tar.gz

@jaehyoun
Copy link

jaehyoun commented Mar 8, 2024

Hi @hbredin Thanks for bringing this up. It is true that even our setup has evolved through time. Following the setup that we inherited from JSALT 2016, in our original works with VBHMM clustering based methods (i.e. 1 and 2) we reported results on the whole set, excluding the file iaeu because it had labeling errors. Later on, following the partition from Kaldi, we used part1 as validation and part2 as test and the other way around for cross validation and tuning VBx hyperparameters. Still, we reported results on the whole set and using oracle VAD.

However, because Hitachi folks have started (and other followed) using part1 for fine-tuning and reporting results on part2 on EEND models, we did the same in our latest end-to-end work and continue with this setup. However, this is mainly because the community seems to have adopted this setup and we wanted to be able to compare against existing results.

Hi, Herve
So you mean for Hitachi EEND-EDA experimennts,
Train set = Callhome part 1
Validation set = Callhome part 2
Test set = Callhome part 2

Is it right?

@hbredin
Copy link
Contributor Author

hbredin commented Mar 8, 2024

I guess this is for Hitashi people to answer here.
But I do hope that they are not using the same set for both validation and testing :)

Here is what I do, on my side:

  • use 75% of Callhome part 1 as train
  • use the remaining 25% of Callhome part 1 as validation
  • use Callhome part 2 as test

I don't think the actual split of part 1 (into train and dev) is really critical.
As long as part 2 never leaks into the various training steps (either train or validation) and we all report numbers on part 2, comparison should be fair.

@fnlandini
Copy link
Contributor

I guess a good scenario is what Hervé commented where he has a split of Part 1. However, it can have the issue that the same speaker appears in the 75% used as train AND in the 25% used as validation and that can lead to over-optimistic results in the validation set. But that certainly is correct in that the test set (Part 2) is never used for developing the model.

If I can add, I am afraid that many people are making decisions on Part 2 (which is the test set) and that should not be the case. Very few works report results on Part 2 without fine-tuning or comparisons on Part 1 (without fine-tuning).
Something I've been doing recently is to make all my comparisons (and decisions) on Part 1 without fine-tuning and only at the very end perform fine-tuning using Part 1 in order to report results on Part 2.
Also, when doing fine-tuning there is the question of how many epochs to run. I used the same number for both (or more) methods that I was comparing. This might still play in favor of one or the other method but at least there is no direct decision made on the test set (like run fine-tuning until the performance stops improving for Part 2).

I would not mind hearing reading others' opinions :)

@jaehyoun
Copy link

jaehyoun commented Mar 11, 2024

@hbredin @fnlandini Thank you for your quick and considerable responses!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants