Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility of WAT results #458

Open
flackbash opened this issue Sep 17, 2024 · 5 comments
Open

Reproducibility of WAT results #458

flackbash opened this issue Sep 17, 2024 · 5 comments

Comments

@flackbash
Copy link

flackbash commented Sep 17, 2024

Dear authors,

First of all, thank you for the great work you do in making entity linking results more comparable.

My question is specifically about GERBIL's WAT annotator:
I get different results when selecting WAT as an annotator in the A2KB task versus when I use my own NIF API which simply forwards requests from GERBIL to the official WAT API.

My setup is as follows:
I built my own NIF API which forwards the text GERBIL posts to the WAT API at https://wat.d4science.org/wat/tag/tag.
I do not provide any additional parameters to the WAT API.
I take the result from the WAT API, extract the span start and end from the fields start and end and the entity title from the title field.
I create an entity URI as follows (in Python):

from urllib.parse import quote
entity_uri = "http://dbpedia.org/resource/" + quote(wiki_title.replace(" ", "_"))

Then I send the span and the entity URI back to GERBIL.

The results I get using this approach differ from those I get when simply selecting WAT as annotator in GERBIL. On KORE50 for example, I get a Micro InKB F1 score of 0.5512 using my NIF API and 0.5781 when selecting WAT as annotator.
See this experiment: http://gerbil.aksw.org/gerbil/experiment?id=202409170001

I was wondering if GERBIL sets any additional parameters in the call to the API or filters the returned entities by score using a threshold. Looking at the GERBIL code, I didn't see any of that though.
Can you confirm that GERBIL does not use additional API parameters and does not filter results by score? This would already help me to narrow down the problem.

I just realized that the results for the recognition task are the same, so the problem might be in the URI matching.
How exactly does GERBIL create URIs from the Wikipedia titles predicted by WAT?

Any other hints to where this discrepancy could come from are highly appreciated.

Many thanks in advance!

@MichaelRoeder
Copy link
Member

Thank you for using GERBIL 🙂

I hope we can find the difference, together 👍

For A2KB, we send a request to "https://wat.d4science.org/wat/tag/tag". Apart from the document text and our API key, we do not use any additional parameters.

I assume that the difference comes from how we make use of confidence scores. We choose the confidence threshold that gives us the best Micro F1 score. You can find the chosen threshold in the "confidence threshold" column of the results. If you forward the confidence scores, too, you should achieve the same results.

The received Wikipedia article title is used to directly create a DBpedia IRI. With our sameAs retrieval approach described in our journal paper, we should end up with a set of IRIs including the DBpedia and Wikipedia IRIs.

I hope that this issue didn't consume a lot of your time. Please let us know if you think that the behavior of GERBIL is unreasonable and should be changed or improved. 🙂

@flackbash
Copy link
Author

Thanks a lot for the quick reply!
I had so far successfully ignored the confidence threshold column as I rarely scroll that far to the right...
Using the reported confidence thresholds, I can now reproduce the results, thank you for the clarification :)

My first intuition is that setting the confidence threshold individually for each benchmark gives systems like WAT, which delegate the task of finding a good confidence threshold to the user, an unfair advantage over other systems.
I know that systems like WAT, DBpedia Spotlight or TagMe sell this delegating as a feature and argue that it gives the user more control over precision vs. recall.
However, most other linkers could probably output some kind of confidence score, too, but they aim at providing a single setting that gives good results for most benchmarks instead of making the user figure out a threshold that works well.
In my opinion, setting the confidence threshold individually for each benchmark does also not represent a realistic scenario, as a user in a real world scenario will most likely not set a confidence threshold for each piece of text that is processed (and setting the threshold such that the results are optimal, would basically require generating a ground truth for the processed text).

I personally don't think it would be unfair to take the results that the API outputs as they are without any filtering at all since these are the results a user can expect, if they don't do any additional tweaking. Right now, the results are the upper bound of what a user can expect from the linker (without changing the API parameters).

It's an interesting problem and very relevant for me as I'm currently writing an analysis and comparison of different entity linkers, so I also need to figure out how to best deal with this... I would love to hear your point of view on it!

Again, thank you for the quick reply and clarifications, it really spared me a headache!

@MichaelRoeder
Copy link
Member

Yes, I am also slightly unhappy with the way we implemented comparison. I think that we could offer much more information and insights to the user about the confidence scores and their impact on the evaluation scores.

While I agree with your negative points (results become an upper bound; the comparison can be seen as unfair since we use our knowledge about the test set gold standard to find the condidence threshold), I would like to point out that previous works had a "barrier" between systems with and without confidence scores and we tried to get rid of this separation. I also think that the confidence score is actually a nice, additional feature. On the other hand, I understand the argument that a user may not make use of it 😉

With respect to your comparison of linkers, I guess the main goal has to be the fairness of comparison. There could be different ways to handle it (I do not know the exact context of your work, so my suggestions might be wrong 😅):

  1. Ignore confidence scores and compare all systems based on what they provide. That would work but it is quite easy for others to argue against your results.

  2. Make use of confidence scores.
    2.1. Decide what to optimize for: we optimize the Micro F1 score, but you could also go for other Micro, Macro or weighted average scores.
    2.2. Decide on which data you base the optimization: the main disadvantage that is already described above is that the optimization is done during the evaluation based on the gold standard. In our use case, this is related to the way how GERBIL works but I agree that it is sub-optimal. You could also choose one of the following (better) strategies:

    • Run an evaluation on the training data and find the threshold based on these results. This is the "classic" approach but it also assumes that you differentiate between train and test datasets and that you have at least one training dataset for each test dataset.
    • Run the evaluation similar to a cross validation, i.e., you gather the evaluation data for all datasets. Then, you choose dataset 1, take it out, determine the best threshold based on all other datasets and get the evaluation results for dataset 1 by applying the threshold to the evaluation data that you gathered for this dataset. Then, you would repeat the same strategy for all other datasets.

    2.3. Make your critics of confidence scores a point of your work, i.e., analyze to which extend the evaluation results vary based on the confidence scores and how they are chosen. For example, compare the upper bound calculated by GERBIL to the evaluation values that you got.

Your work sounds very interesting and I would like to know more about it. Feel free to write me a mail if you have questions or if you would like to discuss how we could support your work.

@flackbash
Copy link
Author

Thanks a lot for your input on this!

One more aspect of this:
I'm trying to understand to which extent authors of entity linking papers are aware that a confidence threshold is set per benchmark in the GERBIL evaluation.
In the GERBIL papers https://svn.aksw.org/papers/2015/WWW_GERBIL/public.pdf and https://www.semantic-web-journal.net/system/files/swj1671.pdf the only reference I find is in the former paper. The paper mentions the SA2KB task. The description of the task mentions that the scores returned by a linker are used during the evaluation, but not how they are used.
The latter paper does not mention the task or how confidence thresholds are set from what I can see.
Was the difference between the Sa2KB and the A2KB task originally exactly this setting the confidence threshold?

The only reference I found in the GERBIL Wiki was here: https://github.com/dice-group/gerbil/wiki/Experiment-types.

Please let me know if I missed anything. I only skimmed through the papers and documentation.

Given that this behavior from what I can tell is currently not well documented, my assumption is that authors often (or at least sometimes) are not aware that if they provide a score to GERBIL, it will be used to tune the results using knowledge about the test data (please let me know if you think otherwise).
For example, the recently introduced linker SpEL which yields state-of-the-art results on the AIDA-CoNLL benchmark, provides scores to GERBIL in its NIF API (and from what I can tell the reported results in the paper, specifically table 5, are achieved using the individual confidence thresholds). However, I didn't find any mention of using a confidence threshold in the paper. They merely report that they compute a probability score and then predict the entity with the highest probability. From this it seems to me that they did not intend the scores to be used in this way and are not aware of how this affects their results, because something like this should definitely be mentioned in a paper.

I think it would help a lot if you would make it clearer in GERBIL's documentation how provided scores are used.

However, I still think it would be fairer to force systems (or the user) to decide on a threshold (I agree that making no use of the scores at all can also be problematic).

Thanks a lot for your recommendations and I'll definitely get back to the offer when more questions come up! I will probably do something along the lines of 2.3: Use a fixed threshold based on the recommended threshold from the paper or API documentation and then compare it to the upper bound results.

@MichaelRoeder
Copy link
Member

  • improve the documentation about the usage of confidence scores

@MichaelRoeder MichaelRoeder reopened this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants