DisagrementMeasure agreement calculation wrong for 100 agreement? #35

reckart · 2021-04-13T10:27:39Z

The KrippendorffAlphaAgreement is a disagreement measure. If there is a full agreement, then the expected and observed disagreement is calculated as 0.0 in

    @Override
    public double calculateAgreement()
    {
        double D_O = calculateObservedDisagreement();
        double D_E = calculateExpectedDisagreement();
        if (D_O == D_E) {
            return 0.0;
        }
        else {
            return 1.0 - (D_O / D_E);
        }
    }

However, a disagreement of 0 in this case does not yield an agreement of 1.0 but instead an agreement of 0.0.... seems wrong?

Suggested fix:

                double D_O = calculateObservedDisagreement();
                double D_E = calculateExpectedDisagreement();
                if (D_O == 0.0 && D_E == 0.0) {
                    return 1.0;
                }
                return 1.0 - (D_O / D_E);

The text was updated successfully, but these errors were encountered:

reckart · 2021-04-13T10:28:12Z

@chmeyer are you still watching this repo? Is this a bug or a feature?

chmeyer · 2021-04-17T10:03:12Z

@reckart sort of, where time permits :)

I would not recommend this fix. Although Krippendorff's measure internally uses disagreement modeling (i.e., observed disagreement D_O and expected disagreement D_E, it is still defined as an agreement measure. This is achieved by the "1 - " term in the result calculation "1 - (D_O / D_E)". That's why the method is called "calculateAgreement" rather than "calculateDisagreement".

Example: Imagine, we see an observed disagreement of 0.5 (~ half of the annotations are "wrong") and due to the annotations, we would expect a disagreement of 0.57 (this happens for example in a 2-raters, 4-items AA, AB, BA, BB study). This means that the raters did only slightly better than chance, so we see alpha = 1 - (0.5 / 0.57) = 1 - 0.88 = 0.12. If the raters produce only an observed disagreement of 0.25, then they do clearly better than chance and we would obtain alpha = 1 - (0.25 / 0.57) = 1 - 0.43 = 0.57. To reach acceptable agreement levels, the raters would need to produce even less observed disagreement (or the expected disagreement raises).

Coming back to your question, if there is no observed disagreement and no expected disagreement, this would mean that we have an empty study and thus nothing to judge. Returning an alpha = 1 agreement would be misleading IMHO. Returning alpha = 0 as it is now, is also debateable, as there is no clear definition for the situation. Thus, NaN would be an option, but for practicality reasons (e.g., writing numbers in a database, computing averages, etc.), we did choose 0 in the first place and probably that's fine to keep.

What do you think? Best wishes!

reckart · 2021-04-17T10:06:06Z

In my case, I found that if I have two annotators who both annotate the same unit with the same label, then the expected and observed disagreement are both 0 and in the current code this causes the agreement to be reported as 0 - but it is full agreement and thus should be reported as 1.

reckart · 2021-04-17T10:06:54Z

Coming back to your question, if there is no observed disagreement and no expected disagreement, this would mean that we have an empty study and thus nothing to judge.

So the study is not necessarily empty if expected/observed disagreement are both 0.

reckart · 2021-04-17T10:07:58Z

Maybe?

if (D_O == D_E) {
  return study.isEmpty() ? 0.0 : 1.0;
}

chmeyer · 2021-04-17T10:16:48Z

Well, D_O can be 0.0 if the raters agree on all items. But D_E does not fall 0 in a proper study. It can be 0 if there is only a single label, but then there is nothing to agree on, i.e. no question. In my opinion, this would not a real annotation study. But if we want support this use case, then, yes, the study.isEmpty solution should be a way.

reckart · 2021-04-17T10:21:17Z

I assume you mean by "real" study that there is a significant number of annotations :) In INCEpTION/WebAnno, we use calculate pairwise agreement between annotators. It is not uncommon to have cases where

the annotators did not find any item to be labelled in their documents
the annotators did find very few items to be labelled and agree on all of them

We can and probably should handle the first case (no items) directly in our code telling the users that there was no data to compare.
However, the second case I think be better handled here.

Thanks for the feedback!

chmeyer · 2021-04-17T10:31:47Z

Few labels is not the actual problem: the simplest case AA BB (2 raters agree on 2 items) returns alpha = 1. Also cases with 4 items work well and give a good agreement notion that captures the uncertainty, e.g., in AA AB AA BB (alpha = 0.53). But if there is nothing to decide, i.e. there is only a single label, then we could have 1000 items that are annotated with A by both raters without being able to tell the agreement as there is no expectation model. I am OK to set such cases to 1, but still they should be taken with a grain of salt.

- Adjust based on discussion at dkpro/dkpro-statistics#35

reckart added the 🐛Bug Something isn't working label Apr 13, 2021

reckart added a commit to inception-project/inception that referenced this issue Apr 17, 2021

#2191 - Agreement does not take tagset into account

ecb3bd3

- Adjust based on discussion at dkpro/dkpro-statistics#35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DisagrementMeasure agreement calculation wrong for 100 agreement? #35

DisagrementMeasure agreement calculation wrong for 100 agreement? #35

reckart commented Apr 13, 2021 •

edited

Loading

reckart commented Apr 13, 2021

chmeyer commented Apr 17, 2021

reckart commented Apr 17, 2021

reckart commented Apr 17, 2021

reckart commented Apr 17, 2021

chmeyer commented Apr 17, 2021

reckart commented Apr 17, 2021

chmeyer commented Apr 17, 2021

DisagrementMeasure agreement calculation wrong for 100 agreement? #35

DisagrementMeasure agreement calculation wrong for 100 agreement? #35

Comments

reckart commented Apr 13, 2021 • edited Loading

reckart commented Apr 13, 2021

chmeyer commented Apr 17, 2021

reckart commented Apr 17, 2021

reckart commented Apr 17, 2021

reckart commented Apr 17, 2021

chmeyer commented Apr 17, 2021

reckart commented Apr 17, 2021

chmeyer commented Apr 17, 2021

reckart commented Apr 13, 2021 •

edited

Loading