Entity Localization Bug: Sentence. Doubly-detected sentence in many papers. #187
Labels
bad-entity-detection
An issue or task related to an entity that was detected in the wrong place
bug
Something isn't working
entity-localization
An issue or task related to entity localization
sentences
An issue or task related to sentences
Milestone
Description: In some papers, the same sentence gets detected twice. This can be observed by opening up that paper in the reader interface (i.e., https://scholarphi.semanticscholar.org/?file=https://arxiv.org/pdf/[PAPER_ID].pdf&preset=demo) and then entering the following CSS in the web inspector.
Here is an example of a duplicated sentence (the sentence "Our decoder..."), from paper 1702.01287v1:
Additional papers I have seen this for include (out of the list of papers that can be seen here #188):
Here are some additional notes:
How to fix: I don't know the definitive cause of the error.
However, I suspect that one cause is that our pipeline used to color multiple entities the same color. This was recently fixed in #180. It could be that when we run the pipeline again, we see many of these duplicates disappear.
If this issue persists, then here are some ideas for fixes. One potential fix is to deduplicate sentences; if two sentences have overlapping bounding boxes, then filter out one of them.
Another potential fix is to skip processing detecting sentences that are not marked as "clean" by the sentence detector, i.e., which contain less than 2 English words in them, in order to strip out LaTeX junk. This assumes that most of the doubly-detected sentences are detected twice because the second instance of the sentence is just a junk sentence that appeared right before or after a clean sentence, though which got colorized using the same character offsets.
This issue may be low severity. I do not know if it impacts the behavior of clutter, though if it does not, we may be able to ignore this error.
The text was updated successfully, but these errors were encountered: