Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity Localization Bug: Sentence. Doubly-detected sentence in many papers. #187

Open
andrewhead opened this issue Jan 4, 2021 · 0 comments
Labels
bad-entity-detection An issue or task related to an entity that was detected in the wrong place bug Something isn't working entity-localization An issue or task related to entity localization sentences An issue or task related to sentences

Comments

@andrewhead
Copy link
Contributor

andrewhead commented Jan 4, 2021

Description: In some papers, the same sentence gets detected twice. This can be observed by opening up that paper in the reader interface (i.e., https://scholarphi.semanticscholar.org/?file=https://arxiv.org/pdf/[PAPER_ID].pdf&preset=demo) and then entering the following CSS in the web inspector.

.sentence-annotation {
  background-color: rgba(0, 0, 255, 0.2);
}

Here is an example of a duplicated sentence (the sentence "Our decoder..."), from paper 1702.01287v1:

image

Additional papers I have seen this for include (out of the list of papers that can be seen here #188):

  • 1701.07481v3
  • 1702.01287v1
  • 1701.02810v2
  • 1805.08660v1
  • 1906.00414v2
  • 1906.01502v1
  • 1908.00300v1
  • 1905.05475v2
  • 1706.08482v1
  • 1704.05838v1
  • 1903.00621v1
  • 1705.06566v2
  • 1806.02371v1
  • 1901.10159v1
  • 1811.12359v4
  • 1707.00683v3 (first page)
  • 1711.08028v4
  • 1905.10887v2

Here are some additional notes:

  • For some papers, doubly-detected sentences seem to occur at the first sentence of the section (1908.00300v1, 1905.05475v2)

How to fix: I don't know the definitive cause of the error.

However, I suspect that one cause is that our pipeline used to color multiple entities the same color. This was recently fixed in #180. It could be that when we run the pipeline again, we see many of these duplicates disappear.

If this issue persists, then here are some ideas for fixes. One potential fix is to deduplicate sentences; if two sentences have overlapping bounding boxes, then filter out one of them.

Another potential fix is to skip processing detecting sentences that are not marked as "clean" by the sentence detector, i.e., which contain less than 2 English words in them, in order to strip out LaTeX junk. This assumes that most of the doubly-detected sentences are detected twice because the second instance of the sentence is just a junk sentence that appeared right before or after a clean sentence, though which got colorized using the same character offsets.

This issue may be low severity. I do not know if it impacts the behavior of clutter, though if it does not, we may be able to ignore this error.

@andrewhead andrewhead added bug Something isn't working entity-localization An issue or task related to entity localization bad-entity-detection An issue or task related to an entity that was detected in the wrong place sentences An issue or task related to sentences labels Jan 4, 2021
@andrewhead andrewhead added this to the LaTeX Updates for Alpha milestone Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bad-entity-detection An issue or task related to an entity that was detected in the wrong place bug Something isn't working entity-localization An issue or task related to entity localization sentences An issue or task related to sentences
Projects
None yet
Development

No branches or pull requests

1 participant