Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent behavior: accents in search #6815

Closed
sbecuwe opened this issue Aug 31, 2020 · 10 comments
Closed

inconsistent behavior: accents in search #6815

sbecuwe opened this issue Aug 31, 2020 · 10 comments

Comments

@sbecuwe
Copy link

sbecuwe commented Aug 31, 2020

JabRef version 5.1 on macOS 10.14.6

I have tested the latest development version from http://builds.jabref.org/master/ and the problem persists.

It seems the way accents are handled in author names, is not consistent.

  • "Lef{\`e}vre, V." is found when entering "Lefevre" in the search field. The same holds for "Pr{\'e}vost, M." when entering "Prevost".

  • However, "M{\"u}hlbach, G." is not found when entering "Muhlbach" in the search field. The same holds for "D{\'{\i}}az, J." when entering "Diaz", and "Nj{\aa}stad, O." when entering "Njastad" or "Njaastad".

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

I'll start looking into this

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

k3KAW8Pnf7mkmdSMPHz27 commented Nov 9, 2020

I can't replicate ("Lefevre" does not match "Lef{`e}vre, V." for me).
@sbecuwe , if this problem still remains, could you send me your settings for "Language", "Default encoding" and "Default library mode" (in JabRef Preferences, see below)?

Skärmavbild 2020-11-09 kl  13 44 40

@sbecuwe
Copy link
Author

sbecuwe commented Nov 11, 2020

I tried both ISO-8859-1 and UTF-8. Language is English, library mode is BibTeX.
I turns out to be more tricky.

  • When I use the original BibTeX file or a copy of it, the described behavior is reproducible.
  • When I make a copy of the original file and delete all lines, apart from the ones in the attachment, Lefevre et al is no longer found by the search operation...

I double checked the file type of all files: it's "BibTeX text file, ASCII text" according to the file command.

example.txt

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

k3KAW8Pnf7mkmdSMPHz27 commented Nov 11, 2020

@sbecuwe I don't know if you'd be willing to share the original BibTeX file with me (I am NOT a JabRef developer, I am a volunteer). It would be easier to try to pinpoint what is going on using the debugger rather than going through the code. Also, I appreciate that you seem to be spending more time on this issue than you were probably intending to =/

In a nutshell,

  1. The behaviour you described should not happen (field content get resolved of latex before/during the search), and "{`e}" -> "è", which does not match "e".
  2. My best guest is that you are not matching the author field, in which case it is likeliest to be,
    1. the citation key, or something using the same code (does Mueh match M{\"u}hlbach, G. (though in theory, "Nj{\aa}stad, O." should match "Njastad" if this were true))?
    2. some completely different field, in which case I have no idea of what is going on. (does the search term Lefevre and not author = Lefevre find anything?)

@Siedlerchr Siedlerchr added the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label Nov 22, 2020
@calixtus
Copy link
Member

Related to the whole messy complex of the Latex-To-Unicode converter (#6155).

@koppor koppor removed the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label Aug 16, 2021
@koppor
Copy link
Member

koppor commented Aug 16, 2021

@LingZhang22
Copy link
Contributor

Hi, may I have a try on this issue? Is it possible to get some guidance on where I should start? Or is there any particular information I should look into? Thanks a lot!

@koppor
Copy link
Member

koppor commented Apr 3, 2022

It is an issue where one needs to think hard what to do. One also needs to craft test cases.

Hints on the current code behavior:

  1. The current search is implemented in-memory without the help of any indexing service, database, etc.
  2. org.jabref.model.entry.BibEntry#getLatexFreeField returns the content of a field without LaTeX (but with Unicode). Example: M{\"u}hlbach gets Mühlbach
  3. The method is used in all three searcher classes (Ctrl+Click on getLatexFreeField):
    grafik
  4. The simplest search processor is org.jabref.model.search.rules.ContainBasedSearchRule.
  5. There, org.jabref.model.search.rules.ContainBasedSearchRule#applyRule checks a match of the search string for an entry.
  6. The matching is if (formattedFieldContent.contains(word)) {. Thus, Muhlbach will never match Mühlbach, because Java checks for character equivalence.

For a quick fix, other equivalence checks should be implemented. Java's Collator is a good start. Maybe, you can create a pull request:

  1. Add test cases mirroring the issue to org.jabref.model.search.rules.ContainBasedSearchRuleTest. These test should fail.
  2. Change the equality check in org.jabref.model.search.rules.ContainBasedSearchRule to use the Collator. These tests should work then.
  3. Add CHANGELOG.md entry
  4. Submit pull request.

@LingZhang22
Copy link
Contributor

It is an issue where one needs to think hard what to do. One also needs to craft test cases.

Hints on the current code behavior:

  1. The current search is implemented in-memory without the help of any indexing service, database, etc.
  2. org.jabref.model.entry.BibEntry#getLatexFreeField returns the content of a field without LaTeX (but with Unicode). Example: M{\"u}hlbach gets Mühlbach
  3. The method is used in all three searcher classes (Ctrl+Click on getLatexFreeField):
    grafik
  4. The simplest search processor is org.jabref.model.search.rules.ContainBasedSearchRule.
  5. There, org.jabref.model.search.rules.ContainBasedSearchRule#applyRule checks a match of the search string for an entry.
  6. The matching is if (formattedFieldContent.contains(word)) {. Thus, Muhlbach will never match Mühlbach, because Java checks for character equivalence.

For a quick fix, other equivalence checks should be implemented. Java's Collator is a good start. Maybe, you can create a pull request:

  1. Add test cases mirroring the issue to org.jabref.model.search.rules.ContainBasedSearchRuleTest. These test should fail.
  2. Change the equality check in org.jabref.model.search.rules.ContainBasedSearchRule to use the Collator. These tests should work then.
  3. Add CHANGELOG.md entry
  4. Submit pull request.

Thank you so much for your detailed instructions. It really helps a lot!

The code uses the contains methods for the String and I didn't find the Collator API has that method, so I searched online, and find Normalizer. It seems that the search performed as expected, but I am not very sure whether it has any underlying issues. I have created a PR for that. Please let me know if this is not a suitable API to use. I will fix it. Thanks again!

@calixtus
Copy link
Member

calixtus commented Apr 9, 2022

Should be fixed by #8640 thanks to @LingZhang22 . @sbecuwe Could you please test the current dev version (after a backup of your files) if this issue persists? Thanks!

@calixtus calixtus closed this as completed Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants