inconsistent behavior: accents in search #6815

sbecuwe · 2020-08-31T08:41:40Z

JabRef version 5.1 on macOS 10.14.6

I have tested the latest development version from http://builds.jabref.org/master/ and the problem persists.

It seems the way accents are handled in author names, is not consistent.

"Lef{\`e}vre, V." is found when entering "Lefevre" in the search field. The same holds for "Pr{\'e}vost, M." when entering "Prevost".
However, "M{\"u}hlbach, G." is not found when entering "Muhlbach" in the search field. The same holds for "D{\'{\i}}az, J." when entering "Diaz", and "Nj{\aa}stad, O." when entering "Njastad" or "Njaastad".

k3KAW8Pnf7mkmdSMPHz27 · 2020-11-08T22:45:44Z

I'll start looking into this

k3KAW8Pnf7mkmdSMPHz27 · 2020-11-09T18:47:21Z

I can't replicate ("Lefevre" does not match "Lef{`e}vre, V." for me).
@sbecuwe , if this problem still remains, could you send me your settings for "Language", "Default encoding" and "Default library mode" (in JabRef Preferences, see below)?

sbecuwe · 2020-11-11T11:01:30Z

I tried both ISO-8859-1 and UTF-8. Language is English, library mode is BibTeX.
I turns out to be more tricky.

When I use the original BibTeX file or a copy of it, the described behavior is reproducible.
When I make a copy of the original file and delete all lines, apart from the ones in the attachment, Lefevre et al is no longer found by the search operation...

I double checked the file type of all files: it's "BibTeX text file, ASCII text" according to the file command.

example.txt

k3KAW8Pnf7mkmdSMPHz27 · 2020-11-11T15:56:27Z

@sbecuwe I don't know if you'd be willing to share the original BibTeX file with me (I am NOT a JabRef developer, I am a volunteer). It would be easier to try to pinpoint what is going on using the debugger rather than going through the code. Also, I appreciate that you seem to be spending more time on this issue than you were probably intending to =/

In a nutshell,

The behaviour you described should not happen (field content get resolved of latex before/during the search), and "{`e}" -> "è", which does not match "e".
My best guest is that you are not matching the author field, in which case it is likeliest to be,
1. the citation key, or something using the same code (does Mueh match M{\"u}hlbach, G. (though in theory, "Nj{\aa}stad, O." should match "Njastad" if this were true))?
2. some completely different field, in which case I have no idea of what is going on. (does the search term Lefevre and not author = Lefevre find anything?)

calixtus · 2021-02-15T20:40:55Z

Related to the whole messy complex of the Latex-To-Unicode converter (#6155).

koppor · 2021-08-16T19:23:53Z

Use the discussion as test cases for unicode conversion (especially the round-trip Rework superscript: latex-to-unicode and unicode-to-latex roundtrip not working #3644)

LingZhang22 · 2022-03-30T06:33:23Z

Hi, may I have a try on this issue? Is it possible to get some guidance on where I should start? Or is there any particular information I should look into? Thanks a lot!

koppor · 2022-04-03T22:43:16Z

It is an issue where one needs to think hard what to do. One also needs to craft test cases.

Hints on the current code behavior:

The current search is implemented in-memory without the help of any indexing service, database, etc.
org.jabref.model.entry.BibEntry#getLatexFreeField returns the content of a field without LaTeX (but with Unicode). Example: M{\"u}hlbach gets Mühlbach
The method is used in all three searcher classes (Ctrl+Click on getLatexFreeField):
The simplest search processor is org.jabref.model.search.rules.ContainBasedSearchRule.
There, org.jabref.model.search.rules.ContainBasedSearchRule#applyRule checks a match of the search string for an entry.
The matching is if (formattedFieldContent.contains(word)) {. Thus, Muhlbach will never match Mühlbach, because Java checks for character equivalence.

For a quick fix, other equivalence checks should be implemented. Java's Collator is a good start. Maybe, you can create a pull request:

Add test cases mirroring the issue to org.jabref.model.search.rules.ContainBasedSearchRuleTest. These test should fail.
Change the equality check in org.jabref.model.search.rules.ContainBasedSearchRule to use the Collator. These tests should work then.
Add CHANGELOG.md entry
Submit pull request.

LingZhang22 · 2022-04-04T15:37:38Z

It is an issue where one needs to think hard what to do. One also needs to craft test cases.

Hints on the current code behavior:

The current search is implemented in-memory without the help of any indexing service, database, etc.

org.jabref.model.entry.BibEntry#getLatexFreeField returns the content of a field without LaTeX (but with Unicode). Example: M{\"u}hlbach gets Mühlbach

The method is used in all three searcher classes (Ctrl+Click on getLatexFreeField):

The simplest search processor is org.jabref.model.search.rules.ContainBasedSearchRule.

There, org.jabref.model.search.rules.ContainBasedSearchRule#applyRule checks a match of the search string for an entry.

The matching is if (formattedFieldContent.contains(word)) {. Thus, Muhlbach will never match Mühlbach, because Java checks for character equivalence.

For a quick fix, other equivalence checks should be implemented. Java's Collator is a good start. Maybe, you can create a pull request:

Add test cases mirroring the issue to org.jabref.model.search.rules.ContainBasedSearchRuleTest. These test should fail.

Change the equality check in org.jabref.model.search.rules.ContainBasedSearchRule to use the Collator. These tests should work then.

Add CHANGELOG.md entry

Submit pull request.

Thank you so much for your detailed instructions. It really helps a lot!

The code uses the contains methods for the String and I didn't find the Collator API has that method, so I searched online, and find Normalizer. It seems that the search performed as expected, but I am not very sure whether it has any underlying issues. I have created a PR for that. Please let me know if this is not a suitable API to use. I will fix it. Thanks again!

calixtus · 2022-04-09T10:18:53Z

Should be fixed by #8640 thanks to @LingZhang22 . @sbecuwe Could you please test the current dev version (after a backup of your files) if this issue persists? Thanks!

Siedlerchr added the search label Aug 31, 2020

Siedlerchr mentioned this issue Nov 8, 2020

Fixes exception in preview using regexp search and regexp search without specified field #7073

Merged

5 tasks

Siedlerchr added the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label Nov 22, 2020

koppor removed the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label Aug 16, 2021

koppor added the project: GSoC label Mar 26, 2022

LingZhang22 mentioned this issue Apr 4, 2022

Fixed the inconsistent behavior for accents in search #8640

Merged

6 tasks

calixtus closed this as completed Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent behavior: accents in search #6815

inconsistent behavior: accents in search #6815

sbecuwe commented Aug 31, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Nov 8, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Nov 9, 2020 •

edited

Loading

sbecuwe commented Nov 11, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Nov 11, 2020 •

edited

Loading

calixtus commented Feb 15, 2021

koppor commented Aug 16, 2021

LingZhang22 commented Mar 30, 2022

koppor commented Apr 3, 2022

LingZhang22 commented Apr 4, 2022

calixtus commented Apr 9, 2022

inconsistent behavior: accents in search #6815

inconsistent behavior: accents in search #6815

Comments

sbecuwe commented Aug 31, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Nov 8, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Nov 9, 2020 • edited Loading

sbecuwe commented Nov 11, 2020

k3KAW8Pnf7mkmdSMPHz27 commented Nov 11, 2020 • edited Loading

calixtus commented Feb 15, 2021

koppor commented Aug 16, 2021

LingZhang22 commented Mar 30, 2022

koppor commented Apr 3, 2022

LingZhang22 commented Apr 4, 2022

calixtus commented Apr 9, 2022

k3KAW8Pnf7mkmdSMPHz27 commented Nov 9, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Nov 11, 2020 •

edited

Loading