Fixes miss-parsed names in `AutomaticPersonsGroup` #7228

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-22T20:26:44Z

This fixes the parsing and matching of authors' last names. It is much slower than the previous solution, and uses more RAM, but I think that is hard/impossible to avoid. However, it is possible that the caching solution can be made smarter, but I leave the decision if it should, to the reviewers 😁

An alternative solution is moving the caching to AuthorList, it will make it easier to modify/improve at a later point in time. ~~It will make this PR take quite a bit of more time as I'd like to make AuthorList extend List in that case. See #6552 (comment) for a much longer discussion on that topic.~~
Given that the only thing that should be cached are the last names (and always latex-free), perhaps it has more in common with the autocomplete functionality. I have not looked into that.

Change in CHANGELOG.md described (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
~~Screenshots added in PR description (for UI changes)~~
Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

tobiasdiez

So, what was the problem in the end? It looks like the code is more or less equivalent to the one before, except that now the author parsing and latex2unicode conversion are in a different order (which does make sense!). Is this the only change?

Concerning the caching, I think this is not needed. The parsing of a string to an authorlist is already cached, and getting the last names with latex converted is also cached:

jabref/src/main/java/org/jabref/model/entry/AuthorList.java

Line 377 in afdb194

public String getAsLastNamesLatexFree(boolean oxfordComma) {

However, this code probably needs to be changed a bit to be usable for our purposes. I would suggest to add a getLastNameLatexFree in Author, which caches it's result. This method then can be used in the groups parsing, as well as in AuthorList.getAsLastNamesLatexFree (there is no need to have an additional cache there again).

src/test/java/org/jabref/model/groups/AutomaticPersonsGroupTest.java

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-30T17:04:29Z

So, what was the problem in the end? It looks like the code is more or less equivalent to the one before, except that now the author parsing and latex2unicode conversion are in a different order (which does make sense!). Is this the only change?

The other change is that the previous solution relied on the existing WordKeywordGroup SearchStrategies which did not parse authors. Hence, "Turing, Allan" ended up attempting to match the group "Turing" with the last name "Turing," (note the ,). There might be other differences as well but my "contribution" is to use the same code that parses the names for the groups to check if an author matches.
Note that among the test cases it is only createSubgroupsContainingCommaSeparatedLastNames that fails without these changes.

Now that I am looking at it again, I'll see if I can extend KeywordGroup instead. It should make the changes more readable.

Concerning the caching, I think this is not needed. The parsing of a string to an authorlist is already cached, and getting the last names with latex converted is also cached:

jabref/src/main/java/org/jabref/model/entry/AuthorList.java

Line 377 in afdb194

public String getAsLastNamesLatexFree(boolean oxfordComma) {

However, this code probably needs to be changed a bit to be usable for our purposes. I would suggest to add a getLastNameLatexFree in Author, which caches it's result. This method then can be used in the groups parsing, as well as in AuthorList.getAsLastNamesLatexFree (there is no need to have an additional cache there again).

In some cases the getAsLastNamesLatexFree contain "and" between the two last authors. Given that a latex-free string is allowed to contain "and" for other reasons (e.g., {Barnes and Noble}) I'd rather avoid using a plain String.
I don't have an opinion regarding adding an Author#getLastNameLatexFree method. Every time I look at this issue I end up thinking "this should be fixed in BibEntry#getResolvedFieldOrAliasLatexFree" in which case the AUTHOR_CACHE would cache latexfree AuthorLists, but in my opinion, it is a bit of a nightmare to implement.
Should I add the Author#getLastNameLatexFree method? At the very least it would allow

.map(Author::getLastNameLatexFree)

and removing the caching for the AuthorList method.

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-30T17:09:09Z

I just feel that if the caching for AuthorList.getAsLastNamesLatexFree is removed, then all the AuthorList.getAs...LatexFree should be removed. I'd guess the stream...Collectors.joining is fast enough for that.

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-30T20:17:00Z

Just a note... It seems that none of the groups use BibDatabase which means they all use entry.getField instead of entry.getResolvedFieldOrAlias.
I'd argue it would be out of scope for this issue, but it could lead to problems (perhaps it is unlikely to be used for the author/editor fields).

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-30T22:05:34Z

I just realized that AuthorList.getAsLastNamesLatexFree uses getLastOnly, not getLast. So I can't reuse the last name cache easily in AuthorList. I think that for the AutomaticPersonGroup it makes sense that the subgroup is named Neumann and not von Neumann.

src/main/java/org/jabref/model/groups/LastNameGroup.java

src/test/java/org/jabref/model/groups/AutomaticPersonsGroupTest.java

tobiasdiez · 2020-12-31T13:03:35Z

I just feel that if the caching for AuthorList.getAsLastNamesLatexFree is removed, then all the AuthorList.getAs...LatexFree should be removed. I'd guess the stream...Collectors.joining is fast enough for that.

I completely agree! The Author class should be responsible to calculate the latex-free version of names etc, and cache its result. Maybe, the most readable version is a method Author latexfree() which returns the complete latex2unicode converted representation of the author, so you can write author.latexfree().getLast(). That's something of value even beyond this ticket!
Would also make it pretty easy to cache in a flexible way by having only Author latexfree as a cached variable. It's also pretty extensible you get automatic caching for methods like getLastOnly.

Every time I look at this issue I end up thinking "this should be fixed in BibEntry#getResolvedFieldOrAliasLatexFree"

That would be indeed a good solution, but is sadly not possible. {Barnes and Noble} and Author is sadly different to Barnes and Noble and Author. The author specification of bibtex is just a mess, and relies on so many assumptions. The best would be if people would only rely on the explicit name format of biber/biblatex: AUTHOR = {given=Arnar, family=Vigfusson}, but that's a very long way to go.

k3KAW8Pnf7mkmdSMPHz27 · 2020-12-31T14:01:56Z

I just feel that if the caching for AuthorList.getAsLastNamesLatexFree is removed, then all the AuthorList.getAs...LatexFree should be removed. I'd guess the stream...Collectors.joining is fast enough for that.

I completely agree! The Author class should be responsible to calculate the latex-free version of names etc, and cache its result. Maybe, the most readable version is a method Author latexfree() which returns the complete latex2unicode converted representation of the author, so you can write author.latexfree().getLast(). That's something of value even beyond this ticket!
Would also make it pretty easy to cache in a flexible way by having only Author latexfree as a cached variable. It's also pretty extensible you get automatic caching for methods like getLastOnly.

Perhaps this is the good way out of this... I just have this itch to kill the Author class and make it a Map, preferably an EnumMap (thought that might not be possible)... I might be tunnel-visioning on that 😁
I'd argue it is out of scope of this issue. Rewriting the AuthorList#getAs... would be (one) of the benefits but it would be quite time consuming. Additionally, if someone wants to track down possible performance bottlenecks, it would be better in its own PR. You think opening up a new PR with Author.latexFree (Author.resolveLatex?) is the way to go (rather than continuing it here)?

Every time I look at this issue I end up thinking "this should be fixed in BibEntry#getResolvedFieldOrAliasLatexFree"

That would be indeed a good solution, but is sadly not possible. {Barnes and Noble} and Author is sadly different to Barnes and Noble and Author. The author specification of bibtex is just a mess, and relies on so many assumptions. The best would be if people would only rely on the explicit name format of biber/biblatex: AUTHOR = {given=Arnar, family=Vigfusson}, but that's a very long way to go.

In my opinion, that solution is possible. However, it would require changes to the author parser so that delimiters can be reconstructed. The surrounding braces are removed by Author, internal braces might be a bigger challenge. It would also tie in to the discussion about the mess outside the bibtex/biber/biblatex definitions of authors, if #2205 is still an issue (I have been eyeing it but not found the time to look at the details).

And thank you for multiple code review on new year as well! 🎆

tobiasdiez

The current changes are fine with me. The author caching of latexfree things can be done in a follow-up PR.

src/main/java/org/jabref/model/groups/LastNameGroup.java

k3KAW8Pnf7mkmdSMPHz27 · 2021-01-12T22:31:52Z

k3KAW8Pnf7mkmdSMPHz27 dismissed tobiasdiez’s stale review via edd24bd

Oups. Sorry, I did not know that was a thing. I just wanted to play around with what went wrong with the equality and hashCode (barring my thinking 🤡 ), since it should not hurt to have more "standard" equality and hashCode methods.

tobiasdiez

I've also only recently noticed this. I guess it's coming from the recent changes that PRs need to be reviewed before they are merged (that was always a standing requirement, but now github is configured for it is well). @koppor

koppor · 2021-01-18T19:45:41Z

k3KAW8Pnf7mkmdSMPHz27 dismissed tobiasdiez’s stale review via edd24bd

This is some automatic thing. Any change will dismiss a review. We want to be sure that approvals are "done right". The final approver can check whether the changes from the dismissed approval to the current one are OK.

Thus, no worries! 🎉 - And thank you for continuing on working on this.

…dtask * upstream/master: Upgrade citeproc to 3.x snapshot without graal (#7370) Fix Exception if no AzureInstrumentationKey is available (#7373) Update snapcraft source url (#7372) Fix checkstyle and adjust language GitBook: [master] 3 pages and 32 assets modified Add migration to special field (#7368) GitBook: [master] 5 pages and 29 assets modified Modify message at the duplicates found dialog (#7231) Fixes miss-parsed names in `AutomaticPersonsGroup` (#7228) Fix an issue where the password for a shared SQL database was only remembered if it was the same as the username (#7364) Fix harvard exporter by changing AuthorsFormatter (#7355) Bump styfle/cancel-workflow-action from 0.6.0 to 0.7.0 (#7363) Bump mockito-core from 3.7.0 to 3.7.7 (#7360) Bump org.beryx.jlink from 2.23.1 to 2.23.2 (#7361) Bump libreoffice from 7.0.3 to 7.0.4 (#7362) Export full month name instead of number in ms office (#7358)

* upstream/master: (217 commits) Fix handling of URL in file field (JabRef#7347) Fix expansion of bracketed expressions in RegExpBasedFileFinder (JabRef#7338) Refactor unlinked files (JabRef#7209) Add pressing enter when the search field is focused as a way to trigger search (JabRef#7377) Upgrade citeproc to 3.x snapshot without graal (JabRef#7370) Fix Exception if no AzureInstrumentationKey is available (JabRef#7373) Update snapcraft source url (JabRef#7372) Fix checkstyle and adjust language GitBook: [master] 3 pages and 32 assets modified Add migration to special field (JabRef#7368) GitBook: [master] 5 pages and 29 assets modified Modify message at the duplicates found dialog (JabRef#7231) Fixes miss-parsed names in `AutomaticPersonsGroup` (JabRef#7228) Fix an issue where the password for a shared SQL database was only remembered if it was the same as the username (JabRef#7364) Fix harvard exporter by changing AuthorsFormatter (JabRef#7355) Bump styfle/cancel-workflow-action from 0.6.0 to 0.7.0 (JabRef#7363) Bump mockito-core from 3.7.0 to 3.7.7 (JabRef#7360) Bump org.beryx.jlink from 2.23.1 to 2.23.2 (JabRef#7361) Bump libreoffice from 7.0.3 to 7.0.4 (JabRef#7362) Export full month name instead of number in ms office (JabRef#7358) ... # Conflicts: # external-libraries.md # src/main/java/module-info.java # src/main/java/org/jabref/gui/shared/SharedDatabaseLoginDialogViewModel.java

k3KAW8Pnf7mkmdSMPHz27 added 5 commits December 22, 2020 15:02

Add SearchStrategy injection

e82f01e

Fix caching, parsing and matching last names

0117f9d

Add test cases

cc187ee

Update CHANGELOG.md

f87503a

Merge branch 'master' into fix-for-issue-5833

7d749e6

k3KAW8Pnf7mkmdSMPHz27 marked this pull request as ready for review December 29, 2020 17:31

k3KAW8Pnf7mkmdSMPHz27 changed the title ~~[WIP] Fixes miss-parsed names in AutomaticPersonsGroup~~ Fixes miss-parsed names in AutomaticPersonsGroup Dec 29, 2020

tobiasdiez added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 29, 2020

tobiasdiez reviewed Dec 30, 2020

View reviewed changes

src/test/java/org/jabref/model/groups/AutomaticPersonsGroupTest.java Outdated Show resolved Hide resolved

src/test/java/org/jabref/model/groups/AutomaticPersonsGroupTest.java Outdated Show resolved Hide resolved

k3KAW8Pnf7mkmdSMPHz27 added 6 commits December 30, 2020 14:02

Add LastNameGroup as a standalone group

de08b35

Fix test cases

b2a9fe9

Fix checkstyle

7566fa4

Add caching of last name

8a77d4e

Fix usage of latex free last names

5a90724

Fix mistake in latex free caching

8e185f9

k3KAW8Pnf7mkmdSMPHz27 added 3 commits December 30, 2020 17:26

Fix tests

0d7f5cd

Fix LastNameGroup for last name containing latex

d134521

Readability and JavaDoc

2ccb3ae

tobiasdiez reviewed Dec 31, 2020

View reviewed changes

src/main/java/org/jabref/model/groups/LastNameGroup.java Outdated Show resolved Hide resolved

tobiasdiez reviewed Dec 31, 2020

View reviewed changes

src/test/java/org/jabref/model/groups/AutomaticPersonsGroupTest.java Outdated Show resolved Hide resolved

tobiasdiez reviewed Dec 31, 2020

View reviewed changes

src/test/java/org/jabref/model/groups/AutomaticPersonsGroupTest.java Outdated Show resolved Hide resolved

k3KAW8Pnf7mkmdSMPHz27 added 2 commits December 31, 2020 10:01

Remove unused BibEntry

fbe3325

Flatten Stream

56cca39

tobiasdiez reviewed Jan 1, 2021

View reviewed changes

k3KAW8Pnf7mkmdSMPHz27 added 2 commits January 3, 2021 09:40

Fix test cases

6315c61

Remove duplicated test case

3028237

tobiasdiez previously approved these changes Jan 4, 2021

View reviewed changes

Siedlerchr reviewed Jan 4, 2021

View reviewed changes

src/main/java/org/jabref/model/groups/LastNameGroup.java Outdated Show resolved Hide resolved

k3KAW8Pnf7mkmdSMPHz27 added 4 commits January 4, 2021 08:25

Fix use of Objects.equals

990fb34

Merge branch 'master' into fix-for-issue-5833

88dea22

Fix equality check

c43da99

Optimized imports

450ba10

k3KAW8Pnf7mkmdSMPHz27 mentioned this pull request Jan 5, 2021

Suggestions for changes in caching latex free authors #7301

Merged

23 tasks

Add intellij generated equal and hashCode

edd24bd

k3KAW8Pnf7mkmdSMPHz27 dismissed tobiasdiez’s stale review via edd24bd January 12, 2021 22:28

tobiasdiez previously approved these changes Jan 13, 2021

View reviewed changes

Merge branch 'master' into fix-for-issue-5833

ba3ac43

koppor dismissed tobiasdiez’s stale review via ba3ac43 January 18, 2021 19:46

Fixed merge error

805187c

koppor merged commit 31ced14 into JabRef:master Jan 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes miss-parsed names in `AutomaticPersonsGroup` #7228

Fixes miss-parsed names in `AutomaticPersonsGroup` #7228

k3KAW8Pnf7mkmdSMPHz27 commented Dec 22, 2020 •

edited

Loading

tobiasdiez left a comment

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 •

edited

Loading

tobiasdiez commented Dec 31, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 31, 2020 •

edited

Loading

tobiasdiez left a comment

k3KAW8Pnf7mkmdSMPHz27 commented Jan 12, 2021

tobiasdiez left a comment

koppor commented Jan 18, 2021

Fixes miss-parsed names in AutomaticPersonsGroup #7228

Fixes miss-parsed names in AutomaticPersonsGroup #7228

Conversation

k3KAW8Pnf7mkmdSMPHz27 commented Dec 22, 2020 • edited Loading

tobiasdiez left a comment

Choose a reason for hiding this comment

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 • edited Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 • edited Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 • edited Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 • edited Loading

tobiasdiez commented Dec 31, 2020 • edited Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 31, 2020 • edited Loading

tobiasdiez left a comment

Choose a reason for hiding this comment

k3KAW8Pnf7mkmdSMPHz27 commented Jan 12, 2021

tobiasdiez left a comment

Choose a reason for hiding this comment

koppor commented Jan 18, 2021

Fixes miss-parsed names in `AutomaticPersonsGroup` #7228

Fixes miss-parsed names in `AutomaticPersonsGroup` #7228

k3KAW8Pnf7mkmdSMPHz27 commented Dec 22, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 30, 2020 •

edited

Loading

tobiasdiez commented Dec 31, 2020 •

edited

Loading

k3KAW8Pnf7mkmdSMPHz27 commented Dec 31, 2020 •

edited

Loading