Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Citation Relations in LRU Cache #10980

Merged
merged 5 commits into from
Mar 5, 2024
Merged

Store Citation Relations in LRU Cache #10980

merged 5 commits into from
Mar 5, 2024

Conversation

cardionaut
Copy link
Contributor

Citations and References under the "Citation Relations" Tab are now saved in an LRUMap instead of a regular HashMap.
Their size is now limited to a (somewhat arbitrarily chosen) 100 entries.
I chose to initialize them at max size to avoid resizing, although I am not sure whether this works as intended. Each Map stores Lists which themselves grow and potentially lead to reallocation of memory.

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we loose data.

I would like to see an MVStore be used. Doesn't an MVStore support some kind of caching?

The MVStore is a bridgte to the file system.

JabRef can create an mv store based similar to org.jabref.logic.util.io.BackupFileUtil#getPathForNewBackupFileAndCreateDirectory

(backupDir is org.jabref.gui.desktop.os.NativeDesktop#getBackupDirectory)

private static final Map<String, List<BibEntry>> REFERENCES_MAP = new HashMap<>();
private static final Integer MAX_CACHED_ENTRIES = 100;
private static final Map<String, List<BibEntry>> CITATIONS_MAP = new LRUMap<>(MAX_CACHED_ENTRIES, MAX_CACHED_ENTRIES);
private static final Map<String, List<BibEntry>> REFERENCES_MAP = new LRUMap<>(MAX_CACHED_ENTRIES, MAX_CACHED_ENTRIES);

public List<BibEntry> getCitations(BibEntry entry) {
return CITATIONS_MAP.getOrDefault(entry.getDOI().map(DOI::getDOI).orElse(""), Collections.emptyList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for entries which are not likely to be viewed, but are requested? I think, the hashmap then returns an empty list. Thus, we loose information.

Shouldn't the references be recaluclated then (instead of returning Collections.emptyList())? But how? (I am not that deep into the code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the entry is requested but no longer in cache, the references and citations are recalculated.
This behaviour is the same as before, the only change is that the number of references and citations stored is now limited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still think we need to use an MVStore, then I'd need some more info on how and why.
At the moment I'm having trouble seeing the problem and the benefits of MVStore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the entry is requested but no longer in cache, the references and citations are recalculated.

I needed to go into the code and understand if for myself. It would have been nice to guide me to org.jabref.gui.entryeditor.citationrelationtab.BibEntryRelationsRepository#needToRefreshCitations.

I implemented a test case showing that it works: #10983

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still think we need to use an MVStore, then I'd need some more info on how and why. At the moment I'm having trouble seeing the problem and the benefits of MVStore.

Do you know about "rate limits" and blocking users for too many requests?

Quoting https://libguides.ucalgary.ca/c.php?g=732144&p=5260798

The API allows up to 100 requests per 5 minutes. To access a higher rate limit, complete the form to request authentication for your project.

That means, for a large library, I cannot step through the references, because I could hit the rate limit. I know, this is seldom, but it could happen at following setting: In a corporate setting: all requests are going through a proxy. Thus, the rate limit is not per person, but per the SUM of persons. In case 100 researchers work in parallel with JabRef, each researcher can get ONE request per 5 minutes. And per entry TWO requests are needed: For the citing and the cited by.

Let's investiage MVStore. MVStore is a library storing the values of a hashmap on disk. Thus, NOT in memory. Thus, it takes less memory than a full hash map in memory, because it is on disk. -- MVStore routes through the request to a map entry to disk. - See https://www.h2database.com/html/mvstore.html for details.


We can merge as is, but we should work on MVStore fast. Otherwise, companies with a corporate proxy (and there are many companies using one) will not be able to use that feature of JabRef any more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried out the feature and I usually get http 429 and cannot see any citations...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still think we need to use an MVStore, then I'd need some more info on how and why. At the moment I'm having trouble seeing the problem and the benefits of MVStore.

Do you know about "rate limits" and blocking users for too many requests?

Quoting https://libguides.ucalgary.ca/c.php?g=732144&p=5260798

The API allows up to 100 requests per 5 minutes. To access a higher rate limit, complete the form to request authentication for your project.

That means, for a large library, I cannot step through the references, because I could hit the rate limit. I know, this is seldom, but it could happen at following setting: In a corporate setting: all requests are going through a proxy. Thus, the rate limit is not per person, but per the SUM of persons. In case 100 researchers work in parallel with JabRef, each researcher can get ONE request per 5 minutes. And per entry TWO requests are needed: For the citing and the cited by.

Let's investiage MVStore. MVStore is a library storing the values of a hashmap on disk. Thus, NOT in memory. Thus, it takes less memory than a full hash map in memory, because it is on disk. -- MVStore routes through the request to a map entry to disk. - See https://www.h2database.com/html/mvstore.html for details.

We can merge as is, but we should work on MVStore fast. Otherwise, companies with a corporate proxy (and there are many companies using one) will not be able to use that feature of JabRef any more.

Ok, I see your point now.
I was not aware that we had such a limited number of requests and that companies work with a singular request pool.
Then it might be best to wait until this is done properly, as the LRU cache would only help a small number of users but could greatly limit company users' experience.
What do you think @koppor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cardionaut Thank you for coming back on this.

Proposal:

  1. Merge as is. Think, 100 entries is a good choice
  2. Work on MVStore in a follow-up PR 😅. Estimate: 100 lines of code, but scattered around JabRef. NativeDesktop needs to be touched etc. The most difficult thing will be the closing of the MVStore. Since the DOIs are globally unique, one can close the MVStore when JabRef is shut down. This makes it "easier" (in comparison to Add FileMonitor for LaTeX citations #10937, where for each tab some closing thing were necessary). -- Nevertheless, it could be that this will be a back-and-forth code development (meaning: code reviews with significant changes could come pu). I hope, you can invest the time and energy in this @cardionaut. That feature would really help to make the citation relations really usable. (Because the information for each DOI is stored independent of each library and is presented as soon it is availbable.)... (Follow-up requirement: Refresh the DOI information if one week passed since the last fetch. Maybe this can be baked into the HashMap designed for the MVStore). -- Implementation hint: NOT doing it like org.jabref.logic.journals.JournalAbbreviationRepository, because there, there is no direct access to the MVStore, but new hashmaps are created.

Copy link
Contributor Author

@cardionaut cardionaut Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koppor Great, sounds like a plan.
I can not make any promises on how quickly I can get this done but I will try my best.
I work full-time and am still quite new to Java.
I'll set up a draft PR as soon as I have made some progress.

@koppor koppor mentioned this pull request Mar 4, 2024
6 tasks
koppor
koppor previously approved these changes Mar 4, 2024
@koppor koppor added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Mar 5, 2024
@calixtus calixtus added this pull request to the merge queue Mar 5, 2024
Merged via the queue into JabRef:main with commit f5efb34 Mar 5, 2024
20 checks passed
@cardionaut cardionaut deleted the fix-for-issue-10958 branch March 7, 2024 12:18
@cardionaut
Copy link
Contributor Author

I am afraid someone else has to take over this issue, as I am currently unable to find the time and this is unlikely to change soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants