-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store Citation Relations in LRU Cache #10980
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we loose data.
I would like to see an MVStore be used. Doesn't an MVStore support some kind of caching?
The MVStore is a bridgte to the file system.
JabRef can create an mv store based similar to org.jabref.logic.util.io.BackupFileUtil#getPathForNewBackupFileAndCreateDirectory
(backupDir
is org.jabref.gui.desktop.os.NativeDesktop#getBackupDirectory)
private static final Map<String, List<BibEntry>> REFERENCES_MAP = new HashMap<>(); | ||
private static final Integer MAX_CACHED_ENTRIES = 100; | ||
private static final Map<String, List<BibEntry>> CITATIONS_MAP = new LRUMap<>(MAX_CACHED_ENTRIES, MAX_CACHED_ENTRIES); | ||
private static final Map<String, List<BibEntry>> REFERENCES_MAP = new LRUMap<>(MAX_CACHED_ENTRIES, MAX_CACHED_ENTRIES); | ||
|
||
public List<BibEntry> getCitations(BibEntry entry) { | ||
return CITATIONS_MAP.getOrDefault(entry.getDOI().map(DOI::getDOI).orElse(""), Collections.emptyList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens for entries which are not likely to be viewed, but are requested? I think, the hashmap then returns an empty list. Thus, we loose information.
Shouldn't the references be recaluclated then (instead of returning Collections.emptyList()
)? But how? (I am not that deep into the code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the entry is requested but no longer in cache, the references and citations are recalculated.
This behaviour is the same as before, the only change is that the number of references and citations stored is now limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you still think we need to use an MVStore, then I'd need some more info on how and why.
At the moment I'm having trouble seeing the problem and the benefits of MVStore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the entry is requested but no longer in cache, the references and citations are recalculated.
I needed to go into the code and understand if for myself. It would have been nice to guide me to org.jabref.gui.entryeditor.citationrelationtab.BibEntryRelationsRepository#needToRefreshCitations
.
I implemented a test case showing that it works: #10983
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you still think we need to use an MVStore, then I'd need some more info on how and why. At the moment I'm having trouble seeing the problem and the benefits of MVStore.
Do you know about "rate limits" and blocking users for too many requests?
Quoting https://libguides.ucalgary.ca/c.php?g=732144&p=5260798
The API allows up to 100 requests per 5 minutes. To access a higher rate limit, complete the form to request authentication for your project.
That means, for a large library, I cannot step through the references, because I could hit the rate limit. I know, this is seldom, but it could happen at following setting: In a corporate setting: all requests are going through a proxy. Thus, the rate limit is not per person, but per the SUM of persons. In case 100 researchers work in parallel with JabRef, each researcher can get ONE request per 5 minutes. And per entry TWO requests are needed: For the citing and the cited by.
Let's investiage MVStore. MVStore is a library storing the values of a hashmap on disk. Thus, NOT in memory. Thus, it takes less memory than a full hash map in memory, because it is on disk. -- MVStore routes through the request to a map entry to disk. - See https://www.h2database.com/html/mvstore.html for details.
We can merge as is, but we should work on MVStore fast. Otherwise, companies with a corporate proxy (and there are many companies using one) will not be able to use that feature of JabRef any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried out the feature and I usually get http 429
and cannot see any citations...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you still think we need to use an MVStore, then I'd need some more info on how and why. At the moment I'm having trouble seeing the problem and the benefits of MVStore.
Do you know about "rate limits" and blocking users for too many requests?
Quoting https://libguides.ucalgary.ca/c.php?g=732144&p=5260798
The API allows up to 100 requests per 5 minutes. To access a higher rate limit, complete the form to request authentication for your project.
That means, for a large library, I cannot step through the references, because I could hit the rate limit. I know, this is seldom, but it could happen at following setting: In a corporate setting: all requests are going through a proxy. Thus, the rate limit is not per person, but per the SUM of persons. In case 100 researchers work in parallel with JabRef, each researcher can get ONE request per 5 minutes. And per entry TWO requests are needed: For the citing and the cited by.
Let's investiage MVStore. MVStore is a library storing the values of a hashmap on disk. Thus, NOT in memory. Thus, it takes less memory than a full hash map in memory, because it is on disk. -- MVStore routes through the request to a map entry to disk. - See https://www.h2database.com/html/mvstore.html for details.
We can merge as is, but we should work on MVStore fast. Otherwise, companies with a corporate proxy (and there are many companies using one) will not be able to use that feature of JabRef any more.
Ok, I see your point now.
I was not aware that we had such a limited number of requests and that companies work with a singular request pool.
Then it might be best to wait until this is done properly, as the LRU cache would only help a small number of users but could greatly limit company users' experience.
What do you think @koppor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cardionaut Thank you for coming back on this.
Proposal:
- Merge as is. Think, 100 entries is a good choice
- Work on MVStore in a follow-up PR 😅. Estimate: 100 lines of code, but scattered around JabRef. NativeDesktop needs to be touched etc. The most difficult thing will be the closing of the MVStore. Since the DOIs are globally unique, one can close the MVStore when JabRef is shut down. This makes it "easier" (in comparison to Add FileMonitor for LaTeX citations #10937, where for each tab some closing thing were necessary). -- Nevertheless, it could be that this will be a back-and-forth code development (meaning: code reviews with significant changes could come pu). I hope, you can invest the time and energy in this @cardionaut. That feature would really help to make the citation relations really usable. (Because the information for each DOI is stored independent of each library and is presented as soon it is availbable.)... (Follow-up requirement: Refresh the DOI information if one week passed since the last fetch. Maybe this can be baked into the HashMap designed for the MVStore). -- Implementation hint: NOT doing it like org.jabref.logic.journals.JournalAbbreviationRepository, because there, there is no direct access to the MVStore, but new hashmaps are created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koppor Great, sounds like a plan.
I can not make any promises on how quickly I can get this done but I will try my best.
I work full-time and am still quite new to Java.
I'll set up a draft PR as soon as I have made some progress.
I am afraid someone else has to take over this issue, as I am currently unable to find the time and this is unlikely to change soon. |
Citations and References under the "Citation Relations" Tab are now saved in an LRUMap instead of a regular HashMap.
Their size is now limited to a (somewhat arbitrarily chosen) 100 entries.
I chose to initialize them at max size to avoid resizing, although I am not sure whether this works as intended. Each Map stores Lists which themselves grow and potentially lead to reallocation of memory.
Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if applicable)