-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929
Conversation
Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I annotated some thoughts I had during implementation, any comments are highly welcome!
src/main/java/org/jabref/logic/importer/util/GrobidService.java
Outdated
Show resolved
Hide resolved
src/test/java/org/jabref/logic/importer/fileformat/GrobidPdfMetadataImporterTest.java
Outdated
Show resolved
Hide resolved
src/test/java/org/jabref/logic/importer/fileformat/GrobidPdfMetadataImporterTest.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jabref/logic/importer/util/GrobidService.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jabref/logic/importer/fetcher/GrobidCitationFetcher.java
Outdated
Show resolved
Hide resolved
have you looked into the Apache Tika class? Could this be an option? |
I did not until now. I don't really see a benefit. I think using a decent library for the pdf-transfer is enough. |
This comment has been minimized.
This comment has been minimized.
Merge dialog, I like the general idea, but it has to be usable with small screens. |
Maybe its not even necessary to display the textfields on the right, since the information displayed is redundant to the togglebutton selected. |
The idea is:
|
That looks nice! Good work. I agree with Christoph, that the merge dialog needs to work also on smaller screens. For this, the n-merge approach might be problematic and probably the user is overwhelmed anyway by the information of > 3 choices. It may be worth a consideration to merge some of the extracted metadata automatically (say based on some priority list) and then only present the user two choices. For example, merge grobid + firstpage automatically, and then have a user interface with only the options "grobid/firstpage" and "xmp". That brings me to the question what is the "firstpage" extractor? Is this our self-written importer that only barely works with a small selection of IEEE documents (if I remember correctly)? In this case, I would actually argue that we just remove the FirstPage extractor and replace it completely by grobid. Based on the merging "grobid + firstpage" I got another idea: what about automatically merging all extracted metadata (grobid + xmp) per default, with the choice in the preferences to activate the advanced merge dialog? In the end, it's a tradeoff of how quickly users can import PDFs vs the quality of the extracted metadata. And if grobid is good enough, then I would argue that it's justified to value speed over a small quality improvement. It would be a bit of a magical feeling if you drop a PDF into JabRef, and an entry is directly created with almost-perfect metadata without any further user interaction. But on the other hand, there are certain user circles that have good-quality xmp metadata stored in their PDFs, so for them it would be frustrating if this is automatically overwritten by the subpar grobid metadata. For those users, the merge dialog has a lot of value and they can activate it in the preferences. What do you think? Final small remark: For me "Grobid" is an implementation detail and most users don't know (and don't need to know) what this is. So I propose to replace it by some generic name in the UI, something like "Automatically extracted". |
Thanks :)
Sure. This is just a first draft, lots of details to work on. Once all that empty space is gone, the whole thing will be much smaller I hope.
Actually, there will be even more than three. Grobid, Xmp, Firstpage, embedded bibtex, DOI fetcher (if any other importer detects a DOI, I would use the fetchers to get information from the DOI and add another column).
Thats the plan for later on. I wanted to first implement the merge to get a feeling for how good Grobid works, how likely it is for XMP metadata / embedded bibtex to be present...
Yes, thats the one. AFAIK it works very well for the small set of data it works with, right? I did not check the implementation yet, will it fail if the pdf is not in IEEE format? If so, I would opt to use it if it does not fail.
See above. This sounds like an argument to make the priority list sortable.
True, but all others are 'automatically extracted' as well. Before starting this project, I didn't know about XMP either. Maybe we should just use generic names for the options (Source 1, Source 2) and show details on mouse-over? |
Regarding the custom first page importer, it also checks if it finds a DOI on the page and then simply calls the DOI fetcher with that doi |
In that case, I would disable that feature from the first-page importer and do DOI lookup last. So if any importer finds a DOI, use the DOI fetcher. |
Headers and the entry editor are now placed in VBox/HBox containers around the table that displays the options. Users can (if necessary) scroll in h and v directions.
I'm a bit worried that displaying all of this information in one screen will be overwhelming. Maybe condense it into "Extracted" (Grobid, Firstpage, DOI merged) vs "Embedded" (XMP, bibtex merged) ?
Yes! Don't make everything configurable, especially not at the beginning.
This was more an argument/idea to merge all information by default. In the end, the user doesn't care what kind of sources we used, he just wants to have a high-quality entry when he puts a pdf into JabRef. |
To be able to use the DiffHighlighter, but the TextFlow is ugly because it grows beyond it's boundaries if the text is too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some remarks were made in PR #8002 .
After fixing them, this should be ready too.
Otherweise lgtm.
src/main/java/org/jabref/gui/mergeentries/MultiMergeEntriesViewModel.java
Outdated
Show resolved
Hide resolved
Two green checkmarks. Codacy is complaining about 4-space tabs instead of 2, but in every other file its done like here. So merging now. 🎉 |
We want to be able to import PDFs into JabRef and infer all Bib-data from the file itself.
This is done by using multiple importers. If they disagree about the metadata, we need a way to merge the conflicting data. As of #7947, this is done by prioritization of importers.
For pro-users, we want to have the option to do the merge manually using an n-way merge dialog. It can be triggered by clicking a button next to a linked (offline, pdf) file.
The dialog looks like a table. Each source will be represented by a column, each field by a row. There will be an additional, editable, column that represents the final entry.
Users can:
Sources will be the existing PDF importers and the importers implemented in #7947.
In a second step, we want to use this functionality to clean-up bib entries. Users may select an existing bib-entry and 'enhance' it by analyzing linked pdf files. In that case, the original entry will be displayed by an additional source-column.
TODO:
CHANGELOG.md
described in a way that is understandable for the average user (if applicable)