Accept application/x-bibtex for processHeaderDocument #800

btut · 2021-07-19T09:16:46Z

I extended the api to get bibtex for the processHeaderDocument service.
Previousely, AFAIK, it was only possible for /api/processReferences and /api/processCitation services.

Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800).

koppor · 2021-07-21T19:44:02Z

This PR is part of #GSoC and is a follow-up to #532.

kermitt2 · 2021-07-22T15:09:42Z

Hello @btut !

Thanks for the PR. Did you look how the BibTeX results for the header metadata look like?

I vaguely remember that the extracted header fields are not exactly mapped to the same field as for a citation string in the BiblioItem object (because they are handled and normalized differently), so the BibTeX serialization might not work correctly for headers - of course anyway many of the interesting extracted fields will be lost in the very limited BibTeX format as compared to the XML version.

So given your feedback, we might need to extend/review the toBibTeX() method in BiblioItem class (which I didn't look at since 7-8 years :) for having this PR working.

btut · 2021-07-23T06:02:35Z

Hi!

Did you look how the BibTeX results for the header metadata look like?

I checked a handful of pdfs and results seem fine. The only necessary field that seems to be missing is the date. I'll check the toBibTeX method to see what else might be lost.

of course anyway many of the interesting extracted fields will be lost in the very limited BibTeX format as compared to the XML version

Sure, but I think the goal for someone that needs BibTeX is to cite it. Aside from the date, the BibTeX entry only lacks author details like e-mail and affiliation. There is no need for these details when citing.

we might need to extend/review the toBibTeX() method in BiblioItem class

I'll have a look!

btut · 2021-07-26T08:15:38Z

Hello again!
I fixed a small problem with the dates in toBibTeX. I noticed that usually the normalized_publication_date field is populated, not the pubblication_date so I used that one and used ISO 8601 formatting (used by BibTeX).

Unfortunately, for many papers that I used for testing Grobid did not detect a date at all. This was the case both for BibTeX and TEI.

btut · 2021-07-28T20:00:51Z

I'll have a look!

I tested this extensively with many pdf's and it looks good! Could we move on with this PR? We would like to use this feature in JabRef.

Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800).

grobid-service/src/main/java/org/grobid/service/process/GrobidRestProcessFiles.java

grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java

grobid-core/src/main/java/org/grobid/core/data/Date.java

kermitt2 · 2021-07-31T22:29:47Z

Hello @btut !

I submitted a review for changes: 1) reuse the existing ISO date normalization (more complete - it has been changed/moved recently, so you might need to merge your PR branch with the current master) and 2) review the returned MIME type.

Apart from that, I've seen one bug, when the surname is incorrectly recognized as forname, we have a null in the BibTeX:

curl -X POST -H "Accept: application/x-bibtex" -d "citations=Griff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8070/api/processCitation

@article{-1,
  author = {null, Griff},
  journal = {Expert. Opin. Ther. Targets},
  year = {2002},
  pages = {103--113},
  volume = {6},
  number = {1}
}

lopez@work:~/grobid_client_python$ curl -X POST -H "Accept: " -d "citations=Griff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8070/api/processCitation
<biblStruct >
	<analytic>
		<title/>
		<author>
			<persName><forename type="first">Griff</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert. Opin. Ther. Targets</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="103" to="113" />
			<date type="published" when="2002" />
		</imprint>
	</monogr>
</biblStruct>

We might want to output no author at all in the BibTeX in this case?

The rest look very good indeed, everything is well mapped as expected!

…ure/bibtexHeaderAndFulltext

If only a firstname is detected, use it without lastname.

btut · 2021-08-01T14:18:50Z

Thanks for your review, @kermitt2!

Apart from that, I've seen one bug, when the surname is incorrectly recognized as forname, we have a null in the BibTeX

Great catch!

We might want to output no author at all in the BibTeX in this case?

I think just outputting the firstname is better as it preserves that information at least. I implemented that in 2a69fa4, let me know if you disagree!

kermitt2 · 2021-08-01T15:01:26Z

Thanks a lot @btut for the changes !

btut · 2021-08-01T15:49:02Z

Thanks for your great work! Grobid seems to work very well!

kermitt2 · 2021-08-01T21:53:53Z

Too fast (as often), the change is actually not passing the tests because of the way dates are now outputted in the bibtex:

if I understand well date in BibTeX is always ISO normalized?
for more "raw date", BibTeX uses year, for instance for this:

Kolb, S., Wirtz G.: Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE), Oxford, United Kingdom, April 7 - 10, 2014.

we dont' have a normalized date, because of the day range, so we would have something like this in the bibtex:

year = {April 7 - 10, 2014},

we can have too something like that:

year = {2014},
month = {April},
day = {7 - 10},

and apparently the support of these bibtex flavors depend on the style?

kermitt2 · 2021-08-01T23:22:41Z

Is something like this acceptable, with both date and year to reflect normalized versus raw extraction:

curl -X POST -H "Accept: application/x-bibtex" -d "Kolb, S., Wirtz G.: Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE), Oxford, United Kingdom, April 7 - 10, 2014." localhost:8070/api/processCitation

@inproceedings{-1,
  author = {Kolb, S and Wirtz, G},
  booktitle = {Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE)},
  date = {2014},
  year = {April 7 - 10, 2014},
  address = {Oxford, United Kingdom}
}

note: implemented in PR #807

btut · 2021-08-02T05:51:16Z

I added some thoughts in #807.

Is something like this acceptable, with both date and year to reflect normalized versus raw extraction

I think this would be a good way to go as it gives all possible data and the user can decide on what to use. I would go the other way around though (split normalized_publication_date into year, month and day and place the non-normalized date in the date field), because year should be numeric.

koppor · 2021-08-02T23:02:34Z

TLDR:

* we can have too something like that:

year = {2014},
month = {April},
day = {7 - 10},

Please try to output following:

year = {2014},
month = 4,
date = {2014-04-07/2014-04-10},

Long answer:

* if I understand well `date` in BibTeX is always ISO normalized?

This is done by the bibtex-variant "biblatex". See https://ftp.rrzn.uni-hannover.de/pub/mirror/tex-archive/macros/latex/contrib/biblatex/doc/biblatex.pdf

"iso8601-2 Extended Format specification level 1", which is a "yes" to your answer.

Nevertheless, date ranges can be specified:

However, normal BibTeX does not use that, it only uses year and month. In all cases, the year field should be filled to be compatible with most .bib processors out there. - The date field is more an optional thing and could be left out.

we dont' have a normalized date, because of the day range, so we would have something like this in the bibtex:
year = {April 7 - 10, 2014},

This should not happen. :)

and apparently the support of these bibtex flavors depend on the style?

Yes, they do. However, year and month is (nearly) always supported.

* GrobidPdfMetadataImporter implemented Implemented an Importer that querries Grobid for metadata of a pdf. The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet available in Grobid, but we opened a PR that implements it (kermitt2/grobid#800). * Fixed class when accessing resources * Use FileHelper method to get extension * Use jsoup to issue POST request * Removed unnecessary field * Reverted URLDownload It's no longer necessary to set the POST data by bytes as we use JSoup for that. * Changelog entry * Add pdf link to imported entry * Remove citationkey from Grobid Grobid cannot predict a citationkey * FirstPageImporter * Fixed grammar mistake in CHANGELOG.md Co-authored-by: Christoph <[email protected]> * Fixed Grobid tests * Fixed Grobid URL * Checkstyle * Fixed doc * Checkstyle * Use JSoup for plaintext citations as well * Renamed FirstPageImporter to PdfVerbatimBibTextImporter * Fixed getName (no importer) * Renamed Grobid importer to match convention * PdfEmbeddedBibTeXImporter * Renamed PdfEmbeddedBibTeXImporter to PdfEmbeddedBibFileImporter * Checkstyle * Remove debug output * Checkstyle * PdfMergeMetadataImporter * Add DOI and ISBN fetching in PdfMergeMetadataImporter * Fixed concurrent list access * Adapted tests to contain fetchable ID's * Derive XMP preferences from importFormatPreferences * Localization * Use Importers in JabRef * Remove unnecessary test documents * Checkstyle * Grobid Timeout * Null-check * Use MergeImporter as WebFetcher Users can perform a PDF import on already imported pdf's to improve the quality of the entry * Only force BibTeX import if everything else fails Fixes #7984 * Prioritize non-bruteforce importers that When importing, try importers that can tell if they are suitable for a certain file format or not. Some importers only check if a file is present, not if it in the correct format (isRecognizedFormat is always true if an existing file is given). They are used last. The List of importers now reflects that prioritization. It is not sorted by importer names anymore. The getter-methods getImportFormats and getImportFormatList still sort the List by name for the View. * Checkstyle * Fixed WebFetchersTest * Grobid does not need localization * Followup on removed Grobid localization * Fixed tests * Checkstyle * Grobid Fetcher and Tests adapted to updated Grobid * Adapted GrobidServiceTest to updated Grobid Co-authored-by: Christoph <[email protected]>

Accept application/x-bibtex for processHeaderDocument

3f87777

btut mentioned this pull request Jul 20, 2021

Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) JabRef/jabref#7929

Merged

5 tasks

Benedikt Tutzer added 2 commits July 26, 2021 10:05

Use ISO date format of normalized publication date

72d7117

Don't include date if ISO-string is null

7062857

kermitt2 requested changes Jul 31, 2021

View reviewed changes

kermitt2 added the enhancement label Jul 31, 2021

kermitt2 added this to the 0.7.1 milestone Jul 31, 2021

Benedikt Tutzer added 4 commits August 1, 2021 15:58

Merge branch 'master' of https://github.com/kermitt2/grobid into feat…

b75b44d

…ure/bibtexHeaderAndFulltext

Re-used iso-date formatting from TeiFormatter

34f4f4f

Return correct MIME-type

c1afa0a

Fix empty lastname

2a69fa4

If only a firstname is detected, use it without lastname.

kermitt2 approved these changes Aug 1, 2021

View reviewed changes

kermitt2 merged commit 1a6b103 into kermitt2:master Aug 1, 2021

kermitt2 mentioned this pull request Aug 5, 2021

Review date handling with header consolidation and date serialization in references #807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept application/x-bibtex for processHeaderDocument #800

Accept application/x-bibtex for processHeaderDocument #800

btut commented Jul 19, 2021

koppor commented Jul 21, 2021 •

edited

Loading

kermitt2 commented Jul 22, 2021

btut commented Jul 23, 2021

btut commented Jul 26, 2021

btut commented Jul 28, 2021

kermitt2 commented Jul 31, 2021

btut commented Aug 1, 2021

kermitt2 commented Aug 1, 2021

btut commented Aug 1, 2021

kermitt2 commented Aug 1, 2021

kermitt2 commented Aug 1, 2021 •

edited

Loading

btut commented Aug 2, 2021

koppor commented Aug 2, 2021

Accept application/x-bibtex for processHeaderDocument #800

Accept application/x-bibtex for processHeaderDocument #800

Conversation

btut commented Jul 19, 2021

koppor commented Jul 21, 2021 • edited Loading

kermitt2 commented Jul 22, 2021

btut commented Jul 23, 2021

btut commented Jul 26, 2021

btut commented Jul 28, 2021

kermitt2 commented Jul 31, 2021

btut commented Aug 1, 2021

kermitt2 commented Aug 1, 2021

btut commented Aug 1, 2021

kermitt2 commented Aug 1, 2021

kermitt2 commented Aug 1, 2021 • edited Loading

btut commented Aug 2, 2021

koppor commented Aug 2, 2021

koppor commented Jul 21, 2021 •

edited

Loading

kermitt2 commented Aug 1, 2021 •

edited

Loading