Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept application/x-bibtex for processHeaderDocument #800

Merged
merged 7 commits into from
Aug 1, 2021

Conversation

btut
Copy link

@btut btut commented Jul 19, 2021

I extended the api to get bibtex for the processHeaderDocument service.
Previousely, AFAIK, it was only possible for /api/processReferences and /api/processCitation services.

btut added a commit to btut/jabref that referenced this pull request Jul 20, 2021
Implemented an Importer that querries Grobid for metadata of a pdf.
The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet
available in Grobid, but we opened a PR that implements it
(kermitt2/grobid#800).
@koppor
Copy link
Contributor

koppor commented Jul 21, 2021

This PR is part of #GSoC and is a follow-up to #532.

@kermitt2
Copy link
Owner

Hello @btut !

Thanks for the PR. Did you look how the BibTeX results for the header metadata look like?

I vaguely remember that the extracted header fields are not exactly mapped to the same field as for a citation string in the BiblioItem object (because they are handled and normalized differently), so the BibTeX serialization might not work correctly for headers - of course anyway many of the interesting extracted fields will be lost in the very limited BibTeX format as compared to the XML version.

So given your feedback, we might need to extend/review the toBibTeX() method in BiblioItem class (which I didn't look at since 7-8 years :) for having this PR working.

@btut
Copy link
Author

btut commented Jul 23, 2021

Hi!

Did you look how the BibTeX results for the header metadata look like?

I checked a handful of pdfs and results seem fine. The only necessary field that seems to be missing is the date. I'll check the toBibTeX method to see what else might be lost.

of course anyway many of the interesting extracted fields will be lost in the very limited BibTeX format as compared to the XML version

Sure, but I think the goal for someone that needs BibTeX is to cite it. Aside from the date, the BibTeX entry only lacks author details like e-mail and affiliation. There is no need for these details when citing.

we might need to extend/review the toBibTeX() method in BiblioItem class

I'll have a look!

@btut
Copy link
Author

btut commented Jul 26, 2021

Hello again!
I fixed a small problem with the dates in toBibTeX. I noticed that usually the normalized_publication_date field is populated, not the pubblication_date so I used that one and used ISO 8601 formatting (used by BibTeX).

Unfortunately, for many papers that I used for testing Grobid did not detect a date at all. This was the case both for BibTeX and TEI.

@btut
Copy link
Author

btut commented Jul 28, 2021

I'll have a look!

I tested this extensively with many pdf's and it looks good! Could we move on with this PR? We would like to use this feature in JabRef.

btut added a commit to btut/jabref that referenced this pull request Jul 30, 2021
Implemented an Importer that querries Grobid for metadata of a pdf.
The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet
available in Grobid, but we opened a PR that implements it
(kermitt2/grobid#800).
@kermitt2
Copy link
Owner

Hello @btut !

I submitted a review for changes: 1) reuse the existing ISO date normalization (more complete - it has been changed/moved recently, so you might need to merge your PR branch with the current master) and 2) review the returned MIME type.

Apart from that, I've seen one bug, when the surname is incorrectly recognized as forname, we have a null in the BibTeX:

curl -X POST -H "Accept: application/x-bibtex" -d "citations=Griff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8070/api/processCitation

@article{-1,
  author = {null, Griff},
  journal = {Expert. Opin. Ther. Targets},
  year = {2002},
  pages = {103--113},
  volume = {6},
  number = {1}
}
lopez@work:~/grobid_client_python$ curl -X POST -H "Accept: " -d "citations=Griff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8070/api/processCitation
<biblStruct >
	<analytic>
		<title/>
		<author>
			<persName><forename type="first">Griff</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert. Opin. Ther. Targets</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="103" to="113" />
			<date type="published" when="2002" />
		</imprint>
	</monogr>
</biblStruct>

We might want to output no author at all in the BibTeX in this case?

The rest look very good indeed, everything is well mapped as expected!

@kermitt2 kermitt2 added this to the 0.7.1 milestone Jul 31, 2021
@btut
Copy link
Author

btut commented Aug 1, 2021

Thanks for your review, @kermitt2!

Apart from that, I've seen one bug, when the surname is incorrectly recognized as forname, we have a null in the BibTeX

Great catch!

We might want to output no author at all in the BibTeX in this case?

I think just outputting the firstname is better as it preserves that information at least. I implemented that in 2a69fa4, let me know if you disagree!

@kermitt2
Copy link
Owner

kermitt2 commented Aug 1, 2021

Thanks a lot @btut for the changes !

@kermitt2 kermitt2 merged commit 1a6b103 into kermitt2:master Aug 1, 2021
@btut
Copy link
Author

btut commented Aug 1, 2021

Thanks for your great work! Grobid seems to work very well!

@kermitt2
Copy link
Owner

kermitt2 commented Aug 1, 2021

Too fast (as often), the change is actually not passing the tests because of the way dates are now outputted in the bibtex:

  • if I understand well date in BibTeX is always ISO normalized?

  • for more "raw date", BibTeX uses year, for instance for this:

Kolb, S., Wirtz G.: Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE), Oxford, United Kingdom, April 7 - 10, 2014.

we dont' have a normalized date, because of the day range, so we would have something like this in the bibtex:

year = {April 7 - 10, 2014},
  • we can have too something like that:
year = {2014},
month = {April},
day = {7 - 10},

and apparently the support of these bibtex flavors depend on the style?

@kermitt2
Copy link
Owner

kermitt2 commented Aug 1, 2021

Is something like this acceptable, with both date and year to reflect normalized versus raw extraction:

curl -X POST -H "Accept: application/x-bibtex" -d "Kolb, S., Wirtz G.: Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE), Oxford, United Kingdom, April 7 - 10, 2014." localhost:8070/api/processCitation
@inproceedings{-1,
  author = {Kolb, S and Wirtz, G},
  booktitle = {Towards Application Portability in Platform as a Service Proceedings of the 8th IEEE International Symposium on Service-Oriented System Engineering (SOSE)},
  date = {2014},
  year = {April 7 - 10, 2014},
  address = {Oxford, United Kingdom}
}

note: implemented in PR #807

@btut
Copy link
Author

btut commented Aug 2, 2021

I added some thoughts in #807.

Is something like this acceptable, with both date and year to reflect normalized versus raw extraction

I think this would be a good way to go as it gives all possible data and the user can decide on what to use. I would go the other way around though (split normalized_publication_date into year, month and day and place the non-normalized date in the date field), because year should be numeric.

@koppor
Copy link
Contributor

koppor commented Aug 2, 2021

TLDR:

* we can have too something like that:
year = {2014},
month = {April},
day = {7 - 10},

Please try to output following:

year = {2014},
month = 4,
date = {2014-04-07/2014-04-10},

Long answer:

* if I understand well `date` in BibTeX is always ISO normalized?

This is done by the bibtex-variant "biblatex". See https://ftp.rrzn.uni-hannover.de/pub/mirror/tex-archive/macros/latex/contrib/biblatex/doc/biblatex.pdf

"iso8601-2 Extended Format specification level 1", which is a "yes" to your answer.

Nevertheless, date ranges can be specified:

grafik

However, normal BibTeX does not use that, it only uses year and month. In all cases, the year field should be filled to be compatible with most .bib processors out there. - The date field is more an optional thing and could be left out.

we dont' have a normalized date, because of the day range, so we would have something like this in the bibtex:

year = {April 7 - 10, 2014},

This should not happen. :)

and apparently the support of these bibtex flavors depend on the style?

Yes, they do. However, year and month is (nearly) always supported.

Siedlerchr added a commit to JabRef/jabref that referenced this pull request Aug 18, 2021
* GrobidPdfMetadataImporter implemented

Implemented an Importer that querries Grobid for metadata of a pdf.
The necessary Grobid functionality (retrieving BibTeX for a pdf) is not yet
available in Grobid, but we opened a PR that implements it
(kermitt2/grobid#800).

* Fixed class when accessing resources

* Use FileHelper method to get extension

* Use jsoup to issue POST request

* Removed unnecessary field

* Reverted URLDownload

It's no longer necessary to set the POST data by bytes as we use JSoup
for that.

* Changelog entry

* Add pdf link to imported entry

* Remove citationkey from Grobid

Grobid cannot predict a citationkey

* FirstPageImporter

* Fixed grammar mistake in CHANGELOG.md

Co-authored-by: Christoph <[email protected]>

* Fixed Grobid tests

* Fixed Grobid URL

* Checkstyle

* Fixed doc

* Checkstyle

* Use JSoup for plaintext citations as well

* Renamed FirstPageImporter to PdfVerbatimBibTextImporter

* Fixed getName (no importer)

* Renamed Grobid importer to match convention

* PdfEmbeddedBibTeXImporter

* Renamed PdfEmbeddedBibTeXImporter to PdfEmbeddedBibFileImporter

* Checkstyle

* Remove debug output

* Checkstyle

* PdfMergeMetadataImporter

* Add DOI and ISBN fetching in PdfMergeMetadataImporter

* Fixed concurrent list access

* Adapted tests to contain fetchable ID's

* Derive XMP preferences from importFormatPreferences

* Localization

* Use Importers in JabRef

* Remove unnecessary test documents

* Checkstyle

* Grobid Timeout

* Null-check

* Use MergeImporter as WebFetcher

Users can perform a PDF import on already imported pdf's to improve the
quality of the entry

* Only force BibTeX import if everything else fails

Fixes #7984

* Prioritize non-bruteforce importers that

When importing, try importers that can tell if they are suitable for a
certain file format or not.
Some importers only check if a file is present, not if it in the correct
format (isRecognizedFormat is always true if an existing file is given).
They are used last.

The List of importers now reflects that prioritization. It is not sorted
by importer names anymore.
The getter-methods getImportFormats and getImportFormatList still sort
the List by name for the View.

* Checkstyle

* Fixed WebFetchersTest

* Grobid does not need localization

* Followup on removed Grobid localization

* Fixed tests

* Checkstyle

* Grobid Fetcher and Tests adapted to updated Grobid

* Adapted GrobidServiceTest to updated Grobid

Co-authored-by: Christoph <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants