Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an interface to import PDF metadata from multiple sources (XMP, Grobid, ...) #7929

Merged
merged 128 commits into from
Aug 21, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
128 commits
Select commit Hold shift + click to select a range
22f0241
GrobidPdfMetadataImporter implemented
btut Jul 20, 2021
8effaa9
Fixed class when accessing resources
btut Jul 20, 2021
5d487d2
Draft of merge dialog
btut Jul 20, 2021
96cd5cf
Default to first available entry
btut Jul 21, 2021
8b5510e
Changed layout
btut Jul 21, 2021
3a4a01a
Checkstyle
btut Jul 21, 2021
8314855
Bind buttons with equal content together
btut Jul 21, 2021
05964bc
Use TextArea only for multiline fields
btut Jul 23, 2021
0f64b1c
Use SplitPane
btut Jul 23, 2021
1260cf9
Fixed scaling of labels
btut Jul 23, 2021
97fb43d
Add tooltip for toggle buttons
btut Jul 23, 2021
733415f
Implemented loading BibEntries in background
btut Jul 23, 2021
620424c
Implemented DOI Lookup button
btut Jul 23, 2021
46bf75a
Changed Button content to TextFlow
btut Jul 24, 2021
f036112
Change DOI button to icon
btut Jul 28, 2021
a5d216c
Use FileHelper method to get extension
btut Jul 28, 2021
d9dc84e
Use ellipsing text flow
btut Jul 28, 2021
6e6b5bc
Ignore empty fields
btut Jul 28, 2021
3546715
Use jsoup to issue POST request
btut Jul 28, 2021
6906d11
Removed unnecessary field
btut Jul 28, 2021
c9e4c06
Reverted URLDownload
btut Jul 28, 2021
f319fc7
Enable VGrow
btut Jul 28, 2021
993ad84
Insets and DiffHighlighting
btut Jul 30, 2021
8a7b80f
GrobidPdfMetadataImporter implemented
btut Jul 20, 2021
6fe2a23
Fixed class when accessing resources
btut Jul 20, 2021
f99bc52
Use FileHelper method to get extension
btut Jul 28, 2021
1d64d80
Use jsoup to issue POST request
btut Jul 28, 2021
f591bfc
Removed unnecessary field
btut Jul 28, 2021
b2bd365
Reverted URLDownload
btut Jul 28, 2021
e458c77
Changelog entry
btut Jul 30, 2021
5478585
Add pdf link to imported entry
btut Jul 30, 2021
d0cc663
Remove citationkey from Grobid
btut Jul 30, 2021
2cd78fc
FirstPageImporter
btut Jul 30, 2021
eb22157
Fixed grammar mistake in CHANGELOG.md
btut Jul 30, 2021
3ac0094
Fixed Grobid tests
btut Jul 30, 2021
c87ed4e
Fixed Grobid URL
btut Jul 30, 2021
3d8c4da
Checkstyle
btut Jul 30, 2021
168b866
Fixed doc
btut Jul 30, 2021
42adea9
Checkstyle
btut Jul 30, 2021
980af83
MVVM split
btut Aug 1, 2021
73dc505
Use JSoup for plaintext citations as well
btut Aug 1, 2021
33cbc95
Merge branch 'improvement/morePdfImporters' into improvement/pdfMetad…
btut Aug 1, 2021
d841207
Actual MVVM
btut Aug 2, 2021
616e73d
Fixes and styling
btut Aug 2, 2021
e2a215e
Cleanup Diff-highlighting
btut Aug 2, 2021
44dfebd
Checkstyle
btut Aug 2, 2021
1841cdf
Prettier loading indicator
btut Aug 2, 2021
7ce7105
Renamed FirstPageImporter to PdfVerbatimBibTextImporter
btut Aug 4, 2021
53d8e9a
Fixed getName (no importer)
btut Aug 4, 2021
9080f14
Renamed Grobid importer to match convention
btut Aug 4, 2021
4c74d51
Fixed loading issue
btut Aug 5, 2021
2757be6
PdfEmbeddedBibTeXImporter
btut Aug 5, 2021
8a05c3e
Renamed PdfEmbeddedBibTeXImporter to PdfEmbeddedBibFileImporter
btut Aug 5, 2021
0c488ec
Checkstyle
btut Aug 5, 2021
02057f0
Remove debug output
btut Aug 5, 2021
3d66855
Checkstyle
btut Aug 5, 2021
fd8918b
PdfMergeMetadataImporter
btut Aug 5, 2021
56868f5
Add DOI and ISBN fetching in PdfMergeMetadataImporter
btut Aug 5, 2021
479a0bc
Fixed concurrent list access
btut Aug 5, 2021
cb6a910
Adapted tests to contain fetchable ID's
btut Aug 5, 2021
0b64ebd
Configurable diff-modes and styling
btut Aug 10, 2021
649049c
Localization
btut Aug 10, 2021
b6e3aaa
Refactor
btut Aug 10, 2021
7f78c9e
Merge branch 'main' of github.com:JabRef/jabref into improvement/pdfM…
btut Aug 10, 2021
e18eabd
Merge branch 'main' of github.com:JabRef/jabref into improvement/more…
btut Aug 10, 2021
1bf6409
Derive XMP preferences from importFormatPreferences
btut Aug 10, 2021
787e040
Localization
btut Aug 10, 2021
a3cdff9
Use Importers in JabRef
btut Aug 10, 2021
564988a
Remove unnecessary test documents
btut Aug 10, 2021
4db14f4
Fixed error introduced by refactor
btut Aug 10, 2021
ba13971
Fit field-editor-column to width
btut Aug 10, 2021
e3d279a
Checkstyle
btut Aug 11, 2021
25e7b2e
Localization in diff-mode
btut Aug 14, 2021
04eecaf
Grobid Timeout
btut Aug 14, 2021
b7e5b62
Null-check
btut Aug 14, 2021
5cbf919
Use MergeImporter as WebFetcher
btut Aug 14, 2021
1cb4dfc
Only force BibTeX import if everything else fails
btut Aug 16, 2021
3ab8ebb
Prioritize non-bruteforce importers that
btut Aug 16, 2021
7ba8b40
Checkstyle
btut Aug 16, 2021
eadbf67
Added explanaition on need for runInJavaFXThread
btut Aug 16, 2021
2b00f47
Styling for dark theme
btut Aug 16, 2021
18dbb67
Fixed WebFetchersTest
btut Aug 16, 2021
9a138b6
Added parse pdf metadata button to GUI
btut Aug 16, 2021
41de0d0
Changelog
btut Aug 16, 2021
4fdd850
Merge branch 'main' of github.com:JabRef/jabref into improvement/pdfM…
btut Aug 16, 2021
6cd9544
Fixed moving-text glitch
btut Aug 16, 2021
1f4bf84
Follow up on glitch-fix
btut Aug 16, 2021
0468e67
Checkstyle and localization
btut Aug 16, 2021
3d46df4
Grobid does not need localization
btut Aug 16, 2021
40b2759
Followup on removed Grobid localization
btut Aug 16, 2021
6324cf2
Fixed tests
btut Aug 16, 2021
54274be
Merge branch 'improvement/morePdfImporters' of github.com:btut/jabref…
btut Aug 16, 2021
089b025
Enable all importers
btut Aug 16, 2021
5cf2af7
Merge branch 'main' of github.com:JabRef/jabref into improvement/more…
btut Aug 16, 2021
b555ada
Checkstyle
btut Aug 16, 2021
a956a37
Merge branch 'improvement/morePdfImporters' of github.com:btut/jabref…
btut Aug 17, 2021
5994a5d
Merge branch 'main' of github.com:JabRef/jabref into improvement/pdfM…
btut Aug 17, 2021
44fee74
Improved display
btut Aug 18, 2021
7a98c8a
Modern switch statements
btut Aug 18, 2021
fb186e3
Fixed position of buttons in LinkedFilesEditor
calixtus Aug 18, 2021
63272bb
Merge remote-tracking branch 'btut/improvement/pdfMetadataImport' int…
calixtus Aug 18, 2021
9db1045
Merge branch 'main' of github.com:JabRef/jabref into improvement/pdfM…
btut Aug 18, 2021
8b1974c
Merge branch 'improvement/pdfMetadataImport' of github.com:btut/jabre…
btut Aug 18, 2021
9f69569
Collapse importers that yield no result
btut Aug 19, 2021
31b62a0
Settings for grobid
btut Aug 19, 2021
cdc9fb1
Merge branch 'main' of github.com:JabRef/jabref into useGrobidPreference
btut Aug 19, 2021
fcdb5a4
Use settings
btut Aug 19, 2021
7e918a3
Updated PdfImporter priorization
btut Aug 19, 2021
3440b32
Store opt-out preference
btut Aug 19, 2021
e5222ce
Partial implementation of opt-in/out dialogue
btut Aug 19, 2021
ef5444a
Show dialog before all Grobid actions
btut Aug 20, 2021
ffaeec5
Static code checks
btut Aug 20, 2021
4f4f398
Merge branch 'main' of github.com:JabRef/jabref into useGrobidPreference
btut Aug 20, 2021
e9ef2e3
Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport
btut Aug 20, 2021
ae32a40
Use Grobid Settings and Opt-In dialog
btut Aug 20, 2021
bd6c8e4
Fix l10n issue (introduced in merge)
btut Aug 20, 2021
70a2e3e
Merge branch 'main' of github.com:JabRef/jabref into useGrobidPreference
btut Aug 20, 2021
51eb15d
Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport
btut Aug 20, 2021
61b3b5b
Fixed missing import (introduced by merge)
btut Aug 20, 2021
3dbafbe
Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport
btut Aug 20, 2021
69af125
Extract given-clause in test
btut Aug 21, 2021
f676003
Improved readability
btut Aug 21, 2021
1e97104
Changelog
btut Aug 21, 2021
43aaa05
Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport
btut Aug 21, 2021
55a4653
Changelog update
btut Aug 21, 2021
f0afe0c
Merge branch 'useGrobidPreference' into improvement/pdfMetadataImport
btut Aug 21, 2021
87eded9
Renamed Entry to EntrySource
btut Aug 21, 2021
32c0a3d
Merge branch 'main' of github.com:JabRef/jabref into improvement/pdfM…
btut Aug 21, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,10 @@

public class GrobidCitationFetcher implements SearchBasedFetcher {

public static final String GROBID_URL = "http://grobid.jabref.org:8070";
calixtus marked this conversation as resolved.
Show resolved Hide resolved

private static final Logger LOGGER = LoggerFactory.getLogger(GrobidCitationFetcher.class);

private static final String GROBID_URL = "http://grobid.jabref.org:8070";
private ImportFormatPreferences importFormatPreferences;
private GrobidService grobidService;

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
package org.jabref.logic.importer.fileformat;

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Path;
import java.util.Objects;

import org.jabref.logic.importer.ImportFormatPreferences;
import org.jabref.logic.importer.Importer;
import org.jabref.logic.importer.ParserResult;
import org.jabref.logic.importer.util.GrobidService;
import org.jabref.logic.l10n.Localization;
import org.jabref.logic.util.StandardFileType;

/**
* Wraps the GrobidService function to be used as an Importer.
*/
public class GrobidPdfMetadataImporter extends Importer {

private final GrobidService grobidService;
private final ImportFormatPreferences importFormatPreferences;

public GrobidPdfMetadataImporter(String grobidServerURL, ImportFormatPreferences importFormatPreferences) {
this.grobidService = new GrobidService(grobidServerURL);
this.importFormatPreferences = importFormatPreferences;
}

@Override
public String getName() {
return Localization.lang("Grobid PDF metadata");
}

@Override
public StandardFileType getFileType() {
return StandardFileType.PDF;
}

@Override
public ParserResult importDatabase(BufferedReader reader) throws IOException {
Objects.requireNonNull(reader);
throw new UnsupportedOperationException(
"PdfXmpImporter does not support importDatabase(BufferedReader reader)."
+ "Instead use importDatabase(Path filePath, Charset defaultEncoding).");
}

@Override
public ParserResult importDatabase(String data) throws IOException {
Objects.requireNonNull(data);
throw new UnsupportedOperationException(
"PdfXmpImporter does not support importDatabase(String data)."
+ "Instead use importDatabase(Path filePath, Charset defaultEncoding).");
}

@Override
public ParserResult importDatabase(Path filePath, Charset defaultEncoding) {
Objects.requireNonNull(filePath);
try {
return new ParserResult(grobidService.processPDF(filePath, importFormatPreferences));
} catch (Exception exception) {
return ParserResult.fromError(exception);
}
}

@Override
public boolean isRecognizedFormat(BufferedReader reader) throws IOException {
Objects.requireNonNull(reader);
return false;
}

/**
* Returns whether the given stream contains data that is a.) a pdf and b.)
* contains at least one BibEntry.
*/
@Override
public boolean isRecognizedFormat(Path filePath, Charset defaultEncoding) throws IOException {
Objects.requireNonNull(filePath);
String[] splittedFileName = filePath.getFileName().toString().split("\\.");
if (splittedFileName.length <= 1) {
return false;
}
String extension = splittedFileName[splittedFileName.length - 1];
return getFileType().getExtensions().contains(extension);
}

@Override
public String getId() {
return "grobidPdf";
}

@Override
public String getDescription() {
return "Wraps the GrobidService function to be used as an Importer.";
}
}
58 changes: 58 additions & 0 deletions src/main/java/org/jabref/logic/importer/util/GrobidService.java
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,17 @@
import java.io.IOException;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.Duration;
import java.util.List;

import org.jabref.logic.importer.ImportFormatPreferences;
import org.jabref.logic.importer.ParseException;
import org.jabref.logic.importer.fileformat.BibtexParser;
import org.jabref.logic.net.URLDownload;
import org.jabref.model.entry.BibEntry;
import org.jabref.model.util.DummyFileUpdateMonitor;

/**
* Implements an API to a GROBID server, as described at
Expand All @@ -19,6 +27,8 @@
*/
public class GrobidService {

public static String HTTP_REQUEST_BOUNDARY = "---------------------------JabRefRequest";
btut marked this conversation as resolved.
Show resolved Hide resolved

public enum ConsolidateCitations {
NO(0), WITH_METADATA(1), WITH_DOI_ONLY(2);
private int code;
Expand Down Expand Up @@ -59,4 +69,52 @@ public String processCitation(String rawCitation, ConsolidateCitations consolida

return httpResponse;
}

public List<BibEntry> processPDF(Path filePath, ImportFormatPreferences importFormatPreferences) throws IOException, ParseException {
URLDownload urlDownload = new URLDownload(grobidServerURL
+ "/api/processHeaderDocument"); // shall we use processFulltextDocument?
urlDownload.setConnectTimeout(Duration.ofSeconds(150));
urlDownload.addHeader("Accept", MediaTypes.APPLICATION_BIBTEX);
urlDownload.addHeader("Content-Type", "multipart/form-data; boundary=" + HTTP_REQUEST_BOUNDARY);
urlDownload.setPostData(readPdf(filePath));
String httpResponse = urlDownload.asString();

if (httpResponse == null || httpResponse.equals("@misc{-1,\n author = {}\n}\n")) { // This filters empty BibTeX entries
throw new IOException("The GROBID server response does not contain anything.");
}

BibtexParser parser = new BibtexParser(importFormatPreferences, new DummyFileUpdateMonitor());
return parser.parseEntries(httpResponse);
}

private byte[] readPdf(Path filePath) throws IOException {
StringBuilder preFile = new StringBuilder();
preFile.append("--");
preFile.append(HTTP_REQUEST_BOUNDARY);
preFile.append("\r\n");
preFile.append("Content-Disposition: form-data; name=\"consolidateHeader\"\r\n\r\n1\r\n--");
preFile.append(HTTP_REQUEST_BOUNDARY);
preFile.append("\r\n");
preFile.append("Content-Disposition: form-data; name=\"input\"; filename=\"");
preFile.append(filePath.getFileName().toString());
preFile.append("\"\r\nContent-Type: application/pdf\r\n\r\n");
byte[] preFileBytes = preFile.toString().getBytes();

byte[] fileContent = Files.readAllBytes(filePath);

StringBuilder postFile = new StringBuilder();
postFile.append("\r\n--");
postFile.append(HTTP_REQUEST_BOUNDARY);
postFile.append("\r\n");
postFile.append("Content-Disposition: form-data; name=\"input\"\r\n\r\n\r\n--");
postFile.append(HTTP_REQUEST_BOUNDARY);
postFile.append("--\r\n");
byte[] postFileBytes = postFile.toString().getBytes();

byte[] post = new byte[preFileBytes.length + fileContent.length + postFileBytes.length];
System.arraycopy(preFileBytes, 0, post, 0, preFileBytes.length);
System.arraycopy(fileContent, 0, post, preFileBytes.length, fileContent.length);
System.arraycopy(postFileBytes, 0, post, preFileBytes.length + fileContent.length, postFileBytes.length);
return post;
}
btut marked this conversation as resolved.
Show resolved Hide resolved
}
12 changes: 9 additions & 3 deletions src/main/java/org/jabref/logic/net/URLDownload.java
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ public class URLDownload {

private final URL source;
private final Map<String, String> parameters = new HashMap<>();
private String postData = "";
private byte[] postData = null;
private Duration connectTimeout = DEFAULT_CONNECT_TIMEOUT;

/**
Expand Down Expand Up @@ -222,6 +222,12 @@ public void addHeader(String key, String value) {
}

public void setPostData(String postData) {
if (postData != null) {
this.postData = postData.getBytes();
}
}

public void setPostData(byte[] postData) {
if (postData != null) {
this.postData = postData;
}
Expand Down Expand Up @@ -339,10 +345,10 @@ private URLConnection openConnection() throws IOException {
for (Entry<String, String> entry : this.parameters.entrySet()) {
connection.setRequestProperty(entry.getKey(), entry.getValue());
}
if (!this.postData.isEmpty()) {
if (this.postData != null) {
connection.setDoOutput(true);
try (DataOutputStream wr = new DataOutputStream(connection.getOutputStream())) {
wr.writeBytes(this.postData);
wr.write(this.postData);
}
}

Expand Down
2 changes: 2 additions & 0 deletions src/main/resources/l10n/JabRef_en.properties
Original file line number Diff line number Diff line change
Expand Up @@ -2363,3 +2363,5 @@ Rebuild\ fulltext\ search\ index\ for\ current\ library?=Rebuild fulltext search
Rebuilding\ fulltext\ search\ index...=Rebuilding fulltext search index...
Failed\ to\ access\ fulltext\ search\ index=Failed to access fulltext search index
Found\ match\ in\ %0=Found match in %0

Grobid\ PDF\ metadata=Grobid PDF metadata
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
package org.jabref.logic.importer.fileformat;

import java.io.IOException;
import java.net.URISyntaxException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Path;
import java.util.List;
import java.util.Optional;

import org.jabref.logic.importer.ImportFormatPreferences;
import org.jabref.logic.importer.fetcher.GrobidCitationFetcher;
import org.jabref.logic.util.StandardFileType;
import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.field.StandardField;

import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;
import org.mockito.Answers;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertFalse;
import static org.junit.jupiter.api.Assertions.assertTrue;
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

public class GrobidPdfMetadataImporterTest {

private GrobidPdfMetadataImporter importer;
private ImportFormatPreferences importFormatPreferences;

@BeforeEach
public void setUp() {
importFormatPreferences = mock(ImportFormatPreferences.class, Answers.RETURNS_DEEP_STUBS);
when(importFormatPreferences.getKeywordSeparator()).thenReturn(',');
importer = new GrobidPdfMetadataImporter(GrobidCitationFetcher.GROBID_URL, importFormatPreferences);
}

@Test
public void testsGetExtensions() {
assertEquals(StandardFileType.PDF, importer.getFileType());
}

@Test
public void testImportEntries() throws URISyntaxException {
btut marked this conversation as resolved.
Show resolved Hide resolved
Path file = Path.of(GrobidPdfMetadataImporterTest.class.getResource("LNCS-minimal.pdf").toURI());
List<BibEntry> bibEntries = importer.importDatabase(file, StandardCharsets.UTF_8).getDatabase().getEntries();

assertEquals(1, bibEntries.size());

BibEntry be0 = bibEntries.get(0);
assertEquals(Optional.of("Lastname, Firstname"), be0.getField(StandardField.AUTHOR));
assertEquals(Optional.of("Paper Title"), be0.getField(StandardField.TITLE));
btut marked this conversation as resolved.
Show resolved Hide resolved
}

@Test
public void testIsRecognizedFormat() throws IOException, URISyntaxException {
Path file = Path.of(GrobidPdfMetadataImporterTest.class.getResource("annotated.pdf").toURI());
assertTrue(importer.isRecognizedFormat(file, StandardCharsets.UTF_8));
}

@Test
public void testIsRecognizedFormatReject() throws IOException, URISyntaxException {
Path file = Path.of(PdfXmpImporterTest.class.getResource("BibtexImporter.examples.bib").toURI());
assertFalse(importer.isRecognizedFormat(file, StandardCharsets.UTF_8));
}

@Test
public void testGetCommandLineId() {
assertEquals("grobidPdf", importer.getId());
}
}