-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Fetcher for ISIDORE #10518
Merged
Merged
Added Fetcher for ISIDORE #10518
Changes from all commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
fb2222e
ADD ISIDOREFetcher
7d65239
ADD ISIDOREFetcher to WebFetchers
0f87aed
Merge branch 'JabRef:main' into fix-for-issue-10423
u7492883 6d02f06
Merge branch 'JabRef:main' into fix-for-issue-10423
u7492883 25e4985
ADD ISIDOREFetcherTest.java
060b388
FIX ISIDOREFetcher.java
3037d9a
Merge branch 'fix-for-issue-10423' of https://github.com/u7492883/jab…
ea6988e
ADD ISIDORE privacy policy
403bf06
ADD ISIDORE fetcher to CHANGELOG.md
6d286fe
ADD issue number before link.
e50b370
REMOVE stacktrace message from ISIDOREFetcher.java
39496eb
FIX static analysis issue with ISIDOREFetcher.java.
664a933
REMOVED inverted booleans
bcd405c
FIX using replace instead of replaceALL
c08beb0
FIX string equals avoids null
38b9d33
Merge branch 'JabRef:main' into fix-for-issue-10423
u7492883 48def73
EDIT moved parser creation into constructor.
b246a0d
FIX added constant values and removed abstract (due to copyright risk…
86e4ba4
FIX made ISIDOREFetcherTest.java more readable.
ca69df7
EDIT use //s instead of looking for multiple spaces
81c2172
ADD comment about publisher format.
8f45b03
Merge branch 'main' into fix-for-issue-10423
koppor e15a3aa
EDIT made test cases more readable and intuitive based on comments fr…
0ca60b9
EDIT using StringJoiner and fixed exceptions based on comments.
4dacce8
Merge branch 'fix-for-issue-10423' of https://github.com/u7492883/jab…
065c4cf
Merge branch 'main' of https://github.com/jabref/jabref into fix-for-…
db4a9ad
ADD added message to CHANGELOG.md
620f719
FIX style compliace for CHANGELOG.md
0553443
FIX remove quotation marks from title
80e1088
EDIT moved documentBuilder out of Parser
4702d14
FIX moved Isidore Fetcher into unreleased. I missed something when me…
1e3f3a3
FIX test case after fixing quotation marks in title
ae49b38
Implement search based parser fetcher
Siedlerchr fdf40fd
add xml output
Siedlerchr b3e7d59
fck fetcher
Siedlerchr 5c09c36
fck fetcher
Siedlerchr cd6d630
add accept header
Siedlerchr 8e3eb37
Merge branch 'main' into fix-for-issue-10423
Siedlerchr 50e4205
fix fetcher
Siedlerchr d00eeb1
Fix checkstyle
koppor e96cfef
Adapt test to include new fetcher
koppor b8c8f68
Fix checkstyle issues
koppor 92cf060
Fix support for querying for authors
koppor 87f17bc
remove duplicate test
koppor 5b04053
Merge branch 'main' into fix-for-issue-10423
koppor d4d08e0
Add workaround of fielded terms.
koppor b71ef06
More modern "Word" class
koppor baa4038
openRewrite
koppor 42bac2b
Merge branch 'main' into fix-for-issue-10423
koppor a0dfb2d
Merge branch 'main' into fix-for-issue-10423
koppor 247ec3e
Fix checkstyle
koppor File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
248 changes: 248 additions & 0 deletions
248
src/main/java/org/jabref/logic/importer/fetcher/ISIDOREFetcher.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,248 @@ | ||
package org.jabref.logic.importer.fetcher; | ||
|
||
import java.io.IOException; | ||
import java.io.PushbackInputStream; | ||
import java.net.MalformedURLException; | ||
import java.net.URISyntaxException; | ||
import java.net.URL; | ||
import java.nio.charset.StandardCharsets; | ||
import java.util.ArrayList; | ||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.Optional; | ||
import java.util.StringJoiner; | ||
|
||
import javax.xml.parsers.DocumentBuilder; | ||
import javax.xml.parsers.DocumentBuilderFactory; | ||
import javax.xml.parsers.ParserConfigurationException; | ||
|
||
import org.jabref.logic.help.HelpFile; | ||
import org.jabref.logic.importer.FetcherException; | ||
import org.jabref.logic.importer.PagedSearchBasedParserFetcher; | ||
import org.jabref.logic.importer.Parser; | ||
import org.jabref.logic.importer.fetcher.transformers.ISIDOREQueryTransformer; | ||
import org.jabref.logic.net.URLDownload; | ||
import org.jabref.model.entry.BibEntry; | ||
import org.jabref.model.entry.field.StandardField; | ||
import org.jabref.model.entry.types.EntryType; | ||
import org.jabref.model.entry.types.StandardEntryType; | ||
|
||
import jakarta.ws.rs.core.MediaType; | ||
import org.apache.http.client.utils.URIBuilder; | ||
import org.apache.lucene.queryparser.flexible.core.nodes.QueryNode; | ||
import org.jooq.lambda.Unchecked; | ||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
import org.w3c.dom.Document; | ||
import org.w3c.dom.Element; | ||
import org.w3c.dom.Node; | ||
import org.w3c.dom.NodeList; | ||
import org.xml.sax.SAXException; | ||
|
||
/** | ||
* Fetcher for <a href="https://isidore.science">ISIDORE</a>``` | ||
* Will take in the link to the website or the last six digits that identify the reference | ||
* Uses <a href="https://isidore.science/api">ISIDORE's API</a>. | ||
*/ | ||
public class ISIDOREFetcher implements PagedSearchBasedParserFetcher { | ||
|
||
private static final Logger LOGGER = LoggerFactory.getLogger(ISIDOREFetcher.class); | ||
|
||
private static final String SOURCE_WEB_SEARCH = "https://api.isidore.science/resource/search"; | ||
|
||
private final DocumentBuilderFactory factory; | ||
|
||
public ISIDOREFetcher() { | ||
this.factory = DocumentBuilderFactory.newInstance(); | ||
} | ||
|
||
@Override | ||
public Parser getParser() { | ||
return xmlData -> { | ||
try { | ||
PushbackInputStream pushbackInputStream = new PushbackInputStream(xmlData); | ||
int data = pushbackInputStream.read(); | ||
if (data == -1) { | ||
return List.of(); | ||
} | ||
if (pushbackInputStream.available() < 5) { | ||
// We guess, it's an error if less than 5 | ||
pushbackInputStream.unread(data); | ||
String error = new String(pushbackInputStream.readAllBytes(), StandardCharsets.UTF_8); | ||
throw new FetcherException(error); | ||
} | ||
|
||
pushbackInputStream.unread(data); | ||
DocumentBuilder builder = this.factory.newDocumentBuilder(); | ||
Document document = builder.parse(pushbackInputStream); | ||
|
||
// Assuming the root element represents an entry | ||
Element entryElement = document.getDocumentElement(); | ||
|
||
if (entryElement == null) { | ||
return Collections.emptyList(); | ||
} | ||
|
||
return parseXMl(entryElement); | ||
} catch (FetcherException e) { | ||
Unchecked.throwChecked(e); | ||
} catch (ParserConfigurationException | | ||
IOException | | ||
SAXException e) { | ||
Unchecked.throwChecked(new FetcherException("Issue with parsing link", e)); | ||
} | ||
return null; | ||
}; | ||
} | ||
|
||
@Override | ||
public URLDownload getUrlDownload(URL url) { | ||
URLDownload download = new URLDownload(url); | ||
download.addHeader("Accept", MediaType.APPLICATION_XML); | ||
return download; | ||
} | ||
|
||
@Override | ||
public URL getURLForQuery(QueryNode luceneQuery, int pageNumber) throws URISyntaxException, MalformedURLException, FetcherException { | ||
ISIDOREQueryTransformer queryTransformer = new ISIDOREQueryTransformer(); | ||
String transformedQuery = queryTransformer.transformLuceneQuery(luceneQuery).orElse(""); | ||
URIBuilder uriBuilder = new URIBuilder(SOURCE_WEB_SEARCH); | ||
uriBuilder.addParameter("q", transformedQuery); | ||
if (pageNumber > 1) { | ||
uriBuilder.addParameter("page", String.valueOf(pageNumber)); | ||
} | ||
uriBuilder.addParameter("replies", String.valueOf(getPageSize())); | ||
uriBuilder.addParameter("lang", "en"); | ||
uriBuilder.addParameter("output", "xml"); | ||
queryTransformer.getParameterMap().forEach((k, v) -> { | ||
uriBuilder.addParameter(k, v); | ||
}); | ||
|
||
URL url = uriBuilder.build().toURL(); | ||
LOGGER.debug("URl for query {}", url); | ||
return url; | ||
} | ||
|
||
private List<BibEntry> parseXMl(Element element) { | ||
var list = element.getElementsByTagName("isidore"); | ||
List<BibEntry> bibEntryList = new ArrayList<>(); | ||
|
||
for (int i = 0; i < list.getLength(); i++) { | ||
Element elem = (Element) list.item(i); | ||
var bibEntry = xmlItemToBibEntry(elem); | ||
bibEntryList.add(bibEntry); | ||
} | ||
return bibEntryList; | ||
} | ||
|
||
private BibEntry xmlItemToBibEntry(Element itemElement) { | ||
return new BibEntry(getType(itemElement.getElementsByTagName("types").item(0).getChildNodes())) | ||
.withField(StandardField.TITLE, itemElement.getElementsByTagName("title").item(0).getTextContent().replace("\"", "")) | ||
.withField(StandardField.AUTHOR, getAuthor(itemElement.getElementsByTagName("enrichedCreators").item(0))) | ||
.withField(StandardField.YEAR, itemElement.getElementsByTagName("date").item(0).getChildNodes().item(1).getTextContent().substring(0, 4)) | ||
.withField(StandardField.JOURNAL, getJournal(itemElement.getElementsByTagName("dc:source"))) | ||
.withField(StandardField.PUBLISHER, getPublishers(itemElement.getElementsByTagName("publishers").item(0))) | ||
.withField(StandardField.DOI, getDOI(itemElement.getElementsByTagName("ore").item(0).getChildNodes())); | ||
} | ||
|
||
private String getDOI(NodeList list) { | ||
for (int i = 0; i < list.getLength(); i++) { | ||
String content = list.item(i).getTextContent(); | ||
if (content.contains("DOI:")) { | ||
return content.replace("DOI: ", ""); | ||
} | ||
if (list.item(i).getTextContent().contains("doi:")) { | ||
return content.replace("info:doi:", ""); | ||
} | ||
} | ||
return ""; | ||
} | ||
|
||
/** | ||
* Get the type of the document, ISIDORE only seems to have select types, also their types are different to | ||
* those used by JabRef. | ||
*/ | ||
private EntryType getType(NodeList list) { | ||
for (int i = 0; i < list.getLength(); i++) { | ||
String type = list.item(i).getTextContent(); | ||
if (type.contains("article") || type.contains("Article")) { | ||
return StandardEntryType.Article; | ||
} | ||
if (type.contains("thesis") || type.contains("Thesis")) { | ||
return StandardEntryType.Thesis; | ||
} | ||
if (type.contains("book") || type.contains("Book")) { | ||
return StandardEntryType.Book; | ||
} | ||
} | ||
return StandardEntryType.Misc; | ||
} | ||
|
||
private String getAuthor(Node itemElement) { | ||
// Gets all the authors, separated with the word "and" | ||
// For some reason the author field sometimes has extra numbers and letters. | ||
StringJoiner stringJoiner = new StringJoiner(" and "); | ||
for (int i = 1; i < itemElement.getChildNodes().getLength(); i += 2) { | ||
String next = removeNumbers(itemElement.getChildNodes().item(i).getTextContent()).replaceAll("\\s+", " "); | ||
next = next.replace("\n", ""); | ||
if (next.isBlank()) { | ||
continue; | ||
} | ||
stringJoiner.add(next); | ||
} | ||
return (stringJoiner.toString().substring(0, stringJoiner.length())).trim().replaceAll("\\s+", " "); | ||
} | ||
|
||
/** | ||
* Remove numbers from a string and everything after the number, (helps with the author field). | ||
*/ | ||
private String removeNumbers(String string) { | ||
for (int i = 0; i < string.length(); i++) { | ||
if (Character.isDigit(string.charAt(i))) { | ||
return string.substring(0, i); | ||
} | ||
} | ||
return string; | ||
} | ||
|
||
private String getPublishers(Node itemElement) { | ||
u7492883 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// In the XML file the publishers node often lists multiple publisher e.g. | ||
// <publisher origin="HAL CCSD">HAL CCSD</publisher> | ||
// <publisher origin="Elsevier">Elsevier</publisher> | ||
// Therefore this function simply gets all of them. | ||
if (itemElement == null) { | ||
return ""; | ||
} | ||
StringJoiner stringJoiner = new StringJoiner(", "); | ||
for (int i = 0; i < itemElement.getChildNodes().getLength(); i++) { | ||
if (itemElement.getChildNodes().item(i).getTextContent().isBlank()) { | ||
continue; | ||
} | ||
stringJoiner.add(itemElement.getChildNodes().item(i).getTextContent().trim()); | ||
} | ||
return stringJoiner.toString(); | ||
} | ||
|
||
private String getJournal(NodeList list) { | ||
if (list.getLength() == 0) { | ||
return ""; | ||
} | ||
String reference = list.item(list.getLength() - 1).getTextContent(); | ||
for (int i = 0; i < reference.length(); i++) { | ||
if (reference.charAt(i) == ',') { | ||
return reference.substring(0, i); | ||
} | ||
} | ||
return ""; | ||
} | ||
|
||
@Override | ||
public String getName() { | ||
return "ISIDORE"; | ||
} | ||
|
||
@Override | ||
public Optional<HelpFile> getHelpPage() { | ||
return Optional.of(HelpFile.FETCHER_ISIDORE); | ||
} | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this suggestion OK for you or do you see any issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately some of the Author nodes are a bit weird. Sometimes they contain a string of numbers and a dash after the name and then repeat the name again for no apparent reason e.g. (Patrick Bonnel becomes Patrick Bonnel 0766-05442 Patrick). So to solve this I simply removed everything after the first number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK. You can even put that as JavaDoc comment and as test case.
What I meant: Your lines 155 to 160 can be dine with a one-line RegEx.
Using
string.replaceFirst("\\d.*", "")
is a concise and efficient way to achieve the same result. This regular expression will replace the first digit and everything that follows it with an empty string, effectively removing the numbers and everything after them.Here's the
removeNumbers
method usingreplaceFirst
:In the regex:
\\d
matches the first digit encountered..*
matches everything after the digit.The
replaceFirst
method will then replace this matched portion with an empty string. If no match is found (i.e., if there are no digits), the original string remains unchanged. This is a clean and efficient way to achieve the desired behavior.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a second ChatGPT suggestion, but I don't know about performance gains. I tend to keep the above suggestion and optimize if there are performance issues
Given your context, the method you've provided returns the portion of the string before the first number. Using a regular expression, we can accomplish the same task more concisely.
Here's a refactored version of the
removeNumbers
method using regex:The regular expression
^[^\\d]*
can be interpreted as:^
asserts position at the start of a string.[^\\d]*
matches zero or more non-digit characters.The method works by matching as many non-digit characters as possible from the beginning of the string until it encounters a digit (or the end of the string). If a match is found, it returns that match; otherwise, it simply returns the original string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second option is good, if
Pattern.compile(....)
is moved to a class constant.