Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Fetcher for ISIDORE #10518

Merged
merged 51 commits into from
Jan 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
fb2222e
ADD ISIDOREFetcher
Oct 17, 2023
7d65239
ADD ISIDOREFetcher to WebFetchers
Oct 17, 2023
0f87aed
Merge branch 'JabRef:main' into fix-for-issue-10423
u7492883 Oct 17, 2023
6d02f06
Merge branch 'JabRef:main' into fix-for-issue-10423
u7492883 Oct 17, 2023
25e4985
ADD ISIDOREFetcherTest.java
Oct 18, 2023
060b388
FIX ISIDOREFetcher.java
Oct 18, 2023
3037d9a
Merge branch 'fix-for-issue-10423' of https://github.com/u7492883/jab…
Oct 18, 2023
ea6988e
ADD ISIDORE privacy policy
Oct 18, 2023
403bf06
ADD ISIDORE fetcher to CHANGELOG.md
Oct 18, 2023
6d286fe
ADD issue number before link.
Oct 18, 2023
e50b370
REMOVE stacktrace message from ISIDOREFetcher.java
Oct 18, 2023
39496eb
FIX static analysis issue with ISIDOREFetcher.java.
Oct 18, 2023
664a933
REMOVED inverted booleans
Oct 18, 2023
bcd405c
FIX using replace instead of replaceALL
Oct 18, 2023
c08beb0
FIX string equals avoids null
Oct 19, 2023
38b9d33
Merge branch 'JabRef:main' into fix-for-issue-10423
u7492883 Oct 20, 2023
48def73
EDIT moved parser creation into constructor.
Oct 20, 2023
b246a0d
FIX added constant values and removed abstract (due to copyright risk…
Oct 20, 2023
86e4ba4
FIX made ISIDOREFetcherTest.java more readable.
Oct 20, 2023
ca69df7
EDIT use //s instead of looking for multiple spaces
Oct 21, 2023
81c2172
ADD comment about publisher format.
Oct 21, 2023
8f45b03
Merge branch 'main' into fix-for-issue-10423
koppor Oct 21, 2023
e15a3aa
EDIT made test cases more readable and intuitive based on comments fr…
Oct 22, 2023
0ca60b9
EDIT using StringJoiner and fixed exceptions based on comments.
Oct 22, 2023
4dacce8
Merge branch 'fix-for-issue-10423' of https://github.com/u7492883/jab…
Oct 22, 2023
065c4cf
Merge branch 'main' of https://github.com/jabref/jabref into fix-for-…
Oct 29, 2023
db4a9ad
ADD added message to CHANGELOG.md
Oct 29, 2023
620f719
FIX style compliace for CHANGELOG.md
Oct 29, 2023
0553443
FIX remove quotation marks from title
Oct 29, 2023
80e1088
EDIT moved documentBuilder out of Parser
Oct 29, 2023
4702d14
FIX moved Isidore Fetcher into unreleased. I missed something when me…
Oct 29, 2023
1e3f3a3
FIX test case after fixing quotation marks in title
Oct 29, 2023
ae49b38
Implement search based parser fetcher
Siedlerchr Nov 2, 2023
fdf40fd
add xml output
Siedlerchr Nov 2, 2023
b3e7d59
fck fetcher
Siedlerchr Nov 2, 2023
5c09c36
fck fetcher
Siedlerchr Nov 2, 2023
cd6d630
add accept header
Siedlerchr Nov 4, 2023
8e3eb37
Merge branch 'main' into fix-for-issue-10423
Siedlerchr Dec 26, 2023
50e4205
fix fetcher
Siedlerchr Dec 26, 2023
d00eeb1
Fix checkstyle
koppor Dec 26, 2023
e96cfef
Adapt test to include new fetcher
koppor Dec 26, 2023
b8c8f68
Fix checkstyle issues
koppor Dec 26, 2023
92cf060
Fix support for querying for authors
koppor Dec 26, 2023
87f17bc
remove duplicate test
koppor Dec 26, 2023
5b04053
Merge branch 'main' into fix-for-issue-10423
koppor Jan 8, 2024
d4d08e0
Add workaround of fielded terms.
koppor Jan 9, 2024
b71ef06
More modern "Word" class
koppor Jan 9, 2024
baa4038
openRewrite
koppor Jan 9, 2024
42bac2b
Merge branch 'main' into fix-for-issue-10423
koppor Jan 15, 2024
a0dfb2d
Merge branch 'main' into fix-for-issue-10423
koppor Jan 16, 2024
247ec3e
Fix checkstyle
koppor Jan 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv

### Added

- We added a fetcher for [ISIDORE](https://isidore.science/), simply paste in the link into the text field or the last 6 digits in the link that identify that paper. [#10423](https://github.com/JabRef/jabref/issues/10423)
- When importing entries form the "Citation relations" tab, the field [cites](https://docs.jabref.org/advanced/entryeditor/entrylinks) is now filled according to the relationship between the entries. [#10572](https://github.com/JabRef/jabref/pull/10752)

### Changed
Expand Down
1 change: 1 addition & 0 deletions PRIVACY.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ These third-party services are the following:
| [The SAO/NASA Astrophysics Data System](https://ui.adsabs.harvard.edu/) | <https://ui.adsabs.harvard.edu/help/privacy/> |
| [Unpaywall](https://unpaywall.org/) | <https://unpaywall.org/legal/privacy> |
| [zbMATH Open](https://www.zbmath.org) | <https://zbmath.org/privacy-policy/> |
| [ISIDORE](https://isidore.science/) | <https://isidore.science/credit> |

[1]: *Note: The Mr. DLib service is used for the related articles tab in the entry editor and collects also your language, your browser and operating system (by default*disabled*).*

Expand Down
32 changes: 12 additions & 20 deletions src/main/java/org/jabref/logic/formatter/casechanger/Word.java
Original file line number Diff line number Diff line change
Expand Up @@ -17,38 +17,30 @@ public final class Word {
* Set containing common lowercase function words
*/
public static final Set<String> SMALLER_WORDS;
public static final Set<Character> DASHES;
public static final Set<String> CONJUNCTIONS;

public static final Set<Character> DASHES = Set.of('-', '~', '⸗', '〰', '᐀', '֊', '־', '‐', '‑', '‒',
'–', '—', '―', '⁓', '⁻', '₋', '−', '⸺', '⸻',
'〜', '゠', '︱', '︲', '﹘', '﹣', '-');

// Conjunctions used as part of Title case capitalisation to specifically check if word is conjunction or not
public static final Set<String> CONJUNCTIONS = Set.of("and", "but", "for", "nor", "or", "so", "yet");

private final char[] chars;

private final boolean[] protectedChars;

static {
Set<String> smallerWords = new HashSet<>();
Set<Character> dashes = new HashSet<>();
Set<String> conjunctions = new HashSet<>();

// Conjunctions used as part of Title case capitalisation to specifically check if word is conjunction or not
conjunctions.addAll(Arrays.asList("and", "but", "for", "nor", "or", "so", "yet"));
// Articles
smallerWords.addAll(Arrays.asList("a", "an", "the"));

// Prepositions
smallerWords.addAll(Arrays.asList("above", "about", "across", "against", "along", "among", "around", "at", "before", "behind", "below", "beneath", "beside", "between", "beyond", "by", "down", "during", "except", "for", "from", "in", "inside", "into", "like", "near", "of", "off", "on", "onto", "since", "to", "toward", "through", "under", "until", "up", "upon", "with", "within", "without"));
// Conjunctions used as part of all case capitalisation to check if it is a small word or not
smallerWords.addAll(conjunctions);
// Dashes
dashes.addAll(Arrays.asList(
'-', '~', '⸗', '〰', '᐀', '֊', '־', '‐', '‑', '‒',
'–', '—', '―', '⁓', '⁻', '₋', '−', '⸺', '⸻',
'〜', '゠', '︱', '︲', '﹘', '﹣', '-'
));

// unmodifiable for thread safety
DASHES = dashes;

// unmodifiable for thread safety
CONJUNCTIONS = conjunctions;
// Conjunctions used as part of all case capitalisation to check if it is a small word or not
smallerWords.addAll(CONJUNCTIONS);

// unmodifiable for thread safety
SMALLER_WORDS = smallerWords.stream()
.map(word -> word.toLowerCase(Locale.ROOT))
.collect(Collectors.toUnmodifiableSet());
Expand Down
1 change: 1 addition & 0 deletions src/main/java/org/jabref/logic/help/HelpFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ public enum HelpFile {
FETCHER_IEEEXPLORE("collect/import-using-online-bibliographic-database#ieeexplore"),
FETCHER_INSPIRE("collect/import-using-online-bibliographic-database#inspire"),
FETCHER_ISBN("collect/add-entry-using-an-id"),
FETCHER_ISIDORE("collect/import-using-online-bibliographic-database#isidore"),
FETCHER_MEDLINE("collect/import-using-online-bibliographic-database#medline"),
FETCHER_OAI2_ARXIV("collect/import-using-online-bibliographic-database#arxiv"),
FETCHER_RFC("collect/add-entry-using-an-id"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ public class AuthorListParser {
/**
* Builds a new array of strings with stringbuilder. Regarding to the name affixes.
*
* @return New string with correct seperation
* @return New string with correct separation
*/
private static StringBuilder buildWithAffix(Collection<Integer> indexArray, List<String> nameList) {
StringBuilder stringBuilder = new StringBuilder();
Expand Down
2 changes: 2 additions & 0 deletions src/main/java/org/jabref/logic/importer/WebFetchers.java
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
import org.jabref.logic.importer.fetcher.GvkFetcher;
import org.jabref.logic.importer.fetcher.IEEE;
import org.jabref.logic.importer.fetcher.INSPIREFetcher;
import org.jabref.logic.importer.fetcher.ISIDOREFetcher;
import org.jabref.logic.importer.fetcher.IacrEprintFetcher;
import org.jabref.logic.importer.fetcher.IssnFetcher;
import org.jabref.logic.importer.fetcher.LOBIDFetcher;
Expand Down Expand Up @@ -105,6 +106,7 @@ public static Optional<IdFetcher<? extends Identifier>> getIdFetcherForField(Fie
public static SortedSet<SearchBasedFetcher> getSearchBasedFetchers(ImportFormatPreferences importFormatPreferences, ImporterPreferences importerPreferences) {
SortedSet<SearchBasedFetcher> set = new TreeSet<>(new CompositeSearchFirstComparator());
set.add(new ArXivFetcher(importFormatPreferences));
set.add(new ISIDOREFetcher());
set.add(new INSPIREFetcher(importFormatPreferences));
set.add(new GvkFetcher(importFormatPreferences));
set.add(new BvbFetcher());
Expand Down
248 changes: 248 additions & 0 deletions src/main/java/org/jabref/logic/importer/fetcher/ISIDOREFetcher.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
package org.jabref.logic.importer.fetcher;

import java.io.IOException;
import java.io.PushbackInputStream;
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Optional;
import java.util.StringJoiner;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.jabref.logic.help.HelpFile;
import org.jabref.logic.importer.FetcherException;
import org.jabref.logic.importer.PagedSearchBasedParserFetcher;
import org.jabref.logic.importer.Parser;
import org.jabref.logic.importer.fetcher.transformers.ISIDOREQueryTransformer;
import org.jabref.logic.net.URLDownload;
import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.field.StandardField;
import org.jabref.model.entry.types.EntryType;
import org.jabref.model.entry.types.StandardEntryType;

import jakarta.ws.rs.core.MediaType;
import org.apache.http.client.utils.URIBuilder;
import org.apache.lucene.queryparser.flexible.core.nodes.QueryNode;
import org.jooq.lambda.Unchecked;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

/**
* Fetcher for <a href="https://isidore.science">ISIDORE</a>```
* Will take in the link to the website or the last six digits that identify the reference
* Uses <a href="https://isidore.science/api">ISIDORE's API</a>.
*/
public class ISIDOREFetcher implements PagedSearchBasedParserFetcher {

private static final Logger LOGGER = LoggerFactory.getLogger(ISIDOREFetcher.class);

private static final String SOURCE_WEB_SEARCH = "https://api.isidore.science/resource/search";

private final DocumentBuilderFactory factory;

public ISIDOREFetcher() {
this.factory = DocumentBuilderFactory.newInstance();
}

@Override
public Parser getParser() {
return xmlData -> {
try {
PushbackInputStream pushbackInputStream = new PushbackInputStream(xmlData);
int data = pushbackInputStream.read();
if (data == -1) {
return List.of();
}
if (pushbackInputStream.available() < 5) {
// We guess, it's an error if less than 5
pushbackInputStream.unread(data);
String error = new String(pushbackInputStream.readAllBytes(), StandardCharsets.UTF_8);
throw new FetcherException(error);
}

pushbackInputStream.unread(data);
DocumentBuilder builder = this.factory.newDocumentBuilder();
Document document = builder.parse(pushbackInputStream);

// Assuming the root element represents an entry
Element entryElement = document.getDocumentElement();

if (entryElement == null) {
return Collections.emptyList();
}

return parseXMl(entryElement);
} catch (FetcherException e) {
Unchecked.throwChecked(e);
} catch (ParserConfigurationException |
IOException |
SAXException e) {
Unchecked.throwChecked(new FetcherException("Issue with parsing link", e));
}
return null;
};
}

@Override
public URLDownload getUrlDownload(URL url) {
URLDownload download = new URLDownload(url);
download.addHeader("Accept", MediaType.APPLICATION_XML);
return download;
}

@Override
public URL getURLForQuery(QueryNode luceneQuery, int pageNumber) throws URISyntaxException, MalformedURLException, FetcherException {
ISIDOREQueryTransformer queryTransformer = new ISIDOREQueryTransformer();
String transformedQuery = queryTransformer.transformLuceneQuery(luceneQuery).orElse("");
URIBuilder uriBuilder = new URIBuilder(SOURCE_WEB_SEARCH);
uriBuilder.addParameter("q", transformedQuery);
if (pageNumber > 1) {
uriBuilder.addParameter("page", String.valueOf(pageNumber));
}
uriBuilder.addParameter("replies", String.valueOf(getPageSize()));
uriBuilder.addParameter("lang", "en");
uriBuilder.addParameter("output", "xml");
queryTransformer.getParameterMap().forEach((k, v) -> {
uriBuilder.addParameter(k, v);
});

URL url = uriBuilder.build().toURL();
LOGGER.debug("URl for query {}", url);
return url;
}

private List<BibEntry> parseXMl(Element element) {
var list = element.getElementsByTagName("isidore");
List<BibEntry> bibEntryList = new ArrayList<>();

for (int i = 0; i < list.getLength(); i++) {
Element elem = (Element) list.item(i);
var bibEntry = xmlItemToBibEntry(elem);
bibEntryList.add(bibEntry);
}
return bibEntryList;
}

private BibEntry xmlItemToBibEntry(Element itemElement) {
return new BibEntry(getType(itemElement.getElementsByTagName("types").item(0).getChildNodes()))
.withField(StandardField.TITLE, itemElement.getElementsByTagName("title").item(0).getTextContent().replace("\"", ""))
.withField(StandardField.AUTHOR, getAuthor(itemElement.getElementsByTagName("enrichedCreators").item(0)))
.withField(StandardField.YEAR, itemElement.getElementsByTagName("date").item(0).getChildNodes().item(1).getTextContent().substring(0, 4))
.withField(StandardField.JOURNAL, getJournal(itemElement.getElementsByTagName("dc:source")))
.withField(StandardField.PUBLISHER, getPublishers(itemElement.getElementsByTagName("publishers").item(0)))
.withField(StandardField.DOI, getDOI(itemElement.getElementsByTagName("ore").item(0).getChildNodes()));
}

private String getDOI(NodeList list) {
for (int i = 0; i < list.getLength(); i++) {
String content = list.item(i).getTextContent();
if (content.contains("DOI:")) {
return content.replace("DOI: ", "");
}
if (list.item(i).getTextContent().contains("doi:")) {
return content.replace("info:doi:", "");
}
}
return "";
}

/**
* Get the type of the document, ISIDORE only seems to have select types, also their types are different to
* those used by JabRef.
*/
private EntryType getType(NodeList list) {
for (int i = 0; i < list.getLength(); i++) {
String type = list.item(i).getTextContent();
if (type.contains("article") || type.contains("Article")) {
return StandardEntryType.Article;
}
if (type.contains("thesis") || type.contains("Thesis")) {
return StandardEntryType.Thesis;
}
if (type.contains("book") || type.contains("Book")) {
return StandardEntryType.Book;
}
}
return StandardEntryType.Misc;
}

private String getAuthor(Node itemElement) {
// Gets all the authors, separated with the word "and"
// For some reason the author field sometimes has extra numbers and letters.
StringJoiner stringJoiner = new StringJoiner(" and ");
for (int i = 1; i < itemElement.getChildNodes().getLength(); i += 2) {
String next = removeNumbers(itemElement.getChildNodes().item(i).getTextContent()).replaceAll("\\s+", " ");
next = next.replace("\n", "");
if (next.isBlank()) {
continue;
}
stringJoiner.add(next);
}
return (stringJoiner.toString().substring(0, stringJoiner.length())).trim().replaceAll("\\s+", " ");
}

/**
* Remove numbers from a string and everything after the number, (helps with the author field).
*/
private String removeNumbers(String string) {
for (int i = 0; i < string.length(); i++) {
if (Character.isDigit(string.charAt(i))) {
return string.substring(0, i);
}
}
return string;
Comment on lines +200 to +205
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < string.length(); i++) {
if (Character.isDigit(string.charAt(i))) {
return string.substring(0, i);
}
}
return string;
return string.replaceFirst("\\d.*", "");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this suggestion OK for you or do you see any issues?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately some of the Author nodes are a bit weird. Sometimes they contain a string of numbers and a dash after the name and then repeat the name again for no apparent reason e.g. (Patrick Bonnel becomes Patrick Bonnel 0766-05442 Patrick). So to solve this I simply removed everything after the first number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK. You can even put that as JavaDoc comment and as test case.

What I meant: Your lines 155 to 160 can be dine with a one-line RegEx.


Using string.replaceFirst("\\d.*", "") is a concise and efficient way to achieve the same result. This regular expression will replace the first digit and everything that follows it with an empty string, effectively removing the numbers and everything after them.

Here's the removeNumbers method using replaceFirst:

private String removeNumbers(String string) {
    return string.replaceFirst("\\d.*", "");
}

In the regex:

  • \\d matches the first digit encountered.
  • .* matches everything after the digit.

The replaceFirst method will then replace this matched portion with an empty string. If no match is found (i.e., if there are no digits), the original string remains unchanged. This is a clean and efficient way to achieve the desired behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a second ChatGPT suggestion, but I don't know about performance gains. I tend to keep the above suggestion and optimize if there are performance issues

import java.util.regex.*;

private String removeNumbers(String string) {
    Matcher m = Pattern.compile("^[^\\d]*").matcher(string);
    if (m.find()) {
        return m.group(0);
    }
    return string;
}

Given your context, the method you've provided returns the portion of the string before the first number. Using a regular expression, we can accomplish the same task more concisely.

Here's a refactored version of the removeNumbers method using regex:

import java.util.regex.*;

private String removeNumbers(String string) {
    Matcher m = Pattern.compile("^[^\\d]*").matcher(string);
    if (m.find()) {
        return m.group(0);
    }
    return string;
}

The regular expression ^[^\\d]* can be interpreted as:

  • ^ asserts position at the start of a string.
  • [^\\d]* matches zero or more non-digit characters.

The method works by matching as many non-digit characters as possible from the beginning of the string until it encounters a digit (or the end of the string). If a match is found, it returns that match; otherwise, it simply returns the original string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second option is good, if Pattern.compile(....) is moved to a class constant.

}

private String getPublishers(Node itemElement) {
u7492883 marked this conversation as resolved.
Show resolved Hide resolved
// In the XML file the publishers node often lists multiple publisher e.g.
// <publisher origin="HAL CCSD">HAL CCSD</publisher>
// <publisher origin="Elsevier">Elsevier</publisher>
// Therefore this function simply gets all of them.
if (itemElement == null) {
return "";
}
StringJoiner stringJoiner = new StringJoiner(", ");
for (int i = 0; i < itemElement.getChildNodes().getLength(); i++) {
if (itemElement.getChildNodes().item(i).getTextContent().isBlank()) {
continue;
}
stringJoiner.add(itemElement.getChildNodes().item(i).getTextContent().trim());
}
return stringJoiner.toString();
}

private String getJournal(NodeList list) {
if (list.getLength() == 0) {
return "";
}
String reference = list.item(list.getLength() - 1).getTextContent();
for (int i = 0; i < reference.length(); i++) {
if (reference.charAt(i) == ',') {
return reference.substring(0, i);
}
}
return "";
}

@Override
public String getName() {
return "ISIDORE";
}

@Override
public Optional<HelpFile> getHelpPage() {
return Optional.of(HelpFile.FETCHER_ISIDORE);
}
}
Loading
Loading