-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #12000: Parsing arXiv Id when importing a PDF with arXiv Id #12079
base: main
Are you sure you want to change the base?
Conversation
2. Add 'getArxivId' method to 'PdfContentImporter' class 3. Add logic to handle arxivId in 'getEntryFromPDFContent' method 4. Add 'extractArxivIdFromPage1' test method to 'PdfContentImporterTest' class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arxiv is not a Standard field
@@ -25,6 +25,7 @@ public enum StandardField implements Field { | |||
ARCHIVEPREFIX("archiveprefix"), | |||
ASSIGNEE("assignee", FieldProperty.PERSON_NAMES), | |||
AUTHOR("author", FieldProperty.PERSON_NAMES), | |||
ARXIVID("arXivId", FieldProperty.VERBATIM), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arxiv is no StandardField, you need to use eprint for this
see chapter 3.14.7 Electronic Publishing Information of the biblatex spec
http://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf
.withField(StandardField.AUTHOR, "Review Article") | ||
.withField(StandardField.TITLE, "British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296 arXiv:2408.06224v1 q The Authors") | ||
.withField(StandardField.YEAR, "2024") | ||
.withField(StandardField.ARXIVID, "2408.06224v1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be eprint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback! I see, I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You then should also add the correpsonding eprinttype field to arxiv
… into 12000-fetch-arxiv-info-from-pdf
CHANGELOG.md
Outdated
@@ -11,6 +11,7 @@ Note that this project **does not** adhere to [Semantic Versioning](https://semv | |||
|
|||
### Added | |||
|
|||
- We added functionality to handle arXiv ID in `PdfContentImporter` and implemented related test case. [#12000](https://github.com/JabRef/jabref/issues/12000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are user facing. A user does not see PdfContentImporter
in the UI. We are updating the docs at JabRef/user-documentation#537 - and the updated section should be linked.
private String getArxivId(String arxivId) { | ||
int pos; | ||
if (arxivId == null) { | ||
pos = curString.indexOf("arxiv"); | ||
if (pos < 0) { | ||
pos = curString.indexOf("arXiv"); | ||
} | ||
if (pos >= 0) { | ||
String arxivText = curString.substring(pos); | ||
return ArXivIdentifier.parse(arxivText).map(ArXivIdentifier::asString).orElse(null); | ||
} | ||
} | ||
return arxivId; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No! We have the class org.jabref.model.entry.identifier.ArXivIdentifier use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still unfixed.
Pattern oldIdentifierPattern = Pattern.compile("(" + ARXIV_PREFIX + ")?\\s?:?\\s?(?<id>(?<classification>[a-z\\-]+(\\.[A-Z]{2})?)/\\d{7})(v(?<version>\\d+))?"); | ||
Matcher oldIdentifierMatcher = oldIdentifierPattern.matcher(identifier); | ||
if (oldIdentifierMatcher.matches()) { | ||
return getArXivIdentifier(oldIdentifierMatcher); | ||
} | ||
Pattern oldIdentifierPattern = Pattern.compile("(" + ARXIV_PREFIX + ")?\\s?:?\\s?(?<id>(?<classification>[a-z\\-]+(\\.[A-Z]{2})?)/\\d{7})(v(?<version>\\d+))?"); | ||
Matcher oldIdentifierMatcher = oldIdentifierPattern.matcher(identifier); | ||
if (oldIdentifierMatcher.matches()) { | ||
return getArXivIdentifier(oldIdentifierMatcher); | ||
} | ||
|
||
return Optional.empty(); | ||
return Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong indent. Please revert.
@@ -123,4 +123,38 @@ British Journal of Nutrition (2008), 99, 1–11 doi: 10.1017/S0007114507795296 | |||
|
|||
assertEquals(Optional.of(entry), importer.getEntryFromPDFContent(firstPageContent, "\n")); | |||
} | |||
|
|||
@Test | |||
void extractArxivIdFromPage1() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also use a real arXiv PDF. I think, there is a link to one in the issue?
if (arxivId != null) { | ||
year = "20" + arxivId.substring(0, 2); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go to the else
branch. - If there is no year then one can check for arxivId
and guess a year out of it.
private String getArxivId(String arxivId) { | ||
int pos; | ||
if (arxivId == null) { | ||
pos = curString.indexOf("arxiv"); | ||
if (pos < 0) { | ||
pos = curString.indexOf("arXiv"); | ||
} | ||
if (pos >= 0) { | ||
String arxivText = curString.substring(pos); | ||
return ArXivIdentifier.parse(arxivText).map(ArXivIdentifier::asString).orElse(null); | ||
} | ||
} | ||
return arxivId; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still unfixed.
Describe the changes you have made here: what, why, ...
Link the issue that will be closed, e.g., "Closes #333". If your PR closes a koppor issue, link it using its URL, e.g., "Closes koppor#47".
Add 'ARXIVID' field to 'StandardField' enum class
Add 'getArxivId' method to 'PdfContentImporter' class
Add logic to handle arxivId in 'getEntryFromPDFContent' method
Add 'extractArxivIdFromPage1' test method to 'PdfContentImporterTest' class
Closes #12000
Closes #12000
Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if applicable)