-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pubmed XML import: html tags in title produce javax.xml.bind.JAXBElement@ #6302
Comments
Duplicate of #4273 Found a bunch of xsd: |
Thank you. I appreciate your effort. |
Unfortunately still relevant |
This is still an issue in JabRef 5.3 (installed via flatpak on Pop!_OS 21.04) |
The JAXBElements can also appear in JabRef as blank spaces. In either case, this is a troublesome issue because because the affected tags often contain essential information such as p-values, keywords, and so on (the very reason for use of bold, italic, and underline emphasis). The issue also affects MathML tags, which adds to the complexity. Many reference managers and related tools either strip away all formatting or import titles and abstracts as plain text, which then needs cleaning to get rid of unwanted xml tags. Reading these fields as CDATA for the sake of the problem tags would introduce the problem of importing named characters (e.g., WorkaroundFind and replace unwanted XML tags with LaTeX before importing. JabRef displays the data with appropriate formatting and still converts XML character references as intended.
Nested tags and consecutive elements may need some cleanup after replacement, but in the event of error or if you skip this step the raw text will be visible in JabRef, and this is the best many commercial reference managers do (while offering little opportunity for workarounds or improvement; thank you JabRef maintainers 🙏). MathML elements are more complicated but not too difficult to deal with manually if necessary. Here is an example: <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML">
<mml:mrow>
<mml:msubsup>
<mml:mi>η</mml:mi>
<mml:mi>p</mml:mi>
<mml:mn>2</mml:mn
</mml:msubsup>
</mml:mrow>
</mml:math> If I am not mistaken, this converts to LaTeX as |
JabRef version 5.0 on Ubuntu 19.10
Steps to reproduce the behavior
Predicting Locally Advanced Rectal Cancer Response to Neoadjuvant Therapy With 18 F-FDG PET and MRI Radiomics Features
Pubmed should find the publication with this title.
Notice, on Pubmed's results web page, how this publication has the number 18 as a superscript in the title.
Copy the publication's PMID number. You can find it in the lower left corner. In this case it is:
30637502
Download the XML results file from Pubmed for this result.
Depending on whether you are using the old Pubmed website, or the new one, do as follows:
If using the old Pubmed website, with the results displayed, click on:
If using the new XML web site, there is no XML download, so use this other method instead, which downloads the XML result from Pubmed using the "wget" utility. The wget commandline is like this:
wget -O pubmed_result.xml 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=30637502&retmode=xml'
where the PMID number you found above is placed after the:
id=
in the url used by wget.
Jabref imports the publication into a new library.
In Jabref, double click on the publication's row.
This opens the details panel.
In the details panel, click on the left-most tab called:
Required fields
Copy and paste the Title field's value.
You get:
Open the XML results file in a text editor. See the XML file inside the attached zip file.
Search in the XML file for the text:
<ArticleTitle>
You see that the contents of the <ArticleTitle> XML tag is:
Notice the:
<sup>
tag in the XML.
This <sup> tag is valid, according to Pubmed's DTD file, used to define what is valid inside Pubmed's XML output. See the DTD file inside the attached zip file.
Open the DTD file in a text editor.
In the DTD file, near the beginning, see the line:
This says that the following XML / HTML tags are allowed in
text
entities. The allowed tags are:Elsewhere in the DTD is says that the article title is allowed to be a
text
.When Jabref encounters these tags, inside the value of the title, Jabref produces a text like:
instead of producing the text that was inside the tag.
Other fields besides the Title
I believe there is a similar problem with superscript and italics in the:
field as well, when importing from Pubmed's XML.
Jabref's Log was empty
Jabref's error console was empty, after importing the above XML file from Pubmed.
Checked XML validity against its DTD
I checked the XML file against its DTD, using the first three or four online DTD checkers, that I found googling for:
xml dtd validator online
All of the validators I tried replied that the XML is valid against its DTD.
Attachment
I attach a zip file:
Pubmed XML import superscript tag in title problem.zip
which contains two files:
An XML results file for the above Pubmed search, downloaded from Pubmed.
Pubmed's DTD file, used to define valid Pubmed XML output, downloaded from Pubmed here:
https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd
The above url of the DTD can be found inside the XML file, near the top of the file.
Thank you.
The text was updated successfully, but these errors were encountered: