Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pubmed XML import: html tags in title produce javax.xml.bind.JAXBElement@ #6302

Closed
damers77 opened this issue Apr 16, 2020 · 5 comments · Fixed by #9963
Closed

Pubmed XML import: html tags in title produce javax.xml.bind.JAXBElement@ #6302

damers77 opened this issue Apr 16, 2020 · 5 comments · Fixed by #9963
Labels
bug Confirmed bugs or reports that are very likely to be bugs fetcher

Comments

@damers77
Copy link

damers77 commented Apr 16, 2020

JabRef version 5.0 on Ubuntu 19.10

Steps to reproduce the behavior

  1. In Pubmed, search for a publication which has a superscript or italics in its title.
  2. For example, in Pubmed copy and paste the following text into Pubmed's search bar, then hit Search:

Predicting Locally Advanced Rectal Cancer Response to Neoadjuvant Therapy With 18 F-FDG PET and MRI Radiomics Features

  1. Pubmed should find the publication with this title.

  2. Notice, on Pubmed's results web page, how this publication has the number 18 as a superscript in the title.

  3. Copy the publication's PMID number. You can find it in the lower left corner. In this case it is:
    30637502

  4. Download the XML results file from Pubmed for this result.
    Depending on whether you are using the old Pubmed website, or the new one, do as follows:

  • If using the old Pubmed website, with the results displayed, click on:

    • Send to
    • File
    • XML
    • Create file
    • save the file
  • If using the new XML web site, there is no XML download, so use this other method instead, which downloads the XML result from Pubmed using the "wget" utility. The wget commandline is like this:

wget -O pubmed_result.xml 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=30637502&retmode=xml'

where the PMID number you found above is placed after the:
id=

in the url used by wget.

  1. Open Jabref
  2. Menu:
  • File
  • Import into new library
  • open the XML results file from Pubmed you saved above.
    Jabref imports the publication into a new library.
  1. In Jabref, double click on the publication's row.
    This opens the details panel.

  2. In the details panel, click on the left-most tab called:
    Required fields

  3. Copy and paste the Title field's value.
    You get:

Predicting locally advanced rectal cancer response to neoadjuvant therapy with , javax.xml.bind.JAXBElement@4e4ed55f, F-FDG PET and MRI radiomics features.

  1. Notice the problem in the title: The superscript with the number 18 has been replaced with the java-related string:

javax.xml.bind.JAXBElement@4e4ed55f

  1. Open the XML results file in a text editor. See the XML file inside the attached zip file.

  2. Search in the XML file for the text:
    <ArticleTitle>

  3. You see that the contents of the <ArticleTitle> XML tag is:

Predicting locally advanced rectal cancer response to neoadjuvant therapy with <sup>18</sup>F-FDG PET and MRI radiomics features.

  1. Notice the:
    <sup>

    tag in the XML.

  2. This <sup> tag is valid, according to Pubmed's DTD file, used to define what is valid inside Pubmed's XML output. See the DTD file inside the attached zip file.

  3. Open the DTD file in a text editor.

  4. In the DTD file, near the beginning, see the line:

<!ENTITY % text "#PCDATA | b | i | sup | sub | u" >

This says that the following XML / HTML tags are allowed in text entities. The allowed tags are:

  • <b>
  • <i>
  • <sup>
  • <sub>
  • <u>

Elsewhere in the DTD is says that the article title is allowed to be a text.

When Jabref encounters these tags, inside the value of the title, Jabref produces a text like:

javax.xml.bind.JAXBElement@4e4ed55f

instead of producing the text that was inside the tag.

Other fields besides the Title

I believe there is a similar problem with superscript and italics in the:

  • Abstract

field as well, when importing from Pubmed's XML.

Jabref's Log was empty

Jabref's error console was empty, after importing the above XML file from Pubmed.

Checked XML validity against its DTD

I checked the XML file against its DTD, using the first three or four online DTD checkers, that I found googling for:

xml dtd validator online

All of the validators I tried replied that the XML is valid against its DTD.

Attachment

I attach a zip file:

Pubmed XML import superscript tag in title problem.zip

which contains two files:

  1. An XML results file for the above Pubmed search, downloaded from Pubmed.

  2. Pubmed's DTD file, used to define valid Pubmed XML output, downloaded from Pubmed here:
    https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd

    The above url of the DTD can be found inside the XML file, near the top of the file.

Thank you.

@Siedlerchr
Copy link
Member

Siedlerchr commented Apr 17, 2020

Duplicate of #4273
I will try to look into this issue again. It's damn complex.

Found a bunch of xsd:
https://www.ncbi.nlm.nih.gov/data_specs/schema/

@damers77
Copy link
Author

Thank you. I appreciate your effort.
One idea is, for now, to only handle tags for superscript and italics; this would cover the majority of cases I have seen, of actual titles and abstracts in Pubmed.

@Siedlerchr Siedlerchr added fetcher bug Confirmed bugs or reports that are very likely to be bugs labels Apr 24, 2020
@Siedlerchr
Copy link
Member

Unfortunately still relevant

@Siedlerchr Siedlerchr reopened this Jun 22, 2021
@JabRef JabRef deleted a comment from github-actions bot Jun 22, 2021
@github-actions github-actions bot closed this as completed Jul 7, 2021
@Siedlerchr Siedlerchr reopened this Jul 7, 2021
@JabRef JabRef deleted a comment from github-actions bot Jul 7, 2021
@pmagwene
Copy link

This is still an issue in JabRef 5.3 (installed via flatpak on Pop!_OS 21.04)

@ryan-carpenter
Copy link

The JAXBElements can also appear in JabRef as blank spaces. In either case, this is a troublesome issue because because the affected tags often contain essential information such as p-values, keywords, and so on (the very reason for use of bold, italic, and underline emphasis). The issue also affects MathML tags, which adds to the complexity.

Many reference managers and related tools either strip away all formatting or import titles and abstracts as plain text, which then needs cleaning to get rid of unwanted xml tags. Reading these fields as CDATA for the sake of the problem tags would introduce the problem of importing named characters (e.g., &lt and &gt).

Workaround

Find and replace unwanted XML tags with LaTeX before importing. JabRef displays the data with appropriate formatting and still converts XML character references as intended.

Regex match Replacement Note
<(b)>(.*?)</\1> \\textbf{$2} Bold
<(i)>(.*?)</\1> \\textbf{$2} Italic
<(sub)>(.*?)</\1> _{$2} Subscript
<(sup)>(.*?)</\1> ^{$2} Superscript
<(sub|sup|b|i|u)>(.*?)</\1> $2 Replace all with plain text

Nested tags and consecutive elements may need some cleanup after replacement, but in the event of error or if you skip this step the raw text will be visible in JabRef, and this is the best many commercial reference managers do (while offering little opportunity for workarounds or improvement; thank you JabRef maintainers 🙏).

MathML elements are more complicated but not too difficult to deal with manually if necessary. Here is an example:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML">
    <mml:mrow>
        <mml:msubsup>
            <mml:mi>η</mml:mi>
            <mml:mi>p</mml:mi>
            <mml:mn>2</mml:mn
            </mml:msubsup>
    </mml:mrow>
</mml:math>

If I am not mistaken, this converts to LaTeX as η_p^2. I would consider it an improvement if JabRef took no action on <mml> tags and imported them as strings or perhaps kept only the data. Something like <Unsupported MathML expression: "η p 2"> would at least make it possible to see what I might be missing.

@koppor koppor moved this to Normal priority in Prioritization Nov 10, 2022
@github-project-automation github-project-automation bot moved this from Normal priority to Done in Prioritization Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs or reports that are very likely to be bugs fetcher
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants