Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MathML support when importing PubMed #4273

Closed
1 task done
vchouraki opened this issue Aug 17, 2018 · 11 comments · Fixed by #9963
Closed
1 task done

Add MathML support when importing PubMed #4273

vchouraki opened this issue Aug 17, 2018 · 11 comments · Fixed by #9963
Assignees
Labels
bug Confirmed bugs or reports that are very likely to be bugs build-system fetcher import

Comments

@vchouraki
Copy link

vchouraki commented Aug 17, 2018

Hello,

Tasks

JabRef version 4.3.1 on Ubuntu 18.04 with jre1.8.0_181 downloaded from oracle website

Steps to reproduce:

  • import a pubmed reference e.g. "29872490"
  • Compare the results with original title and abstract

Example using above "search"

Original title :

Genome-wide significant risk factors on chromosome 19 and the APOE locus

Imported title :

Genome-wide significant risk factors on chromosome 19 and the , javax.xml.bind.JAXBElement@6d2ba6e1, locus.

Original abstract :

"The apolipoprotein E (APOE) gene on chromosome 19q13.32, was the first, and remains the strongest, genetic risk factor for Alzheimer’s disease (AD). Additional signals associated with AD have been located in chromosome 19, including ABCA7 (19p13.3) and CD33 (19q13.41). The ABCA7 gene has been replicated in most populations. However, the contribution to AD of other signals close to APOE gene remains controversial. Possible explanations for inconsistency between reports include long range linkage disequilibrium (LRLD). We analysed the contribution of ABCA7 and CD33 loci to AD risk and explore LRLD patterns across APOE region. To evaluate AD risk conferred by ABCA7 rs4147929:G>A and CD33 rs3865444:C>A, we used a large Spanish population (1796 AD cases, 2642 controls). The ABCA7 rs4147929:G>A SNP effect was nominally replicated in the Spanish cohort and reached genome-wide significance after meta-analysis (odds ratio (OR)=1.15, 95% confidence interval (95% CI)=1.12–1.19; P = 1.60 x 10-19). CD33 rs3865444:C>A was not associated with AD in the dataset. The meta-analysis was also negative (OR=0.98, 95% CI=0.93–1.04; P=0.48). After exploring LRLD patterns between APOE and CD33 in several datasets, we found significant LD (D’ >0.20; P <0.030) between APOE-Ɛ2 and CD33 rs3865444C>A in two of five datasets, suggesting the presence of a non-universal long range interaction between these loci affecting to some populations. In conclusion, we provide here evidence of genetic association of the ABCA7 locus in the Spanish population and also propose a plausible explanation for the controversy on the contribution of CD33 to AD susceptibility."

Imported abstract :

"The apolipoprotein E ( ) gene on chromosome 19q13.32, was the first, and remains the strongest, genetic risk factor for Alzheimer's disease (AD). Additional signals associated with AD have been located in chromosome 19, including (19p13.3) and 19q13.41). The gene has been replicated in most populations. However, the contribution to AD of other signals close to gene remains controversial. Possible explanations for inconsistency between reports include long range linkage disequilibrium (LRLD). We analysed the contribution of and loci to AD risk and explore LRLD patterns across region. To evaluate AD risk conferred by rs4147929:G>A and rs3865444:C>A, we used a large Spanish population (1796 AD cases, 2642 controls). The rs4147929:G>A SNP effect was nominally replicated in the Spanish cohort and reached genome-wide significance after meta-analysis (odds ratio (OR)=1.15, 95% confidence interval (95% CI)=1.12-1.19; = 1.60 x 10 ). rs3865444:C>A was not associated with AD in the dataset. The meta-analysis was also negative (OR=0.98, 95% CI=0.93-1.04; =0.48). After exploring LRLD patterns between and in several datasets, we found significant LD (D' >0.20; <0.030) between -Ɛ2 and rs3865444C>A in two of five datasets, suggesting the presence of a non-universal long range interaction between these loci affecting to some populations. In conclusion, we provide here evidence of genetic association of the locus in the Spanish population and also propose a plausible explanation for the controversy on the contribution of to AD susceptibility."

Best,

Vincent

@vchouraki vchouraki changed the title Pubmed import replace italic text with extra space or "javax.xml.bind.JAXBElement@xxxxxx" Pubmed import replaces italic text with extra space or "javax.xml.bind.JAXBElement@xxxxxx" Aug 17, 2018
@Siedlerchr Siedlerchr added the bug Confirmed bugs or reports that are very likely to be bugs label Aug 17, 2018
@Siedlerchr
Copy link
Member

Siedlerchr commented Aug 17, 2018

I could confirm this in the latest master. The problems seems to be that our parser is not aware of the italic annotation things in the xml.
We simply need to update our XSD Schema.

  1. Download current DTD https://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html
  2. Convert it to XSD (Visual Studio can do that
  3. Adapt gradle xkc task

I tried a bit around but I can't get it working. I am not an XML expert, but somehow the math namespace can't be resolved.
From the latest article there is a link to a zip package inlcuding all kind of relatex xml dtd stuff and defintions. http://dtd.nlm.nih.gov/ncbi/pubmed/out/180601/pubmed_180601.zip

@bernhard-kleine
Copy link

7 months ago and of low priority! I understand that you are under considerable stress, but medline import for some of us is of utmost importance.

@Siedlerchr
Copy link
Member

Siedlerchr commented Mar 25, 2019

There seems to be a new one and one needs to have a bunch of addtional xml files.
https://www.ncbi.nlm.nih.gov/pmc/pmcdoc/dtd/

mathml-in-pubmed.mod
I found this other toolbox here: https://github.com/biopython/biopython/tree/master/Bio/Entrez/DTDs

@bernhard-kleine Well, Pubmed is generally working, e.g import of entries is possible. Only for a subset of entries which uses italics in the title or abstract the import is not 100% correct. That's probably why it was labeled as
Trust me, it's not that nobody wants to fix this issue, but it's really complicated as it involves external libraries and tons of external xml schema.

As a side note, the EUtiuilies stuff seems to be able to export json as well, so this m ght be an alternative.
https://www.ncbi.nlm.nih.gov/pmc/tools/get-metadata/

@Siedlerchr
Copy link
Member

Unfortunately still a problem. The jaxb needs handling of math xml somehow

@vchouraki
Copy link
Author

This is still a problem (checked today with v5.3 portable on Windows 10). Did not check if there was any trouble with the Firefox extension though.

I understand that the persons maintaining JabRef have other priorities. I think this issue could be closed but then, it would be fair to deactivate the Medline / Pubmed fetcher and explicitely mention that importing Pubmed results in JabRef can be manually done through an nbib file generated from the Pubmed website ("Send to" > "Citation manager").

@JabRef JabRef deleted a comment from github-actions bot Jul 13, 2021
@vchouraki
Copy link
Author

This is still a problem (checked today with v5.7 portable on Windows 10).

Closing this issue, considering it as "wontfix"

@vchouraki vchouraki closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2022
@Siedlerchr
Copy link
Member

Sorry, I have been recently looking into this issue again. Unfortuantely I was not able to get the xml parsing work for this case. However, I would still leave this open for the future

@Siedlerchr Siedlerchr reopened this Oct 25, 2022
@koppor koppor moved this to Normal priority in Prioritization Nov 10, 2022
@JabRef JabRef deleted a comment from github-actions bot Mar 16, 2023
@aqurilla
Copy link
Contributor

Hi @Siedlerchr, in this issue the StaX approach works for italics, underline and bold, and we can extract just the text by ignoring those tags.

For the MathML tags are we looking for a full MathML to LaTeX conversion? Or are we just interested in extracting the character elements within the <math>...</math> tags?

e.g. the following would convert to ηp2 if we just extract the characters

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML">
    <mml:mrow>
        <mml:msubsup>
            <mml:mi>η</mml:mi>
            <mml:mi>p</mml:mi>
            <mml:mn>2</mml:mn>
         </mml:msubsup>
    </mml:mrow>
</mml:math>

@Siedlerchr
Copy link
Member

That's great to here! For mathml, not sure, I found an xslt transform script for mathml
https://tex.stackexchange.com/a/85643
https://github.com/davidcarlisle/web-xslt/tree/main/pmml2tex

Maybe that helps or can be used somehow, otherwise I would really go with the plain character parsing as in your exampole

@koppor koppor changed the title Pubmed import replaces italic text with extra space or "javax.xml.bind.JAXBElement@xxxxxx" Add MathML support when importing PubMed Mar 18, 2023
@koppor
Copy link
Member

koppor commented Mar 18, 2023

@vchouraki The italics issue was solved. Please check the latest build in https://builds.jabref.org/main/.

I renamed this issue, because we are discussing MathML now. - I hope the MathML support won't be a rabbit whole (see #6155).

@aqurilla
Copy link
Contributor

Please assign this issue to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs or reports that are very likely to be bugs build-system fetcher import
Projects
Archived in project
5 participants