Can't extract article title from nested XML file #158

ZhangWoW123 · 2024-10-24T03:16:15Z

Thank you for developing and maintaining the pubmed_parser package. This is a great help to may pubmed related analysis.

Describe the bug
I encountered an issue when using the package to extract PubMed information from XML files. Sometime, the article title is missing from the output, even though it exists in the source XML file.

To Reproduce
An example of this issue is PMID 39029957. In the XML file, the <ArticleTitle> section is structured as follows:

<ArticleTitle>
  <b>
    <b>OKN-007 is an Effective Anticancer Therapeutic Agent Targeting Inflammatory and Immune Metabolism Pathways in Endometrial Cancer.
    </b>
  </b>
</ArticleTitle>

When using medline_parser.parse_article_info, it calls the utils.stringify_children function, which only extracts the current layer and first layer of children. Since the title is within the second layer, the parsed title is empty.

Here is the code being executed:

import pandas as pd
import pubmed_parser as pp

filename = 'pubmed24n1476.xml.gz'

parsed_articles = pp.parse_medline_xml(
    filename,
    year_info_only=True,
    nlm_category=True,
    author_list=True
)
df = pd.DataFrame.from_dict(parsed_articles)
df[df["pmid"] == 39029957]

The xml file for this pmid is in pubmed24n1476.xml.gz file and can be downloaded from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/

Expected behavior
I expect the function to extract the correct title from the XML. A temporary solution is modifying the utils.stringify_children function by replacing return "".join(filter(None, parts)) with return ''.join(root.xpath('.//text()')).strip(). However, I am unsure if this will cause other issues.

Screenshots
Here is the screenshot for the XML source file.

The text was updated successfully, but these errors were encountered:

nils-herrmann · 2024-10-24T09:03:59Z

Thank you for the thorough documentation of the bug @ZhangWoW123 !

The current implementation and proposed solution for stringify_children can be found in StackOverflow. Although both approaches give similar results, there are minor differences:

The current implementation only extracts text from the children of the node
The proposed solution extract text from all descendants of the node

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

Michael-E-Rose · 2024-10-24T09:25:19Z

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

I wonder whether that's necessary. Can we not generalize parse_article_info() to include one more level? What would it break elsewhere?

…t of an article with a nested title.

nils-herrmann · 2024-10-24T09:38:42Z

@Michael-E-Rose We could use stringify_descendants() for the other fields in parse_article_info(). In parse_pubmed_caption() the new function would parse not only the fig_caption but the fig_list-items (which we don't want):

<caption>
  <title>Aerosol delivery of sACE2<sub>2</sub>.v2.4&#x02010;IgG1 alleviates
      lung injury and improves survival of SARS&#x02010;CoV&#x02010;2 gamma
      variant infected K18&#x02010;hACE2 transgenic mice</title>
  <p>
      <list list-type="simple" id="emmm202216109-list-0002">
          <list-item id="emmm202216109-li-0004">
              <label>A</label>
              <p>K18&#x02010;hACE2 transgenic mice were inoculated with
...
</caption>

ZhangWoW123 · 2024-10-24T13:19:28Z

Thank you for the thorough documentation of the bug @ZhangWoW123 !

The current implementation and proposed solution for stringify_children can be found in StackOverflow. Although both approaches give similar results, there are minor differences:

The current implementation only extracts text from the children of the node

The proposed solution extract text from all descendants of the node

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

Thank you so much @nils-herrmann for the great help! Looking forward to the updated package:)

…t of an article with a nested title.

Michael-E-Rose · 2024-10-28T08:36:41Z

I still don't get it. Why can we not change stringify_children()? What would break?

Given the complexity of the current codebase, no new function is strictly preferable.

nils-herrmann · 2024-10-28T12:43:55Z

As seen above, parse_pubmed_caption() breaks because it does not only parse the children (i.e. <title> text) but also its descendants text (i.e. <list-item>).

Michael-E-Rose · 2024-10-28T13:10:04Z

I thought the problem is, that parse_article_info() doesn't parse enough, and not too much?

Let me ask the other way around:
@ZhangWoW123 suggested this:

A temporary solution is modifying the utils.stringify_children function by replacing return "".join(filter(None, parts)) with return ''.join(root.xpath('.//text()')).strip(). However, I am unsure if this will cause other issues.

What other issues may it cause, and can we prevent them by changing existing functions?

nils-herrmann · 2024-11-12T17:32:03Z

Original problem: parse_article_info() does not parse enough. The reason is that stringify_children() only gets the text of the children.
Proposed solution: Use ''.join(root.xpath('.//text()')).strip() in stringify_children() which gets the text of all descendants.
Problem of proposed solution: In parse_pubmed_caption() we are interested in getting the text of the children not the descendants, i.e. the proposed solution gets too much text.
We need two functions because we want two different things: Getting text of children or getting the text of the descendants.

Michael-E-Rose · 2024-11-12T17:48:36Z

When would I want only the children, but not the other descendants? How often does it happen actually that there are children and descendants?

If I understand the example from OP correctly, then the nested title is kind of an anomaly.

nils-herrmann · 2024-11-13T12:01:19Z

We want only children in parse_pubmed_caption() because we parse the caption title (children) separately from the caption list items (descendants). Besides that case we always want the descendants.
We can change the code to parse only the caption title and use stringify_descendants() for that function too.

Yes, the nested title is an anomaly.

ZhangWoW123 added the bug label Oct 24, 2024

nils-herrmann added a commit to nils-herrmann/pubmed_parser that referenced this issue Oct 24, 2024

titipata#158 New function to stringify all descendatns of a node. Tes…

2c02ff2

…t of an article with a nested title.

nils-herrmann added a commit to nils-herrmann/pubmed_parser that referenced this issue Oct 24, 2024

titipata#158 New function to stringify all descendatns of a node. Tes…

30cb4b8

…t of an article with a nested title.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't extract article title from nested XML file #158

Can't extract article title from nested XML file #158

ZhangWoW123 commented Oct 24, 2024

nils-herrmann commented Oct 24, 2024

Michael-E-Rose commented Oct 24, 2024

nils-herrmann commented Oct 24, 2024 •

edited

Loading

ZhangWoW123 commented Oct 24, 2024

Michael-E-Rose commented Oct 28, 2024

nils-herrmann commented Oct 28, 2024 •

edited

Loading

Michael-E-Rose commented Oct 28, 2024

nils-herrmann commented Nov 12, 2024 •

edited

Loading

Michael-E-Rose commented Nov 12, 2024

nils-herrmann commented Nov 13, 2024 •

edited

Loading

Can't extract article title from nested XML file #158

Can't extract article title from nested XML file #158

Comments

ZhangWoW123 commented Oct 24, 2024

nils-herrmann commented Oct 24, 2024

Michael-E-Rose commented Oct 24, 2024

nils-herrmann commented Oct 24, 2024 • edited Loading

ZhangWoW123 commented Oct 24, 2024

Michael-E-Rose commented Oct 28, 2024

nils-herrmann commented Oct 28, 2024 • edited Loading

Michael-E-Rose commented Oct 28, 2024

nils-herrmann commented Nov 12, 2024 • edited Loading

Michael-E-Rose commented Nov 12, 2024

nils-herrmann commented Nov 13, 2024 • edited Loading

nils-herrmann commented Oct 24, 2024 •

edited

Loading

nils-herrmann commented Oct 28, 2024 •

edited

Loading

nils-herrmann commented Nov 12, 2024 •

edited

Loading

nils-herrmann commented Nov 13, 2024 •

edited

Loading