Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't extract article title from nested XML file #158

Open
ZhangWoW123 opened this issue Oct 24, 2024 · 10 comments
Open

Can't extract article title from nested XML file #158

ZhangWoW123 opened this issue Oct 24, 2024 · 10 comments
Labels

Comments

@ZhangWoW123
Copy link
Contributor

Thank you for developing and maintaining the pubmed_parser package. This is a great help to may pubmed related analysis.

Describe the bug
I encountered an issue when using the package to extract PubMed information from XML files. Sometime, the article title is missing from the output, even though it exists in the source XML file.

To Reproduce
An example of this issue is PMID 39029957. In the XML file, the <ArticleTitle> section is structured as follows:

<ArticleTitle>
  <b>
    <b>OKN-007 is an Effective Anticancer Therapeutic Agent Targeting Inflammatory and Immune Metabolism Pathways in Endometrial Cancer.
    </b>
  </b>
</ArticleTitle>

When using medline_parser.parse_article_info, it calls the utils.stringify_children function, which only extracts the current layer and first layer of children. Since the title is within the second layer, the parsed title is empty.

Here is the code being executed:

import pandas as pd
import pubmed_parser as pp

filename = 'pubmed24n1476.xml.gz'

parsed_articles = pp.parse_medline_xml(
    filename,
    year_info_only=True,
    nlm_category=True,
    author_list=True
)
df = pd.DataFrame.from_dict(parsed_articles)
df[df["pmid"] == 39029957]

The xml file for this pmid is in pubmed24n1476.xml.gz file and can be downloaded from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/

Expected behavior
I expect the function to extract the correct title from the XML. A temporary solution is modifying the utils.stringify_children function by replacing return "".join(filter(None, parts)) with return ''.join(root.xpath('.//text()')).strip(). However, I am unsure if this will cause other issues.

Screenshots
Here is the screenshot for the XML source file.
Screenshot 2024-10-23 at 10 55 55 PM

@nils-herrmann
Copy link
Contributor

Thank you for the thorough documentation of the bug @ZhangWoW123 !

The current implementation and proposed solution for stringify_children can be found in StackOverflow. Although both approaches give similar results, there are minor differences:

  • The current implementation only extracts text from the children of the node
  • The proposed solution extract text from all descendants of the node

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

@Michael-E-Rose
Copy link
Collaborator

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

I wonder whether that's necessary. Can we not generalize parse_article_info() to include one more level? What would it break elsewhere?

nils-herrmann added a commit to nils-herrmann/pubmed_parser that referenced this issue Oct 24, 2024
@nils-herrmann
Copy link
Contributor

nils-herrmann commented Oct 24, 2024

@Michael-E-Rose We could use stringify_descendants() for the other fields in parse_article_info(). In parse_pubmed_caption() the new function would parse not only the fig_caption but the fig_list-items (which we don't want):

<caption>
  <title>Aerosol delivery of sACE2<sub>2</sub>.v2.4&#x02010;IgG1 alleviates
      lung injury and improves survival of SARS&#x02010;CoV&#x02010;2 gamma
      variant infected K18&#x02010;hACE2 transgenic mice</title>
  <p>
      <list list-type="simple" id="emmm202216109-list-0002">
          <list-item id="emmm202216109-li-0004">
              <label>A</label>
              <p>K18&#x02010;hACE2 transgenic mice were inoculated with
...
</caption>

@ZhangWoW123
Copy link
Contributor Author

Thank you for the thorough documentation of the bug @ZhangWoW123 !

The current implementation and proposed solution for stringify_children can be found in StackOverflow. Although both approaches give similar results, there are minor differences:

  • The current implementation only extracts text from the children of the node
  • The proposed solution extract text from all descendants of the node

To avoid braking other functions like parse_pubmed_caption() let's create a new function stringify_descendants().

Thank you so much @nils-herrmann for the great help! Looking forward to the updated package:)

nils-herrmann added a commit to nils-herrmann/pubmed_parser that referenced this issue Oct 24, 2024
@Michael-E-Rose
Copy link
Collaborator

I still don't get it. Why can we not change stringify_children()? What would break?

Given the complexity of the current codebase, no new function is strictly preferable.

@nils-herrmann
Copy link
Contributor

nils-herrmann commented Oct 28, 2024

As seen above, parse_pubmed_caption() breaks because it does not only parse the children (i.e. <title> text) but also its descendants text (i.e. <list-item>).

@Michael-E-Rose
Copy link
Collaborator

I thought the problem is, that parse_article_info() doesn't parse enough, and not too much?

Let me ask the other way around:
@ZhangWoW123 suggested this:

A temporary solution is modifying the utils.stringify_children function by replacing return "".join(filter(None, parts)) with return ''.join(root.xpath('.//text()')).strip(). However, I am unsure if this will cause other issues.

What other issues may it cause, and can we prevent them by changing existing functions?

@nils-herrmann
Copy link
Contributor

nils-herrmann commented Nov 12, 2024

  • Original problem: parse_article_info() does not parse enough. The reason is that stringify_children() only gets the text of the children.

  • Proposed solution: Use ''.join(root.xpath('.//text()')).strip() in stringify_children() which gets the text of all descendants.

  • Problem of proposed solution: In parse_pubmed_caption() we are interested in getting the text of the children not the descendants, i.e. the proposed solution gets too much text.

  • We need two functions because we want two different things: Getting text of children or getting the text of the descendants.

@Michael-E-Rose
Copy link
Collaborator

When would I want only the children, but not the other descendants? How often does it happen actually that there are children and descendants?

If I understand the example from OP correctly, then the nested title is kind of an anomaly.

@nils-herrmann
Copy link
Contributor

nils-herrmann commented Nov 13, 2024

We want only children in parse_pubmed_caption() because we parse the caption title (children) separately from the caption list items (descendants). Besides that case we always want the descendants.
We can change the code to parse only the caption title and use stringify_descendants() for that function too.

Yes, the nested title is an anomaly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants