-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't extract article title from nested XML file #158
Comments
Thank you for the thorough documentation of the bug @ZhangWoW123 ! The current implementation and proposed solution for
To avoid braking other functions like |
I wonder whether that's necessary. Can we not generalize |
…t of an article with a nested title.
@Michael-E-Rose We could use <caption>
<title>Aerosol delivery of sACE2<sub>2</sub>.v2.4‐IgG1 alleviates
lung injury and improves survival of SARS‐CoV‐2 gamma
variant infected K18‐hACE2 transgenic mice</title>
<p>
<list list-type="simple" id="emmm202216109-list-0002">
<list-item id="emmm202216109-li-0004">
<label>A</label>
<p>K18‐hACE2 transgenic mice were inoculated with
...
</caption> |
Thank you so much @nils-herrmann for the great help! Looking forward to the updated package:) |
…t of an article with a nested title.
I still don't get it. Why can we not change Given the complexity of the current codebase, no new function is strictly preferable. |
As seen above, |
I thought the problem is, that Let me ask the other way around:
What other issues may it cause, and can we prevent them by changing existing functions? |
|
When would I want only the children, but not the other descendants? How often does it happen actually that there are children and descendants? If I understand the example from OP correctly, then the nested title is kind of an anomaly. |
We want only children in Yes, the nested title is an anomaly. |
Thank you for developing and maintaining the
pubmed_parser
package. This is a great help to may pubmed related analysis.Describe the bug
I encountered an issue when using the package to extract PubMed information from XML files. Sometime, the article title is missing from the output, even though it exists in the source XML file.
To Reproduce
An example of this issue is PMID
39029957
. In the XML file, the<ArticleTitle>
section is structured as follows:When using
medline_parser.parse_article_info
, it calls theutils.stringify_children
function, which only extracts the current layer and first layer of children. Since the title is within the second layer, the parsed title is empty.Here is the code being executed:
The xml file for this pmid is in
pubmed24n1476.xml.gz
file and can be downloaded from https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/Expected behavior
I expect the function to extract the correct title from the XML. A temporary solution is modifying the
utils.stringify_children
function by replacingreturn "".join(filter(None, parts))
withreturn ''.join(root.xpath('.//text()')).strip()
. However, I am unsure if this will cause other issues.Screenshots
Here is the screenshot for the XML source file.
The text was updated successfully, but these errors were encountered: