-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support streaming for pubmed #3740
Conversation
@albertvillanova just FYI, since you were so helpful with the previous pubmed issue :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @abhi-mosaic.
Below some comments/suggestions.
@@ -39,7 +39,7 @@ | |||
# The HuggingFace dataset library don't host the datasets but only point to the original files | |||
# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method) | |||
# Note these URLs here are used by MockDownloadManager.create_dummy_data_list | |||
_URLs = [f"ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed22n{i:04d}.xml.gz" for i in range(1, 1115)] | |||
_URLs = [f"https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed22n{i:04d}.xml.gz" for i in range(1, 1115)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you changed the protocol. PubMed download instructions recommend to use their FTP servers.
Moreover, the change of protocol may have a performance impact.
datasets/pubmed/pubmed.py
Outdated
try: | ||
tree = etree.parse(filename) | ||
root = tree.getroot() | ||
xmldict = self.xml_to_dictionnary(root) | ||
except etree.ParseError: | ||
logger.warning(f"Ignoring file {filename}, it is malformed") | ||
continue | ||
|
||
for article in xmldict["PubmedArticleSet"]["PubmedArticle"]: | ||
self.update_citation(article) | ||
new_article = default_article() | ||
|
||
with open(filename, "rt", encoding="utf-8") as f: | ||
xml_str = f.read() | ||
try: | ||
deepupdate(new_article, article) | ||
except Exception: | ||
logger.warning(f"Ignoring article {article}, it is malformed") | ||
root = etree.fromstring(xml_str) | ||
xmldict = self.xml_to_dictionnary(root) | ||
except etree.ParseError: | ||
logger.warning(f"Ignoring file {filename}, it is malformed") | ||
continue | ||
|
||
try: | ||
_ = self.info.features.encode_example(new_article) | ||
except Exception as e: | ||
logger.warning(f"Ignore example because {e}") | ||
continue | ||
yield id_, new_article | ||
id_ += 1 | ||
for article in xmldict["PubmedArticleSet"]["PubmedArticle"]: | ||
self.update_citation(article) | ||
new_article = default_article() | ||
|
||
try: | ||
deepupdate(new_article, article) | ||
except Exception: | ||
logger.warning(f"Ignoring article {article}, it is malformed") | ||
continue | ||
|
||
try: | ||
_ = self.info.features.encode_example(new_article) | ||
except Exception as e: | ||
logger.warning(f"Ignore example because {e}") | ||
continue | ||
yield id_, new_article | ||
id_ += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary, once that xml.etree.ElementTree.parse
already supports streaming. See:
The only thing needs being changed is the import: as ET
instead of as etree
:
datasets/datasets/pubmed/pubmed.py
Line 19 in 71aa41c
import xml.etree.ElementTree as etree |
@@ -172,7 +172,7 @@ def create_dummy_data_list(self, path_to_dummy_data, data_url): | |||
# trick: if there are many shards named like `data.txt-000001-of-00300`, only use the first one | |||
is_tf_records = all(bool(re.findall("[0-9]{3,}-of-[0-9]{3,}", url)) for url in data_url) | |||
is_pubmed_records = all( | |||
url.startswith("ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed") for url in data_url | |||
url.startswith("https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed") for url in data_url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above for the change of protocol.
IIRC streaming from FTP is not fully tested yet, so I'm fine with switching to HTTPS for now, as long as the download speed/availability is great |
@albertvillanova Thanks for pointing me to the Unfortunately I tried keeping the Error when using
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your work and your profiling!
I see FTP doesn't allow streaming, as it timeouts while processing the XML (I guess).
If the performance is not negatively impacted, let's use HTTPS instead.
This PR makes some minor changes to the
pubmed
dataset to allow forstreaming=True
. Fixes #3739.Basically, I followed the C4 dataset which works in streaming mode as an example, and made the following changes:
ftp://
tohttps://
open
the filename and pass the XML contents toetree.fromstring(xml_str)
The Github diff tool makes it look like the changes are larger than they are, sorry about that.
I tested locally and the
pubmed
dataset now works in both normal and streaming modes. There is some overhead at the start of each shard in streaming mode as building the XML tree online is quite slow (each pubmed .xml.gz file is ~20MB), but the overhead gets amortized over all the samples in the shard. On my laptop with a single CPU worker I am able to stream at about ~600 samples/s.