Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming for pubmed #3740

Merged
merged 3 commits into from
Feb 18, 2022
Merged

Conversation

abhi-mosaic
Copy link
Contributor

@abhi-mosaic abhi-mosaic commented Feb 17, 2022

This PR makes some minor changes to the pubmed dataset to allow for streaming=True. Fixes #3739.

Basically, I followed the C4 dataset which works in streaming mode as an example, and made the following changes:

  • Change URL prefix from ftp:// to https://
  • Explicilty open the filename and pass the XML contents to etree.fromstring(xml_str)

The Github diff tool makes it look like the changes are larger than they are, sorry about that.

I tested locally and the pubmed dataset now works in both normal and streaming modes. There is some overhead at the start of each shard in streaming mode as building the XML tree online is quite slow (each pubmed .xml.gz file is ~20MB), but the overhead gets amortized over all the samples in the shard. On my laptop with a single CPU worker I am able to stream at about ~600 samples/s.

@abhi-mosaic
Copy link
Contributor Author

@albertvillanova just FYI, since you were so helpful with the previous pubmed issue :)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @abhi-mosaic.

Below some comments/suggestions.

@@ -39,7 +39,7 @@
# The HuggingFace dataset library don't host the datasets but only point to the original files
# This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)
# Note these URLs here are used by MockDownloadManager.create_dummy_data_list
_URLs = [f"ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed22n{i:04d}.xml.gz" for i in range(1, 1115)]
_URLs = [f"https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed22n{i:04d}.xml.gz" for i in range(1, 1115)]
Copy link
Member

@albertvillanova albertvillanova Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why you changed the protocol. PubMed download instructions recommend to use their FTP servers.

Moreover, the change of protocol may have a performance impact.

Comment on lines 361 to 386
try:
tree = etree.parse(filename)
root = tree.getroot()
xmldict = self.xml_to_dictionnary(root)
except etree.ParseError:
logger.warning(f"Ignoring file {filename}, it is malformed")
continue

for article in xmldict["PubmedArticleSet"]["PubmedArticle"]:
self.update_citation(article)
new_article = default_article()

with open(filename, "rt", encoding="utf-8") as f:
xml_str = f.read()
try:
deepupdate(new_article, article)
except Exception:
logger.warning(f"Ignoring article {article}, it is malformed")
root = etree.fromstring(xml_str)
xmldict = self.xml_to_dictionnary(root)
except etree.ParseError:
logger.warning(f"Ignoring file {filename}, it is malformed")
continue

try:
_ = self.info.features.encode_example(new_article)
except Exception as e:
logger.warning(f"Ignore example because {e}")
continue
yield id_, new_article
id_ += 1
for article in xmldict["PubmedArticleSet"]["PubmedArticle"]:
self.update_citation(article)
new_article = default_article()

try:
deepupdate(new_article, article)
except Exception:
logger.warning(f"Ignoring article {article}, it is malformed")
continue

try:
_ = self.info.features.encode_example(new_article)
except Exception as e:
logger.warning(f"Ignore example because {e}")
continue
yield id_, new_article
id_ += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary, once that xml.etree.ElementTree.parse already supports streaming. See:

The only thing needs being changed is the import: as ET instead of as etree:

import xml.etree.ElementTree as etree

@@ -172,7 +172,7 @@ def create_dummy_data_list(self, path_to_dummy_data, data_url):
# trick: if there are many shards named like `data.txt-000001-of-00300`, only use the first one
is_tf_records = all(bool(re.findall("[0-9]{3,}-of-[0-9]{3,}", url)) for url in data_url)
is_pubmed_records = all(
url.startswith("ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed") for url in data_url
url.startswith("https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed") for url in data_url
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above for the change of protocol.

@lhoestq
Copy link
Member

lhoestq commented Feb 17, 2022

IIRC streaming from FTP is not fully tested yet, so I'm fine with switching to HTTPS for now, as long as the download speed/availability is great

@abhi-mosaic
Copy link
Contributor Author

@albertvillanova Thanks for pointing me to the ET module replacement. It should look a lot cleaner now.

Unfortunately I tried keeping the ftp:// protocol but was seeing timeout errors? in streaming mode (below). I think the https:// performance is not an issue, when I was profiling the open(..) -> f.read() -> etree.fromstring(xml_str) codepath, most of the time was spent in the XML parsing rather than the data download.

Error when using ftp://:

Traceback (most recent call last):
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/implementations/ftp.py", line 301, in _fetch_range
    self.fs.ftp.retrbinary(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/ftplib.py", line 430, in retrbinary
    callback(data)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/implementations/ftp.py", line 293, in callback
    raise TransferDone
fsspec.implementations.ftp.TransferDone

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_pubmed_streaming.py", line 9, in <module>
    print (next(iter(pubmed_train_streaming)))
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/iterable_dataset.py", line 365, in __iter__
    for key, example in self._iter():
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/iterable_dataset.py", line 362, in _iter
    yield from ex_iterable
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/iterable_dataset.py", line 79, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
  File "/Users/abhinav/.cache/huggingface/modules/datasets_modules/datasets/pubmed/af552ed918e2841e8427203530e3cfed3a8bc3213041d7853bea1ca67eec683d/pubmed.py", line 362, in _generate_examples
    tree = ET.parse(filename)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/streaming.py", line 65, in wrapper
    return function(*args, use_auth_token=use_auth_token, **kwargs)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/utils/streaming_download_manager.py", line 636, in xet_parse
    return ET.parse(f, parser=parser)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/xml/etree/ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/xml/etree/ElementTree.py", line 595, in parse
    self._root = parser._parse_whole(source)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/utils/streaming_download_manager.py", line 293, in read_with_retries
    out = read(*args, **kwargs)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 292, in read
    return self._buffer.read(size)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 479, in read
    if not self._read_gzip_header():
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 422, in _read_gzip_header
    magic = self._fp.read(2)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 96, in read
    self.file.read(size-self._length+read)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/spec.py", line 1485, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/caching.py", line 153, in _fetch
    self.cache = self.fetcher(start, end)  # new block replaces old
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/implementations/ftp.py", line 311, in _fetch_range
    self.fs.ftp.getmultiline()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/ftplib.py", line 224, in getmultiline
    line = self.getline()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/ftplib.py", line 206, in getline
    line = self.file.readline(self.maxline + 1)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work and your profiling!

I see FTP doesn't allow streaming, as it timeouts while processing the XML (I guess).

If the performance is not negatively impacted, let's use HTTPS instead.

@albertvillanova albertvillanova merged commit e8a0dd2 into huggingface:master Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pubmed dataset does not work in streaming mode
3 participants