Support streaming for pubmed #3740

abhi-mosaic · 2022-02-17T00:18:22Z

This PR makes some minor changes to the pubmed dataset to allow for streaming=True. Fixes #3739.

Basically, I followed the C4 dataset which works in streaming mode as an example, and made the following changes:

Change URL prefix from ftp:// to https://
Explicilty open the filename and pass the XML contents to etree.fromstring(xml_str)

The Github diff tool makes it look like the changes are larger than they are, sorry about that.

I tested locally and the pubmed dataset now works in both normal and streaming modes. There is some overhead at the start of each shard in streaming mode as building the XML tree online is quite slow (each pubmed .xml.gz file is ~20MB), but the overhead gets amortized over all the samples in the shard. On my laptop with a single CPU worker I am able to stream at about ~600 samples/s.

abhi-mosaic · 2022-02-17T00:45:59Z

@albertvillanova just FYI, since you were so helpful with the previous pubmed issue :)

albertvillanova

Thank you, @abhi-mosaic.

Below some comments/suggestions.

albertvillanova · 2022-02-17T10:08:22Z

datasets/pubmed/pubmed.py

@@ -39,7 +39,7 @@
 # The HuggingFace dataset library don't host the datasets but only point to the original files
 # This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)
 # Note these URLs here are used by MockDownloadManager.create_dummy_data_list
-_URLs = [f"ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed22n{i:04d}.xml.gz" for i in range(1, 1115)]
+_URLs = [f"https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed22n{i:04d}.xml.gz" for i in range(1, 1115)]


Not sure why you changed the protocol. PubMed download instructions recommend to use their FTP servers.

Moreover, the change of protocol may have a performance impact.

albertvillanova · 2022-02-17T10:13:15Z

datasets/pubmed/pubmed.py

-            try:
-                tree = etree.parse(filename)
-                root = tree.getroot()
-                xmldict = self.xml_to_dictionnary(root)
-            except etree.ParseError:
-                logger.warning(f"Ignoring file {filename}, it is malformed")
-                continue
-
-            for article in xmldict["PubmedArticleSet"]["PubmedArticle"]:
-                self.update_citation(article)
-                new_article = default_article()
-
+            with open(filename, "rt", encoding="utf-8") as f:
+                xml_str = f.read()
                try:
-                    deepupdate(new_article, article)
-                except Exception:
-                    logger.warning(f"Ignoring article {article}, it is malformed")
+                    root = etree.fromstring(xml_str)
+                    xmldict = self.xml_to_dictionnary(root)
+                except etree.ParseError:
+                    logger.warning(f"Ignoring file {filename}, it is malformed")
                    continue

-                try:
-                    _ = self.info.features.encode_example(new_article)
-                except Exception as e:
-                    logger.warning(f"Ignore example because {e}")
-                    continue
-                yield id_, new_article
-                id_ += 1
+                for article in xmldict["PubmedArticleSet"]["PubmedArticle"]:
+                    self.update_citation(article)
+                    new_article = default_article()
+
+                    try:
+                        deepupdate(new_article, article)
+                    except Exception:
+                        logger.warning(f"Ignoring article {article}, it is malformed")
+                        continue
+
+                    try:
+                        _ = self.info.features.encode_example(new_article)
+                    except Exception as e:
+                        logger.warning(f"Ignore example because {e}")
+                        continue
+                    yield id_, new_article
+                    id_ += 1


I don't think this is necessary, once that xml.etree.ElementTree.parse already supports streaming. See:

Extend support for streaming datasets that use ET.parse #3476

The only thing needs being changed is the import: as ET instead of as etree:

datasets/datasets/pubmed/pubmed.py

Line 19 in 71aa41c

import xml.etree.ElementTree as etree

albertvillanova · 2022-02-17T10:13:32Z

src/datasets/utils/mock_download_manager.py

@@ -172,7 +172,7 @@ def create_dummy_data_list(self, path_to_dummy_data, data_url):
        # trick: if there are many shards named like `data.txt-000001-of-00300`, only use the first one
        is_tf_records = all(bool(re.findall("[0-9]{3,}-of-[0-9]{3,}", url)) for url in data_url)
        is_pubmed_records = all(
-            url.startswith("ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed") for url in data_url
+            url.startswith("https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed") for url in data_url


Same as above for the change of protocol.

lhoestq · 2022-02-17T15:30:10Z

IIRC streaming from FTP is not fully tested yet, so I'm fine with switching to HTTPS for now, as long as the download speed/availability is great

abhi-mosaic · 2022-02-17T16:28:20Z

@albertvillanova Thanks for pointing me to the ET module replacement. It should look a lot cleaner now.

Unfortunately I tried keeping the ftp:// protocol but was seeing timeout errors? in streaming mode (below). I think the https:// performance is not an issue, when I was profiling the open(..) -> f.read() -> etree.fromstring(xml_str) codepath, most of the time was spent in the XML parsing rather than the data download.

Error when using ftp://:

Traceback (most recent call last):
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/implementations/ftp.py", line 301, in _fetch_range
    self.fs.ftp.retrbinary(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/ftplib.py", line 430, in retrbinary
    callback(data)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/implementations/ftp.py", line 293, in callback
    raise TransferDone
fsspec.implementations.ftp.TransferDone

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test_pubmed_streaming.py", line 9, in <module>
    print (next(iter(pubmed_train_streaming)))
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/iterable_dataset.py", line 365, in __iter__
    for key, example in self._iter():
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/iterable_dataset.py", line 362, in _iter
    yield from ex_iterable
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/iterable_dataset.py", line 79, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
  File "/Users/abhinav/.cache/huggingface/modules/datasets_modules/datasets/pubmed/af552ed918e2841e8427203530e3cfed3a8bc3213041d7853bea1ca67eec683d/pubmed.py", line 362, in _generate_examples
    tree = ET.parse(filename)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/streaming.py", line 65, in wrapper
    return function(*args, use_auth_token=use_auth_token, **kwargs)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/utils/streaming_download_manager.py", line 636, in xet_parse
    return ET.parse(f, parser=parser)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/xml/etree/ElementTree.py", line 1202, in parse
    tree.parse(source, parser)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/xml/etree/ElementTree.py", line 595, in parse
    self._root = parser._parse_whole(source)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/abhi-datasets/src/datasets/utils/streaming_download_manager.py", line 293, in read_with_retries
    out = read(*args, **kwargs)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 292, in read
    return self._buffer.read(size)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 479, in read
    if not self._read_gzip_header():
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 422, in _read_gzip_header
    magic = self._fp.read(2)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/gzip.py", line 96, in read
    self.file.read(size-self._length+read)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/spec.py", line 1485, in read
    out = self.cache._fetch(self.loc, self.loc + length)
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/caching.py", line 153, in _fetch
    self.cache = self.fetcher(start, end)  # new block replaces old
  File "/Users/abhinav/Documents/mosaicml/hf_datasets/venv/lib/python3.8/site-packages/fsspec/implementations/ftp.py", line 311, in _fetch_range
    self.fs.ftp.getmultiline()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/ftplib.py", line 224, in getmultiline
    line = self.getline()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/ftplib.py", line 206, in getline
    line = self.file.readline(self.maxline + 1)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

albertvillanova

Thanks a lot for your work and your profiling!

I see FTP doesn't allow streaming, as it timeouts while processing the XML (I guess).

If the performance is not negatively impacted, let's use HTTPS instead.

abhi-mosaic added 2 commits February 16, 2022 16:07

update pubmed streaming

cc72d0f

update hardcoded exception

71aa41c

albertvillanova requested changes Feb 17, 2022

View reviewed changes

refactor to use ET

a15d864

albertvillanova approved these changes Feb 18, 2022

View reviewed changes

albertvillanova merged commit e8a0dd2 into huggingface:master Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support streaming for pubmed #3740

Support streaming for pubmed #3740

abhi-mosaic commented Feb 17, 2022 •

edited by albertvillanova

Loading

abhi-mosaic commented Feb 17, 2022

albertvillanova left a comment

albertvillanova Feb 17, 2022 •

edited

Loading

albertvillanova Feb 17, 2022

albertvillanova Feb 17, 2022

lhoestq commented Feb 17, 2022

abhi-mosaic commented Feb 17, 2022

albertvillanova left a comment

Support streaming for pubmed #3740

Support streaming for pubmed #3740

Conversation

abhi-mosaic commented Feb 17, 2022 • edited by albertvillanova Loading

abhi-mosaic commented Feb 17, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Feb 17, 2022 • edited Loading

Choose a reason for hiding this comment

albertvillanova Feb 17, 2022

Choose a reason for hiding this comment

albertvillanova Feb 17, 2022

Choose a reason for hiding this comment

lhoestq commented Feb 17, 2022

abhi-mosaic commented Feb 17, 2022

albertvillanova left a comment

Choose a reason for hiding this comment

abhi-mosaic commented Feb 17, 2022 •

edited by albertvillanova

Loading

albertvillanova Feb 17, 2022 •

edited

Loading