From 72d067d21896af62b9770da0adb76b09ebc43ae4 Mon Sep 17 00:00:00 2001 From: Grigory Sizov Date: Wed, 19 Oct 2022 08:21:20 -0700 Subject: [PATCH] Add docs on accessing Azure blob storage through fsspec (#836) Summary: ### Changes Adding an example of DataPipe usage with Azure Blob storage via `fsspec`, similar to https://github.com/pytorch/data/pull/812. The example is placed into a new section in `docs/source/tutorial.rst` Here is the screenshot showing that code snippets in the tutorial work as expected: Screenshot 2022-10-18 at 19 33 49 #### Minor note Technically, `fsspec` [allows both path prefixes `abfs://` or `az://`](https://github.com/fsspec/adlfs/blob/f15c37a43afd87a04f01b61cd90294dd57181e1d/README.md?plain=1#L33) for Azure Blob storage Gen2 as synonyms. However, only `abfs://` works for us for the following reason: - If a path starts with `az`, the variable `fs.protocol` [here](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L82) is still `abfs` - So the condition `root.startswith(protocol)` is false, and `is_local` is true - As a result the path "doubles" in [this line](https://github.com/pytorch/data/blob/768ecdae8b56af640a78e29f82864dc4f65df371/torchdata/datapipes/iter/load/fsspec.py#L95), like on this screenshot: Screenshot 2022-10-18 at 19 50 56 This won't have any effect for the users, however, as long as they use `abfs://` prefix recommended in the tutorial Pull Request resolved: https://github.com/pytorch/data/pull/836 Reviewed By: NivekT Differential Revision: D40483505 Pulled By: sgrigory fbshipit-source-id: f03373aa4b376af8ea2ac3480fc133067caaa0ce --- docs/source/tutorial.rst | 42 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst index 3370856ce..ad9469b26 100644 --- a/docs/source/tutorial.rst +++ b/docs/source/tutorial.rst @@ -298,7 +298,7 @@ recommend using the functional form of DataPipes. Working with Cloud Storage Providers --------------------------------------------- -In this section, we show examples accessing AWS S3 and Google Cloud Storage with built-in``fsspec`` DataPipes. +In this section, we show examples accessing AWS S3, Google Cloud Storage, and Azure Cloud Storage with built-in ``fsspec`` DataPipes. Although only those two providers are discussed here, with additional libraries, ``fsspec`` DataPipes should allow you to connect with other storage systems as well (`list of known implementations `_). @@ -384,3 +384,43 @@ directory ``applications``. # gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-application_data.tsv, StreamWrapper<...> # gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-continuity_data.tsv, StreamWrapper<...> # gcs:/uspto-pair/applications/05900035.zip/05900035/05900035-transaction_history.tsv, StreamWrapper<...> + +Accessing Azure Blob storage with ``fsspec`` DataPipes +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This requires the installation of the libraries ``fsspec`` +(`documentation `_) and ``adlfs`` +(`adlfs GitHub repo `_). +You can access data in Azure Data Lake Storage Gen2 by providing URIs staring with ``abfs://``. +For example, +`FSSpecFileLister `_ (``.list_files_by_fsspec(...)``) +can be used to list files in a directory in a container: + +.. code:: python + + from torchdata.datapipes.iter import IterableWrapper + + storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY} + dp = IterableWrapper(['abfs://CONTAINER/DIRECTORY']).list_files_by_fsspec(**storage_options) + print(list(dp)) + # ['abfs://container/directory/file1.txt', 'abfs://container/directory/file2.txt', ...] + +You can also open files using `FSSpecFileOpener `_ +(``.open_files_by_fsspec(...)``) and stream them +(if supported by the file format). + +Here is an example of loading a CSV file ``ecdc_cases.csv`` from a public container inside the +directory ``curated/covid-19/ecdc_cases/latest``, belonging to account ``pandemicdatalake``. + +.. code:: python + + from torchdata.datapipes.iter import IterableWrapper + dp = IterableWrapper(['abfs://public/curated/covid-19/ecdc_cases/latest/ecdc_cases.csv']) \ + .open_files_by_fsspec(account_name='pandemicdatalake') \ + .parse_csv() + print(list(dp)[:3]) + # [['date_rep', 'day', ..., 'iso_country', 'daterep'], + # ['2020-12-14', '14', ..., 'AF', '2020-12-14'], + # ['2020-12-13', '13', ..., 'AF', '2020-12-13']] +If necessary, you can also access data in Azure Data Lake Storage Gen1 by using URIs staring with +``adl://`` and ``abfs://``, as described in `README of adlfs repo `_