Skip to content

Commit

Permalink
Adding random_split example to tutorial (#843)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #843

`random_split` is one of the most commonly used functionality. It will useful to include that in the tutorial such that users don't try to use `torch.utils.data.random_split` with `IterDataPipe`.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D40532302

Pulled By: NivekT

fbshipit-source-id: a9bcd9a000eaa62e0a624d27febf5081ba0d0e33
  • Loading branch information
NivekT authored and facebook-github-bot committed Oct 20, 2022
1 parent 5ad07ae commit 222877d
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 3 deletions.
13 changes: 11 additions & 2 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Suppose that we want to load data from CSV files with the following steps:
- List all CSV files in a directory
- Load CSV files
- Parse CSV file and yield rows
- Split our dataset into training and validation sets

There are a few `built-in DataPipes <torchdata.datapipes.iter.html>`_ that can help us with the above operations.

Expand All @@ -19,6 +20,8 @@ There are a few `built-in DataPipes <torchdata.datapipes.iter.html>`_ that can h
streams <generated/torchdata.datapipes.iter.FileOpener.html>`_
- ``CSVParser`` - `consumes file streams, parses the CSV contents, and returns one parsed line at a
time <generated/torchdata.datapipes.iter.CSVParser.html>`_
- ``RandomSplitter`` - `randomly split samples from a source DataPipe into
groups <generated/torchdata.datapipes.iter.RandomSplitter.html>`_

As an example, the source code for ``CSVParser`` looks something like this:

Expand Down Expand Up @@ -48,9 +51,14 @@ class constructors. A pipeline can be assembled as the following:
datapipe = dp.iter.FileLister([FOLDER]).filter(filter_fn=lambda filename: filename.endswith('.csv'))
datapipe = dp.iter.FileOpener(datapipe, mode='rt')
datapipe = datapipe.parse_csv(delimiter=',')
N_ROWS = 10000 # total number of rows of data
train, valid = datapipe.random_split(total_length=N_ROWS, weights={"train": 0.5, "valid": 0.5}, seed=0)
for d in datapipe: # Iterating through the data
pass
for x in train: # Iterating through the training dataset
pass
for y in valid: # Iterating through the validation dataset
pass
You can find the full list of built-in `IterDataPipes here <torchdata.datapipes.iter.html>`_ and
`MapDataPipes here <torchdata.datapipes.map.html>`_.
Expand Down Expand Up @@ -422,5 +430,6 @@ directory ``curated/covid-19/ecdc_cases/latest``, belonging to account ``pandemi
# [['date_rep', 'day', ..., 'iso_country', 'daterep'],
# ['2020-12-14', '14', ..., 'AF', '2020-12-14'],
# ['2020-12-13', '13', ..., 'AF', '2020-12-13']]
If necessary, you can also access data in Azure Data Lake Storage Gen1 by using URIs staring with
``adl://`` and ``abfs://``, as described in `README of adlfs repo <https://github.com/fsspec/adlfs/blob/main/README.md>`_
2 changes: 1 addition & 1 deletion torchdata/datapipes/iter/util/randomsplitter.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
@functional_datapipe("random_split")
class RandomSplitterIterDataPipe(IterDataPipe):
r"""
Randomly split samples from a source DataPipe into groups(functional name: ``random_split``).
Randomly split samples from a source DataPipe into groups (functional name: ``random_split``).
Since there is no buffer, only ONE group of samples (i.e. one child DataPipe) can be iterated through
at any time. Attempts to iterate through multiple of them simultaneously will fail.
Expand Down

0 comments on commit 222877d

Please sign in to comment.