Update the Dataset documentation (deepjavalibrary#1686)

This makes a number of updates to the dataset documentation. It moves the list of datasets to the main guide which is more accessible that basicdataset's README and could contain future datasets not in basic dataset. It also updates the list of built-in datasets and adds a list of dataset helpers. It improves the page for creating a custom dataset. Finally, there are a few other miscellaneous doc improvements.
syskin345 · Jun 1, 2022 · 890098c · 890098c
1 parent 8ed561b
commit 890098c
Show file tree

Hide file tree

Showing 19 changed files with 398 additions and 244 deletions.
diff --git a/api/src/main/java/ai/djl/training/dataset/ArrayDataset.java b/api/src/main/java/ai/djl/training/dataset/ArrayDataset.java
@@ -21,19 +21,27 @@
 
 /**
  * {@code ArrayDataset} is an implementation of {@link RandomAccessDataset} that consist entirely of
- * large {@link NDArray}s. There can be multiple data and label {@link NDArray}s within the dataset.
- * Each sample will be retrieved by indexing each {@link NDArray} along the first dimension.
+ * large {@link NDArray}s. It is recommended only for datasets small enough to fit in memory that
+ * come in array formats. Otherwise, consider directly using the {@link RandomAccessDataset}
+ * instead.
+ *
+ * <p>There can be multiple data and label {@link NDArray}s within the dataset. Each sample will be
+ * retrieved by indexing each {@link NDArray} along the first dimension.
  *
  * <p>The following is an example of how to use ArrayDataset:
  *
  * <pre>
  *     ArrayDataset dataset = new ArrayDataset.Builder()
- *                              .setData(data)
- *                              .optLabels(label)
+ *                              .setData(data1, data2)
+ *                              .optLabels(labels1, labels2, labels3)
  *                              .setSampling(20, false)
  *                              .build();
  * </pre>
  *
+ * <p>Suppose you get a {@link Batch} from {@code trainer.iterateDataset(dataset)} or {@code
+ * dataset.getData(manager)}. In the data of this batch, it will be an NDList with one NDArray for
+ * each data input. In this case, it would be 2 arrays. Similarly, the labels would have 3 arrays.
+ *
  * @see Dataset
  */
 public class ArrayDataset extends RandomAccessDataset {

diff --git a/api/src/main/java/ai/djl/training/dataset/Dataset.java b/api/src/main/java/ai/djl/training/dataset/Dataset.java
@@ -21,7 +21,9 @@
 /**
  * An interface to represent a set of sample data/label pairs to train a model.
  *
- * @see <a href="http://docs.djl.ai/docs/dataset.html">The guide on datasets</a>
+ * @see <a href="http://docs.djl.ai/docs/dataset.html">The guide to datasets</a>
+ * @see <a href="http://docs.djl.ai/docs/development/how_to_use_dataset.html">The guide to
+ *     implementing a custom dataset</a>
  */
 public interface Dataset {
 

diff --git a/api/src/main/java/ai/djl/training/dataset/RandomAccessDataset.java b/api/src/main/java/ai/djl/training/dataset/RandomAccessDataset.java
@@ -34,6 +34,11 @@
 /**
  * RandomAccessDataset represent the dataset that support random access reads. i.e. it could access
  * a specific data item given the index.
+ *
+ * <p>Almost all datasets in DJL extend, either directly or indirectly, {@link RandomAccessDataset}.
+ *
+ * @see <a href="http://docs.djl.ai/docs/development/how_to_use_dataset.html">The guide to
+ *     implementing a custom dataset</a>
  */
 public abstract class RandomAccessDataset implements Dataset {
 

diff --git a/basicdataset/README.md b/basicdataset/README.md
@@ -4,17 +4,7 @@
 
 This module contains a number of basic and standard datasets in the Deep Java Library's (DJL). These datasets are used to train deep learning models.
 
-## List of datasets
-
-This module contains the following datasets:
-
-- [MNIST](http://yann.lecun.com/exdb/mnist/) - A handwritten digits dataset
-- [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) - A dataset consisting of 60,000 32x32 color images in 10 classes
-- [Coco](http://cocodataset.org) - A large-scale object detection, segmentation, and captioning dataset that contains 1.5 million object instances
-    - You have to manually add `com.twelvemonkeys.imageio:imageio-jpeg:3.5` dependency to your project
-- [ImageNet](http://www.image-net.org/) - An image database organized according to the WordNet hierarchy
-  >**Note**: You have to manually download the ImageNet dataset due to licensing requirements.
-- [Pikachu](http://d2l.ai/chapter_computer-vision/object-detection-dataset.html) - 1000 Pikachu images of different angles and sizes created using an open source 3D Pikachu model
+You can find the datasets provided by this module on our [docs](http://docs.djl.ai/docs/dataset.html).
 
 ## Documentation
 

diff --git a/basicdataset/src/main/java/ai/djl/basicdataset/cv/CocoDetection.java b/basicdataset/src/main/java/ai/djl/basicdataset/cv/CocoDetection.java
@@ -34,6 +34,13 @@
 /**
  * Coco image detection dataset from http://cocodataset.org/#home.
  *
+ * <p>Coco is a large-scale object detection, segmentation, and captioning dataset although only
+ * object detection is implemented at thsi time. It contains 1.5 million object instances and is one
+ * of the standard benchmark object detection datasets.
+ *
+ * <p>To use this dataset, you have to manually add {@code
+ * com.twelvemonkeys.imageio:imageio-jpeg:3.5} as a dependency in your project.
+ *
  * <p>Each image might have different {@link ai.djl.ndarray.types.Shape}s.
  */
 public class CocoDetection extends ObjectDetectionDataset {

diff --git a/basicdataset/src/main/java/ai/djl/basicdataset/cv/PikachuDetection.java b/basicdataset/src/main/java/ai/djl/basicdataset/cv/PikachuDetection.java
@@ -39,7 +39,14 @@
 import java.util.Map;
 import java.util.Optional;
 
-/** Pikachu image detection dataset that contains multiple Pikachus in each image. */
+/**
+ * Pikachu image detection dataset that contains multiple Pikachus in each image.
+ *
+ * <p>It was based on a section from the [Dive into Deep Learning
+ * book](http://d2l.ai/chapter_computer-vision/object-detection-dataset.html). It contains 1000
+ * Pikachu images of different angles and sizes created using an open source 3D Pikachu model. Each
+ * image contains only a single pikachu.
+ */
 public class PikachuDetection extends ObjectDetectionDataset {
 
     private static final String VERSION = "1.0";

diff --git a/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/AbstractImageFolder.java b/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/AbstractImageFolder.java
@@ -34,7 +34,11 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-/** A dataset for loading image files stored in a folder structure. */
+/**
+ * A dataset for loading image files stored in a folder structure.
+ *
+ * <p>Usually, you want to use {@link ImageFolder} instead.
+ */
 public abstract class AbstractImageFolder extends ImageClassificationDataset {
 
     private static final Logger logger = LoggerFactory.getLogger(AbstractImageFolder.class);

diff --git a/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/Cifar10.java b/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/Cifar10.java
@@ -34,6 +34,9 @@
 /**
  * CIFAR10 image classification dataset from https://www.cs.toronto.edu/~kriz/cifar.html.
  *
+ * <p>It consists of 60,000 32x32 color images with 10 classes. It can train in a few hours with a
+ * GPU.
+ *
  * <p>Each sample is an image (in 3-D {@link NDArray}) with shape (32, 32, 3).
  */
 public final class Cifar10 extends ArrayDataset {

diff --git a/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/FashionMnist.java b/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/FashionMnist.java
@@ -33,9 +33,11 @@
 
 /**
  * FashMnist is a dataset from Zalando article images
- * https://github.com/zalandoresearch/fashion-mnist.
+ * (https://github.com/zalandoresearch/fashion-mnist).
  *
- * <p>Each sample is an image (in 3-D NDArray) with shape (28, 28, 1).
+ * <p>Each sample is a grayscale image (in 3-D NDArray) with shape (28, 28, 1).
+ *
+ * <p>It was created to be a drop in replacement for {@link Mnist}, but have a less simplistic task.
  */
 public final class FashionMnist extends ArrayDataset {
 

diff --git a/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/ImageFolder.java b/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/ImageFolder.java
@@ -24,6 +24,8 @@
 /**
  * A dataset for loading image files stored in a folder structure.
  *
+ * <p>Below is an example directory layout for the image folder:
+ *
  * <pre>
  *  The image folder should be structured as follows:
  *       root/shoes/Aerobic Shoes1.png
@@ -32,11 +34,35 @@
  *       root/boots/Black Boots.png
  *       root/boots/White Boots.png
  *       ...
- *       root/pumps/Red Pumps
- *       root/pumps/Pink Pumps
+ *       root/pumps/Red Pumps.png
+ *       root/pumps/Pink Pumps.png
  *       ...
+ *
  *  here shoes, boots, pumps are your labels
  *  </pre>
+ *
+ * <p>Here, the dataset will take the folder names (shoes, boots, bumps) in sorted order as your
+ * labels. Nested folder structures are not currently supported.
+ *
+ * <p>Then, you can create your instance of the dataset as follows:
+ *
+ * <pre>
+ * // set the image folder path
+ * Repository repository = Repository.newInstance("folder", Paths.get("/path/to/imagefolder/root");
+ * ImageFolder dataset =
+ *     new ImageFolder.Builder()
+ *         .setRepository(repository)
+ *         .addTransform(new Resize(100, 100)) // Use image transforms as necessary for your data
+ *         .addTransform(new ToTensor()) // Usually required as the last transform to convert images to tensors
+ *         .setSampling(batchSize, true)
+ *         .build();
+ *
+ * // call prepare before using
+ * dataset.prepare();
+ *
+ * // to get the synset or label names
+ * List&gt;String&lt; synset = dataset.getSynset();
+ * </pre>
  */
 public final class ImageFolder extends AbstractImageFolder {
 

diff --git a/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/Mnist.java b/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/Mnist.java
@@ -34,7 +34,11 @@
 /**
  * MNIST handwritten digits dataset from http://yann.lecun.com/exdb/mnist.
  *
- * <p>Each sample is an image (in 3-D NDArray) with shape (28, 28, 1).
+ * <p>Each sample is a grayscale image (in 3-D NDArray) with shape (28, 28, 1).
+ *
+ * <p>It is a common starting dataset because it is small and can train within minutes. However, it
+ * is an overly easy task that even poor models can still perform very well on. Instead, consider
+ * {@link FashionMnist} which offers a comparable speed but a more reasonable difficulty task.
  */
 public final class Mnist extends ArrayDataset {
 

diff --git a/basicdataset/src/main/java/ai/djl/basicdataset/nlp/StanfordQuestionAnsweringDataset.java b/basicdataset/src/main/java/ai/djl/basicdataset/nlp/StanfordQuestionAnsweringDataset.java
@@ -39,6 +39,8 @@
  * questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every
  * question is a segment of text, or span, from the corresponding reading passage, or the question
  * might be unanswerable.
+ *
+ * @see <a href="https://rajpurkar.github.io/SQuAD-explorer/">Dataset website</a>
  */
 @SuppressWarnings("unchecked")
 public class StanfordQuestionAnsweringDataset extends TextDataset implements RawDataset<Object> {

diff --git a/docs/dataset.md b/docs/dataset.md
@@ -1,6 +1,6 @@
 # Dataset
 
-A dataset (or data set) is a collection of data that are used for machine-learning training job.
+A dataset (or data set) is a collection of data that is used for training a machine learning model.
 
 Machine learning typically works with three datasets:
 
@@ -11,17 +11,67 @@ Machine learning typically works with three datasets:
 - Validation dataset
 
     The validation set is used to evaluate a given model during the training process. It helps machine learning
-    engineers to fine-tune the [HyperParameter](https://github.com/deepjavalibrary/djl/blob/master/api/src/main/java/ai/djl/training/hyperparameter/param/Hyperparameter.java)
+    engineers to fine-tune the [HyperParameters](https://github.com/deepjavalibrary/djl/blob/master/api/src/main/java/ai/djl/training/hyperparameter/param/Hyperparameter.java)
     at model development stage.
     The model doesn't learn from validation dataset; and validation dataset is optional.
 
 - Test dataset
 
     The Test dataset provides the gold standard used to evaluate the model.
     It is only used once a model is completely trained.
+    The test dataset should more accurately evaluate how the model will be performed on new data.
 
 See [Jason Brownlee’s article](https://machinelearningmastery.com/difference-test-validation-datasets/) for more detail.
 
 ## [Basic Dataset](../basicdataset/README.md)
 
 DJL provides a number of built-in basic and standard datasets. These datasets are used to train deep learning models.
+This module contains the following datasets:
+
+### CV
+
+#### Image Classification
+
+- [MNIST](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/cv/classification/Mnist.html) - A small and fast handwritten digits dataset
+- [Fashion MNIST](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/cv/classification/FashionMnist.html) - A small and fast clothing type detection dataset
+- [CIFAR10](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/cv/classification/Cifar10.html) - A dataset consisting of 60,000 32x32 color images in 10 classes
+- [ImageNet](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/cv/classification/ImageNet.html) - An image database organized according to the WordNet hierarchy
+  >**Note**: You have to manually download the ImageNet dataset due to licensing requirements.
+
+#### Object Detection
+
+- [Pikachu](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/cv/PikachuDetection.html) - 1000 Pikachu images of different angles and sizes created using an open source 3D Pikachu model
+- [Banana Detection](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/cv/BananaDetection.html) - A testing single object detection dataset
+
+#### Other CV
+
+- [Captcha](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/cv/classification/CaptchaDataset.html) - A dataset for a grayscale 6-digit CAPTCHA task
+- [Coco](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/cv/CocoDetection.html) - A large-scale object detection, segmentation, and captioning dataset that contains 1.5 million object instances
+  - You have to manually add `com.twelvemonkeys.imageio:imageio-jpeg:3.5` dependency to your project
+
+### NLP
+
+#### Text Classification and Sentiment Analysis
+
+- [AmazonReview](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/nlp/AmazonReview.html) - A sentiment analysis dataset of Amazon Reviews with their ratings
+- [Stanford Movie Review](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/nlp/StanfordMovieReview.html) - A sentiment analysis dataset of movie reviews and sentiments sourced from IMDB
+- [GoEmotions](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/nlp/GoEmotions.html) - A dataset classifying 50k curated reddit comments into either 27 emotion categories or neutral
+
+#### Unlabeled Text
+
+- [Penn Treebank Text](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/nlp/PennTreebankText.html) - The text (not POS tags) from the Penn Treebank, a collection of Wall Street Journal stories
+- [WikiText2](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/nlp/WikiText2.html) - A collection of over 100 million tokens extracted from good and featured articles on wikipedia
+
+#### Other NLP
+
+- [Stanford Question Answering Dataset (SQuAD)](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/nlp/StanfordQuestionAnsweringDataset.html) - A reading comprehension dataset with text from wikipedia articles
+- [Tatoeba English French Dataset](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/nlp/TatoebaEnglishFrenchDataset.html) - An english-french translation dataset from the Tatoeba Project
+
+### Tabular
+
+- [Airfoil Self-Noise](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/tabular/AirfoilRandomAccess.html) - A 6 feature dataset from NASA tests of airfoils
+- [Ames House Pricing](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/tabular/AmesRandomAccess.html) - A 80 feature dataset to predict house prices
+
+### Time Series
+
+- [Daily Delhi Climate](https://javadoc.io/doc/ai.djl/basicdataset/latest/ai/djl/basicdataset/tabular/DailyDelhiClimate.html)
diff --git a/docs/development/add_dataset_to_djl.md b/docs/development/add_dataset_to_djl.md
@@ -0,0 +1,66 @@
+# Add a new dataset to DJL basic datasets
+
+This document outlines the procedure to add new datasets to DJL.
+
+## Step 1: Prepare the folder structure
+
+1. Navigate to the `test/resources/mlrepo/dataset` folder and create a folder in it to store your dataset based on its category.
+   For example, `cv/ai/djl/basicdataset/mnist`.
+2. Create a version folder within your newly created dataset's folder (e.g `0.0.1`). The version should match your dataset version.
+
+### Step 2: Create a `metadata.json` file
+
+You need to create a `metadata.json` file for the repository to load the dataset. You can refer to the format in the `metadata.json` files for existing datasets to create your own.
+
+**Note:** You need to update the sha1 hash of each file in your `metadata.json` file. Use the following command to get the sha1Hash value:
+
+```shell
+$ shasum -a 1 <file_name>
+```
+
+### Step 3: Create a Dataset implementation
+
+Create a class that implements the dataset and loads it.
+For more details on creating datasets, see the [dataset creation guide](how_to_use_dataset.md).
+You should also look at examples of official DJL datasets such as [`AmesRandomAccess`](https://github.com/deepjavalibrary/djl/blob/master/basicdataset/src/main/java/ai/djl/basicdataset/tabular/AmesRandomAccess.java)
+or [`Cifar10`](https://github.com/deepjavalibrary/djl/blob/master/basicdataset/src/main/java/ai/djl/basicdataset/cv/classification/Cifar10.java).
+
+Then, add some tests for the dataset.
+For testing, you can use a local repository such as:
+
+```java
+Repository repository = Repository.newInstace("testRepository", Paths.get("/test/resources/mlrepo"));
+```
+
+### Step 4: Update the datasets list
+
+Add your dataset to the [list of built-in datasets](../dataset.md).
+
+### Step 5: Upload metadata
+
+The official DJL ML repository is located on an S3 bucket managed by the AWS DJL team.
+You have to add the metadata and any dataset files to the repository.
+
+For non-AWS team members, go ahead straight to Step 6 and open a pull request.
+Within the pull request, you can coordinate with an AWS member to add the necessary files.
+
+For AWS team members, run the following command to upload your model to the S3 bucket:
+
+```shell
+$ ./gradlew syncS3
+```
+
+The `metadata.json` in DJL is mainly a repository of metadata.
+Within the metadata is typically contains only links indicating where the actual data would be found.
+
+However, some datasets can be distributed by DJL depending on whether it makes it easier to use and the dataset permits redistribution.
+In that case, coordinate with an AWS team member in your pull request.
+
+### Step 6: Open a PR to add your ModelLoader and metadata files to the git repository
+
+**Note**: Avoid checking in binary files to git. Binary files should only be uploaded to the S3 bucket.
+
+If you are relying on an AWS team member, you should leave your code with the local test repositories.
+If you try to use the official repository before it contains your metadata, the tests will not pass the CI.
+
+Once an AWS team member adds your metadata, they will prompt you to update your PR to the official repository.
diff --git a/docs/development/add_model_to_model-zoo.md b/docs/development/add_model_to_model-zoo.md
@@ -1,6 +1,6 @@
-# Add a new model to the model zoo
+# Add a new model to the DJL model zoo
 
-This document outlines the procedure to add new models into the model zoo.
+This document outlines the procedure to add new models into the DJL model zoo.
 
 ## Step 1: Prepare the model files
 
@@ -31,6 +31,7 @@ For example, `image_classification/ai/djl/resnet`.
 3. Copy model files into the version folder.
 
 ### Step 3: Create a `metadata.json` file
+
 You need to create a `metadata.json` file for the model zoo to load the model. You can refer to the format in the `metadata.json` files for existing models to create your own.
 
 For a model built as a DJL block, you must recreate the block before loading the parameters. As part of your `metadata.json` file, you should use the `arguments` property to specify the arguments required for the model loader to create another `Block` matching the one used to train the model.
@@ -51,7 +52,7 @@ Verify that your folder has the following files (see Step 1 for additional files
 
 The official DJL ML repository is located on an S3 bucket managed by the AWS DJL team.
 
-For non-team members, coordinate with a team member in your pull request to coordinate adding the necessary files.
+For non-team members, coordinate with a team member in your pull request to add the necessary files.
 
 For AWS team members, run the following command to upload your model to the S3 bucket: