Skip to content

Commit

Permalink
Doc to dataset (#18037)
Browse files Browse the repository at this point in the history
* Link to the Datasets doc

* Remove unwanted file
  • Loading branch information
sgugger authored Jul 6, 2022
1 parent be79cd7 commit 2e90c3d
Show file tree
Hide file tree
Showing 16 changed files with 34 additions and 34 deletions.
2 changes: 1 addition & 1 deletion docs/source/en/perf_train_gpu_one.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ pip install transformers datasets accelerate nvidia-ml-py3

The `nvidia-ml-py3` library allows us to monitor the memory usage of the models from within Python. You might be familiar with the `nvidia-smi` command in the terminal - this library allows to access the same information in Python directly.

Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier. In total we get 512 sequences each with length 512 and store them in a [`Dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=dataset#datasets.Dataset) with PyTorch format.
Then we create some dummy data. We create random token IDs between 100 and 30000 and binary labels for a classifier. In total we get 512 sequences each with length 512 and store them in a [`~datasets.Dataset`] with PyTorch format.


```py
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/preprocessing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -244,7 +244,7 @@ For example, the [MInDS-14](https://huggingface.co/datasets/PolyAI/minds14) data
'sampling_rate': 8000}
```

1. Use 🤗 Datasets' [`cast_column`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.cast_column) method to upsample the sampling rate to 16kHz:
1. Use 🤗 Datasets' [`~datasets.Dataset.cast_column`] method to upsample the sampling rate to 16kHz:

```py
>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/tasks/asr.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ The preprocessing function needs to:
... return batch
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with `num_proc`. Remove the columns you don't need:

```py
>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/tasks/audio_classification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ The preprocessing function needs to:
... return inputs
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need, and rename `intent_class` to `label` because that is what the model expects:

```py
>>> encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/tasks/image_classification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ Create a preprocessing function that will apply the transforms and return the `p
... return examples
```

Use 🤗 Dataset's [`with_transform`](https://huggingface.co/docs/datasets/package_reference/main_classes.html?#datasets.Dataset.with_transform) method to apply the transforms over the entire dataset. The transforms are applied on-the-fly when you load an element of the dataset:
Use 🤗 Dataset's [`~datasets.Dataset.with_transform`] method to apply the transforms over the entire dataset. The transforms are applied on-the-fly when you load an element of the dataset:

```py
>>> food = food.with_transform(transforms)
Expand Down
6 changes: 3 additions & 3 deletions docs/source/en/tasks/language_modeling.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ Here is how you can create a preprocessing function to convert the list to a str
... return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once and increasing the number of processes with `num_proc`. Remove the columns you don't need:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once and increasing the number of processes with `num_proc`. Remove the columns you don't need:

```py
>>> tokenized_eli5 = eli5.map(
Expand Down Expand Up @@ -245,7 +245,7 @@ At this point, only three steps remain:
```
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:

```py
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
Expand Down Expand Up @@ -352,7 +352,7 @@ At this point, only three steps remain:
```
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:

```py
>>> tf_train_set = lm_dataset["train"].to_tf_dataset(
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/tasks/multiple_choice.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ The preprocessing function needs to do:
... return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

```py
tokenized_swag = swag.map(preprocess_function, batched=True)
Expand Down Expand Up @@ -224,7 +224,7 @@ At this point, only three steps remain:
```
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs in `columns`, targets in `label_cols`, whether to shuffle the dataset order, batch size, and the data collator:

```py
>>> data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/tasks/question_answering.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ Here is how you can create a function to truncate and map the start and end toke
... return inputs
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove the columns you don't need:

```py
>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
Expand Down Expand Up @@ -199,7 +199,7 @@ At this point, only three steps remain:
```
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and the start and end positions of an answer in `columns`, whether to shuffle the dataset order, batch size, and the data collator:

```py
>>> tf_train_set = tokenized_squad["train"].to_tf_dataset(
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/tasks/sequence_classification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Create a preprocessing function to tokenize `text` and truncate sequences to be
... return tokenizer(examples["text"], truncation=True)
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

```py
tokenized_imdb = imdb.map(preprocess_function, batched=True)
Expand Down Expand Up @@ -144,7 +144,7 @@ At this point, only three steps remain:
</Tip>
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:

```py
>>> tf_train_set = tokenized_imdb["train"].to_tf_dataset(
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/tasks/summarization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ The preprocessing function needs to:
... return model_inputs
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

```py
>>> tokenized_billsum = billsum.map(preprocess_function, batched=True)
Expand Down Expand Up @@ -160,7 +160,7 @@ At this point, only three steps remain:
```
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:

```py
>>> tf_train_set = tokenized_billsum["train"].to_tf_dataset(
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/tasks/token_classification.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ Here is how you can create a function to realign the tokens and labels, and trun
... return tokenized_inputs
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to tokenize and align the labels over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to tokenize and align the labels over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

```py
>>> tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)
Expand Down Expand Up @@ -199,7 +199,7 @@ At this point, only three steps remain:
```
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:

```py
>>> tf_train_set = tokenized_wnut["train"].to_tf_dataset(
Expand Down
4 changes: 2 additions & 2 deletions docs/source/en/tasks/translation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ The preprocessing function needs to:
... return model_inputs
```

Use 🤗 Datasets [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
Use 🤗 Datasets [`~datasets.Dataset.map`] function to apply the preprocessing function over the entire dataset. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

```py
>>> tokenized_books = books.map(preprocess_function, batched=True)
Expand Down Expand Up @@ -162,7 +162,7 @@ At this point, only three steps remain:
```
</pt>
<tf>
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset). Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:
To fine-tune a model in TensorFlow, start by converting your datasets to the `tf.data.Dataset` format with [`~datasets.Dataset.to_tf_dataset`]. Specify inputs and labels in `columns`, whether to shuffle the dataset order, batch size, and the data collator:

```py
>>> tf_train_set = tokenized_books["train"].to_tf_dataset(
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/training.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ The [`DefaultDataCollator`] assembles tensors into a batch for the model to trai

</Tip>

Next, convert the tokenized datasets to TensorFlow datasets with the [`to_tf_dataset`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.to_tf_dataset) method. Specify your inputs in `columns`, and your label in `label_cols`:
Next, convert the tokenized datasets to TensorFlow datasets with the [`~datasets.Dataset.to_tf_dataset`] method. Specify your inputs in `columns`, and your label in `label_cols`:

```py
>>> tf_train_dataset = small_train_dataset.to_tf_dataset(
Expand Down
4 changes: 2 additions & 2 deletions src/transformers/modeling_tf_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1189,15 +1189,15 @@ def prepare_tf_dataset(
prefetch: bool = True,
):
"""
Wraps a HuggingFace `datasets.Dataset` as a `tf.data.Dataset` with collation and batching. This method is
Wraps a HuggingFace [`~datasets.Dataset`] as a `tf.data.Dataset` with collation and batching. This method is
designed to create a "ready-to-use" dataset that can be passed directly to Keras methods like `fit()` without
further modification. The method will drop columns from the dataset if they don't match input names for the
model. If you want to specify the column names to return rather than using the names that match this model, we
recommend using `Dataset.to_tf_dataset()` instead.
Args:
dataset (`Any`):
A `datasets.Dataset` to be wrapped as a `tf.data.Dataset`.
A [~`datasets.Dataset`] to be wrapped as a `tf.data.Dataset`.
batch_size (`int`, defaults to 8):
The size of batches to return.
shuffle (`bool`, defaults to `True`):
Expand Down
Loading

0 comments on commit 2e90c3d

Please sign in to comment.