From 6f576c6b28b3bebe5bc6a01e0e252791efac2b3b Mon Sep 17 00:00:00 2001
From: fstahlberg <fstahlberg@google.com>
Date: Tue, 7 Dec 2021 13:53:31 +0100
Subject: [PATCH] Update README.md

---
 README.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index a2d0481..163e846 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@ The following instructions have been tested in an
 [Anaconda](https://www.anaconda.com/) (version Anaconda3 2021.05) Python
 environment, but are expected to work in other Python 3 setups, too.
 
-### Install the dependencies
+### 1.) Install the dependencies
 
 Install the Abseil Python package with PIP:
 
@@ -16,7 +16,7 @@ Install the Abseil Python package with PIP:
 pip install absl-py
 ```
 
-### Download the C4\_200M corruptions
+### 2.) Download the C4\_200M corruptions
 
 Change to a new working directory and download the C4\_200M corruptions from
 [Kaggle Datasets](https://www.kaggle.com/felixstahlberg/the-c4-200m-dataset-for-gec):
@@ -43,7 +43,7 @@ second and third columns are byte start and end positions, and the fourth column
 contains the replacement text.
 
 
-### Extract C4\_200M target sentences from C4
+### 3.) Extract C4\_200M target sentences from C4
 
 C4\_200M uses a relatively small subset of C4 (200M sentences). There are two ways to obtain the C4\_200M target sentences: using TensorFlow Datasets or using the C4 version provided by [allenai](https://github.com/allenai/allennlp/discussions/5056).
 
@@ -74,7 +74,7 @@ The above reads 5 shards (00000 to 00004) at once and saves the target sentences
 
 #### Using the C4 Dataset in .json.gz Format
 
-Given a folder containing the C4 dataset compressed in `.json.gz` files as provided by [allenai](https://github.com/allenai/allennlp/discussions/5056), it is possible to fetch the clean target sentences as follows:
+Given a folder containing the C4 dataset compressed in `.json.gz` files as provided by [allenai](https://github.com/allenai/allennlp/discussions/5056), it is possible to fetch [most of the clean target sentences](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction/issues/2) as follows:
 
 ```
 python c4200m_get_target_sentences_json.py edits.tsv-00000-of-00010 /C4/en/target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010
@@ -86,7 +86,7 @@ Repeat for the remaining nine shards, optionally with trailing ampersand for par
 
 
 
-### Apply corruption edits
+### 4.) Apply corruption edits
 
 The mapping from the MD5 hash to the target sentence is written to
 `target_sentences.tsv*`: