Merge pull request #3 from Giovanni-Alzetta/allenai_script

Adding script to read allenai version of the dataset
google-research-datasets · Dec 7, 2021 · cb61541 · cb61541
2 parents 918c19f + 75a57a3
commit cb61541
Show file tree

Hide file tree

Showing 3 changed files with 106 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -10,16 +10,12 @@ environment, but are expected to work in other Python 3 setups, too.
 
 ### Install the dependencies
 
-Install the TensorFlow Datasets and Abseil Python packages with PIP:
+Install the Abseil Python package with PIP:
 
 ```
-pip install tensorflow-datasets absl-py
+pip install absl-py
 ```
 
-### Setup C4 in TensorFlow Datasets
-
-Obtain the **C4 corpus version 2.2.1** by following [these instructions](https://www.tensorflow.org/datasets/catalog/c4). More recent versions such as version 3.0.1 provided by [allenai](https://github.com/allenai/allennlp/discussions/5056) may also [work](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction/issues/2).
-
 ### Download the C4\_200M corruptions
 
 Change to a new working directory and download the C4\_200M corruptions from
@@ -49,27 +45,24 @@ contains the replacement text.
 
 ### Extract C4\_200M target sentences from C4
 
-C4\_200M uses a relatively small subset of C4 (200M sentences). The `c4200m_get_target_sentences.py` script fetches the clean target sentences from C4 for a single shard:
+C4\_200M uses a relatively small subset of C4 (200M sentences). There are two ways to obtain the C4\_200M target sentences: using TensorFlow Datasets or using the C4 version provided by [allenai](https://github.com/allenai/allennlp/discussions/5056).
+
+#### Using TensorFlow Datasets
+
+Install the TensorFlow Datasets Python package with PIP:
 
 ```
-python c4200m_get_target_sentences.py edits.tsv-00000-of-00010 target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010
+pip install tensorflow-datasets
 ```
 
-The mapping from the MD5 hash to the target sentence is written to
-`target_sentences.tsv*`:
+Obtain the **C4 corpus version 2.2.1** by following [these instructions](https://www.tensorflow.org/datasets/catalog/c4). The `c4200m_get_target_sentences.py` script fetches the clean target sentences from C4 for a single shard:
 
 ```
-$ head -n 3 target_sentences.tsv-00000-of-00010
-
-00000002020d286371dd59a2f8a900e6	Bitcoin goes for $7,094 this morning, according to CoinDesk.
-00000069b517cf07c79124fae6ebd0d8	1. The effect of "widespread dud" targets two face up attack position monsters on the field.
-0000006dce3b7c10a6ad25736c173506	Capital Gains tax on the sale of properties for non-residents is set at 21% for 2014 and 20% in 2015 payable on profits earned on the difference of the property value between the year of purchase (purchase price plus costs) and the year of sale (sales price minus costs), based on the approved annual percentage increase on the base value approved by law.
+python c4200m_get_target_sentences.py edits.tsv-00000-of-00010 target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010
 ```
 
-Repeat for the remaining nine shards, optionally with trailing ampersand for parallel
-processing.
 
-You can also run the concurrent script with the `concurrent-runs` parameter 
+Repeat for the remaining nine shards, optionally with trailing ampersand for parallel processing. You can also run the concurrent script with the `concurrent-runs` parameter 
 to check multiple shards at the same time.
 
 ```
@@ -78,8 +71,34 @@ python c4200m_get_target_sentences_concurrent.py edits.tsv-00000-of-00010 target
 
 The above reads 5 shards (00000 to 00004) at once and saves the target sentences to their corresponding files.
 
+
+#### Using the C4 Dataset in .json.gz Format
+
+Given a folder containing the C4 dataset compressed in `.json.gz` files as provided by [allenai](https://github.com/allenai/allennlp/discussions/5056), it is possible to fetch the clean target sentences as follows:
+
+```
+python c4200m_get_target_sentences_json.py edits.tsv-00000-of-00010 /C4/en/target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010
+```
+
+where we assume the training examples of the C4 dataset are located in `/C4/en/*train*.json.gz`.
+
+Repeat for the remaining nine shards, optionally with trailing ampersand for parallel processing.
+
+
+
 ### Apply corruption edits
 
+The mapping from the MD5 hash to the target sentence is written to
+`target_sentences.tsv*`:
+
+```
+$ head -n 3 target_sentences.tsv-00000-of-00010
+
+00000002020d286371dd59a2f8a900e6	Bitcoin goes for $7,094 this morning, according to CoinDesk.
+00000069b517cf07c79124fae6ebd0d8	1. The effect of "widespread dud" targets two face up attack position monsters on the field.
+0000006dce3b7c10a6ad25736c173506	Capital Gains tax on the sale of properties for non-residents is set at 21% for 2014 and 20% in 2015 payable on profits earned on the difference of the property value between the year of purchase (purchase price plus costs) and the year of sale (sales price minus costs), based on the approved annual percentage increase on the base value approved by law.
+```
+
 To generate the final parallel dataset the edits in `edit.tsv*` have to be
 applied to the sentences in `target_sentences.tsv*`:
 

diff --git a/c4200m_get_target_sentences.py b/c4200m_get_target_sentences.py
@@ -40,9 +40,8 @@ def main(argv):
         (len(target_sentences), len(remaining_hashes)))
   print("Writing C4_200M sentence pairs to %r..." % output_tsv_path)
   with open(output_tsv_path, "w") as output_tsv_writer:
-    with open(edits_tsv_path) as edits_tsv_reader:
-      while target_sentences:
-        output_tsv_writer.write("%s\t%s\n" % heapq.heappop(target_sentences))
+    while target_sentences:
+      output_tsv_writer.write("%s\t%s\n" % heapq.heappop(target_sentences))
 
 
 if __name__ == "__main__":

diff --git a/c4200m_get_target_sentences_json.py b/c4200m_get_target_sentences_json.py
@@ -0,0 +1,67 @@
+"""Looks up C4 sentences by their hashes and stores them in a TSV file.
+
+This scripts cycles on all *train*.json.gz files contained in a folder,
+to use it you can download the c4/en/ folder following the instructions here:
+https://github.com/allenai/allennlp/discussions/5056
+
+To extract all examples (183894319 for the v3.0.1 dataset) you have to run
+the script on all edit files provided."""
+
+import hashlib
+import heapq
+import os.path
+
+from absl import app
+
+import gzip
+import json
+
+LOGGING_STEPS = 100000
+
+
+def main(argv):
+    if len(argv) != 4:
+        raise app.UsageError(
+            "python3 c4200m_get_target_sentences.py <edits-tsv> <dataset_dir> "
+            "<output-tsv>")
+    edits_tsv_path = argv[1]
+    dataset_dir = argv[2]
+    output_tsv_path = argv[3]
+
+    print("Loading C4_200M target sentence hashes from %r..." % edits_tsv_path)
+    remaining_hashes = set()
+    with open(edits_tsv_path) as edits_tsv_reader:
+        for tsv_line in edits_tsv_reader:
+            remaining_hashes.add(tsv_line.split("\t", 1)[0])
+    print("Searching for %d target sentences in the C4 dataset..." %
+          len(remaining_hashes))
+    target_sentences = []
+    num_tot_examples = 0
+    for file_json in os.listdir(dataset_dir):
+        if not (file_json.endswith('.json.gz') and 'train' in file_json):
+            continue
+        dataset_file = os.path.join(dataset_dir, file_json)
+        with gzip.open(dataset_file, 'r') as f_in:
+            for num_done_examples, example in enumerate(f_in):
+                example = json.loads(example)
+                for line in example["text"].split("\n"):
+                    line_md5 = hashlib.md5(line.encode("utf-8")).hexdigest()
+                    if line_md5 in remaining_hashes:
+                        heapq.heappush(target_sentences, (line_md5, line))
+                        remaining_hashes.remove(line_md5)
+                if not remaining_hashes:
+                    break
+                if (num_tot_examples + num_done_examples) % LOGGING_STEPS == 0:
+                    print("-- %d C4 examples done, %d sentences still to be found" %
+                          ((num_tot_examples + num_done_examples), len(remaining_hashes)))
+            num_tot_examples += num_done_examples
+    print("Found %d target sentences (%d not found)." %
+          (len(target_sentences), len(remaining_hashes)))
+    print("Writing C4_200M sentence pairs to %r..." % output_tsv_path)
+    with open(output_tsv_path, "w") as output_tsv_writer:
+        while target_sentences:
+            output_tsv_writer.write("%s\t%s\n" % heapq.heappop(target_sentences))
+
+
+if __name__ == "__main__":
+    app.run(main)