Skip to content

Commit

Permalink
Merge pull request #3 from Giovanni-Alzetta/allenai_script
Browse files Browse the repository at this point in the history
Adding script to read allenai version of the dataset
  • Loading branch information
fstahlberg authored Dec 7, 2021
2 parents 918c19f + 75a57a3 commit cb61541
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 21 deletions.
55 changes: 37 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,12 @@ environment, but are expected to work in other Python 3 setups, too.

### Install the dependencies

Install the TensorFlow Datasets and Abseil Python packages with PIP:
Install the Abseil Python package with PIP:

```
pip install tensorflow-datasets absl-py
pip install absl-py
```

### Setup C4 in TensorFlow Datasets

Obtain the **C4 corpus version 2.2.1** by following [these instructions](https://www.tensorflow.org/datasets/catalog/c4). More recent versions such as version 3.0.1 provided by [allenai](https://github.com/allenai/allennlp/discussions/5056) may also [work](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction/issues/2).

### Download the C4\_200M corruptions

Change to a new working directory and download the C4\_200M corruptions from
Expand Down Expand Up @@ -49,27 +45,24 @@ contains the replacement text.

### Extract C4\_200M target sentences from C4

C4\_200M uses a relatively small subset of C4 (200M sentences). The `c4200m_get_target_sentences.py` script fetches the clean target sentences from C4 for a single shard:
C4\_200M uses a relatively small subset of C4 (200M sentences). There are two ways to obtain the C4\_200M target sentences: using TensorFlow Datasets or using the C4 version provided by [allenai](https://github.com/allenai/allennlp/discussions/5056).

#### Using TensorFlow Datasets

Install the TensorFlow Datasets Python package with PIP:

```
python c4200m_get_target_sentences.py edits.tsv-00000-of-00010 target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010
pip install tensorflow-datasets
```

The mapping from the MD5 hash to the target sentence is written to
`target_sentences.tsv*`:
Obtain the **C4 corpus version 2.2.1** by following [these instructions](https://www.tensorflow.org/datasets/catalog/c4). The `c4200m_get_target_sentences.py` script fetches the clean target sentences from C4 for a single shard:

```
$ head -n 3 target_sentences.tsv-00000-of-00010
00000002020d286371dd59a2f8a900e6 Bitcoin goes for $7,094 this morning, according to CoinDesk.
00000069b517cf07c79124fae6ebd0d8 1. The effect of "widespread dud" targets two face up attack position monsters on the field.
0000006dce3b7c10a6ad25736c173506 Capital Gains tax on the sale of properties for non-residents is set at 21% for 2014 and 20% in 2015 payable on profits earned on the difference of the property value between the year of purchase (purchase price plus costs) and the year of sale (sales price minus costs), based on the approved annual percentage increase on the base value approved by law.
python c4200m_get_target_sentences.py edits.tsv-00000-of-00010 target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010
```

Repeat for the remaining nine shards, optionally with trailing ampersand for parallel
processing.

You can also run the concurrent script with the `concurrent-runs` parameter
Repeat for the remaining nine shards, optionally with trailing ampersand for parallel processing. You can also run the concurrent script with the `concurrent-runs` parameter
to check multiple shards at the same time.

```
Expand All @@ -78,8 +71,34 @@ python c4200m_get_target_sentences_concurrent.py edits.tsv-00000-of-00010 target

The above reads 5 shards (00000 to 00004) at once and saves the target sentences to their corresponding files.


#### Using the C4 Dataset in .json.gz Format

Given a folder containing the C4 dataset compressed in `.json.gz` files as provided by [allenai](https://github.com/allenai/allennlp/discussions/5056), it is possible to fetch the clean target sentences as follows:

```
python c4200m_get_target_sentences_json.py edits.tsv-00000-of-00010 /C4/en/target_sentences.tsv-00000-of-00010 &> get_target_sentences.log-00000-of-00010
```

where we assume the training examples of the C4 dataset are located in `/C4/en/*train*.json.gz`.

Repeat for the remaining nine shards, optionally with trailing ampersand for parallel processing.



### Apply corruption edits

The mapping from the MD5 hash to the target sentence is written to
`target_sentences.tsv*`:

```
$ head -n 3 target_sentences.tsv-00000-of-00010
00000002020d286371dd59a2f8a900e6 Bitcoin goes for $7,094 this morning, according to CoinDesk.
00000069b517cf07c79124fae6ebd0d8 1. The effect of "widespread dud" targets two face up attack position monsters on the field.
0000006dce3b7c10a6ad25736c173506 Capital Gains tax on the sale of properties for non-residents is set at 21% for 2014 and 20% in 2015 payable on profits earned on the difference of the property value between the year of purchase (purchase price plus costs) and the year of sale (sales price minus costs), based on the approved annual percentage increase on the base value approved by law.
```

To generate the final parallel dataset the edits in `edit.tsv*` have to be
applied to the sentences in `target_sentences.tsv*`:

Expand Down
5 changes: 2 additions & 3 deletions c4200m_get_target_sentences.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,8 @@ def main(argv):
(len(target_sentences), len(remaining_hashes)))
print("Writing C4_200M sentence pairs to %r..." % output_tsv_path)
with open(output_tsv_path, "w") as output_tsv_writer:
with open(edits_tsv_path) as edits_tsv_reader:
while target_sentences:
output_tsv_writer.write("%s\t%s\n" % heapq.heappop(target_sentences))
while target_sentences:
output_tsv_writer.write("%s\t%s\n" % heapq.heappop(target_sentences))


if __name__ == "__main__":
Expand Down
67 changes: 67 additions & 0 deletions c4200m_get_target_sentences_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
"""Looks up C4 sentences by their hashes and stores them in a TSV file.
This scripts cycles on all *train*.json.gz files contained in a folder,
to use it you can download the c4/en/ folder following the instructions here:
https://github.com/allenai/allennlp/discussions/5056
To extract all examples (183894319 for the v3.0.1 dataset) you have to run
the script on all edit files provided."""

import hashlib
import heapq
import os.path

from absl import app

import gzip
import json

LOGGING_STEPS = 100000


def main(argv):
if len(argv) != 4:
raise app.UsageError(
"python3 c4200m_get_target_sentences.py <edits-tsv> <dataset_dir> "
"<output-tsv>")
edits_tsv_path = argv[1]
dataset_dir = argv[2]
output_tsv_path = argv[3]

print("Loading C4_200M target sentence hashes from %r..." % edits_tsv_path)
remaining_hashes = set()
with open(edits_tsv_path) as edits_tsv_reader:
for tsv_line in edits_tsv_reader:
remaining_hashes.add(tsv_line.split("\t", 1)[0])
print("Searching for %d target sentences in the C4 dataset..." %
len(remaining_hashes))
target_sentences = []
num_tot_examples = 0
for file_json in os.listdir(dataset_dir):
if not (file_json.endswith('.json.gz') and 'train' in file_json):
continue
dataset_file = os.path.join(dataset_dir, file_json)
with gzip.open(dataset_file, 'r') as f_in:
for num_done_examples, example in enumerate(f_in):
example = json.loads(example)
for line in example["text"].split("\n"):
line_md5 = hashlib.md5(line.encode("utf-8")).hexdigest()
if line_md5 in remaining_hashes:
heapq.heappush(target_sentences, (line_md5, line))
remaining_hashes.remove(line_md5)
if not remaining_hashes:
break
if (num_tot_examples + num_done_examples) % LOGGING_STEPS == 0:
print("-- %d C4 examples done, %d sentences still to be found" %
((num_tot_examples + num_done_examples), len(remaining_hashes)))
num_tot_examples += num_done_examples
print("Found %d target sentences (%d not found)." %
(len(target_sentences), len(remaining_hashes)))
print("Writing C4_200M sentence pairs to %r..." % output_tsv_path)
with open(output_tsv_path, "w") as output_tsv_writer:
while target_sentences:
output_tsv_writer.write("%s\t%s\n" % heapq.heappop(target_sentences))


if __name__ == "__main__":
app.run(main)

0 comments on commit cb61541

Please sign in to comment.