Skip to content

Commit

Permalink
Add a chat data preprocessing script (#1239)
Browse files Browse the repository at this point in the history
* Add a chat data preprocessing script

* add EOT at end of a chat

* update README.md

* apply pre-commit

---------

Co-authored-by: Quentin Anthony <[email protected]>
  • Loading branch information
dmahan93 and Quentin-Anthony authored Jun 25, 2024
1 parent 2608972 commit 0e5f6db
Show file tree
Hide file tree
Showing 6 changed files with 412 additions and 10 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:
- name: install pytest
run: python3 -m pip install pytest pytest-forked pyyaml requests wandb
- name: install torch
run: python3 -m pip install torch
run: python3 -m pip install torch
- name: install requirements
run: pip install -r requirements/requirements.txt
- name: Run Tests
Expand Down
12 changes: 6 additions & 6 deletions megatron/data/helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -428,9 +428,9 @@ py::array build_mapping_impl(const py::array_t<int64_t>& docs_,
}

} // for (auto sent_index=sent_index_first; ...
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {

if (!second) {
if (verbose) {
Expand Down Expand Up @@ -660,9 +660,9 @@ py::array build_blocks_mapping_impl(const py::array_t<int64_t>& docs_,
num_sent = 0;
}
} // for (auto sent_index=sent_index_first; ...
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {

if (!second) {
if (verbose) {
Expand Down
1 change: 1 addition & 0 deletions megatron/model/norms.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ def get_norm(neox_args):
eps = neox_args.layernorm_epsilon
if neox_args.layernorm_fusion:
from .fused_layer_norm import MixedFusedLayerNorm

norm = MixedFusedLayerNorm
else:
norm = LayerNorm
Expand Down
8 changes: 5 additions & 3 deletions megatron/neox_arguments/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -794,7 +794,9 @@ def calculate_batch_parameters(

# either none of the three parameters are provided or just gradient_accumulation_step is provided
else:
assert False, "Either train_batch_size or train_micro_batch_size_per_gpu needs to be provided"
assert (
False
), "Either train_batch_size or train_micro_batch_size_per_gpu needs to be provided"
return int(train_batch), int(micro_batch), int(grad_acc)

@staticmethod
Expand Down Expand Up @@ -1098,8 +1100,8 @@ def calculate_derived(self):
if "flash" in self.attention_config:
_flash_version = packaging.version.Version(version("flash-attn"))
if self.sliding_window_width is not None:
assert (
_flash_version >= packaging.version.Version("2.3.0")
assert _flash_version >= packaging.version.Version(
"2.3.0"
), f"Flash-Attention version ({str(_flash_version)}) must be >= 2.3.0 to support sliding window attention."
if self.pos_emb == "alibi":
if not _flash_version >= packaging.version.Version("2.4.0.post1"):
Expand Down
51 changes: 51 additions & 0 deletions tools/datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,57 @@ output data:
--dataset-impl {lazy,cached,mmap}
Dataset implementation to use. Default: mmap
runtime:
--workers WORKERS Number of worker processes to launch
--log-interval LOG_INTERVAL
Interval between progress updates
```
## `preprocess_data_with_chat_template.py`
Similar, but uses huggingface's [chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating) to
tokenize the data to support multiturn and more complicated use cases.

N.B. If using this, you **must** specify your data when training/finetuning with the following configs
```json
"train_data_paths": ["train_documents"],
"test_data_paths": ["test_documents"],
"valid_data_paths": ["test_documents"],
"label_data_paths": ["label_documents"]
```

the `"data_path"` option will not work with `"label_data_paths"`.


```
usage: preprocess_data_with_chat_template.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--no-mask]
[--generation-role GENERATION_ROLE] [--only-last] [--num-docs NUM_DOCS]
--tokenizer-path TOKENIZER_PATH [--ftfy] --output-prefix OUTPUT_PREFIX
[--dataset-impl {lazy,cached,mmap}] [--workers WORKERS]
[--log-interval LOG_INTERVAL]
options:
-h, --help show this help message and exit
input data:
--input INPUT Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma separated list
--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
space separate listed of keys to extract from jsonl. Default: text
--no-mask If set, this will not mask any tokens in the input data.
--generation-role GENERATION_ROLE
The role of the model generating the chat, usually 'assistant'. Default: assistant
--only-last If set, this will mask everything except the last turn in the chat.
--num-docs NUM_DOCS Optional: Number of documents in the input data (if known) for an accurate progress bar.
tokenizer:
--tokenizer-path TOKENIZER_PATH
Path to HF Tokenizer.
--ftfy Use ftfy to clean text
output data:
--output-prefix OUTPUT_PREFIX
Path to binary output file without suffix
--dataset-impl {lazy,cached,mmap}
Dataset implementation to use. Default: mmap
runtime:
--workers WORKERS Number of worker processes to launch
--log-interval LOG_INTERVAL
Expand Down
Loading

0 comments on commit 0e5f6db

Please sign in to comment.