Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a chat data preprocessing script #1239

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:
- name: install pytest
run: python3 -m pip install pytest pytest-forked pyyaml requests wandb
- name: install torch
run: python3 -m pip install torch
run: python3 -m pip install torch
- name: install requirements
run: pip install -r requirements/requirements.txt
- name: Run Tests
Expand Down
12 changes: 6 additions & 6 deletions megatron/data/helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -428,9 +428,9 @@ py::array build_mapping_impl(const py::array_t<int64_t>& docs_,
}

} // for (auto sent_index=sent_index_first; ...
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {

if (!second) {
if (verbose) {
Expand Down Expand Up @@ -660,9 +660,9 @@ py::array build_blocks_mapping_impl(const py::array_t<int64_t>& docs_,
num_sent = 0;
}
} // for (auto sent_index=sent_index_first; ...
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {
} // if (num_remain_sent > 1) {
} // for (int doc=0; doc < num_docs; ++doc) {
} // for (int epoch=0; epoch < num_epochs; ++epoch) {

if (!second) {
if (verbose) {
Expand Down
1 change: 1 addition & 0 deletions megatron/model/norms.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ def get_norm(neox_args):
eps = neox_args.layernorm_epsilon
if neox_args.layernorm_fusion:
from .fused_layer_norm import MixedFusedLayerNorm

norm = MixedFusedLayerNorm
else:
norm = LayerNorm
Expand Down
8 changes: 5 additions & 3 deletions megatron/neox_arguments/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -794,7 +794,9 @@ def calculate_batch_parameters(

# either none of the three parameters are provided or just gradient_accumulation_step is provided
else:
assert False, "Either train_batch_size or train_micro_batch_size_per_gpu needs to be provided"
assert (
False
), "Either train_batch_size or train_micro_batch_size_per_gpu needs to be provided"
return int(train_batch), int(micro_batch), int(grad_acc)

@staticmethod
Expand Down Expand Up @@ -1098,8 +1100,8 @@ def calculate_derived(self):
if "flash" in self.attention_config:
_flash_version = packaging.version.Version(version("flash-attn"))
if self.sliding_window_width is not None:
assert (
_flash_version >= packaging.version.Version("2.3.0")
assert _flash_version >= packaging.version.Version(
"2.3.0"
), f"Flash-Attention version ({str(_flash_version)}) must be >= 2.3.0 to support sliding window attention."
if self.pos_emb == "alibi":
if not _flash_version >= packaging.version.Version("2.4.0.post1"):
Expand Down
51 changes: 51 additions & 0 deletions tools/datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,57 @@ output data:
--dataset-impl {lazy,cached,mmap}
Dataset implementation to use. Default: mmap

runtime:
--workers WORKERS Number of worker processes to launch
--log-interval LOG_INTERVAL
Interval between progress updates
```
## `preprocess_data_with_chat_template.py`
Similar, but uses huggingface's [chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating) to
tokenize the data to support multiturn and more complicated use cases.

N.B. If using this, you **must** specify your data when training/finetuning with the following configs
```json
"train_data_paths": ["train_documents"],
"test_data_paths": ["test_documents"],
"valid_data_paths": ["test_documents"],
"label_data_paths": ["label_documents"]
```

the `"data_path"` option will not work with `"label_data_paths"`.


```
usage: preprocess_data_with_chat_template.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--no-mask]
[--generation-role GENERATION_ROLE] [--only-last] [--num-docs NUM_DOCS]
--tokenizer-path TOKENIZER_PATH [--ftfy] --output-prefix OUTPUT_PREFIX
[--dataset-impl {lazy,cached,mmap}] [--workers WORKERS]
[--log-interval LOG_INTERVAL]

options:
-h, --help show this help message and exit

input data:
--input INPUT Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma separated list
--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
space separate listed of keys to extract from jsonl. Default: text
--no-mask If set, this will not mask any tokens in the input data.
--generation-role GENERATION_ROLE
The role of the model generating the chat, usually 'assistant'. Default: assistant
--only-last If set, this will mask everything except the last turn in the chat.
--num-docs NUM_DOCS Optional: Number of documents in the input data (if known) for an accurate progress bar.

tokenizer:
--tokenizer-path TOKENIZER_PATH
Path to HF Tokenizer.
--ftfy Use ftfy to clean text

output data:
--output-prefix OUTPUT_PREFIX
Path to binary output file without suffix
--dataset-impl {lazy,cached,mmap}
Dataset implementation to use. Default: mmap

runtime:
--workers WORKERS Number of worker processes to launch
--log-interval LOG_INTERVAL
Expand Down
Loading
Loading