jsonl to binidx tool

This repository is greatly simplified from https://github.com/EleutherAI/gpt-neox, to ONLY convert .jsonl into .bin and .idx , can serve for dataset preparation of RWKV model (see https://github.com/BlinkDL/RWKV-LM),

The current RWKV models use GPT Neox tokenizer 20B_tokenizer.json

python tools/preprocess_data.py --input ./sample.jsonl --output-prefix ./data/sample --vocab ./20B_tokenizer.json --dataset-impl mmap --tokenizer-type HFTokenizer --append-eod

The multilingual rwkv-4-world models use a new tokenizer rwkv_vocab_v20230424.txt.

python tools/preprocess_data.py --input ./sample.jsonl --output-prefix ./data/sample --vocab ./rwkv_vocab_v20230424.txt --dataset-impl mmap --tokenizer-type RWKVTokenizer --append-eod

The jsonl format sample (one line for each document):

{"text": "This is the first document."}
{"text": "Hello\nWorld"}
{"text": "1+1=2\n1+2=3\n2+2=4"}

generated by code like this:

ss = json.dumps({"meta": meta, "text": text}, ensure_ascii=False)
out.write(ss + "\n")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jsonl to binidx tool

The current RWKV models use GPT Neox tokenizer 20B_tokenizer.json

The multilingual rwkv-4-world models use a new tokenizer rwkv_vocab_v20230424.txt.

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
tools		tools
20B_tokenizer.json		20B_tokenizer.json
README.md		README.md
requirements.txt		requirements.txt
rwkv_vocab_v20230424.txt		rwkv_vocab_v20230424.txt
sample.jsonl		sample.jsonl

Abel2076/json2binidx_tool

Folders and files

Latest commit

History

Repository files navigation

jsonl to binidx tool

The current RWKV models use GPT Neox tokenizer 20B_tokenizer.json

The multilingual rwkv-4-world models use a new tokenizer rwkv_vocab_v20230424.txt.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages