easy HF dataset doremi? #10

brando90 · 2023-08-21T18:32:39Z

Is there a data set compatible with HF I may use?

dataset = load_dataset("c4", "en", streaming=True, split="train").with_format("torch")
remove_columns = ["text", "timestamp", "url"]
but instead have

dataset = load_dataset("doremi", "en", streaming=True, split="train").with_format("torch")
remove_columns = ["text", "timestamp", "url"]
thus automatically using the doremi weights?

brando90 · 2023-08-21T18:32:50Z

https://huggingface.co/papers/2305.10429

sangmichaelxie · 2023-09-25T22:11:05Z

we don't currently have such a dataset on huggingface, but we will let you know if we decide to do so! One issue is that the weights are on the chunk level, meaning that we are weighting sampling probability for the tokenized examples (not the raw documents).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

easy HF dataset doremi? #10

easy HF dataset doremi? #10

brando90 commented Aug 21, 2023

brando90 commented Aug 21, 2023

sangmichaelxie commented Sep 25, 2023

easy HF dataset doremi? #10

easy HF dataset doremi? #10

Comments

brando90 commented Aug 21, 2023

brando90 commented Aug 21, 2023

sangmichaelxie commented Sep 25, 2023