Skip to content
This repository has been archived by the owner on Oct 18, 2024. It is now read-only.

Converting text corpora to HDF5 format #7

Open
Remorax opened this issue Jul 19, 2021 · 1 comment
Open

Converting text corpora to HDF5 format #7

Remorax opened this issue Jul 19, 2021 · 1 comment

Comments

@Remorax
Copy link

Remorax commented Jul 19, 2021

Hello,

Thank you for providing access to this wonderful repository, it is truly very interesting and shall be very helpful to me as part of my university experiments.

However, could you please let me know how to convert a new text corpus to the HDF5 format expected by your code? Specifically, I would like to know how to generate:

  1. hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 and hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5
  2. test_128 and test_512
  3. uncased_L-12_H-768_A-12 and uncased_L-24_H-1024_A-16

If any further details are required, please let me know. Look forward to hearing from you soon.

@yifding
Copy link
Owner

yifding commented Jul 20, 2021

Hey,

Thanks for raising the question.
So basically hetseq generates "HDF5" files following the logic provided by NVIDIA at https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/data/create_datasets_from_start.sh with downloaded wikipedia. You may need to adatp the code on customized data.

  1. "hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 and hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5" are basically two formats for BERT phase1 and phase2. These two are generated from wikipedia by me, you can just use it to run BERT with hetseq.
  2. "test_128 and test_512" are just subset of 1 to do debugging and fast running.
  3. "uncased_L-12_H-768_A-12 and uncased_L-24_H-1024_A-16" are just some tensorflow ckpt, the only thing we require is the "vocab" to transform "tokens" into their "input_ids".

Please let me know if it answers your question or you need any help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants