You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 18, 2024. It is now read-only.
Thank you for providing access to this wonderful repository, it is truly very interesting and shall be very helpful to me as part of my university experiments.
However, could you please let me know how to convert a new text corpus to the HDF5 format expected by your code? Specifically, I would like to know how to generate:
hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 and hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5
test_128 and test_512
uncased_L-12_H-768_A-12 and uncased_L-24_H-1024_A-16
If any further details are required, please let me know. Look forward to hearing from you soon.
The text was updated successfully, but these errors were encountered:
"hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5 and hdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5" are basically two formats for BERT phase1 and phase2. These two are generated from wikipedia by me, you can just use it to run BERT with hetseq.
"test_128 and test_512" are just subset of 1 to do debugging and fast running.
"uncased_L-12_H-768_A-12 and uncased_L-24_H-1024_A-16" are just some tensorflow ckpt, the only thing we require is the "vocab" to transform "tokens" into their "input_ids".
Please let me know if it answers your question or you need any help.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello,
Thank you for providing access to this wonderful repository, it is truly very interesting and shall be very helpful to me as part of my university experiments.
However, could you please let me know how to convert a new text corpus to the HDF5 format expected by your code? Specifically, I would like to know how to generate:
hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5
andhdf5_lower_case_1_seq_len_512_max_pred_80_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5
test_128
andtest_512
uncased_L-12_H-768_A-12
anduncased_L-24_H-1024_A-16
If any further details are required, please let me know. Look forward to hearing from you soon.
The text was updated successfully, but these errors were encountered: