-
Notifications
You must be signed in to change notification settings - Fork 32
How do I set vocabulary table for grapheme-based model? #1
Comments
Hi, the vocabulary table should contain "all symbols that you expect your model to predict". If the model is expected to predict "English Vocabulary", for example, then the vocabulary table should be composed of "English Vocabulary", such as "I, you, he, she" and so forth; if the model is asked to predict "phonemes", then the vocabulary should contain phonemes (it is exactly the case of TIMIT). Take a concrete example:
and the "labels" field in TFRecord file can be created like def make_example(labels):
feature_lists = tf.train.FeatureLists(feature_list={
'labels': tf.train.FeatureList(feature=[
tf.train.Feature(bytes_list=tf.train.BytesList(value=[p.encode()]))
for p in labels
]),
})
return tf.train.SequenceExample(feature_lists=feature_lists)
with tf.python_io.TFRecordWriter('data.tfrecords') as writer:
writer.write(make_example(['how', 'are', 'you']).SerializeToString()) In your case, the transcriptions are composed of English alphabets and space. Isn't it like the example above? I mean each transcription is a sentence composed of "English word". If it is, then the question is how you expect your model to predict. If you want to build word-based ASR, then the above example is all you need; If you want to build character-based ASR (it is less likely to train an English character-based ASR in my opinion), then you are definitely correct. Thank you for the questions :) |
Also, is TIMIT the only one you've tried? |
Also I hope you share the source codes for data preparation part, rather than just showing a brief idea. |
Hi,
Thank you for the suggestions. |
Thank you so much. |
@WindQAQ By the way, Google has recently proposed the SOTM asr model based on las, and according to it, word-piece is the best option as output units. |
@mjhanphd Also, any paper or article about SOTM asr model is available? |
Also, I have released scripts of processing and training on VCTK dataset. If you are interested in it, you can check |
In my 2nd test, it finished normally but weird thing is |
I've never run into this kind of problem, at least in TIMIT and VCTK. In addition, it seems that the result on librispeech is not so good? How about the edit distance of random batch on training set? |
Given only the audio features and corresponding transcripts, in the form of TFRecord where 'inputs' would be lists of audio features and 'labels' would be lists of transcripts.
Here each transcript might be composed of English alphabets plus space.
So, how should I define the vocabulary table in this case?
I guess it would be something like below:
A
B
(...)
Z
Is this correct?
Also, I wonder how should I represent the "space".
Should it be really just a space or is there any special way to represent it?
Thank you :)
The text was updated successfully, but these errors were encountered: