How do I set vocabulary table for grapheme-based model? #1

mjhanphd · 2018-06-28T02:33:14Z

Given only the audio features and corresponding transcripts, in the form of TFRecord where 'inputs' would be lists of audio features and 'labels' would be lists of transcripts.
Here each transcript might be composed of English alphabets plus space.
So, how should I define the vocabulary table in this case?
I guess it would be something like below:

A
B
(...)
Z

Is this correct?
Also, I wonder how should I represent the "space".
Should it be really just a space or is there any special way to represent it?
Thank you :)

WindQAQ · 2018-06-28T05:15:17Z

Hi, the vocabulary table should contain "all symbols that you expect your model to predict". If the model is expected to predict "English Vocabulary", for example, then the vocabulary table should be composed of "English Vocabulary", such as "I, you, he, she" and so forth; if the model is asked to predict "phonemes", then the vocabulary should contain phonemes (it is exactly the case of TIMIT).

Take a concrete example:
If the transcription is a sentence "how are you" and you expect the model is a word-based ASR (predict English vocabulary), then the vocabulary table should be

how
are
you

and the "labels" field in TFRecord file can be created like

def make_example(labels):
    feature_lists = tf.train.FeatureLists(feature_list={
        'labels': tf.train.FeatureList(feature=[
            tf.train.Feature(bytes_list=tf.train.BytesList(value=[p.encode()]))
            for p in labels
        ]),
    })

    return tf.train.SequenceExample(feature_lists=feature_lists)

with tf.python_io.TFRecordWriter('data.tfrecords') as writer:
    writer.write(make_example(['how', 'are', 'you']).SerializeToString())

In your case, the transcriptions are composed of English alphabets and space. Isn't it like the example above? I mean each transcription is a sentence composed of "English word". If it is, then the question is how you expect your model to predict. If you want to build word-based ASR, then the above example is all you need; If you want to build character-based ASR (it is less likely to train an English character-based ASR in my opinion), then you are definitely correct.

Thank you for the questions :)

mjhanphd · 2018-06-28T07:22:34Z

Also, is TIMIT the only one you've tried?
The basic idea of seq2seq models is to let the model to learn the direct relationship between audio and transcript, without the need of additional modules including pronunciation dictionary.
In this sense I believe it will be really helpful if you provide the results regarding the truly end-to-end approaches, either word-based or character-based.

mjhanphd · 2018-06-28T07:24:47Z

Also I hope you share the source codes for data preparation part, rather than just showing a brief idea.

WindQAQ · 2018-06-28T07:55:46Z

Hi,

If you want to add space symbol, i will recommend you to replace the space with a special token like <space> (both transcriptions and vocabulary table)
I have given a try on the Chinese character-based model with my lab's corpus, and I am working on Libirspeech dataset currently which is also a word-based model. The first one gets promising results, and the second one still works in progress.
The reason that I don't release the codes of data preparation part is that it seems TIMIT is not a free corpus (correct me if I'm wrong). And my processing on TIMIT requires lots of some prerequisites (i.e. Kaldi), which is not such a simple thing to install for someone. After I finish processing Libirspeech dataset, I will release all codes of Libirspeech dataset on my GitHub and provide the OneClick Script to run it.

Thank you for the suggestions.

mjhanphd · 2018-06-29T00:43:54Z

Thank you so much.

WindQAQ · 2018-07-26T09:45:16Z

@mjhanphd
In fact, I train a word-based ASR on Librispeech, but it ran out of my resource on my machine because of enumerous data. Alternatively, I use VCTK to train word-based ASR, which is still running.

Could you share the result of char-based ASR on Librispeech?

Thank you very much!

mjhanphd · 2018-07-26T14:19:10Z

@WindQAQ
I tried training char-based model on librispeech.
But I stopped it after 7 days of training because it seemed like there's no more decrease in loss, which was staying around 1.8xxx.
Anyway I tried to do evaluation on testset but it crashed out due to an unknown reason.

By the way, Google has recently proposed the SOTM asr model based on las, and according to it, word-piece is the best option as output units.
Do you have any plan to try it?

WindQAQ · 2018-07-27T00:25:57Z

@mjhanphd
Is there any error message for reference? Indeed, I feel that tensorflow.estimator.Estimator.evaluate and tensorflow.estimator.Estimator.predict are extremely slow compared with tensorflow.estimator.train_and_evaluate. I cannot figure out the reasons.

Also, any paper or article about SOTM asr model is available?
Thank you for answering and helping.

WindQAQ · 2018-07-27T15:31:25Z

Also, I have released scripts of processing and training on VCTK dataset. If you are interested in it, you can check vctk/ for more details. I will update the results after finishing running.

mjhanphd · 2018-07-31T07:44:11Z

In my 2nd test, it finished normally but weird thing is
the resulting file contains only 2 decoding results, while the "test-clean.tfrecords" contains much more instances.
I ran the command below
python infer.py --data ../deepSpeech/data/librispeech/processed/test-clean/test-clean.tfrecords --vocab ./misc/eng-char.table --model_dir ./model_libri --save ./output
and the resulting file follows:
1 A N D I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T
2 T H E S T A R T W A S A S T R A N G E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A

WindQAQ · 2018-07-31T08:00:59Z

I've never run into this kind of problem, at least in TIMIT and VCTK. In addition, it seems that the result on librispeech is not so good? How about the edit distance of random batch on training set?

WindQAQ added the good first issue Good for newcomers label Jul 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I set vocabulary table for grapheme-based model? #1

How do I set vocabulary table for grapheme-based model? #1

mjhanphd commented Jun 28, 2018 •

edited

Loading

WindQAQ commented Jun 28, 2018 •

edited

Loading

mjhanphd commented Jun 28, 2018 •

edited

Loading

mjhanphd commented Jun 28, 2018

WindQAQ commented Jun 28, 2018

mjhanphd commented Jun 29, 2018

WindQAQ commented Jul 26, 2018

mjhanphd commented Jul 26, 2018 •

edited

Loading

WindQAQ commented Jul 27, 2018 •

edited

Loading

WindQAQ commented Jul 27, 2018

mjhanphd commented Jul 31, 2018

WindQAQ commented Jul 31, 2018

How do I set vocabulary table for grapheme-based model? #1

How do I set vocabulary table for grapheme-based model? #1

Comments

mjhanphd commented Jun 28, 2018 • edited Loading

A B (...) Z

WindQAQ commented Jun 28, 2018 • edited Loading

mjhanphd commented Jun 28, 2018 • edited Loading

mjhanphd commented Jun 28, 2018

WindQAQ commented Jun 28, 2018

mjhanphd commented Jun 29, 2018

WindQAQ commented Jul 26, 2018

mjhanphd commented Jul 26, 2018 • edited Loading

WindQAQ commented Jul 27, 2018 • edited Loading

WindQAQ commented Jul 27, 2018

mjhanphd commented Jul 31, 2018

WindQAQ commented Jul 31, 2018

mjhanphd commented Jun 28, 2018 •

edited

Loading

A
B
(...)
Z

WindQAQ commented Jun 28, 2018 •

edited

Loading

mjhanphd commented Jun 28, 2018 •

edited

Loading

mjhanphd commented Jul 26, 2018 •

edited

Loading

WindQAQ commented Jul 27, 2018 •

edited

Loading