Skip to content
This repository has been archived by the owner on Jul 8, 2019. It is now read-only.

How do I set vocabulary table for grapheme-based model? #1

Open
mjhanphd opened this issue Jun 28, 2018 · 11 comments
Open

How do I set vocabulary table for grapheme-based model? #1

mjhanphd opened this issue Jun 28, 2018 · 11 comments
Labels
good first issue Good for newcomers

Comments

@mjhanphd
Copy link

mjhanphd commented Jun 28, 2018

Given only the audio features and corresponding transcripts, in the form of TFRecord where 'inputs' would be lists of audio features and 'labels' would be lists of transcripts.
Here each transcript might be composed of English alphabets plus space.
So, how should I define the vocabulary table in this case?
I guess it would be something like below:

A
B
(...)
Z

Is this correct?
Also, I wonder how should I represent the "space".
Should it be really just a space or is there any special way to represent it?
Thank you :)

@WindQAQ
Copy link
Owner

WindQAQ commented Jun 28, 2018

Hi, the vocabulary table should contain "all symbols that you expect your model to predict". If the model is expected to predict "English Vocabulary", for example, then the vocabulary table should be composed of "English Vocabulary", such as "I, you, he, she" and so forth; if the model is asked to predict "phonemes", then the vocabulary should contain phonemes (it is exactly the case of TIMIT).

Take a concrete example:
If the transcription is a sentence "how are you" and you expect the model is a word-based ASR (predict English vocabulary), then the vocabulary table should be

how
are
you

and the "labels" field in TFRecord file can be created like

def make_example(labels):
    feature_lists = tf.train.FeatureLists(feature_list={
        'labels': tf.train.FeatureList(feature=[
            tf.train.Feature(bytes_list=tf.train.BytesList(value=[p.encode()]))
            for p in labels
        ]),
    })

    return tf.train.SequenceExample(feature_lists=feature_lists)

with tf.python_io.TFRecordWriter('data.tfrecords') as writer:
    writer.write(make_example(['how', 'are', 'you']).SerializeToString())

In your case, the transcriptions are composed of English alphabets and space. Isn't it like the example above? I mean each transcription is a sentence composed of "English word". If it is, then the question is how you expect your model to predict. If you want to build word-based ASR, then the above example is all you need; If you want to build character-based ASR (it is less likely to train an English character-based ASR in my opinion), then you are definitely correct.

Thank you for the questions :)

@mjhanphd
Copy link
Author

mjhanphd commented Jun 28, 2018

Also, is TIMIT the only one you've tried?
The basic idea of seq2seq models is to let the model to learn the direct relationship between audio and transcript, without the need of additional modules including pronunciation dictionary.
In this sense I believe it will be really helpful if you provide the results regarding the truly end-to-end approaches, either word-based or character-based.

@mjhanphd
Copy link
Author

Also I hope you share the source codes for data preparation part, rather than just showing a brief idea.

@WindQAQ
Copy link
Owner

WindQAQ commented Jun 28, 2018

Hi,

  1. If you want to add space symbol, i will recommend you to replace the space with a special token like <space> (both transcriptions and vocabulary table)
  2. I have given a try on the Chinese character-based model with my lab's corpus, and I am working on Libirspeech dataset currently which is also a word-based model. The first one gets promising results, and the second one still works in progress.
  3. The reason that I don't release the codes of data preparation part is that it seems TIMIT is not a free corpus (correct me if I'm wrong). And my processing on TIMIT requires lots of some prerequisites (i.e. Kaldi), which is not such a simple thing to install for someone. After I finish processing Libirspeech dataset, I will release all codes of Libirspeech dataset on my GitHub and provide the OneClick Script to run it.

Thank you for the suggestions.

@mjhanphd
Copy link
Author

Thank you so much.

@WindQAQ
Copy link
Owner

WindQAQ commented Jul 26, 2018

@mjhanphd
In fact, I train a word-based ASR on Librispeech, but it ran out of my resource on my machine because of enumerous data. Alternatively, I use VCTK to train word-based ASR, which is still running.

Could you share the result of char-based ASR on Librispeech?

Thank you very much!

@mjhanphd
Copy link
Author

mjhanphd commented Jul 26, 2018

@WindQAQ
I tried training char-based model on librispeech.
But I stopped it after 7 days of training because it seemed like there's no more decrease in loss, which was staying around 1.8xxx.
Anyway I tried to do evaluation on testset but it crashed out due to an unknown reason.

By the way, Google has recently proposed the SOTM asr model based on las, and according to it, word-piece is the best option as output units.
Do you have any plan to try it?

@WindQAQ
Copy link
Owner

WindQAQ commented Jul 27, 2018

@mjhanphd
Is there any error message for reference? Indeed, I feel that tensorflow.estimator.Estimator.evaluate and tensorflow.estimator.Estimator.predict are extremely slow compared with tensorflow.estimator.train_and_evaluate. I cannot figure out the reasons.

Also, any paper or article about SOTM asr model is available?
Thank you for answering and helping.

@WindQAQ WindQAQ added the good first issue Good for newcomers label Jul 27, 2018
@WindQAQ
Copy link
Owner

WindQAQ commented Jul 27, 2018

Also, I have released scripts of processing and training on VCTK dataset. If you are interested in it, you can check vctk/ for more details. I will update the results after finishing running.

@mjhanphd
Copy link
Author

In my 2nd test, it finished normally but weird thing is
the resulting file contains only 2 decoding results, while the "test-clean.tfrecords" contains much more instances.
I ran the command below
python infer.py --data ../deepSpeech/data/librispeech/processed/test-clean/test-clean.tfrecords --vocab ./misc/eng-char.table --model_dir ./model_libri --save ./output
and the resulting file follows:
1 A N D I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T H A N I W A S A M O R E T
2 T H E S T A R T W A S A S T R A N G E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A R T A N D T H E S T A R T O F T H E S T A

@WindQAQ
Copy link
Owner

WindQAQ commented Jul 31, 2018

I've never run into this kind of problem, at least in TIMIT and VCTK. In addition, it seems that the result on librispeech is not so good? How about the edit distance of random batch on training set?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants