Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Build lexicon and lts rules from scratch #17

Merged
merged 18 commits into from
Mar 15, 2016

Conversation

zeehio
Copy link
Contributor

@zeehio zeehio commented Mar 1, 2016

Do not merge yet, as I am still testing.

I have edited lang/cmulex/make_cmulex so it automatically downloads and compiles speech_tools, festival and festvox. Then the script downloads the festlex_CMU lexicon and re-creates the lexicon and LTS rules in flite format.

To run it simply

(cd lang/cmulex; ./make_cmulex)

I just patched this right now and I still have to test that everything works as expected. It takes several hours to create everything so it is not feasible to run it on every build. I'll see in some hours if the computation has finished and if everything is ok.

This script may help understand where the flite lexicon and letter to sound rules data comes from (helping on issue #16) and also is a step towards fixing #15.

Additionally by improving it and with the right lexicon we can start working on the lexicon and letter to sound rules in other languages. The lack of UTF-8 support in the speech tools & festival & festvox suite will make multilingual support harder to solve, but we'll deal with one problem at a time.

Pinging issues #3, #4 and #5 as this a small step towards multilingual support.

@LongBoolean
Copy link
Collaborator

@zeehio
It appears that the rebuild_cmulex branch is branching out from master. https://github.com/MycroftAI/mimic/network
I don't know if this will cause merging conflicts with the development branch (if it does no problem, we can always revert back), but in the future I would recommend branching out from the development branch.

@forslund
Copy link
Collaborator

forslund commented Mar 1, 2016

And also: Great work, I wish I had this yesterday it had saved me a lot of time finding the dependencies and what variables I needed to export =)

@zeehio
Copy link
Contributor Author

zeehio commented Mar 1, 2016

I did a rebase on top of development, thanks @LongBoolean !

I fixed the issues @forslund mentioned in the code. As I rebased the commits part your comments may be lost. (Both the check before download in festlex_CMU and the ~/projects path have been fixed)

I went to sleep yesterday and the laptop was doing the lts step. I forgot that I was running each step manually, so it will take me more time to check that everything is working (I am leaving for work now).

@zeehio
Copy link
Contributor Author

zeehio commented Mar 1, 2016

Everything runs, but the resulting lexicon and LTS rules are not good: Tests fail and phoneme prediction is bad. I will try to take a look into this in ~12 hours. Feel free to work on it if you want.

@forslund
Copy link
Collaborator

forslund commented Mar 2, 2016

I can also report success after letting it run over night, I haven't looked at the quality yet (I'm not sure I have the experience required either).

@zeehio
Copy link
Contributor Author

zeehio commented Mar 2, 2016

make mimic and try saying bin/mimic -t "hello world". I am hearing "heelo"... Maybe I did something wrong

@forslund
Copy link
Collaborator

forslund commented Mar 2, 2016

@zeehio can testsuite/lex_lookup be used to check the result against the input lex data?

@zeehio
Copy link
Contributor Author

zeehio commented Mar 2, 2016

Yes, the unittest lex_test fails

@forslund
Copy link
Collaborator

forslund commented Mar 2, 2016

testsuite/lex_lookup tells the story well:

A couple of examples:

word input current mimic my generated
h (((ey ch) 1)) (ey1 ch) [null]
hello (((hh ax) 0) ((l ow) 1)) (hh ax0 l ow1) (ih0 l l ax0)
you (((y uw) 1)) (y uw1) (iy0 ax0 ax0)

input is from /lang/cmulex/festival/lib/dicts/cmu/cmudict-0.4.out
current mimic is what master generates
my generated is what I get after running `'make_cmulex``

Not quite sure what's gone wrong, but as both current and my generated should be cmudict 0.4 so they should be more similar. I made a log of the generation progress, could that be of use to anyone?

@zeehio
Copy link
Contributor Author

zeehio commented Mar 2, 2016

I plan to track each of the steps of the phonetic transcription to locate the problem tomorrow (24 hours from now).

All the words present in the original lexicon should be transcribed correctly.

I have all the intermediate files and logs in my laptop as well, I do not need any more logs for now.

If you want to try by yourself, the steps for building the lexicon and letter to sound rules are described here. Basically:

  1. Compile lexicon
  2. Train letter to sound rules models
  3. Prune the lexicon words that are predicted correctly by the letter to sound rules.
  4. Compress with huffman encoding

@zeehio
Copy link
Contributor Author

zeehio commented Mar 8, 2016

Just to provide an update on this issue, all the steps from lexicon compilation, training the letter to sound rules models and pruning the lexicon seem to be all right. I have written a python script able to load the pruned_lex.out intermediate file and the cmu_lts_rules.scm intermediate file and both files are consistent and work as expected.

This means that the error must be either in the last make_cmulex lex line or in the compresslex step, that uses huffman encoding that has been programmed in shell

###########################################################################
##             Author:  Alan W Black ([email protected])                    ##
##               Date:  December 2004                                    ##
###########################################################################
##                                                                       ##
##  Make a Huffman table data                                            ##
##    done by finding the top singletons, bigrams, trigrams ... and byte ##
##    coding them                                                        ##
##                                                                       ##
##  But this isn't full huffman coding yet                               ##
##                                                                       ##
##  Hmm, this should probably be written in something other than shell   ##
########################################################################### 

I'm thinking of rewriting this Huffman coding from scratch, given that I have a poor knowledge of awk scripting and that the awk code is not clearly documented (at least to me).

And the final step will be generating the C source files. I still don't have a clear idea of what is the format of the LTS rules files and the lexicon. I think I have read somewhere that the LTS rules are written in flite as a finite state machine, I will have to take a look into it if nobody else does it...

@zeehio
Copy link
Contributor Author

zeehio commented Mar 8, 2016

Aha! Exporting LC_ALL is the right thing to do (LANG does not set all the LC variables). MIMICDIR had to be exported also or some script in compresslex missed it.

Now "hello" seems to work. I will rerun everything and check the unit tests. I will come back tomorrow!

@rhdunn
Copy link
Contributor

rhdunn commented Mar 8, 2016

On my testing, I have found that repeated build runs produce the same output. Therefore, there are differences between the cmudict in the festlex_CMU.tar.gz file and what was used to build the flite cmulex, or something in the build scripts, that are causing the generated output to be different. I haven't tested to see if different versions of festival/speech-tools produce different output.

@forslund
Copy link
Collaborator

forslund commented Mar 8, 2016

@zeehio great news! Looking forward to an update.
@rhdunn I'm not sure I get your point. What you're saying is probably correct but shouldn't most of the words be similar at least? Hello should be pretty similar no matter the version.

(There will probably be some differences and the test_lex unit test is pretty stupid and will report false errors if some of the words pronunciation is updated)

@rhdunn
Copy link
Contributor

rhdunn commented Mar 8, 2016

@forslund My point was two-fold:

  1. performing a build on the same input produces the same output -- this is important for repeatable builds;
  2. the cmudict file (or changes on top of that file) used to generate the cmulex data is different to that used in the build -- this results in the initial build being different.

Point (2) should not matter for updating the dictionary.

The test should use the data from the dictionary to compare the generated (cmulex) and dictionary (cmudict) pronunciations. That way, it will avoid false errors.

@forslund
Copy link
Collaborator

forslund commented Mar 8, 2016

@rhdunn agreed, the test should be smarter. When we can build the dict to a workable format for mimic we should add it to the repo and update the test to lookup the correct pronunciation from it.

@zeehio
Copy link
Contributor Author

zeehio commented Mar 11, 2016

I will rebase and squash as soon as I have tested everything again to get rid of all these dirty commits.

@forslund
Copy link
Collaborator

I'll set my computer to run this tonight. Should I expect a longer build time than previously?

@zeehio
Copy link
Contributor Author

zeehio commented Mar 11, 2016

No, as slow as always. I wanted to make it faster (<1hour) though and I may still be able to.

@forslund
Copy link
Collaborator

That would be great but don't feel pressure to make it run faster right now. When tested I'll merge this and you/we/I/someone can work on speeding things up in another pull request.

@zeehio
Copy link
Contributor Author

zeehio commented Mar 11, 2016

Still failing on my computer. The good think is that it takes ~4 hours, but soon I will be able reproduce the same (bad) build in ~30 minutes. Hopefully that will make my testing faster.

Also by rewriting parts of the process I do more checking on partial results. It seems the LTS model festival builds is fine, but then the model converted to flite/mimic fails.

@rhdunn
Copy link
Contributor

rhdunn commented Mar 11, 2016

I've noticed that festival 1.96 appears faster than festival 2.4 at building the LTS model. My local builds are also faster (~2hrs with 2.4 and ~20min with 1.96) and depends on what else the computer is doing as festival consumes 100% of the CPU.

Once the festival LTS model has been created, you could back up the festival directory to skip those steps while testing the festival to flite conversion logic.

@forslund
Copy link
Collaborator

@zeehio I built the lex from your latest push and it's quite a lot better than my first build but as you probably know it's not quite correct yet.

Seems like sounds from "ee" (sleet, fleet, etc) and "ch" (chair, church, etc.).

As always if there is something you like me to test or run for you just say the word.

@forslund forslund mentioned this pull request Mar 12, 2016
@zeehio zeehio force-pushed the rebuild_cmulex branch 5 times, most recently from 9aa6cb7 to ebc7e2f Compare March 15, 2016 00:52
@zeehio zeehio force-pushed the rebuild_cmulex branch 4 times, most recently from 6fa1296 to 3baa277 Compare March 15, 2016 01:59
@zeehio
Copy link
Contributor Author

zeehio commented Mar 15, 2016

Good news (testing is welcome).

  • Now the lexicon and the letter to sound rules (lts) seem to be correct.
  • The build time of the lexicon and LTS rules has been reduced from ~4-6 hours to ~30 minutes.
  • It requires python. I have tested it with python-2.7 and python-3.5.

Building the lexicon and lts rules

  • The lexicon is based on the cmudict that comes with festlex_CMU (festival CMU lexicon).
  • Following these instructions the lexicon is preprocessed removing short words before training LTS rules.
  • The letter to sound rules trained with the filtered lexicon predict ~65% of the filtered lexicon. This is similar to the values obtained https://www.cs.cmu.edu/~awb/papers/ISCA01/flite/node7.html. I say similar because I don't care about the exact value: I have not split the lexicon into train/test subsets, so I can't assess an accurate percentage of prediction (and that 65% is over-optimistic). In any case, feel free to test how good/bad the LTS rules are.
  • Once the LTS rules are trained, the lexicon is pruned removing the words that the LTS model can predict and that have no ambiguity (no homographs). By doing that we reduce the lexicon size making look-ups faster.
  • Unit tests have been written to test that the lexicon works and so do the LTS rules.

Testing and merging

If you are happy with the result and you don't mind having all those commits in the history, feel free to merge this pull request. If you want me to work further into ordering the commits better, I can try to do it (or feel free to do it yourself if you want).

Edit

I'd rather merge this once we are happy with it and create another pull request later to deal with issue #15

@forslund
Copy link
Collaborator

Outstandning! Took ~45 minutes to build for me and I haven't found any problems at all.

I agree that this should be a separated from issue #15. I'm going to look in the details during lunch (or this evening depending on time) and then I'll probably merge it.

Once more great work, it is much appreciated.

forslund added a commit that referenced this pull request Mar 15, 2016
Build lexicon and lts rules from scratch
@forslund forslund merged commit 54a8c88 into MycroftAI:development Mar 15, 2016
@zeehio zeehio deleted the rebuild_cmulex branch July 15, 2016 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants