[WIP] Build lexicon and lts rules from scratch #17

zeehio · 2016-03-01T01:29:22Z

Do not merge yet, as I am still testing.

I have edited lang/cmulex/make_cmulex so it automatically downloads and compiles speech_tools, festival and festvox. Then the script downloads the festlex_CMU lexicon and re-creates the lexicon and LTS rules in flite format.

To run it simply

(cd lang/cmulex; ./make_cmulex)

I just patched this right now and I still have to test that everything works as expected. It takes several hours to create everything so it is not feasible to run it on every build. I'll see in some hours if the computation has finished and if everything is ok.

This script may help understand where the flite lexicon and letter to sound rules data comes from (helping on issue #16) and also is a step towards fixing #15.

Additionally by improving it and with the right lexicon we can start working on the lexicon and letter to sound rules in other languages. The lack of UTF-8 support in the speech tools & festival & festvox suite will make multilingual support harder to solve, but we'll deal with one problem at a time.

Pinging issues #3, #4 and #5 as this a small step towards multilingual support.

LongBoolean · 2016-03-01T04:16:13Z

@zeehio
It appears that the rebuild_cmulex branch is branching out from master. https://github.com/MycroftAI/mimic/network
I don't know if this will cause merging conflicts with the development branch (if it does no problem, we can always revert back), but in the future I would recommend branching out from the development branch.

forslund · 2016-03-01T07:54:48Z

And also: Great work, I wish I had this yesterday it had saved me a lot of time finding the dependencies and what variables I needed to export =)

zeehio · 2016-03-01T08:01:09Z

I did a rebase on top of development, thanks @LongBoolean !

I fixed the issues @forslund mentioned in the code. As I rebased the commits part your comments may be lost. (Both the check before download in festlex_CMU and the ~/projects path have been fixed)

I went to sleep yesterday and the laptop was doing the lts step. I forgot that I was running each step manually, so it will take me more time to check that everything is working (I am leaving for work now).

zeehio · 2016-03-01T09:58:41Z

Everything runs, but the resulting lexicon and LTS rules are not good: Tests fail and phoneme prediction is bad. I will try to take a look into this in ~12 hours. Feel free to work on it if you want.

forslund · 2016-03-02T08:17:30Z

I can also report success after letting it run over night, I haven't looked at the quality yet (I'm not sure I have the experience required either).

zeehio · 2016-03-02T08:25:48Z

make mimic and try saying bin/mimic -t "hello world". I am hearing "heelo"... Maybe I did something wrong

forslund · 2016-03-02T08:29:29Z

@zeehio can testsuite/lex_lookup be used to check the result against the input lex data?

zeehio · 2016-03-02T08:31:41Z

Yes, the unittest lex_test fails

forslund · 2016-03-02T21:29:34Z

testsuite/lex_lookup tells the story well:

A couple of examples:

word	input	current mimic	my generated
h	(((ey ch) 1))	(ey1 ch)	[null]
hello	(((hh ax) 0) ((l ow) 1))	(hh ax0 l ow1)	(ih0 l l ax0)
you	(((y uw) 1))	(y uw1)	(iy0 ax0 ax0)

input is from /lang/cmulex/festival/lib/dicts/cmu/cmudict-0.4.out
current mimic is what master generates
my generated is what I get after running `'make_cmulex``

Not quite sure what's gone wrong, but as both current and my generated should be cmudict 0.4 so they should be more similar. I made a log of the generation progress, could that be of use to anyone?

zeehio · 2016-03-02T22:39:51Z

I plan to track each of the steps of the phonetic transcription to locate the problem tomorrow (24 hours from now).

All the words present in the original lexicon should be transcribed correctly.

I have all the intermediate files and logs in my laptop as well, I do not need any more logs for now.

If you want to try by yourself, the steps for building the lexicon and letter to sound rules are described here. Basically:

Compile lexicon
Train letter to sound rules models
Prune the lexicon words that are predicted correctly by the letter to sound rules.
Compress with huffman encoding

zeehio · 2016-03-08T02:03:22Z

Just to provide an update on this issue, all the steps from lexicon compilation, training the letter to sound rules models and pruning the lexicon seem to be all right. I have written a python script able to load the pruned_lex.out intermediate file and the cmu_lts_rules.scm intermediate file and both files are consistent and work as expected.

This means that the error must be either in the last make_cmulex lex line or in the compresslex step, that uses huffman encoding that has been programmed in shell

###########################################################################
##             Author:  Alan W Black ([email protected])                    ##
##               Date:  December 2004                                    ##
###########################################################################
##                                                                       ##
##  Make a Huffman table data                                            ##
##    done by finding the top singletons, bigrams, trigrams ... and byte ##
##    coding them                                                        ##
##                                                                       ##
##  But this isn't full huffman coding yet                               ##
##                                                                       ##
##  Hmm, this should probably be written in something other than shell   ##
###########################################################################

I'm thinking of rewriting this Huffman coding from scratch, given that I have a poor knowledge of awk scripting and that the awk code is not clearly documented (at least to me).

And the final step will be generating the C source files. I still don't have a clear idea of what is the format of the LTS rules files and the lexicon. I think I have read somewhere that the LTS rules are written in flite as a finite state machine, I will have to take a look into it if nobody else does it...

zeehio · 2016-03-08T02:31:36Z

Aha! Exporting LC_ALL is the right thing to do (LANG does not set all the LC variables). MIMICDIR had to be exported also or some script in compresslex missed it.

Now "hello" seems to work. I will rerun everything and check the unit tests. I will come back tomorrow!

rhdunn · 2016-03-08T07:12:06Z

On my testing, I have found that repeated build runs produce the same output. Therefore, there are differences between the cmudict in the festlex_CMU.tar.gz file and what was used to build the flite cmulex, or something in the build scripts, that are causing the generated output to be different. I haven't tested to see if different versions of festival/speech-tools produce different output.

forslund · 2016-03-08T07:29:34Z

@zeehio great news! Looking forward to an update.
@rhdunn I'm not sure I get your point. What you're saying is probably correct but shouldn't most of the words be similar at least? Hello should be pretty similar no matter the version.

(There will probably be some differences and the test_lex unit test is pretty stupid and will report false errors if some of the words pronunciation is updated)

rhdunn · 2016-03-08T13:08:36Z

@forslund My point was two-fold:

performing a build on the same input produces the same output -- this is important for repeatable builds;
the cmudict file (or changes on top of that file) used to generate the cmulex data is different to that used in the build -- this results in the initial build being different.

Point (2) should not matter for updating the dictionary.

The test should use the data from the dictionary to compare the generated (cmulex) and dictionary (cmudict) pronunciations. That way, it will avoid false errors.

forslund · 2016-03-08T13:20:20Z

@rhdunn agreed, the test should be smarter. When we can build the dict to a workable format for mimic we should add it to the repo and update the test to lookup the correct pronunciation from it.

zeehio · 2016-03-11T07:41:05Z

I will rebase and squash as soon as I have tested everything again to get rid of all these dirty commits.

forslund · 2016-03-11T10:02:52Z

I'll set my computer to run this tonight. Should I expect a longer build time than previously?

zeehio · 2016-03-11T10:17:17Z

No, as slow as always. I wanted to make it faster (<1hour) though and I may still be able to.

forslund · 2016-03-11T10:21:17Z

That would be great but don't feel pressure to make it run faster right now. When tested I'll merge this and you/we/I/someone can work on speeding things up in another pull request.

zeehio · 2016-03-11T12:19:02Z

Still failing on my computer. The good think is that it takes ~4 hours, but soon I will be able reproduce the same (bad) build in ~30 minutes. Hopefully that will make my testing faster.

Also by rewriting parts of the process I do more checking on partial results. It seems the LTS model festival builds is fine, but then the model converted to flite/mimic fails.

rhdunn · 2016-03-11T13:02:26Z

I've noticed that festival 1.96 appears faster than festival 2.4 at building the LTS model. My local builds are also faster (~2hrs with 2.4 and ~20min with 1.96) and depends on what else the computer is doing as festival consumes 100% of the CPU.

Once the festival LTS model has been created, you could back up the festival directory to skip those steps while testing the festival to flite conversion logic.

forslund · 2016-03-12T21:19:55Z

@zeehio I built the lex from your latest push and it's quite a lot better than my first build but as you probably know it's not quite correct yet.

Seems like sounds from "ee" (sleet, fleet, etc) and "ch" (chair, church, etc.).

As always if there is something you like me to test or run for you just say the word.

- Fix lexicon pruning - Export LC_ALL and MIMICDIR - Export PATH

As described in http://www.cstr.ed.ac.uk/projects/festival/manual/festival_13.html#SEC44 Short words are removed before training LTS rules

…tring a bit

…oducing the new code in the future

zeehio · 2016-03-15T02:40:19Z

Good news (testing is welcome).

Now the lexicon and the letter to sound rules (lts) seem to be correct.
The build time of the lexicon and LTS rules has been reduced from ~4-6 hours to ~30 minutes.
It requires python. I have tested it with python-2.7 and python-3.5.

Building the lexicon and lts rules

The lexicon is based on the cmudict that comes with festlex_CMU (festival CMU lexicon).
Following these instructions the lexicon is preprocessed removing short words before training LTS rules.
The letter to sound rules trained with the filtered lexicon predict ~65% of the filtered lexicon. This is similar to the values obtained https://www.cs.cmu.edu/~awb/papers/ISCA01/flite/node7.html. I say similar because I don't care about the exact value: I have not split the lexicon into train/test subsets, so I can't assess an accurate percentage of prediction (and that 65% is over-optimistic). In any case, feel free to test how good/bad the LTS rules are.
Once the LTS rules are trained, the lexicon is pruned removing the words that the LTS model can predict and that have no ambiguity (no homographs). By doing that we reduce the lexicon size making look-ups faster.
Unit tests have been written to test that the lexicon works and so do the LTS rules.

Testing and merging

If you are happy with the result and you don't mind having all those commits in the history, feel free to merge this pull request. If you want me to work further into ordering the commits better, I can try to do it (or feel free to do it yourself if you want).

Edit

I'd rather merge this once we are happy with it and create another pull request later to deal with issue #15

forslund · 2016-03-15T08:02:16Z

Outstandning! Took ~45 minutes to build for me and I haven't found any problems at all.

I agree that this should be a separated from issue #15. I'm going to look in the details during lunch (or this evening depending on time) and then I'll probably merge it.

Once more great work, it is much appreciated.

Build lexicon and lts rules from scratch

zeehio force-pushed the rebuild_cmulex branch from 77be620 to 8f480b4 Compare March 1, 2016 07:57

forslund mentioned this pull request Mar 12, 2016

Spanish Language #3

Open

zeehio force-pushed the rebuild_cmulex branch 5 times, most recently from 9aa6cb7 to ebc7e2f Compare March 15, 2016 00:52

zeehio force-pushed the rebuild_cmulex branch 4 times, most recently from 6fa1296 to 3baa277 Compare March 15, 2016 01:59

zeehio added 6 commits March 15, 2016 03:05

Build lexicon and lts rules from scratch

9d61748

- Fix lexicon pruning - Export LC_ALL and MIMICDIR - Export PATH

Write python script to verify LTS rules and lexicon

1922e5e

Try to improve the lexicon filtering for LTS Training

2a16cda

Use python script for lexicon pruning

fb77af8

Benchmark make_cmulex reporting start time on each step

129e86e

Use python script to filter lexicon for training lts rules

9180f05

As described in http://www.cstr.ed.ac.uk/projects/festival/manual/festival_13.html#SEC44 Short words are removed before training LTS rules

zeehio force-pushed the rebuild_cmulex branch from 3baa277 to 17baa17 Compare March 15, 2016 02:05

Quote paths

508c65c

zeehio force-pushed the rebuild_cmulex branch from 17baa17 to 924e027 Compare March 15, 2016 02:10

zeehio added 10 commits March 15, 2016 03:11

Start homograph support in python script

f959996

make_cmulex exits on error

b4bbf23

The python script is able to test the LTS rules. Improved python docs…

b6b95c0

…tring a bit

Do not prune homographs

9ac0e62

Parametrize lexicon filtering for lts rules training

c4b1dd6

Remove unused code

22476e6

Split and update unit tests for lexicon and lts rules

4df24c5

Reverted make_lts from previous flite version. We may explore re-intr…

60adffa

…oducing the new code in the future

python2/3 compatibility. Rename python script according to usage

1046738

New lexicon and lts model

1f80f53

zeehio force-pushed the rebuild_cmulex branch from 924e027 to 1f80f53 Compare March 15, 2016 02:11

zeehio mentioned this pull request Mar 15, 2016

Some words are pronounced incorrectly. #15

Closed

forslund added a commit that referenced this pull request Mar 15, 2016

Merge pull request #17 from zeehio/rebuild_cmulex

54a8c88

Build lexicon and lts rules from scratch

forslund merged commit 54a8c88 into MycroftAI:development Mar 15, 2016

zeehio deleted the rebuild_cmulex branch July 15, 2016 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Build lexicon and lts rules from scratch #17

[WIP] Build lexicon and lts rules from scratch #17

zeehio commented Mar 1, 2016

LongBoolean commented Mar 1, 2016

forslund commented Mar 1, 2016

zeehio commented Mar 1, 2016

zeehio commented Mar 1, 2016

forslund commented Mar 2, 2016

zeehio commented Mar 2, 2016

forslund commented Mar 2, 2016

zeehio commented Mar 2, 2016

forslund commented Mar 2, 2016

zeehio commented Mar 2, 2016

zeehio commented Mar 8, 2016

zeehio commented Mar 8, 2016

rhdunn commented Mar 8, 2016

forslund commented Mar 8, 2016

rhdunn commented Mar 8, 2016

forslund commented Mar 8, 2016

zeehio commented Mar 11, 2016

forslund commented Mar 11, 2016

zeehio commented Mar 11, 2016

forslund commented Mar 11, 2016

zeehio commented Mar 11, 2016

rhdunn commented Mar 11, 2016

forslund commented Mar 12, 2016

zeehio commented Mar 15, 2016

forslund commented Mar 15, 2016

[WIP] Build lexicon and lts rules from scratch #17

[WIP] Build lexicon and lts rules from scratch #17

Conversation

zeehio commented Mar 1, 2016

LongBoolean commented Mar 1, 2016

forslund commented Mar 1, 2016

zeehio commented Mar 1, 2016

zeehio commented Mar 1, 2016

forslund commented Mar 2, 2016

zeehio commented Mar 2, 2016

forslund commented Mar 2, 2016

zeehio commented Mar 2, 2016

forslund commented Mar 2, 2016

zeehio commented Mar 2, 2016

zeehio commented Mar 8, 2016

zeehio commented Mar 8, 2016

rhdunn commented Mar 8, 2016

forslund commented Mar 8, 2016

rhdunn commented Mar 8, 2016

forslund commented Mar 8, 2016

zeehio commented Mar 11, 2016

forslund commented Mar 11, 2016

zeehio commented Mar 11, 2016

forslund commented Mar 11, 2016

zeehio commented Mar 11, 2016

rhdunn commented Mar 11, 2016

forslund commented Mar 12, 2016

zeehio commented Mar 15, 2016

Building the lexicon and lts rules

Testing and merging

Edit

forslund commented Mar 15, 2016