-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Build lexicon and lts rules from scratch #17
Conversation
@zeehio |
And also: Great work, I wish I had this yesterday it had saved me a lot of time finding the dependencies and what variables I needed to export =) |
I did a rebase on top of development, thanks @LongBoolean ! I fixed the issues @forslund mentioned in the code. As I rebased the commits part your comments may be lost. (Both the check before download in festlex_CMU and the ~/projects path have been fixed) I went to sleep yesterday and the laptop was doing the |
Everything runs, but the resulting lexicon and LTS rules are not good: Tests fail and phoneme prediction is bad. I will try to take a look into this in ~12 hours. Feel free to work on it if you want. |
I can also report success after letting it run over night, I haven't looked at the quality yet (I'm not sure I have the experience required either). |
|
@zeehio can |
Yes, the unittest lex_test fails |
A couple of examples:
input is from Not quite sure what's gone wrong, but as both current and my generated should be cmudict 0.4 so they should be more similar. I made a log of the generation progress, could that be of use to anyone? |
I plan to track each of the steps of the phonetic transcription to locate the problem tomorrow (24 hours from now). All the words present in the original lexicon should be transcribed correctly. I have all the intermediate files and logs in my laptop as well, I do not need any more logs for now. If you want to try by yourself, the steps for building the lexicon and letter to sound rules are described here. Basically:
|
Just to provide an update on this issue, all the steps from lexicon compilation, training the letter to sound rules models and pruning the lexicon seem to be all right. I have written a python script able to load the This means that the error must be either in the last
I'm thinking of rewriting this Huffman coding from scratch, given that I have a poor knowledge of awk scripting and that the awk code is not clearly documented (at least to me). And the final step will be generating the C source files. I still don't have a clear idea of what is the format of the LTS rules files and the lexicon. I think I have read somewhere that the LTS rules are written in flite as a finite state machine, I will have to take a look into it if nobody else does it... |
Aha! Exporting LC_ALL is the right thing to do (LANG does not set all the LC variables). MIMICDIR had to be exported also or some script in compresslex missed it. Now "hello" seems to work. I will rerun everything and check the unit tests. I will come back tomorrow! |
On my testing, I have found that repeated build runs produce the same output. Therefore, there are differences between the cmudict in the festlex_CMU.tar.gz file and what was used to build the flite cmulex, or something in the build scripts, that are causing the generated output to be different. I haven't tested to see if different versions of festival/speech-tools produce different output. |
@zeehio great news! Looking forward to an update. (There will probably be some differences and the |
@forslund My point was two-fold:
Point (2) should not matter for updating the dictionary. The test should use the data from the dictionary to compare the generated (cmulex) and dictionary (cmudict) pronunciations. That way, it will avoid false errors. |
@rhdunn agreed, the test should be smarter. When we can build the dict to a workable format for mimic we should add it to the repo and update the test to lookup the correct pronunciation from it. |
I will rebase and squash as soon as I have tested everything again to get rid of all these dirty commits. |
I'll set my computer to run this tonight. Should I expect a longer build time than previously? |
No, as slow as always. I wanted to make it faster (<1hour) though and I may still be able to. |
That would be great but don't feel pressure to make it run faster right now. When tested I'll merge this and you/we/I/someone can work on speeding things up in another pull request. |
Still failing on my computer. The good think is that it takes ~4 hours, but soon I will be able reproduce the same (bad) build in ~30 minutes. Hopefully that will make my testing faster. Also by rewriting parts of the process I do more checking on partial results. It seems the LTS model festival builds is fine, but then the model converted to flite/mimic fails. |
I've noticed that festival 1.96 appears faster than festival 2.4 at building the LTS model. My local builds are also faster (~2hrs with 2.4 and ~20min with 1.96) and depends on what else the computer is doing as festival consumes 100% of the CPU. Once the festival LTS model has been created, you could back up the festival directory to skip those steps while testing the festival to flite conversion logic. |
@zeehio I built the lex from your latest push and it's quite a lot better than my first build but as you probably know it's not quite correct yet. Seems like sounds from "ee" (sleet, fleet, etc) and "ch" (chair, church, etc.). As always if there is something you like me to test or run for you just say the word. |
9aa6cb7
to
ebc7e2f
Compare
6fa1296
to
3baa277
Compare
- Fix lexicon pruning - Export LC_ALL and MIMICDIR - Export PATH
As described in http://www.cstr.ed.ac.uk/projects/festival/manual/festival_13.html#SEC44 Short words are removed before training LTS rules
…oducing the new code in the future
Good news (testing is welcome).
Building the lexicon and lts rules
Testing and mergingIf you are happy with the result and you don't mind having all those commits in the history, feel free to merge this pull request. If you want me to work further into ordering the commits better, I can try to do it (or feel free to do it yourself if you want). EditI'd rather merge this once we are happy with it and create another pull request later to deal with issue #15 |
Outstandning! Took ~45 minutes to build for me and I haven't found any problems at all. I agree that this should be a separated from issue #15. I'm going to look in the details during lunch (or this evening depending on time) and then I'll probably merge it. Once more great work, it is much appreciated. |
Build lexicon and lts rules from scratch
Do not merge yet, as I am still testing.
I have edited
lang/cmulex/make_cmulex
so it automatically downloads and compiles speech_tools, festival and festvox. Then the script downloads the festlex_CMU lexicon and re-creates the lexicon and LTS rules in flite format.To run it simply
I just patched this right now and I still have to test that everything works as expected. It takes several hours to create everything so it is not feasible to run it on every build. I'll see in some hours if the computation has finished and if everything is ok.
This script may help understand where the flite lexicon and letter to sound rules data comes from (helping on issue #16) and also is a step towards fixing #15.
Additionally by improving it and with the right lexicon we can start working on the lexicon and letter to sound rules in other languages. The lack of UTF-8 support in the speech tools & festival & festvox suite will make multilingual support harder to solve, but we'll deal with one problem at a time.
Pinging issues #3, #4 and #5 as this a small step towards multilingual support.