Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some words are pronounced incorrectly. #15

Closed
redsteakraw opened this issue Feb 27, 2016 · 22 comments
Closed

Some words are pronounced incorrectly. #15

redsteakraw opened this issue Feb 27, 2016 · 22 comments

Comments

@redsteakraw
Copy link

Some words are pronounced incorrectly.

The two that come to mind in my testing are
Atheism and Penis

Atheism is pronounced by Mimic

A thigh ism

Now Theism, Theist and Atheist are pronounced correctly so this is a bit puzzling why atheism is pronounced differently.

Penis is pronounced by mimic like the words

pen is

it should be pronounced like

pee nis

Now I tested this out with a few voices and had identical results.

@rhdunn
Copy link
Contributor

rhdunn commented Feb 27, 2016

Mimic (and flite which it is based on) use the CMU pronunciation dictionary version 0.4 to derive its pronunciations for American English. This dictionary contains a large number of pronunciation errors, inconsistencies and mixed accents. As such, the pronunciations vary in accuracy.

For the words you highlighted, cmudict 0.4 contains:

ATHEISM  AH0 TH AY1 S AH0 M   
ATHEIST  EY1 TH IY0 AH0 S T
ATHEISTIC  EY2 TH IY0 IH1 S T IH0 K
ATHEISTS  EY1 TH IY0 AH0 S T S
PENIS  P EH1 N IH0 S

This highlights what I mentioned above and explains why mimic/flite are pronouncing those words incorrectly.

In cmudict 0.6d, these are:

ATHEISM  AH0 TH AY1 S AH0 M
ATHEISM(2)  EY1 TH IY0 IH2 Z AH0 M
ATHEIST  EY1 TH IY0 AH0 S T
ATHEISTIC  EY2 TH IY0 IH1 S T IH0 K
ATHEISTS  EY1 TH IY0 AH0 S T S
ATHEISTS(2)  EY1 TH IY0 AH0 S S
ATHEISTS(3)  EY1 TH IY0 AH0 S
PENIS  P IY1 N IH0 S

So penis has been corrected in that version, but atheism is only correct in the alternate pronunciation.

@ryanleesipes
Copy link

Can we easily update the dictionary version then?

On Sat, Feb 27, 2016, 3:17 AM Reece H. Dunn [email protected]
wrote:

Mimic (and flite which it is based on) use the CMU pronunciation
dictionary version 0.4 to derive its pronunciations for American English.
This dictionary contains a large number of pronunciation errors,
inconsistencies and mixed accents. As such, the pronunciations vary in
accuracy.

For the words you highlighted, cmudict 0.4
https://github.com/rhdunn/cmudict/tree/cmudict-0.4 contains:

ATHEISM AH0 TH AY1 S AH0 M
ATHEIST EY1 TH IY0 AH0 S T
ATHEISTIC EY2 TH IY0 IH1 S T IH0 K
ATHEISTS EY1 TH IY0 AH0 S T S
PENIS P EH1 N IH0 S

This highlights what I mentioned above and explains why mimic/flite are
pronouncing those words incorrectly.

In cmudict 0.6d https://github.com/rhdunn/cmudict/tree/cmudict-0.6d,
these are:

ATHEISM AH0 TH AY1 S AH0 M
ATHEISM(2) EY1 TH IY0 IH2 Z AH0 M
ATHEIST EY1 TH IY0 AH0 S T
ATHEISTIC EY2 TH IY0 IH1 S T IH0 K
ATHEISTS EY1 TH IY0 AH0 S T S
ATHEISTS(2) EY1 TH IY0 AH0 S S
ATHEISTS(3) EY1 TH IY0 AH0 S
PENIS P IY1 N IH0 S

So penis has been corrected in that version, but atheism is only correct
in the alternate pronunciation.


Reply to this email directly or view it on GitHub
#15 (comment).

Ryan Sipes
CTO, Mycroft A.I.
https://mycroft.ai
785-979-6091

@rhdunn
Copy link
Contributor

rhdunn commented Feb 28, 2016

There is a make_cmulex script in the lang/cmulex directory. I'm not sure how easy this is to run, though, as I have not tried it. It does require festival (which requires speech-tools), and the festlex_CMU.tar.gz file from http://www.cstr.ed.ac.uk/downloads/festival/2.4/. That script hard-codes various references, so may need some work.

I have github repositories of both festival and speech-tools that include various back-ported build fixes. Versions 1.95 and earlier require older systems with gcc 2.95 -- I have built these in a Debian Woody chroot.

@zeehio
Copy link
Contributor

zeehio commented Mar 15, 2016

Once #17 is merged, we will be able to address this issue adding/correcting the lexicon.

Concerns about updating the dictionary

The lexicon we are currently using has part of speech (POS) information for some words. This POS information can be used to disambiguate the pronunciation of words. For instance: (live as verb: "I live here" vs. live as noun: "The post office will not ship live animals."). More recent versions of the cmu_dict do not have POS information:

LIVE  L AY1 V
LIVE(2)  L IH1 V

I am a bit concerned about how this lack of POS information can affect mimic's ability to resolve homograph ambiguities.

Proposal

My current idea is to add all the new cmudict words using a 'missing' POS field (there is no problem on that). Alternative pronunciations will be automatically discarded and if bugs arise we will see how to deal with them. Our "base dictionary" will still be our current dictionary, so any word that already has POS information (such as "live") will not lose it.

If anyone is aware of a free lexicon with POS information and phonetic transcriptions, suggestions are welcome.

@rhdunn
Copy link
Contributor

rhdunn commented Mar 15, 2016

Regarding updating the dictionary:

  1. cmudict-0.4.diff contains the changes made to the base cmudict-0.4.scm generated from the cmudict source file.

  2. I have a cmudict-tools project that can be used to help maintain the pronunciations. This has the ability to generate festival format dictionaries (e.g. cmudict-tools --format festival print cmudict). This uses the value in brackets ((2) in your LIVE example) in the POS field, so in cmudict markup this would be:

    LIVE(n)  L AY1 V
    LIVE(v)  L IH1 V
    
  3. I am maintaining a mirror of the CMU pronunciation dictionary in my cmudict repository, which is tracking the old versions and the changes made to it in the different branches of the cmusphynx project.

  4. There is a potential license conflict between the festlex_CMU changes and the changes made to the cmudict file after version 0.6d. This is because the COPYING file in festlex_CMU contains the requirement:

    3. Original authors' names are not deleted.
    

    and the current maintainer of the cmudict (Alex Rudnicky) removed the original header that referenced authorship to Bob Weide (the original maintainer) which was first added in version 0.2. Additionally, the cmudict-0.4.scm file from festlex_CMU preserves that header, whereas cmudict-0.4.out does not (albeit with the text converted to lower case).

  5. The cmudict versions 0.1 to 0.7 are available in the Public Domain. Versions 0.5 and 0.7 don't have an official release, but Alex Rudnicky created a reconstructed version in cmusphinx commit 7825 which I have tagged as cmudict-0.7. Versions after this have been released under a 2-clause BSD license (source and binary distributions must retain the copyright notice and license text). I don't know how compatible these are with the changes made in the festlex_CMU files (POS tags and additional words).

@forslund
Copy link
Collaborator

@rhdunn pretty interesting stuff and will probably be useful. I've tested it very briefly and I might be using it wrong but it didn't accept the command line you gave I used ./cmudict-tools --format festlex print [dict], I got the following message
cmudict-tools: error: argument --format: invalid choice: 'festival' (choose from 'festlex', 'cmudict-weide', 'sphinx', 'cmudict', 'cmudict-new', 'json')

to get it running I used the festlex option. This in turn seem to have made the format of the output differ slightly from cmudict-0.4.out found in festlex_CMU.tar.gz used with make_lex.

For example chair in 0.4:
("chair" nil (((ch eh r) 1)))
generated from your cmudict repo with cmudict-tools
("chair" nil (ch eh1 r))

This is what mimic produces after make_cmulex so it might be all right, it's just a bit confusing for people like me who generally don't know what's going on =) (I need to find a good write down of all this and read through it).

Is festlex the flag you meant or is there another flag that I'm missing?

@zeehio
Copy link
Contributor

zeehio commented Mar 16, 2016

@rhdunn using your dictionary looks great!

@forslund, when make_cmulex calls the python script I wrote, the syllable structure is flattened following what festival did.

@rhdunn
Copy link
Contributor

rhdunn commented Mar 16, 2016

@forslund Yes, festlex is the flag I meant. I also meant that it generates the cmudict-0.4.scm format. Both have the form:

("word" pos (pronunciation))

The .scm version (which cmudict-tools generates) is a direct phoneme replacement for phonemes in cmudict (with the addition of using ax for ah0). The .out version groups phonemes based on the syllables, and pronunciation has the form:

((pronunciation) stress) ... ((pronunciation) stress)

with the vowel stress number moving to the syllable group.

If you look in the Makefile for festlex_CMU.tar.gz (festival/lib/dicts/cmu/Makefile) the scm to out conversion is done by:

cmudict-0.4.out: cmudict-0.4.scm cmudict_extensions.scm
        cat cmudict-0.4.scm cmudict_extensions.scm >all.scm
        ${ESTDIR}/../festival/bin/festival -b cmudict_compile.scm
        rm -f all.scm

The cmudict_compile.scm script is doing:

(load "cmulex.scm")
(lex.compile "all.scm" "cmudict-0.4.out")

which is what part of make_cmulex is doing during the build, so you can run something similar if the .out file is missing. Something like:

if [ ! -e cmudict-0.4.out ]
then
    cat cmudict-0.4.scm cmudict_extensions.scm >all.scm
    $FESTIVAL --heap 10000000 -b '(begin (load "cmulex.scm") (lex.compile "all.scm" "cmudict-0.4.out"))'
fi

Regarding documentation of the process, there is very sparse disjointed information about the process. I have built up my experience from trying to understand the code and searching for material online.

@rhdunn
Copy link
Contributor

rhdunn commented Apr 14, 2016

Hi,

I have created an American English Pronunciation Dictionary (AmEPD) based on cmudict 0.7 (the last Public Domain version of the dictionary). This includes:

  1. Removing mixed forms (CAT-1), spelling based initialisms (IBM) and hyphenated words (there are too many hyphenated word variants and hyphenated words will primarily only vary by stress);
  2. Using AX for COMMA and AXR for LETTER unstressed vowels.
  3. Making the pronunciation consistent (ongoing) and reducing the variant count (ongoing);
  4. Adding part of speech tags and context information (for when different pronunciations share the same part of speech) -- I have completed an initial pass on this, so should be a reasonable basis to work from;
  5. I have also corrected the pronunciation of ATHEIST noted in this issue.

There is still a lot of cleanup and consistency checking to do, but this should be a useful starting point.

NOTE: The part of speech tags used here are different to the ones used by festival. The tags for AmEPD are described in the amepd.ttl file in the amepd project, while the festlex-CMU tags are described in the festlex.ttl file of my pos-tags project. The festlex tags are different to the wp39 and wp20 tags (also described in pos-tags) used by the festival TTS program.

@forslund
Copy link
Collaborator

Hi @rhdunn!

Sounds interesting, I'll try to convert it for mimic.

Meant to come back to you about the cmudict-0.7 but forgot. I created a branch using your cmu-dict repo and tool (see rhdunn-cmudict).

When testing we found that the change from ah0 to ax makes the prounciation slightly different, and kept the old dict for now. Is the difference intended or do we need to update the voices for this to sound ok?

Also some of the emphasis levels aren't supported by mimic (I reduced the ones to levels that were included in mimic). Do you have an opinion on how this should be handled?

@rhdunn
Copy link
Contributor

rhdunn commented Apr 15, 2016

@forslund Do you mean changing /AX/ to /AH0/? The festlex dictionary replaces /AH0/ with /AX/ (see the cmu2ft script in festlex-CMU). Thus, the is DH AX in festlex and DH AH0 in cmudict. The cmudict does not have /AX/ and /AXR/, while my amepd does. NOTE: festlex does not use /AXR/.

The cmudict uses the/AH/ vowel is used for STRUT and commA words, and /ER/ for NURSE and lettER words. When festlex converts /AH0/ to /AX/ (and as transcribed in the cmudict), contrast in several words is lost (esp. for um-, un- and up- words).

For the stressed levels, 2 is used for secondary stress in the cmudict and is not present in festlex. From the cmu2ft script, festlex is using stress level 1 for these phonemes. This can currently be done using tr 2 1 on the output of the conversion process. To be more robust, I should modify cmudict-tools so the festvox phoneset does not have secondary stress and maps it to primary stress (2 -> 1).

@rhdunn
Copy link
Contributor

rhdunn commented Apr 15, 2016

NOTE: I will also need to modify the cmudict-tool to handle part-of-speech. The "remove variants" command will currently strip the words containing POS information :(. It needs to be intelligent in which entry to select -- the way I have set up the amepd is for the first entries to be the common ones and the ones that should be used if no additional disambiguation is supported.

I should also add support for mapping between vocabularies, making it easier to map from the amepd context vocabulary to the festlex one.

@zeehio
Copy link
Contributor

zeehio commented Apr 15, 2016

Thanks @rhdunn! I had not seen the cmu2ft script! Your cmudict-tool makes our lives easier!

@forslund
Copy link
Collaborator

@rhdunn, oh dear... I mixed them up! That explains it...
Rebuilding the cmulex using the cmudict 0.7 and cmu2ft instead of cmudict-tool + my manual conversion sounds better.

I'm gonna throw fortune at it to test more strings but so far so good.

@rhdunn
Copy link
Contributor

rhdunn commented Apr 30, 2016

I have updated the cmudict-tool program so that the festlex phonemes work like from the cmu2ft script. Things still to support:

  1. Mapping between context tagsets (e.g. from the cainteoir tagset used in my amepd to the festlex tagset used in the festival cmudict (e.g. mapping det to dt). I am looking into this at the moment.
  2. Being able to remove pronunciation variants, but keep part-of-speech based variants.
  3. Only keeping the first (word, part-of-speech) entry.

@rhdunn
Copy link
Contributor

rhdunn commented May 3, 2016

I have the above working now with the latest cmudict-tools, so you can run:

git clone [email protected]:rhdunn/amepd.git
cd amepd
git checkout amepd-0.1-1
cmudict-tools --format=festlex --output-context=festlex --remove-duplicate-contexts print cmudict > cmudict.scm

This will give you a cmudict.scm file that is in the same format as cmudict-0.4.scm, so should be usable by the mimic dictionary build process.

NOTE: Some entries cannot be disambiguated by part of speech alone, e.g.:

AXES(noun)  AE1 K S IH0 Z #@@{ "root": "AXE" }@@
AXES(verb)  AE1 K S IH0 Z #@@{ "root": "AXE" }@@
AXES(noun)  AE1 K S IY0 Z #@@{ "root": "AXIS" }@@

so will look odd when in the festival format as only the first two of those entries will be included.

The dictionary contains fixes for the words reported in the initial summary of the issue above.

@forslund
Copy link
Collaborator

forslund commented May 3, 2016

Cool! I'll try it out as soon as I get time (my best guess: tomorrow). Getting an updated dict and closing this issue would be great.

Also, I'll see if I can make a guide on how to update the dictionary using @zeehio's scripts together with your dict and tool.

@zeehio
Copy link
Contributor

zeehio commented May 3, 2016

Hopefully I will find some time for mimic this weekend.

Yesterday I realized that "mycroft" needs to be added to the dictionary.

Thanks for working hard on this!

@forslund
Copy link
Collaborator

forslund commented May 4, 2016

Yeah, it might be a good idea to add Mycroft =)

I tested the amepd dictionary and make_cmulex_helper.py seem to stumble on '
I'm not sure what's the correct way to handle these characters. @zeehio do you have any suggestion? (mimic may even strip the input text from all special characters making these entries hard to use without some serious rewriting.)

Removing all lines using the characters produce nice results, both penis and atheism is pronounced correctly. Need to test some more though.

@rhdunn is the upstream dictionary interested in keeping the pronounciation of "Mycroft" or should we keep that as a local patch? And Thanks for the hard work!

@rhdunn
Copy link
Contributor

rhdunn commented May 4, 2016

@forslund I will be adding Mycroft shortly in part of the updates I am making to the dictionary post 0.1.

I have checked and ' entries are not in the festival dictionaries. You can use:

grep -vF "'"

to filter out ' characters, i.e.:

cmudict-tools --format=festlex --output-context=festlex --remove-duplicate-contexts print cmudict | grep -vF "'" > cmudict.scm

Mimic/flite are handling this via 's being classed as a possessive ending part of speech class. For example, using -pw (print words):

$ bin/flite -pw -t "How is Sarah's dog?"
how is sarah 's dog 

@forslund
Copy link
Collaborator

forslund commented May 4, 2016

I used a similar but more complicated grep-line =)

Thanks for clearing up the 's issue, I'll just remove the lines involved using grep. I'm going to make a clean rebuild tomorrow and create a proper pull request so people can start trying out your dict!

@zeehio
Copy link
Contributor

zeehio commented May 30, 2016

Given that this has been merged already this issue can be closed.

Huge thanks to @rhdunn and @forslund for doing all the hard work!

@zeehio zeehio closed this as completed May 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants