Skip to content

WANT: Train Bad and Contrast, not just Good

Andy Glew edited this page Mar 25, 2022 · 3 revisions

Example: training speech recognition of TOC vs POC

Just now I have been dictating quite a bit of text with the acronym TOC (Table Of Contents). Let's assume that I am pronouncing each of the letters T O C -- my desire to be able to have the non-letter pronunciation tock=>TOC is a separate issue.

However, every attempt to say "TOC" is being recognized as "POC". Both terms are common in my technical work: "Table Of Contents" and "Point of Convergence" * TOC-vs-POC.

I suppose that if I trained on a large enough collection of text Dragon might learn to use TOC in some contexts and POC and others. but let's skip past that, especially since I often use these terms in situations where I am not dictating large amounts of text in flow, but where I might be dictating components of names in computer programs, or editing diagrams.

Dragon's vocabulary editor allows me to input custom words was both written and spoken forms, both in computer readable ASCII text:

  • POC\POC <-- pre-existing in my standard vocabulary
  • TOC\TOC <-- pre-existing in my standard vocabulary
  • TOC\T OC <-- I added this as a custom word
When a misrecognition occurs, if I were in a speech-enabled application, I can say "correct that" or "correct POC" and tell it to emit "TOC" instead. But I am seldom in such a speech-enabled application. Almost never.

When I am dictating into a non-speech-enabled application and a misrecognition occurs ... well, as mentioned elsewhere it is usually best not to do too much training ... but if I were to train ( risking losing all such training the next time my Dragon user profile gets corrupted, i.e. within a few weeks)

I can use the vocabulary editor to train a word or a phrase at a time, possibly in groups but each isolated from the others:

  • select the written\spoken forms to train
  • hit the train button
  • Dragon presents the spoken form
  • I pronounce it
and repeat.

It is even possible who select the written/spoken forms of POC and POC to train at the same time, in a group. But AFAICT these are isolated from each other.

Let's think how you might explain this to a human taking dictation or learning the language:

You might say "The letters P O C are the acronym POC. the letters TOC are the acronym TOC".

And then proceed to "Can you hear the difference when I alternate between POC and TOC? POC as in Point Of Convergence. TOC as in Table Of Contents. POC vs TOC. P vs T. POC TOC POC TOC POC TOC. POC POC TOC TOC".

Yes, I know, Dragon doesn't really have any concept of phonemes. Nor, for the most part, do any of the new generation of ML/DNN speech recognizers. we humans usually do not know what is happening in the middle of the neural nets. understanding what features the neural net is computing is a research topic.

But it should be possible to train an ML recognizer for such contrasting inputs. After all, they are trained according to errors.

And the key thing here is to not train "POC" and "TOC" in isolation, but to train them in contrast to each other.

And furthermore... it is just as important to train bad examples as good. To train the recognizer to reject bad mappings as it is to train it to accept mappings.

Yes, I know, it is unclear how to weight such small and isolated training fragments. Training on documents allows context to be taken into account, and allows less dependence on pronunciation, actual sound. We know that ML speech recognizers are not really listening for the same sort of phonetic clues that humans listen for. Too bad.

Avoid training pronunciation, unless you really have to

As described elsewhere (TBD link), Dragon training gets thrown out when your Dragon user profile gets corrupted, which for me happens every few weeks. So we avoid training pronunciation, preferring to rely on written\spoken forms

Discarding Dragon pronunciation training is unfortunate, but should not be fundamental.

E.g. if the voice/sound recording corresponding to a document were preserved, you can imagine retraining it when re-creating a profile. But "corresponding to a document" is not appropriate, since even the best of us seldom dictate an entire document without stumbling.

Indeed, it is mostly recommended that users not preserve the voice/sound recording accompanying all documents. It doesn't help much, it consumes disk space, and it often consumes CPU cycles and other resources that interfere with pleasant usage of Dragon speech recognition

Perhaps the best one could hope for would be to have smaller fragments of voice/sound/pronunciation mapping to similarly small fragments of recognized ascii text (both to be emitted as dictation text, or recognize as commands).

"Small fragments" does not necessarily mean at the fine granularity of a vocabulary entry.

But be slightly larger fragments for these contrasting examples might be suitable:

"POC as in Point Of Convergence. TOC as in Table Of Contents. POC not TOC. P vs T. POC TOC POC TOC POC TOC. POC POC TOC TOC".

not just speech, other manx-UI

This wish applies to speech recognition, but also to any user interface, particularly natural user interface, that involves training.

e.g. handwriting recognition. Whether time/stroke-based, or timeless image-based.

e.g. gesture recognition. whether on a pointing device such as a mouse or trackpad, or in three dimensions.

Probably also things like facial recognition, although facial recognition applies more to authentication and invasions of privacy, and is not so much used by actual users to control their computer.

Noting that sound recordings and images have standard data formats, whereas time-based pen strokes or gestures do not. Except for video, EEG that might apply to lipreading or sign language.

Clone this wiki locally