Beginning with a smart brain

The "starting" brain of chatbot is not an empty brain. Ideallly, we are not chatting with a blank stale and it already knows things about most words in the English dictionary. So it already knows quite a lot! It does not now the meaning of individual words, but it knows the relationships between certain words.

Knowledge representation

There already exists many models we can leverage for representing knowledge, in particular, a popular field is to determine the relationship between pairs of words for semantics relationship between. A popular such program is word2vec. It finds a low number $n$ and encodes every word as a continuous-valued vector in $R^n$, with the property that the dot products of two words is high if and only if they are semantically correlated (meaning that they frequently appear in the same context).

As such, we can use this to guess if two words are synonyms or not.

Even better, given 3 words, we can answers questions like this:

Question. Man is to Woman as King is to ?? Answer. Queen.

Question. Beijing is to China as ?? is to Japan Answer. Tokyo

Even better yet, word2vec can answer other, even more complex.

What did we gain?

Instead of doing a very rough word parsing which must match a fairly rigid pattern, now we have matches between words which may have no syntactical relationships. For example, Tokyo and Beijing share no letter in common. The reason word2vec can do this is that it doesn't look at the letters that make up the words, but instead, it uses statistics to compute the relationship between words and how frequently certain groups of words appear in the same context. Then, it packages or encodes those statistics in a low-dimensional vector space so that it remains manageable for the end-user.

word2vec references

I suggest we start with word2vec, the model described just above. It has shown to be highly successful and powerful for NLP tasks.

Code implementations with tutorials:

Papers:

Blog articles:

Source datasets

Good example of source datasets on which to train word2vec include: Wikipedia articles, Google News articles. English dictionaries. There are probably many others as well.

Yet another model: Domain Specific Classifier

I implemented the Domain-Specific Classifier (DSC) model in R. It is similar to word2vec, but is simpler and more basic. Still, it might be useful as a sort of low-level word2vec. It runs very quickly even with large datasets:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly