-
Notifications
You must be signed in to change notification settings - Fork 0
Beginning with a smart brain
The "starting" brain of chatbot is not an empty brain. Ideallly, we are not chatting with a blank stale and it already knows things about most words in the English dictionary. So it already knows quite a lot! It does not now the meaning of individual words, but it knows the relationships between certain words.
There already exists many models we can leverage for representing knowledge, in particular, a popular field is to determine the relationship between pairs of words for semantics relationship between. A popular such program is word2vec. It finds a low number
As such, we can use this to guess if two words are synonyms or not.
Even better, given 3 words, we can answers questions like this:
Question. Man is to Woman as King is to ?? Answer. Queen.
Question. Beijing is to China as ?? is to Japan Answer. Tokyo
Even better yet, word2vec can answer other, even more complex.
Instead of doing a very rough word parsing which must match a fairly rigid pattern, now we have matches between words which may have no syntactical relationships. For example, Tokyo and Beijing share no letter in common. The reason word2vec can do this is that it doesn't look at the letters that make up the words, but instead, it uses statistics to compute the relationship between words and how frequently certain groups of words appear in the same context. Then, it packages or encodes those statistics in a low-dimensional vector space so that it remains manageable for the end-user.
I suggest we start with word2vec, the model described just above. It has shown to be highly successful and powerful for NLP tasks.
Code implementations with tutorials:
- https://code.google.com/archive/p/word2vec/
- https://www.tensorflow.org/versions/r0.10/tutorials/word2vec/index.html
- http://deeplearning4j.org/word2vec
- https://github.com/dav/word2vec
Papers:
- http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf
- https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Blog articles:
- https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
- http://rare-technologies.com/making-sense-of-word2vec/
- https://www.quora.com/How-does-word2vec-work
- https://radimrehurek.com/gensim/models/word2vec.html
Good example of source datasets on which to train word2vec include: Wikipedia articles, Google News articles. English dictionaries. There are probably many others as well.
I implemented the Domain-Specific Classifier (DSC) model in R. It is similar to word2vec, but is simpler and more basic. Still, it might be useful as a sort of low-level word2vec. It runs very quickly even with large datasets: