Skip to content

Handling sentences based on statiscal learning

Phil Paradis edited this page Sep 18, 2016 · 1 revision

The other templates that can be used to handle sentences is based on, but not limited to, the following:

Non-exhaustive list of sentence handling functions

  • is_handling
  • is_question
  • ability_statement
  • ability_question
  • rating_statement
  • rating_question

What else do we need? Can you provide suggestions?

  • has_handling (for has, have, very similar to is, but denotes possession rather than characterization, so we can be slightly more precise in the ''types'' of relationships)
  • has_question
  • describe_statement (same as is_question)
  • describe_question
  • type_statement (When the user asks the bot to describe a new type. For example, what is E.Coli? Is it an entity? Yes. Is it a person or an animal (say no if it is neither)? No. What is it? A bacteria.)
  • type_question (Describe "bacteria". Bacteria is a type. It fits in the following hiearchy: {Person, Animal, Bacteria} -> Entity. I do not know anything else about it yet, but I some examples of bacteria: E.coli)
  • is_question (What is "E.coli"? An E.coli is an entity, more precisely, a bacteria. I do not know anything else about it yet.)
  • describe_question (Describe "E.coli". An E.coli is an entity, more precisely, a bacteria. I do not know anything else about it yet.)
  • tellmemore_handling (Similar to "is", but always appends rather than modify or replace)
  • forget_handling (Forget or erase something the chatbot knows. The chatbot can "pretend" to forget things or refuse this requestion)

Why do we need a lot of these?

I'm hoping that one day, the chatbot will be good enough that instead of using string parsing to classify a statement in the right category, one can use the tools of statistical learning theory. More precisely, this falls in the field of Natural Language Processing (NLP), a part of Machine Learning (ML) theory. One would use techniques such as bag-of-words and use large datasets of pre-labelled (sentences, handling_function) pairs to decide on the correct handling function:

This has at least 2 advantages:

  • May be more "human-like", that is, it might make errors and parse the same question differently sometimes (especially when it looks at the context too, then it completely depends on the last few sentences, not just the current sentence)
  • Helps resolve parsing-conflicts: what happens when multiples templates can parse the same sentence? How do we decide which sentence should be used? We never have this problem with statistical learning theory.

Why does it work better? One of the reason is that we're not parsing based on syntax only anymore, but we're using the semantics implicitly (from the information stored in the pre-labelled datasets, which would have to be manually labelled by a human).

Drawbacks? Need to produce large datasets of pre-labelled (sentence, handling_function) pairs. This can be quite time-consuming as it needs to be produced by a human, one sentence at a time. However, the model can be trained once on a different computer and then the model can be used already trained on the target computer. So there is no large initialization cost to running the chatbot.

Moreover, we can try to harness numerous other tools from NLP to make the statistical learning approach stronger.

Tool to help create datasets:

To help creating datasets, chatbot would be modified such that an output file of the conversation is always created in real-time by the bot whenever having a conversation with it. The conversations would generally have the form <user, sentenceId> and <chatbot, handling_func_usedId, responseId>, with an optional header at the beginning. For example:

# Header: Ignore any line that starts with #
# datetime = 2016-09-18 10:42:00 (UTC-5:00)
# Conversation length: 7 sentences
chatbot, greeting1
user, sentence1
chatbot, handling_func_used1, response1
user, sentence2
chatbot, handling_func_used2, response2
user, sentence3
chatbot, handling_func_used3, response3

For example:

$ cat convo1
chatbot, give_greeting, What is your name?
user, Phil
chatbot, give_compliment, I like your name, Phil.
user, I have 2 dogs.
chatbot, is_handling, Okay, Phil has 2 dogs.
user, What is your name?
chatbot, has_handling, My name has 2 values: Chatbot, Phil!

Correct the chatbot answers

This would be a two-step process. First step would be to create a corrected database. For example, there would be another tool called chatbot_correct_conversation. This would take function convo1.txt and output convo1_corrected.txt. It would present the chatbot with context (if any) at every response of the chatbot and then ask if the correcting handling function was used? If yes, press [Enter], otherwise, type the correct handling function name that should have been used. Optionally, type an example of a proper response.

Example:

$ chatbot_correct_conversation convo1
Line number: 1
Context (2 previous exchanges): N/A
Chatbot answer: What is your name?
Chatbot handling_func: give_greeting

Is this correct? [Y/N]
Y

Line number: 3
Context (2 previous exchanges):
chatbot> What is your name?
user> Phil
Chatbot answer: I like your name, Phil!
Chatbot handling_func: give_compliment

Is this correct? [Y/N]
Y

Line number: 5
Context (2 previous exchanges):
chatbot> What is your name?
user> Phil
chatbot> Phil, I like your name!
user> I have 2 dogs.
Chatbot answer: Okay, Phil has 2 dogs.
Chatbot handling_func: has_handling

Is this correct? [Y/give correct handling func]
Y

Line number: 7
Context (2 previous exchanges):
chatbot> Phil, I like your name!
user> I have 2 dogs.
chatbot> Okay, Phil has 2 dogs.
user> What is your name?
Chatbot answer: My name has 2 values: chatbox, Phil
Chatbot handling_func: has_handling

Is this correct? [Y/give correct handling func]
N
Correction handling function: has_handling
`

Tool to create new dataset for training a statistical learning model from a corrected dataset

Secondly, there should be a tool, called chatbot_create_learning_dataset, which should be available to create or append existing datasets for the purpose of statistical learning from an existing conversation. One would call call first chatbot_correct_conversation to create convo1_corrected from convo1. Then, one would call chatbot_create_learning_dataset convo1_corrected convo1_newdataset on the conversation to automatically format it in the right format. This could be a tool that is automatically called by chatbot_correct_conversation itself, but it is good to keep intermediate raw and corrected datasets too.

  • This tool could work with various parameters that can be set at runtime, such as

-- Number of context lines to keep: Number (Default 2). -- Keep chatbot answers in context: Y/N (Default N).

This would automate the process of creating databases that include context and require far less typing.

  • Finally, the tool could be integrated directly into chatbot and used by the user at the same time as he is having a conversation (after receiving chatbot's answer, he would then immediately correct it if necessary).