This article discusses:
- Options for when to connect your client application to Direct Line Speech (DLS) channel and your bot, and how long to keep that connection open
- The built-in 5-minute websocket disconnect policy
- The need to use a Cognitive Services Speech token, as it is recommended that client applications never handle speech keys.
The text below assumes you are deploying a bot. However, similar considerations apply if you host your dialog using Custom Commands.
The text below and C# pseudo code assumes your client application uses keyword activation. The considerations with regards to connection logic are similar if your application uses microphone button ("push to talk") activation only.
A web-socket between the client application, DLS channel and your bot is established when the client application opens. Web-socket connections remains open until the client application closes, which the exception mentioned in this section.
Pros
- Lowest latency - There is no need to create the web-socket connection at the moment the user says the keyword or presses the microphone button.
- State of conversation can remain in memory in the bot instance between dialogs
Cons
- Always-on connection with an instance of your bot is present even for a client application that is not active. This prevents scaling up and down resouces based on actual usage.
Set up a web-socket connection with DLS and bot using DialogServiceConnector. Start listening for keyword. Audio will not stream to the cloud until keyword is detected:
// Set up cloud connection, reading audio from default microphone
var botConfig = BotFrameworkConfig.FromSubscription("YourSpeechSubscriptionKey", "YourServiceRegion");
botConfig.Language = "en-US";
connector = new DialogServiceConnector(botConfig);
// Implement event handlers
connector.ActivityReceived += (sender, activityReceivedEventArgs) => { ... };
...
// Start listening for keyword
var model = KeywordRecognitionModel.FromFile("YourKeywordModelFileName");
connector.StartKeywordRecognitionAsync(model);
Audio automatically starts streaming to the cloud. Bot gets the recognized text and replies with one or more Bot-Framework activities. Client application receives and processes the activities, including playback of attached TTS audio stream. Client application may initiate a second turn by calling ListenOnceAsync() based on bot hint.
Close the web socket connection with DLS and bot
// Unregister event handlers
connector.ActivityReceived -= ...
// Close cloud connection
connector.Dispose();
A web-socket between the client application, DLS and your bot is established when on-device keyword detection fires, or on button press ("push to talk"). It closes after the current dialog is done.
Pros
- Higher latency - Connection needs to be established after keyword activation or microphone button press
Cons
- Bot instances are only spun up to handle active clients
- If state preservation is needed between dialogs with a client, state needs to be stored at the end of a dialog and fetched at the beginning of the next dialog.
Set up on-device keyword recognizer, without cloud connection:
var micAudioConfig = AudioConfig.FromDefaultMicrophoneInput();
var keywordRecognizer = new KeywordRecognizer(micAudioConfig);
// Implement event handlers
keywordRecognizer.Recognized += (sender, keywordRecognitionEventArgs) => { ... };
...
// Start listening for keyword (no cloud connection needed!)
var model = KeywordRecognitionModel.FromFile("YourKeywordModelFileName");
keywordRecognizer.RecognizeOnceAsync(model)
Set up a web-socket connection with DLS and bot using DialogServiceConnector, and stream audio to the cloud starting from the keyword. Do dialog turns. Close web-socket at the end of the dialog:
// In the Recognized event handler:
var result = keywordRecognitionEventArgs.Result;
if (result.Reason == ResultReason.RecognizedKeyword)
{
// Get an audio data stream which starts from right before the keyword
var audioDataStream = AudioDataStream.FromResult(result);
// Create a dialog connector using audio input stream
var audioInputStream = PushAudioInputStream.create();
var streamAudioConfig = AudioConfig.fromStreamInput(audioInputStream);
var botConfig = BotFrameworkConfig.FromSubscription("YourSpeechSubscriptionKey", "YourServiceRegion");
var connector = new DialogServiceConnector(botConfig, streamAudioConfig);
// Implement event handlers
connector.ActivityReceived += (sender, activityReceivedEventArgs) => { ... };
// Start a thread here to pump audio from audioDataStream to audioInputStream... This will simplified in future version of Speech SDK
...
// Do dialog turns
connector.ListenOnceAsync()
...
// When the dialog is done release dialog resources and close cloud connection
audioDataStream.detachInput();
audioDataStream.Dispose();
connector.ActivityReceived -= ...
connector.Dispose()
streamAudioConfig.Dispose();
audioInputStream.Dispose();
}
Stop listening for keyword:
keywordRecognizer.Recognized -= ...
keywordRecognizer.Dispose();
A web-socket between the client application, DLS and your bot is established when on-device keyword detection fires, or on button press ("push to talk"). It remains open for a pre-determined amount of time beyond the end of the current dialog.
This is a compromise between Options 1 and 2. If your scenario typically calls for a user to initiate several dialogs with your bot in a short period of time, this may be the best solution.
Direct Line Speech (DLS) channel closes the websocket connection with a client application that has not been active for 5 minutes. Active meaning the client has either
- Sent audio via ListenOnceAsync or keyword activation (StartKeywordRecognitionAsync), or
- Sent a Bot-Framework activity (SendActivityAsync)
When the 5-minute disconnection happens, the client application will receive a Connector canceled event with the following arguments:
Reason
= CancellationReason.ErrorErrorCode
= CancellationErrorCode.ConnectionFailureErrorDetails
= "Connection was closed by the remote host. Error code: 1000. Error details: Exceeded maximum websocket connection idle duration(> 300000ms) SessionId: ..."
The client application does not need to act on this event. Speech SDK has a built-in reconnect logic. Next time audio or activity is being sent up, connection will be established. This means your event handler method should ignore this cancellation error, by doing the following (C# pseudo code):
private void ConnectorCanceled(object sender, SpeechRecognitionCanceledEventArgs e)
{
if (e.Reason == CancellationReason.Error && e.ErrorCode == CancellationErrorCode.ConnectionFailure
&& e.ErrorDetails.Contains("1000"))
{
// Do nothing... perhaps just log the event
}
else ...
}
Since any interaction (audio or activity being sent up) after 5 minutes of idle time requires first reestablishing the websocket connection with DLS, it will slightly affect the overall response latency from the bot.
[to be written...]