Azure OpenAI: audio transcription and translation (Azure#38460)

* squashed commit: whisper transcription/translation support * Update to latest TypeSpec * rebase codegen to mvp tsp PR * Merge, snap, suppression cleanup * PR feedback: remove errant <auto-generated/> tags in /custom * Test refresh, incl. temporary BYOD rollback * test fix for recordings and omitting multipart audio bodies * PR feedback: idiomatic response formats * PR feedback: fully distinguish translation and transcription types * PR feedback: keep well-(enough)-known names for Srt,Vtt * test: revert accidentally included local swap to live test mode * full test update, including RAI adjustments * merged .tsp snap (regen pending) * code regen after merge and tsp snap * CHANGELOG update * Incorporate fabulous PR feedback. Thank you, Jose!
azure-sdk · Sep 21, 2023 · 494a50b · 494a50b
1 parent bc5e855
commit 494a50b
Show file tree

Hide file tree

Showing 79 changed files with 2,709 additions and 409 deletions.
diff --git a/sdk/openai/Azure.AI.OpenAI/CHANGELOG.md b/sdk/openai/Azure.AI.OpenAI/CHANGELOG.md
@@ -4,11 +4,36 @@
 
 ### Features Added
 
+- Audio Transcription and Audio Translation using OpenAI Whisper models is now supported. See [OpenAI's API
+  reference](https://platform.openai.com/docs/api-reference/audio) or the [Azure OpenAI
+  quickstart](https://learn.microsoft.com/azure/ai-services/openai/whisper-quickstart) for detailed overview and
+  background information.
+  - The new methods `GetAudioTranscription` and `GetAudioTranscription` expose these capabilities on `OpenAIClient`
+  - Transcription produces text in the primary, supported, spoken input language of the audio data provided, together
+    with any optional associated metadata
+  - Translation produces text, translated to English and reflective of the audio data provided, together with any
+    optional associated metadata
+  - These methods work for both Azure OpenAI and non-Azure `api.openai.com` client configurations
+
 ### Breaking Changes
 
+- The underlying representation of `PromptFilterResults` (for `Completions` and `ChatCompletions`) has had its response
+  body key changed from `prompt_annotations` to `prompt_filter_results`
+- **Prior versions of the `Azure.AI.OpenAI` library may no longer populate `PromptFilterResults` as expected** and it's
+  highly recommended to upgrade to this version if the use of Azure OpenAI content moderation annotations for input data
+  is desired
+- If a library version upgrade is not immediately possible, it's advised to use `Response<T>.GetRawResponse()` and manually
+  extract the `prompt_filter_results` object from the top level of the `Completions` or `ChatCompletions` response `Content`
+  payload
+
 ### Bugs Fixed
 
-### Other Changes
+- Support for the described breaking change for `PromptFilterResults` was added and this library version will now again
+  deserialize `PromptFilterResults` appropriately
+- `PromptFilterResults` and `ContentFilterResults` are now exposed on the result classes for streaming Completions and
+  Chat Completions. `Streaming(Chat)Completions.PromptFilterResults` will report an index-sorted list of all prompt
+  annotations received so far while `Streaming(Chat)Choice.ContentFilterResults` will reflect the latest-received
+  content annotations that were populated and received while streaming
 
 ## 1.0.0-beta.7 (2023-08-25)
 

diff --git a/sdk/openai/Azure.AI.OpenAI/README.md b/sdk/openai/Azure.AI.OpenAI/README.md
@@ -399,6 +399,51 @@ Response<ImageGenerations> imageGenerations = await client.GetImageGenerationsAs
 Uri imageUri = imageGenerations.Value.Data[0].Url;
 ```
 
+### Transcribe audio data with Whisper speech models
+
+```C# Snippet:TranscribeAudio
+using Stream audioStreamFromFile = File.OpenRead("myAudioFile.mp3");
+BinaryData audioFileData = BinaryData.FromStream(audioStreamFromFile);
+
+var transcriptionOptions = new AudioTranscriptionOptions()
+{
+    AudioData = BinaryData.FromStream(audioStreamFromFile),
+    ResponseFormat = AudioTranscriptionFormat.Verbose,
+};
+
+Response<AudioTranscription> transcriptionResponse = await client.GetAudioTranscriptionAsync(
+    deploymentId: "my-whisper-deployment", // whisper-1 as model name for non-Azure OpenAI
+    transcriptionOptions);
+AudioTranscription transcription = transcriptionResponse.Value;
+
+// When using Simple, SRT, or VTT formats, only transcription.Text will be populated
+Console.WriteLine($"Transcription ({transcription.Duration.Value.TotalSeconds}s):");
+Console.WriteLine(transcription.Text);
+```
+
+### Translate audio data to English with Whisper speech models
+
+```C# Snippet:TranslateAudio
+using Stream audioStreamFromFile = File.OpenRead("mySpanishAudioFile.mp3");
+BinaryData audioFileData = BinaryData.FromStream(audioStreamFromFile);
+
+var translationOptions = new AudioTranslationOptions()
+{
+    AudioData = BinaryData.FromStream(audioStreamFromFile),
+    ResponseFormat = AudioTranslationFormat.Verbose,
+};
+
+Response<AudioTranslation> translationResponse = await client.GetAudioTranslationAsync(
+    deploymentId: "my-whisper-deployment", // whisper-1 as model name for non-Azure OpenAI
+    translationOptions);
+AudioTranslation translation = translationResponse.Value;
+
+// When using Simple, SRT, or VTT formats, only translation.Text will be populated
+Console.WriteLine($"Translation ({translation.Duration.Value.TotalSeconds}s):");
+// .Text will be translated to English (ISO-639-1 "en")
+Console.WriteLine(translation.Text);
+```
+
 ## Troubleshooting
 
 When you interact with Azure OpenAI using the .NET SDK, errors returned by the service correspond to the same HTTP status codes returned for [REST API][openai_rest] requests.

diff --git a/sdk/openai/Azure.AI.OpenAI/api/Azure.AI.OpenAI.netstandard2.0.cs b/sdk/openai/Azure.AI.OpenAI/api/Azure.AI.OpenAI.netstandard2.0.cs
diff --git a/sdk/openai/Azure.AI.OpenAI/assets.json b/sdk/openai/Azure.AI.OpenAI/assets.json
@@ -2,5 +2,5 @@
   "AssetsRepo": "Azure/azure-sdk-assets",
   "AssetsRepoPrefixPath": "net",
   "TagPrefix": "net/openai/Azure.AI.OpenAI",
-  "Tag": "net/openai/Azure.AI.OpenAI_a0250cd0f1"
+  "Tag": "net/openai/Azure.AI.OpenAI_52e82965d8"
 }
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom.Suppressions/OpenAIClient.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom.Suppressions/OpenAIClient.cs
@@ -0,0 +1,48 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+using System.Threading;
+using Azure.Core;
+
+namespace Azure.AI.OpenAI
+{
+    [CodeGenSuppress("GetCompletions", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetCompletionsAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetChatCompletions", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetChatCompletionsAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetEmbeddings", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetEmbeddingsAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetChatCompletionsWithAzureExtensions", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetChatCompletionsWithAzureExtensions", typeof(string), typeof(ChatCompletionsOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetChatCompletionsWithAzureExtensionsAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetChatCompletionsWithAzureExtensionsAsync", typeof(string), typeof(ChatCompletionsOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetAudioTranscriptionAsPlainText", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetAudioTranscriptionAsPlainText", typeof(string), typeof(AudioTranscriptionOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetAudioTranscriptionAsPlainTextAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetAudioTranscriptionAsPlainTextAsync", typeof(string), typeof(AudioTranscriptionOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetAudioTranscriptionAsResponseObject", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetAudioTranscriptionAsResponseObject", typeof(string), typeof(AudioTranscriptionOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetAudioTranscriptionAsResponseObjectAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetAudioTranscriptionAsResponseObjectAsync", typeof(string), typeof(AudioTranscriptionOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetAudioTranslationAsPlainText", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetAudioTranslationAsPlainText", typeof(string), typeof(AudioTranslationOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetAudioTranslationAsPlainTextAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetAudioTranslationAsPlainTextAsync", typeof(string), typeof(AudioTranslationOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetAudioTranslationAsResponseObject", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetAudioTranslationAsResponseObject", typeof(string), typeof(AudioTranslationOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("GetAudioTranslationAsResponseObjectAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("GetAudioTranslationAsResponseObjectAsync", typeof(string), typeof(AudioTranslationOptions), typeof(CancellationToken))]
+    [CodeGenSuppress("CreateGetCompletionsRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("CreateGetChatCompletionsRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("CreateGetEmbeddingsRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("CreateGetChatCompletionsWithAzureExtensionsRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("CreateGetAudioTranscriptionAsPlainTextRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("CreateGetAudioTranscriptionAsResponseObjectRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("CreateGetAudioTranslationAsPlainTextRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    [CodeGenSuppress("CreateGetAudioTranslationAsResponseObjectRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
+    public partial class OpenAIClient
+    {
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTaskLabel.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTaskLabel.cs
@@ -0,0 +1,12 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+namespace Azure.AI.OpenAI
+{
+    internal readonly partial struct AudioTaskLabel
+    {
+        // CUSTOM CODE NOTE: here to demote visibility to internal.
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscription.Serialization.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscription.Serialization.cs
@@ -0,0 +1,30 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+using System.Text.Json;
+
+namespace Azure.AI.OpenAI
+{
+    public partial class AudioTranscription
+    {
+        internal static AudioTranscription FromResponse(Response response)
+        {
+            if (response.Headers.ContentType.Contains("text/plain"))
+            {
+                return new AudioTranscription(
+                    text: response.Content.ToString(),
+                    internalAudioTaskLabel: null,
+                    language: null,
+                    duration: default,
+                    segments: null);
+            }
+            else
+            {
+                using var document = JsonDocument.Parse(response.Content);
+                return DeserializeAudioTranscription(document.RootElement);
+            }
+        }
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscription.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscription.cs
@@ -0,0 +1,15 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+namespace Azure.AI.OpenAI
+{
+    public partial class AudioTranscription
+    {
+        // CUSTOM CODE NOTE: included to demote visibility of 'task'
+
+        /// <summary> The label that describes which operation type generated the accompanying response data. </summary>
+        internal AudioTaskLabel? InternalAudioTaskLabel { get; }
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscriptionFormat.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscriptionFormat.cs
@@ -0,0 +1,32 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+using System;
+using Azure.Core;
+
+namespace Azure.AI.OpenAI
+{
+    public readonly partial struct AudioTranscriptionFormat : IEquatable<AudioTranscriptionFormat>
+    {
+        /// <summary>
+        /// Specifies that a transcription response should provide plain, unannotated text with no additional metadata.
+        /// </summary>
+        [CodeGenMember("Json")]
+        public static AudioTranscriptionFormat Simple { get; } = new AudioTranscriptionFormat(SimpleValue);
+
+        /// <summary>
+        /// Specifies that a transcription response should provide plain, unannotated text with additional metadata
+        /// including timings, probability scores, and other processing details.
+        /// </summary>
+        [CodeGenMember("VerboseJson")]
+        public static AudioTranscriptionFormat Verbose { get; } = new AudioTranscriptionFormat(VerboseValue);
+        /// <summary> Use a response body that is plain text containing the raw, unannotated transcription. </summary>
+
+        // (Note: text is hidden as its behavior is redundant with 'json' when using a shared, strongly-typed response
+        // value container)
+        [CodeGenMember("Text")]
+        internal static AudioTranscriptionFormat InternalPlainText { get; } = new AudioTranscriptionFormat(InternalPlainTextValue);
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscriptionOptions.Serialization.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscriptionOptions.Serialization.cs
@@ -0,0 +1,37 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+using System.Net.Http;
+using Azure.Core;
+
+namespace Azure.AI.OpenAI
+{
+    public partial class AudioTranscriptionOptions
+    {
+        internal virtual RequestContent ToRequestContent()
+        {
+            var content = new MultipartFormDataRequestContent();
+            content.Add(new StringContent(InternalNonAzureModelName), "model");
+            content.Add(new ByteArrayContent(AudioData.ToArray()), "file", "@file.wav");
+            if (Optional.IsDefined(ResponseFormat))
+            {
+                content.Add(new StringContent(ResponseFormat.ToString()), "response_format");
+            }
+            if (Optional.IsDefined(Prompt))
+            {
+                content.Add(new StringContent(Prompt), "prompt");
+            }
+            if (Optional.IsDefined(Temperature))
+            {
+                content.Add(new StringContent($"{Temperature}"), "temperature");
+            }
+            if (Optional.IsDefined(Language))
+            {
+                content.Add(new StringContent(Language), "language");
+            }
+            return content;
+        }
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscriptionOptions.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscriptionOptions.cs
@@ -0,0 +1,37 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+using System;
+
+namespace Azure.AI.OpenAI
+{
+    public partial class AudioTranscriptionOptions
+    {
+        /// <summary>
+        /// The audio data to transcribe. This must be the binary content of a file in one of the supported media formats:
+        /// flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
+        /// <para>
+        /// To assign a byte[] to this property use <see cref="BinaryData.FromBytes(byte[])"/>.
+        /// The byte[] will be serialized to a Base64 encoded string.
+        /// </para>
+        /// <para>
+        /// Examples:
+        /// <list type="bullet">
+        /// <item>
+        /// <term>BinaryData.FromBytes(new byte[] { 1, 2, 3 })</term>
+        /// <description>Creates a payload of "AQID".</description>
+        /// </item>
+        /// </list>
+        /// </para>
+        /// </summary>
+        public BinaryData AudioData { get; set; }
+
+        /// <summary> Initializes a new instance of AudioTranscriptionOptions. </summary>
+        public AudioTranscriptionOptions()
+        { }
+
+        internal string InternalNonAzureModelName { get; set; }
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranslation.Serialization.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranslation.Serialization.cs
@@ -0,0 +1,30 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+using System.Text.Json;
+
+namespace Azure.AI.OpenAI
+{
+    public partial class AudioTranslation
+    {
+        internal static AudioTranslation FromResponse(Response response)
+        {
+            if (response.Headers.ContentType.Contains("text/plain"))
+            {
+                return new AudioTranslation(
+                    text: response.Content.ToString(),
+                    internalAudioTaskLabel: null,
+                    language: null,
+                    duration: default,
+                    segments: null);
+            }
+            else
+            {
+                using var document = JsonDocument.Parse(response.Content);
+                return DeserializeAudioTranslation(document.RootElement);
+            }
+        }
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranslation.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranslation.cs
@@ -0,0 +1,14 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+namespace Azure.AI.OpenAI
+{
+    /// <summary> Result information for an operation that translated spoken audio into written text. </summary>
+    public partial class AudioTranslation
+    {
+        /// <summary> The label that describes which operation type generated the accompanying response data. </summary>
+        internal AudioTaskLabel? InternalAudioTaskLabel { get; }
+    }
+}
diff --git a/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranslationFormat.cs b/sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranslationFormat.cs
@@ -0,0 +1,32 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#nullable disable
+
+using System;
+using Azure.Core;
+
+namespace Azure.AI.OpenAI
+{
+    public readonly partial struct AudioTranslationFormat : IEquatable<AudioTranslationFormat>
+    {
+        /// <summary>
+        /// Specifies that a transcription response should provide plain, unannotated text with no additional metadata.
+        /// </summary>
+        [CodeGenMember("Json")]
+        public static AudioTranslationFormat Simple { get; } = new AudioTranslationFormat(SimpleValue);
+
+        /// <summary>
+        /// Specifies that a transcription response should provide plain, unannotated text with additional metadata
+        /// including timings, probability scores, and other processing details.
+        /// </summary>
+        [CodeGenMember("VerboseJson")]
+        public static AudioTranslationFormat Verbose { get; } = new AudioTranslationFormat(VerboseValue);
+        /// <summary> Use a response body that is plain text containing the raw, unannotated transcription. </summary>
+
+        // (Note: text is hidden as its behavior is redundant with 'json' when using a shared, strongly-typed response
+        // value container)
+        [CodeGenMember("Text")]
+        internal static AudioTranslationFormat InternalPlainText { get; } = new AudioTranslationFormat(InternalPlainTextValue);
+    }
+}