Skip to content

Commit

Permalink
Azure OpenAI: audio transcription and translation (Azure#38460)
Browse files Browse the repository at this point in the history
* squashed commit: whisper transcription/translation support

* Update to latest TypeSpec

* rebase codegen to mvp tsp PR

* Merge, snap, suppression cleanup

* PR feedback: remove errant <auto-generated/> tags in /custom

* Test refresh, incl. temporary BYOD rollback

* test fix for recordings and omitting multipart audio bodies

* PR feedback: idiomatic response formats

* PR feedback: fully distinguish translation and transcription types

* PR feedback: keep well-(enough)-known names for Srt,Vtt

* test: revert accidentally included local swap to live test mode

* full test update, including RAI adjustments

* merged .tsp snap (regen pending)

* code regen after merge and tsp snap

* CHANGELOG update

* Incorporate fabulous PR feedback. Thank you, Jose!
  • Loading branch information
trrwilson authored Sep 21, 2023
1 parent bc5e855 commit 494a50b
Show file tree
Hide file tree
Showing 79 changed files with 2,709 additions and 409 deletions.
27 changes: 26 additions & 1 deletion sdk/openai/Azure.AI.OpenAI/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,36 @@

### Features Added

- Audio Transcription and Audio Translation using OpenAI Whisper models is now supported. See [OpenAI's API
reference](https://platform.openai.com/docs/api-reference/audio) or the [Azure OpenAI
quickstart](https://learn.microsoft.com/azure/ai-services/openai/whisper-quickstart) for detailed overview and
background information.
- The new methods `GetAudioTranscription` and `GetAudioTranscription` expose these capabilities on `OpenAIClient`
- Transcription produces text in the primary, supported, spoken input language of the audio data provided, together
with any optional associated metadata
- Translation produces text, translated to English and reflective of the audio data provided, together with any
optional associated metadata
- These methods work for both Azure OpenAI and non-Azure `api.openai.com` client configurations

### Breaking Changes

- The underlying representation of `PromptFilterResults` (for `Completions` and `ChatCompletions`) has had its response
body key changed from `prompt_annotations` to `prompt_filter_results`
- **Prior versions of the `Azure.AI.OpenAI` library may no longer populate `PromptFilterResults` as expected** and it's
highly recommended to upgrade to this version if the use of Azure OpenAI content moderation annotations for input data
is desired
- If a library version upgrade is not immediately possible, it's advised to use `Response<T>.GetRawResponse()` and manually
extract the `prompt_filter_results` object from the top level of the `Completions` or `ChatCompletions` response `Content`
payload

### Bugs Fixed

### Other Changes
- Support for the described breaking change for `PromptFilterResults` was added and this library version will now again
deserialize `PromptFilterResults` appropriately
- `PromptFilterResults` and `ContentFilterResults` are now exposed on the result classes for streaming Completions and
Chat Completions. `Streaming(Chat)Completions.PromptFilterResults` will report an index-sorted list of all prompt
annotations received so far while `Streaming(Chat)Choice.ContentFilterResults` will reflect the latest-received
content annotations that were populated and received while streaming

## 1.0.0-beta.7 (2023-08-25)

Expand Down
45 changes: 45 additions & 0 deletions sdk/openai/Azure.AI.OpenAI/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,51 @@ Response<ImageGenerations> imageGenerations = await client.GetImageGenerationsAs
Uri imageUri = imageGenerations.Value.Data[0].Url;
```

### Transcribe audio data with Whisper speech models

```C# Snippet:TranscribeAudio
using Stream audioStreamFromFile = File.OpenRead("myAudioFile.mp3");
BinaryData audioFileData = BinaryData.FromStream(audioStreamFromFile);

var transcriptionOptions = new AudioTranscriptionOptions()
{
AudioData = BinaryData.FromStream(audioStreamFromFile),
ResponseFormat = AudioTranscriptionFormat.Verbose,
};

Response<AudioTranscription> transcriptionResponse = await client.GetAudioTranscriptionAsync(
deploymentId: "my-whisper-deployment", // whisper-1 as model name for non-Azure OpenAI
transcriptionOptions);
AudioTranscription transcription = transcriptionResponse.Value;

// When using Simple, SRT, or VTT formats, only transcription.Text will be populated
Console.WriteLine($"Transcription ({transcription.Duration.Value.TotalSeconds}s):");
Console.WriteLine(transcription.Text);
```

### Translate audio data to English with Whisper speech models

```C# Snippet:TranslateAudio
using Stream audioStreamFromFile = File.OpenRead("mySpanishAudioFile.mp3");
BinaryData audioFileData = BinaryData.FromStream(audioStreamFromFile);

var translationOptions = new AudioTranslationOptions()
{
AudioData = BinaryData.FromStream(audioStreamFromFile),
ResponseFormat = AudioTranslationFormat.Verbose,
};

Response<AudioTranslation> translationResponse = await client.GetAudioTranslationAsync(
deploymentId: "my-whisper-deployment", // whisper-1 as model name for non-Azure OpenAI
translationOptions);
AudioTranslation translation = translationResponse.Value;

// When using Simple, SRT, or VTT formats, only translation.Text will be populated
Console.WriteLine($"Translation ({translation.Duration.Value.TotalSeconds}s):");
// .Text will be translated to English (ISO-639-1 "en")
Console.WriteLine(translation.Text);
```

## Troubleshooting

When you interact with Azure OpenAI using the .NET SDK, errors returned by the service correspond to the same HTTP status codes returned for [REST API][openai_rest] requests.
Expand Down
121 changes: 119 additions & 2 deletions sdk/openai/Azure.AI.OpenAI/api/Azure.AI.OpenAI.netstandard2.0.cs

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion sdk/openai/Azure.AI.OpenAI/assets.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
"AssetsRepo": "Azure/azure-sdk-assets",
"AssetsRepoPrefixPath": "net",
"TagPrefix": "net/openai/Azure.AI.OpenAI",
"Tag": "net/openai/Azure.AI.OpenAI_a0250cd0f1"
"Tag": "net/openai/Azure.AI.OpenAI_52e82965d8"
}
48 changes: 48 additions & 0 deletions sdk/openai/Azure.AI.OpenAI/src/Custom.Suppressions/OpenAIClient.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

using System.Threading;
using Azure.Core;

namespace Azure.AI.OpenAI
{
[CodeGenSuppress("GetCompletions", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetCompletionsAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetChatCompletions", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetChatCompletionsAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetEmbeddings", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetEmbeddingsAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetChatCompletionsWithAzureExtensions", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetChatCompletionsWithAzureExtensions", typeof(string), typeof(ChatCompletionsOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetChatCompletionsWithAzureExtensionsAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetChatCompletionsWithAzureExtensionsAsync", typeof(string), typeof(ChatCompletionsOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetAudioTranscriptionAsPlainText", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetAudioTranscriptionAsPlainText", typeof(string), typeof(AudioTranscriptionOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetAudioTranscriptionAsPlainTextAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetAudioTranscriptionAsPlainTextAsync", typeof(string), typeof(AudioTranscriptionOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetAudioTranscriptionAsResponseObject", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetAudioTranscriptionAsResponseObject", typeof(string), typeof(AudioTranscriptionOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetAudioTranscriptionAsResponseObjectAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetAudioTranscriptionAsResponseObjectAsync", typeof(string), typeof(AudioTranscriptionOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetAudioTranslationAsPlainText", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetAudioTranslationAsPlainText", typeof(string), typeof(AudioTranslationOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetAudioTranslationAsPlainTextAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetAudioTranslationAsPlainTextAsync", typeof(string), typeof(AudioTranslationOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetAudioTranslationAsResponseObject", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetAudioTranslationAsResponseObject", typeof(string), typeof(AudioTranslationOptions), typeof(CancellationToken))]
[CodeGenSuppress("GetAudioTranslationAsResponseObjectAsync", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("GetAudioTranslationAsResponseObjectAsync", typeof(string), typeof(AudioTranslationOptions), typeof(CancellationToken))]
[CodeGenSuppress("CreateGetCompletionsRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("CreateGetChatCompletionsRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("CreateGetEmbeddingsRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("CreateGetChatCompletionsWithAzureExtensionsRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("CreateGetAudioTranscriptionAsPlainTextRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("CreateGetAudioTranscriptionAsResponseObjectRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("CreateGetAudioTranslationAsPlainTextRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
[CodeGenSuppress("CreateGetAudioTranslationAsResponseObjectRequest", typeof(string), typeof(RequestContent), typeof(RequestContext))]
public partial class OpenAIClient
{
}
}
12 changes: 12 additions & 0 deletions sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTaskLabel.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

namespace Azure.AI.OpenAI
{
internal readonly partial struct AudioTaskLabel
{
// CUSTOM CODE NOTE: here to demote visibility to internal.
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

using System.Text.Json;

namespace Azure.AI.OpenAI
{
public partial class AudioTranscription
{
internal static AudioTranscription FromResponse(Response response)
{
if (response.Headers.ContentType.Contains("text/plain"))
{
return new AudioTranscription(
text: response.Content.ToString(),
internalAudioTaskLabel: null,
language: null,
duration: default,
segments: null);
}
else
{
using var document = JsonDocument.Parse(response.Content);
return DeserializeAudioTranscription(document.RootElement);
}
}
}
}
15 changes: 15 additions & 0 deletions sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscription.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

namespace Azure.AI.OpenAI
{
public partial class AudioTranscription
{
// CUSTOM CODE NOTE: included to demote visibility of 'task'

/// <summary> The label that describes which operation type generated the accompanying response data. </summary>
internal AudioTaskLabel? InternalAudioTaskLabel { get; }
}
}
32 changes: 32 additions & 0 deletions sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscriptionFormat.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

using System;
using Azure.Core;

namespace Azure.AI.OpenAI
{
public readonly partial struct AudioTranscriptionFormat : IEquatable<AudioTranscriptionFormat>
{
/// <summary>
/// Specifies that a transcription response should provide plain, unannotated text with no additional metadata.
/// </summary>
[CodeGenMember("Json")]
public static AudioTranscriptionFormat Simple { get; } = new AudioTranscriptionFormat(SimpleValue);

/// <summary>
/// Specifies that a transcription response should provide plain, unannotated text with additional metadata
/// including timings, probability scores, and other processing details.
/// </summary>
[CodeGenMember("VerboseJson")]
public static AudioTranscriptionFormat Verbose { get; } = new AudioTranscriptionFormat(VerboseValue);
/// <summary> Use a response body that is plain text containing the raw, unannotated transcription. </summary>

// (Note: text is hidden as its behavior is redundant with 'json' when using a shared, strongly-typed response
// value container)
[CodeGenMember("Text")]
internal static AudioTranscriptionFormat InternalPlainText { get; } = new AudioTranscriptionFormat(InternalPlainTextValue);
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

using System.Net.Http;
using Azure.Core;

namespace Azure.AI.OpenAI
{
public partial class AudioTranscriptionOptions
{
internal virtual RequestContent ToRequestContent()
{
var content = new MultipartFormDataRequestContent();
content.Add(new StringContent(InternalNonAzureModelName), "model");
content.Add(new ByteArrayContent(AudioData.ToArray()), "file", "@file.wav");
if (Optional.IsDefined(ResponseFormat))
{
content.Add(new StringContent(ResponseFormat.ToString()), "response_format");
}
if (Optional.IsDefined(Prompt))
{
content.Add(new StringContent(Prompt), "prompt");
}
if (Optional.IsDefined(Temperature))
{
content.Add(new StringContent($"{Temperature}"), "temperature");
}
if (Optional.IsDefined(Language))
{
content.Add(new StringContent(Language), "language");
}
return content;
}
}
}
37 changes: 37 additions & 0 deletions sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranscriptionOptions.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

using System;

namespace Azure.AI.OpenAI
{
public partial class AudioTranscriptionOptions
{
/// <summary>
/// The audio data to transcribe. This must be the binary content of a file in one of the supported media formats:
/// flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm.
/// <para>
/// To assign a byte[] to this property use <see cref="BinaryData.FromBytes(byte[])"/>.
/// The byte[] will be serialized to a Base64 encoded string.
/// </para>
/// <para>
/// Examples:
/// <list type="bullet">
/// <item>
/// <term>BinaryData.FromBytes(new byte[] { 1, 2, 3 })</term>
/// <description>Creates a payload of "AQID".</description>
/// </item>
/// </list>
/// </para>
/// </summary>
public BinaryData AudioData { get; set; }

/// <summary> Initializes a new instance of AudioTranscriptionOptions. </summary>
public AudioTranscriptionOptions()
{ }

internal string InternalNonAzureModelName { get; set; }
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

using System.Text.Json;

namespace Azure.AI.OpenAI
{
public partial class AudioTranslation
{
internal static AudioTranslation FromResponse(Response response)
{
if (response.Headers.ContentType.Contains("text/plain"))
{
return new AudioTranslation(
text: response.Content.ToString(),
internalAudioTaskLabel: null,
language: null,
duration: default,
segments: null);
}
else
{
using var document = JsonDocument.Parse(response.Content);
return DeserializeAudioTranslation(document.RootElement);
}
}
}
}
14 changes: 14 additions & 0 deletions sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranslation.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

namespace Azure.AI.OpenAI
{
/// <summary> Result information for an operation that translated spoken audio into written text. </summary>
public partial class AudioTranslation
{
/// <summary> The label that describes which operation type generated the accompanying response data. </summary>
internal AudioTaskLabel? InternalAudioTaskLabel { get; }
}
}
32 changes: 32 additions & 0 deletions sdk/openai/Azure.AI.OpenAI/src/Custom/AudioTranslationFormat.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#nullable disable

using System;
using Azure.Core;

namespace Azure.AI.OpenAI
{
public readonly partial struct AudioTranslationFormat : IEquatable<AudioTranslationFormat>
{
/// <summary>
/// Specifies that a transcription response should provide plain, unannotated text with no additional metadata.
/// </summary>
[CodeGenMember("Json")]
public static AudioTranslationFormat Simple { get; } = new AudioTranslationFormat(SimpleValue);

/// <summary>
/// Specifies that a transcription response should provide plain, unannotated text with additional metadata
/// including timings, probability scores, and other processing details.
/// </summary>
[CodeGenMember("VerboseJson")]
public static AudioTranslationFormat Verbose { get; } = new AudioTranslationFormat(VerboseValue);
/// <summary> Use a response body that is plain text containing the raw, unannotated transcription. </summary>

// (Note: text is hidden as its behavior is redundant with 'json' when using a shared, strongly-typed response
// value container)
[CodeGenMember("Text")]
internal static AudioTranslationFormat InternalPlainText { get; } = new AudioTranslationFormat(InternalPlainTextValue);
}
}
Loading

0 comments on commit 494a50b

Please sign in to comment.