From 1a397b0153e3e75f51c71a7a0507372a1b0791a3 Mon Sep 17 00:00:00 2001 From: Simon Ottenhaus <69142583+simonottenhauskenbun@users.noreply.github.com> Date: Fri, 5 Apr 2024 14:19:53 +0200 Subject: [PATCH] community: add tutorial for offline use of pyannote/speaker-diarization-3.1 --- README.md | 2 + .../offline_usage_speaker_diarization.ipynb | 172 ++++++++++++++++++ 2 files changed, 174 insertions(+) create mode 100644 tutorials/community/offline_usage_speaker_diarization.ipynb diff --git a/README.md b/README.md index 2c6a889f1..b3df6eabc 100644 --- a/README.md +++ b/README.md @@ -71,6 +71,8 @@ for turn, _, speaker in diarization.itertracks(yield_label=True): - [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min - [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min - [First release of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min +- Community contributions (not maintained by the core team) + - 2024-04-05 > [Offline speaker diarization (speaker-diarization-3.1)](tutorials/community/offline_usage_speaker_diarization.ipynb) by [Simon Ottenhaus](https://github.com/simonottenhauskenbun) ## Benchmark diff --git a/tutorials/community/offline_usage_speaker_diarization.ipynb b/tutorials/community/offline_usage_speaker_diarization.ipynb new file mode 100644 index 000000000..932742628 --- /dev/null +++ b/tutorials/community/offline_usage_speaker_diarization.ipynb @@ -0,0 +1,172 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Offline Speaker Diarization (speaker-diarization-3.1)\n", + "\n", + "This notebooks gives a short introduction how to use the [speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) pipeline with local models.\n", + "\n", + "In order to use local models, you first need to download them from huggingface and place them in a local folder. \n", + "Then you need to create a local config file, similar to the one in HF, but with local model paths.\n", + "\n", + "❗ **Naming of the model files is REALLY important! See end of notebook for details.** ❗\n", + "\n", + "## Get the models\n", + "\n", + "1. Install the `pyannote-audio` package: `!pip install pyannote.audio`\n", + "2. Create a huggingface account https://huggingface.co/join\n", + "3. Accept [pyannote/segmentation-3.0](https://hf.co/pyannote/segmentation-3.0) user conditions\n", + "4. Create a local folder `models`, place all downloaded files there\n", + " 1. [wespeaker-voxceleb-resnet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM/blob/main/pytorch_model.bin), to be placed in `models/pyannote_model_wespeaker-voxceleb-resnet34-LM.bin`\n", + " 2. [segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0/blob/main/pytorch_model.bin), to be placed in `models/pyannote_model_segmentation-3.0.bin`\n", + "\n", + "Running `ls models` should show the following files:\n", + "```\n", + "pyannote_model_segmentation-3.0.bin (5.7M)\n", + "pyannote_model_wespeaker-voxceleb-resnet34-LM.bin (26MB)\n", + "```\n", + "\n", + "❗ **make sure the 'wespeaker-voxceleb-resnet34-LM' model is named 'pyannote_model_wespeaker-voxceleb-resnet34-LM.bin'** ❗" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Config for local models\n", + "\n", + "Create a local config, similar to the one in HF: [speaker-diarization-3.1/blob/main/config.yaml](https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/config.yaml), but with local model paths\n", + "\n", + "Contents of `models/pyannote_diarization_config.yaml`:\n", + "\n", + "```yaml\n", + "version: 3.1.0\n", + "\n", + "pipeline:\n", + " name: pyannote.audio.pipelines.SpeakerDiarization\n", + " params:\n", + " clustering: AgglomerativeClustering\n", + " # embedding: pyannote/wespeaker-voxceleb-resnet34-LM # if you want to use the HF model\n", + " embedding: models/pyannote_model_wespeaker-voxceleb-resnet34-LM.bin # if you want to use the local model\n", + " embedding_batch_size: 32\n", + " embedding_exclude_overlap: true\n", + " # segmentation: pyannote/segmentation-3.0 # if you want to use the HF model\n", + " segmentation: models/pyannote_model_segmentation-3.0.bin # if you want to use the local model\n", + " segmentation_batch_size: 32\n", + "\n", + "params:\n", + " clustering:\n", + " method: centroid\n", + " min_cluster_size: 12\n", + " threshold: 0.7045654963945799\n", + " segmentation:\n", + " min_duration_off: 0.0\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Loading the local pipeline\n", + "\n", + "**Hint**: The paths in the config are relative to the current working directory, not relative to the config file.\n", + "If you want to start your notebook/script from a different directory, you can use `os.chdir` temporarily, to 'emulate' config-relative paths.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "from pyannote.audio import Pipeline\n", + "\n", + "def load_pipeline_from_pretrained(path_to_config: str | Path) -> Pipeline:\n", + " path_to_config = Path(path_to_config)\n", + "\n", + " print(f\"Loading pyannote pipeline from {path_to_config}...\")\n", + " # the paths in the config are relative to the current working directory\n", + " # so we need to change the working directory to the model path\n", + " # and then change it back\n", + "\n", + " cwd = Path.cwd().resolve() # store current working directory\n", + "\n", + " # first .parent is the folder of the config, second .parent is the folder containing the 'models' folder\n", + " cd_to = path_to_config.parent.parent.resolve()\n", + "\n", + " print(f\"Changing working directory to {cd_to}\")\n", + " os.chdir(cd_to)\n", + "\n", + " pipeline = Pipeline.from_pretrained(path_to_config)\n", + "\n", + " print(f\"Changing working directory back to {cwd}\")\n", + " os.chdir(cwd)\n", + "\n", + " return pipeline\n", + "\n", + "PATH_TO_CONFIG = \"path/to/your/pyannote_diarization_config.yaml\"\n", + "pipeline = load_pipeline_from_pretrained(PATH_TO_CONFIG)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Notes on file naming (pyannote-audio 3.1.1)\n", + "\n", + "Pyannote uses some internal logic to determine the model type.\n", + "\n", + "The funtion `def PretrainedSpeakerEmbedding(...` in (speaker_verification.py)[https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/pipelines/speaker_verification.py#L712] uses the the file path of the model to infer the model type.\n", + "\n", + "```python\n", + "def PretrainedSpeakerEmbedding(\n", + " embedding: PipelineModel,\n", + " device: torch.device = None,\n", + " use_auth_token: Union[Text, None] = None,\n", + "):\n", + " #...\n", + " if isinstance(embedding, str) and \"pyannote\" in embedding:\n", + " return PyannoteAudioPretrainedSpeakerEmbedding(\n", + " embedding, device=device, use_auth_token=use_auth_token\n", + " )\n", + "\n", + " elif isinstance(embedding, str) and \"speechbrain\" in embedding:\n", + " return SpeechBrainPretrainedSpeakerEmbedding(\n", + " embedding, device=device, use_auth_token=use_auth_token\n", + " )\n", + "\n", + " elif isinstance(embedding, str) and \"nvidia\" in embedding:\n", + " return NeMoPretrainedSpeakerEmbedding(embedding, device=device)\n", + "\n", + " elif isinstance(embedding, str) and \"wespeaker\" in embedding:\n", + " return ONNXWeSpeakerPretrainedSpeakerEmbedding(embedding, device=device) # <-- this is called, but the wespeaker-voxceleb-resnet34-LM is not an ONNX model\n", + "\n", + " else:\n", + " # fallback to pyannote in case we are loading a local model\n", + " return PyannoteAudioPretrainedSpeakerEmbedding(\n", + " embedding, device=device, use_auth_token=use_auth_token\n", + " )\n", + "```\n", + "\n", + "The [wespeaker-voxceleb-resnet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM/blob/main/pytorch_model.bin) model is not an ONNX model, but a `PyannoteAudioPretrainedSpeakerEmbedding`. So if `wespeaker` is in the file name, the code will infer the model type incorrectly. If `pyannote` is somewhere in the file name, the model type will be inferred correctly, as the first if statement will be true..." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}