Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming parameter #197

Merged
merged 51 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
9a3d8da
test: new modules
KevKibe Sep 18, 2024
0baaad2
test: new modules
KevKibe Sep 18, 2024
e0b7a49
test: revert
KevKibe Sep 18, 2024
e1e8961
fix(training): repository creation
KevKibe Sep 18, 2024
33f6e48
update: add pseudo labelling eval
KevKibe Sep 18, 2024
927a9d2
fix: missing import
KevKibe Sep 18, 2024
4fce347
fix: missing import
KevKibe Sep 18, 2024
bf6547c
fix: generation_num_beams parameter
KevKibe Sep 18, 2024
9d96d96
fix: training scheduler
KevKibe Sep 20, 2024
5511946
fix: training scheduler
KevKibe Sep 20, 2024
55c4ac1
fix: training scheduler
KevKibe Sep 20, 2024
4aed3ce
fix: add print statement for len pred_str eval_preds
KevKibe Sep 20, 2024
7b4918d
fix: add print statement for len raw datasets split
KevKibe Sep 20, 2024
7b9b65d
update(data): add streaming parameter
KevKibe Sep 22, 2024
02b827f
update(docs): troubleshoot page
KevKibe Sep 23, 2024
2b568c5
update: test dataset streaming
KevKibe Sep 23, 2024
077a515
fix: ruff formatting
KevKibe Sep 23, 2024
7b54569
fix: ruff formatting
KevKibe Sep 23, 2024
2a39def
fix: refactore HF_WRITE_TOKEN to HF_TOKEN
KevKibe Sep 23, 2024
4ccc70c
fix: add verbosity to pytest commands
KevKibe Sep 23, 2024
bd3cb7d
fix: add num_speakers parameter
KevKibe Sep 23, 2024
e5a11bb
fix: logging
KevKibe Sep 23, 2024
b8c345c
fix: ruff formatting
KevKibe Sep 23, 2024
a95557f
fix: revert to print statements
KevKibe Sep 23, 2024
db82315
fix: filter_eot_tokens
KevKibe Sep 23, 2024
ca0e55f
fix: compute_metrics
KevKibe Sep 23, 2024
e5b1a71
fix: pseudo labelling
KevKibe Sep 23, 2024
404dec8
fix: filter_eot_tokens, compute metrics parameters
KevKibe Sep 24, 2024
04ab848
fix: remove unused imports
KevKibe Sep 24, 2024
62b09d2
update: add lr_scheduler_type as default parameter constant_with_warmup
KevKibe Sep 24, 2024
abcacc9
update(docs): video demo version
KevKibe Sep 24, 2024
b6b5af6
Merge branch 'main' into optimize-data-prep
KevKibe Sep 24, 2024
a6ade6b
configure setup.py
KevKibe Sep 24, 2024
0da2cb4
update(deployment): add num_workers, language to SpeechTranscriptionP…
KevKibe Sep 24, 2024
17c851f
fix: torch, torchvision package versions
KevKibe Sep 24, 2024
30db1df
fix: compute type in load_asr_model
KevKibe Sep 24, 2024
4df9657
fix: compute type in load_asr_model
KevKibe Sep 24, 2024
e8c7187
fix: compute type in load_asr_model
KevKibe Sep 24, 2024
8b7c572
fix: compute type in load_asr_model
KevKibe Sep 24, 2024
9503e95
comment out install_requires
KevKibe Sep 24, 2024
83ca782
update: model config parameters
KevKibe Sep 24, 2024
a0cbe27
fix: ruff formatting
KevKibe Sep 24, 2024
b2d30bb
fix: get_decoder_prompt_ids method
KevKibe Sep 24, 2024
cd8d095
fix: language parameter
KevKibe Sep 24, 2024
0fce305
fix: model.to("cuda") config
KevKibe Sep 24, 2024
4dc94e7
fix: Trainer missing parameter
KevKibe Sep 24, 2024
a5db923
uncomment out install_requires
KevKibe Sep 24, 2024
c50cc00
update
KevKibe Sep 24, 2024
79cef66
update
KevKibe Sep 24, 2024
a5dde5a
comment out install_requires
KevKibe Sep 24, 2024
22f6d05
update trainer
KevKibe Sep 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .github/workflows/deployment.speech_inference_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ jobs:

- name: Run tests
env:
HF_READ_TOKEN: ${{ secrets.HF_READ_TOKEN }}
HF_WRITE_TOKEN: ${{ secrets.HF_WRITE_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
WANDB_TOKEN: ${{ secrets.WANDB_TOKEN }}
run: pytest src/tests/test_model_optimization.py src/tests/test_transcription_pipeline.py
run: pytest -vv src/tests/test_model_optimization.py src/tests/test_transcription_pipeline.py
7 changes: 3 additions & 4 deletions .github/workflows/training.model_prep_test.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Test training.model_trainer Module.
name: Test training.model_prep Module.

on: [pull_request]

Expand Down Expand Up @@ -42,7 +42,6 @@ jobs:

- name: Run tests
env:
HF_READ_TOKEN: ${{ secrets.HF_READ_TOKEN }}
HF_WRITE_TOKEN: ${{ secrets.HF_WRITE_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
WANDB_TOKEN: ${{ secrets.WANDB_TOKEN }}
run: pytest src/tests/test_model_prep.py
run: pytest -vv src/tests/test_model_prep.py
5 changes: 2 additions & 3 deletions .github/workflows/training.model_trainer_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ jobs:

- name: Run tests
env:
HF_READ_TOKEN: ${{ secrets.HF_READ_TOKEN }}
HF_WRITE_TOKEN: ${{ secrets.HF_WRITE_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
WANDB_TOKEN: ${{ secrets.WANDB_TOKEN }}
run: pytest src/tests/test_model_trainer.py
run: pytest -vv src/tests/test_model_trainer.py
7 changes: 3 additions & 4 deletions .github/workflows/training_tests.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Test Data and Model Prep Modules
name: Test Data Loading and Processing Modules

on: [pull_request]

Expand Down Expand Up @@ -42,7 +42,6 @@ jobs:

- name: Run tests
env:
HF_READ_TOKEN: ${{ secrets.HF_READ_TOKEN }}
HF_WRITE_TOKEN: ${{ secrets.HF_WRITE_TOKEN }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
WANDB_TOKEN: ${{ secrets.WANDB_TOKEN }}
run: pytest src/tests/test_audio_processor.py src/tests/test_data_prep.py src/tests/test_load_dataset.py
run: pytest -vv src/tests/test_audio_processor.py src/tests/test_data_prep.py src/tests/test_load_dataset.py
4 changes: 3 additions & 1 deletion DOCS/gettingstarted.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
## Usage Demo on Colab(v0.9.12)
- Refer to documentation below for updated instructions and guides.
<iframe width="560" height="315" src="https://www.youtube.com/embed/NHSV8ZyhMVA?si=6217bgwGGUavm-Nq" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

## Prerequisites

- Sign up to HuggingFace and get your token keys use this [guide](https://huggingface.co/docs/hub/en/security-tokens).
Expand Down Expand Up @@ -66,6 +67,7 @@ processed_dataset = process.load_dataset(
feature_extractor=feature_extractor,
tokenizer=tokenizer,
processor=feature_processor,
streaming=True,
train_num_samples = None, # Optional: int - Number of samples to load into training dataset, default the whole training set.
test_num_samples = None ) # Optional: int - Number of samples to load into test dataset, default the whole test set.
# Set None to load the entire dataset
Expand Down Expand Up @@ -112,7 +114,7 @@ trainer.train(
from training.merge_lora import Merger

# Merge PEFT fine-tuned model weights with the base model weights
Merger.merge_lora_weights(hf_model_id="your-finetuned-model-name-on-huggingface-hub", huggingface_write_token = " ")
Merger.merge_lora_weights(hf_model_id="your-finetuned-model-name-on-huggingface-hub", huggingface_token = " ")
```

## Step 7: Test Model using an Audio File
Expand Down
23 changes: 19 additions & 4 deletions DOCS/troubleshoot.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,18 @@
## Troubleshooting Tips

- If you encounter trouble installing `africanwhisper` package on Kaggle, see: <br>
[Issue #142](https://github.com/KevKibe/African-Whisper/issues/142)
- If you encounter trouble installing `africanwhisper` package on Kaggle, and encounter the error:
```commandline
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/aiohttp-3.9.1.dist-info/METADATA'
```
Execute this command before installing the package:
```commandline
!rm /opt/conda/lib/python3.10/site-packages/aiohttp-3.9.1.dist-info -rdf
```
see [Issue #142](https://github.com/KevKibe/African-Whisper/issues/142) for more info.


- If you encounter this error installing `africanwhisper` package on Colab:
```
```commandline
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.7.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.
torchtext 0.18.0 requires torch>=2.3.0, but you have torch 2.2.2 which is incompatible.
Expand All @@ -15,4 +23,11 @@ WARNING: The following packages were previously imported in this runtime:
[pydevd_plugins]
You must restart the runtime in order to use newly installed versions.
```
- Restart the kernel and continue with the next step.
restart the kernel and continue with the next step.

- If you encounter the error:
```commandline
TypeError: expected string or bytes-like object
```
upgrade `pandas` version to `2.2.2` and restart kernel

5 changes: 3 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ faster-whisper==1.0.3
python-dotenv==1.0.1
pyannote.audio==3.2.0
nltk==3.8.1
torchvision==0.17.2
torchvision
ctranslate2==4.3.1
pandas==2.0.3
pandas==2.2.2
fastapi==0.111.0
uvicorn==0.30.1
tqdm
15 changes: 9 additions & 6 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,16 @@
"python-dotenv==1.0.1",
"pyannote-audio==3.2.0",
"nltk==3.8.1",
"torchvision==0.17.2",
"torchvision",
"ctranslate2==4.3.1",
"pandas==2.0.3",
"pandas==2.2.2",
"huggingface_hub",
"soundfile",
"tqdm"
]

DEPLOYMENT_DEPS = [
"torch==2.3.1",
"torch",
"transformers==4.42.3",
"pydantic==2.7.3",
"prometheus-client==0.20.0",
Expand All @@ -41,15 +44,15 @@
"faster-whisper==1.0.3",
"pyannote-audio==3.2.0",
"nltk==3.8.1",
"torchvision==0.17.2",
"torchvision",
"ctranslate2==4.3.1",
"pandas==2.2.1",
"pandas==2.2.2",
]
ALL_DEPS = BASE_DEPS + DEPLOYMENT_DEPS

setup(
name="africanwhisper",
version="0.9.12",
version="0.9.13",
author="Kevin Kibe",
author_email="[email protected]",
package_dir={"": "src"},
Expand Down
6 changes: 3 additions & 3 deletions src/deployment/faster_whisper/load_asr_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def load_asr_model(whisper_arch,
print("No language specified, language will be first be detected for each audio file (increases inference time).")
tokenizer = None

default_asr_options = {
default_asr_options = { # explore temperature_increment_on_fallback parameter
"beam_size": 5,
"best_of": 5,
"patience": 1,
Expand All @@ -57,15 +57,15 @@ def load_asr_model(whisper_arch,
"compression_ratio_threshold": 2.4,
"log_prob_threshold": -1.0,
"no_speech_threshold": 0.6,
"condition_on_previous_text": False,
"condition_on_previous_text": False, # explore True
"prompt_reset_on_temperature": 0.5,
"initial_prompt": None,
"prefix": None,
"suppress_blank": True,
"suppress_tokens": [-1],
"without_timestamps": True,
"max_initial_timestamp": 0.0,
"word_timestamps": False,
"word_timestamps": False, # Explore True
"prepend_punctuations": "\"'“¿([{-",
"append_punctuations": "\"'.。,,!!??::”)]}、",
"suppress_numerals": False,
Expand Down
4 changes: 2 additions & 2 deletions src/deployment/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ python-dotenv==1.0.1
faster-whisper==1.0.3
pyannote.audio==3.1.1
nltk==3.8.1
torchvision==0.17.2
torchvision
ctranslate2==4.1.0
pandas==2.2.1
pandas==2.2.2
python-multipart==0.0.9
18 changes: 10 additions & 8 deletions src/deployment/speech_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,15 +40,15 @@ def convert_model_to_optimized_format(self) -> None:
else:
print(f"Model {self.model_name} is already in CTranslate2 format")

def load_transcription_model(self) -> object:
def load_transcription_model(self, beam_size: int = 5, language = None) -> object:
"""
Loads the ASR model for transcription.

Returns:
object: Loaded ASR model.
"""
asr_options = {
"beam_size": 5,
"beam_size": beam_size,
"patience": 1.0,
"length_penalty": 1.0,
"temperatures": 0,
Expand All @@ -61,14 +61,15 @@ def load_transcription_model(self) -> object:
"suppress_numerals": True,
}
model_dir = None
compute_type = "bfloat16" if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else "float32"
# compute_type = "bfloat16" if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else "float16"
compute_type = "float16" if torch.cuda.is_available() else "float32"
model = load_asr_model(
whisper_arch = self.model_name,
device=self.device,
device_index=0,
device_index=0, #for multi-gpu processing
download_root=model_dir,
compute_type=compute_type,
language=None,
language=language,
asr_options=asr_options,
vad_options={"vad_onset": 0.500, "vad_offset": 0.363},
threads=8
Expand All @@ -88,7 +89,6 @@ class SpeechTranscriptionPipeline:
batch_size (int): Number of audio segments to process per batch.
chunk_size (int): Duration of each audio chunk for processing.
huggingface_token (str): Read token for accessing Huggingface API.
model_name (str): Name of the model to be used for transcription.
"""
def __init__(self,
audio_file_path: str,
Expand All @@ -101,7 +101,7 @@ def __init__(self,
self.device = 0 if torch.cuda.is_available() else "cpu"
self.batch_size = batch_size
self.chunk_size = chunk_size
self.huggingface_token = huggingface_token
self.huggingface_token = huggingface_token,


def transcribe_audio(self, model) -> Dict:
Expand Down Expand Up @@ -155,21 +155,23 @@ def align_transcription(self, transcription_result: Dict, alignment_model: str =

def diarize_audio(self,
alignment_result: Dict,
num_speakers: int = 1,
min_speakers: int = 1,
max_speakers: int = 3) -> Dict:
"""
Diarizes the audio and assigns speakers to each segment.

Args:
alignment_result (Dict): Alignment result to be diarized.
num_speakers (int, optional): Number of speakers. Defaults to 1.
min_speakers (int, optional): Minimum number of speakers. Defaults to 1.
max_speakers (int, optional): Maximum number of speakers. Defaults to 3.

Returns:
Dict: Diarization result with speakers assigned to segments.
"""
diarize_model = DiarizationPipeline(token=self.huggingface_token, device=self.device)
diarize_segments = diarize_model(self.audio, min_speakers=min_speakers, max_speakers=max_speakers)
diarize_segments = diarize_model(self.audio, num_speakers = num_speakers, min_speakers=min_speakers, max_speakers=max_speakers)
diarization_result = assign_word_speakers(diarize_segments, alignment_result)
return diarization_result

Expand Down
47 changes: 39 additions & 8 deletions src/tests/test_audio_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,20 +17,29 @@ def setUp(self):
"""
# Load dataset
self.data_loader = Dataset(
huggingface_token = os.environ.get("HF_WRITE_TOKEN"),
huggingface_token = os.environ.get("HF_TOKEN"),
dataset_name="mozilla-foundation/common_voice_16_1",
language_abbr=["yi", "ti"]
language_abbr=["af"]
)
self.dataset = self.data_loader.load_dataset(train_num_samples=10, test_num_samples=10)
has_train_sample = any(True for _ in self.dataset["train"])
self.dataset_streaming = self.data_loader.load_dataset(streaming=True, train_num_samples=10, test_num_samples=10)
self.dataset_batch = self.data_loader.load_dataset(streaming=False, train_num_samples=10, test_num_samples=10)

has_train_sample = any(True for _ in self.dataset_streaming["train"])
assert has_train_sample, "Train dataset is empty!"

has_test_sample = any(True for _ in self.dataset_streaming["test"])
assert has_test_sample, "Test dataset is empty!"

has_train_sample = any(True for _ in self.dataset_batch["train"])
assert has_train_sample, "Train dataset is empty!"

has_test_sample = any(True for _ in self.dataset["test"])
has_test_sample = any(True for _ in self.dataset_batch["test"])
assert has_test_sample, "Test dataset is empty!"

# Initialize model preparation
self.model_prep = WhisperModelPrep(
model_id="openai/whisper-small",
language = ["af"],
model_id="openai/whisper-tiny",
processing_task="transcribe",
use_peft=False
)
Expand All @@ -42,7 +51,14 @@ def setUp(self):

# Initialize AudioDataProcessor
self.processor = AudioDataProcessor(
dataset=self.dataset,
dataset=self.dataset_streaming,
feature_extractor=self.feature_extractor,
tokenizer=self.tokenizer,
feature_processor=self.feature_processor
)

self.processor_batch = AudioDataProcessor(
dataset=self.dataset_batch,
feature_extractor=self.feature_extractor,
tokenizer=self.tokenizer,
feature_processor=self.feature_processor
Expand All @@ -53,7 +69,7 @@ def test_resampled_dataset(self):
Test the resampled_dataset method.
"""
# Arrange
sample_dataset = self.dataset
sample_dataset = self.dataset_streaming

# Act & Assert
for split, samples in sample_dataset.items():
Expand All @@ -63,5 +79,20 @@ def test_resampled_dataset(self):
self.assertIn("labels", resampled_data)
self.assertEqual(resampled_data["audio"]["sampling_rate"], 16000)

def test_resampled_dataset_batch(self):
"""
Test the resampled_dataset method.
"""
# Arrange
sample_dataset = self.dataset_batch

# Act & Assert
for split, samples in sample_dataset.items():
for sample in samples:
resampled_data = self.processor_batch.resampled_dataset(sample)
self.assertIn("input_features", resampled_data)
self.assertIn("labels", resampled_data)
self.assertEqual(resampled_data["audio"]["sampling_rate"], 16000)

if __name__ == '__main__':
unittest.main()
Loading
Loading