Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no dysfluencies #10

Open
bisserai opened this issue Oct 24, 2024 · 1 comment
Open

no dysfluencies #10

bisserai opened this issue Oct 24, 2024 · 1 comment

Comments

@bisserai
Copy link

Hello,

First of all thanks for developing this tool and making it available ! I'm trying to use crisperwhisper to annotate a naturalistic language production experiment in german. The files are 1mn each, and there's a single speaker per recording, answering to open-ended questions. I'm running the below on cpu and only get one dysfluency at best in an 17s excerpt of which contains 4 ehms) and can't find a way to improve this. My colleague tried running on a server and it's much faster but not improving the dysfluencies.

Thanks for your help !
bissera

i have a macbook pro with a 2,3 GHz Quad-Core Intel Core i7 processor, 32gb of ram and running sonoma 14.4.1. my ide is vscode 1.94.1

the code in my jupyter notebook:

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "nyrahealth/CrisperWhisper"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
    ).to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps='word',
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs = {"language":"<|de|>","task": "transcribe"}
)
@LaurinmyReha
Copy link
Contributor

Hello,

Well thank you for using it. Your code looks good. The problem is that for german we had a lot less disfluencies and largely worked with synthetic data and hoped that this ability to detect disfluencies would transfer over from english. We have also observed that for german disfluency detection is not satisfactory yet. We have however now constructed a Dataset containing over 100000 annotated fillers for german and will retrain CrisperWhisper soon so i hope the updated version will be more helpful for german. In the meanwhile i have seen some improvements by increasing the beam size and checking out beams that are a bit less probable for german ( perhaps in your case taking the ones containing the most fillers could be a simple heuristic which could improve what you are looking for)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants