Whisper-Large-v2 Model for Audio Transcription: Repeated and Missing Translation Information
Hi everyone,
I have been using the whisper-large-v2 model to transcribe audio files that are about 10 to 20 minutes long. I have tried many different techniques to improve the accuracy of the transcription, but so far, nothing has worked.
One of the main issues I am facing is that there is repeated translation information between chunks, and some translation information is missing altogether. I have tried normalizing the audio to [-1,1], as well as tuning hyperparameters such as chunk_length_s and stride_length_s. However, I have found that the optimal parameters vary from file to file, and there is no generalization.
I am using the following environment:
- whisper-large-v2
- torch 1.10.0+cu111
- transformers 4.28.1 (latest)
Here is the code I am using:
pipe = pipeline(
task="automatic-speech-recognition",
model=MODEL_NAME,
device='cuda:0',
generate_kwargs={"task": "transcribe"},
)
sound = AudioSegment.from_file("sample.mp3", format="mp3")
sound.export("sample.wav", format="wav")
y, sr = librosa.load("sample.wav", sr=None)
y_resampled = librosa.resample(y, orig_sr=sr, target_sr=16000)
outputs = pipe(y_resampled, return_timestamps=True, generate_kwargs={"task": "transcribe", "language": "<|zh|>"}, chunk_length_s=30, stride_length_s=5, batch_size=16, max_new_tokens=512)
print(outputs["text"])
Overall, I feel that the performance of the model is not good enough for me to use it formally. I would appreciate any suggestions or advice on how to improve the accuracy of the transcription.
Thank you.
Hi there,
any ideas on this? I am having the same problem using very similar code and parameters. From my tests it looks like it is most likely an issue with the chunking algorithm. --> https://huggingface.co/blog/asr-chunking
Playing around with the stride parameters sometimes gives me good results. Then using the same parameters for a different file, again there may be parts missing.
I am comparing two the openai whisper API and they always get it right. So it must be solvable obiously :)
Thx so much
Andi
Hello Good People,
Any update on this? I'm experiencing a similar problem where the transcription contains repeated text or is missing some content, particularly when the audio length exceeds 30 seconds.
Regards
Abaddon - The Knight of Hell