what if we segment the audio first and then transcribe tho its some extra compute to throw in but imo it would resul tin better result !
Ayaan Sharif
Ayaan-Sharif
AI & ML interests
NLP, LLM, TEXT, Languages
Recent Activity
replied to
sanchit-gandhi's
post
4 days ago
Why does returning timestamps help Whisper reduce hallucinations? 🧐
Empirically, most practitioners have found that setting `return_timestamps=True` helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm.
But why does this work?..
My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription:
```markdown
The cat sat on the on the on the mat.
```
Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.:
```markdown
<|0.00|> The cat sat on the on the on the mat.<|5.02|>
```
However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability:
```markdown
<|0.00|> The cat sat on the mat.<|5.02|>
```
In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper https://huggingface.co/papers/2212.04356 → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences).
Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?
liked
a Space
5 days ago
R1ckShi/FunClip
liked
a Space
5 days ago
sanchit-gandhi/whisper-jax
Organizations
None yet
Ayaan-Sharif's activity
replied to
sanchit-gandhi's
post
4 days ago
upvoted
a
collection
13 days ago
reacted to
vladbogo's
post with 👍
24 days ago
Post
Panda-70M is a new large-scale video dataset comprising 70 million high-quality video clips, each paired with textual captions, designed to be used as pre-training for video understanding tasks.
Key Points:
* Automatic Caption Generation: Utilizes an automatic pipeline with multiple cross-modality teacher models to generate captions for video clips.
* Fine-tuned Caption Selection: Employs a fine-tuned retrieval model to select the most appropriate caption from multiple candidates for each video clip.
* Improved Performance: Pre-training on Panda-70M shows significant performance gains in video captioning, text-video retrieval, and text-driven video generation.
Paper: Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2402.19479)
Project page: https://snap-research.github.io/Panda-70M/
Code: https://github.com/snap-research/Panda-70M
Congrats to the authors @tschen , @aliaksandr-siarohin et al. for their work!
Key Points:
* Automatic Caption Generation: Utilizes an automatic pipeline with multiple cross-modality teacher models to generate captions for video clips.
* Fine-tuned Caption Selection: Employs a fine-tuned retrieval model to select the most appropriate caption from multiple candidates for each video clip.
* Improved Performance: Pre-training on Panda-70M shows significant performance gains in video captioning, text-video retrieval, and text-driven video generation.
Paper: Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2402.19479)
Project page: https://snap-research.github.io/Panda-70M/
Code: https://github.com/snap-research/Panda-70M
Congrats to the authors @tschen , @aliaksandr-siarohin et al. for their work!
multi gpu setup when ?
2
#5 opened about 1 month ago
by
Ayaan-Sharif
reacted to
merve's
post with ❤️
2 months ago
Post
5431
Another great week in open ML!
Here's a small recap 🫰🏻
Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2
💬 Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M
🖼️ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license
🖼️💬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!
Dataset releases
🖼️ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
Here's a small recap 🫰🏻
Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2
💬 Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M
🖼️ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license
🖼️💬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!
Dataset releases
🖼️ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2