2 3 18

Ayaan Sharif

Ayaan-Sharif

https://shariif.tech

AI & ML interests

NLP, LLM, TEXT, Languages

Recent Activity

replied to sanchit-gandhi's post 4 days ago

Why does returning timestamps help Whisper reduce hallucinations? 🧐 Empirically, most practitioners have found that setting `return_timestamps=True` helps reduce hallucinations, particularly when doing long-form evaluation with Transformers’ “chunked” algorithm. But why does this work?.. My interpretation is that forcing the model to predict timestamps is contradictory to hallucinations. Suppose you have the transcription: ```markdown The cat sat on the on the on the mat. ``` Where we have a repeated hallucination for “on the”. If we ask the model to predict timestamps, then the “on the” has to contribute to the overall segment-level timing, e.g.: ```markdown <|0.00|> The cat sat on the on the on the mat.<|5.02|> ``` However, it’s impossible to fit 3 copies of “on the” within the time allocation given to the segment, so the probability for this hallucinatory sequence becomes lower, and the model actually predicts the correct transcription with highest probability: ```markdown <|0.00|> The cat sat on the mat.<|5.02|> ``` In this sense, the end timestamp is of the opposite of the initial timestamp constraint they describe in Section 4.5 of the paper https://huggingface.co/papers/2212.04356 → it helps the model remove extra words at the end of the sequence (rather than the initial timestamp which helps when the model ignores words at the start), but the overall principle is the same (using timestamps to improve the probability of more realistic sequences). Leaving it open to you: why do you think timestamps reduces Whisper hallucinations?

liked a Space 5 days ago

R1ckShi/FunClip

liked a Space 5 days ago

sanchit-gandhi/whisper-jax

View all activity

Organizations

None yet

Ayaan-Sharif's activity

replied to sanchit-gandhi's post 4 days ago

what if we segment the audio first and then transcribe tho its some extra compute to throw in but imo it would resul tin better result !

liked 3 Spaces 5 days ago

Running

🚀

Ebook2audiobook V2.0 Beta

Added improvements, 1107+ languages supported

liked a model 10 days ago

huggyllama/llama-7b

Text Generation • Updated Jul 2, 2024 • 319k • 308

commented a paper 12 days ago

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Paper • 2412.10302 • Published 24 days ago • 11 •

liked a model 12 days ago

deepseek-ai/DeepSeek-V3-Base

Updated 7 days ago • 8.17k • 1.16k

upvoted a collection 13 days ago

IndicConformer

Collection

A collection of ASR models for 22 scheduled languages of India • 22 items • Updated Oct 15, 2024 • 5

liked a Space 13 days ago

Running

450

🌍

THUDM/cogvlm2-llama3-caption

Video-Text-to-Text • Updated Sep 26, 2024 • 11.5k • 76

Neurazum/Xbai-Epilepsy-1.0

Video-Text-to-Text • Updated Nov 11, 2024 • 2

reacted to vladbogo's post with 👍 24 days ago

Post

Panda-70M is a new large-scale video dataset comprising 70 million high-quality video clips, each paired with textual captions, designed to be used as pre-training for video understanding tasks.

Key Points:
* Automatic Caption Generation: Utilizes an automatic pipeline with multiple cross-modality teacher models to generate captions for video clips.
* Fine-tuned Caption Selection: Employs a fine-tuned retrieval model to select the most appropriate caption from multiple candidates for each video clip.
* Improved Performance: Pre-training on Panda-70M shows significant performance gains in video captioning, text-video retrieval, and text-driven video generation.

Paper: Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (2402.19479)
Project page: https://snap-research.github.io/Panda-70M/
Code: https://github.com/snap-research/Panda-70M

Congrats to the authors @tschen , @aliaksandr-siarohin et al. for their work!

1 reply

liked a Space 29 days ago

Running on Zero

126

🎥📸💬

VideoLLaMA2

Media understanding

New activity in tencent/HunyuanVideo about 1 month ago

multi gpu setup when ?

#5 opened about 1 month ago by

Ayaan-Sharif

liked a Space about 1 month ago

Running

833

🔍

QwQ-32B-Preview

liked a model about 1 month ago

Qwen/QwQ-32B-Preview

Text Generation • Updated Nov 29, 2024 • 94.7k • • 1.5k

liked a model about 2 months ago

cognitivecomputations/dolphin-2.9.2-qwen2-72b

Text Generation • Updated Oct 8, 2024 • 9.87k • 136

reacted to merve's post with ❤️ 2 months ago

Post

5431

Another great week in open ML!
Here's a small recap 🫰🏻

Model releases
⏯️ Video Language Models
AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2

💬 Small language models
Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets.
Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M

🖼️ Image Generation
Stability AI released stabilityai/stable-diffusion-3.5-medium, a 2B model with commercially permissive license

🖼️💬Any-to-Any
gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!

Dataset releases
🖼️ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2

liked a dataset 4 months ago

HuggingFaceFV/finevideo

Viewer • Updated 21 days ago • 39.5k • 5.17k • 285

Ayaan Sharif

AI & ML interests

Recent Activity

Organizations

Ayaan-Sharif's activity

ClipVideo

Whisper JAX

Ebook2audiobook V2.0 Beta

huggyllama/llama-7b

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

deepseek-ai/DeepSeek-V3-Base

IndicConformer

QVQ 72B Preview

Llmlingua 2

THUDM/cogvlm2-llama3-caption

Neurazum/Xbai-Epilepsy-1.0

VideoLLaMA2

multi gpu setup when ?

QwQ-32B-Preview

Qwen/QwQ-32B-Preview

cognitivecomputations/dolphin-2.9.2-qwen2-72b

HuggingFaceFV/finevideo