Umitcan Sahin

ucsahin

AI & ML interests

Visual Language Models, Large Language Models, Vision Transformers

Recent Activity

liked a dataset about 22 hours ago
muhammetfatihaktug/bilim_teknik_mini_colpali
liked a dataset about 22 hours ago
selimc/tr-textbook-ColPali
liked a model about 22 hours ago
selimc/turkish-colpali
View all activity

Organizations

None yet

ucsahin's activity

reacted to singhsidhukuldeep's post with ๐Ÿ”ฅ 14 days ago
view post
Post
2178
Exciting News in AI: JinaAI Releases JINA-CLIP-v2!

The team at Jina AI has just released a groundbreaking multilingual multimodal embedding model that's pushing the boundaries of text-image understanding. Here's why this is a big deal:

๐Ÿš€ Technical Highlights:
- Dual encoder architecture combining a 561M parameter Jina XLM-RoBERTa text encoder and a 304M parameter EVA02-L14 vision encoder
- Supports 89 languages with 8,192 token context length
- Processes images up to 512ร—512 pixels with 14ร—14 patch size
- Implements FlashAttention2 for text and xFormers for vision processing
- Uses Matryoshka Representation Learning for efficient vector storage

โšก๏ธ Under The Hood:
- Multi-stage training process with progressive resolution scaling (224โ†’384โ†’512)
- Contrastive learning using InfoNCE loss in both directions
- Trained on massive multilingual dataset including 400M English and 400M multilingual image-caption pairs
- Incorporates specialized datasets for document understanding, scientific graphs, and infographics
- Uses hard negative mining with 7 negatives per positive sample

๐Ÿ“Š Performance:
- Outperforms previous models on visual document retrieval (52.65% nDCG@5)
- Achieves 89.73% image-to-text and 79.09% text-to-image retrieval on CLIP benchmark
- Strong multilingual performance across 30 languages
- Maintains performance even with 75% dimension reduction (256D vs 1024D)

๐ŸŽฏ Key Innovation:
The model solves the long-standing challenge of unifying text-only and multi-modal retrieval systems while adding robust multilingual support. Perfect for building cross-lingual visual search systems!

Kudos to the research team at Jina AI for this impressive advancement in multimodal AI!
New activity in ucsahin/TR-VLM-DPO-Dataset about 1 month ago
reacted to merve's post with ๐Ÿ”ฅ๐Ÿ‘€๐Ÿ‘ about 1 month ago
view post
Post
2185
The authors of ColPali trained a retrieval model based on SmolVLM ๐Ÿค  vidore/colsmolvlm-alpha
TLDR;

- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks

- ColSmolVLM is more memory efficient than ColQwen2 ๐Ÿ’—
reacted to ezgikorkmaz's post with ๐Ÿš€ about 2 months ago
reacted to merve's post with ๐Ÿค—๐Ÿ‘€ about 2 months ago
view post
Post
5043
OmniVision-968M: a new local VLM for edge devices, fast & small but performant
๐Ÿ’จ a new vision language model with 9x less image tokens, super efficient
๐Ÿ“– aligned with DPO for reducing hallucinations
โšก๏ธ Apache 2.0 license ๐Ÿ”ฅ

Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model https://huggingface.co/NexaAIDev/omnivision-968M
  • 4 replies
ยท