Jebadiah (jebadiah greenwood)

updated a model 4 days ago

Jebadiah/Luna-dream-02

Updated 4 days ago • 9

New activity in featherless-ai/try-this-model 5 months ago

jeiku/Aura-NeMo-12B

2

#2 opened 5 months ago by

Jebadiah

reacted to merve's post with 🚀 6 months ago

Post

3248

Forget any document retrievers, use ColPali 💥💥

Document retrieval is done through OCR + layout detection, but you are losing a lot of information in between, stop doing that! 🤓

ColPali uses a vision language model, which is better in doc understanding 📑
ColPali: vidore/colpali (mit license!)
Blog post: https://huggingface.co/blog/manu/colpali
The authors also released a new benchmark for document retrieval:
ViDoRe Benchmark: vidore/vidore-benchmark-667173f98e70a1c0fa4db00d
ViDoRe Leaderboard: vidore/vidore-leaderboard

ColPali marries the idea of modern vision language models with retrieval 🤝

The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali 🖇️
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) 🤩

The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet.
ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!

reacted to merve's post with 😎👀👍🧠🤯 7 months ago

Post

3589

EPFL and Apple (at @EPFL-VILAB ) just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! 🙀
4M is a multimodal training framework introduced by Apple and EPFL.
Resulting model takes image and text and output image and text 🤩

Models: EPFL-VILAB/4m-models-660193abe3faf4b4d98a2742
Demo: EPFL-VILAB/4M
Paper: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2406.09406)

This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:

input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!

This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation 🖼️

The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well ☺️

updated 12 models 8 months ago

jebadiah greenwood

AI & ML interests

Recent Activity

Organizations

Jebadiah's activity

Jebadiah/Luna-dream-02

jeiku/Aura-NeMo-12B

Jebadiah/Aria-ruby-v4-e

Jebadiah/Aria-ruby-v3

Jebadiah/Aria-diamond-ruby-v2

Jebadiah/Aria-diamond-ruby-v1

Jebadiah/Tess-coder-ruby-p7

Jebadiah/Tess-coder-ruby-p6

Jebadiah/Tess-coder-ruby-p5

Jebadiah/Tess-coder-ruby-p4

Jebadiah/Tess-Qwen-ruby-p3

Jebadiah/Tess-gradient-ruby-p2

Jebadiah/Tess-gradient-ruby-p1

Jebadiah/Tess-gradient-ruby