minlik
's Collections
Multimodal
updated
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding
Paper
•
2306.17107
•
Published
•
11
On the Hidden Mystery of OCR in Large Multimodal Models
Paper
•
2305.07895
•
Published
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
8
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
•
2401.15947
•
Published
•
50
DocPedia: Unleashing the Power of Large Multimodal Model in the
Frequency Domain for Versatile Document Understanding
Paper
•
2311.11810
•
Published
•
1
OCR-free Document Understanding Transformer
Paper
•
2111.15664
•
Published
•
2
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding
Paper
•
2210.03347
•
Published
•
3
Reading Order Matters: Information Extraction from Visually-rich
Documents by Token Path Prediction
Paper
•
2310.11016
•
Published
Nougat: Neural Optical Understanding for Academic Documents
Paper
•
2308.13418
•
Published
•
36
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
•
2403.00522
•
Published
•
45
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
•
2402.04615
•
Published
•
41
LayoutLLM: Layout Instruction Tuning with Large Language Models for
Document Understanding
Paper
•
2404.05225
•
Published
•
1
LayoutLLM: Large Language Model Instruction Tuning for Visually Rich
Document Understanding
Paper
•
2403.14252
•
Published
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
44
CogVLM: Visual Expert for Pretrained Language Models
Paper
•
2311.03079
•
Published
•
24
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
•
2405.20204
•
Published
•
35
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
•
2406.09246
•
Published
•
37
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
51
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper
•
2407.04172
•
Published
•
23
LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
60
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
93
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
57