danielz01
's Collections
VLFM
updated
Kosmos-2.5: A Multimodal Literate Model
Paper
•
2309.11419
•
Published
•
50
Mirasol3B: A Multimodal Autoregressive model for time-aligned and
contextual modalities
Paper
•
2311.05698
•
Published
•
9
Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks
Paper
•
2311.06242
•
Published
•
87
PolyMaX: General Dense Prediction with Mask Transformer
Paper
•
2311.05770
•
Published
•
6
Learning Vision from Models Rivals Learning Vision from Data
Paper
•
2312.17742
•
Published
•
15
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities
Paper
•
2401.12168
•
Published
•
26
A Survey of Resource-efficient LLM and Multimodal Foundation Models
Paper
•
2401.08092
•
Published
•
3
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on
Generalizability, Trustworthiness and Causality through Four Modalities
Paper
•
2401.15071
•
Published
•
35
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD
Generalization
Paper
•
2401.15914
•
Published
•
7
MouSi: Poly-Visual-Expert Vision-Language Models
Paper
•
2401.17221
•
Published
•
8
StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
Paper
•
2401.17093
•
Published
•
19
DataComp: In search of the next generation of multimodal datasets
Paper
•
2304.14108
•
Published
•
2
Question Aware Vision Transformer for Multimodal Reasoning
Paper
•
2402.05472
•
Published
•
8
Paper
•
2309.16671
•
Published
•
20
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
•
2402.12226
•
Published
•
41
CLoVe: Encoding Compositional Language in Contrastive Vision-Language
Models
Paper
•
2402.15021
•
Published
•
12
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
40
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper
•
2403.01487
•
Published
•
14
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
30
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
•
2404.13013
•
Published
•
30
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
87
Dense Connector for MLLMs
Paper
•
2405.13800
•
Published
•
22