flow2023
's Collections
MLLM
updated
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper
•
2312.16862
•
Published
•
30
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
•
2312.17172
•
Published
•
27
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
Programmers
Paper
•
2401.01974
•
Published
•
5
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper
•
2401.01885
•
Published
•
27
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language
Models
Paper
•
2401.01335
•
Published
•
64
Improving Text Embeddings with Large Language Models
Paper
•
2401.00368
•
Published
•
79
Distilling Vision-Language Models on Millions of Videos
Paper
•
2401.06129
•
Published
•
15
Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk
Paper
•
2401.05033
•
Published
•
16
LEGO:Language Enhanced Multi-modal Grounding Model
Paper
•
2401.06071
•
Published
•
10
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
Concept Understanding
Paper
•
2401.04575
•
Published
•
14
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering
with Multi-Granularity Answers
Paper
•
2401.04695
•
Published
•
11
Paper
•
2401.04088
•
Published
•
158
Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes
Interactively
Paper
•
2401.02955
•
Published
•
21
Understanding LLMs: A Comprehensive Overview from Training to Inference
Paper
•
2401.02038
•
Published
•
62
Can Large Language Models Understand Context?
Paper
•
2402.00858
•
Published
•
22
StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
Paper
•
2401.17093
•
Published
•
19
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
•
2401.16420
•
Published
•
55
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
•
2401.15947
•
Published
•
49
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD
Generalization
Paper
•
2401.15914
•
Published
•
7
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper
•
2401.13601
•
Published
•
45
Small Language Model Meets with Reinforced Vision Vocabulary
Paper
•
2401.12503
•
Published
•
32
Large Language Models are Superpositions of All Characters: Attaining
Arbitrary Role-play via Self-Alignment
Paper
•
2401.12474
•
Published
•
35
Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated
Text
Paper
•
2401.12070
•
Published
•
43
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities
Paper
•
2401.12168
•
Published
•
26
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
125
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
40
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
•
2403.04132
•
Published
•
38
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
•
2402.10986
•
Published
•
77
Linear Transformers with Learnable Kernel Functions are Better
In-Context Models
Paper
•
2402.10644
•
Published
•
79
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
Paper
•
2402.01622
•
Published
•
34
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
Paper
•
2403.15042
•
Published
•
25
When Do We Not Need Larger Vision Models?
Paper
•
2403.13043
•
Published
•
25
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
Computer Environments
Paper
•
2404.07972
•
Published
•
46
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
•
2404.07973
•
Published
•
30
BRAVE: Broadening the visual encoding of vision-language models
Paper
•
2404.07204
•
Published
•
18
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
•
2404.14396
•
Published
•
18
PhysDreamer: Physics-Based Interaction with 3D Objects via Video
Generation
Paper
•
2404.13026
•
Published
•
23
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
38
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
24
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
•
2404.19752
•
Published
•
22
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
•
2405.09798
•
Published
•
26
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
•
2406.04325
•
Published
•
72
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
19
Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Paper
•
2406.06469
•
Published
•
24
Mixture-of-Agents Enhances Large Language Model Capabilities
Paper
•
2406.04692
•
Published
•
55
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
Complex Reasoning Abilities
Paper
•
2406.11768
•
Published
•
20
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
52
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented
Generation
Paper
•
2406.19215
•
Published
•
29
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper
•
2406.15334
•
Published
•
8
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
•
2407.03320
•
Published
•
93
FunAudioLLM: Voice Understanding and Generation Foundation Models for
Natural Interaction Between Humans and LLMs
Paper
•
2407.04051
•
Published
•
35
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paper
•
2407.03418
•
Published
•
8
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for
Sparse Architectural Large Language Models
Paper
•
2407.01906
•
Published
•
34
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
•
2408.05211
•
Published
•
47
Task-oriented Sequential Grounding in 3D Scenes
Paper
•
2408.04034
•
Published
•
8
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
50
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
92
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
•
2408.15998
•
Published
•
84
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Paper
•
2408.15881
•
Published
•
21
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
124
OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
Paper
•
2409.05152
•
Published
•
30
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
•
2409.02889
•
Published
•
55
OLMoE: Open Mixture-of-Experts Language Models
Paper
•
2409.02060
•
Published
•
77
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
•
2408.16725
•
Published
•
52
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
53
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
•
2410.05993
•
Published
•
107
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Paper
•
2410.19168
•
Published
•
19