OnePiece123
's Collections
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
•
2406.17294
•
Published
•
11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
•
2406.19389
•
Published
•
53
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
•
2406.20076
•
Published
•
9
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of
Audio Events in Text-to-audio Generation
Paper
•
2407.02869
•
Published
•
18
Unveiling Encoder-Free Vision-Language Models
Paper
•
2406.11832
•
Published
•
51
FunAudioLLM: Voice Understanding and Generation Foundation Models for
Natural Interaction Between Humans and LLMs
Paper
•
2407.04051
•
Published
•
36
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
•
2407.06135
•
Published
•
21
Vision language models are blind
Paper
•
2407.06581
•
Published
•
83
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
40
SEED-Story: Multimodal Long Story Generation with Large Language Model
Paper
•
2407.08683
•
Published
•
22