btjhjeon
's Collections
Multimodal Dataset
updated
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper
•
2407.09413
•
Published
•
10
MAVIS: Mathematical Visual Instruction Tuning
Paper
•
2407.08739
•
Published
•
31
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper
•
2409.01437
•
Published
•
71
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
•
2409.05840
•
Published
•
46
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced
Mathematical Reasoning
Paper
•
2409.12568
•
Published
•
48
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Paper
•
2410.10816
•
Published
•
20
Personalized Visual Instruction Tuning
Paper
•
2410.07113
•
Published
•
69
Harnessing Webpage UIs for Text-Rich Visual Understanding
Paper
•
2410.13824
•
Published
•
30
EMMA: End-to-End Multimodal Model for Autonomous Driving
Paper
•
2410.23262
•
Published
•
2
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
•
2411.07461
•
Published
•
22
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video
Generation
Paper
•
2411.08380
•
Published
•
25
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
112
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
Video Reasoning via Core Frame Selection
Paper
•
2411.14794
•
Published
•
13
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
•
2411.17991
•
Published
•
5
GATE OpenING: A Comprehensive Benchmark for Judging Open-ended
Interleaved Image-Text Generation
Paper
•
2411.18499
•
Published
•
18
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
•
2412.00927
•
Published
•
26
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
•
2412.05237
•
Published
•
46
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
•
2412.05243
•
Published
•
18
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper
•
2412.07112
•
Published
•
25
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Paper
•
2412.07147
•
Published
•
5
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
•
2412.09283
•
Published
•
19
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation
Understanding
Paper
•
2412.17295
•
Published
•
9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
•
2412.18319
•
Published
•
34
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
•
2501.00958
•
Published
•
75