Papers - a Imotech Collection

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 32

Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 25

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8, 2024 • 24

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Paper • 2406.13897 • Published May 30, 2024 • 12

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Paper • 2407.13759 • Published Jul 18, 2024 • 17

POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

Paper • 2407.14931 • Published Jul 20, 2024 • 20

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 27

DistilDIRE: A Small, Fast, Cheap and Lightweight Diffusion Synthesized Deepfake Detection

Paper • 2406.00856 • Published Jun 2, 2024 • 11

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Paper • 2407.16741 • Published Jul 23, 2024 • 68

3D Question Answering for City Scene Understanding

Paper • 2407.17398 • Published Jul 24, 2024 • 22

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Paper • 2407.20229 • Published Jul 29, 2024 • 7

SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 109

RelBench: A Benchmark for Deep Learning on Relational Databases

Paper • 2407.20060 • Published Jul 29, 2024 • 7

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

Paper • 2408.02545 • Published Aug 5, 2024 • 35

MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh Tokenization

Paper • 2408.02555 • Published Aug 5, 2024 • 28

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Paper • 2408.03256 • Published Aug 6, 2024 • 10

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 59

Transformer Explainer: Interactive Learning of Text-Generative Models

Paper • 2408.04619 • Published Aug 8, 2024 • 155

FruitNeRF: A Unified Neural Radiance Field based Fruit Counting Framework

Paper • 2408.06190 • Published Aug 12, 2024 • 17

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Paper • 2408.07060 • Published Aug 13, 2024 • 40

Imagen 3

Paper • 2408.07009 • Published Aug 13, 2024 • 61

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 51

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

Paper • 2408.13252 • Published Aug 23, 2024 • 24

MuCodec: Ultra Low-Bitrate Music Codec

Paper • 2409.13216 • Published Sep 20, 2024 • 22

Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19, 2024 • 135

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19, 2024 • 24

FlexiTex: Enhancing Texture Generation with Visual Guidance

Paper • 2409.12431 • Published Sep 19, 2024 • 11

3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

Paper • 2409.12892 • Published Sep 19, 2024 • 5

SpaceBlender: Creating Context-Rich Collaborative Spaces Through Generative 3D Scene Blending

Paper • 2409.13926 • Published Sep 20, 2024 • 5

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Paper • 2409.15278 • Published Sep 23, 2024 • 22

Improvements to SDXL in NovelAI Diffusion V3

Paper • 2409.15997 • Published Sep 24, 2024 • 11

Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale

Paper • 2409.17115 • Published Sep 25, 2024 • 60

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

Paper • 2409.18125 • Published Sep 26, 2024 • 33

Game4Loc: A UAV Geo-Localization Benchmark from Game Data

Paper • 2409.16925 • Published Sep 25, 2024 • 6

DressRecon: Freeform 4D Human Reconstruction from Monocular Video

Paper • 2409.20563 • Published Sep 30, 2024 • 7

Posterior-Mean Rectified Flow: Towards Minimum MSE Photo-Realistic Image Restoration

Paper • 2410.00418 • Published Oct 1, 2024 • 9

SyntheOcc: Synthesize Geometric-Controlled Street View Images through 3D Semantic MPIs

Paper • 2410.00337 • Published Oct 1, 2024 • 10

Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

Paper • 2410.00890 • Published Oct 1, 2024 • 18

Law of the Weakest Link: Cross Capabilities of Large Language Models

Paper • 2409.19951 • Published Sep 30, 2024 • 53

Illustrious: an Open Advanced Illustration Model

Paper • 2409.19946 • Published Sep 30, 2024 • 13

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

Paper • 2410.01215 • Published Oct 2, 2024 • 30

3DGS-DET: Empower 3D Gaussian Splatting with Boundary Guidance and Box-Focused Sampling for 3D Object Detection

Paper • 2410.01647 • Published Oct 2, 2024 • 28

Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 144

MIGA: Mixture-of-Experts with Group Aggregation for Stock Market Prediction

Paper • 2410.02241 • Published Oct 3, 2024 • 7

CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction

Paper • 2410.01273 • Published Oct 2, 2024 • 9

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published Oct 2, 2024 • 13

DOTS: Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search

Paper • 2410.03864 • Published Oct 4, 2024 • 10

Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach

Paper • 2410.06949 • Published Oct 9, 2024 • 5

Data Selection via Optimal Control for Language Models

Paper • 2410.07064 • Published Oct 9, 2024 • 8

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Paper • 2410.07171 • Published Oct 9, 2024 • 41

Does Spatial Cognition Emerge in Frontier Models?

Paper • 2410.06468 • Published Oct 9, 2024 • 2

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Paper • 2410.03450 • Published Oct 4, 2024 • 36

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10, 2024 • 24

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Paper • 2410.05265 • Published Oct 7, 2024 • 29

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Paper • 2410.09732 • Published Oct 13, 2024 • 54

Toward General Instruction-Following Alignment for Retrieval-Augmented Generation

Paper • 2410.09584 • Published Oct 12, 2024 • 47

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Paper • 2410.10306 • Published Oct 14, 2024 • 54

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Paper • 2410.10563 • Published Oct 14, 2024 • 38

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Paper • 2410.10792 • Published Oct 14, 2024 • 29

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14, 2024 • 24

Rethinking Data Selection at Scale: Random Selection is Almost All You Need

Paper • 2410.09335 • Published Oct 12, 2024 • 16

Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 84

Mentor-KD: Making Small Language Models Better Multi-step Reasoners

Paper • 2410.09037 • Published Oct 11, 2024 • 4

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights

Paper • 2410.09008 • Published Oct 11, 2024 • 16

Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

Paper • 2410.08102 • Published Oct 10, 2024 • 19

StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

Paper • 2410.08815 • Published Oct 11, 2024 • 43

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9, 2024 • 35

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

Paper • 2410.08261 • Published Oct 10, 2024 • 49

FlatQuant: Flatness Matters for LLM Quantization

Paper • 2410.09426 • Published Oct 12, 2024 • 12

Harnessing Webpage UIs for Text-Rich Visual Understanding

Paper • 2410.13824 • Published Oct 17, 2024 • 29

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Paper • 2410.13757 • Published Oct 17, 2024 • 31

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 8

AERO: Softmax-Only LLMs for Efficient Private Inference

Paper • 2410.13060 • Published Oct 16, 2024 • 4

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Paper • 2410.13754 • Published Oct 17, 2024 • 74

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Paper • 2410.13370 • Published Oct 17, 2024 • 35

Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

Paper • 2410.13674 • Published Oct 17, 2024 • 15

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

Paper • 2410.13726 • Published Oct 17, 2024 • 10

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

Paper • 2410.10812 • Published Oct 14, 2024 • 15

AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21, 2024 • 58

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 52

SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation

Paper • 2410.14745 • Published Oct 17, 2024 • 45

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Paper • 2410.16153 • Published Oct 21, 2024 • 43

Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception

Paper • 2410.12788 • Published Oct 16, 2024 • 23

DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes

Paper • 2410.18084 • Published Oct 23, 2024 • 13

Lightweight Neural App Control

Paper • 2410.17883 • Published Oct 23, 2024 • 9

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Paper • 2410.13924 • Published Oct 17, 2024 • 6

LOGO -- Long cOntext aliGnment via efficient preference Optimization

Paper • 2410.18533 • Published Oct 24, 2024 • 42

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Paper • 2410.18693 • Published Oct 24, 2024 • 40

Framer: Interactive Frame Interpolation

Paper • 2410.18978 • Published Oct 24, 2024 • 36

Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 35

Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Paper • 2410.18798 • Published Oct 24, 2024 • 19

WAFFLE: Multi-Modal Model for Automated Front-End Development

Paper • 2410.18362 • Published Oct 24, 2024 • 11

mistralai/Pixtral-12B-Base-2409

Updated Oct 30, 2024 • 69

Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning

Paper • 2410.19290 • Published Oct 25, 2024 • 10

Continuous Speech Synthesis using per-token Latent Diffusion

Paper • 2410.16048 • Published Oct 21, 2024 • 29

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Paper • 2410.19168 • Published Oct 24, 2024 • 19

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

Paper • 2404.16710 • Published Apr 25, 2024 • 75

GPT-4o System Card

Paper • 2410.21276 • Published Oct 25, 2024 • 82

Neural Fields in Robotics: A Survey

Paper • 2410.20220 • Published Oct 26, 2024 • 4

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Paper • 2410.21220 • Published Oct 28, 2024 • 10

AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant

Paper • 2410.18603 • Published Oct 24, 2024 • 32

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

Paper • 2410.18666 • Published Oct 24, 2024 • 19

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Paper • 2410.21465 • Published Oct 28, 2024 • 11

CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 200

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

Paper • 2410.20424 • Published Oct 27, 2024 • 39

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Paper • 2410.23168 • Published Oct 30, 2024 • 24

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Paper • 2410.20650 • Published Oct 28, 2024 • 16

Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Paper • 2410.22366 • Published Oct 28, 2024 • 77

Language Models can Self-Lengthen to Generate Long Texts

Paper • 2410.23933 • Published Oct 31, 2024 • 17

SelfCodeAlign: Self-Alignment for Code Generation

Paper • 2410.24198 • Published Oct 31, 2024 • 23

Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks

Paper • 2410.24032 • Published Oct 31, 2024 • 9

M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Paper • 2410.21157 • Published Oct 28, 2024 • 6

Face Anonymization Made Simple

Paper • 2411.00762 • Published Nov 1, 2024 • 7

HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models

Paper • 2410.22901 • Published Oct 30, 2024 • 8

CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes

Paper • 2411.00771 • Published Nov 1, 2024 • 9

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Paper • 2410.24024 • Published Oct 31, 2024 • 48

Training-free Regional Prompting for Diffusion Transformers

Paper • 2411.02395 • Published Nov 4, 2024 • 25

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Paper • 2411.02355 • Published Nov 4, 2024 • 46

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D

Paper • 2411.02336 • Published Nov 4, 2024 • 23

GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

Paper • 2411.03047 • Published Nov 5, 2024 • 8

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

Paper • 2411.02959 • Published Nov 5, 2024 • 64

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published Nov 5, 2024 • 25

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Paper • 2411.04905 • Published Nov 7, 2024 • 111

RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval

Paper • 2411.04752 • Published Nov 7, 2024 • 16

SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

Paper • 2411.05007 • Published Nov 7, 2024 • 16

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Paper • 2411.04952 • Published Nov 7, 2024 • 28

BitNet a4.8: 4-bit Activations for 1-bit LLMs

Paper • 2411.04965 • Published Nov 7, 2024 • 63

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

Paper • 2411.06176 • Published Nov 9, 2024 • 44

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Paper • 2411.07126 • Published Nov 11, 2024 • 28

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published Nov 11, 2024 • 46

CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM

Paper • 2411.04954 • Published Nov 7, 2024 • 8

PramaLLC/BEN

Image Segmentation • Updated Nov 21, 2024 • 222 • 78

SAMPart3D: Segment Any Part in 3D Objects

Paper • 2411.07184 • Published Nov 11, 2024 • 26

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

Paper • 2411.09595 • Published Nov 14, 2024 • 71

MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published Nov 14, 2024 • 60

Large Language Models Can Self-Improve in Long-context Reasoning

Paper • 2411.08147 • Published Nov 12, 2024 • 62

GenXD: Generating Any 3D and 4D Scenes

Paper • 2411.02319 • Published Nov 4, 2024 • 20

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 111

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

Paper • 2411.10510 • Published Nov 15, 2024 • 8

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Paper • 2411.10669 • Published Nov 16, 2024 • 10

SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Paper • 2411.09944 • Published Nov 15, 2024 • 12

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 44

Continuous Speculative Decoding for Autoregressive Image Generation

Paper • 2411.11925 • Published Nov 18, 2024 • 15

RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 47

ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Paper • 2411.12044 • Published Nov 18, 2024 • 13

Building Trust: Foundations of Security, Safety and Transparency in AI

Paper • 2411.12275 • Published Nov 19, 2024 • 10

SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning

Paper • 2411.10161 • Published Nov 15, 2024 • 8

SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration

Paper • 2411.10958 • Published Nov 17, 2024 • 51

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Paper • 2411.11922 • Published Nov 18, 2024 • 18

Ultra-Sparse Memory Network

Paper • 2411.12364 • Published Nov 19, 2024 • 19

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

Paper • 2411.14199 • Published Nov 21, 2024 • 29

Natural Language Reinforcement Learning

Paper • 2411.14251 • Published Nov 21, 2024 • 27

Patience Is The Key to Large Language Model Reasoning

Paper • 2411.13082 • Published Nov 20, 2024 • 7

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

Paper • 2411.14347 • Published Nov 21, 2024 • 13

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Paper • 2411.13807 • Published Nov 21, 2024 • 11

Hymba: A Hybrid-head Architecture for Small Language Models

Paper • 2411.13676 • Published Nov 20, 2024 • 39

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Paper • 2411.14405 • Published Nov 21, 2024 • 58

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training

Paper • 2411.15124 • Published Nov 22, 2024 • 56

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 16

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Paper • 2411.14794 • Published Nov 22, 2024 • 12

MyTimeMachine: Personalized Facial Age Transformation

Paper • 2411.14521 • Published Nov 21, 2024 • 20

Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images

Paper • 2411.13127 • Published Nov 20, 2024 • 4

Material Anything: Generating Materials for Any 3D Object via Diffusion

Paper • 2411.15138 • Published Nov 22, 2024 • 42

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 76

Learning 3D Representations from Procedural 3D Programs

Paper • 2411.17467 • Published Nov 25, 2024 • 8

TEXGen: a Generative Diffusion Model for Mesh Textures

Paper • 2411.14740 • Published Nov 22, 2024 • 15

ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 82

DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

Paper • 2411.17786 • Published Nov 26, 2024 • 12

Adaptive Blind All-in-One Image Restoration

Paper • 2411.18412 • Published Nov 27, 2024 • 4

Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

Paper • 2411.18478 • Published Nov 27, 2024 • 32

GRAPE: Generalizing Robot Policy via Preference Alignment

Paper • 2411.19309 • Published Nov 28, 2024 • 42

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Paper • 2411.18552 • Published Nov 27, 2024 • 17

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Paper • 2411.19146 • Published Nov 28, 2024 • 13

MATATA: a weak-supervised MAthematical Tool-Assisted reasoning for Tabular Applications

Paper • 2411.18915 • Published Nov 28, 2024 • 8

Reverse Thinking Makes LLMs Stronger Reasoners

Paper • 2411.19865 • Published Nov 29, 2024 • 19

LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification

Paper • 2411.19638 • Published Nov 29, 2024 • 6

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Paper • 2411.19842 • Published Nov 29, 2024 • 10

TinyFusion: Diffusion Transformers Learned Shallow

Paper • 2412.01199 • Published about 1 month ago • 14

o1-Coder: an o1 Replication for Coding

Paper • 2412.00154 • Published Nov 29, 2024 • 41

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Paper • 2412.00174 • Published Nov 29, 2024 • 22

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

Paper • 2412.00568 • Published Nov 30, 2024 • 14

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Paper • 2412.01822 • Published 30 days ago • 14

Art-Free Generative Models: Art Creation Without Graphic Art Knowledge

Paper • 2412.00176 • Published Nov 29, 2024 • 8

HUGSIM: A Real-Time, Photo-Realistic and Closed-Loop Simulator for Autonomous Driving

Paper • 2412.01718 • Published 30 days ago • 2

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Paper • 2412.01292 • Published about 1 month ago • 12

SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance

Paper • 2412.02687 • Published 29 days ago • 108

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published 28 days ago • 119

Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

Paper • 2412.03515 • Published 28 days ago • 25

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Paper • 2412.02030 • Published 29 days ago • 18

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

Paper • 2412.03558 • Published 28 days ago • 15

CleanDIFT: Diffusion Features without Noise

Paper • 2412.03439 • Published 28 days ago • 12

Mimir: Improving Video Diffusion Models for Precise Text Understanding

Paper • 2412.03085 • Published 28 days ago • 12

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Paper • 2412.04455 • Published 27 days ago • 36

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published 23 days ago • 18

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Paper • 2412.07744 • Published 22 days ago • 19

Are Your LLMs Capable of Stable Reasoning?

Paper • 2412.13147 • Published 15 days ago • 90

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Paper • 2412.11863 • Published 16 days ago • 2

TidyBot++: An Open-Source Holonomic Mobile Manipulator for Robot Learning

Paper • 2412.10447 • Published 21 days ago • 5

The Open Source Advantage in Large Language Models (LLMs)

Paper • 2412.12004 • Published 16 days ago • 9

Smaller Language Models Are Better Instruction Evolvers

Paper • 2412.11231 • Published 17 days ago • 25

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

Paper • 2412.11919 • Published 16 days ago • 33

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published 19 days ago • 79

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Paper • 2412.12606 • Published 15 days ago • 41

Compressed Chain of Thought: Efficient Reasoning Through Dense Representations

Paper • 2412.13171 • Published 15 days ago • 31

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Paper • 2412.10704 • Published 18 days ago • 15

Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework

Paper • 2412.11713 • Published 16 days ago • 5

RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment

Paper • 2412.13746 • Published 14 days ago • 9

Phi-4 Technical Report

Paper • 2412.08905 • Published 20 days ago • 93

Evaluating and Aligning CodeLLMs on Human Preference

Paper • 2412.05210 • Published 26 days ago • 47

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Paper • 2412.09501 • Published 20 days ago • 43

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published 21 days ago • 51

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Paper • 2412.09626 • Published 20 days ago • 19

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

Paper • 2409.11242 • Published Sep 17, 2024 • 5

GenEx: Generating an Explorable World

Paper • 2412.09624 • Published 20 days ago • 86

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published 14 days ago • 113

AniDoc: Animation Creation Made Easier

Paper • 2412.14173 • Published 14 days ago • 49

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Paper • 2412.14161 • Published 14 days ago • 45

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Paper • 2412.14171 • Published 14 days ago • 23

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Paper • 2412.13795 • Published 14 days ago • 18

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Paper • 2412.15204 • Published 13 days ago • 31

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published 13 days ago • 52

Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published 13 days ago • 68

Qwen2.5 Technical Report

Paper • 2412.15115 • Published 13 days ago • 333

CAD-Recode: Reverse Engineering CAD Code from Point Clouds

Paper • 2412.14042 • Published 14 days ago • 5

Predicting the Original Appearance of Damaged Historical Documents

Paper • 2412.11634 • Published 16 days ago • 4

AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities

Paper • 2412.14123 • Published 14 days ago • 11

Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published 13 days ago • 47

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

Paper • 2412.13649 • Published 14 days ago • 18

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Paper • 2412.14590 • Published 13 days ago • 12

Multi-LLM Text Summarization

Paper • 2412.15487 • Published 12 days ago • 5

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Paper • 2412.14963 • Published 13 days ago • 5

DepthLab: From Partial to Complete

Paper • 2412.18153 • Published 8 days ago • 31

SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval

Paper • 2412.15443 • Published 12 days ago • 7

RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

Paper • 2412.14922 • Published 13 days ago • 80

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published 8 days ago • 29

YuLan-Mini: An Open Data-efficient Language Model

Paper • 2412.17743 • Published 9 days ago • 57

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published 8 days ago • 14

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published 26 days ago • 121

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

Paper • 2412.18605 • Published 8 days ago • 17

Automating the Enterprise with Foundation Models

Paper • 2405.03710 • Published May 3, 2024 • 1

1.58-bit FLUX

Paper • 2412.18653 • Published 8 days ago • 47

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Paper • 2412.18525 • Published 8 days ago • 46

Bringing Objects to Life: 4D generation from 3D objects

Paper • 2412.20422 • Published 3 days ago • 27

Edicho: Consistent Image Editing in the Wild

Paper • 2412.21079 • Published 1 day ago • 16

Slow Perception: Let's Perceive Geometric Figures Step-by-step

Paper • 2412.20631 • Published 2 days ago • 8

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

Paper • 2412.20005 • Published 4 days ago • 9

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

Paper • 2412.04862 • Published 26 days ago • 48