18 2 66

sometimesanotion PRO

sometimesanotion

AI & ML interests

Agentic LLM services, model merging, finetunes, distillation

Recent Activity

updated a model about 5 hours ago

sometimesanotion/Qwen2.5-14B-Vimarckoso-v3

new activity about 6 hours ago

hotmailuser/QwenSlerp2-14B:This should be an interesting merge

updated a model about 6 hours ago

sometimesanotion/Lamarck-14B-v0.4-Qwenvergence

View all activity

Organizations

sometimesanotion's activity

updated a model about 5 hours ago

sometimesanotion/Qwen2.5-14B-Vimarckoso-v3

Text Generation • Updated about 5 hours ago • 307 • 7

New activity in hotmailuser/QwenSlerp2-14B about 6 hours ago

This should be an interesting merge

#1 opened about 8 hours ago by

sometimesanotion

updated a model about 6 hours ago

sometimesanotion/Lamarck-14B-v0.4-Qwenvergence

Text Generation • Updated about 6 hours ago • 120 • 1

liked a model about 9 hours ago

mradermacher/Lamarck-14B-v0.6-rc4-i1-GGUF

Updated about 13 hours ago • 2

updated a model about 19 hours ago

sometimesanotion/Lamarck-14B-v0.6

Text Generation • Updated about 19 hours ago • 4 • 1

reacted to singhsidhukuldeep's post with 👍 about 20 hours ago

Post

1766

Groundbreaking Research Alert: Rethinking RAG with Cache-Augmented Generation (CAG)

Researchers from National Chengchi University and Academia Sinica have introduced a paradigm-shifting approach that challenges the conventional wisdom of Retrieval-Augmented Generation (RAG).

Instead of the traditional retrieve-then-generate pipeline, their innovative Cache-Augmented Generation (CAG) framework preloads documents and precomputes key-value caches, eliminating the need for real-time retrieval during inference.

Technical Deep Dive:
- CAG preloads external knowledge and precomputes KV caches, storing them for future use
- The system processes documents only once, regardless of subsequent query volume
- During inference, it loads the precomputed cache alongside user queries, enabling rapid response generation
- The cache reset mechanism allows efficient handling of multiple inference sessions through strategic token truncation

Performance Highlights:
- Achieved superior BERTScore metrics compared to both sparse and dense retrieval RAG systems
- Demonstrated up to 40x faster generation times compared to traditional approaches
- Particularly effective with both SQuAD and HotPotQA datasets, showing robust performance across different knowledge tasks

Why This Matters:
The approach significantly reduces system complexity, eliminates retrieval latency, and mitigates common RAG pipeline errors. As LLMs continue evolving with expanded context windows, this methodology becomes increasingly relevant for knowledge-intensive applications.