DeepSeek V4: 1M Context, 10x KV Cache Savings, and Ultra-Low Pricing
DeepSeek released V4, highlighting major long-context efficiency gains: at a 1M-token context, V4 Pro uses 27% of FLOPs and 10% of the KV cache compared to V3.2. Two models shipped—V4 Pro (1.6T parameters, 49B active) and V4 Flash (284B parameters, 13B active)—both with native 1M context, plus V4 Pro Max for higher reasoning effort that competes with Opus 4.6 and GPT 5.4 on knowledge and agentic benchmarks. The speed and memory savings come from a hybrid attention stack interleaving compressed sparse attention (compresses every 4 KV tokens and applies sparse top-K) and heavy compressed attention (compresses every 128 tokens). The models are optimized for agent use cases and are available via DeepSeek’s API with low per-token pricing and built-in context caching, also accessible on chat.deepseek.com and via weights on Hugging Face.
00:00 V4 Launch Highlights
00:18 Models and Benchmarks
00:52 Hybrid Attention Explained
01:43 Efficiency and Use Cases
02:03 Agents and Pricing
02:52 Why Million Context Matters
03:43 Access and Wrap Up
source
