DeepSeek V4-Flash and V4-Pro: 1M-token open-weight LLMs with Hybrid Attention

DeepSeek just dropped V41Flash and V41Pro with a 1,000,0001token context window and a so1called Hybrid Attention Architecture 1 and yes, that matters more than the marketing one1liner. Open models that actually hold a million tokens without collapsing into token1by1token oracle mode change how you architect retrieval, caching, and agent orchestration.

The engineering trade is obvious: a 1M token KV/attention state blows past the memory assumptions of most inference stacks. If your serving layer still treats the context window as a fixed 32k or 64k buffer, you'll either waste GPU memory or throttle throughput rebuilding contexts in CPU land. Hybrid Attention generally mixes local and global (or sparse) attention patterns to make very long ranges tractable, but it doesn't remove the practical costs: KV cache size, attention compute, IO to offload tiers, and the complexity of streaming updates for multi-agent sessions.

Two operational implications to be blunt: first, retrieval-augmented workflows can stop hacking around window limits with aggressive chunking that loses structure. You can now keep more of a document (or a chain of agent actions) in-context and let the model reason over full traces. Second, inference infra will become the bottleneck, not model capability. Teams that haven't invested in memory tiering (GPU HBM + CPU DRAM + local NVMe offload + remote shard sets) will see either sky-high costs or poor latencies.

OpenAI's counterpoint this week was a smaller GPT-4o-family variant tuned for function calling and structured outputs, which can outperform GPT-3.5 Turbo on some benchmarks while being cheaper per token. Anthropic pushed Claude further toward agentic workflows, with dynamic multi-agent orchestration and tunable reasoning depth. NVIDIA and others updated their large-model stacks and docs (NeMo/Megatron tooling in particular), and Cohere, Alibaba, and smaller communities kept expanding open-weight rosters.

Put simply: vendors are bifurcating the problem. Use smaller, cost-efficient models for routine API patterns, and route complex, long-context planning to new open models that can actually hold giant traces. That routing sounds obvious, but the infrastructure required to do it without exploding cost is not.

If you run inference, expect to re-architect three subsystems:

KV cache and offload: move beyond "rebuild per request." Expect to implement GPU-resident KV caches with NVMe-backed offload and efficient serialization for partial context grafting.
Retriever and chunking logic: fewer tiny chunks, more semantic segments plus overlap strategies that exploit a 1M window. Retrieval precision matters less than topology 1 you want the right segments in the same attention neighborhood.
Observability and chargebacks: measure token-level compute and memory residency. Traditional p95 latency metrics hide the real cost: a single 1M token request can dominate daily spend.

This week's releases also accelerated agent frameworks that need long context: multi-agent handoffs, plan-and-execute traces, cross-agent memory. If you're building agent orchestration, read how Bedrock-style knowledge bases and Claude's new multi-agent primitives are being used 1 they're the plumbing that lets long contexts be useful instead of noisy. (See Amazon Bedrock AgentCore: Managed Knowledge Base and Web Search for Platform Teams and the recent write-up on Alibaba Qwen 3.6-Plus.)

Opinion: this is overdue and disruptive. Open-weight long-context is the single biggest infrastructure pressure point of 2026 so far. If your stack still assumes 32k token ceilings or relies on brittle chunking heuristics, you1re going to pay for it in either cost or developer velocity. The right call is to treat long-context as a first-class resource: design tiered memory, billing by token-residency, and routing rules that match intent to capability.

Prediction: the next six months will be about inference economics 1 not model quality. Expect startups and cloud vendors to ship NVMe offload primitives, GPU-resident KV services, and new token-aware load balancers. The teams that win will be the ones that stop treating models as remote black boxes and start treating context as a scarce, measurable system resource.

DeepSeek V4-Flash and V4-Pro: 1M-token open-weight LLMs with Hybrid Attention

Sources

Alibaba Qwen 3.6-Plus: agentic LLM for tool orchestration and multimodal coding

OpenAI model-picker, pricing, and Assistants/Realtime API changes for GPT-4o (model defaults & routing)

Zhipu GLM-5.1: permissive open-weight release with competitive coding and reasoning