Google just changed the calculus for running agentic workloads on GCP: Vertex AI added Gemini 3.1 (including Flash and FlashLite variants) and Veo 3.1 to its public preview roster while simultaneously reducing Agent Engine runtime prices and, crucially, started billing Sessions, Memory Bank, and Code Execution as distinct line items. That mix makes token spend the dominant cost vector, and it forces platform teams to treat Vertex as an expensive, stateful dependency rather than a cheap API call.
The blunt numbers matter because you can now model actual spend. Third-party estimates show input/output token rates vary widely between Pro and Flash/Flash-Lite variants; Flash-Lite-style variants are markedly cheaper on a per-token basis. Those differentials mean model selection alone can be an order-of-magnitude lever on monthly bills for the same application behavior.
Layer on the Agent Engine billing model: runtime billed per vCPU-hour, memory billed per GB-hour, plus metered sessions, agent memory (managed memory for agents), and code execution events. Runtime prices were reduced, which nudges customers toward serverized agent patterns, but breaking out sessions and agent memory as separate meters will surprise teams that assumed tokens were the only cost. In short: tokens dominate, but agent scaffolding (state, long-running runtime, code execution) is no longer "free."
Why platform engineers should care
Teams building microservices on GKE or Cloud Run already have a familiar pattern: stateless frontends call managed AI endpoints for heavy lifting. These release notes and pricing changes make it explicit that you must isolate three cost domains:
- Tokens: per-request generative costs (input/output tokens) which scale with conversation length and multimodal payloads.
- Agent infrastructure: vCPU and memory for Agent Engine runtime and any code execution you permit.
- Index & search infra: Vector Search node-hours and per-prediction costs for online retrieval.
If you keep agent state and vector indexes in the same project as frontend services, a buggy feedback loop or exploratory ML experiment can drive all three meters simultaneously. The practical best practice is already obvious: isolate high-token, agentic workloads into their own project/account and enforce hard token quotas, session limits, and separate cost monitoring.
This is the right call from Google. Breaking out Agent Engine usage and managed agent memory billing is uncomfortable short-term, but it forces teams to think about operational boundaries and attack surfaces. The alternative opaque, bundled pricing encouraged sloppy designs where conversational state and vector indexes ballooned without visibility.
Architecture implications for GKE and Cloud Run
Use stateless GKE/Cloud Run services for request translation, auth, and lightweight enrichment. Push stateful agents, session stores, and vector indexes to Vertex AI managed services (Vector Search, agent memory) only when you need the operational simplicity but assume those managed services will be the primary drivers of dollar spend. Where latency and cost both matter, consider hybrid approaches: cheaper flash-lite models for real-time hints, and gated Pro model calls for high-value responses.
Also revisit resource sizing and autoscaling. With runtime billed per vCPU-hour, idle agent VMs are now a visible cost; aggressive scale-to-zero and eviction policies on Cloud Run or careful node-pool autoscaling on GKE matter again.
Two final thoughts
First: expect your monthly bill to flip from "compute + storage" to "tokens + stacked agent meters." Instrument token counts in telemetry and expose them in billing alerts. Second: this is another data point in the cloud-provider playbook make the managed AI primitives cheap to adopt but explicit to operate. Teams that treat Vertex AI like a free utility will get hit; teams that bake token-aware limits and project isolation into their platform will be the ones that sleep through the next model refresh.
If you want a refresher on how Vertex AI and Cloud Run/GKE have been converging, I covered the previous Gemini 2.5 wave and Cloud Run GPU updates here. Take that as a hint: this isn't a small billing tweak it's a nudge toward cleaner separation of concerns in AI platform design.