GCP

Google Cloud updates: Agent Platform remote MCP GA, GKE Gateway API v1.5, zero-scale clusters, and pipeline pricing

Agent Platform remote MCP GA, GKE Gateway API v1.5 support, zero-scale clusters, and Agent Platform per-pipeline pricing: operational and cost impacts.

May 24, 2026·6 min read·AI researched · AI written · AI reviewed

Overview

Google Cloud's recent release notes combine three operational trajectories: Agent Platform remote MCP reaching GA, GKE support for Gateway API v1.5 and controller conformance, and infrastructure cost controls (zero-scale clusters plus per-pipeline pricing for Agent Platform). Together these affect deployment topology, control-plane contracts, and run-cost math for platform teams running AI pipelines and customer services on GKE.

Agent Platform remote MCP server — GA and practical implications

Google's remote MCP (Model Context Protocol) server moving to General Availability signals a production-grade contract for exposing model metadata and pipeline interactions to external tooling. Treat the GA API as stable for integration planning, but follow standard hardening and operational steps.

Operational guidance:

  • Contract stability: plan integrations (IDE plugins, CI/CD steps, audit agents) against the GA API and avoid depending on preview behavior.
  • Security and isolation: treat the MCP endpoint as a sensitive control-plane API. Place it behind private connectivity (VPC Service Controls, Private Service Connect) or require mTLS and strict RBAC.
  • Billing impact: per recent pricing information, Agent Platform lists a per-Pipeline Run fee (previously waived during Preview). Model your pipelines for this marginal cost: batching, consolidating thin pipelines, and pre-processing reduce billable run counts.

Illustrative cost example: 10,000 pipeline executions/month × $0.03 = $300/month. Use this to evaluate whether to consolidate or add local preprocessing to reduce runs.

GKE Gateway API v1.5 support and controller conformance

GKE added support for Gateway API v1.5 in recent releases, and Google reports its GKE Gateway controller has passed core conformance tests for v1.5. If you standardize on the Gateway API for multi-controller portability or advanced L7 features, validate behavior in your environment.

What to validate first:

  • API compatibility: exercise Gateway primitives you rely on (GatewayClass, Gateway, HTTPRoute, TCPRoute). Pay attention to route precedence, header manipulation, and filter semantics introduced or stabilized in v1.5.
  • Field and behavior changes: reconcile manifests if you previously used alpha/beta fields; update to the stable v1 names and semantics.
  • Observability: verify the controller exposes status/conditions and metrics expected by your SRE tooling; map these to ingress SLOs.

Minimal example (update gatewayClassName to your cluster's value):

# gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: app-gateway
  namespace: gke-system
spec:
  gatewayClassName: gke-gateway-class
  listeners:
    - name: http
      protocol: HTTP
      port: 80
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: app-route
  namespace: default
spec:
  parentRefs:
    - name: app-gateway
      namespace: gke-system
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api
      backendRefs:
        - name: backend-svc
          port: 8080

Notes: the Gateway API v1 is the stable surface; if your environment still requires v1beta1 for compatibility, test and migrate to v1 when ready. Confirm the exact gatewayClassName your GKE release exposes.

Zero-scale clusters in GKE — setup and workload placement

Zero-scale clusters reduce idle node costs by allowing user-facing node pools to scale down to zero while preserving the control plane and system node pool(s). Use this pattern to separate always-on cluster infrastructure from user workloads.

Recommended pattern:

  • Keep a minimal always-on system node pool for kube-system and platform agents.
  • Place user workloads on secondary node pools with autoscaler min=0 and an appropriate max for demand.
  • Use node taints/labels and pod tolerations/nodeSelectors so workloads land on intended pools.

Example: create a secondary node pool that can scale to zero and taint it to isolate workloads.

gcloud container node-pools create secondary-pool \
  --cluster=my-cluster \
  --region=us-central1 \
  --machine-type=e2-standard-4 \
  --enable-autoscaling --min-nodes=0 --max-nodes=10 \
  --node-taints=pool=secondary:NoSchedule \
  --num-nodes=1

Example deployment annotation to schedule on that pool:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      nodeSelector:
        cloud.google.com/gke-nodepool: secondary-pool
      tolerations:
        - key: "pool"
          operator: "Equal"
          value: "secondary"
          effect: "NoSchedule"
      containers:
        - name: api
          image: gcr.io/myproj/api:1.2.3
          ports:
            - containerPort: 8080

Operational caveats:

  • Cold-start latency: when a node pool scales from 0, expect provisioning and image-pull delay. Use HPA/metrics-based scaling or maintain a small warm pool to meet SLOs.
  • Stateful workloads: avoid local state on zero-scale pools; use managed storage (PersistentVolumes, Cloud SQL, etc.) and ensure PVC binding and StorageClass behavior are appropriate.
  • System add-ons: verify CNI, CSI drivers, and logging/monitoring agents do not assume permanent node presence.

Pricing and platform signals — per-pipeline fees and modeling

Agent Platform pricing now includes a per-Pipeline Run fee (previously waived during Preview). This introduces a transaction-aware cost dimension in addition to resource-hour modeling.

Actionable steps:

  • Add run-count columns to cost models: pipeline run counts, average execution time, and retry/success rates.
  • Reduce noisy retries and enforce idempotency to avoid duplicate billable runs.
  • Consider consolidation: fuse multiple thin pipelines into a single execution when appropriate to reduce per-run charges.

Simple Python estimator: Beyond per-run fees, expect more usage-based metrics to surface (inference per request, context-store I/O). Instrument and attribute AI-related costs back to teams and features.

Broader MCP and inference integrations

Release notes mention preview features such as a Memorystore-backed Valkey remote MCP option and an AI Inference Single Method Transform (SMT) for Pub/Sub. These integrate inference and model context into messaging and caching layers and can reduce latency but change failure domains.

Operational checks:

  • If using AI Inference SMT for Pub/Sub, load-test message size, latency, and throughput; add robust dead-letter handling.
  • Evaluate Memorystore for short-lived model metadata and routing, and verify eviction and consistency semantics for your routing logic.

Practical takeaways

  • Deployment topology: segregate always-on control/system pools from autoscaled user pools (min=0) and control scheduling with taints/labels. Anticipate cold starts and design SLOs accordingly.
  • Ingress contracts: upgrade and test on GKE releases that include Gateway API v1.5 support; validate controller conformance and status/metric exposure.
  • Cost attribution: add pipeline-run metrics to billing dashboards, enforce idempotency, and consider pipeline consolidation to reduce per-run charges.
  • Harden MCP: deploy MCP endpoints with private connectivity, mTLS, RBAC, and audit logging. Test runtime limits and backpressure behavior to avoid metadata loss or billing surprises.
  • Prototype inference transforms and caches in staging: measure tail latency, throughput, and failure modes; use circuit breakers and dead-letter queues.

If your platform combines AI orchestration and Kubernetes hosting, use this release to align topology (zero-scale worker pools), contract (Gateway API v1.5), and cost attribution (per-run billing). The technical changes are manageable; the cross-functional work on policies, SLOs, and billing attribution is the heavier lift.

Sources

google-cloudgkeagent-platformcost-optimization
← All articles
GCP

Google Cloud Weekly: Cloud Run Worker Pools GA, Gemini 3.1 Flash‑Lite & Pro Previews, AI Infra Updates

Weekly Google Cloud roundup: Cloud Run worker pools GA for pull-based non-HTTP workloads; Gemini 3.1 Flash-Lite and Pro in preview on Vertex AI and Gemini API.

Jun 1, 2026·6mgoogle-cloudcloud-run
GCP

Google Cloud: Gemini 3.1 Flash‑Lite & Pro previews, Cloud Run worker pools GA, Fractional G4s, and gcloud/url-map updates

Gemini 3.1 Flash‑Lite/Pro previews, Cloud Run worker pools GA, Fractional G4 GPUs, and gcloud/url-map updates — operational guidance for platform and SRE teams.

May 29, 2026·6mgoogle-cloudvertex-ai
GCP

Cloud Next 2026: GKE Data Cache API, Vertex AI Model Garden (Claude Opus 4.7), Flexible CUDs for M1–M4/H3/H4D

Cloud Next 2026 recap: GKE Data Cache API, Vertex AI Model Garden adds Claude Opus 4.7, and Flexible CUDs expand to M1–M4, H3/H4D, Cloud Run — cluster ops.

May 27, 2026·6mgkevertex-ai