GCP

GKE Maintenance Controls: Per-Node-Pool Exclusions, 90‑Day No-Upgrade Windows, and Data-Cache SSDs

GKE adds per-node-pool maintenance exclusions and 90-day no-upgrades windows, plus an ephemeral local SSD dataCacheCount API. Operational guidance for SREs.

June 11, 2026·6 min read·AI researched · AI written · AI reviewed

Google Cloud's recent release-note consolidation delivered several incremental but operationally meaningful changes across GKE, Vertex AI, and networking. The unifying theme is more granular operational primitives: node-pool–level maintenance controls, an API knob for local-SSD data caches, and additional hosted models and networking controls that affect hybrid and multi-cloud architectures. For platform teams, these are operational primitives to be codified in policies and automation rather than one-off workarounds.

What changed in GKE maintenance controls

GKE now supports per-node-pool maintenance exclusions and extends the "No upgrades" maintenance window up to 90 days. Previously, maintenance exclusions were cluster-scoped, forcing teams to either disable release-channel behavior entirely or accept coarse, cluster-wide windows.

How the new controls behave (operational summary):

  • Per-node-pool exclusions let you suppress automatic node upgrades for specific pools (for example, pools that host stateful workloads or specialized hardware) while allowing other pools to follow the release-channel cadence.
  • The longer "No upgrades" window (up to 90 days) gives teams more time to coordinate hardware refreshes, driver validation, or major-version testing before requiring nodes to be moved back to normal upgrade cadence.

Operational impacts and risks:

  • Upgrade drift: Long exclusions increase the risk that node OS, kernel, and kubelet versions diverge from your tested fleet. Require explicit review and automated drift detection for excluded pools.
  • Scheduling correctness: Multi-tenant clusters should ensure that pods needing newer node features cannot land on excluded pools that may lag on kernel or kubelet capabilities. Use nodeSelectors, nodeAffinity, or admission controls to enforce placement rules.
  • CI/CD and promotion: Image and workload promotion pipelines that gate by node or OS features must consume node-pool metadata to make correct placement decisions.

In short: you gain finer control, but must compensate with tighter governance, placement rules, and automated checks.

GKE data-cache API: ephemeral local SSDs (dataCacheCount)

The API surface now exposes a node-pool level configuration to control how many local SSD devices are exposed per node. In Config Connector / ContainerCluster terms this appears under nodePools[].config.nodeConfig.ephemeralStorageLocalSsdConfig.dataCacheCount. The intent is to support a data-cache pattern using ephemeral local SSDs for short-lived, throughput-sensitive workloads such as batch ML staging, local shuffles, or ephemeral feature caches during preprocessing.

Technical benefits:

  • Local SSDs reduce I/O variability and avoid the extra network hop of remote block storage for temporary, throughput-sensitive paths.
  • The API knob (dataCacheCount) lets operators provision a consistent number of local devices per node without custom instance templates or manual node provisioning.

Practical adoption considerations:

  • Treat local SSD node pools as specialized hardware: isolate them with labels and taints, scale and budget separately, and include them in IaC templates.
  • Scheduling: enforce nodeSelector/nodeAffinity and tolerations so only compatible workloads land on data-cache nodes. Ensure your controllers and schedulers avoid falling back to persistent disks when local SSDs are required.
  • Data lifecycle: local SSDs are ephemeral and not durable across node replacement. Persist checkpoints and model artifacts to durable storage (for example, GCS) at acceptable intervals.

Example manifest snippet (Config Connector ContainerCluster, partial) showing field placement in a node pool:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
  name: my-gke-cluster
spec:
  nodePools:
  - name: data-cache-pool
    config:
      nodeConfig:
        ephemeralStorageLocalSsdConfig:
          dataCacheCount: 2
    initialNodeCount: 3

This example enables two local SSD devices per node in a dedicated pool. Validate local SSD throughput with automated benchmarks before opening the pool for production workloads.

Vertex AI and hosted model patterns

Vertex AI's Model Garden continues to expand hosted model choices (for example, Anthropic Claude Opus 4.7 and additional Gemini-based variants). More hosted models simplify some multi-model agent or inference architectures but also create operational requirements:

  • Model routing and throttling: your inference gateway must support multi-model routing, per-model quotas, and billing attribution.
  • Observability and audit: instrument per-model telemetry so requests, latency, and cost are attributable to a specific model and tenant.
  • Compliance: hosted models have varying data-processing boundaries. Verify DPA, encryption, and retention controls before routing regulated data through hosted endpoints.

If you’re migrating from self-hosted models, hosted options change operational trade-offs; capture per-model IAM, telemetry, and cost controls in your migration plan.

Networking, logging exports, and hybrid topologies

Recent networking notes include an EXTERNAL_PASSTHROUGH option for forwarding rules/backend services (useful when you need source-IP–preserving handoffs to external appliances) and Partner Cross-Cloud Interconnect options for predictable cross-cloud bandwidth between GCP and AWS (partner availability and pricing vary by region). Logging export improvements include finer filtering by namespace/ingestion labels and additional CMEK options for exported logs.

Architectural guidance:

  • Use EXTERNAL_PASSTHROUGH when you need the load balancer to forward traffic without SNAT to third-party appliances or bare-metal endpoints that require preserved source IPs.
  • Evaluate Partner Cross-Cloud Interconnect in lab tests for replication and backup workflows where throughput and latency SLAs matter; compare pricing and availability against VPN/site-to-site options.
  • Centralize export policies and use namespace/ingestion filters to limit exported data and apply CMEK only where regulatory requirements demand it.

Concrete checklist for platform teams

  1. Rework upgrade governance for node pools
  • Treat node-pool exclusions as an intentional lifecycle choice. Add CI/CD checkpoints, require explicit exclusion durations, and automate alerts for pools excluded beyond defined thresholds.
  1. Formalize node-pool specialization
  • Codify data-cache node pools in IaC, autoscaling rules, cost-center tags, and scheduling policies. Benchmark local SSD throughput automatically before production rollout.
  1. Enforce per-model controls for inference
  • Add multi-model routing, per-model quotas, and telemetry to attribution and billing. Enforce model-level IAM and observability.
  1. Revisit hybrid networking
  • Add EXTERNAL_PASSTHROUGH and Partner Cross-Cloud Interconnect options to network diagrams and runbooks where source-IP preservation or predictable cross-cloud bandwidth are required.
  1. Tighten observability exports
  • Use the new filtering and CMEK controls to send only required namespaces to high-cost sinks and to meet regulatory encryption requirements.
  1. Update runbooks and playbooks
  • Add explicit steps for node-pool–specific failures, eviction scenarios on specialized pools, and model-serving fallbacks if data-cache nodes are degraded.

Prioritized projects

  • Automate an upgrade-exclusion review job that flags node pools excluded >30/60/90 days and triggers human review.
  • Create a data-cache node-pool template with automated throughput validation.
  • Extend inference proxies to tag telemetry with model id and cost center and to enforce per-model rate limits.
  • Pilot Partner Cross-Cloud Interconnect for one critical replication workflow and measure SLO compliance.

Summary

Individually these changes are small; cumulatively they change operational patterns. Use the new knobs—per-node-pool exclusions, data-cache SSD knobs, additional hosted models, and refined networking and export controls—as infrastructure primitives. Codify policies, automate guardrails, and treat specialized node pools and models as first-class components of your platform.

Sources

gkevertex-aigcloud-networking
← All articles
GCP

Vertex AI: Gemini 2.5 Flash‑Lite GA — Cloud Run GPUs GA and GKE Inference Updates

Gemini 2.5 Flash-Lite is GA on Vertex AI with explicit caching and batch prediction. Cloud Run GPUs are GA for serverless GPU inference; check region quotas.

Jun 10, 2026·6mvertex-aigemini-2-5
GCP

Gemini Enterprise Agent Platform: Pricing, API Alignment, and GKE/Cloud Run Impacts for Gemini 2.5 & 3.x

Gemini Enterprise Agent Platform clarifies Gemini 2.5/3.x token and grounding pricing, affecting Vertex AI cost models, RAG economics, and GKE/Cloud Run ops.

Jun 8, 2026·6mgemini-enterprise-agent-platformgemini-2.5
GCP

GKE per-node-pool maintenance exclusions, Gemini Enterprise 3.1 Pro/3 Flash (LA), BigQuery Fluid Scaling GA

GKE: per-node-pool maintenance exclusions + 90-day no-updates; Gemini Enterprise: 3.1 Pro/3 Flash (LA); BigQuery: fluid scaling GA, per-second billing.

Jun 7, 2026·6mgke-maintenancegke