GKE Maintenance Controls: Per-Node-Pool Exclusions, 90‑Day No-Upgrade Windows, and Data-Cache SSDs

Google Cloud's recent release-note consolidation delivered several incremental but operationally meaningful changes across GKE, Vertex AI, and networking. The unifying theme is more granular operational primitives: node-pool–level maintenance controls, an API knob for local-SSD data caches, and additional hosted models and networking controls that affect hybrid and multi-cloud architectures. For platform teams, these are operational primitives to be codified in policies and automation rather than one-off workarounds.

What changed in GKE maintenance controls

GKE now supports per-node-pool maintenance exclusions and extends the "No upgrades" maintenance window up to 90 days. Previously, maintenance exclusions were cluster-scoped, forcing teams to either disable release-channel behavior entirely or accept coarse, cluster-wide windows.

How the new controls behave (operational summary):

Per-node-pool exclusions let you suppress automatic node upgrades for specific pools (for example, pools that host stateful workloads or specialized hardware) while allowing other pools to follow the release-channel cadence.
The longer "No upgrades" window (up to 90 days) gives teams more time to coordinate hardware refreshes, driver validation, or major-version testing before requiring nodes to be moved back to normal upgrade cadence.

Operational impacts and risks:

Upgrade drift: Long exclusions increase the risk that node OS, kernel, and kubelet versions diverge from your tested fleet. Require explicit review and automated drift detection for excluded pools.
Scheduling correctness: Multi-tenant clusters should ensure that pods needing newer node features cannot land on excluded pools that may lag on kernel or kubelet capabilities. Use nodeSelectors, nodeAffinity, or admission controls to enforce placement rules.
CI/CD and promotion: Image and workload promotion pipelines that gate by node or OS features must consume node-pool metadata to make correct placement decisions.

In short: you gain finer control, but must compensate with tighter governance, placement rules, and automated checks.

GKE data-cache API: ephemeral local SSDs (dataCacheCount)

The API surface now exposes a node-pool level configuration to control how many local SSD devices are exposed per node. In Config Connector / ContainerCluster terms this appears under nodePools[].config.nodeConfig.ephemeralStorageLocalSsdConfig.dataCacheCount. The intent is to support a data-cache pattern using ephemeral local SSDs for short-lived, throughput-sensitive workloads such as batch ML staging, local shuffles, or ephemeral feature caches during preprocessing.

Technical benefits:

Local SSDs reduce I/O variability and avoid the extra network hop of remote block storage for temporary, throughput-sensitive paths.
The API knob (dataCacheCount) lets operators provision a consistent number of local devices per node without custom instance templates or manual node provisioning.

Practical adoption considerations:

Treat local SSD node pools as specialized hardware: isolate them with labels and taints, scale and budget separately, and include them in IaC templates.
Scheduling: enforce nodeSelector/nodeAffinity and tolerations so only compatible workloads land on data-cache nodes. Ensure your controllers and schedulers avoid falling back to persistent disks when local SSDs are required.
Data lifecycle: local SSDs are ephemeral and not durable across node replacement. Persist checkpoints and model artifacts to durable storage (for example, GCS) at acceptable intervals.

Example manifest snippet (Config Connector ContainerCluster, partial) showing field placement in a node pool:

apiVersion: container.cnrm.cloud.google.com/v1beta1
kind: ContainerCluster
metadata:
  name: my-gke-cluster
spec:
  nodePools:
  - name: data-cache-pool
    config:
      nodeConfig:
        ephemeralStorageLocalSsdConfig:
          dataCacheCount: 2
    initialNodeCount: 3

This example enables two local SSD devices per node in a dedicated pool. Validate local SSD throughput with automated benchmarks before opening the pool for production workloads.

Vertex AI and hosted model patterns

Vertex AI's Model Garden continues to expand hosted model choices (for example, Anthropic Claude Opus 4.7 and additional Gemini-based variants). More hosted models simplify some multi-model agent or inference architectures but also create operational requirements:

Model routing and throttling: your inference gateway must support multi-model routing, per-model quotas, and billing attribution.
Observability and audit: instrument per-model telemetry so requests, latency, and cost are attributable to a specific model and tenant.
Compliance: hosted models have varying data-processing boundaries. Verify DPA, encryption, and retention controls before routing regulated data through hosted endpoints.

If you’re migrating from self-hosted models, hosted options change operational trade-offs; capture per-model IAM, telemetry, and cost controls in your migration plan.

Networking, logging exports, and hybrid topologies

Recent networking notes include an EXTERNAL_PASSTHROUGH option for forwarding rules/backend services (useful when you need source-IP–preserving handoffs to external appliances) and Partner Cross-Cloud Interconnect options for predictable cross-cloud bandwidth between GCP and AWS (partner availability and pricing vary by region). Logging export improvements include finer filtering by namespace/ingestion labels and additional CMEK options for exported logs.

Architectural guidance:

Use EXTERNAL_PASSTHROUGH when you need the load balancer to forward traffic without SNAT to third-party appliances or bare-metal endpoints that require preserved source IPs.
Evaluate Partner Cross-Cloud Interconnect in lab tests for replication and backup workflows where throughput and latency SLAs matter; compare pricing and availability against VPN/site-to-site options.
Centralize export policies and use namespace/ingestion filters to limit exported data and apply CMEK only where regulatory requirements demand it.

Concrete checklist for platform teams

Rework upgrade governance for node pools

Treat node-pool exclusions as an intentional lifecycle choice. Add CI/CD checkpoints, require explicit exclusion durations, and automate alerts for pools excluded beyond defined thresholds.

Formalize node-pool specialization

Codify data-cache node pools in IaC, autoscaling rules, cost-center tags, and scheduling policies. Benchmark local SSD throughput automatically before production rollout.

Enforce per-model controls for inference

Add multi-model routing, per-model quotas, and telemetry to attribution and billing. Enforce model-level IAM and observability.

Revisit hybrid networking

Add EXTERNAL_PASSTHROUGH and Partner Cross-Cloud Interconnect options to network diagrams and runbooks where source-IP preservation or predictable cross-cloud bandwidth are required.

Tighten observability exports

Use the new filtering and CMEK controls to send only required namespaces to high-cost sinks and to meet regulatory encryption requirements.

Update runbooks and playbooks

Add explicit steps for node-pool–specific failures, eviction scenarios on specialized pools, and model-serving fallbacks if data-cache nodes are degraded.

Prioritized projects

Automate an upgrade-exclusion review job that flags node pools excluded >30/60/90 days and triggers human review.
Create a data-cache node-pool template with automated throughput validation.
Extend inference proxies to tag telemetry with model id and cost center and to enforce per-model rate limits.
Pilot Partner Cross-Cloud Interconnect for one critical replication workflow and measure SLO compliance.

Summary

Individually these changes are small; cumulatively they change operational patterns. Use the new knobs—per-node-pool exclusions, data-cache SSD knobs, additional hosted models, and refined networking and export controls—as infrastructure primitives. Codify policies, automate guardrails, and treat specialized node pools and models as first-class components of your platform.

GKE Maintenance Controls: Per-Node-Pool Exclusions, 90‑Day No-Upgrade Windows, and Data-Cache SSDs

Sources

Gemini Pro preview, Cloud Run Worker Pools GA, and network-optimized VMs for GKE

Gemini Pro preview in Vertex AI and public APIs: what platform teams must do

Cloud Run worker pools GA: pull-based, non-HTTP workers as a first-class resource