GKE per-node-pool maintenance exclusions, 90-day no-upgrade window, and concurrent node-pool upgrades (Preview)

GKE just handed platform teams two practical levers and one sharp edge: you can now exclude maintenance per node pool, extend a “No upgrades” exclusion to 90 days, and — in Preview — run concurrent node-pool upgrades instead of the old one-at-a-time choreography. That combination short-circuits a lot of the brittle upgrade scripts people have been living with, but it also moves responsibility from Google’s default sequencing to operator policy and test discipline.

What changed, precisely

Per-node-pool maintenance exclusions: Node pools can opt out of automated maintenance independently of the cluster or release-channel settings. This is the natural extension of previous cluster-level suppression and makes it trivial to protect pools that host stateful or latency-sensitive services.
90-day "No upgrades" exclusion: The maintenance exclusion window can now be extended up to 90 days. That’s a long freeze compared to the historical cadence and is explicitly supported as a first-class control.
Concurrent node pool upgrades (Preview): Administrators can set a maximum number of node pools to upgrade at the same time instead of the default serial (one-at-a-time) behavior. This is a Preview feature and exposes a concurrency knob for upgrade throughput.

Why this matters — and what it forces you to do differently

For modern clusters with many specialized node pools (GPU, high-memory, tainted pools for stateful workloads, preemptible/spot pools), per-pool exclusions are overdue. Teams have been working around cluster-level controls with fragile label-and-schedule tricks, manual cordons, or custom cron-driven drains. Per-pool exclusions are the right abstraction: protect the pools that matter without stalling the rest of the cluster.

The 90-day window is useful for long-running projects, regulatory freezes, or slow-moving device certification cycles. But 90 days is long enough that it will become a version drift and security problem if treated as a permanent setting. If you flip a pool to "No upgrades" for 90 days and never track accumulated CVE fixes and kubelet patches, you’ll pay dearly in incident response later. In short: this is a tool for planned freezes, not an excuse to ignore maintenance.

Concurrent upgrades are the real operational lever. If you operate dozens of node pools, serial upgrades are painfully slow and increase the blast radius of any single delayed or failed upgrade (everything backs up). Allowing n-way upgrades reduces total maintenance windows — but only if your workloads tolerate simultaneous drain events across pools. Pod Disruption Budgets (PDBs), StatefulSet partitioning, DaemonSet availability, CSI driver behavior under concurrent drains, and cluster-autoscaler interactions all become first-order concerns.

Here’s my blunt take: Google is doing the right thing by shifting control down to the entity that actually knows which workloads must be frozen. But they’re also outsourcing risk. Concurrent upgrades without better guardrails (node-pool surge limits, clear visibility into cross-pool eviction rates, or automatic rollbacks on PDB violations) will lead to outages for teams that treat this as a button to speed things up without testing.

Operational checklist (short)

Treat a 90-day "No upgrades" as a scheduled, auditable freeze; track security backports separately.
Add upgrade concurrency to your staging runs: simulate drains across the same set of pools you’ll touch in prod.
Audit PDBs and DaemonSet tolerations — concurrent drains expose gaps fast.
Monitor node pool-level upgrade state and pod eviction metrics during preview runs; set alerts for abnormal eviction spikes.

Final thought

These changes signal Google accepting that clusters are heterogeneous by design, not by accident. Giving operators per-pool exclusion and a concurrency knob is overdue and practical — but it’s not risk-free. Expect platform teams that treat the 90-day freeze as a policy to be the new slow, safe operators; expect teams that treat concurrency as a performance hack to be the source of the next wave of “it worked in canary” incidents. Pick your side, and write a test to prove it.

GKE per-node-pool maintenance exclusions, 90-day no-upgrade window, and concurrent node-pool upgrades (Preview)

Sources

GKE per-node-pool maintenance exclusions and 90-day no-upgrade window (release channels)

BigQuery fluid scaling GA: per-second billing for autoscaling reservations

GKE Maintenance Controls: Per-Node-Pool Exclusions, 90‑Day No-Upgrade Windows, and Data-Cache SSDs