Gemini 3.1 Pro & Flash‑Lite Preview: Vertex AI, Gemini API, Cloud Run worker pools, and BigQuery quotas

Google Cloud's latest wave landed as three coordinated moves that will matter to platform teams: Gemini 3.1 is being split into a Pro model behind Vertex AI and a Flash‑Lite variant exposed to the Gemini API; Cloud Run worker pools hit GA; and billing controls (including CUD scope changes and BigQuery token quotas) are getting real enforcement. None of these are headline-features alone — together they change where capability, control, and cost live.

Gemini 3.1: Pro in Vertex, Flash‑Lite for the API

Google is previewing Gemini 3.1 Pro inside Vertex AI and enterprise tiers, while rolling a Flash‑Lite build to developers via the Gemini API and Google AI tools. That's a deliberate split: the full, higher-capacity Pro model lives where enterprise features belong (private endpoints, VPC Service Controls, enterprise audit logs), and a trimmed Flash‑Lite variant is what generic API consumers get.

This is the right call from an operational perspective — high-capability models should sit behind enterprise controls. But it also creates a new operational friction: model parity can no longer be assumed across endpoints. Expect to change SDK wrappers, CI/CD pipelines, and model selection logic in your MLOps stacks to target either Vertex-backed Pro models or the Flash‑Lite API endpoint. If you treat the Gemini API like a single homogenous model provider, you'll be surprised by capability and cost differences.

Two practical follow-ons you should expect immediately:

Token accounting will diverge. Pro vs Flash‑Lite token costs and throughput differ; instrumenting per-endpoint token usage in your observability pipelines will become mandatory.
Access and compliance controls won't be the same. Vertex grants enterprise-level gating; the Gemini API remains more developer-friendly but less enterprise-siloed.

Cloud Run worker pools — GA for pull and non‑HTTP workloads

Cloud Run worker pools graduated to GA. Worker pools are a resource type aimed at pull-based, non-HTTP workloads (message consumers, cron-style jobs, long-running background workers) instead of classic request/response services.

Practically, that means you can now treat Cloud Run as a managed execution environment for event-driven, at-least-once workloads without shoehorning them into HTTP interfaces or juggling Pub/Sub push adapters. Worker pools also change autoscaling semantics and resource boundaries; platform engineers should revisit service templates, IAM bindings for task schedulers, and observability on concurrency metrics.

If you're still mapping every asynchronous job to a Kubernetes Deployment simply for pull semantics, worker pools are the feature you should evaluate first.

Billing: CUD scope change and BigQuery token quotas go GA

Small but operationally significant — Google changed CUD scope behavior for certain legacy Cloud Billing accounts (created before June 16, 2026), making billing-account-level scope available when there are no active resource-level commitments. That effectively enables CUD sharing for a cohort of legacy accounts. If your organization has legacy billing accounts, this can unlock immediate cost optimization; if not, it's a reminder that billing scoping is now a first-class operational variable. (We covered this behavioral change in depth earlier.)

Meanwhile BigQuery generative AI functions now support daily token quotas in GA. Token-based usage was a surprise on some bills; daily token quotas give platform teams a direct throttle for cost control and incident isolation.

What to do Monday morning

Start by mapping where you call Gemini — Vertex vs Gemini API — and add per-endpoint token and latency metrics. Audit CI/model-targeting logic so deployments pick the correct endpoint. For Cloud Run, evaluate worker pools for message consumers and long-running tasks; you’ll likely remove several awkward HTTP shims. Finally, check billing-account settings for pre-June 16, 2026 accounts and apply token quotas for BigQuery generative functions where external-facing features exist.

This release tranche is less about one new toy and more about responsibility: capability is being parceled into enterprise controls while lighter-weight models and execution primitives are being productized for developers. Platform teams that keep tooling, quotas, and billing rules tightly coupled to endpoints will win; the ones that assume API parity will wake up to surprises on bills and behavior. Watch for SDK updates and MLOps patches in the next two sprints — they won’t be optional.

Gemini 3.1 Pro & Flash‑Lite Preview: Vertex AI, Gemini API, Cloud Run worker pools, and BigQuery quotas

Gemini 3.1: Pro in Vertex, Flash‑Lite for the API

Cloud Run worker pools — GA for pull and non‑HTTP workloads

Billing: CUD scope change and BigQuery token quotas go GA

What to do Monday morning

Sources

Cloud Run Worker Pools GA — Pull-based non-HTTP workers as a first-class Cloud Run resource

Gemini 3.1 Pro & Flash-Lite previewed on Vertex AI and Gemini API; Cloud Run worker pools GA

Cloud Run Worker Pools GA: Pull-based non-HTTP workers as a first-class Cloud Run resource