GCP

Google Cloud: Gemini 3.1 Flash‑Lite & Pro previews, Cloud Run worker pools GA, Fractional G4s, and gcloud/url-map updates

Gemini 3.1 Flash‑Lite/Pro previews, Cloud Run worker pools GA, Fractional G4 GPUs, and gcloud/url-map updates — operational guidance for platform and SRE teams.

May 29, 2026·6 min read·AI researched · AI written · AI reviewed

Google Cloud's recent release notes and previews introduce several operational knobs platform and SRE teams should evaluate: two Gemini 3.1 variants in preview, Cloud Run worker pools reaching GA as a first-class resource for pull-based workloads, and infrastructure/CLI changes (Fractional G4 GPU attachments, new gcloud flags, and beta cachePolicy for URL maps). These updates tighten integration between high-throughput generative models and standard cloud infrastructure patterns while offering new levers for isolation, cost control, and low-disruption operations.

Gemini 3.1: Flash‑Lite and Pro previews — how to apply them

Google announced two 3.1 variants in preview: Gemini 3.1 Flash‑Lite and Gemini 3.1 Pro. Treat them as distinct inference tiers with different cost, latency, and fidelity trade-offs.

  • Gemini 3.1 Flash‑Lite: positioned for high-volume, cost-sensitive text workloads where per-token cost and throughput matter. Typical uses include high-QPS classification, short completions, and telemetry enrichment.

  • Gemini 3.1 Pro: positioned for higher-fidelity workloads that require deeper reasoning, coding assistance, or complex analytic responses.

Operational implications

  • Tiering and routing: implement traffic routing (API gateway, model-router, or feature-flag layer) to send latency- or cost-sensitive requests to Flash‑Lite and higher-fidelity requests to Pro. Capture token usage per request to feed cost-aware routing.

  • Vertex AI deployment: both variants appear in the Vertex AI preview deployment flow. Integrate model lifecycle operations (versioning, A/B rollout, canary) into existing pipelines and IaC that manage endpoints and deployments.

  • Observability: Flash‑Lite deployments should focus on cost-per-inference and tail latency under load; Pro deployments should add fidelity signals (consistency, hallucination detection). Include model-variant and token-usage metadata in traces and metrics (OpenTelemetry or your APM of choice).

Example: deploy a Gemini 3.1 Flash‑Lite model to a Vertex AI endpoint (preview; adapt to your environment):

# Create an endpoint (regional)
gcloud ai endpoints create \
  --region=us-central1 \
  --display-name=gemini-3-1-flash-lite-endpoint
 
# Deploy a model to the endpoint (model must already be uploaded or referenced by model ID)
gcloud ai endpoints deploy-model ENDPOINT_ID \
  --model=MODEL_ID \
  --display-name=gemini-3-1-flashlite-deploy \
  --region=us-central1 \
  --machine-type=n1-standard-8 \
  --min-replica-count=1 \
  --max-replica-count=5

Adjust machine types and replica counts to meet your throughput and latency SLAs. Use preview pricing and guidance to set autoscaling thresholds and cost alerts.

Cloud Run worker pools GA: a named resource for pull-based workloads

Cloud Run worker pools are GA as a region-scoped resource for non-HTTP, pull-based workloads. This changes how teams manage background processing by separating compute profile and lifecycle from per-job deployment artifacts.

Key patterns and benefits

  • Worker pool as a resource: create and manage named worker pools to standardize compute profiles (machine types, VPC connectors, service accounts) and reuse them across jobs.

  • Non-HTTP, pull-based semantics: target workloads consuming Pub/Sub, Cloud Tasks, or custom queues where a request/response HTTP surface is unnecessary.

  • Security and isolation: attach dedicated service accounts and IAM roles to worker pools to enforce least-privilege boundaries. Combine with VPC egress settings for network isolation.

  • Operational lifecycle: worker pools enable centralized configuration for compute and networking while jobs remain lightweight to deploy and update.

Example flow: create a worker pool and reference it from a Cloud Run Job manifest

# create a worker pool (regional)
gcloud run worker-pools create my-worker-pool \
  --region=us-central1 \
  --project=my-project \
  --service-account=worker-sa@my-project.iam.gserviceaccount.com
 
# example Cloud Run job YAML snippet referencing a worker pool
cat > job.yaml <<EOF
apiVersion: run.googleapis.com/v1
kind: Job
metadata:
  name: batch-processor
spec:
  template:
    spec:
      workerPool: projects/my-project/locations/us-central1/workerPools/my-worker-pool
      containers:
        - image: gcr.io/my-project/batch-processor:stable
          env:
            - name: QUEUE_NAME
              value: projects/my-project/topics/my-topic
      maxRetries: 3
EOF
 
gcloud run jobs replace --region=us-central1 --file=job.yaml

Use worker pools to centralize network egress, VPC access, and credential handling while keeping per-job deployments simple.

Infra and CLI/networking updates

Several release-note items affect capacity planning and change automation:

  • Fractional G4 GPU attachments: finer-grained GPU sizing lets you right-size GPU capacity for inference fleets and preprocessing jobs, improving utilization and cost efficiency.

  • gcloud compute instances update --minimal-action: a flag surfaced to reduce disruption when applying certain instance updates in-place. Use this in automation where in-place changes are safe; add validations to avoid assuming all updates are non-disruptive.

Example (apply a metadata update with minimal disruption):

gcloud compute instances update my-instance-1 \
  --zone=us-central1-a \
  --metadata=CONFIG_VERSION=v2 \
  --minimal-action
  • Beta cachePolicy for compute URL maps: URL maps can now include cachePolicy fields (beta). This makes it possible to manage caching behavior at the load-balancer routing level (path- or prefix-specific TTLs and cache modes) and keep cache rules colocated with routing logic.

Conceptual url-map route rule with cachePolicy (adapt to your environment/provider):

routeRules:
  - priority: 0
    matchRules:
      - prefixMatch: "/static/"
    service: https://www.googleapis.com/compute/v1/projects/my-project/global/backendServices/static-backend
    cachePolicy:
      cacheMode: USE_ORIGIN_HEADERS
      defaultTtl: 3600
      maxTtl: 86400
      negativeCaching: true

These incremental changes reduce friction for large-scale deployments by offering lower-disruption updates, more granular cache control, and more flexible GPU sizing.

Practical operational recommendations

  • Architect model tiers: define which workloads should go to Flash‑Lite vs Pro and implement routing, cost-aware fallbacks, and per-tenant model selection in your model-serving gateway.

  • Standardize worker compute: build a small catalog of worker-pool specs (e.g., small/medium/large) with standardized IAM and VPC settings, and migrate batch/cron jobs to those pools iteratively.

  • Update autoscaling and capacity planning: include fractional GPU options and Flash‑Lite throughput profiles in right-sizing exercises. Tag autoscaler metrics with model-variant to avoid noisy signals across heterogeneous model classes.

  • Adopt low-disruption update modes carefully: integrate --minimal-action into deployment automation where safe and add pipeline checks to determine when a full restart is necessary.

  • Pilot route-level caching: test cachePolicy on non-critical URL maps to validate cache-control semantics end-to-end before broader rollout.

Actionable next steps

  1. Create a model-variant policy enumerating workloads for Flash‑Lite vs Pro and implement routing rules.
  2. Build a worker-pool catalog and migrate representative jobs to verify IAM and network isolation.
  3. Add token-usage and model-cost metrics to APM and billing dashboards and link them to autoscaling policies.
  4. Pilot cachePolicy on a test URL map and measure behavior under production-like traffic.

These changes are incremental but provide important operational levers. Focus on model routing, worker isolation, and low-disruption change paths to realize cost, isolation, and performance improvements as generative AI moves into production workloads.

Sources

google-cloudvertex-aicloud-rungcloud-cli
← All articles
GCP

Google Cloud Weekly: Cloud Run Worker Pools GA, Gemini 3.1 Flash‑Lite & Pro Previews, AI Infra Updates

Weekly Google Cloud roundup: Cloud Run worker pools GA for pull-based non-HTTP workloads; Gemini 3.1 Flash-Lite and Pro in preview on Vertex AI and Gemini API.

Jun 1, 2026·6mgoogle-cloudcloud-run
GCP

Cloud Next 2026: GKE Data Cache API, Vertex AI Model Garden (Claude Opus 4.7), Flexible CUDs for M1–M4/H3/H4D

Cloud Next 2026 recap: GKE Data Cache API, Vertex AI Model Garden adds Claude Opus 4.7, and Flexible CUDs expand to M1–M4, H3/H4D, Cloud Run — cluster ops.

May 27, 2026·6mgkevertex-ai
GCP

GCP Next '26 Recap — GKE Data Cache field, Flexible CUDs for Cloud Run, and platform-scale storage/networking

Takeaways from Google Cloud Next '26: GKE Data Cache cluster field, Flexible CUDs for Cloud Run and new VM families, plus platform storage/networking impacts.

May 26, 2026·6mgcpgke