Kubernetes v1.36 Beta: Adjusting Job Resources on the Fly for Suspended Workloads

Introduction

Kubernetes v1.36 elevates the ability to modify container resource requests and limits in the pod template of a suspended Job from alpha to beta. Initially introduced in v1.35, this feature empowers queue controllers and cluster administrators to tweak CPU, memory, GPU, and extended resource specifications on a Job while it remains suspended, before it begins or resumes execution. This capability addresses a critical gap in batch and machine learning workflows where resource demands are not always known at Job creation time.

Kubernetes v1.36 Beta: Adjusting Job Resources on the Fly for Suspended Workloads

Why Mutable Pod Resources for Suspended Jobs?

Batch and machine learning workloads often face fluctuating resource requirements that depend on current cluster capacity, queue priorities, and the availability of specialized hardware like GPUs. Before this feature, once a Job’s pod template was set, its resource fields were immutable. If a queue controller such as Kueue determined that a suspended Job should run with different resources, the only recourse was to delete and recreate the Job entirely. That approach meant losing metadata, status, and history—an expensive and disruptive process.

This new functionality offers a more graceful path: a specific Job instance triggered by a CronJob can progress with reduced resources rather than failing outright when the cluster is heavily loaded. It also allows queue controllers to optimize resource allocation dynamically, improving overall cluster utilization and Job success rates.

Example: Machine Learning Training Job

Consider a machine learning training Job that initially requests 4 GPUs:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
          limits:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
      restartPolicy: Never

A queue controller managing cluster resources might determine that only 2 GPUs are available. With this feature, the controller can update the Job’s resource requests before resuming it:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example-abcd123
  labels:
    app.kubernetes.io/name: trainer
spec:
  suspend: true
  template:
    metadata:
      annotations:
        kubernetes.io/description: "ML training, ID abcd123"
    spec:
      containers:
      - name: trainer
        image: example-registry.example.com/training:2026-04-23T150405.678
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
            example-hardware-vendor.com/gpu: "2"
          limits:
            cpu: "4"
            memory: "16Gi"
            example-hardware-vendor.com/gpu: "2"
      restartPolicy: Never

After the resources are updated, the controller resumes the Job by setting spec.suspend to false, and the new Pods are created with the adjusted resource specifications. This process avoids deletion and preserves all associated metadata and history.

How It Works

The Kubernetes API server relaxes the immutability constraint on pod template resource fields specifically for Jobs that are suspended. No new API types are introduced; the existing Job and pod template structures accommodate the change through a controlled relaxation of validation rules. The feature is enabled by default in v1.36 as a beta feature, meaning cluster operators can rely on it without needing to explicitly enable a feature gate.

Key technical aspects include:

Resource field mutability is allowed only when spec.suspend is true.
Changes apply to container-level resource requests and limits, including extended resources.
The controller or user must modify the Job object and set the new pod template resources; the API server validates the changes.
When the Job is resumed (suspend set to false), the new pod template is used to create Pods.

Use Cases for Mutable Resources

Queue Controllers: Kueue and similar controllers can adjust resources based on cluster availability and job priorities, reducing the need for preemption or job rejection.
CronJob Adaptability: A CronJob-driven Job can downgrade its resource footprint during periods of high cluster load, ensuring it still runs (albeit slower) rather than failing.
Cost Optimization: Administrators can delay resource-intensive Jobs until cheaper or more abundant compute becomes available, then adjust resources accordingly before resumption.

Benefits and Limitations

This feature provides significant operational flexibility for batch and ML workloads. However, it comes with some important considerations:

Scope: Only Jobs with spec.suspend: true can have their pod template resources modified. Active, running Jobs remain immutable for resource changes.
Metadata preservation: Unlike the delete-and-recreate approach, all Job metadata (labels, annotations, status) is retained.
Security: Only users or controllers with update permission on the Job can modify resources, maintaining existing access controls.

Getting Started

To use this feature, you need a Kubernetes cluster running v1.36 or later. The feature is enabled by default. You can suspend a Job by setting spec.suspend: true, update the pod template’s resources section, and then resume the Job. For queue controllers, integrate with the Kubernetes API to watch suspended Jobs and apply resource modifications programmatically.

For more details, refer to the official Kubernetes documentation on job suspension and resource management for containers.

Conclusion

The promotion of mutable pod resources for suspended Jobs to beta in Kubernetes v1.36 marks a meaningful step toward more intelligent and resource-efficient batch processing. By allowing on-the-fly adjustments without data loss, it strengthens the platform’s suitability for dynamic, large-scale workloads. As Kubernetes continues to evolve, features like this underscore the commitment to providing flexible, observable, and adaptable scheduling mechanisms.