Kubernetes v1.36: Tackling Controller Staleness with Atomic FIFO and Enhanced Observability

Controllers in Kubernetes rely on local caches to react quickly to cluster changes. However, staleness—when a controller's cache is outdated—can lead to incorrect actions, missed updates, or delays. Starting with v1.36, Kubernetes introduces the AtomicFIFO feature in client-go, which helps keep caches consistent even when events arrive out of order. This improvement not only mitigates staleness but also provides better observability into controller behavior. Below, we answer common questions about these updates.

What is staleness in Kubernetes controllers?

Staleness occurs when a controller's internal cache does not reflect the current state of the cluster. Controllers maintain a local copy of objects they care about, updated by watching the Kubernetes API server via informers. The cache gives fast reads, but if it falls behind—for instance, after a controller restart or during API server downtime—the controller may operate on outdated information. This outdated view is referred to as staleness. It can cause subtle bugs that are hard to diagnose because the controller seems to behave correctly until an inconsistency leads to a wrong decision.

Kubernetes v1.36: Tackling Controller Staleness with Atomic FIFO and Enhanced Observability

How does staleness affect controller behavior?

Staleness can manifest in three main ways: incorrect actions, such as deleting a resource that should exist; no action when a resource requires attention; and slow action, where the controller takes too long to respond. These issues often stem from assumptions made by the controller author—for example, assuming the cache is always up-to-date. In production, staleness is usually discovered only after it has already caused harm, making it a critical problem to address.

What causes a controller's cache to become stale?

Several scenarios lead to staleness:

Controller restart: After a restart, the controller must rebuild its cache by re-watching the API server. Until the initial list and watch complete, the cache is empty or incomplete.
API server outage: If the API server is temporarily unavailable, watch connections break and no updates reach the controller, allowing the cache to become outdated.
Out-of-order events: Even with a functioning watch, events may arrive in a non-sequential order (e.g., due to network issues or informer resyncs), causing the cache to reflect an inconsistent state.

What new features in Kubernetes v1.36 help mitigate staleness?

Kubernetes v1.36 introduces the AtomicFIFO feature (feature gate: AtomicFIFO) in the client-go library. This enhancement builds on the existing FIFO queue used by informers to process events. AtomicFIFO ensures that batches of operations—such as the initial list of objects used to populate a cache—are handled atomically. This prevents the queue from being in an inconsistent state when events arrive out of order, thereby reducing staleness. The feature is available to any client using client-go, and it has been adopted by highly contended controllers in kube-controller-manager.

How does atomic FIFO processing work in practice?

Before AtomicFIFO, events added to the informer queue were processed in the order they were received. If a batch of events (like the initial list) came in a non-sequential sequence, the queue could end up with a stale view. With AtomicFIFO, the entire batch—from a list operation or a set of watch events—is committed to the queue as one atomic unit. This ensures that the queue always contains a consistent snapshot of the cluster state. Even if events are later added out of order, the atomic handling guarantees that the cache reflects reality. The result is a more reliable controller that avoids the subtle bugs caused by cache inconsistency.

How does this improvement enhance observability into controllers?

AtomicFIFO not only mitigates staleness but also improves observability. With the new feature, clients using client-go can introspect into the cache to determine the latest resource version that has been successfully processed. This allows operators and developers to monitor whether the controller's cache is up-to-date or lagging. By exposing the current resource version of the queue, teams can set alerts for staleness, track how quickly the controller catches up after restarts, and debug issues more effectively. This increased visibility helps prevent the “it seemed fine until it wasn’t” scenario that often plagues controller-based systems.

Which controllers benefit most from these changes?

While any controller using client-go can benefit, the v1.36 improvements are especially impactful for highly contended controllers in kube-controller-manager. These controllers handle frequent updates to many objects (e.g., endpoints, replicasets) and are more prone to event ordering issues. By adopting AtomicFIFO, they maintain a consistent cache even under heavy load or during informer resyncs. This leads to fewer incorrect actions, faster reconciliation, and reduced operational risk. The changes are backward-compatible and require no code changes from controller authors—just enabling the feature gate.