Unlocking 64KB Pages on 4KB Kernels: Two Innovative Approaches

At the 2026 Linux Storage, Filesystem, Memory Management, and BPF Summit, memory management experts explored two distinct strategies to allow user-space processes to benefit from 64KB base pages while the kernel itself continues using smaller 4KB pages. This capability can boost performance for memory-intensive workloads by reducing TLB pressure and improving cache locality, even when the system's default page size is limited. Below, we break down the key questions and answers that emerged from these discussions.

Why would anyone want 64KB pages when the kernel uses 4KB?

Using larger base pages reduces the number of Translation Lookaside Buffer (TLB) entries needed to map a given range of virtual memory. This decreases TLB misses and can substantially speed up memory access, especially for applications with large working sets. However, larger pages also increase internal fragmentation and memory waste. The challenge is to let specific processes opt into 64KB pages without forcing the entire kernel to switch, which would break compatibility and increase memory overhead significantly. The two sessions proposed different ways to achieve this selectivity.

Unlocking 64KB Pages on 4KB Kernels: Two Innovative Approaches

What was the first approach discussed at the summit?

The first session focused on making the base page size a per-process property. In this model, each process would be allowed to choose its own page size at creation time or via a system call. The kernel would then manage a mixture of 4KB and 64KB page table entries within the same system, using separate page-table trees for different processes. This approach requires careful handling of shared memory and mmap regions, but it gives applications maximum flexibility. For example, a database server could request 64KB pages for its buffer pool while a lightweight shell process continues with 4KB pages.

How would the per-process page size be implemented?

Implementation would extend the mm_struct (memory descriptor) with a field indicating the preferred base page size. When the kernel allocates pages for a process, it would select the appropriate size from the buddy allocator or a separate pool. Page-table entries would need to reflect the chosen size, meaning the same virtual address might be mapped with 4KB PTEs for one process and 64KB PTEs for another. Challenges include handling fork(), exec(), and system calls like mremap or mprotect that operate on memory regions. The solution likely involves lazy conversion or splitting of large pages when a child process inherits a different page size.

What was the second approach, and how does it differ?

The second session presented a method specifically for bringing 64KB base pages to x86 systems, which traditionally only support 4KB and 2MB/1GB huge pages. Instead of per-process granularity, this approach would enable the entire user space to use 64KB pages transparently, by running the kernel in 4KB mode and user space in 64KB mode simultaneously. This requires modifications to the x86 page-table format and careful handling of transitions between kernel and user mode. Unlike the first approach, it doesn't allow mixed page sizes across processes; every user-space process would automatically get 64KB pages if the hardware supports it.

What are the main advantages of each approach?

Per-process approach: Offers fine-grained control, so applications that don't benefit from larger pages aren't forced to use them. It avoids global fragmentation and can be adopted incrementally.
x86-wide approach: Simpler from the user's perspective—no application changes needed. It could deliver performance gains to all processes immediately, as long as the kernel manages the coexistence of page sizes without breaking system calls or device drivers.

Both methods aim to reduce TLB misses, but they target different trade-offs between flexibility and transparency.

What challenges were identified for implementation?

Key challenges include: (1) Kernel-side changes to support mixed page sizes in page reclaim, compaction, and memory pressure handling. (2) Compatibility with shared libraries and kernel modules that expect 4KB pages. (3) TLB shootdown overhead when switching page sizes across CPUs. (4) For the x86 approach, modifications to the page table format are needed, and some CPU errata may arise. Both approaches require extensive testing to ensure no regressions in systems that continue to use 4KB pages.

Are there any real-world examples or ongoing projects?

While the summit sessions were forward-looking, similar ideas have been explored in research and niche hardware like ARM64's support for multiple page sizes (4KB, 16KB, 64KB). Apple's M1/M2 chips use 16KB pages, for instance. The x86 community has long discussed adding a mid-level page size (e.g., 64KB) to improve performance for cloud workloads. The two approaches presented could lead to patches for the Linux kernel, though no timeline was announced. The discussion highlights growing interest in flexible page sizes as memory-hungry AI and database applications become more prevalent.

Where can I learn more about the technical details?

You can follow the LWN article covering the summit for raw summaries, or check the kernel mailing list archives for the memory management track (MM sessions). Additionally, the questions above provide a starting point for understanding the key concepts. For deeper reading, see the Linux kernel documentation on page tables and the ARM64 kernel’s strategy for supporting different page sizes.