**Abstract:**
This document comprehensively explores the paging mechanism in the Linux kernel, covering core concepts including memory models, node/zone partitioning, page allocation strategies, kernel page tables, and page swapping. Practical optimization techniques in high-performance computing environments (exemplified by Hong Kong servers) are analyzed. Structured hierarchically for clarity, this guide aims to deepen understanding of Linux memory management principles.
—
### I. Introduction
Memory management is a core OS function that ensures efficient resource utilization by processes while maintaining system stability and performance. In high-performance computing (HPC) environments like servers—where handling massive concurrent requests, running multiple VMs/containers, and supporting dense data centers is critical—Linux’s robust memory management mechanisms are indispensable. The paging mechanism serves as the foundation of Linux’s virtual memory management.
**This document covers:**
1. **Memory Models**: Characteristics and applications of Flat, SMP, and NUMA.
2. **Memory Hierarchy**: Organization of nodes, zones, and pages.
3. **Page Allocation**: Implementation of Buddy System and Slub Allocator.
4. **Kernel Page Tables**: Initialization and virtual-to-physical address mapping.
5. **Page Swapping**: Passive reclamation and active management strategies.
Through these topics, readers will understand the paging mechanism’s real-world value in HPC environments.
—
### II. Memory Models Overview
Linux supports multiple memory models to accommodate diverse hardware architectures and performance needs:
#### A. Flat Memory Model
– **Description**: Uses a contiguous physical address space with linear page-to-address mapping.
– **Advantages**: Simple structure; ideal for legacy hardware/small systems.
– **Limitations**: Poor scalability under high memory/process demands; unsuitable for modern multiprocessor systems.
– **Use Cases**: Resource-constrained embedded systems or early computers.
#### B. Symmetric Multiprocessing (SMP)
– **Description**: Multiple processors share a unified memory space under a single OS scheduler.
– **Characteristics**:
– Optimizes resource usage via load balancing.
– Memory contention may occur due to shared access.
– **Use Cases**: Multi-core servers for concurrent processing (e.g., virtualization, multi-threaded apps).
#### C. Non-Uniform Memory Access (NUMA)
– **Description**: Each CPU has local memory; remote memory access incurs higher latency. CPU + local memory = NUMA node.
– **Characteristics**:
– Optimizes local access for performance.
– Cross-node access increases latency.
– **Use Cases**: Servers running HPC tasks (e.g., databases, distributed computing).
– **Challenge**: Intelligent allocation required when local memory is exhausted.
**Summary**: NUMA is pivotal in servers for minimizing latency and boosting throughput in multiprocessor environments.
—
### III. Memory Hierarchy
Linux partitions physical memory into three layers for efficient management:
#### A. Node
– **Description**: In NUMA, each node corresponds to a CPU and its local memory.
– **Structure**: Defined by `pglist_data` struct:
“`c
node_id, // Unique identifier
node_zones, // Zones within the node
node_zonelists, // Fallback nodes/zones
node_mem_map, // Page array
node_start_pfn, // Starting page frame number
node_present_pages,// Available physical pages
node_spanned_pages,// Total pages (including holes)
kswapd_wait, kswapd // Page reclamation/swapping
“`
– **Server Application**: Optimize inter-node allocation to reduce remote access latency (e.g., NUMA-aware scheduling for high-concurrency apps).
#### B. Zone
– **Description**: Nodes are subdivided into zones (stored in `node_zones`). `node_zonelists` stores fallback zones.
– **Zone Types**:
– `ZONE_DMA`: For hardware DMA transfers.
– `ZONE_NORMAL`: Directly mapped kernel space (≤896MB in 32-bit systems).
– `ZONE_HIGHMEM`: High-memory area (obsolete in 64-bit).
– `ZONE_MOVABLE`: Reduces fragmentation by grouping movable pages.
– **Structure**: Managed via `zone` struct:
“`c
zone_start_pfn, // Starting page number
spanned_pages, // Total spanned pages
present_pages, // Physically present pages
managed_pages, // Buddy-managed pages
per_cpu_pageset // Hot (cache-resident)/cold pages
“`
– **Server Application**: Zone configuration optimizes allocation for critical apps (e.g., DBs, web servers).
#### C. Page
– **Description**: Smallest unit; described by `page` struct.
– **Use Cases**:
– **Full-page usage**:
– *Anonymous pages*: Direct virtual mapping.
– *File-mapped pages*: Associated with filesystems.
– **Slub allocation**: Splits pages for small objects.
– **Compound pages**: Aggregates pages for large allocations.
– **Structure**:
“`c
mapping, // Mapping info (LSB=1 for anonymous pages)
index, // Mapping offset
_mapcount, // Number of page table references
lru, // Page-swapping list
s_mem, freelist // Slub allocator metadata
“`
– **Server Impact**: Page management efficiency affects performance during frequent small allocations (e.g., process creation).
—
### IV. Page Allocation Mechanisms
#### A. Buddy System
– **Description**: Allocates large blocks (page-granular) via free-page lists.
– **Mechanism**:
– Each list contains blocks sized as powers of 2 (order *i* = 2<sup>i</sup> pages).
– Requests for (2<sup>i-1</sup>, 2<sup>i</sup>] pages allocate 2<sup>i</sup> pages, splitting if necessary.
– **Implementation**: `alloc_pages(gfp_mask, order)`
– `gfp_mask`: Zone flags (e.g., `GFP_KERNEL`, `GFP_HIGHMEM`).
– `order`: Allocates 2<sup>order</sup> pages.
– **Optimization**: *Memory interleaving* balances writes across modules for speed.
– **Server Use**: Large-scale allocations (e.g., VMs, containers).
#### B. Slub Allocator
– **Description**: Allocates small objects from pages obtained via Buddy System.
– **Mechanism**:
– Each `kmem_cache` serves one object type (e.g., `task_struct`).
– Uses per-CPU (`kmem_cache_cpu`) and per-node (`kmem_cache_node`) caches.
– **Allocation Paths**:
1. **Fast path**: Fetch from `kmem_cache_cpu→freelist`.
2. **Slow path**: Fetch from `kmem_cache_node→partial` or request new pages.
– **Functions**: `kmem_cache_alloc_node()`, `kmem_cache_free()`.
– **Server Impact**: Critical for high-concurrency small allocations (e.g., process creation).
—
### V. Kernel Page Tables
Core component for virtual-to-physical address translation.
#### A. Initialization
– **Process**: During boot via `setup_arch()` call chain.
– **Key Steps**:
1. Load `swapper_pg_dir` (top-level page directory).
2. Map virtual/physical addresses via `init_mem_mapping()→kernel_physical_mapping_init()`.
3. Enable tables via CR3 using `__va`/`__pa` for address conversion.
– **Server Impact**: Fast initialization ensures rapid service startup.
#### B. Mapping
– **Description**: Multi-level tables (`PGD→PUD→PMD→PTE`) handle translations.
– **Key Structures**:
– `swapper_pg_dir`: Top-level directory.
– `level4_ident_pgt`: Direct-mapped region.
– `level4_kernel_pgt`: Kernel code region.
– **Features**:
– Direct-mapped zone (`ZONE_NORMAL`) uses offset-based fast conversion.
– High-memory zone (`ZONE_HIGHMEM`) uses specialized mapping.
– **Server Impact**: Efficiency affects memory access speed in large-memory/HPC scenarios.
—
### VI. Page Swapping
Manages memory overflow when virtual memory > physical memory.
#### A. Passive Page Reclaim
– **Trigger**: Physical memory exhaustion via `get_page_from_freelist()`.
– **Mechanism**:
– Scan LRU (Least Recently Used) lists for inactive pages.
– *Anonymous pages*: Write to swap space.
– *File-mapped pages*: Sync modifications to disk.
– **Implementation**: `node_reclaim()→__node_reclaim()→shrink_node()`.
– **Server Role**: Last line of defense against OOM (Out-Of-Memory) in high load.
#### B. Active Page Management
– **Description**: Kernel thread `kswapd` scans memory based on watermarks.
– **Watermarks**:
– `pages_min`: Minimum threshold (kernel-only allocations).
– `pages_low`: Reclaim triggered (pages_low = pages_min × 5/4).
– `pages_high`: Low pressure (pages_high = pages_min × 3/2).
– **Implementation**: `balance_pgdat()→kswapd_shrink_node()→shrink_node()`, prioritizing inactive pages via `shrink_node_memcg()`.
– **Server Role**: Maintains memory health for sustained uptime.
—
### VII. Conclusion
Linux’s paging mechanism underpins its memory management system, integrating:
– Flexible **memory models** (Flat, SMP, NUMA) for hardware adaptability.
– Hierarchical **node/zone/page** organization for efficiency.
– **Buddy System** and **Slub Allocator** for versatile allocation.
– **Kernel page tables** and **swapping** for virtual memory optimization.
In HPC environments like Hong Kong servers, these mechanisms ensure high memory utilization and reliability for dense data centers. Mastering these principles empowers sysadmins and developers to optimize performance for modern applications.