The Linux filesystem is a cornerstone of the operating system, providing a structured way to store, organize, and access data on disk. This article dives deep into the EXT4 filesystem, its structure, and how the Linux kernel manages file operations, with a focus on technical accuracy and practical insights for IT professionals and developers. Whether you’re optimizing storage or troubleshooting performance, this guide offers a clear and detailed overview.
Key Features of Linux Filesystems
Linux filesystems are designed to balance performance, reliability, and flexibility. The core characteristics include:
- Block-Based Storage: Files are stored in fixed-size blocks (typically 4KB), allowing efficient use of disk space.
- Inode Indexing: Each file or directory is associated with an inode, a data structure that tracks metadata and block locations.
- Caching Layer: Frequently accessed files are cached in memory to reduce disk I/O and improve performance.
- Hierarchical Organization: Files are organized in directories, enabling intuitive management and navigation.
- Process Tracking: The kernel maintains data structures to monitor which processes are accessing which files.
These features ensure that Linux filesystems are robust and scalable for diverse workloads.
EXT4 Filesystem Structure
The EXT4 filesystem, an evolution of EXT2 and EXT3, is widely used in Linux distributions due to its performance and reliability. Below, we explore its key components: inodes, blocks, extents, and bitmaps.
Inodes and Blocks
In EXT4, the disk is divided into fixed-size blocks (default 4KB, configurable during formatting). Files are split into these blocks for storage, allowing flexible allocation without requiring contiguous space. Each file or directory is represented by an inode, a data structure that stores metadata such as:
- File permissions (
i_mode) - Owner and group IDs (
i_uid,i_gid) - File size (
i_size_lo,i_size_high) - Timestamps (
i_atimefor access,i_mtimefor modification,i_ctimefor inode changes,i_dtimefor deletion) - Block pointers (
i_block)
The inode structure is defined as follows:
struct ext4_inode {
__le16 i_mode; /* File mode */
__le16 i_uid; /* Low 16 bits of Owner UID */
__le32 i_size_lo; /* Size in bytes */
__le32 i_atime; /* Access time */
__le32 i_ctime; /* Inode change time */
__le32 i_mtime; /* Modification time */
__le32 i_dtime; /* Deletion time */
__le16 i_gid; /* Low 16 bits of Group ID */
__le16 i_links_count; /* Links count */
__le32 i_blocks_lo; /* Blocks count */
__le32 i_flags; /* File flags */
__le32 i_block[15]; /* Pointers to blocks */
__le32 i_generation; /* File version (for NFS) */
__le32 i_file_acl_lo; /* File ACL */
__le32 i_size_high; /* High 32 bits of file size */
...
};
In EXT2/EXT3, the i_block array directly references data blocks for small files (up to 12 blocks). For larger files, it uses indirect blocks:
- Direct Blocks:
i_block[0-11]point directly to data blocks. - Single Indirect:
i_block[12]points to a block containing data block addresses. - Double Indirect:
i_block[13]points to a block of indirect block addresses. - Triple Indirect:
i_block[14]extends this further for very large files.
This approach, while flexible, can slow down access for large files due to multiple disk reads.
Extents in EXT4
To address the limitations of indirect blocks, EXT4 introduced extents, which allow contiguous block allocation for improved performance and reduced fragmentation. An extent is a range of contiguous blocks, and EXT4 organizes them in a tree structure. The extent tree is managed by the ext4_extent_header:
struct ext4_extent_header {
__le16 eh_magic; /* Format identifier */
__le16 eh_entries; /* Number of valid entries */
__le16 eh_max; /* Maximum entries the node can hold */
__le16 eh_depth; /* Tree depth */
__le32 eh_generation; /* Tree generation */
};
- Leaf Nodes (
eh_depth = 0): Containext4_extentstructures pointing to contiguous data blocks. - Index Nodes (
eh_depth > 0): Point to lower-level nodes (leaf or index).
struct ext4_extent {
__le32 ee_block; /* First logical block */
__le16 ee_len; /* Number of blocks */
__le16 ee_start_hi; /* High 16 bits of physical block */
__le32 ee_start_lo; /* Low 32 bits of physical block */
};
struct ext4_extent_idx {
__le32 ei_block; /* Logical block covered */
__le32 ei_leaf_lo; /* Pointer to next-level block */
__le16 ei_leaf_hi; /* High 16 bits of next-level block */
__u16 ei_unused;
};
For small files, the inode’s i_block can store an ext4_extent_header and up to four ext4_extent entries, representing up to 512MB of data (4 extents × 128MB per extent). For larger files, the extent tree grows, with a single 4KB block holding up to 340 extents, supporting files up to 42.5GB.
Inode and Block Bitmaps
EXT4 uses bitmaps to track the allocation status of inodes and blocks:
- Inode Bitmap: A 4KB block where each bit indicates whether an inode is allocated (1) or free (0).
- Block Bitmap: Similarly tracks block allocation.
When creating a file (e.g., using open with O_CREAT), the kernel scans the inode bitmap to find a free inode and the block bitmap to allocate data blocks.
Filesystem Layout
The EXT4 filesystem is organized into block groups to manage large disks efficiently. Each block group contains:
- Superblock: Stores global filesystem metadata (e.g., total inodes, blocks, and per-group counts).
- Block Group Descriptor Table: Lists metadata for each block group, including inode and block bitmap locations.
- Inode Table: Stores inode structures.
- Data Blocks: Hold file data.
The ext4_group_desc structure defines block group metadata:
struct ext4_group_desc {
__le32 bg_block_bitmap_lo; /* Block bitmap location */
__le32 bg_inode_bitmap_lo; /* Inode bitmap location */
__le32 bg_inode_table_lo; /* Inode table location */
...
};
To prevent data loss, superblocks and group descriptor tables are duplicated across block groups. The sparse_super feature reduces redundancy by storing copies only in block groups with indices like 0, 3, 5, 7, etc.
The Meta Block Group feature further optimizes by grouping block groups into sets of 64, with each set maintaining its own descriptor table, reducing space usage and improving scalability.
Directory Storage
Directories in EXT4 are treated as files with inodes, but their data blocks store a list of ext4_dir_entry structures, each containing a filename and its corresponding inode number:
- “.”: Current directory.
- “..”: Parent directory.
- Other Entries: Filenames and inodes of files/subdirectories.
For large directories, setting the EXT4_INDEX_FL flag enables a hash-based index, organizing entries into a tree for faster lookups. The hash maps filenames to data blocks, with leaf nodes containing ext4_dir_entry lists.
File Caching in Linux
The Linux kernel optimizes file I/O through caching, implemented in the EXT4 file operations (ext4_file_operations):
const struct file_operations ext4_file_operations = {
.read_iter = ext4_file_read_iter,
.write_iter = ext4_file_write_iter,
...
};
Cached I/O vs. Direct I/O
Linux supports two I/O modes:
- Cached I/O:
- Read: Checks the kernel’s page cache. If data is cached, it’s returned directly; otherwise, it’s read from disk and cached.
- Write: Copies data to the kernel’s page cache, marking pages as dirty. The kernel writes dirty pages to disk later (e.g., via
syncor when memory is low).
- Direct I/O:
- Bypasses the cache, reading/writing directly to disk, reducing overhead for applications that manage their own caching.
Cached Write Operations
The generic_perform_write function handles cached writes:
- Prepare: Calls
ext4_write_beginto initialize logging and allocate a cache page. - Copy: Uses
iov_iter_copy_from_user_atomicto transfer data from user space to the kernel’s page cache. - Complete: Calls
ext4_write_endto finalize logging and mark the page as dirty. - Balance: Invokes
balance_dirty_pages_ratelimitedto manage dirty pages, triggering writeback if thresholds are exceeded.
EXT4 supports three journaling modes for reliability:
- Journal Mode: Logs both metadata and data before writing, ensuring maximum safety but lower performance.
- Ordered Mode (default): Logs metadata only, ensuring data is written to disk before metadata.
- Writeback Mode: Logs metadata only, without guaranteeing data write order, offering higher performance but less safety.
Cached Read Operations
The generic_file_buffered_read function manages cached reads:
- Check Cache: Searches the page cache using
find_get_page. - Read Ahead: If no cache is found, triggers synchronous read-ahead (
page_cache_sync_readahead) to load data and adjacent blocks. - Async Read-Ahead: For cached pages marked for readahead, initiates asynchronous preloading (
page_cache_async_readahead). - Copy: Transfers data from the cache to user space using
copy_page_to_iter.
Conclusion
The EXT4 filesystem, with its inode-based structure, extent trees, and robust journaling, provides a reliable and efficient solution for Linux storage. By leveraging caching and optimized I/O operations, the Linux kernel ensures high performance for both small and large files. Understanding these components is crucial for system administrators and developers aiming to optimize storage performance or troubleshoot filesystem issues.