理解ext系列文件系统

Last Updated: 2023-07-04 09:40:20 Tuesday

-- TOC --

Linux下第一个商用级别文件系统，ext2文件系统。后来升级到ext3/ext4，提供新功能，兼容性也很好。ext系列文件系统在Linux世界应用广泛，是Linux原生的文件系统。它支持标准的Unix文件属性和权限管理方式，这一点是与Windows系文件系统很显著的区别。

Linux最开始参考minix实现文件系统，minix是用于教学的，其很多参数都不能适用于商用环境。1992年，Remy Card设计并实现了Ext文件系统，其寓意也是对minix文件系统的扩展，同时实现了VFS，在0.96c版本中集成发布。1993年，Remy Card开发了Ext2，这是Linux第一个商用级别的文件系统，应用时间很长，但Ext2没有日志功能，无法解决系统崩溃导致文件系统数据不一致的问题。2001年Stephen Tweedie主导开发了Ext3文件系统，主要就是增加日志功能（Journal）。2008年，Ext4版本出现，引入了很多新特性，如Extent，预分配，延迟分配，加密等。

Partition Layout

ext系列文件系统将整个分区划分成特定size的Block，然后将一定数量的连续block合并在一起，成为大小相同的块组，即Block Group，最后的那个group所包含的block数量可能偏少。SuperBlock，GDT以及其他Metadata存储在第一个块组，即Group0，以及其它几个特定的Group中（备份）。每个group包含此group中所有block的bitmap，以及inode的bitmap和inode table。通过GDT可以定位inode table的位置。ext系列文件系统通过Block Group管理硬盘资源。

When block size is 4KiB：

Block Group 0 | Block Group 1 | Block Group 2 | ...

In Block Group 0:

1024 Bytes  |  SuperBlock  |  Block Group Desc Table  | ... | ... | ...

Super Block

super block应该可以对应FAT系统的boot sector。block可以对应cluster。

superblock的位置，固定在距离分区开始偏移1024个字节的地方。如果block的size是1KiB，superblock就是1号block，如果block的size大于1KiB，superblock在0号block内。superblock也是一个block，ext2文件系统就是将分区划分成固定大小的block（从分区开始的地方），然后把block组合成block group。

下面是Linux内核中ext2 super block的定义：（fs/ext2/ext2.h）

关于__bitwise申明，参考kernel中的bitwise

block和block group都从0开始编号！

// usr/inlcude/linux/types.h
typedef __u32 __bitwise __le32;

/*
 * Structure of the super block
 */
struct ext2_super_block {
        __le32  s_inodes_count;         /* Inodes count */
        __le32  s_blocks_count;         /* Blocks count */
        __le32  s_r_blocks_count;       /* Reserved blocks count */
        __le32  s_free_blocks_count;    /* Free blocks count */
        __le32  s_free_inodes_count;    /* Free inodes count */
        __le32  s_first_data_block;     /* First Data Block */
        __le32  s_log_block_size;       /* Block size */
        __le32  s_log_frag_size;        /* Fragment size */
        __le32  s_blocks_per_group;     /* # Blocks per group */
        __le32  s_frags_per_group;      /* # Fragments per group */
        __le32  s_inodes_per_group;     /* # Inodes per group */
        __le32  s_mtime;                /* Mount time */
        __le32  s_wtime;                /* Write time */
        __le16  s_mnt_count;            /* Mount count */
        __le16  s_max_mnt_count;        /* Maximal mount count */
        __le16  s_magic;                /* Magic signature */
        __le16  s_state;                /* File system state */
        __le16  s_errors;               /* Behaviour when detecting errors */
        __le16  s_minor_rev_level;      /* minor revision level */
        __le32  s_lastcheck;            /* time of last check */
        __le32  s_checkinterval;        /* max. time between checks */
        __le32  s_creator_os;           /* OS */
        __le32  s_rev_level;            /* Revision level */
        __le16  s_def_resuid;           /* Default uid for reserved blocks */
        __le16  s_def_resgid;           /* Default gid for reserved blocks */
        /*
         * These fields are for EXT2_DYNAMIC_REV superblocks only.
         *
         * Note: the difference between the compatible feature set and
         * the incompatible feature set is that if there is a bit set
         * in the incompatible feature set that the kernel doesn't
         * know about, it should refuse to mount the filesystem.
         *
         * e2fsck's requirements are more strict; if it doesn't know
         * about a feature in either the compatible or incompatible
         * feature set, it must abort and not try to meddle with
         * things it doesn't understand...
         */
        __le32  s_first_ino;            /* First non-reserved inode */
        __le16   s_inode_size;          /* size of inode structure */
        __le16  s_block_group_nr;       /* block group # of this superblock */
        __le32  s_feature_compat;       /* compatible feature set */
        __le32  s_feature_incompat;     /* incompatible feature set */
        __le32  s_feature_ro_compat;    /* readonly-compatible feature set */
        __u8    s_uuid[16];             /* 128-bit uuid for volume */
        char    s_volume_name[16];      /* volume name */
        char    s_last_mounted[64];     /* directory where last mounted */
        __le32  s_algorithm_usage_bitmap; /* For compression */
        /*
         * Performance hints.  Directory preallocation should only
         * happen if the EXT2_COMPAT_PREALLOC flag is on.
         */
        __u8    s_prealloc_blocks;      /* Nr of blocks to try to preallocate*/
        __u8    s_prealloc_dir_blocks;  /* Nr to preallocate for dirs */
        __u16   s_padding1;
        /*
         * Journaling support valid if EXT3_FEATURE_COMPAT_HAS_JOURNAL set.
         */
        __u8    s_journal_uuid[16];     /* uuid of journal superblock */
        __u32   s_journal_inum;         /* inode number of journal file */
        __u32   s_journal_dev;          /* device number of journal file */
        __u32   s_last_orphan;          /* start of list of inodes to delete */
        __u32   s_hash_seed[4];         /* HTREE hash seed */
        __u8    s_def_hash_version;     /* Default hash version to use */
        __u8    s_reserved_char_pad;
        __u16   s_reserved_word_pad;
        __le32  s_default_mount_opts;
        __le32  s_first_meta_bg;        /* First metablock block group */
        __u32   s_reserved[190];        /* Padding to the end of the block */
};

下面是一个新格式化的ext2分区superblock的内容：

$ sudo xxd -a -u -s1024 -l512 /dev/sdc2
00000400: 20B2 1B00 009A 6E00 B387 0500 11B1 6C00   .....n.......l.
00000410: 15B2 1B00 0000 0000 0200 0000 0200 0000  ................
00000420: 0080 0000 0080 0000 F01F 0000 0000 0000  ................
00000430: B8EA 8F64 0000 FFFF 53EF 0100 0100 0000  ...d....S.......
00000440: 89EA 8F64 0000 0000 0000 0000 0100 0000  ...d............
00000450: 0000 0000 0B00 0000 0001 0000 3800 0000  ............8...
00000460: 0200 0000 0300 0000 2820 B256 5651 47E6  ........( .VVQG.
00000470: 9F9B AEF7 99CD F9E7 0000 0000 0000 0000  ................
00000480: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
000004c0: 0000 0000 0000 0000 0000 0000 0000 FE03  ................
000004d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000004e0: 0000 0000 0000 0000 0000 0000 C959 D352  .............Y.R
000004f0: 7587 44C7 8C1A 382B C47C BC32 0100 0000  u.D...8+.|.2....
00000500: 0C00 0000 0000 0000 89EA 8F64 0000 0000  ...........d....
00000510: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
00000550: 0000 0000 0000 0000 0000 0000 2000 2000  ............ . .
00000560: 0100 0000 0000 0000 0000 0000 0000 0000  ................
00000570: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
000005f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

对照翻译：

__le32 s_inodes_count：inode总数，0x001BB220，即1815072，与df -i命令显示一致。
__le32 s_blocks_count：block总数，0x006E9A00，即7248384。
__le32 s_r_blocks_count：为super user保留的block数量0x000587B3，即362419， This is most useful if for some reason a user, maliciously or not, fill the file system to capacity; the super user will have this specified amount of free blocks at his disposal so he can edit and save configuration files。（这些保留的空间，在什么情况下会被使用？）
__le32 s_free_blocks_count：free block数量，整个分区的空闲block数，0x006CB111，即7123217。
__le32 s_free_inodes_count：free inode数量，整个分区的空闲inode数，0x001BB215，即1815061，与df -i命令显示一致。
__le32 s_first_data_block：第1个有数据的block，0x00000000，即0，说明这个ext2的分区，block size大于1KiB。如果block size等于1KiB，这是最小值，这个字段的值就是1。
__le32 s_log_block_size：block size，0x00000002，1024<<2=4096，即block size为4KiB，4KiB足够放入superblock的所有内容，还有足够的可扩展空间。
__le32 s_log_frag_size：fragment size，0x00000002，1024<<2=4096，即fragment size也为4KiB。（我理解fragment是比block更小的空间，比如把一个block划分成多个fragments，貌似这个feature已经没有被支持了？）
__le32 s_blocks_per_group：每group的block数，0x00008000，即32768，与first data block的值一起，就可以确定每个group的边界位置，最后一个group包含的block，可能小于这个数。按本例数据，每个block group的size为128MiB。如果block size为1KiB，0号block会被排除在group0之外。
__le32 s_frags_per_group：fragments per group，0x00008000，同blocks per group，保持fragment和block的size一致。
__le32 s_inodes_per_group：每group内inode的数量，0x00001FF0，即8176。Note that you cannot have more than (block size in bytes * 8) inodes per group as the inode bitmap must fit within a single block。

做道数学题：

>>> import math
>>> math.ceil(7248384/32768)*8176
1815072

__le32 s_mtime：last mount time，全0，可能是刚格式化，还没有mount过。
__le32 s_wtime：last write time，0x648FEAB8，即1687153336，如下：

>>> import time
>>> time.ctime(1687153336)
'Mon Jun 19 13:42:16 2023'

__le16 s_mnt_count，至上一次fully verfied后mount的次数。
__le16 s_max_mnt_count，在一个full check前，最大可mount的次数，0xFFFF。
__le16 s_magic，固定为0xEF53。
__le16 s_state，我观察到mount后，此值为0，umount后，此值为1。
__le16 s_errors，16bit value indicating what the file system driver should do when an error is detected. 1表示continue as if nothing happended，2表示remount read-only，3表示cause a kernel panic。
__le16 s_minor_rev_level，0x0000，后面还有个revision level，应该是major。
__le32 s_lastcheck，Unix time, as defined by POSIX, of the last file system check. 我尝试了一次fsck，但此时间没有变化。
__le32 s_checkinterval，Maximum Unix time interval, as defined by POSIX, allowed between file system checks. 全0
__le32 s_creator_os，创建此文件系统的OS，0表示Linux，1表示GNU HURD，2表示MASIX，3表示FreeBSD，4表示Lites
__le32 s_rev_level，0表示EXT2_GOOD_OLD_REV，1表示EXT2_DYNAMIC_REV，Revision 1 with variable inode sizes, extended attributes, etc. 这里是1。（用lsblk -f命令，显示ext2，版本为1.0，与本例数据match）
__le16 s_def_resuid，16bit value used as the default user id for reserved blocks. 全0，表示root。
__le16 s_def_resgid，16bit value used as the default group id for reserved blocks. 全0，表示root。
__le32 s_first_ino，第1个没有被保留的inode，0x0000000B，即11，In revision 0, the first non-reserved inode is fixed to 11 (EXT2_GOOD_OLD_FIRST_INO). In revision 1 and later this value may be set to any value.
__le16 s_inode_size，0x0100，即256，In revision 0, this value is always 128 (EXT2_GOOD_OLD_INODE_SIZE). In revision 1 and later, this value must be a perfect power of 2 and must be smaller or equal to the block size. （结合前面的inodes per group和block size，每个block group中，有511个block用来存放inode）
__le16 s_block_group_nr，0x0000，16bit value used to indicate the block group number hosting this superblock structure. This can be used to rebuild the file system from any superblock backup.
__le32 s_feature_compat，0x00000038，即0b00111000，文件系统的实现，可自由选择支持或不支持这些feature：

# feature_compatibility
EXT2_FEATURE_COMPAT_DIR_PREALLOC 0x0001 Block pre-allocation for new directories
EXT2_FEATURE_COMPAT_IMAGIC_INODES 0x0002     
EXT3_FEATURE_COMPAT_HAS_JOURNAL 0x0004  An Ext3 journal exists
EXT2_FEATURE_COMPAT_EXT_ATTR 0x0008 Extended inode attributes are present
EXT2_FEATURE_COMPAT_RESIZE_INO 0x0010   Non-standard inode size used
EXT2_FEATURE_COMPAT_DIR_INDEX 0x0020    Directory indexing (HTree)

__le32 s_feature_incompat，0x00000002，The file system implementation should refuse to mount the file system if any of the indicated feature is unsupported. An implementation not supporting these features would be unable to properly use the file system. For example, if compression is being used and an executable file would be unusable after being read from the disk if the system does not know how to uncompress it.

# feature_incompat
EXT2_FEATURE_INCOMPAT_COMPRESSION 0x0001    Disk/File compression is used
EXT2_FEATURE_INCOMPAT_FILETYPE 0x0002    
EXT3_FEATURE_INCOMPAT_RECOVER 0x0004     
EXT3_FEATURE_INCOMPAT_JOURNAL_DEV 0x0008     
EXT2_FEATURE_INCOMPAT_META_BG 0x0010

le32 s_feature_ro_compat，0x00000003， The file system implementation should mount as read-only if any of the indicated feature is unsupported.

EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001  Sparse Superblock
EXT2_FEATURE_RO_COMPAT_LARGE_FILE 0x0002 Large file support, 64-bit file size
EXT2_FEATURE_RO_COMPAT_BTREE_DIR 0x0004 Binary tree sorted directory files

__u8 s_uuid[16]，分区的uuid，与用lsblk -f查看到的一样：2820b256-5651-47e6-9f9b-aef799cdf9e7
char s_volume_name[16]，ISO-Latin-1 characters and be 0 terminated，基本没使用。
char s_last_mounted[64]，64 bytes directory path where the file system was last mounted. While not normally used, it could serve for auto-finding the mountpoint when not indicated on the command line. Again the path should be zero terminated for compatibility reasons. Valid path is constructed from ISO-Latin-1 characters.
__le32 s_algorithm_usage_bitmap，0x00000000，

EXT2_LZV1_ALG   0   Binary value of 0x00000001
EXT2_LZRW3A_ALG 1   Binary value of 0x00000002
EXT2_GZIP_ALG   2   Binary value of 0x00000004
EXT2_BZIP2_ALG  3   Binary value of 0x00000008
EXT2_LZO_ALG    4   Binary value of 0x00000010

__u8 s_prealloc_blocks，0x00， 8-bit value representing the number of blocks the implementation should attempt to pre-allocate when creating a new regular file. Linux 2.6.28 will only perform pre-allocation using Ext4 although no problem is expected if any version of Linux encounters a file with more blocks present than required.
__u8 s_prealloc_dir_blocks，0x00， 8-bit value representing the number of blocks the implementation should attempt to pre-allocate when creating a new directory. Linux 2.6.28 will only perform pre-allocation using Ext4 and only if the EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is present. Since Linux does not de-allocate blocks from directories after they were allocated, it should be safe to perform pre-allocation and maintain compatibility with Linux.
__u16 s_padding1，

这个字段在另一个项目的结构体中是reserved_gdt_blocks，0x03FE，1022，gdt是Group Descriptor Table，GDT后面保留的block数量。

眼睛看花了，重新copy一下截止到这里的raw content：

000004d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000004e0: 0000 0000 0000 0000 0000 0000 C959 D352  .............Y.R
000004f0: 7587 44C7 8C1A 382B C47C BC32 0100 0000  u.D...8+.|.2....

__u8 s_journal_uuid[16]，16-byte value containing the uuid of the journal superblock. See Ext3 Journaling for more information.
__u32 s_journal_inum，4字节，全0
__u32 s_journal_dev，4字节，全0
__u32 s_last_orphan，4字节，全0
__u32 s_hash_seed[4]，16字节，An array of 4 32bit values containing the seeds used for the hash algorithm for directory indexing. 有值，但不知道作何使用。
__u8 s_def_hash_version，0x01，An 8bit value containing the default hash version used for directory indexing.
__u8 s_reserved_char_pad，0x00，
__16 s_reserved_word_pad，0x0000，

00000500: 0C00 0000 0000 0000 89EA 8F64 0000 0000  ...........d....
00000510: 0000 0000 0000 0000 0000 0000 0000 0000  ................

__le32 s_default_mount_opts，0x0000000C，
__le32 s_first_meta_bg，0x00000000，
__u32 s_reserved[190]

在grub项目代码中，此处还有个这个字段mkfs_time，0x648FEA89，

>>> import time
>>> time.ctime(0x648FEA89)
'Mon Jun 19 13:41:29 2023'

显然，这个ext2的super block结构体，包含了ext3和ext4的扩展。

Block Group Desc Table (GDT)

Block Group Table在SuperBlock的后面一个block，它可能占用大于1个block的空间，由group的数量和block size决定。

/*
 * Structure of a blocks group descriptor
 */
struct ext2_group_desc
{
        __le32  bg_block_bitmap;                /* Blocks bitmap block */
        __le32  bg_inode_bitmap;                /* Inodes bitmap block */
        __le32  bg_inode_table;         /* Inodes table block */
        __le16  bg_free_blocks_count;   /* Free blocks count */
        __le16  bg_free_inodes_count;   /* Free inodes count */
        __le16  bg_used_dirs_count;     /* Directories count */
        __le16  bg_pad;
        __le32  bg_reserved[3];
};

每个GDT表项32字节！一个4KiB的block，可以存放128个GDT表项。

__le32 bg_block_bitmap，当前block group的block bitmap的起始block id
__le32 bg_inode_bitmap，当前block group的inode bitmap的起始block id
__le32 bg_inode_table，当前block group的起始inode table
__le16 bg_free_blocks_count，当前block group的空闲block数
__le16 bg_free_inodes_count，当前block group的空闲inode数
__le16 bg_used_dirs_count，目录数

Block Bitmap

用每个bit来表示block group中的block是否available。

Each bit represent the current state of a block within that block group, where 1 means “used” and 0 “free/available”. The first block of this block group is represented by bit 0 of byte 0, the second by bit 1 of byte 0. The 8th block is represented by bit 7 (most significant bit) of byte 0 while the 9th block is represented by bit 0 (least significant bit) of byte 1.

按此例数据，每个block group含32768个block，共需要4096Byte来存储32768个bit，这刚好是一个block的大小。

如何确定Group数量

由于block bitmap固定使用一个block，因此，调整block size，就会得到不同的group数量！

Inode Bitmap

与block bitmap同理，但inode的编号从1开始，对应bit0。

GDT实例

下面是第1个GDT表项：

$ sudo xxd -a -u -s4096 -l32 /dev/sdc2
00001000: 0104 0000 0204 0000 0304 0000 F879 E51F  .............y..
00001010: 0200 0400 0000 0000 0000 0000 0000 0000  ................

第1项，对应第1个block group：

block bitmap的block id：0x00000401，1025
inode bitmap的block id：0x00000402，1026
inode table的block id：0x00000403，1027
free block count：0x79F8，31224
free inode count: 0x1FE5，8165
used dir count: 0x0002，2

这个分区一共有7248384个block，每个block group包含32768个block，即共有222个block group。这第1个block group，格式化后啥都不干，就只剩下31224个block，用掉了1544个block。我们看一下block bitmap：

$ sudo xxd -a -u -s$((4096*1025)) -l 512 /dev/sdc2
00401000: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401010: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401020: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401030: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401040: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401050: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401060: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401070: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401080: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
00401090: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
004010a0: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
004010b0: FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF  ................
004010c0: FF00 0000 0000 0000 0000 0000 0000 0000  ................
004010d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
004011f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

仔细数一下，刚好前1544个block显示被占用。

block bitmap位于1025号block，0号是superblock，1号和2号存放GDT，中间有1022个block保留，对应reserved_gdt_blocks字段。每个inode的size为256Bytes，每个block group包含8176个inode表项，每个block为4KiB大小，那么inode table占满511个block。1027+511=1538，这是root directory data block的编号。看一下block inode bitmap:

$ sudo xxd -a -u -s$((4096*1026)) -l 512 /dev/sdc2
00402000: FF07 0000 0000 0000 0000 0000 0000 0000  ................
00402010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
004021f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

刚好前11(8+3)个inode表项被占用，8165+11=8176，与其它数据能够对应起来。

为什么要把bitmap分散在不同的group中？

因为如果所有的bitmap都存放在同一个group中，当该group数据块被损坏时，整个文件系统的可用性都会受到影响。分散保存可以降低这种风险。此外，将bitmap分散在多个group中还能提高性能。

block group概念，并没有限制inode中记录的文件block属于其它group的情况！ext2文件系统甚至鼓励这种跨group存储文件。

将bitmap分散在group中存储，可以提高性能？

这一点对于一般普通Linux用户来说，是基本无感的。因为你的partition中，基本只包含一个物理硬盘，没有多个，无法并行IO。但是如果partition包含多个物理硬盘，就可以并行IO，此时就可以理解分散保存两个bitmap带来的安全性和高性能。从这个角度看，ext2文件系统的设计，已经考虑到了服务器运行环境的需求。

Inode Table

在Linux项目源码中（fs/ext2/ext2.h），找到inode的结构：

/*
 * Constants relative to the data blocks
 */
#define EXT2_NDIR_BLOCKS                12
#define EXT2_IND_BLOCK                  EXT2_NDIR_BLOCKS
#define EXT2_DIND_BLOCK                 (EXT2_IND_BLOCK + 1)
#define EXT2_TIND_BLOCK                 (EXT2_DIND_BLOCK + 1)
#define EXT2_N_BLOCKS                   (EXT2_TIND_BLOCK + 1)

/*
 * Structure of an inode on the disk
 */
struct ext2_inode {
        __le16  i_mode;         /* File mode */
        __le16  i_uid;          /* Low 16 bits of Owner Uid */
        __le32  i_size;         /* Size in bytes */
        __le32  i_atime;        /* Access time */
        __le32  i_ctime;        /* Creation time */
        __le32  i_mtime;        /* Modification time */
        __le32  i_dtime;        /* Deletion Time */
        __le16  i_gid;          /* Low 16 bits of Group Id */
        __le16  i_links_count;  /* Links count */
        __le32  i_blocks;       /* Blocks count of 512 bytes */
        __le32  i_flags;        /* File flags */
        union {
                struct {
                        __le32  l_i_reserved1;
                } linux1;
                struct {
                        __le32  h_i_translator;
                } hurd1;
                struct {
                        __le32  m_i_reserved1;
                } masix1;
        } osd1;                         /* OS dependent 1 */
        __le32  i_block[EXT2_N_BLOCKS];/* Pointers to blocks */
        __le32  i_generation;   /* File version (for NFS) */
        __le32  i_file_acl;     /* File ACL */
        __le32  i_dir_acl;      /* Directory ACL */
        __le32  i_faddr;        /* Fragment address */
        union {
                struct {
                        __u8    l_i_frag;       /* Fragment number */
                        __u8    l_i_fsize;      /* Fragment size */
                        __u16   i_pad1;
                        __le16  l_i_uid_high;   /* these 2 fields    */
                        __le16  l_i_gid_high;   /* were reserved2[0] */
                        __u32   l_i_reserved2;
                } linux2;
                struct {
                        __u8    h_i_frag;       /* Fragment number */
                        __u8    h_i_fsize;      /* Fragment size */
                        __le16  h_i_mode_high;
                        __le16  h_i_uid_high;
                        __le16  h_i_gid_high;
                        __le32  h_i_author;
                } hurd2;
                struct {
                        __u8    m_i_frag;       /* Fragment number */
                        __u8    m_i_fsize;      /* Fragment size */
                        __u16   m_pad1;
                        __u32   m_i_reserved2[2];
                } masix2;
        } osd2;                         /* OS dependent 2 */
};

仔细计算，这个结构体是128字节的，为啥superblock中显示256字节呢？ext2只需要128字节，有些扩展功能，需要更多空间，inode size大一点，就可以支持了。

创建一个文本文件，用stat命令查看其stat编号为15。现在我们直接查看15号inode的内容：

$ sudo xxd -u -s$((4096*1027+256*14)) -l256 /dev/sdc2
00403e00: A481 0000 0B00 0000 D7B4 9264 D7B4 9264  ...........d...d
00403e10: D7B4 9264 0000 0000 0000 0100 0800 0000  ...d............
00403e20: 0000 0000 0300 0000 0108 0000 0000 0000  ................
00403e30: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00403e40: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00403e50: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00403e60: 0000 0000 219C CE93 0000 0000 0000 0000  ....!...........
00403e70: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00403e80: 2000 0000 343D FB14 309E ED10 309E ED10   ...4=..0...0...
00403e90: D7B4 9264 309E ED10 0000 0000 0000 0000  ...d0...........
00403ea0: 0000 02EA 0706 3400 0000 0000 2500 0000  ......4.....%...
00403eb0: 0000 0000 7365 6C69 6E75 7800 0000 0000  ....selinux.....
00403ec0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00403ed0: 0000 0000 0000 0000 756E 636F 6E66 696E  ........unconfin
00403ee0: 6564 5F75 3A6F 626A 6563 745F 723A 756E  ed_u:object_r:un
00403ef0: 6C61 6265 6C65 645F 743A 7330 0000 0000  labeled_t:s0....

i_block[EXT2_N_BLOCKS]

__le32 i_block[EXT2_N_BLOCKS]这个字段包含了文件所有data block。我们这个测试文件很小，只需要关注第1个block id，即0x00000801，2049号block，内容如下：

$ sudo xxd -u -s$((4096*2049)) -l32 /dev/sdc2
00801000: 3132 3334 3536 3738 3930 0A00 0000 0000  1234567890......
00801010: 0000 0000 0000 0000 0000 0000 0000 0000  ................

In total there are 15 pointers in the i_block[] array. The meaning of each of these pointers is explained below:

i_block[0..11] point directly to the first 12 data blocks of the file.

i_block[12] points to a single indirect block

i_block[13] points to a double indirect block

i_block[14] points to a triple indirect block

ext系列文件系统，不像FAT系列，采用cluster chain的方式，而是直接在一个数组中存放block id，还分级。数组的前12个entry指向12个direct block，后面3个entry是间接指向。

－－－－－－－－
｜direct｜ --> data block
－－－－－－－－

－－－－－－－－－－
｜single indirect | --＞    | direct  0 | --＞ data block
－－－－－－－－－－　　　　| direct  1 | --＞ data block
　　　　　　　　　　　　　　| ...       | ...
　　　　　　　　　　　　　　| direct  n | --＞ data block
　　　　　　　　　　　　　　－－－－－－－－－－
－－－－－－－－－－
｜double indirect | --＞    | single direct  0 | --＞ ...
－－－－－－－－－－　　　　| single direct  1 | --＞ ...
　　　　　　　　　　　　　　| ...              | ...
　　　　　　　　　　　　　　| single direct  n | --＞ ...
　　　　　　　　　　　　　　－－－－－－－－－－

－－－－－－－－－－
｜triple indirect | --＞    | double direct  0 | --＞ ...
－－－－－－－－－－　　　　| double direct  1 | --＞ ...
　　　　　　　　　　　　　　| ...              | ...
　　　　　　　　　　　　　　| double direct  n | --＞ ...
　　　　　　　　　　　　　　－－－－－－－－－－

4KiB的block，block编号使用4字节，一个block可以包含1024个block id，
前12个直接block的size，4KiB*12 = 48KiB
1级间接block，可以包含1024*4KiB = 4096KiB，4MiB
2级间接block，可以包含1024*4096KiB = 4194304KiB，4GiB
3级间接block，可以包含1024*4194304KiB = 4294967296KiB，4TiB

都不用加起来，最后一项是其决定作用的，大小为4TiB。

最大文件Size

i_size字段只有32位，虽然i_block[]字段可以表达很大的空间，但ext2文件系统是如何计算出2TiB最大文件大小的？这是用到了i_dir_acl字段。

In revision 0, (signed) 32bit value indicating the size of the file in bytes. In revision 1 and later revisions, and only for regular files, this represents the lower 32-bit of the file size; the upper 32-bit is located in the i_dir_acl.

文件的acl有i_file_acl，因此使用i_dir_acl完全没问题。

i_mode

16bit value used to indicate the format of the described file and the access rights, which can be combined in various ways.

Constant    Value   Description
-- file format --
EXT2_S_IFSOCK   0xC000  socket
EXT2_S_IFLNK    0xA000  symbolic link
EXT2_S_IFREG    0x8000  regular file
EXT2_S_IFBLK    0x6000  block device
EXT2_S_IFDIR    0x4000  directory
EXT2_S_IFCHR    0x2000  character device
EXT2_S_IFIFO    0x1000  fifo
-- process execution user/group override --
EXT2_S_ISUID    0x0800  Set process User ID
EXT2_S_ISGID    0x0400  Set process Group ID
EXT2_S_ISVTX    0x0200  sticky bit
-- access rights --
EXT2_S_IRUSR    0x0100  user read
EXT2_S_IWUSR    0x0080  user write
EXT2_S_IXUSR    0x0040  user execute
EXT2_S_IRGRP    0x0020  group read
EXT2_S_IWGRP    0x0010  group write
EXT2_S_IXGRP    0x0008  group execute
EXT2_S_IROTH    0x0004  others read
EXT2_S_IWOTH    0x0002  others write
EXT2_S_IXOTH    0x0001  others execute

atime,ctime,mtime,dtime

这4个时间，都是unix time，32位，只能到秒，stat命令显示出来的小于秒的数据，来自哪里？

系统中的每份可见的文件，以及文件夹，都有三个时间属性，分别是atime，mtime，ctime，本节对这三个时间属性做点介绍。

atime，就是（last）access time，代表的是最后一次对文件的访问时间。当对文件进行read之类的系统调用的时候，这个时间会被更新。（很多时候为了提高系统性能，延长SSD硬盘寿命，会去掉atime的更新）

mtime，就是（last）modified time，代表的是文件最后一次被修改的时间。这个比较容易理解。

ctime，就是（last）changed time，这个属性与平台相关，在Linux系统中，ctime表示是最后一次文件的metadata被修改的时间，而在Win系统中，就是文件的初始创建的时间。

文件夹的xtime

文件夹的access time，atime，是在读取文件或者执行文件时更改的（如果我们只cd进入一个目录然后cd ..，不会引起atime的改变，但ls一下就不同了）。

文件夹的modified time，mtime，是在文件夹中有文件新建、删除才会改变（如果只是改变文件内容不会引起mtime的改变，换句话说，如果ls -f的结果发生改变mtime就会被刷新。这里可能有人要争论了：我进入dd这个文件夹vi了一个文件然后退出，前后ls -f的结果没有改变但是文件夹的mtime发生改变了……这点请主意vi命令在编辑文件时会在本文件夹下产生一个".file.swp"临时文件，该文件随着vi的退出而被删除……这就导致了mtime的改变。不信你可以用nano修改文件来试验）。

文件夹的change time，ctime，基本同文件的ctime，其体现的是inode的change time。

relatime

从kernel2.6.29开，还默认集成了一个relatime的属性。可能是因为在文件读操作很频繁的系统中，atime更新所带来的开销很大，所以很多SA都在挂装文件系统的时候使用noatime属性来停止更新atime。但是有些程序需要根据atime进行一些判断和操作，所以Linux就推出了一个relatime特性。使用这个特性来挂装文件系统后，只有当mtime比atime更新的时候，才会更新atime。事实上，这个时候atime和mtime已经是同一个东西了。所以这个选项就是为了实现对atime的兼容才推出的。并不是一个新的时间属性。使用方法就是通过mount -o relatime /dir来挂装目录.

i_links_count，硬链接

这是所谓的硬链接的数量。

通过文件系统的inode链接来产生的新的文件名，而不是产生新的文件，称为硬链接。一般情况下，每个inode号码对应一个文件名，但是Linux允许多个文件名指向同一个inode号码。意味着可以使用不同的文件名访问相同的内容。

$ ln source-file taget-file  # hard link

运行该命令以后，源文件与目标文件的inode号码相同，都指向同一个inode，我们用ls命令，只是看起来多了个文件，但inode数并没有增加，多的只是一个可以被看到的文件名。硬链接会让inode信息中的链接数增加1。当一个文件拥有多个硬链接时，对文件内容修改，会影响到所有硬链接指向的文件名。但是删除一个文件名，不影响对另一个硬链接文件名的访问。删除一个文件名，只会使得inode信息中的链接数减1。

硬链接只能在同一文件系统中的文件之间进行链接，且不能对目录进行创建。如果删除硬链接对应的源文件，则硬链接文件仍然存在，而且保存了原有的内容，这样可以起到防止因为误操作而错误删除文件的效果。由于硬链接是有着相同inode号仅文件名不同的文件，因此，删除一个硬链接文件并不影响其他有相同inode号的文件。

软链接

软链接类似于Windows系统的快捷方式。软链接就是再创建一个独立的文件，而这个文件会让数据的读取指向它连接的那个文件的文件名。

$ ln -s source-file-dir target-file-dir  # soft link or symbolic link

软链接会生成新的文件，新文件有新的inode号码，但其内容依赖链接的另一个文件。

软链接主要应用于以下两个方面：

方便管理，例如可以把一个复杂路径下的文件链接到一个简单路径下方便用户访问；
解决文件系统磁盘空间不足的情况。例如某个文件系统空间已经用完了，但是现在必须在该文件系统下创建一个新的目录并存储大量的文件，那么可以把另一个剩余空间较多的文件系统中的目录链接到该文件系统中，这样就可以很好的解决空间不足问题。

删除软链接并不影响被指向的文件，但若被指向的原文件被删除，则相关软连接就变成了死链接。软链接链接的是pathname，不是inode，目标文件删除后再重新创建，软链接依然有效。

inode特点

由于inode号码与文件名分离，导致一些Unix/Linux系统具备以下几种特有的现象：

文件名包含特殊字符，可能无法正常删除。这时可通过搜索inode来删除文件；(参考find命令删除inode)
移动文件或重命名文件，只是改变文件名，不影响inode号码；
打开一个文件以后，系统就以inode号码来识别这个文件，不再考虑文件名。

这种情况使得软件更新变得简单，可以在不关闭软件的情况下进行更新，不需要重启。因为系统通过inode号码，识别运行中的文件，不通过文件名。更新的时候，新版文件以同样的文件名，生成一个新的inode，不会影响到运行中的文件。等到下一次运行这个软件的时候，文件名就自动指向新版文件，旧版文件的inode则被回收。

有一种故障，叫做inode耗尽！明明硬盘还有很多空间，但确提示：No space left on device。如何解决inodes耗尽的问题

Root Directory

Linux中的文件系统采用层级目录树结构，因此任何文件都必须要位于某个目录中，每个文件系统都要有root，文件和子目录都从root开始。一块ext分区的文件系统，root在哪里？2号inode。

The second entry of the Inode table contains the inode pointing to the data of the root directory; as defined by the EXT2_ROOT_INO constant.

In revision 0 directories could only be stored in a linked list. Revision 1 and later introduced indexed directories. The indexed directory is backward compatible with the linked list directory; this is achieved by inserting empty directory entry records to skip over the hash indexes.

2号inode一定在0号group中，通过0号group找到inode table所在的block，然后查看2号inode的内容，如下：

$ sudo xxd -a -u -s$((4096*1027+256*1)) -l256 /dev/sdc2
00403100: ED41 0000 0010 0000 0DF6 9864 0BF6 9864  .A.........d...d
00403110: 0BF6 9864 0000 0000 0000 0300 0800 0000  ...d............
00403120: 0000 0000 1000 0000 0206 0000 0000 0000  ................
00403130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
00403180: 2000 0000 28D4 A86F 28D4 A86F 20B1 74EA   ...(..o(..o .t.
00403190: B8EA 8F64 0000 0000 0000 0000 0000 0000  ...d............
004031a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
004031f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

前16bit是i_mode，其内容表示这是个目录，只有一个data block，id为1538。

$ python -c 'print(bin(0x41ed)[2:].rjust(16,"0"))'
0100000111101101

查看1538号block：

$ sudo xxd -a -u -s$((4096*1538)) -l256 /dev/sdc2
00602000: 0200 0000 0C00 0102 2E00 0000 0200 0000  ................
00602010: 2000 0202 2E2E 0000 0000 0000 0000 0000   ...............
00602020: 0000 0000 0000 0000 0000 0000 0F00 0000  ................
00602030: 0C00 0301 6162 6300 0C00 0000 1400 0905  ....abc.........
00602040: 6E61 6D65 6470 6970 6500 0000 A1B7 1000  namedpipe.......
00602050: B40F 0602 6961 6D64 6972 0000 0000 0000  ....iamdir......
00602060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
006020f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

这就是root directory：

$ ll /mnt/u2
total 12K
drwxr-xr-x. 3 root root 4.0K Jun 26 10:20 .
drwxr-xr-x. 1 root root   20 Jun 12 15:54 ..
-rw-r--r--. 1 root root   11 Jun 21 16:29 abc
drwxr-xr-x. 2 root root 4.0K Jun 26 10:20 iamdir
prw-r--r--. 1 root root    0 Jun 26 10:20 namedpipe

目录内entry结构：

/*
 * The new version of the directory entry.  Since EXT2 structures are
 * stored in intel byte order, and the name_len field could never be
 * bigger than 255 chars, it's safe to reclaim the extra byte for the
 * file_type field.
 */
struct ext2_dir_entry_2 {
        __le32  inode;                  /* Inode number */
        __le16  rec_len;                /* Directory entry length */
        __u8    name_len;               /* Name length */
        __u8    file_type;
        char    name[];                 /* File name, up to EXT2_NAME_LEN */
};

rec_len，record length，表示整个entry的长度，4的倍数，只能比实际需要的空间大，不能小，不能有entry跨block存放。
file_type的取值如下：

0   Unknown
1   Regular File
2   Directory
3   Character Device
4   Block Device
5   Named pipe
6   Socket
7   Symbolic Link

现在开始解析上面打印出来的root directory：

0x00000002，2号inode
0x000C，12字节
0x01，name length，1字节，.
0x02，file type，
2Eh，.的ASCII编码
0x000000，这3个字节没用，padding
0x00000002，2号inode
0x0020，32字节
0x02，name length，2字节，..
0x02，file type，
2E2Eh，..的ASCII编码
后面22个字节的padding，全0

这个..目录有点特殊，这里本就是分区的root，..表达的是上层路径，上层路径属于别的分区，即不同的device。这里的特殊处理，只看block data看不出来。我们继续：

0x0000000F，15号inode
0x000C，12字节
0x03，3字节的name
0x01，regular file
61626300h，abc，加上1字节的padding
0x0000000C，12号inode
0x0014，20字节
0x09，9字节的name
0x05，Named Pipe
6E61 6D65 6470 6970 6500 0000h，name+padding

Named Pipe这类特殊文件，不会占用data block，但会占用inode table。

0x0010B7A1，1095585号inode
0x0FB4，4020字节
0x06，6字节name
0x02，directory
...

1095585这个夸张的inode编号，说明Linux系统在使用ext文件系统的时候，并不会按照顺序，有一套自己的算法逻辑。这不正常的4020字节，是否是说明这个此data block的最后一项。同时，用0作padding是很正确的，还真不能是其它值，只有这个值表达了字符串的结束。

通过路径定位文件，就是一层层的寻找文件的inode，然后得到其data block列表，这要就得到了文件内容。

e2compr

在学习ext文件系统的过程中，了解到有这么个东西，实现普通文件的压缩和解压，on the fly...后来看到一个人在网上的回复，说这个东西已经死了，从没进入过mainline，他的回答还有一些其它信息：

As mentioned in the comments already, e2compr unfortunately never made it into the mainline kernel. It touched too many other parts of the kernel to ever really be clean enough to merge in. Also, the way it was written didn't allow it to support journals. This latter issue was really the kiss of death for it, as everything with the ext filesystem was moving towards journaling.

The actual old ext2 filesystem doesn't even exist in the kernel any more. It is all the ext4 code which mounts ext2 filesystems in a sort of compatibility mode. Compression would have to be written in from scratch at this point. Corporate support wasn't there for this kind of rewrite, instead it went to other filesystems.

Your options for on-the-fly compression are ZFS and Btrfs. Both of them are Copy-on-Write filesystems. This means when you copy a file, it doesn't take up any extra space until you write to it and change it from the original. This makes creating snapshots quite easy. It also can lead to file fragmentation, since when you change a portion of a large file, that portion doesn't overwrite the old part, it is written somewhere new and the old part is marked free. Large files which see lots of small changes, like database files, can suffer from bad fragmentation.

dumpe2fs命令

这个命令可以将ext文件系统的各种metadata一览无余，其显示的内容，正如本文的分析内容。

$ sudo dumpe2fs /dev/sdc2 | less

stat命令

stat命令显示的信息，与inode结构体前面部分，有比较好的对应关系。

$ stat <filename|dirname>

$ stat abc
  File: abc
  Size: 11              Blocks: 8          IO Block: 4096   regular file
Device: 8,34    Inode: 15          Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:unlabeled_t:s0
Access: 2023-06-21 16:29:11.071001996 +0800
Modify: 2023-06-21 16:29:11.071001996 +0800
Change: 2023-06-21 16:29:11.088002381 +0800
 Birth: 2023-06-21 16:29:11.071001996 +0800

Size，文件字节数
Blocks，512字节的倍数，对应i_blocks字段。
IO Block的值，表示这个ext2分区的block size
Access部分是解析i_mode的值
Device：8,34 是这个分区在驱动层面的major和minor号，用ll /dev可查看
Links：1，硬链接的数量
Context: 这部分内容与256字节的inode后一半的内容有关
Inode编号：15

-f，显示文件系统的信息，但还是要跟一个文件名，用来定位文件系统的位置。

-t，以terse的方式显示。

本文链接：https://cs.pynote.net/hd/hdisk/202306132/

-- EOF --

-- MORE --