在Linux中,“万物兼文件”,我们知道在linux下面有很多文件系统,如EXT/2/3/4,XFS等,为了很好的支持各种类型的文件系统,Linux抽象了一层虚拟文件系统层,用于更加灵活的适配各种具体的文件系统实现。其基本架构如下:
可以看到所有的虚拟文件系统操作都必须在内核态执行,这是由于对于系统存储及外部设备的访问极其复杂,这部分的操作不能交给用户去操作,否则系统会非常不稳定。
文件系统类型
- 基于磁盘的文件系统
在非易失介质存储存储文件的经典方法,也就是为我们所熟知的各类文件系统,注入EXT2/3/4, FAT等 - 虚拟文件系统
在内核中生成,是一种使用用户应用程序与用户通信的方法,最为人所知的就是proc文件系统,其不需要与任何种类的硬件上存储信息,所有的信息都存储在内存中,伴随着进程而消亡 - 网络文件系统
这种文件系统可以访问其他计算机上的数据,本机不会陷入内核态,所有的请求会发送到其他机器执行,因此网络文件系统一般会以FUSE的形式挂载。
通用文件系统
虚拟文件系统定义了一些了方法和抽象以及文件系统中对象(或文件)的统一视图,但是在不同的实现中,会截然不同,其提供的是一个通用的全集,其提供的许多操作在某些子系统中并不需要,比如proc系统中的write_page操作。
在处理文件时,内核空间和用户空间使用的对象是不同的,在用户空间一个文件有一个"文件描述符"标识,是一个整数,也就是我们经常说的FD,只在一个进程内部有效,两个不同进程之间可以使用同一个FD;而FD对应的内核空间的数据结构是struct file,其主要的成员为address_space,address_space是真正与底层设备交互数据结构,而另外一个管理文件元信息的数据结构是inode,其存储着文件的链接,访问时间,版本,对应的后端设备,所在的超级块等等元信息,但是不包括文件名,文件名存储在struct dentry中,这是由于文件名是用于索引及管理inode的,而dentry就是用于管理inode的,而dentry则通过super_block索引。
下面我们就来具体讨论一下具体的各个结构及他们的关系,并讨论一下在linux中打开一个文件到写入具体经历了哪些事情。
VFS结构
inode
inode用于管理文件的元数据信息,包括权限信息,访问信息,链接信息,存储设备信息等, 对应的操作主要包括链接、权限、,其数据结构如下:
相关介绍参考inode
/*
* Keep mostly read-only and often accessed (especially for
* the RCU path lookup and 'stat' data) fields at the beginning
* of the 'struct inode'
*/
struct inode {
...
const struct inode_operations *i_op; // inode的操作,与具体的文件系统相关
struct super_block *i_sb; // 超级块
struct address_space *i_mapping; // 地址空间,真正的与设备交互模块
...
/* Stat data, not accessed from path walking */
unsigned long i_ino; // inode 编号
/*
* Filesystems may only read i_nlink directly. They shall use the
* following functions for modification:
*
* (set|clear|inc|drop)_nlink
* inode_(inc|dec)_link_count
*/
union {
const unsigned int i_nlink;
unsigned int __i_nlink;
};
dev_t i_rdev;
loff_t i_size;
struct timespec64 i_atime; // 最后访问时间
struct timespec64 i_mtime; // 最后修改时间
struct timespec64 i_ctime; // 创建时间
spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short i_bytes; // 文件大小字节数
u8 i_blkbits; // 文件大小对应的块长度
u8 i_write_hint;
blkcnt_t i_blocks; // 文件长度 / 块长度
#ifdef __NEED_I_SIZE_ORDERED
seqcount_t i_size_seqcount;
#endif
/* Misc */
unsigned long i_state;
struct rw_semaphore i_rwsem;
unsigned long dirtied_when; /* jiffies of first dirtying */
unsigned long dirtied_time_when;
struct hlist_node i_hash;
struct list_head i_io_list; /* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *i_wb; /* the associated cgroup wb */
/* foreign inode detection, see wbc_detach_inode() */
int i_wb_frn_winner;
u16 i_wb_frn_avg_time;
u16 i_wb_frn_history;
#endif
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
struct list_head i_wb_list; /* backing dev writeback list */
union {
struct hlist_head i_dentry; // 一个inode可能被多个dentry使用(link)
struct rcu_head i_rcu;
};
atomic64_t i_version;
atomic_t i_count;
atomic_t i_dio_count;
atomic_t i_writecount;
#ifdef CONFIG_IMA
atomic_t i_readcount; /* struct files open RO */
#endif
const struct file_operations *i_fop; /* former ->i_op->default_file_ops */
struct file_lock_context *i_flctx;
struct address_space i_data;
struct list_head i_devices;
union {
struct pipe_inode_info *i_pipe; // 管道类型
struct block_device *i_bdev; // 块设备
struct cdev *i_cdev; // 字符设备
char *i_link; // 不知道是啥
unsigned i_dir_seq; // 不知道是啥
};
__u32 i_generation;
#ifdef CONFIG_FSNOTIFY
__u32 i_fsnotify_mask; /* all events this inode cares about */
struct fsnotify_mark_connector __rcu *i_fsnotify_marks;
#endif
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
struct fscrypt_info *i_crypt_info;
#endif
void *i_private; /* fs or device private pointer */
} __randomize_layout;
struct inode_operations {
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); // 根据inode中的dir及dentry中的filename 查找 inode
const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); // 查找inode目录下的对于dentryfilename的所有链接
int (*permission) (struct inode *, int);
struct posix_acl * (*get_acl)(struct inode *, int);
int (*readlink) (struct dentry *, char __user *,int);
int (*create) (struct inode *,struct dentry *, umode_t, bool);
int (*link) (struct dentry *,struct inode *,struct dentry *); // 创建hard link
int (*unlink) (struct inode *,struct dentry *); // 删除hardlink
int (*symlink) (struct inode *,struct dentry *,const char *); // 创建软连接
int (*mkdir) (struct inode *,struct dentry *,umode_t); // 根据mode及dentry中的目录名创建目录,并生成inode
int (*rmdir) (struct inode *,struct dentry *); // 删除目录
int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); // 根据
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int); // VFS to move the file specified by old_dentry from the old_dir directory to the directory new_dir, with the filename specified by new_dentry
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
int (*update_time)(struct inode *, struct timespec64 *, int);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode);
int (*tmpfile) (struct inode *, struct dentry *, umode_t);
int (*set_acl)(struct inode *, struct posix_acl *, int);
} ____cacheline_aligned;
dentry
dentry主要用于管理文件名,建立与所有子目录项的联系。
dentry state
dentry可以有三种状态 used,unused,negative
used:关联到一个有效的inode
unused:关联到了一个有效的inode,但是引用数为0,还没被真正删除
negative:没有可关联的inode,可能是文件被删除了,或者根本没有存储设备的文件
dentry cache
通过一个path查找对应的dentry,如果每次都从磁盘中去获取的话会比较耗资源,所以提供了一个lru缓存用于加速查找,比如我们查找 /usr/bin/java这个文件的目录项的时候,先需要找到 / 的 目录项,然后/bin,依次类推直到找到path的结尾,这样中间的查找过程中涉及到的目录项就会被缓存起来,方便下次查找。而这个查找过程在下面的look_up中详细分析
更多细节看dentry
其数据结构如下:
struct dentry {
/* RCU lookup touched fields */
unsigned int d_flags; /* protected by d_lock */
seqcount_t d_seq; /* per dentry seqlock */
struct hlist_bl_node d_hash; /* lookup hash list */
struct dentry *d_parent; /* parent directory */
struct qstr d_name;
struct inode *d_inode; /* Where the name belongs to - NULL is
* negative */
unsigned char d_iname[DNAME_INLINE_LEN]; /* small names */
/* Ref lookup also touches following */
struct lockref d_lockref; /* per-dentry lock and refcount */
const struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
unsigned long d_time; /* used by d_revalidate */
void *d_fsdata; /* fs-specific data */
union {
struct list_head d_lru; /* LRU list */
wait_queue_head_t *d_wait; /* in-lookup ones only */
};
struct list_head d_child; /* child of parent list */
struct list_head d_subdirs; /* our children */
/*
* d_alias and d_rcu can share memory
*/
union {
struct hlist_node d_alias; /* inode alias list */
struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */
struct rcu_head d_rcu;
} d_u;
} __randomize_layout;
struct dentry_operations {
int (*d_revalidate)(struct dentry *, unsigned int); // 检测dentry有消息
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *); // 计算dentry的hash值
int (*d_compare)(const struct dentry *, // 比较文件名
unsigned int, const char *, const struct str *);
int (*d_delete)(const struct dentry *);
// 删除目录项,默认实现为将引用置0,也就是标位unused
int (*d_init)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_prune)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *); //当丢失inode时,释放dentry
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *);
} ____cacheline_aligned;
super_block
超级块用于管理挂载点对于的实际文件系统中的一些参数,包括:块长度,文件系统可处理的最大文件长度,文件系统类型,对应的存储设备等。(注:在之前的整体结构图中superblock会有一个files指向所有打开的文件,但是在下面的数据结构中并没有找到相关的代码,是因为之前该结构会用于判断umount逻辑时,确保所有文件都已被关闭,新版的不知道怎么处理这个逻辑了,后续看到了再补上)
相关superblock的管理主要在文件系统的挂载逻辑,这个后续在讲到挂载相关的模块是详细分析。而superblock主要功能是管理inode。
详细信息见superblock
其数据结构如下:
struct super_block {
struct list_head s_list; /* Keep this first */
dev_t s_dev; /* search index; _not_ kdev_t */
unsigned char s_blocksize_bits; // 块字节
unsigned long s_blocksize; // log2(块字节)
loff_t s_maxbytes; /* Max file size */
struct file_system_type *s_type; // 文件系统类型
const struct super_operations *s_op; // 超级块的操作
const struct dquot_operations *dq_op;
const struct quotactl_ops *s_qcop;
const struct export_operations *s_export_op;
unsigned long s_flags;
unsigned long s_iflags; /* internal SB_I_* flags */
unsigned long s_magic;
struct dentry *s_root; // 根目录项。所有的path lookup 都是从此开始
struct rw_semaphore s_umount;
int s_count;
atomic_t s_active;
#ifdef CONFIG_SECURITY
void *s_security;
#endif
const struct xattr_handler **s_xattr;
#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
const struct fscrypt_operations *s_cop;
#endif
struct hlist_bl_head s_roots; /* alternate root dentries for NFS */
struct list_head s_mounts; /* list of mounts; _not_ for fs use */
struct block_device *s_bdev;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
struct hlist_node s_instances;
unsigned int s_quota_types; /* Bitmask of supported quota types */
struct quota_info s_dquot; /* Diskquota specific options */
struct sb_writers s_writers;
/*
* Keep s_fs_info, s_time_gran, s_fsnotify_mask, and
* s_fsnotify_marks together for cache efficiency. They are frequently
* accessed and rarely modified.
*/
void *s_fs_info; /* Filesystem private info */
/* Granularity of c/m/atime in ns (cannot be worse than a second) */
u32 s_time_gran;
#ifdef CONFIG_FSNOTIFY
__u32 s_fsnotify_mask;
struct fsnotify_mark_connector __rcu *s_fsnotify_marks;
#endif
char s_id[32]; /* Informational name */
uuid_t s_uuid; /* UUID */
unsigned int s_max_links;
fmode_t s_mode;
/*
* The next field is for VFS *only*. No filesystems have any business
* even looking at it. You had been warned.
*/
struct mutex s_vfs_rename_mutex; /* Kludge */
/*
* Filesystem subtype. If non-empty the filesystem type field
* in /proc/mounts will be "type.subtype"
*/
char *s_subtype;
const struct dentry_operations *s_d_op; /* default d_op for dentries */
/*
* Saved pool identifier for cleancache (-1 means none)
*/
int cleancache_poolid;
struct shrinker s_shrink; /* per-sb shrinker handle */
/* Number of inodes with nlink == 0 but still referenced */
atomic_long_t s_remove_count;
/* Pending fsnotify inode refs */
atomic_long_t s_fsnotify_inode_refs;
/* Being remounted read-only */
int s_readonly_remount;
/* AIO completions deferred from interrupt context */
struct workqueue_struct *s_dio_done_wq;
struct hlist_head s_pins;
/*
* Owning user namespace and default context in which to
* interpret filesystem uids, gids, quotas, device nodes,
* xattrs and security labels.
*/
struct user_namespace *s_user_ns;
/*
* The list_lru structure is essentially just a pointer to a table
* of per-node lru lists, each of which has its own spinlock.
* There is no need to put them into separate cachelines.
*/
struct list_lru s_dentry_lru; // 目录项缓存
struct list_lru s_inode_lru; // inode 缓存
struct rcu_head rcu;
struct work_struct destroy_work;
struct mutex s_sync_lock; /* sync serialisation lock */
/*
* Indicates how deep in a filesystem stack this SB is
*/
int s_stack_depth;
/* s_inode_list_lock protects s_inodes */
spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp;
struct list_head s_inodes; /* all inodes */
spinlock_t s_inode_wblist_lock;
struct list_head s_inodes_wb; /* writeback inodes */
} __randomize_layout;
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb); // 在当前sb创建inode
void (*destroy_inode)(struct inode *); // 在当前sb删除inode
void (*dirty_inode) (struct inode *, int flags); // 标记为脏inode
int (*write_inode) (struct inode *, struct writeback_control *wbc);// inode 写回
int (*drop_inode) (struct inode *); // 同delete,不过inode的引用必须为0
void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *); // 卸载sb
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_super) (struct super_block *);
int (*freeze_fs) (struct super_block *);
int (*thaw_super) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *); // 查询元信息
int (*remount_fs) (struct super_block *, int *, char *); //重新挂载
void (*umount_begin) (struct super_block *); // 主要用于NFS
// 查询相关
int (*show_options)(struct seq_file *, struct dentry *);
int (*show_devname)(struct seq_file *, struct dentry *);
int (*show_path)(struct seq_file *, struct dentry *);
int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
struct dquot **(*get_dquots)(struct inode *);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
long (*nr_cached_objects)(struct super_block *,
struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
};
address_space
之前提到spuerblock用于管理inode,而dentry用于文件名管理,文件名到inode的映射及目录的管理,而inode用于管理一些文件的元数据信息,但是真正的将文件与磁盘等存储设备的交互由谁来做呢?write一份数据是怎么从内存写回磁盘,而又如何从磁盘读数据到内存呢?这就是address_space主要需要处理的工作,address_space主要用于处理内存到后端设备之间的数据同步,其具体工作原理在内存缓存中详细介绍。
struct address_space {
struct inode *host; // 所在的inode 以便于获取文件元信息
struct xarray i_pages; // 文件对应的内存页
gfp_t gfp_mask; // 内存类型
atomic_t i_mmap_writable; // VM_SHARED映射计数
struct rb_root_cached i_mmap; // mmap私有和共享映射的树结构
struct rw_semaphore i_mmap_rwsem;
unsigned long nrpages; // 文件大小对应的内存页数量
unsigned long nrexceptional;
pgoff_t writeback_index; //回写由此开始
const struct address_space_operations *a_ops; // 地址空间操作
unsigned long flags; // 错误标识位
errseq_t wb_err; //
spinlock_t private_lock;
struct list_head private_list;
void *private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
struct address_space_operations {
int (*writepage)(struct page *page, struct writeback_control *wbc); // 回写一页
int (*readpage)(struct file *, struct page *); //读取一页数据到内存中
/* Write back some dirty pages from this mapping. */
int (*writepages)(struct address_space *, struct writeback_control *); // 回写脏页
/* Set a page dirty. Return true if this dirtied it */
int (*set_page_dirty)(struct page *page); // 标记脏页
/*
* Reads in the requested pages. Unlike ->readpage(), this is
* PURELY used for read-ahead!.
*/
int (*readpages)(struct file *filp, struct address_space *mapping,
struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidatepage) (struct page *, unsigned int, unsigned int);
int (*releasepage) (struct page *, gfp_t);
void (*freepage)(struct page *);
ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
/*
* migrate the contents of a page to the specified target. If
* migrate_mode is MIGRATE_ASYNC, it must not block.
*/
int (*migratepage) (struct address_space *,
struct page *, struct page *, enum migrate_mode);
bool (*isolate_page)(struct page *, isolate_mode_t);
void (*putback_page)(struct page *);
int (*launder_page) (struct page *);
int (*is_partially_uptodate) (struct page *, unsigned long,
unsigned long);
void (*is_dirty_writeback) (struct page *, bool *, bool *);
int (*error_remove_page)(struct address_space *, struct page *);
/* swapfile support */
int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
sector_t *span);
void (*swap_deactivate)(struct file *file);
};
file
前文中提到对于进程来说,用户空间看到的整数fd,而内核中的对应的数据结构则为file,所有用户空间对于fd的操作都会由系统调用转换到操作file。
更多详细信息见file
其数据结构如下:
struct task_struct {
...
/* Filesystem information: */
struct fs_struct *fs; // root & pwd path
/* Open file information: */
struct files_struct *files; // opened files
/* Namespaces: */
struct nsproxy *nsproxy;
...
};
/*
* Open file table structure
*/
struct files_struct {
/*
* read mostly part
*/
atomic_t count; // 打开文件数
bool resize_in_progress; //
wait_queue_head_t resize_wait;
struct fdtable __rcu *fdt; // fd table
struct fdtable fdtab; // fd table
/*
* written part on a separate cache line in SMP
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
unsigned int next_fd; // 该进程打开的下一个fd
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
unsigned long full_fds_bits_init[1];
struct file __rcu * fd_array[NR_OPEN_DEFAULT]; //打开的文件
};
struct fdtable {
unsigned int max_fds; // ulimit -n 打开句柄上限
struct file __rcu **fd; /* current fd array */
unsigned long *close_on_exec;
unsigned long *open_fds; // fd占用位图
unsigned long *full_fds_bits;
struct rcu_head rcu;
};
struct file {
union {
struct llist_node fu_llist;
struct rcu_head fu_rcuhead;
} f_u;
struct path f_path; // 路径
struct inode *f_inode; /* cached value */
const struct file_operations *f_op; // 文件操作
/*
* Protects f_ep_links, f_flags.
* Must not be taken from IRQ context.
*/
spinlock_t f_lock;
enum rw_hint f_write_hint;
atomic_long_t f_count;
unsigned int f_flags;
fmode_t f_mode;
struct mutex f_pos_lock;
loff_t f_pos; // 当前文件的操作位置
struct fown_struct f_owner; // 当前文件所在的进程
const struct cred *f_cred;
struct file_ra_state f_ra;
u64 f_version;
#ifdef CONFIG_SECURITY
void *f_security;
#endif
/* needed for tty driver, and maybe others */
void *private_data;
#ifdef CONFIG_EPOLL
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct list_head f_ep_links;
struct list_head f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping; // 地址空间
errseq_t f_wb_err;
} __randomize_layout
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int); // 移动操作位置
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *); // 将文件与虚拟内存映射
unsigned long mmap_supported_flags;
int (*open) (struct inode *, struct file *); //
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *); // 对一个file 加锁
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *file, int mode, loff_t offset,
loff_t len);
void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
loff_t, size_t, unsigned int);
loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags);
int (*fadvise)(struct file *, loff_t, loff_t, int);
} __randomize_layout;
虚拟文件系统实战
由此对于虚拟文件的基本架构有了一定的理解,但是如果想要对于虚拟文件有比较深刻的认识还是比较模糊的,那么我们来通过自己伪码来操作一下文件,以描述linux内核是如何来读写文件的,我们以写文件为例来过一下整个流程:
需求:从0开始向文件/testmount/testdir/testfile1.txt 中写入 hello world
基本过程其基本系统调用过程为1.mkdir 2. creat 3. open 4. write
mkdir对应的函数调用的执行过程如下:
rootInode = sb->s_root->d_inode;
testDirDentry = dentry("testdir")
testDirInode = rootInode->i_op->mkdir(rootInode , testDirDentry, 777))
creat对应的函数调用的执行过程如下:
testFileDentry = dentry("testfile1.txt")
testFileInode = testDirInode->i_op->create(testDirInode, testFileDentry, 777 )
open 的系统调用的执行过程如下
testFileInode->f_op->open(testFileInode, testfile)
write的系统调用的执行过程如下
testfile->f_op->write(file, "hello world", len, 0)
具体流程:
- 假设现在我们有一个快磁盘设备/dev/sda,我们将其格式化为EX2文件系统,具体怎么将块设备格式化这个我们再设备管理章节在描述。
- 我们将该磁盘挂载到/testmount 目录,这样内核就会通过挂载模块注册对应的superblock,具体如何挂载且听下回分解。
- 我们想要写文件/testmount/testdir/testfile1.txt文件,那么首先会要根据文件名完整路径查找对应的目录项,并在不存在的时候创建对应的inode文件。
3.1 根据完整路径找到对应的挂载点的superblock,我们这里最精确的匹配sb是/testmount
3.2 找到sb后,找到当前sb的root dentry,找到root dentry对应的inode,通过inode中的address_space从磁盘中读取信息,如果是目录则其中存储内容为所有子条目信息,从而构建完整的root dentry中的子条目;发现没有对应testdir的目录,这时候就会报目录不存在的错误;用户开始创建对应的目录,并将对应的信息写回inode对应的设备;同理也需要在/testdir目录下创建testfile1.txt文件并写回/testdir对应的inode设备。 - 找到inode之后,我们需要通过open系统调用打开对应的文件,进程通过files_struct中的next_fd申请分配一个文件描述符,然后调用inode->f_op->open(inode, file),生成一个file对象,并将inode中的address_space信息传到file中,然后将用户空间的fd关联到该file对象。
- 打开文件之后所有后续的读写操作都是通过该fd来进行,在内核层面就是通过对应的file数据结构操作文件,比如我们要写入hello world,那么就是通过调用file->f_op->write;
其实file->f_op其实是讲对应的字节内容写入到address_space中对应的内存中,address_space再选择合适的时间写回磁盘,这就是我们常说的缓存系统,当然我们也可以通过fsync系统调用强制将数据同步回存储系统。在f_op的函数中都可以看到__user描述信息,说明数据是来自用户空间的内存地址,这些数据最终要写到内核缓存的address_space中的page内存中,这就是我们常说的内核拷贝,后来就出来了大家所熟知的零拷贝sendfile,直接在两个fd直接拷贝数据,操作的都是内核里面的page数据,不需要到用户地址空间走一遭。
结语
至此vfs的基本流程就介绍完了,但是对于super_block的挂载,address_space的具体读写操作后续再慢慢补上。其中address_space会在也缓存及块缓存中详细介绍,因为这一块是特别复杂的而且与具体的文件系统实现相关,后续将结合EX2文件系统一起介绍。