Journaling File Systems

原文出自http://www.linux-mag.com/2002-10/jfs_01.html,本文僅做筆記摘錄,並翻譯摘錄部分.
Linux now offers four alternatives to Ext2: Ext3, ReiserFS, XFS, and JFS.
Linux 現在提供四種不同於 Ext2 的選擇: Ext3, ReiserFS, XFS, and JFS.
They supports journaling, a feature certainly demanded by enterprise, can simplify restarts, reduce fragmentation, and accelerate I/O.
她們都支援日誌(journaling),一個企業級所需的特色,它可以簡單的重新啟動,減少碎裂(fragmentation)和加速I/O.
Some vernacular of file systems:
一些關於 file systems 的常用語:
A “logical block” is the smallest unit of storage, measured in bytes, and it may take several blocks to store a single file.
一個 “logical block” 是 Storage 中最小的單位,以bytes作單位,而通常一個檔案會花好幾個 block 來存放.
A “logical volume” can be a physical disk or some subset of the physical disk space.
一個 “logical volume” 可以是一個邏輯磁碟或實體磁碟空間的集合.
“Block allocation” is a method of allocating blocks where the file system allocates one block at a time.
“Block allocation”:索取 block 的方法.
“Internal fragmentation” occurs when a file does not a fill a block completely.
“Internal fragmentation” 內部碎列,當有很多 block, 而每個 block 都沒有完全塞滿時.
“External fragmentation” occurs when the logical blocks that make up a file are scattered all over the disk.
“External fragmentation” 外部碎列,當一個檔案分散在很多且不同的 block 上時.
An “extent” is a large number of contiguous blocks. described by a triple, consisting of (file offset, starting block number, length), where file offset is the offset of the extent’s first block from the beginning of the file, starting block number is the first block in the extent, and length is the number of blocks in the extent. For large files, extent allocation is a much more efficient technique than block allocation.
“extent” 一堆連續的 blocks, 包含 file offset, starting block number, length. file offset 就是相對於檔案的位址, starting block number 就是起始的 block 號碼, length 就是 block 的數目.
“meta-data” is the file system’s internal data structures. Meta-data includes date and time stamps, ownership information, file access permissions, other security information such as access control lists (if they exist), the file’s size and the storage location or locations on disk.
“meta-data” 就是 file system 的內部資料結構. 包含日期和時間戳記,擁有者,存取權限以及其他安全性資訊,檔案長度,和位置…
An “inode” stores all of the information about a file except the data itself. as a “bookkeeping” file for a file. An inode contains file permissions, file types, and the number of links to the file. It can also contain some direct pointers to file data blocks; pointers to blocks that contain pointers to file data bocks (so-called indirect pointers); and even double- and triple-indirect pointers. Every inode has a unique inode number.
“inode” 儲存檔案除了資料以外的所有資訊. 每個 inode 都有一個獨一無二的號碼.
A “directory” is a special kind of file that simply contains pointers to other files. Specifically, the inode for a directory file simply contains the inode numbers of its contents, plus permissions, etc.
“directory” 一種特殊的檔案,簡單的包含其他檔案的指標.當然也有 inode 和存取權限等資訊.
Corruption occurs because the logical operation of writing (or updating) a file is actually a sequence of I/O, and the entire operation may not be totally reflected on the media at any given point in time.
錯誤的發生是因為寫入(或更新)檔案的邏輯動作是循序的 I/O 動作,而這整個動作並未完全地反映到實際媒體上.
The magic of journaling file systems lies in transactions.
Journaling file system 的魔法是依賴於 transaction 之上.
treats a sequence of changes as a single, atomic operation.
就像是一個循序的改變,如同一個單一,極微的動作.
tracks changes to file system meta-data and/or user data.
追蹤 file system meta-data 和(或)使用者資料的變化.
The journal in a journaling file system is simply a list of transactions.
Journal 在 journaling file system 簡單的說就是一個 transaction 的列表.
In the event of a system failure, the file system is restored to a consistent state by replaying the journal.
在系統發生錯誤時,file system 就依照目前的狀態,回頭播放一次 journal.
inspects only those portions of the meta-data that have recently changed.
審查這些 meta-data 有被變動過的部份.
also address another significant problem: scalability.
但也有另外一個值得注意的問題: scalability 可靠性.
Features of modern file systems include:
現代檔案系統的特色:
– Faster allocation of free blocks. Extents (as described above) and B+ trees are used individually or together to find and allocate several free blocks, either by size or location, quickly.
快速地配置. Extents 和 B+ trees 被單獨(或一起)用來快速地尋找和配置數個閒置的 block.
– Large (or very large) numbers of files in a directory.
在一個目錄中能存在大量或很大量的檔案.
– Large files.
大量的檔案.
Ext3
designed to provide higher availability without impacting the robustness (at least the simplicity and reliability) of Ext2.
被設計來提供高可用度而使得 Ext2 不需面對太強的衝擊.
uses the same disk layout and data structures as Ext2, and it’s forward- and backward-compatible with Ext2
使用和 Ext2 相同的 disk 配置和資料結構, 而且它可以向前或向後相容於 Ext2.
limitations that Ext2 has. The fixed internal structures of Ext2 are simply too small (too few bits) to capture large file sizes, extremely large partition sizes, and enormous numbers of files in a single directory. Moreover, the bookkeeping techniques of Ext2, such as its linked-list directory implementation, do not scale well to large file systems (there is an upper limit of 32,768 subdirectories in a single directory, and a “soft” upper limit of 10,000-15,000 files in a single directory.)
限制和 Ext2 相同.固定的Ext2內部結構太小(太少bits),且使用的 Link list 在大磁碟環境下表現不佳.
Switching to Ext3
# tune2fs -j /dev/hdb3
Ext3 provides three data journaling modes that can be set at mount time: data=journal, data=writeback, and data=ordered. The data=journal mode provides both meta-data and data journaling. data=writeback mode provides only meta-data journaling. data=ordered mode, which is the default mode, provides meta-data journaling with increased integrity.
提供三種模式: data=journal(保護最完整), data=writeback(只保護 meta-data), data=ordered(預設,保護 meta-data之外,即較完整的journaling).
By the way, the 2.4 kernel has a limit of 2048 Gb for a single block device, so no file system larger than that can be created at this time (without patching the standard kernel). This restriction could be removed in the 2.5.x development kernel, and there are patches available to remove this limit, but as of 2.5.29, the patches haven’t been officially included yet.
目前 2.4 仍然有 2048Gb 的限制.
ReiserFS
One of the unique advantages of ReiserFS is support for small files — lots and lots of small files.
比較大的特色是支援小檔案,很多很多的小檔案.
ReiserFS is about eight to fifteen times faster than Ext2 at handling files smaller than 1K.
在處理很多小於1k的檔案時,它比 ext2 快七到八倍.
ReiserFS can actually store about 6% more data that Ext2 on the same physical file system.
在和 Ext2 相同的情況下,她也可以存比 Ext2 多 6% 的資料.
ReiserFS can allocate the exact space that’s needed. A B* tree manages all file system meta-data, and stores and compresses tails, portions of files smaller than a block.
只配置實際所需的空間. 使用了 B* tree 管理.
also has excellent performance for large files, but it’s especially adept at managing small files.
同時也對大檔案有很好的效能,但他特別適合用來處理小檔案.
JFS
JFS uses many advanced techniques to boost performance, provide for very large file systems,
使用了很多先進的技術來加速.
SGI’s XFS (described next) has many similar features. Some of the features of JFS include:
XFS 也用了相似的技術. 這些技術包含了:
– Extent-based addressing structures. along with aggressive block allocation policies to produce compact, efficient, and scalable structures for mapping logical offsets within files to physical addresses on disk. This feature yields excellent performance.
Extent-based 定址的資料結構.
– Dynamic inode allocation. JFS dynamically allocates space for disk inodes as required, freeing the space when it is no longer required. Additionally, this feature decouples disk inodes from fixed disk locations.
動態的 inode 配置.自動配置所需的空間,並釋放不需要的空間.
– Directory organization. Two different directory organizations are provided: one is used for small directories and the other for large directories.
目錄的組織,使用了不同的目錄組織來分別處理大的和小個檔案.
– 64-bits. JFS is a full 64-bit file system. This allows JFS to support large files and partitions.
64-bits. 完全的 64-bit file system.
– such as allocation groups (which speeds file access times by maximizing locality), and various block sizes ranging from 512-bytes to 4096-bytes (which can be tuned to avoid internal and external fragmentation).
還有,配置群組及可變動的 block-size.
XFS
A single XFS file system can be 18,000 petabytes (that’s 1015 bytes) and a single file can be 9,000 petabytes. XFS is also capable of delivering excellent I/O performance.
一個單一個 XFS file system 最大可以到 18,000 petabytes 和單一檔案可達 9,000 petabytes. 同時也有能力提供最好的I/O效能.
uses many of the same techniques found in JFS.
使用了許多和 JFS 相同的技術.