File Systems

Part of Data Storage

How operating systems organize raw storage into named files and directories — the software layer that turns a disk full of sectors into a usable archive of documents and programs.

Why This Matters

Without a file system, a disk drive is just a numbered sequence of 512-byte blocks. To use it, you would have to remember that “the water purification program starts at sector 12,334 and is 4,891 sectors long.” This does not scale. File systems solve this problem by maintaining a catalog that maps human-readable names to the physical locations of data blocks.

File systems are also where most data loss originates. A crash during a write can leave the file system in an inconsistent state — the directory says a file exists, but the blocks haven’t been written yet. Power failure can corrupt the catalog, making existing data inaccessible even when it is physically intact on the disk. Understanding how file systems work lets you design them correctly, diagnose corruption, and recover data from failed systems.

Every operating system implements at least one file system, but the concepts are universal. FAT, Unix FS, ext2/3/4, NTFS, ISO 9660 — all solve the same problems differently. Understanding the common concepts lets you work with any of them.

The Three Problems Every File System Solves

1. Name resolution: Given a file name (“water-purification.txt”), find the starting block. This requires a directory structure mapping names to locations.

2. Space allocation: When creating or extending a file, decide which free blocks to use and update the free space tracking accordingly. When deleting a file, mark its blocks as free.

3. Integrity maintenance: After crashes, power failures, or other interruptions, the file system metadata (directory entries, block allocation tables) must be consistent. Either all the writes for an operation happened, or none did — partial writes should not leave the metadata in a corrupted state.

Every file system design involves tradeoffs between the complexity of addressing these three problems and the resulting performance and reliability.

Block Structure: The Physical Foundation

File systems work on top of a block device — a storage medium accessed in fixed-size chunks (blocks or sectors). The block size is typically 512 bytes (one hardware sector) to 4,096 bytes (an 8-sector cluster). Larger blocks reduce fragmentation but waste space for small files; smaller blocks use space efficiently but increase overhead for large files.

Block 0 (Master Boot Record or Boot Block): The very first block of a disk partition contains executable code (the bootstrap loader) and the partition table. The file system proper begins in a subsequent block.

Superblock: Contains file system-wide metadata: total number of blocks, number of free blocks, block size, inode count (Unix), file system type and version identifier, time of last mount, state flags (was the file system properly unmounted last time?). The superblock is critical — a corrupted superblock makes the entire file system inaccessible. Always keep a backup copy of the superblock.

Block group / allocation unit: Many file systems divide the disk into groups of blocks, each with local metadata. This improves performance by keeping file data near its associated metadata.

Free Space Tracking

How does the file system know which blocks are free?

Bitmap: One bit per block in the entire partition. If the bit is 1, the block is in use; if 0, the block is free. A 1 TB disk with 4 KB blocks has 256 million blocks, requiring a 32 MB bitmap. Simple, fast to search, easy to implement. Used in ext2/3/4.

Free list: A linked list of free blocks. The first free block contains a pointer to the next free block, and so on. Simple to implement, but fragmented access when the list is long and scattered.

FAT table: In FAT file systems, each entry in the File Allocation Table describes one cluster (multi-block allocation unit). An entry value of 0 means the cluster is free; a non-zero value is either the next cluster in the file chain or an end-of-file marker. Free space scanning requires reading the entire FAT, but allocation is simple.

Directory Structures

Flat directory: All files in a single list. Each entry contains: file name (padded to fixed length), starting block number, file length in bytes. Simple to implement and search, but limits the number of files and does not support organization into subfolders. Appropriate for small systems (the ISO 9660 CD-ROM can emulate this within its directory structure).

Tree directory: Directories contain either file entries or subdirectory entries. A subdirectory entry contains the name of the subdirectory and the block number of that subdirectory’s block list. This enables arbitrarily deep folder hierarchies. Used in all general-purpose operating systems.

Directory entry formats:

FAT directory entry (32 bytes):

  • Bytes 0–7: file name (padded with spaces, uppercase)
  • Byte 8–10: extension (3 characters)
  • Byte 11: attributes (read-only, hidden, system, directory, archive)
  • Bytes 12–21: reserved
  • Bytes 22–23: time (5-bit hour, 6-bit minute, 5-bit 2-second)
  • Bytes 24–25: date
  • Bytes 26–27: starting cluster number (low 16 bits)
  • Bytes 28–31: file size in bytes

Unix inode (variable but typically 128–256 bytes per inode):

  • File type and permission bits (mode)
  • Owner user ID and group ID
  • File size
  • Timestamps (creation, modification, access)
  • Block pointers: 12 direct (pointing to data blocks), 1 indirect (pointing to a block of block pointers), 1 double-indirect, 1 triple-indirect

The directory in Unix does not contain file metadata — it contains only filename-to-inode-number mappings. All metadata is in the inode. This separation allows hard links (multiple directory entries pointing to the same inode) and makes directory operations independent of file size.

The FAT File System: Implementation Guide

FAT (File Allocation Table) is the simplest practical file system to implement from scratch and the most widely compatible. It is used on floppy disks, USB drives, SD cards, and many embedded systems.

Regions of a FAT-formatted volume (in order):

  1. Boot sector (1 sector): BPB (BIOS Parameter Block) with volume parameters
  2. FAT table (variable): one entry per cluster; FAT12 uses 12-bit entries, FAT16 uses 16-bit, FAT32 uses 32-bit
  3. Optional second FAT (backup copy)
  4. Root directory (fixed-size region for FAT12/16; a regular file for FAT32)
  5. Data region: all remaining clusters, numbered starting at 2

Following a file:

  1. Search root directory (or a subdirectory) for the filename
  2. Get starting cluster number from directory entry
  3. Read data cluster
  4. Look up this cluster number in the FAT table; the entry gives the next cluster (or end-of-chain marker 0xFF8–0xFFF for FAT12)
  5. Repeat until end-of-chain

Creating a file:

  1. Scan the FAT table for a free entry (value 0x000)
  2. Write the first cluster of data to that cluster
  3. Mark the FAT entry as end-of-chain (0xFFF for FAT12)
  4. Create a directory entry with the filename, starting cluster, and size

FAT12 for small volumes (floppy disks, small embedded storage): 12-bit entries, maximum 4,096 clusters × cluster size. A 1.44 MB floppy with 512-byte sectors and 1-sector clusters: 2,880 clusters. FAT table size: 2,880 × 1.5 bytes = 4,320 bytes (rounded to 3 sectors × 2 copies = 6 sectors).

FAT12 is the simplest FAT variant to implement. The 12-bit entry packing is slightly fiddly (entries straddle byte boundaries) but well-documented and implementable in under 50 lines of code.

Journaling: Crash Consistency

A non-journaling file system is vulnerable to corruption on power failure. If the power fails between writing a directory entry and writing the actual file data, the directory points to blocks containing garbage. If the power fails while extending a file, the file length and the actual written blocks may be inconsistent.

Journaling adds a write-ahead log (the journal). Before modifying the file system, write a journal entry describing the intended change. Only after the journal entry is safely written do you make the actual change. If a crash occurs before the actual change, the journal entry is replayed on the next mount to complete the operation. If a crash occurs after, the journal entry is discarded as already complete.

Modern file systems (ext3/4, NTFS, HFS+) all journal metadata at minimum. Journaling data as well (full data journaling) provides the strongest consistency but roughly halves write performance.

For a simple rebuild file system, implementing write-ahead logging for directory and FAT changes (not data) provides the most critical consistency properties with modest complexity.

Recovering from Corruption

When a file system is corrupted (by power failure, hardware error, or software bug), the following tools and techniques apply:

fsck (file system check): Scans all metadata and checks for inconsistencies: unreferenced blocks (allocated in FAT but not in any directory chain), double-allocated blocks (same block referenced by two files), directory entries with impossible sizes or starting clusters, invalid FAT chain entries. Inconsistencies are either repaired automatically or placed in a lost+found directory.

Manual superblock recovery: If the superblock is corrupted, find a backup superblock (most file systems write copies at regular intervals) and use it to mount the volume. The backup superblock location is documented in the file system specification.

Direct block reading: Even if the file system metadata is completely corrupted, data blocks on disk are unchanged. If you know a file’s starting block (from backups of directory listings, old partition tables, or intuition based on allocation patterns), you can directly read those blocks and recover the data.

Priority: Before attempting any recovery, make a complete block-level image of the disk (bitwise copy of every sector to another disk or tape). All recovery operations should work on the image, not the original. This prevents recovery attempts from making corruption worse.