Error Detection and Correction
Part of Data Storage
How redundant data reveals and repairs corruption — from simple parity bits to sophisticated codes that recover from multiple simultaneous errors.
Why This Matters
Every storage and transmission medium introduces errors. Magnetic particles flip. Cosmic rays strike memory cells. Optical discs get scratched. Cable noise corrupts serial data. Even the best-designed systems fail at some rate — and for high data volumes or safety-critical applications, even a very low error rate produces significant numbers of corrupted bits.
Error detection and correction (EDC/ECC) solves this by adding redundant information to stored or transmitted data. The redundancy is calculated from the original data in a way that makes errors detectable and, in more advanced schemes, locatable and correctable.
This is not a luxury feature — it is fundamental to reliable computing. A database without error detection will silently accumulate corrupted records. A program stored on tape without checksums may load incorrectly and produce wrong results. A communications link without error detection will inject random corruption into every transfer.
Understanding the theory lets you choose the right code for each situation — simple and cheap for low-noise environments, powerful and complex for harsh or archival applications — and implement them in minimal code or discrete logic.
The Information-Theoretic Foundation
Why redundancy enables correction: A block of n data bits can be in 2^n possible states. If we add k check bits, we have 2^(n+k) total states, but only 2^n of them are “valid codewords” (the others are invalid). If a codeword is changed by one bit error, it moves to an invalid state. If we can determine from an invalid state which valid codeword was “nearest” (fewest bit differences), we can correct the error.
Hamming distance: The Hamming distance between two bit strings is the number of positions at which they differ. A code with minimum Hamming distance d between any two valid codewords can:
- Detect any error affecting up to d-1 bits
- Correct any error affecting up to floor((d-1)/2) bits
For single-bit error correction and double-bit detection (SECDED), minimum Hamming distance must be 4. This requires adding log2(n) + 1 check bits to an n-bit data word.
Parity: The Simplest Code
Even parity: Add one extra bit to each byte (or word) such that the total number of 1-bits is even. If the byte 0b10110100 has four 1-bits (already even), the parity bit is 0. If the byte is 0b10110110 (five 1-bits), the parity bit is 1.
Detecting an error: When reading, recompute the parity from the data bits and compare to the stored parity bit. If they disagree, an odd number of bits have been corrupted. If they agree, either no bits were corrupted or an even number were corrupted (undetectable).
Limitations: Parity detects single-bit errors but not two-bit errors (which cancel each other). It cannot determine which bit is wrong, so it cannot correct errors — it can only detect them. When parity fails, the entire data unit must be re-requested or discarded.
Applications: Memory parity checking (one extra parity bit per byte in RAM), simple serial protocol error detection, floppy disk sector headers.
Implementation in hardware: A single XOR gate tree over all data bits. The XOR of all bits equals the parity bit for even parity. To check: XOR all data bits plus the parity bit; result should be 0 for correct data.
Implementation in one line of code:
def parity(byte):
return bin(byte).count('1') % 2Checksums: Catch-All Block Verification
A checksum is a value computed from all bytes in a block and appended to the block. Any corruption of the block (one or more bits) will usually change the checksum computation result, and the mismatch between the stored and recomputed checksum reveals the error.
Simple additive checksum: Sum all bytes in the block (discarding overflow), negate the sum (take the two’s complement), and append as one or two checksum bytes. To verify: sum all bytes including the checksum; result should be 0.
def checksum_8bit(data):
return (-sum(data)) & 0xFF # one byte checksumThis detects most single-burst errors but misses rearrangements and some specific patterns (adding 0x01 to one byte and -0x01 to another leaves the sum unchanged).
CRC (Cyclic Redundancy Check): A more powerful checksum based on polynomial division in GF(2) (binary arithmetic with no carries). The data bits are treated as the coefficients of a polynomial; division by a generator polynomial of degree r leaves an r-bit remainder (the CRC). Any burst error shorter than r bits changes the remainder, guaranteeing detection.
Standard CRC polynomials:
- CRC-8 (x^8 + x^2 + x + 1): 1-byte checksum, detects all burst errors ≤ 8 bits
- CRC-16 (x^16 + x^15 + x^2 + 1, IBM): 2-byte checksum, commonly used in serial protocols
- CRC-32 (x^32 + x^26 + x^23 + … + x + 1): 4-byte checksum, used in Ethernet, ZIP files, CDs
CRC-32 is close to optimal for block sizes up to 32 KB — any single burst error, any two-bit error, and the vast majority of multiple-bit errors are detected.
Implementation: CRC can be computed in software with a lookup table or bitwise algorithm:
# CRC-32 using lookup table (standard Python)
import binascii
crc = binascii.crc32(data) & 0xFFFFFFFFIn hardware, a shift register with feedback taps computes CRC in real time as data flows through, requiring only r bits of state (32 flip-flops for CRC-32).
Reed-Solomon Codes: Burst Error Correction
Reed-Solomon (RS) codes are block error correction codes that operate on symbols (typically 8-bit bytes) rather than individual bits. An RS(n,k) code takes k data symbols and appends n-k check symbols, producing an n-symbol codeword that can correct up to floor((n-k)/2) symbol errors and detect up to n-k symbol errors.
RS(255,223): The most common instance. 223 data bytes + 32 check bytes = 255-byte codeword. Can correct up to 16 symbol errors per codeword. This is used in NASA deep-space communications, CD audio (as C2 code), and many storage systems.
Why symbols over bits: Reed-Solomon codes treat a corrupted symbol (byte with any number of bit errors) the same as a single error regardless of how many bits within the byte are wrong. This makes RS codes extremely effective against burst errors — a burst that corrupts 16 consecutive bytes counts as 16 symbol errors, all correctable by RS(255,223).
RS on CDs: The CD uses two layers of RS coding (C1 and C2) with interleaving between them. C1 is RS(32,28) — corrects up to 2 symbol errors per 32-symbol block. After de-interleaving, C2 is RS(28,24) — another 2 symbol error correction capability. Combined, this corrects any burst error of up to 500 symbols (approximately 4,000 bits), corresponding to a 4 mm scratch on the disc surface.
Computing RS codes: RS encoding requires field arithmetic in GF(2^8) — finite field multiplication and addition over 8-bit symbols using a specific polynomial. This sounds exotic but reduces to table-based operations implementable in a few hundred lines of code. Standard library implementations exist for every programming language.
RAID: Distributed Error Correction Across Disks
RAID (Redundant Array of Independent Disks) applies error correction principles at the disk level. Multiple disks are combined so that the loss of one disk does not lose data.
RAID-1 (mirroring): Two identical disks holding the same data. If one disk fails, the other provides full recovery. Simple but uses 50% of total capacity for redundancy.
RAID-5 (distributed parity): Data is striped across n disks, with parity information distributed across all disks. Any single disk failure is recoverable from the remaining data and parity. Uses only 1/n of capacity for redundancy. Requires n ≥ 3 disks.
RAID-6 (double parity): Like RAID-5 but with two independent parity calculations. Any two simultaneous disk failures are recoverable. More complex but provides protection against simultaneous failure of two disks — important for large arrays where a second disk may fail during the rebuild from a first failure.
For a small post-collapse system, RAID-1 (two identical disks, one copy each) is the most practical approach. It requires no complex calculation — simply write every block to both disks and read from whichever responds correctly.
Practical Deployment Guidelines
Match code strength to error rate: Simple additive checksum is fine for data transferred over a reliable wired connection between two computers in the same room. CRC-32 is appropriate for disk storage and network transmission. Reed-Solomon is necessary for optical media, noisy radio links, and archival storage where long-term degradation is expected.
Always verify: Compute and store checksums at write time; verify at read time. Do not wait for a visible error to start checking — many corruption events produce plausibly correct-looking but subtly wrong data that only a checksum catches.
Multiple layers: Layer error correction codes. A disk sector has hardware ECC from the drive firmware. The file system adds CRC to each file. The backup archive adds RS error correction to the entire backup. Each layer catches different failure modes.
Document your codes: Record which error correction code is used for each storage format in your system documentation. Without this information, future operators cannot interpret stored data or understand why integrity verification is failing.