Checksums

Part of Data Storage

Sum-based integrity verification — how a few extra bytes reveal whether a block of data was corrupted during storage or transmission.

Why This Matters

A checksum is the simplest form of block-level data integrity verification: compute a value from all the bytes in a block, store that value with the block, and recompute it later to verify nothing changed. If the recomputed value matches the stored value, the block is (almost certainly) intact. If they differ, the block has been corrupted.

Checksums are so fundamental that they are embedded invisibly in almost every storage and communication system: disk sectors include CRC checksums in their headers; TCP/IP packets include checksums before transmission; ZIP and gzip files include CRC-32 checksums of compressed data; most file transfer protocols include checksums to detect download corruption.

For rebuilders, implementing checksums is straightforward — even simple additive checksums are valuable, and strong CRC-32 checksums require only a lookup table and a few lines of code. There is no excuse for storing data without checksum protection. The cost is 4–32 extra bytes per block; the benefit is confidence that your stored data is actually what you stored.

Simple Additive Checksum

The most basic checksum: sum all bytes in the block, take the result modulo 256 (or modulo 65536 for a 2-byte checksum), and store alongside the data. To verify, recompute the sum and compare.

One-byte additive checksum:

def checksum(data):
    return sum(data) % 256
 
def verify(data, stored_checksum):
    return checksum(data) == stored_checksum

Two’s complement checksum (used in many real protocols): The checksum byte is defined as the value that makes the sum of all bytes (including the checksum) equal to zero modulo 256. This means: checksum = (-sum(data)) & 0xFF. Verification is simply: (sum(all_bytes_including_checksum)) & 0xFF == 0.

def twos_complement_checksum(data):
    return (-sum(data)) & 0xFF
 
def verify(all_bytes_including_checksum):
    return sum(all_bytes_including_checksum) & 0xFF == 0

This form is convenient because no separate comparison is needed — just sum everything and check for zero.

Limitations: Simple additive checksums are weak. They cannot detect:

  • Any rearrangement of bytes (the sum is the same regardless of order)
  • Adding byte 0x01 to one position and subtracting 0x01 from another (the sum is unchanged)
  • Replacing any byte with the checksum byte value swapped in

For data stored in fixed formats where rearrangements are impossible (writing a fixed-size block to a fixed disk sector), these limitations are less critical. But for any variable-length or variable-format data, a stronger checksum is necessary.

Fletcher Checksum

The Fletcher checksum improves on simple addition by maintaining two running sums: a simple byte sum and a sum of running sums. This makes the checksum position-sensitive — it detects transpositions and rearrangements.

Fletcher-16 (16-bit result):

def fletcher16(data):
    a = 0
    b = 0
    for byte in data:
        a = (a + byte) % 255
        b = (b + a) % 255
    return (b << 8) | a

Sum a is the simple byte sum; sum b is the sum of all partial sums of a, which makes b sensitive to the order of bytes. Swapping two bytes changes b even if their values are identical.

Fletcher-32 (32-bit result): Same algorithm with sums taken modulo 65535 instead of 255, providing stronger protection for larger blocks.

Fletcher checksums are used in real protocols: TCP/IP OSPF routing protocol uses Fletcher-16. The algorithm is simple enough to implement in hand-assembled machine code.

CRC: Cyclic Redundancy Check

CRC is the standard checksum for data storage and network protocols. It provides strong error detection with compact output (32 bits for CRC-32) and hardware-efficient implementation.

Mathematical basis: Treat the data as a polynomial over GF(2) (binary arithmetic with no carries). Divide this polynomial by a fixed generator polynomial and take the remainder. The remainder (r bits long for a degree-r generator) is the CRC.

Why it is powerful: Any burst error (contiguous corrupted bits) shorter than r bits always changes the remainder — guaranteed detection, not probabilistic. CRC-32 detects all burst errors up to 32 bits and almost all longer burst errors and random multi-bit errors.

The generator polynomial: Different CRC standards use different generator polynomials. CRC-32 (used in Ethernet, ZIP, gzip, CDs) uses: 0x04C11DB7 (represented as a 32-bit polynomial coefficient bitmask)

The specific polynomial matters — using a non-standard polynomial produces CRC values incompatible with standard software and hardware.

Hardware implementation: A shift register with XOR feedback at specific bit positions computes CRC in one clock cycle per byte. The feedback tap positions are determined by the generator polynomial. This is implemented in all Ethernet chips, disk controllers, and optical disc readers.

Software implementation using lookup table:

# Pre-compute lookup table for CRC-32
def make_crc32_table():
    table = []
    for i in range(256):
        crc = i
        for _ in range(8):
            if crc & 1:
                crc = (crc >> 1) ^ 0xEDB88320  # reversed polynomial
            else:
                crc >>= 1
        table.append(crc)
    return table
 
CRC32_TABLE = make_crc32_table()
 
def crc32(data, initial=0xFFFFFFFF):
    crc = initial
    for byte in data:
        crc = (crc >> 8) ^ CRC32_TABLE[(crc ^ byte) & 0xFF]
    return crc ^ 0xFFFFFFFF

This table-based approach computes CRC-32 at approximately one table lookup per byte — fast enough for real-time verification in most applications.

Usage pattern:

  1. Compute CRC-32 of the data block
  2. Append the 4-byte CRC as a trailer to the block
  3. When reading, compute CRC-32 of the data plus the stored CRC bytes
  4. The result should be a known fixed constant (0xDEBB20E3 for CRC-32 with these conventions) if no errors occurred, or the stored CRC should match a freshly computed CRC of just the data

Applying Checksums in Practice

File integrity: Store a CRC-32 or MD5/SHA-256 hash of every important file alongside the file. When restoring from backup, verify the checksum before trusting the file. Never delete the original until the backup has been verified.

program.bin: 18432 bytes, CRC-32 = 0xA1B2C3D4
library.dat: 65536 bytes, CRC-32 = 0x33445566

Maintain a checksum manifest file listing every stored file and its checksum. This manifest itself should be checksummed and stored redundantly.

Protocol framing: Every protocol message (whether stored in memory or transmitted over a link) should end with a checksum. Define clearly: which bytes are included in the checksum calculation (header + data, or data only), what the checksum algorithm is, and what the check value should be on verify.

Disk sector layout: Each 512-byte sector on a disk should have a 4-byte CRC-32 at the end, reducing the user-data payload to 508 bytes but guaranteeing that any single burst error up to 32 bits is detected. Many file systems implement this as part of the sector format; if your file system does not, add it at the application layer.

Incremental checksums: For large files that are appended to (logs, databases), compute rolling checksums: after every N bytes appended, record a checkpoint checksum. If the file is later found corrupted, you can identify exactly which checkpoint-to-checkpoint segment contains the corruption rather than having to re-examine the entire file.

Distinguishing Checksums from Hashes

Checksums (CRC, Fletcher, additive) are designed for detecting accidental corruption. They are fast to compute and small in output, but they are not secure — a malicious actor can modify data and adjust the checksum to match.

Cryptographic hash functions (MD5, SHA-1, SHA-256) are designed to be one-way: it should be computationally infeasible to modify data and produce the same hash. They are much slower to compute and produce larger outputs (16–32 bytes), but provide integrity verification against both accidental corruption and intentional tampering.

For non-adversarial data integrity (storage, backups, transfer over trusted channels), CRC-32 is sufficient and efficient. For any situation where data may have been deliberately modified (downloaded from an untrusted source, received from an unknown party), use SHA-256 or similar.

In a post-collapse computing environment, adversarial tampering may be less of a concern than accidental corruption. CRC-32 is the right default for storage integrity. If and when cryptographic security matters, SHA-256 is the standard choice — implementations are available for all common platforms.