Merkle Trees and Blockchain Verification

April 17, 2026

Contents

Introducing Merkle trees

A common problem in distributed systems is proving that a single item belongs to a dataset without transmitting the entire dataset. This comes up everywhere: a blockchain node checking that a transaction was included in a block, a peer-to-peer client verifying that a downloaded file chunk matches the content the sharer originally claimed to offer, a database replica confirming it has the same data as the primary node.

The data structure that makes such proving efficient is the Merkle tree, invented by Ralph Merkle in 1979 and described in his 1987 paper “A Digital Signature Based on a Conventional Encryption Function”.

To appreciate what Merkle trees enable and what their utility is, it helps to first look at and understand a simpler structure they improve upon: the hash list.

Hash list is straightforward. You take each item in your dataset, hash it individually, and then compute a single root hash by hashing the concatenation of all the individual hashes for individual items: Root = H(H0 ‖ H1 ‖ H2 ‖ … ‖ Hn-1). The root acts as a fingerprint for the entire dataset, if you change any item, the root will change. So far, so good.

An important limitation however immediately becomes apparent when you try to verify that a single item is in the dataset. A verifier who knows the trusted root cannot just hash one item and compare because the root depends on all of the item hashes concatenated together. To recompute it, the verifier needs the hash of every other item too. Proving membership of one item out of n using a hash list therefore requires transmitting all n hashes. The proof size grows linearly with the size of the dataset which quickly becomes prohibitely inefficient for larger datasets.

Hash list: flat structure where all hashes feed into the root

A Merkle tree solves this by organizing those same hashes into a binary tree rather than a flat list. Instead of hashing all items together in one shot, it hashes them in pairs: H0 with H1, H2 with H3, and so on. Those pair-hashes are then paired again at the next level, and so on, until a single root hash remains. This hierarchical structure is what makes it possible to verify membership using only ⌈log₂(n)⌉ hashes, a number that grows logarithmically with the size of the dataset, instead of all n. This property is so useful that Merkle trees have become one of the foundational primitives in distributed systems, from blockchains to file distribution networks and database replication protocols.

In this post we will look at how Merkle trees work, why they are secure, and how they enable lightweight blockchain clients. Along the way we will examine a Rust implementation in detail: the core data structures, the tree construction algorithm, proof generation, and proof verification. But first, let’s see why Merkle trees are so important in the blockchain (i.e. in Bitcoin), how they enable lightweight clients and practical efficient verification of a transaction membership in a block - the application that will serve as our running example throughout the post.

Blockchain fundamentals

Before we look at how Merkle trees fit into blockchain systems, it is worth taking a step back to understand how a blockchain itself works. If you are already familiar with Bitcoin’s block structure, nonces, and Proof of Work, feel free to skip ahead to Blockchains and transaction verification.

Transactions: the atomic unit

A transaction is the smallest meaningful event on a blockchain. In Bitcoin, a transaction is essentially a signed statement that says “address A sends 0.5 BTC to address B.” It references previous transactions to prove the sender has the funds (inputs), specifies recipient addresses and amounts (outputs), and includes a cryptographic signature proving the sender authorized the transfer. Transactions are broadcast to the network and sit in a pool of unconfirmed transactions, waiting to be picked up by a miner and to be included into a new block.

Blocks: pages in the ledger

A block is a batch of transactions that have been validated and permanently recorded. Think of it as a page in an append-only ledger: each page holds many entries, and pages are numbered and linked together so that tearing one out or altering it would be immediately obvious.

Every block has two parts. The block header is a compact summary (just 80 bytes in Bitcoin) that contains metadata about the block. The block body contains the full list of transactions. The header is what matters for the chain’s integrity, so let’s look at what it contains:

Field Size Purpose
Version 4 bytes Protocol version number
Previous block hash 32 bytes SHA-256 hash of the previous block’s header, the link that forms the chain
Merkle root 32 bytes A single hash that commits to every transaction in the block (more on this shortly)
Timestamp 4 bytes Approximate time the block was mined
Difficulty target 4 bytes How hard the mining puzzle is
Nonce 4 bytes A counter the miner increments while searching for a valid hash

The phrase “commits to” in the Merkle root row is a cryptographic term meaning binds irrevocably to. The Merkle root is mathematically determined by the exact set of transactions in the block. Change, remove, reorder, or add even a single transaction, and the Merkle root changes completely. By including this hash in the block header, the block makes a tamper-evident promise about its contents, a promise that cannot be broken without detection.

The nonce and Proof of Work

The nonce (short for number used once) is central to understanding how new blocks are created and why the chain is secure.

To add a block, a miner must solve a computational puzzle called Proof of Work. The miner assembles a candidate block (filling in the previous block hash, the Merkle root of the gathered transactions, the timestamp, and the difficulty target) and then repeatedly hashes the block header, trying different nonce values (0, 1, 2, 3, …), until the resulting hash satisfies a specific condition: it must be numerically smaller than the difficulty target, which in practice means the hash must start with a certain number of leading zeros.

Because cryptographic hash functions are unpredictable (even a one-bit change in the input produces a completely different output), there is no shortcut. The miner cannot work backwards from a desired hash to find the right nonce. It must try value after value, often billions of them, until one happens to produce a hash with enough leading zeros:

Nonce = 0             → Hash = 8a3f7b2c4e…  ✗ (no leading zeros)
Nonce = 1             → Hash = f19e0d4a71…  ✗
Nonce = 2             → Hash = 3c88a1e7b0…  ✗
…
Nonce = 2,871,403,291 → Hash = 000000000019d6…  ✓ (enough leading zeros!)

Why does such a nonce exist at all? The intuition comes from the uniform distribution of hash outputs. A good cryptographic hash function maps inputs to outputs that are effectively uniformly distributed across the entire output space. For SHA-256, there are 2²⁵⁶ possible outputs. The difficulty target defines a threshold: any hash below the target is valid. If the target requires, say, k leading zero bits, then the fraction of valid hashes is 2²⁵⁶⁻ᵏ / 2²⁵⁶ = 1/2ᵏ. Each nonce value produces what is essentially a random sample from the output space, independent of all previous attempts, so the probability of hitting a valid hash on any single try is 1/2ᵏ. This means the expected number of attempts before finding a valid nonce is 2ᵏ: large, but finite. It is a geometric distribution: there is no guarantee that any specific nonce works, but the probability of finding none after a large number of trials shrinks exponentially. In practice, miners also vary other header fields (the timestamp, the transaction set, an extra nonce in the coinbase transaction) to explore a space far larger than the 2³² values of the 4-byte nonce alone, making it overwhelmingly likely that a valid solution exists and will be found.

This mining process is the core of the consensus protocol known as Nakamoto consensus. In a decentralized network with no central authority, nodes need a way to agree on which transactions are valid and in what order they occurred. Proof of Work provides that agreement mechanism: the chain with the most cumulative work (the longest valid chain) is accepted as the canonical history. Miners compete to extend this chain by solving the puzzle described above. The difficulty target adjusts periodically (every 2,016 blocks in Bitcoin, roughly two weeks) so that blocks are produced at an approximately constant rate (one every 10 minutes on average) regardless of how much total computing power the network has. When two miners find valid blocks at roughly the same time, a temporary fork occurs; the network resolves it by following whichever branch gets extended first with subsequent blocks. This simple rule, always follow the longest valid chain, is enough to achieve eventual consensus across thousands of nodes without any coordination.

When a valid nonce is found, the miner broadcasts the block to the network. Other nodes can verify it instantly: they hash the header once and check whether the result meets the difficulty target. Finding the nonce is enormously expensive; verifying it is trivial. This asymmetry is the essence of Proof of Work.

How blocks chain together

Here is the critical design insight that makes the whole structure secure: each block header contains the hash of the previous block’s header. This means every block is cryptographically linked to the one before it, forming a chain that stretches all the way back to the very first block (the genesis block):

Blocks chained together by cryptographic hashes

Each block’s hash depends on the entire header, which includes the previous block’s hash, the Merkle root (which depends on every transaction), and the nonce. This creates a chain of cryptographic dependencies: altering anything in any block changes that block’s hash, which breaks the link from the next block, which breaks the link from the block after that, and so on all the way to the tip of the chain.

Why forging a block is computationally prohibitive

Suppose an attacker wants to alter a transaction buried in Block 5 of a chain that currently has 100 blocks. Here is what would have to happen:

  1. Changing the transaction changes Block 5’s Merkle root (the root commits to the exact transaction set: change one transaction and the root changes) - we will see in detail why this is the case further in the post
  2. A different Merkle root changes Block 5’s header, which changes Block 5’s hash.
  3. Block 6 records Block 5’s hash in its “previous block hash” field. That field no longer matches, so Block 6 is now invalid. The attacker must re-mine Block 6, i.e. find a new nonce whose hash meets the difficulty target.
  4. But re-mining Block 6 gives it a new hash, which invalidates Block 7. The attacker must re-mine Block 7 too.
  5. This cascade continues all the way to Block 100. The attacker must re-mine 95 blocks.

And here is the decisive point: while the attacker is re-doing all that work, the honest network is still mining new blocks on top of Block 100. The attacker is in a race against the combined computational power of every other miner in the network. Unless the attacker controls more than 50% of the total hash rate (the famous 51% attack), the honest chain will always grow faster, and the attacker’s forged chain will never catch up.

The deeper a block is buried under subsequent blocks, the more cumulative work an attacker would need to redo to alter it. This is why a transaction with six confirmations (six blocks mined on top of the block containing it) is considered practically irreversible in Bitcoin.

There is also a deeper economic argument. Bitcoin rewards miners with newly minted BTC (the block reward) and transaction fees for each valid block they add to the longest chain. An attacker who has amassed enough hash power to outpace the honest network and produce the longest chain has, by definition, the ability to earn those rewards legitimately. Attacking the chain would undermine confidence in Bitcoin, devaluing the very currency the attacker is trying to steal or manipulate. Playing by the rules, on the other hand, yields a steady stream of block rewards that are worth more precisely because the network is trustworthy. The incentive structure therefore ensures that the majority of computational power in the network is economically motivated to extend the existing chain and maintain its integrity rather than subvert it. Nakamoto makes this point explicitly in the Bitcoin whitepaper (section 6): a rational actor with majority hash power “ought to find it more profitable to play by the rules… than to undermine the system and the validity of his own wealth.”

This is the security foundation on which the entire blockchain rests. And the Merkle root, the hash that commits each block to its transactions, is the bridge that connects this chain-level security to individual transaction verification. Let’s see how.

Blockchains and transaction verification

With this foundation in place, let’s connect the concepts of “items” and “datasets” from the Merkle tree discussion to the blockchain domain.

In Bitcoin, a transaction is the individual data item, and a block is the dataset. As we saw above, each block header contains a single hash, the Merkle root, that commits to all the transactions in that block. Using this Merkle root is how Bitcoin makes it possible to verify that a specific transaction was included in a block without downloading the entire block.

Why does this matter and why the Merkle tree and Merkle root are needed? In Bitcoin’s consensus protocol, full nodes validate every transaction in every block. But not every participant can afford to be a full node: the full blockchain is over 500 GB and growing. A lightweight client (say, a wallet running on your phone) only downloads block headers (80 bytes each) and trusts that the longest chain of valid headers represents the true ledger. When this wallet needs to confirm that a payment it received actually made it into the blockchain, it must verify that the corresponding transaction is included in some block. This is called Simplified Payment Verification (SPV), described in section 8 of Satoshi Nakamoto’s original Bitcoin whitepaper. The wallet asks a full node for a proof of inclusion, and the full node responds with a small set of hashes that the wallet can use, together with the trusted Merkle root from the block header, to confirm the transaction’s presence.

Now let’s see how the hash list fails here and how the Merkle tree succeeds. Suppose a wallet knows the trusted Merkle root and wants to confirm that a particular transaction (say tx2) is in a block. With a hash list, the wallet cannot just hash tx2 and compare: the root depends on all the hashes concatenated together. The wallet would need the hash of every other transaction in the block too: H0, H1, H3, H4, …, the complete set. Proving a transaction belongs to a block of n transactions requires transmitting all n hashes.

With a Merkle tree, the same verification requires only ⌈log₂(n)⌉ hashes. Consider a typical Bitcoin block: as of 2025, an average block contains roughly 3,000-4,000 transactions. With a hash list, a light node verifying a single transaction would need to download all ~3,500 hashes (about 112 KB of SHA-256 digests) and re-hash the entire list. With a Merkle tree, the same verification requires only ⌈log₂(3500)⌉ = 12 hashes, just 384 bytes. That is a ~300× reduction. For the largest blocks, which can hold over 10,000 transactions, the proof is still only 14 hashes (448 bytes), while the hash list approach would require over 320 KB. This logarithmic scaling is precisely what makes SPV practical: a mobile wallet can confirm a transaction’s inclusion in a block by downloading a handful of hashes rather than the entire block, which is why lightweight Bitcoin clients can operate on devices with limited bandwidth and storage, and this is very important for the wide practical adoption of Bitcoin.

From here on, we will use “transaction” and “block” as our running examples for “item” and “dataset”, but keep in mind that Merkle trees are a general-purpose data structure: everything we discuss applies equally to any scenario where you need to prove that a piece of data belongs to a committed set. While extensively used in the blockchain world, Merkle trees have wider applications.

What is a Merkle tree?

A Merkle tree is a binary tree where every leaf node contains the hash of a data block, and every internal node contains the hash of its two children concatenated together. The single hash at the top, the root hash, acts as a fingerprint for the entire dataset. Change a single byte in any leaf, and the root hash changes. This is the core property of a Merkle tree. Because every internal node is the hash of its children, a modification at any leaf cascades upward through every ancestor, ultimately producing a completely different root. This cascade is what makes the root hash a reliable commitment to the entire dataset. The term “commits to” is standard in cryptography and means binds irrevocably to: once a root hash is published, the data it was computed from is fixed, and it is impossible to alter, insert, or remove any piece of that data without producing a different root, immediately revealing the tampering.

All of these security guarantees rest on the choice of hash function. A Merkle tree is only as strong as the hash function it uses, and the function must satisfy several properties to make the tree trustworthy. Collision resistance means it is computationally infeasible to find two different inputs that produce the same hash; without it, an attacker could swap one leaf for another without changing the root. Preimage resistance means that given a hash, it is infeasible to find any input that produces it; this prevents an attacker from reverse-engineering leaf data from published hashes. The avalanche effect ensures that a small change in the input (even a single flipped bit) changes roughly half the bits of the output; this is what makes the change-detection cascade through tree levels reliable rather than subtle. In practice, SHA-256 is the dominant choice: it provides a 256-bit output with strong collision resistance (~2¹²⁸ operations to find a collision), full preimage resistance (~2²⁵⁶ operations), excellent avalanche behavior, and has been battle-tested for over two decades with no known practical weaknesses. Bitcoin, Certificate Transparency, and IPFS all use it for their Merkle trees. We will discuss the properties of SHA-256 and the rationale for this choice in more detail in The Hash type and helper functions section below.

Here is a Merkle tree built from four items (in the blockchain context we discussed earlier, items are transactions and the dataset is a block, but the structure is entirely generic):

Merkle tree with four items

Each leaf hash (H0 through H3) is computed from the raw item data using a cryptographic hash function (SHA-256 in our implementation). The internal nodes H01 and H23 are computed by hashing the concatenation of their children. The root is computed from those two intermediate hashes.

This data structure has a key property: given the root hash and a small number of intermediate hashes, anyone can verify that a specific leaf belongs to the tree. We will see exactly how that works shortly. But first, let’s look at the Rust implementation to understand how these ideas translate into concrete code.

The Hash type and helper functions

Before we can build a Merkle tree we need a type to represent SHA-256 digests and a function to hash arbitrary data. The implementation starts with a Hash wrapper around a 32-byte array:

/// A SHA-256 digest (32 bytes).
///
/// This is the fundamental unit of data throughout the tree. It wraps a
/// fixed-size byte array, avoiding heap allocations and enabling `Copy`.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct Hash([u8; 32]);

impl Hash {
    /// Returns a reference to the underlying 32-byte array.
    pub fn as_bytes(&self) -> &[u8; 32] {
        &self.0
    }
}

Why wrap a plain [u8; 32] in a newtype? There are a few reasons. First, it makes the API self-documenting: a function that takes a Hash clearly expects a SHA-256 digest, not any 32-byte value. Second, we can implement Display to get nice hex formatting and Debug with a truncated representation for readable debug output:

impl fmt::Display for Hash {
    /// Full lowercase hex string (64 characters).
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        for byte in &self.0 {
            write!(f, "{byte:02x}")?;
        }
        Ok(())
    }
}

impl fmt::Debug for Hash {
    /// Short hex representation (first 4 bytes / 8 hex chars) for readable debug output.
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "Hash({}…)", &format!("{self}")[..8])
    }
}

These trait implementations may seem like minor polish, but they make the library much more pleasant to work with. When you print a proof or debug a tree, you see meaningful hex strings instead of arrays of raw bytes.

For plain hashing of arbitrary data there is a simple hash function:

/// Hash arbitrary data with SHA-256.
pub fn hash(data: &[u8]) -> Hash {
    let mut hasher = Sha256::new();
    hasher.update(data);
    Hash(hasher.finalize().into())
}

This is a straightforward wrapper around the sha2 crate. We will use it in our examples when we need to produce hashes that are not tree leaves (for example, when simulating tampered data).

A note on the choice of hash function: a Merkle tree can in principle be built with any cryptographic hash function, but SHA-256 is by far the most common choice. Bitcoin uses it, Certificate Transparency (RFC 6962) mandates it, and IPFS supports it as the default. Its popularity in Merkle tree implementations comes down to a combination of properties: a 256-bit output that is large enough to make collisions computationally infeasible (finding two inputs that hash to the same output would require on the order of 2¹²⁸ operations (not 2²⁵⁶, because of the birthday paradox: when you are looking for any pair that collides rather than a match against a specific target, the probability grows much faster, analogous to how in a room of just 23 people there is a >50% chance that some two share a birthday out of 365 possible)), strong resistance to preimage attacks (given a hash, finding any input that produces it requires the full 2²⁵⁶ operations), and the avalanche effect (flipping a single input bit changes roughly half the output bits, which is what makes the change-detection cascade through tree levels possible). SHA-256 is also fast enough for practical use while being well-studied and battle-tested over two decades, with no known practical weaknesses. Our implementation uses SHA-256 throughout, but the tree construction and proof logic are hash-agnostic: swapping in a different function (say, BLAKE3 or SHA-3) would only require changing the leaf and pair hashing functions.

Domain separation: a subtle but important detail

There is a security concern that is easy to overlook. Consider what happens if leaf data and internal node data are hashed the same way. An attacker could potentially craft leaf data that, when hashed, produces the same bytes as a legitimate internal node hash. This is known as a second-preimage attack on the tree structure.

Why is this dangerous? Suppose an attacker constructs a leaf whose raw bytes happen to equal the concatenation of two other hashes, say, H0 ‖ H1. If leaves and internal nodes are hashed identically, this fake leaf would produce the same hash as the internal node H01. The attacker could then present a shorter, fabricated tree, one with fewer real items, that has the same root hash as the original. A verifier comparing roots would see a match and accept the forged tree as authentic, even though it contains completely different data. In a blockchain context, this would mean an attacker could trick a light client into accepting a block with a different set of transactions than the one miners actually committed to.

The solution, standardized in RFC 6962 (Certificate Transparency), is domain separation: prefix leaf hashes with a 0x00 byte and internal node hashes with a 0x01 byte. This ensures that a leaf hash and an internal node hash can never collide, even if the underlying data happens to be identical.

Domain separation: leaf vs internal node hashing

In Rust this looks quite straightforward. Here are the two hashing functions that the tree uses internally:

/// Domain-separated leaf hash: SHA-256(0x00 || data).
///
/// The 0x00 prefix prevents second-preimage attacks where an attacker
/// crafts leaf data that collides with an internal node hash.
///
/// Reference: Certificate Transparency (RFC 6962, §2.1) uses domain
/// separation for exactly this reason.
pub fn hash_leaf(data: &[u8]) -> Hash {
    let mut hasher = Sha256::new();
    hasher.update([0x00]);
    hasher.update(data);
    Hash(hasher.finalize().into())
}

/// Domain-separated internal node hash: SHA-256(0x01 || left || right).
///
/// The 0x01 prefix ensures internal nodes live in a different hash domain
/// than leaves, preventing second-preimage attacks.
fn hash_pair(left: &Hash, right: &Hash) -> Hash {
    let mut hasher = Sha256::new();
    hasher.update([0x01]);
    hasher.update(left.0);
    hasher.update(right.0);
    Hash(hasher.finalize().into())
}

Notice that hash_leaf is public: consumers of the library need it to compute leaf hashes for proof lookup (for example, to verify that downloaded content matches a proof’s leaf hash). In contrast, hash_pair is private: it is only used internally during tree construction and proof verification. There is no reason for external code to pair arbitrary hashes.

A small prefix, but it closes a real attack vector. It is a good reminder that in cryptographic constructions, the details that seem trivial are often the ones that matter most.

We can verify the domain separation works by checking that hashing the same data as a leaf and as a pair of internal nodes produces different results:

#[test]
fn leaf_and_internal_hashes_differ_due_to_domain_prefix() {
    let data = hash(b"x");
    let as_leaf = hash_leaf(&data.as_bytes().repeat(2));
    let as_pair = hash_pair(&data, &data);
    assert_ne!(as_leaf, as_pair);
}

Even though the raw bytes being hashed are the same (two copies of hash(b"x")), the different prefix byte means the two functions produce completely different digests.

Building the tree: bottom-up construction

With the hash functions defined, let’s look at how the tree itself is built. The MerkleTree struct stores all levels of the tree from root (index 0) down to leaves (last index):

/// A SHA-256 Merkle tree built from leaf data.
///
/// Internally stores every level of hashes from root (index 0) down to
/// the leaves (last index). This enables proof generation by walking
/// from the leaf level upward.
#[derive(Clone, Debug)]
pub struct MerkleTree {
    /// Tree levels, root-first. `levels[0]` contains the single root hash
    /// (or is empty for an empty tree), and `levels[len-1]` contains the
    /// leaf hashes.
    levels: Vec<Vec<Hash>>,
    /// Number of original leaves (before odd-leaf duplication).
    leaf_count: usize,
}

Why store all levels? It is a deliberate trade-off. Storing only the root would be sufficient if we only ever needed to verify proofs, but we also need to generate proofs. Proof generation requires walking from a leaf upward through the tree, collecting the sibling hash at each level. Having all levels in memory makes this straightforward. The space overhead is acceptable: for n leaves, the total number of hashes across all levels is roughly 2n (a geometric series that converges to 2n), so we use about twice the space of just the leaves. For a tree with a million SHA-256 leaves, that is about 64 MB, entirely reasonable.

The build method constructs the tree bottom-up:

pub fn build<T: AsRef<[u8]>>(leaves: &[T]) -> MerkleTree {
    if leaves.is_empty() {
        return MerkleTree {
            levels: vec![vec![]],
            leaf_count: 0,
        };
    }

    let leaf_count = leaves.len();

    let mut current_level: Vec<Hash> =
        leaves.iter().map(|l| hash_leaf(l.as_ref())).collect();
    let mut levels: Vec<Vec<Hash>> = Vec::new();

    while current_level.len() > 1 {
        // Duplicate the last node when the level has an odd count, matching
        // Bitcoin's Merkle tree construction which duplicates the trailing
        // hash so every node has a pair.
        if current_level.len() % 2 != 0 {
            if let Some(&last) = current_level.last() {
                current_level.push(last);
            }
        }
        levels.push(current_level.clone());

        current_level = current_level
            .chunks_exact(2)
            .map(|pair| hash_pair(&pair[0], &pair[1]))
            .collect();
    }

    levels.push(current_level);
    levels.reverse();

    MerkleTree { levels, leaf_count }
}

Let’s trace through this for four leaves ["tx0", "tx1", "tx2", "tx3"] to understand exactly what happens at each stage:

Bottom-up construction of a Merkle tree

Step 1. Each leaf is hashed with domain separation: hash_leaf(b"tx0") produces H0, hash_leaf(b"tx1") produces H1, and so on. This gives us the initial current_level = [H0, H1, H2, H3].

Step 2. The level has 4 elements (even), so no duplication is needed. We push this level onto levels, then produce the next level by pairing adjacent hashes: H01 = hash_pair(H0, H1) and H23 = hash_pair(H2, H3). Now current_level = [H01, H23].

Step 3. The level has 2 elements (even again). We push it onto levels, then pair: Root = hash_pair(H01, H23). Now current_level = [Root].

Step 4. current_level.len() == 1, so the while loop exits. We push the root onto levels and reverse the whole vector so that levels[0] contains the root and levels[last] contains the leaves.

The resulting levels vector looks like:

levels[0] = [Root]                 // 1 hash
levels[1] = [H01, H23]            // 2 hashes
levels[2] = [H0, H1, H2, H3]     // 4 hashes

A few implementation details worth noting. The generic type parameter T: AsRef<[u8]> means build accepts any type that can be viewed as a byte slice: &str, String, Vec<u8>, &[u8], etc. This makes the API flexible without sacrificing performance. The chunks_exact(2) method from the standard library gives us an iterator over pairs, which maps cleanly onto the pairing operation.

The time complexity is O(n) where n is the number of leaves: we hash each leaf once, and each subsequent level has half as many nodes, giving us n + n/2 + n/4 + … ≈ 2n hash operations total.

Accessor methods provide read access to the tree’s contents:

/// Returns the root hash, or `None` if the tree was built from empty input.
pub fn get_root_hash(&self) -> Option<&Hash> {
    self.levels.first().and_then(|level| level.first())
}

/// Returns the domain-separated hash of the leaf at the given index,
/// or `None` if the index is out of bounds.
pub fn get_leaf_hash(&self, index: usize) -> Option<&Hash> {
    if index >= self.leaf_count {
        return None;
    }
    self.levels.last().and_then(|level| level.get(index))
}

Note the bounds check in get_leaf_hash: it uses self.leaf_count rather than the leaf level’s length. This is important because the leaf level may contain duplicated entries (when the original number of leaves was odd). We do not want to expose those duplicates to callers: they are an internal implementation detail.

Handling odd numbers of leaves

What happens when the number of leaves is not a power of two? The standard approach, used in Bitcoin, is to duplicate the last leaf so that every level has an even number of nodes. For example, a tree with three transactions [tx0, tx1, tx2] is built as if it had four: [tx0, tx1, tx2, tx2].

Handling odd number of leaves by duplicating the last one

This duplication happens at every level, not just the leaf level. Consider a tree with five leaves: [tx0, tx1, tx2, tx3, tx4]. At the leaf level, tx4 is duplicated to make six. Those six hashes produce three pairs at the next level, and then the last one is duplicated again to make four. This continues until we reach the root.

The test suite verifies this behavior:

#[test]
fn odd_leaf_count_duplicates_last() -> anyhow::Result<()> {
    // 3 leaves → [a, b, c, c_dup] at leaf level
    let tree = MerkleTree::build(&["a", "b", "c"]);
    assert_eq!(tree.levels.len(), 3);
    // Leaf level has 4 entries (3 + 1 duplicate)
    assert_eq!(tree.levels[2].len(), 4);
    assert_eq!(
        tree.levels[2][2], tree.levels[2][3],
        "last leaf must be duplicated"
    );
    Ok(())
}

#[test]
fn five_leaves_duplicate_at_each_odd_level() {
    let tree = MerkleTree::build(&["a", "b", "c", "d", "e"]);
    // 5 leaves → 6 (dup) → 3 pairs → 4 (dup) → 2 → 1
    assert_eq!(tree.levels.len(), 4);
    assert_eq!(tree.levels[0].len(), 1);
    assert_eq!(tree.levels[3].len(), 6);
}

The first test confirms that a 3-leaf tree produces a leaf level of length 4, with the last two entries being identical (the duplicated leaf). The second test shows the cascading duplication for 5 leaves: the leaf level has 6 entries (5 + 1 duplicate), which produces 3 pairs, which are padded to 4, producing 2, producing 1 (the root). Four levels total.

Getting a bit ahead, let us note that proofs still work correctly for leaves in odd-sized trees, including the duplicated leaf itself:

#[test]
fn odd_leaf_duplication_does_not_break_proof() -> anyhow::Result<()> {
    // In a 3-leaf tree [a, b, c], "c" is duplicated to make [a, b, c, c].
    // Proof for "c" should still verify correctly.
    let tree = MerkleTree::build(&["a", "b", "c"]);
    let root = tree.get_root_hash().context("missing root")?;
    let leaf_c = hash_leaf(b"c");
    let proof = tree.generate_proof(&leaf_c).context("proof not found")?;
    assert!(proof.verify(root));
    Ok(())
}

More on the proofs right in the next section.

This keeps the construction simple at the cost of a minor space overhead: at most one extra hash per level, which is negligible compared to the level’s size.

Inclusion proofs

The most powerful feature of a Merkle tree is the inclusion proof (also called a Merkle proof or authentication path). Given a leaf, we can prove it belongs to the tree by providing only the sibling hashes along the path from that leaf to the root. The verifier recomputes the root by hashing upward and checks whether the computed value matches the expected root hash.

Before looking at the proof generation algorithm, let’s define the data structures. A proof consists of a leaf hash and a sequence of steps, where each step records a sibling hash and its position:

/// Whether a sibling is to the left or right of the path node in the tree.
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum Direction {
    Left,
    Right,
}

/// A single step in a Merkle inclusion proof: a sibling hash and its position.
#[derive(Clone, Debug, PartialEq, Eq)]
pub struct ProofStep {
    /// The sibling hash at this tree level.
    pub hash: Hash,
    /// The position of the sibling relative to the path node.
    pub direction: Direction,
}

/// A Merkle inclusion proof: a leaf hash plus the sibling hashes needed to
/// recompute the root.
#[derive(Clone, Debug, PartialEq, Eq)]
pub struct Proof {
    /// The leaf hash this proof is for.
    pub leaf: Hash,
    /// Sibling hashes from leaf level up to (but not including) the root.
    pub steps: Vec<ProofStep>,
}

Why record the direction? Because hash_pair(A, B) produces a different result from hash_pair(B, A). When the verifier recomputes the root, it must place the sibling and the current hash in the correct order. If the sibling is on the left, it goes first; if on the right, it goes second.

Let’s trace through a concrete example. Suppose we want to prove that tx2 is included in our four-transaction tree. The proof consists of the sibling hashes along the path from tx2 to the root:

Inclusion proof for tx2 showing the verification path

The proof for tx2 contains two hashes: H3 (the sibling at the leaf level, on the right) and H01 (the sibling at the next level up, on the left). The verification process is:

  1. Start with the leaf hash: H2 = hash_leaf(tx2)
  2. Combine with sibling H3 (which is on the right): H23 = hash_pair(H2, H3)
  3. Combine with sibling H01 (which is on the left): Root' = hash_pair(H01, H23)
  4. Check: does Root' equal the expected root? If yes, tx2 is in the tree.

The proof size is O(log n): for a tree with n leaves, you need only ⌈log₂(n)⌉ sibling hashes. A tree with a million leaves requires just 20 hashes in the proof, roughly 640 bytes with SHA-256. This is remarkably compact.

The test suite verifies this logarithmic relationship:

#[test]
fn ten_thousand_leaves_proof_depth_is_log2_n() -> anyhow::Result<()> {
    let n = 10000;
    let leaves: Vec<String> = (0..n).map(|i| format!("leaf_{i}")).collect();
    let refs: Vec<&str> = leaves.iter().map(|s| s.as_str()).collect();
    let tree = MerkleTree::build(&refs);
    let root = tree.get_root_hash().context("missing root")?;

    let target = hash_leaf(b"leaf_500");
    let proof = tree.generate_proof(&target).context("proof not found")?;

    // Proof steps should be ⌈log₂(n)⌉ = 14 for n=10000
    let expected_depth = (n as f64).log2().ceil() as usize;
    assert_eq!(proof.steps.len(), expected_depth);
    assert!(proof.verify(root));
    Ok(())
}

For 10,000 leaves, the proof contains exactly ⌈log₂(10000)⌉ = 14 hashes. That is 14 × 32 = 448 bytes to prove membership in a dataset that could be megabytes or gigabytes in size.

Generating proofs: walking from leaf to root

Now let’s look at the proof generation algorithm. Given a leaf hash, we find it at the bottom level of the tree and walk upward, collecting the sibling hash at each level:

pub fn generate_proof(&self, leaf_hash: &Hash) -> Option<Proof> {
    let leaf_level = self.levels.last()?;
    let mut position = leaf_level.iter().position(|h| h == leaf_hash)?;

    let mut steps = Vec::with_capacity(self.levels.len().saturating_sub(1));

    // Walk from leaf level upward toward the root, collecting each sibling
    // hash. `levels[0]` is the root and `levels[last]` is the leaf level.
    // At each level we find the sibling of our current position, record it,
    // then halve the position to move to the parent level.
    for depth in (1..self.levels.len()).rev() {
        let level = &self.levels[depth];
        // Nodes are paired (0-1, 2-3, …): even positions pair with the next
        // node (right sibling), odd positions pair with the previous one
        // (left sibling).
        let sibling_pos = if position % 2 == 0 {
            position + 1
        } else {
            position - 1
        };
        let sibling_hash = level[sibling_pos];
        let direction = if position % 2 == 0 {
            Direction::Right
        } else {
            Direction::Left
        };
        steps.push(ProofStep {
            hash: sibling_hash,
            direction,
        });
        position = position / 2;
    }

    Some(Proof {
        leaf: *leaf_hash,
        steps,
    })
}

This is the core of the proof generation, so let’s walk through it carefully for our tx2 example in a four-leaf tree.

The tree levels are:

levels[0] = [Root]              // root level
levels[1] = [H01, H23]         // internal level
levels[2] = [H0, H1, H2, H3]  // leaf level

We search for H2 (the hash of tx2) in the leaf level (levels[2]) and find it at position = 2.

Iteration 1 (depth = 2, the leaf level). Position 2 is even, so the sibling is at position 3 (the next node): that is H3. The direction is Right (the sibling is to the right of our node). We record ProofStep { hash: H3, direction: Right }. Then we halve the position: position = 2 / 2 = 1.

Iteration 2 (depth = 1, the internal level). Position 1 is odd, so the sibling is at position 0 (the previous node): that is H01. The direction is Left (the sibling is to the left of our node). We record ProofStep { hash: H01, direction: Left }. Then we halve: position = 1 / 2 = 0.

The loop does not enter for depth = 0 because the range (1..self.levels.len()).rev() starts at 2 and ends at 1.

The result is a proof with two steps: [{H3, Right}, {H01, Left}]. This matches exactly what we drew in the diagram above.

The position halving (position = position / 2) is the key insight: it reflects the tree’s structure. At the leaf level, nodes 0 and 1 share parent 0, nodes 2 and 3 share parent 1. At the next level, nodes 0 and 1 share parent 0. Dividing the position by 2 maps a node to its parent’s position in the level above.

Note that the method returns None if the leaf hash is not found in the tree. This is the correct behavior: you cannot generate a proof for data that is not in the tree.

A special case worth mentioning: a single-leaf tree. If there is only one leaf, the tree has a single level levels[0] = [H0], and generate_proof produces a proof with zero steps. The root is the leaf hash, so no sibling hashes are needed:

#[test]
fn single_leaf_proof_has_no_steps_and_verifies() -> anyhow::Result<()> {
    let tree = MerkleTree::build(&["only"]);
    let root = tree.get_root_hash().context("missing root")?;
    let leaf = hash_leaf(b"only");
    let proof = tree.generate_proof(&leaf).context("proof not found")?;
    assert!(proof.steps.is_empty(), "single-leaf proof needs no steps");
    assert!(proof.verify(root));
    assert_eq!(proof.compute_root(), *root);
    Ok(())
}

Verifying proofs: a fold over the path

Verification is the elegant counterpart to generation. Starting from the leaf hash, we combine it with each sibling in the proof, respecting the recorded direction, until we reach the root:

impl Proof {
    /// Recompute the root hash from this proof's leaf and steps.
    ///
    /// Starting from the leaf hash, each step provides a sibling hash and
    /// its direction. The pair is combined with domain-separated hashing
    /// (preserving left/right order) to produce the parent hash, until the
    /// root is reached.
    pub fn compute_root(&self) -> Hash {
        self.steps.iter().fold(self.leaf, |current, step| {
            match step.direction {
                // Sibling is on the left → sibling comes first.
                Direction::Left => hash_pair(&step.hash, &current),
                // Sibling is on the right → current comes first.
                Direction::Right => hash_pair(&current, &step.hash),
            }
        })
    }

    /// Verify that this proof's leaf belongs to a tree with the given root.
    pub fn verify(&self, expected_root: &Hash) -> bool {
        self.compute_root() == *expected_root
    }
}

This is the entire verification logic. It is elegant in its simplicity: a single fold over the proof steps, a hash comparison at the end. No tree reconstruction, no complex state. The verifier does not even need to know the size of the original tree.

Let’s trace through verification of our tx2 proof:

  1. Start: current = H2 (the leaf hash)
  2. Step 1: sibling H3 is on the Right → current = hash_pair(H2, H3) = H23
  3. Step 2: sibling H01 is on the Left → current = hash_pair(H01, H23) = Root'
  4. Compare: Root' == Root? If yes, the proof is valid.

The fold operation maps perfectly onto the mathematical definition: we are literally “folding” the path from leaf to root, accumulating hash values as we go. Rust’s type system and the fold combinator make this both concise and correct.

The test suite exhaustively verifies that every leaf in a tree produces a valid proof:

#[test]
fn every_leaf_proof_verifies_against_root() -> anyhow::Result<()> {
    let items = &["a", "b", "c", "d"];
    let tree = MerkleTree::build(items);
    let root = tree.get_root_hash().context("missing root")?;

    for item in items {
        let leaf = hash_leaf(item.as_bytes());
        let proof = tree.generate_proof(&leaf).context("proof not found")?;
        assert_eq!(proof.leaf, leaf);
        assert!(proof.verify(root), "proof failed for leaf '{item}'");
    }
    Ok(())
}

And it also verifies the tamper-detection properties. Changing any part of the proof (a sibling hash, the leaf hash, or even the direction of a sibling) causes verification to fail:

#[test]
fn tampered_sibling_hash_invalidates_proof() -> anyhow::Result<()> {
    let tree = MerkleTree::build(&["a", "b", "c", "d"]);
    let root = tree.get_root_hash().context("missing root")?;
    let leaf = hash_leaf(b"a");
    let mut proof = tree.generate_proof(&leaf).context("proof not found")?;
    proof.steps[0].hash = hash(b"tampered");
    assert!(!proof.verify(root));
    Ok(())
}

#[test]
fn tampered_leaf_hash_invalidates_proof() -> anyhow::Result<()> {
    let tree = MerkleTree::build(&["a", "b", "c", "d"]);
    let root = tree.get_root_hash().context("missing root")?;
    let leaf = hash_leaf(b"a");
    let mut proof = tree.generate_proof(&leaf).context("proof not found")?;
    proof.leaf = hash_leaf(b"TAMPERED");
    assert!(!proof.verify(root));
    Ok(())
}

#[test]
fn flipped_sibling_direction_invalidates_proof() -> anyhow::Result<()> {
    let tree = MerkleTree::build(&["a", "b", "c", "d"]);
    let root = tree.get_root_hash().context("missing root")?;
    let leaf = hash_leaf(b"a");
    let mut proof = tree.generate_proof(&leaf).context("proof not found")?;
    proof.steps[0].direction = match proof.steps[0].direction {
        Direction::Left => Direction::Right,
        Direction::Right => Direction::Left,
    };
    assert!(!proof.verify(root));
    Ok(())
}

The direction test is particularly interesting. It confirms that hash_pair is not commutative: swapping the order of arguments produces a completely different hash. This is by design: it prevents an attacker from rearranging siblings in the proof to forge a valid path.

Blockchain verification with SPV

Now that we understand how Merkle trees work at a fundamental level (the data structures, the construction, proof generation, and verification), we can see why they are so valuable in blockchain systems. The logarithmic proof size is exactly what makes Simplified Payment Verification (SPV) possible. SPV was described in Section 7 of the Bitcoin whitepaper and is the reason lightweight wallets can function without storing the entire blockchain.

SPV architecture: full node vs light node

A full node stores every block with all its transactions and the corresponding Merkle tree. A light node stores only block headers, which include the Merkle root of each block’s transaction tree. When the light node wants to verify a transaction, it asks a full node for a Merkle proof, then verifies it locally against the stored root.

The trust model is important here: the light node does not need to trust the full node. If the full node provides a fraudulent proof, it will not match the Merkle root that the light node already has from the block headers. The proof is either mathematically correct or it is not.

Let’s look at how this works in practice using a Rust implementation. First, we need a transaction type with deterministic serialisation:

/// A simplified Bitcoin-like transaction.
///
/// In Bitcoin's UTXO model, each transaction consumes previous unspent
/// outputs (inputs) and creates new outputs.  There is no per-sender
/// nonce: replay protection is inherent because once a UTXO is spent it
/// ceases to exist.  Here we use a simplified representation with `from`,
/// `to`, and `value` fields to keep the focus on Merkle tree verification
/// rather than full UTXO mechanics.
struct Transaction {
    from: String,
    to: String,
    value: u64,
}

impl Transaction {
    /// Deterministic binary serialisation used as Merkle tree leaf data.
    ///
    /// A real implementation would use a canonical binary encoding;
    /// here we concatenate fields with a delimiter that cannot appear in the
    /// address representation to keep things simple.
    fn to_bytes(&self) -> Vec<u8> {
        let mut buf = Vec::new();
        buf.extend_from_slice(self.from.as_bytes());
        buf.push(b'|');
        buf.extend_from_slice(self.to.as_bytes());
        buf.push(b'|');
        buf.extend_from_slice(self.value.to_le_bytes().as_slice());
        buf
    }
}

The serialisation uses a simple delimiter-based format. In a real Bitcoin implementation the transaction would be serialised using Bitcoin’s own binary format, but the key requirement for Merkle tree construction is determinism: the same transaction must always produce the same bytes which would produce the same leaf hash.

Now we define the participants: blocks, full nodes, and light nodes:

/// A block on the full node: contains the complete transaction list
/// and its Merkle tree.
struct Block {
    id: String,
    transactions: Vec<Transaction>,
    tree: MerkleTree,
}

impl Block {
    fn new(id: impl Into<String>, transactions: Vec<Transaction>) -> Self {
        let id = id.into();
        let leaves: Vec<Vec<u8>> = transactions.iter().map(|tx| tx.to_bytes()).collect();
        let tree = MerkleTree::build(&leaves);
        Self { id, transactions, tree }
    }
}

/// A full node stores complete blocks and can generate inclusion proofs.
struct FullNode {
    blocks: Vec<Block>,
}

impl FullNode {
    /// Generate a Merkle proof for `transaction_index` within `block_id`.
    fn proof_for(&self, block_id: &str, transaction_index: usize) -> Option<Proof> {
        let block = self.blocks.iter().find(|b| b.id == block_id)?;
        let leaf = block.tree.get_leaf_hash(transaction_index)?;
        block.tree.generate_proof(leaf)
    }
}

The Block::new constructor builds the Merkle tree at block creation time. Each transaction is serialised to bytes, and those byte slices become the leaves. The full node can then generate proofs for any transaction by index.

The light node stores only block headers, the minimal metadata needed for verification:

/// The minimal per-block metadata a light client needs: the block
/// identifier and the Merkle root of its transaction tree.
///
/// In a real blockchain (e.g. Bitcoin, Ethereum) a block header carries
/// additional fields (previous-block hash, timestamp, nonce, …); here we
/// keep only what is required for SPV-style verification.
struct BlockHeader {
    id: String,
    merkle_root: merkle_tree::Hash,
}

/// A light node stores only block headers, no full transaction lists.
struct LightNode {
    headers: Vec<BlockHeader>,
}

impl LightNode {
    fn from_full_node(full: &FullNode) -> Result<Self> {
        let headers = full
            .blocks
            .iter()
            .map(|b| {
                let merkle_root = *b
                    .tree
                    .get_root_hash()
                    .with_context(|| format!("block {} has no transactions", b.id))?;
                Ok(BlockHeader {
                    id: b.id.clone(),
                    merkle_root,
                })
            })
            .collect::<Result<Vec<_>>>()?;
        Ok(Self { headers })
    }

    /// Verify that `tx_data` is an authentic transaction included in `block_id`.
    ///
    /// Two checks are needed:
    /// 1. `hash_leaf(tx_data) == proof.leaf`: the transaction bytes match the
    ///    hash claimed by the proof.
    /// 2. `proof.verify(&merkle_root)`: the leaf hash chains up through
    ///    sibling hashes to the trusted Merkle root.
    ///
    /// Without the first check, a malicious full node could supply a valid
    /// proof for a *different* transaction and the light node would accept it.
    fn verify(&self, block_id: &str, tx_data: &[u8], proof: &Proof) -> bool {
        merkle_tree::hash_leaf(tx_data) == proof.leaf
            && self.headers
                .iter()
                .find(|h| h.id == block_id)
                .is_some_and(|h| proof.verify(&h.merkle_root))
    }
}

Notice the two-step verification in the light node, mirroring the pattern we will see in the file integrity example. The first check (hash_leaf(tx_data) == proof.leaf) ensures the transaction the client cares about actually matches the leaf hash claimed by the proof. The second check (proof.verify(&merkle_root)) ensures the leaf hash chains up to the trusted Merkle root. Both are necessary: without the first check, a malicious full node could supply a valid proof for a different transaction, one that genuinely exists in the block, and the light node would accept it, believing the wrong transaction was verified. Without the second check, there would be no chain of trust back to the known-good root. All the cryptographic heavy lifting happens inside the proof’s compute_root method, which we already examined in detail. Note that this computation does run on the light node itself, but since it only processes O(log n) hashes rather than the entire block, it remains very efficient.

Now let’s set up a scenario with two blocks of transactions and walk through three security scenarios:

fn main() -> Result<()> {
    let block_a = Block::new(
        "block-1",
        vec![
            Transaction { from: "0xAlice".into(), to: "0xBob".into(),   value: 50 },
            Transaction { from: "0xBob".into(),   to: "0xCarol".into(), value: 30 },
            Transaction { from: "0xCarol".into(), to: "0xDave".into(),  value: 10 },
            Transaction { from: "0xDave".into(),  to: "0xAlice".into(), value:  5 },
            Transaction { from: "0xAlice".into(), to: "0xDave".into(),  value: 20 },
        ],
    );
    let block_b = Block::new(
        "block-2",
        vec![
            Transaction { from: "0xEve".into(),   to: "0xAlice".into(), value: 100 },
            Transaction { from: "0xAlice".into(), to: "0xBob".into(),   value:  75 },
        ],
    );

    let full_node = FullNode {
        blocks: vec![block_a, block_b],
    };
    let light_node = LightNode::from_full_node(&full_node)?;
    // ...
}

Scenario 1: Successful verification

A client wants to confirm that the third transaction in block-1 (Carol → Dave, 10 units) was really included. The full node provides a Merkle proof; the light node verifies it:

let proof = full_node
    .proof_for("block-1", 2)
    .expect("transaction exists in block");

// The light node must know the transaction it wants to verify.
let tx_data = Transaction { from: "0xCarol".into(), to: "0xDave".into(), value: 10 }.to_bytes();
let verified = light_node.verify("block-1", &tx_data, &proof);
// verified == true

For a block with 5 transactions, the proof contains just 3 hashes (⌈log₂(5)⌉ = 3), regardless of how many transactions the block contains. In Bitcoin, which can have several thousand transactions per block, a proof is still only around 10-12 hashes, about 320-384 bytes.

Scenario 2: Tamper detection

An attacker intercepts the proof and modifies one of the sibling hashes, hoping to trick the light node into accepting a fraudulent transaction:

let mut tampered_proof = proof.clone();
tampered_proof.steps[0].hash = merkle_tree::hash(b"tampered-data");

let tampered_verified = light_node.verify("block-1", &tx_data, &tampered_proof);
// tampered_verified == false

The tampered proof computes a completely different root hash. Even a single bit change in any proof step cascades through the hash chain and produces a root that does not match. This is the avalanche property of cryptographic hash functions working in our favour: a small change to the input produces a completely unpredictable change in the output.

Scenario 3: Cross-block rejection

A proof generated for one block cannot verify against another block’s root. Each block has its own Merkle tree, so proofs are scoped to the block they were generated from:

let block_b_proof = full_node
    .proof_for("block-2", 0)
    .expect("transaction exists in block-2");

let block_b_tx_data = Transaction { from: "0xEve".into(), to: "0xAlice".into(), value: 100 }.to_bytes();
let cross_verified = light_node.verify("block-1", &block_b_tx_data, &block_b_proof);
// cross_verified == false

let own_block_verified = light_node.verify("block-2", &block_b_tx_data, &block_b_proof);
// own_block_verified == true

This ensures that transactions cannot be “moved” between blocks by a malicious full node. A proof is bound to the specific Merkle root, and therefore to the specific block, it was generated from.

Beyond blockchains

While blockchains are perhaps the most well-known application, Merkle trees appear in many other systems where data integrity and efficient verification matter.

File distribution and software updates. Systems like IPFS split files into chunks and organize them into a Merkle DAG (a generalization of Merkle trees). The root hash serves as the content address. A client downloading chunks from untrusted peers can verify each chunk against the root without trusting any single peer. The same principle applies to package managers and firmware update systems: publish a signed root hash, and clients can verify individual files independently.

The pattern is the same one we saw with blockchains. A server holds the full file set and its Merkle tree. A client needs only the trusted root hash. For each file it downloads, it receives a compact proof. Here is a simplified version from the implementation’s file_integrity example:

/// A release: all files and the Merkle tree built from their contents.
struct Release {
    files: Vec<ReleaseFile>,
    tree: MerkleTree,
}

impl Release {
    fn new(files: Vec<ReleaseFile>) -> Self {
        let leaves: Vec<&[u8]> = files.iter().map(|f| f.content.as_slice()).collect();
        let tree = MerkleTree::build(&leaves);
        Self { files, tree }
    }

    fn manifest_root(&self) -> Result<merkle_tree::Hash> {
        self.tree
            .get_root_hash()
            .copied()
            .context("release has no files")
    }
}

/// The client: knows only the trusted manifest root.
struct Client {
    trusted_root: merkle_tree::Hash,
}

impl Client {
    /// Verify that `data` is an authentic file from the release.
    ///
    /// Two checks are needed:
    /// 1. hash_leaf(data) == proof.leaf: the downloaded bytes match the
    ///    hash claimed by the proof.
    /// 2. proof.verify(&trusted_root): the leaf hash chains up through
    ///    sibling hashes to the trusted root.
    fn verify(&self, data: &[u8], proof: &merkle_tree::Proof) -> bool {
        merkle_tree::hash_leaf(data) == proof.leaf && proof.verify(&self.trusted_root)
    }
}

Note the two-step verification in the client. The first check (hash_leaf(data) == proof.leaf) ensures the downloaded bytes match the hash claimed by the proof. The second check (proof.verify(&trusted_root)) ensures the leaf hash chains up to the trusted root. Both are necessary: without the first check, an attacker who controls the download channel could replace both the file and the proof’s leaf hash, and verification would still pass. Without the second check, there would be no chain of trust back to the known-good root.

A constrained device that only needs two out of six files in a release can verify just those two, without downloading the rest. The proof size is the same regardless of how many files are in the release.

Database anti-entropy. Distributed databases like Apache Cassandra and Amazon DynamoDB use Merkle trees to detect inconsistencies between replicas. Each node builds a Merkle tree over its local data. To check whether two replicas are in sync, they compare root hashes. If the roots differ, they walk down the tree together, comparing hashes level by level, to identify exactly which data ranges are out of sync. The implementation’s database_sync example demonstrates this pattern:

struct Replica {
    name: &'static str,
    records: Vec<Record>,
    tree: MerkleTree,
}

impl Replica {
    /// Quick check: do two replicas have identical data?
    fn is_in_sync_with(&self, other: &Replica) -> Result<bool> {
        Ok(self.root()? == other.root()?)
    }

    /// Walk the leaf level to find which record indices differ.
    fn find_divergent_keys(&self, other: &Replica) -> Result<Vec<usize>> {
        let common = self.records.len().min(other.records.len());
        let mut divergent = Vec::new();
        for i in 0..common {
            let hash_a = self.tree.get_leaf_hash(i).context("invalid leaf index")?;
            let hash_b = other.tree.get_leaf_hash(i).context("invalid leaf index")?;
            if hash_a != hash_b {
                divergent.push(i);
            }
        }
        for i in common..self.records.len().max(other.records.len()) {
            divergent.push(i);
        }
        Ok(divergent)
    }

    /// Repair by copying divergent records, then rebuilding the tree.
    fn repair_from(&mut self, source: &Replica, indices: &[usize]) {
        for &idx in indices {
            self.records[idx] = source.records[idx].clone();
        }
        let leaves: Vec<Vec<u8>> = self.records.iter().map(|r| r.to_bytes()).collect();
        self.tree = MerkleTree::build(&leaves);
    }
}

The first step, comparing root hashes, already saves work: if the roots match, the replicas are identical and no further comparison is needed. When they differ, the example above takes a simplified approach and compares leaf hashes directly. This is fine for small datasets, but in a real system with millions of records it would defeat the purpose: you would be comparing every record’s hash even if only a handful differ. Production systems like Cassandra instead walk the tree level by level: start at the roots, compare children, and recurse only into subtrees whose hashes differ, skipping entire subtrees whose roots match. If two replicas with a million records have only 10 divergent records, the traversal touches roughly O(10 × log n) nodes instead of all n. Only those 10 records need to be transferred. This is the same logarithmic principle that makes inclusion proofs efficient, applied to synchronization rather than verification.

Certificate Transparency. Google’s Certificate Transparency framework uses Merkle trees to create an append-only log of TLS certificates. Monitors can efficiently verify that a certificate was logged, and auditors can verify that the log is consistent (no entries were removed or altered). This is the same RFC 6962 that standardized domain separation.

Summary

Merkle trees solve a fundamental problem in distributed systems: how to verify that a piece of data belongs to a larger dataset without possessing the entire dataset. The key insights are:

These properties explain why Merkle trees appear everywhere: Bitcoin and Ethereum for transaction verification, IPFS for content addressing, Certificate Transparency for auditing TLS certificates, Cassandra and DynamoDB for replica synchronization, Git for content-addressable storage, and many more.

Ralph Merkle patented the hash tree in 1979. Nearly fifty years later, it remains one of the most practically important ideas in computer science: a simple construction with profound consequences for how we build systems that operate without mutual trust.

The full Rust implementation discussed in this post, including the blockchain verification, file integrity, and database sync examples, is available at github.com/amoilanen/merkle-tree.

A note on production readiness. The implementation in this post is designed for demonstration and educational purposes. It prioritizes clarity over production hardness. While it gets the important things right - domain-separated hashing per RFC 6962, correct proof generation and verification, and a comprehensive test suite - it has several limitations you should be aware of before using it in real systems:

For production use cases, consider established and audited libraries such as rs_merkle. The implementation here is best suited for learning how Merkle trees work and experimenting with the concepts discussed in this post.