Don’t trust your logs! Implementing a Merkle tree for an Immutable Verifiable Log (in Go)
TLDR; RFC#6962 implementation of merkle tree
Recently I’d been reading on the application of immutable databases for tracking changes in sensitive data, and for auditing purposes. That got me thinking on how they work underneath. This blog is an attempt to understand the data structures/cryptographic algorithms used in building such a system.
Traditional database records are mutable, and therefore there is no way to know for sure if your data has been compromised. Tamper-evident logs solve this problem. Tamper-evident logs means you can cryptographically prove that the data hasn’t been unexpectedly changed. Knowing the provenance, integrity and sequence your data was created means you’ll be able to monitor third-party ecosystems and simplify regulatory compliance. You can trust the data you rely on.
Security Audits
This has a variety of use cases in the industry we work in. Permissioned systems have a flaw, an assumption that they are secure. Many database or log systems run on untrusted servers or are subject to malicious attacks from insiders and, therefore, vulnerable to tampering. Organizations depend on widely-used database systems to store their data and try to limit the kind of operations users can perform on them by using authentication, hardening rules and access control mechanisms.
Permissionless systems (like Blockhain) implement zero trust from grounds up. Suppose we have a continuous sequence of records: access logs, event logs, transaction logs, secrets, configuration files or any sequenceable data. Such sequence can be materialized in one or multiple log files and therefore provide an audit trail to understand the activity of a system. And so, it may be subject to queries by users who want to diagnose problems. The evidence one seeks in these sorts of diagnostic investigations often takes the form of statements of existence and order.
A tamper-evident log stores an accurate, immutable and verifiable history of activity. This is an example of a verifiable system. You could use them to track credits and debits in banking transactions, access logs to sensitive healthcare records, cryptographic hashes of software packages, compliance artifacts of regulated activities, or modifications to a document.
Use cases of Tamper-evident Logs
Discourage Insider Threats
A tamper-evident log makes it impossible for a malicious insider to cover their tracks. A tamper-evident log discourages malicious behaviour by increasing the chance of discovery.
Simplify regulatory compliance
A tamper-evident log keeps you in control of your audit artifacts and gives auditors confidence. Regulated industries require companies to collect and retain many types of compliance records. Auditors need to verify the integrity and nonrepudiation of those records.
By using a tamper-evident to store compliance records, you can keep them in one place and simplify presenting them to an auditor. You can cryptographically prove they haven’t been tampered with.
A tamper-evident log is a more efficient way of presenting compliance records to an auditor who can easily verify their history and integrity.
Monitor third-party ecosystems
A tamper-evident log enables multiple parties to monitor each others’ actions. Where organisations can be required to publish data about their actions, a tamper-evident log can hold a permanent record of this data. Certificate Transparency discourages misbehaviour of Certificate Authorities by making their actions public. Tamper-evident logs can frustrate software supply chain attacks by providing a shared, global view of the content of packages.
Applications of Tamper-evident Logs
The best example of an application of an immutable log is how the Public Go Module Ecosystem. Sigstore is another such example.
If you’ve been in the Golang community for long, in the initial days, there were so many proxy servers available to download modules from, that one would not know which one to trust or not. What if some malicious snippet has been inserted? An attacker able to intercept the connection to the proxy server or github.com
(or an attacker able to break into one of those systems, or a malicious module author) would be able to cause goget
to download different code tomorrow, and goget
would not notice. There was no way to verify until the introduction of the Go checksum database.
Module Authentication with go.sum
The Certificate Transparency project is based on a data structure called a transparent log. The transparent log is hosted on a server and made accessible to clients for random access, but clients are still able to verify that a particular log record really is in the log and also that the server never removes any log record from the log. Separately, third-party auditors can iterate over the log checking that the entries themselves are accurate. These two properties combined mean that a client can use records from the log, confident that those records will remain available in the log for auditors to double-check and report invalid or suspicious entries. Clients and auditors can also compare observations to ensure that the server is showing the same data to everyone involved.
Checksum Database
The Go checksum database runs at https://sum.golang.org/
and serves the following endpoints:
/latest
will serve a signed tree size and hash for the latest log./lookup/M@V
will serve the log record number for the entry about module M version V, followed by the data for the record (that is, thego.sum
lines for module M version V) and a signed tree hash for a tree that contains the record. If the module version is not yet recorded in the log, the notary will try to fetch it before replying. Note that the data should never be used without first authenticating it against the signed tree hash and authenticating the signed tree hash against the client's timeline of signed tree hashes./tile/H/L/K[.p/W]
will serve a log tile. The optional.p/W
suffix indicates a partial log tile with onlyW
hashes. Clients must fall back to fetching the full tile if a partial tile is not found. The record data for the leaf hashes in/tile/H/0/K[.p/W]
are served as/tile/H/data/K[.p/W]
(with a literaldata
path element).
The Go team at Google runs the Go checksum database as a service to the Go ecosystem, similar to running godoc.org
and golang.org
.
Proxying a Checksum Database
A module proxy can also proxy requests to the checksum database. The general proxy URL form is <proxyURL>/sumdb/<databaseURL>
. If GOPROXY=https://proxy.site
then the latest signed tree would be fetched using https://proxy.site/sumdb/sum.golang.org/latest
. Including the full database URL allows a transition to a new database log, such as sum.golang.org/v2
.
Before accessing any checksum database URL using a proxy, the proxy client should first fetch <proxyURL>/sumdb/<sumdb-name>/supported
. If that request returns a successful (HTTP 200) response, then the proxy supports proxying checksum database requests. In that case, the client should use the proxied access method only, never falling back to a direct connection to the database. If the /sumdb/<sumdb-name>/supported
check fails with a “not found” (HTTP 404) or “gone” (HTTP 410) response, the proxy is unwilling to proxy the checksum database, and the client should connect directly to the database. Any other response is treated as the database being unavailable.
A corporate proxy may want to ensure that clients never make any direct database connections. The optional /sumdb/supported
endpoint, along with proxying actual database requests, lets such a proxy ensure that a go
command using the proxy never makes a direct connection to sum.golang.org. But simpler proxies may wish to focus on serving only modules and not checksum data—in particular, module-only proxies can be served from entirely static file systems, with no special infrastructure at all. Such proxies can respond with an HTTP 404 or HTTP 410 to the /sumdb/supported
endpoint, so that clients will connect to the database directly.
go
command client
The go
command verifies the log as it uses it, ensuring that every record it reads is actually in the log and that no observed log ever drops a record from an earlier observed log.
The go
command refers to $GOSUMDB
to find the name and public key of the Go checksum database. That variable defaults to the sum.golang.org
server.
The go
command then caches the latest signed tree size and tree hash in $GOPATH/pkg/sumdb/<sumdb-name>/latest
. It will cache lookup results and tiles in $GOPATH/pkg/mod/download/cache/sumdb/<sumdb-name>/lookup/path@version
and $GOPATH/pkg/mod/download/cache/sumdb/<sumdb-name>/tile/H/L/K[.W]
. This way, go clean -modcache
deletes cached lookup results and tiles but not the latest signed tree hash, which should be preserved for detection of timeline inconsistency. No go
command (only a manual rm -rf $GOPATH/pkg
) will wipe out the memory of the latest observed tree size and hash. If the go
command ever does observe a pair of inconsistent signed tree sizes and hashes, it will complain loudly on standard error and fail the build.
Data structures underneath the hood
Well, that was the point of this article, isn’t it? I was checking out the Certificate Transparency project in detail and then stumbled upon the paper Verifiable data structures. A verifiable data structure is a class of data structure that lets people efficiently agree, with cryptographic certainty, that the data contained within it is correct.I read through the RFC for Certificate Transparency, to check for the data structures used underneath, and Merkle Trees was the answer.Merkle Trees are the most famous of these and have been used for decades because they can enable efficient verification that a particular piece of data is included among many records — as a result they also form the basis of most blockchains.
Merkle tree
A binary tree that stores values at the lowest level of the tree and uses cryptographic hash functions. While leaves compute the hash of their own attributes, parents derive the hash of their children’s hashes concatenated left-to-right. Therefore the hash rooted at a particular subtree is recursively dependent on all its descendants, effectively serving as a succinct summary for that subtree.
Membership proof
A Merkle tree can prove values to be present by constructing efficient membership proofs. Each proof must include a Merkle audit path, and it is verified by recomputing all hashes, bottom up, from the leaf that the proof concerns towards the root. The proof is believed to be valid if the recomputed root hash matches that of the original Merkle tree, but to be convincing it requires a trustworthy root (e.g., signed by a trusted party or published periodically in a newspaper).
Merkle audit path
A Merkle audit path for a leaf is the list of all additional nodes in the Merkle tree required to compute the Merkle Tree Hash for that tree. If the root computed from the audit path matches the true root, then the audit path is proof that the leaf exists in the tree.
RFC Deep Dive
The Tree
A simple array of byte entries
type (
// Path is a list of nodes required for proving inclusion or consistency.
Path [][sha256.Size]byte
// Tree implements a general purpose Merkle tree.
Tree struct {
entries [][]byte
}
)
The Merkle Hash
Logs use a binary Merkle Hash Tree for efficient auditing. The hashing algorithm is SHA-256 [FIPS.180–4] (note that this is fixed for this experiment, but it is anticipated that each log would be able to specify a hash algorithm). The input to the Merkle Tree Hash is a list of data entries; these entries will be hashed to form the leaves of the Merkle Hash Tree. The output is a single 32-byte Merkle Tree Hash. Given an ordered list of n inputs, D[n] = {d(0), d(1), …, d(n-1)}, the Merkle Tree Hash (MTH) is thus defined as follows:
func (t *Tree) hash(D [][]byte) [sha256.Size]byte {
n := uint64(len(D))
/*
The hash of an empty list is the hash of an empty string:
MTH({}) = SHA-256().
*/
if n == 0 {
return sha256.Sum256(nil)
}
/*
The hash of a list with one entry (also known as a leaf hash) is:
MTH({d(0)}) = SHA-256(0x00 || d(0)).
*/
if n == 1 {
c := []byte{LeafPrefix}
c = append(c, D[0]...)
return sha256.Sum256(c)
}
/*
For n > 1, let k be the largest power of two smaller than n (i.e.,
k < n <= 2k). The Merkle Tree Hash of an n-element list D[n] is then
defined recursively as
MTH(D[n]) = SHA-256(0x01 || MTH(D[0:k]) || MTH(D[k:n])),
where || is concatenation and D[k1:k2] denotes the list {d(k1),
d(k1+1),..., d(k2-1)} of length (k2 - k1). (Note that the hash
calculations for leaves and nodes differ. This domain separation is
required to give second preimage resistance.)
Note that we do not require the length of the input list to be a
power of two. The resulting Merkle Tree may thus not be balanced;
however, its shape is uniquely determined by the number of leaves.
(Note: This Merkle Tree is essentially the same as the history tree
[CrosbyWallach] proposal, except our definition handles non-full
trees differently.)
*/
k := largestPowerOf2LessThan(n)
c := []byte{NodePrefix}
x := t.hash(D[0:k])
c = append(c, x[:]...)
x = t.hash(D[k:n])
c = append(c, x[:]...)
return sha256.Sum256(c)
}
Merkle Audit Paths (or Proofs)
A Merkle audit path for a leaf in a Merkle Hash Tree is the shortest list of additional nodes in the Merkle Tree required to compute the Merkle Tree Hash for that tree. Each node in the tree is either a leaf node or is computed from the two nodes immediately below it (i.e., towards the leaves). At each step up the tree (towards the root), a node from the audit path is combined with the node computed so far. In other words, the audit path consists of the list of missing nodes required to compute the nodes leading from a leaf to the root of the tree. If the root computed from the audit path matches the true root, then the audit path is proof that the leaf exists in the tree.
func (t *Tree) path(m uint64, D [][]byte) Path {
/*
The path for the single leaf in a tree with a one-element input list
D[1] = {d(0)} is empty:
PATH(0, {d(0)}) = {}
*/
n := uint64(len(D))
p := make(Path, 0)
if n == 1 && m == 0 {
return p
}
/*
For n > 1, let k be the largest power of two smaller than n. The
path for the (m+1)th element d(m) in a list of n > m elements is then
defined recursively as
PATH(m, D[n]) = PATH(m, D[0:k]) : MTH(D[k:n]) for m < k; and
PATH(m, D[n]) = PATH(m - k, D[k:n]) : MTH(D[0:k]) for m >= k,
where : is concatenation of lists and D[k1:k2] denotes the length
(k2 - k1) list {d(k1), d(k1+1),..., d(k2-1)} as before.
*/
k := largestPowerOf2LessThan(n)
if m < k {
p = append(p, t.path(m, D[0:k])...)
p = append(p, t.hash(D[k:n]))
} else {
p = append(p, t.path(m-k, D[k:n])...)
p = append(p, t.hash(D[0:k]))
}
return p
}
Merkle Consistency Proofs
Merkle consistency proofs prove the append-only property of the tree. A Merkle consistency proof for a Merkle Tree Hash MTH(D[n]) and a previously advertised hash MTH(D[0:m]) of the first m leaves, m <= n, is the list of nodes in the Merkle Tree required to verify that the first m inputs D[0:m] are equal in both trees. Thus, a consistency proof must contain a set of intermediate nodes (i.e., commitments to inputs) sufficient to verify MTH(D[n]), such that (a subset of) the same nodes can be used to verify MTH(D[0:m]). We define an algorithm that outputs the (unique) minimal consistency proof.
func (t *Tree) proof(m uint64, D [][]byte) Path {
/*
Given an ordered list of n inputs to the tree, D[n] = {d(0), ...,
d(n-1)}, the Merkle consistency proof PROOF(m, D[n]) for a previous
Merkle Tree Hash MTH(D[0:m]), 0 < m < n, is defined as:
PROOF(m, D[n]) = SUBPROOF(m, D[n], true)
The subproof for m = n is empty if m is the value for which PROOF was
originally requested (meaning that the subtree Merkle Tree Hash
MTH(D[0:m]) is known):
SUBPROOF(m, D[m], true) = {}
*/
n := uint64(len(D))
if 0 < m && m < n {
return t.subProof(m, D, true)
}
return nil
}
func (t *Tree) subProof(m uint64, D [][]byte, b bool) Path {
/*
The subproof for m = n is the Merkle Tree Hash committing inputs
D[0:m]; otherwise:
SUBPROOF(m, D[m], false) = {MTH(D[m])}
For m < n, let k be the largest power of two smaller than n. The
subproof is then defined recursively.
If m <= k, the right subtree entries D[k:n] only exist in the current
tree. We prove that the left subtree entries D[0:k] are consistent
and add a commitment to D[k:n]:
SUBPROOF(m, D[n], b) = SUBPROOF(m, D[0:k], b) : MTH(D[k:n])
If m > k, the left subtree entries D[0:k] are identical in both
trees. We prove that the right subtree entries D[k:n] are consistent
and add a commitment to D[0:k].
SUBPROOF(m, D[n], b) = SUBPROOF(m - k, D[k:n], false) : MTH(D[0:k])
Here, : is a concatenation of lists, and D[k1:k2] denotes the length
(k2 - k1) list {d(k1), d(k1+1),..., d(k2-1)} as before.
The number of nodes in the resulting proof is bounded above by
ceil(log2(n)) + 1.
*/
path := make(Path, 0)
n := uint64(len(D))
if m == n {
if !b {
path = append(path, t.hash(D))
}
return path
}
if m < n {
k := largestPowerOf2LessThan(n)
if m <= k {
path = append(path, t.subProof(m, D[0:k], b)...)
path = append(path, t.hash(D[k:n]))
} else {
path = append(path, t.subProof(m-k, D[k:n], false)...)
path = append(path, t.hash(D[0:k]))
}
}
return path
}
Example
The binary Merkle Tree with 7 leaves:
hash
/ \
/ \
/ \
/ \
/ \
k l
/ \ / \
/ \ / \
/ \ / \
g h i j
/ \ / \ / \ |
a b c d e f d6
| | | | | |
d0 d1 d2 d3 d4 d5
The audit path for d0 is [b, h, l].
The audit path for d3 is [c, g, l].
The audit path for d4 is [f, j, k].
The audit path for d6 is [i, k].
The same tree, built incrementally in four steps:
hash0 hash1=k
/ \ / \
/ \ / \
/ \ / \
g c g h
/ \ | / \ / \
a b d2 a b c d
| | | | | |
d0 d1 d0 d1 d2 d3
hash2 hash
/ \ / \
/ \ / \
/ \ / \
/ \ / \
/ \ / \
k i k l
/ \ / \ / \ / \
/ \ e f / \ / \
/ \ | | / \ / \
g h d4 d5 g h i j
/ \ / \ / \ / \ / \ |
a b c d a b c d e f d6
| | | | | | | | | |
d0 d1 d2 d3 d0 d1 d2 d3 d4 d5
The consistency proof between hash0 and hash is PROOF(3, D[7]) = [c,
d, g, l]. c, g are used to verify hash0, and d, l are additionally
used to show hash is consistent with hash0.
The consistency proof between hash1 and hash is PROOF(4, D[7]) = [l].
hash can be verified using hash1=k and l.
The consistency proof between hash2 and hash is PROOF(6, D[7]) = [i,
j, k]. k, i are used to verify hash2, and j is additionally used to
show hash is consistent with hash2.
You can find the entire code here
Published originally on my blog
https://aly.arriqaaq.com/merkle-tree-and-verifiable-data-structures/
References
- https://security.googleblog.com/2021/07/verifiable-design-in-modern-systems.html#:~:text=A verifiable data structure is,contained within it is correct.
- https://transparency.dev/verifiable-data-structures/
- https://go.googlesource.com/proposal/+/master/design/25530-sumdb.md
- http://transparency.dev/application/strengthen-discovery-of-encryption-keys/
- https://cs.brown.edu/research/pubs/pdfs/2003/Tamassia-2003-ADS.pdf
- https://www.links.org/files/RevocationTransparency.pdf
- https://github.com/BBVA/qed/blob/dc74bcf1b3035493afb95ec046d1f837972bac02/docs/source/internals/glossary.rst