Merkle hashing

Merkle hashing guarantees data integrity. It is used extensively in numerous high-profile projects: git, BitTorrent, BitCoin and others. Bringing these integrity guarantees to RON introduces some new challenges.

The first one is efficiency. RON ops are very fine grained pieces of data. A RON op is likely to be smaller than its hash. Hence, the overhead of computing hashes becomes a part of the bigger issue of metadata overhead.

The second challenge is the multitude of RON serializations. A hash of a frame serialized with a text mapper would be different from the hash of the same frame in CBOR, etc. Even within one serialization, there are various format options, such as whitespace and compression. For that reason, RON Merkle hashing is defined on the nominal format. It primarily uses uncompressed 128-bit atoms, removing optional elements.

Conveniently, the nominal format roughly matches the in-memory layout of a parsed op in many compiled languages (e.g. C++, go; partially Java). For this convenience to hold true, we assume the nominal format with big-endian 64-bit words here. That likely differs from the de-facto layout.

This spec defines the data which has to be fed to a hash function to produce the op hash. It also defines the reuse of earlier hashes, thus defining a Merkle tree that matches the causal structure of RDT ops.

Key concerns of the algorithm are:

Specifier (op key)

The op’s id and reference form the Merkle DAG structure of the op graph. The hashing function is fed two UUID+hash pairs, as produced by earlier ops via the same function:

This part has a size of (16+HASHSIZE)*2 bytes, e.g. 96 bytes for any 32-byte hash function.


Value atoms are fed in the nominal representation: 128 bits per atom, values parsed, origin ranges cleared, all big-endian. Effectively, the second word (origin) only contains the flags. Supplying the flags is necessary to prevent mocking, e.g. picking an integer that matches some float in its bit layout, etc. Effectively, UUIDs are fed as-is (two big-endian words). Value UUIDs don’t form Merkle links, unless explicitly accompanied by hashes.

The special case is strings, because:

  1. They are arbitrary-length.
  2. Unicode string serialization has some degrees of freedom, esp. the text representation that has escapes.
  3. Mocking.

Thus, strings are fed as 128-bit atoms followed by arbitrary-length UTF-8 strings: