In this post I am going to talk about hash functions, hash tables, Merkle Trees and data structure choices for WaDeD. Feel free to skip any section.

### What’s a hash table?

A hash table is a data structure that provides an associative array: it stores *values* in *buckets*, each bucket being uniquely mapped to an *index*, or *key* of fixed size. It does so by using a hash function (more on that in a minute) to *compute* the key from the original data. Hash tables are usually very efficient in terms of access times, and are one of the most common data structures.

We usually want to be able to insert values into the table, with at most one value linked to a key: the table must not be ambiguous. Since we use an algorithm to compute the key from the original value and that there usually are more possible values than possible keys, it may happen that the algorithm produces the same result for two different values: it will try to map two different values to the same key. This is called a *collision*, and any self-respecting hash table must take these into account and define a policy of dealing with hash collisions. These can be not storing the new value, creating a linked chain from the original bucket, storing the value in a neighbouring bucket, etc., depending on the way the hash table is effectively implemented and what behaviour we expect from it.

### What’s a hash function?

Hash tables use hash functions to compute the keys it will associate to its values. A hash function is any algorithm that takes input from a data set, and maps it to a smaller data set of which all elements have the same size. The operator “Modulo n” for a given integer n, or truncation to the n last bits of data (which is really the modulo 2^{n+1} operation), or the constant function equal to four are simple examples of hash functions.

Cryptographic hash functions have additional properties. The goal is to provide security about the original value. We want it to be highly discontinuous: small changes of an entry should greatly change the output value. We call this property the “avalanche effect”.

In order to be secure, a cryptographic hash function should also be unpredictable (one should not be able to design an entry that will have a given hash), irreversible (one should not be able to retrieve the original value from its hash) and, in practise, should be collision-free. This is, obviously, a theoretical impossibility; however all outdated cryptographic hash functions have been depreciated because ways of designing collisions in practise have been found. It has been shown theoretically that we could design collisions in the used cryptographic hash function, the SHA-1 algorithm, so its days could be over soon whenever somebody figures out how to go from theory to practise.

Because of their good properties, cryptographic hash functions have a lot of applications. They are commonly used to encrypt passwords: storing a password is unsafe and dangerous, but since it cannot be retrieved from its hash, it can be safely stored on a distant server. When trying to connect to the service, the input password is hashed and compared to the stored hash, instead of direct password comparison. Because in practise cryptographic hash functions can be considered as injectives, we can use their hash to check for integrity of a file. This is why Linux distributions always provide hashes of the installation files: after the download, the user can run a hash algorithm on the file. Any corruption, accidental or intentional, will result in a different hash and the file can be discarded. Finally, we can use cryptographic hashes to “fingerprint” a sequence: it is a simple, low cost way of giving a unique identifier to something. Git uses the SHA-1 algorithm to identify commits. If we have a big hash table, using a cryptographic hash function is a good way to minimize collisions.

### What’s a hash tree?

When exchanging files over a network, integrity checks are crucial: we want to protect our computer from installing bad software, and want to be able to ask for data again if it was corrupted during transmission. Therefore there is a need to do quick, efficient integrity checks. We already saw that cryptographic hash functions are good tools to do so, but the way we use them can make a big difference.

A hash tree is a structure that can be used to rapidly check differences between parts of a file. It is a tree whose leaf nodes are data, packets we want to store independently, and whose internal nodes are hashes of the concatenation of their son nodes.

Say I have two copies of the same file, stored as hash trees (the files are split into small chunks which are stored into the leafs) and I want to check whether they are really copies. If the top hash match, then I know that the whole tree does, and therefore the leafs do: the files are the same. If they differ, not only do I know that the files differ, but by going down the tree I can identify exactly which parts of the files are identical and which ones are not.

This is exactly how bittorrent works: when downloading a file with bittorrent, the user first gets the top hash of the file through a secure source. Then she can get the tree through any user, since having a valid top hash ensures the whole tree is valid. Then she can freely communicate with other users on the network who pretend to have parts of the file. As she downloads the packets, she compares the tree that builds up with the copy of the tree she first got, knowing when to discard false or damaged packets. When she has all the packets, the hash tree of her file matches the one she got in the first place, and she knows she successfully downloaded the file.

### Comes in WaDeD

So what do we do with all this?

In WaDeD, modules have to store packets to be transmitted on the network. We want two modules that meet to synchronize all their messages, so that each one can retransmit packets that were held by the other. Therefore we want to be able to quickly know what the symmetrical difference between the two lists of messages is. A hash tree seems like a good idea for this: if two modules already have a lot of packets in common, we can expect them to have huge subtrees in common and minimize the number of comparisons.

This is harder than it first appears. In order for the hash trees to work, two users who have packets in common have to store them at the same place in the tree: otherwise the similarities will not appear when comparing hashes. In a worst case scenario, two modules could have all their messages in common and have different top hash, provided the order of only one stored message differs. This is because the usual application case of hash trees is checking integrity of **fixed files**, not to compute symmetrical differences of two sets whose content and size. If we want to use hash trees, we need two users who carry the same files to store them in the same place of the tree, independently of any other factor.

An acceptable solution to this problem is to consider the leaf layer of the tree as a hash table: the leafs are buckets in which we place packets according to their hash value. This gives us certainty that two almost similar sets of packets will produce almost similar trees, at some costs:

- The number of packets that we can store is limited by the number of buckets (the tree cannot adapt its size). This is not really an added cost, since we already have strong memory constraints (limiting the number of stored packets) and that we want to avoid dynamic memory allocation (forcing us to use buckets anyway).
- Hash tables, especially fixed size ones, always induce a risk of collision. It is inversely proportional to the size of the table.

Is it worth it?

Two modules that meet will always have to compare their packets lists, so that is a common operation. We intend to store at most around a thousand packets, so that is up to a thousand comparisons at every meeting if they want to compare their lists directly. That means a thousand hashes over the radio. Radio communication is slow and the highest energy cost of the whole application, so that is something we absolutely want to avoid.

An amelioration is to have a “top hash” representing the state of a list of messages. This cuts the cost of the 1k radio communications when the lists are the same, but if they differ, we still have to exchange the whole list of hashes, which remains costly.

A hash tree allows us to minimize the amount of data we will exchange on the radio, at the cost of a few more CPU cycles and a 1/1000 collisions chance, which will cost us a few more CPU cycles to deal with, so this seems more than reasonable. A hash tree it will be, then.

In another post I will explain which choices we have to make concerning the effective implementation of the hash tree in memory.

Nice post !

Thanks for sharing 😉

[…] What’s a hash tree? […]