Hello.
A distributed hash table is, as you might think, a hash table that is 'broken up' (distributed) across a number of servers. The primary reasons for this distribution are
- keeping the individual pieces small enough to e.g. reside in memory for performance purposes,
- distributing the data across multiple instances, for performance and resiliency,
- distributing multiple copies of the data across multiple instances for redundancy.
Breaking up the hash table relies on the property that the bits of a good hash value are random and uniformly distributed across the hash's range, and that for any given datum, the hash value is reproducible. Now, by choosing some fixed set of (n) bits in the hash, we can use the binary value of these bits to select one of 2^n locations for a 1/(2^n) fraction of the hash table.
For example, let's use the top 4 bits of the hash, giving us a value between 0 and 15; this is our 'shard' identifier. Now, instead of putting all of the table on 1 machine, we spread them out across 16 machines, by putting all of the records with the top 4 bits of 0000 (shard 0) to machine 0, 0001 to machine 1, etc. Each machine gets approximately 1/16th of the table (because, on average, each shard has about the same number of records).
The scheme outlined, above, evenly distributes the data across some 2^n machines. If one of these machines goes down, you've lost that fraction of the table. With replication, though, we can avoid this. For example, we can put shards 15, 0, and 1 on machine 0, shards 0, 1, and 2 on machine 1, and so on; this makes sure that there are 3 copies of each fraction of the whole table, each copy is on a different machine. If we want to get to shard 3, and machine 3 is down, we can go to machines 2 or 4 to get what we want.