On scaling data structures - psyphi.net blog

http://www.flickr.com/photos/paraflyer/1063341049/sizes/m/

One concept, or data-structure paradigm if you will, which I’ve seen many, many times is the tree. Whatever sort of tree it is – binary or otherwise, it’s a hierarchy of some sort.

Tree structures are very common in every day life. Unfortunately in most programming languages they’re pretty clunky to code up. This aspect, coupled with they’re inherent inflexibility is why I think they don’t belong as a core part of a system which is required to scale horizontally.

The reason I believe this to be the case is because history, experience and current best-practices dictate that to scale well you define your scale-units to be entities which are as distinct from one-another as possible. This separation allows the system to treat them largely as separate units in an embarrassingly-parallel fashion, especially for functions like the popular MapReduce.

Tree structures simply don’t fit in to this paradigm. With any sort of hierarchy you end up binding each entity tightly to its parent, which in turn is bound tightly to each of its children. Think about what needs to happen if your tree is spread across many servers and you want to change a property of the tree, like the name of a node, or worse, delete a node and shuffle some of its children around. The complexity of this sort of manoeuvre is why trees don’t belong as primary keys of scalable systems. It’s also one of the reasons why document-store-databases like CouchDB are linear, key:value stores.