Network distributed databases

Most database designs are tied to a single master server, which might get replicated out to various slaves, but what we’re really after is a database distributed and replicated across the network with no distinguished master server.

OSPF already does this.

OSPF builds routing tables from a link state database. What’s of interest here is how the database get distributed around to the OSPF routers, because none of them are a preferred master server and as a routing protocol, OSPF is specifically designed to handle link failures and thus a partitioned network where some servers can’t talk to the others but all must continue working.

OSPF does this by first associating with each database item an owner (a host, not a user). Only the owner can modify the item. Obviously, the host creating the item in the first place must be its owner (if creation is seen as a special case of modification) so there’s never any question about who the owner is. Since no other host can modify the item, there’s never any issue of a race condition as such. Any modification to the item takes the form of a completely new copy of the item, and is cleanly marked with a monotonically increasing sequence number. Thus there is never any question of how to apply a set of modifications to an item. The version with the highest sequence number is current, and that’s it.

When an item is updated, OSPF makes a best effort to copy the update to every other OSPF speaker in the network (or at least in the ‘area’, a technical OSPF term). This is the “flooding” algorithm. Its logic is simple. Upon receiving an OSPF update, check its sequence number to see if you’ve already seen it before. If so, do nothing. Otherwise, relay a copy to all of your OSPF neighbors. Additionally, a mechanism is provided to obtain a complete copy of all current entries known to an OSPF router, for when a new router joins the network. Various timeouts try to ensure consistency.

OSPF’s flooding algorithm isn’t completely general, because it’s targeted at routers. The expectation is that pretty much every router will itself be a destination for the flooding, so there’s no need for anything other than link-layer multicast. A more generic flooding algorithm might use some other kind of multicast scheme, perhaps my Lightweight Multicast (LWM) proposal.

One obvious question remains: what happens when multiple hosts wish to modify a single database entry? Well, the simplest answer is that they can’t. More specifically, the database has to be structured so that this will never happen. In the case of OSPF, the common case is a link between two routers. Either router should be able to label the link “down”. In the OSPF database, there are two different entries for this link, one owned by each router. The “entry” for the link is thus a composite of the two. For the link to be “up”, both entries from the routers on either end of the link have to be labeled “up”. Otherwise the link is considered “down”.

This basic model can be extended to build a distributed network database. Entries in the database, which I call “serfs”, are owned by specific hosts, and no other host can modify the entry. A best-effort delivery mechanism is provided using flooding, multicast, or some combination thereof. Conventional database entries, that can be modified by more than one host, are built up out of serfs.

For example, consider a remote filesystem. You have a directory with a list of files in it. Any host wishing to place a file into the directory authors a serf listing that file. The “directory” as seen by the users is the union of all the serfs. What if a host wants to delete a file? It authors a serf specifically listing that file as deleted. When the host that authored the file to begin with receives this serf, it notes that its file has been deleted and updates its own serf, which gets flooded around the network. Thus, baring network failures or permission conflicts, the database filesystem should settle down quickly to reflect the deleted file.

But there is a time where both the original serf and the “delete” serf are present in the network (let’s assume permissions are not an issue; all hosts have permission to delete the file). If there are link failures and the network becomes partitioned, this time may becomes substantial, like if a user on a disconnected laptop deletes the file. The first thing I’d like to point out is that although we now have a conflict between the two serfs, it would be impossible for _any_ database system to resolve the conflict without additional information! All we know at this point is that a file has been created and a file has been deleted. In what order did these operations occur? Without any further information, we don’t know!

OK, so let’s say we’re running NTP, so we’ve got consistent clocks. All the serfs now include timestamps, so we can at least tell what order they happened in. But that still might not solve our problem. Did the user delete the copy everyone else sees, or did he delete an older copy that was outdated? Again, with more information, we don’t know. So along with the delete we include, in addition to the timestamp of the delete itself, the timestamp of the file being deleted. Now everyone can tell if the user was deleting the current version of the file (the timestamps match between the delete and the existing file entry), or an outdated one (the timestamp on the file being deleted is older than the timestamp of the file), or if the file itself appears to be outdated! (the timestamp of the version being deleted is newer than the one on the file) OK, better, but we’re still not out of the woods. If the user deleted the current version of the file, then fine, but what if the user deleted an older version? What now? Do we keep the new one or honor the delete?

Well, think of it this way. The delete is a modification to the file. Anyone who has used a version control system knows that if two people make modifications to the same file, then sometimes a manual conflict resolution is necessary. In the special case of a text file, you can localize the changes according to what line numbers were affected, but no matter what, there are still times when you still need to do manual conflict resolution. The bottom line there is simply no completely general way to resolve commit conflicts in a database system. The best you can hope for is to specialize your conflict resolution logic for your particular application (like OSPF’s rule that a link is only up when both ends agree that it’s up), accepting that for some applications (English text or even computer code), it’s probably impossible to expect the computer to resolve all conflicts automatically.

What then, about security? A combination of encryption (to protect read access) and signing (to protect write access) should do the trick. To protect any but authorized users from reading a piece of data, you encrypt it using a key that none but authorized users can decrypt. The metadata associated with the serf (timestamps, name of database, fields affected) probably has to remain unencrypted or minimally encrypted so that the database software can process it. To protect write access, you sign the serfs and program the system to ignore unsigned or incorrectly signed serfs.

Perhaps the trickiest aspect of such a database design would be the flooding algorithm itself. With OSPF, because you can assume that pretty much every router is participating in the database, it’s much easier. You just use single-hop, link-layer multicast. But what about a database without such a simple model? One obvious answer is to use multicast, at least for initial distribution, with all participating databases subscribed to the multicast. The Internet’s current difficulties with multicast are properly the subject of another paper, or two, or ten. We need to get multicast working properly, first, then use multicast to perform initial distribution of new serfs.

So then, how do we deal with dropped serfs? OSPF uses a link-by-link acknowledgement; each router has to explicitly acknowledge that it has received a link update before its neighbor will stop retransmitting it. In our case, using a single multicast to distribute serfs throughout the network makes the initial distribution easier but complicates handling packet loss. What about the OSPF model? Well, it makes sense when the routers themselves, the same devices that would have to duplicate multicast packets, are also the primary recipients of the data. There’s no extra duplication of data. In a more general database model, with multiple database nodes spread out with multiple hops separating each of them, a link-by-link model doesn’t really work anymore. It’s hard to see how we could not use multicast and yet avoid unnecessary duplication.

So let’s assume that we’re using multicast for the initial distribution. How do we detect packet loss? OSPF uses explicit acknowledgements. Again, this makes sense in a link-by-link model, but now we’re talking about sending acknowledgements all the way back to an origin server. That doesn’t really look as good, especially when we ask what happens when the origin server goes down? So it’s probably best to avoid explicit acknowledgements unless we really need to aggressively keep our database up-to-date. This point requires more attention, though.

So a node detects a missing serf, probably by monitoring sequence numbers on the serfs it has received, possibly with some kind of timeout mechanism as well. What does it do? It now needs to get the serf retransmitted from one of the other database servers. Let’s even assume that it knows the IP addresses of all the other database servers and has enough routing information to determine which of them are in some sense ‘close’. Several cases present themselves. First, only one (or two) node(s) might require a serf retransmitted, suggesting a local unicast, as opposed to many nodes requiring the retransmission, which suggests that we should just re-multicast the thing. How do we handle this?

Consider the case of all the nodes on a local Ethernet. It would make sense to multicast retransmission requests to the entire group and let one node with the requested serf retransmit (by multicast) to the entire group. If there were one or two additional nodes connected by slower WAN links, and one of these nodes was missing a serf, then a retransmission request should be multicasted just to the core group, followed by a unicast retransmission to the outlying node.

It’s starting to sound like some kind of ‘anycast’ is required here. A retransmission request should be anycast to the other database nodes. If the node that receives the anycast doesn’t have the serf in question, then it presumably requires it, too, so it now anycasts to the remaining nodes, minus the one(s) requesting the retransmission to begin with.

I don’t have easy answers to these questions, so let’s set a lower bar – we want to achieve retransmission with roughly no more overhead than if the nodes were interconnected by a mesh of TCP sessions. A simple way to approach this would be as follows. Initial annoucements are multicast to the entire group. Retransmission requests are anycast to the entire group and may include a list of additional addresses requiring the retransmission. Retransmissions themselves are unicast to the requesting address, unless additional addresses were specified, in which case multicast (just to those addresses) should be used.

This scheme presupposes a more sophisticated menu of multicast and anycast options than currently exist in the Internet. “Lightweight” non-unicast options are required that don’t require the extensive setup, explicit addressing, and router overhead of current multicast and anycast offerings. Rather than explicit multicast or anycast addresses, I suggest a set of IP header extensions listing multiple unicast addresses that combine to form multicast or anycast behavior. This would allow the easy formation of multicast/anycast packets with no more trouble than appending IP headers. Once we’ve got this, then we attempt to build a ‘Serf-Based Protocol’ (SBP) to provide transport services along the lines I’ve outlined above.

Leave a Reply