Tuesday, October 28, 2008

Resilient Overlay Networks

In this paper, an application-layer overlay network is built using UDP in order to build a more resilient routing infrastructure that, because of its smaller size, can respond to disruptions and slowdowns faster and more flexibly than the traditional BGP approach between ASs. The particular implementation of these application-level routers in this paper uses a single-hop intermediary between each node in order to route around failures; this gives essentiall n-2 paths where before there was as single path (as controlled by information obtained via BGP).

With this path multiplicity, applications can choose how which paths (or under which metrics paths should be chosen) depending on their requirements. In addition to this application-level choice, routers can be more powerful in terms of having much more detailed policies for routing, since the network is essentially a private one built on top of the public internet and because the networks targeted are small relative to the size of ASs. Thus routers can spend more time per packet than a gateway router in the internet can.

One thing that jumps out to me is that RON (unlike the next paper) is geared towards a network that has nodes that can be implicitly trusted. For example, the use of policy tags to bypass some of the per-packet computation is subject to trusting the policy tags you receive; there is nothing to prevent a malicious user from rewriting policy tags to ensure they get the highest level of service while others do not.

The evaluation results are very interesting. In almost every metric, RON improves the metric *in most cases*, but there are always some measurements under which RON does substantially worse. For example, in dealing with packet loss, RON helps improve path reliability in most cases, but there are always a few percentage of samples that show worse behavior during outages: that is, some packets that would be routed if using just underlying connectivity do not get routed with RON. Similarly, RON can make loss rates worse (although the authors believe this is due to using bidirectional information instead of unidirectional; it remains to be seen if that would make a difference).

1 comment:

Randy H. Katz said...

I found it interesting that the metrics of interest where selected to make RON look good. I am not really sure what a 30% packet loss rate over a 30 minute period really means to TCP throughput. Loss rates of even a few percent can significantly degrade TCP throughput. However, numbers such as those chosen do indeed indicate a serious persistent problem with the connection -- which you would hope would be detected and fixed by the underlying routing algorithms.