BMe Research Grant
Nowadays, providing high availability on the Internet is still an open question, although several real-time applications used day-by-day would require such reliable communication network. Failures throughout the network appear frequently due to various reasons, which have to be discovered as fast as possible, and traffic must immediately be rerouted to an alternative path. My research objective is to deeply analyze the only possible method available in today’s commercial routers and propose new network optimization techniques in order to attain higher failure case coverage.
I carried out my research work at BME, Department of Telecommunications and Media Informatics, in particular, at High Speed Network laboratory. In the Lab with Ericsson arm-in-arm, most of the research areas are related to today's and future Internet, where protection against network failures has always been a hot topic. The quality of their work is reflected by the large number of high ranked international publications and journal articles. In 2012, Dr. János Tapolcai has won in the Hungarian Momentum programme, and formed his MTA-BME Future Internet Research Group at BME in which I am a valued member from the beginnings.
Current Internet has reached the level of reliability, where Internet and cloud services are widely spreading among users. This gives an increasing push on the service providers to operate the Internet without any interruption and proceed to win the trust of most potential users. Nowadays, not just Internet Service Providers (ISPs) and end-users are concerned, but many other multimedia suppliers started to gain a foothold in this field and broadcast digital content over the IP (Internet Protocol). Such applications are IP based telephony (Voice over IP), Video on Demand (VoD), online-gaming, IPTV, etc. I expect that the reliability of IP networks will further improve in the future, and Internet will become a key infrastructure for the society. Reliability practically means that connection is ensured throughout the network at any given time, and a failure is handled so fast that virtually no packet loss occurs at the end-users. Necessarily, the Internet has to keep up with these real-time applications, which require continuous and reliable connections. Consequently, if a link or node fails in the network, it does not only cause routing instability, but a whacking portion of the traffic will be lost.
Availability in the Internet is severely plagued by component failures that occur frequently due to various reasons, such as physical interruptions, flapping interfaces, etc. Earlier, these failures were handled by Interior Gateway Protocols (IGPs), like the Open Shortest Path First (OSPF) or the Intermediate System-to-Intermediate System (ISIS), which adopt a restoration-based resilience approach. After a failure, the related information was distributed throughout the network so that every router can recalculate the shortest paths with the failed component removed. This process is called re-convergence, and, depending on the network size and routers' shortest path calculation efficiencies, it can take between 150 ms and a couple of seconds. It is obvious that this is beyond what real-time applications can afford, even if a small delay can be tolerated via buffering.
To overcome these issues, the IETF (Internet Engineering Task Force) defined a framework, called IP Fast ReRoute (IPFRR), for native IP protection in order to reduce failure reaction time to tens of milliseconds, in particular to sub-50 ms convergence time. It tries to avoid the global re-convergence with local rerouting and pre-computed detours, which means that the router adjacent to the failure tries to solve the problem individually by means of pre-computed alternate routing tables, which are installed long before any failure occurs.
As one of the first approaches, a basic specification for IPFRR was defined by the IETF, called Loop-free Alternates (LFA) , which is simple, standardized and is already available in today's routers. In LFA, when a failure occurs, the adjacent router tries to pass the packet to an alternate neighbor, who still has a functioning path to the destination. However, such neighbor does not always exist, evading LFA for providing 100% failure case coverage in every network. Therefore, in the past few years many IPFRR proposals have appeared [2-7], but the majority of them require additional management burden, complexity and non-standard functionality in IP forwarding and routing protocols making LFA the only deployable option for ISPs.
Recently, in order to improve the level of fast protection provided by LFA, the IETF has published a generalization called Remote LFA (rLFA) . This method provides additional backup connectivity when none can be provided by the basic mechanism. However, even though rLFA can provide higher reliability, it still inherits topology dependence from pure LFA, and thus providing 100% failure case coverage with pure IP remains an open question.
Many operators are still considering to enable LFA or rLFA, and trying to measure the expected benefits against the additional costs. In this research, I seek ways to assist in making this important decision. I give new graph theoretical tools for analyzing the different types of LFAs failure case coverage in operational networks. Similar protectability analyses are already available for some non-standardized IPFRR mechanisms.
For LFA, only simulation-based reports have been available this far, and mathematical analysis is confined to the link-protection case.
For rLFA, as far as I know, there is no information available about how it performs in different network topologies, and what are the fundamental lower and upper bounds on failure case coverage, or how this parameter can be improved.
Below, I extend existing results on mathematical LFA coverage analysis with new tools for studying both the link- and node-protection cases [S2]. I also provide a theoretical rLFA coverage analysis and establish a sufficient and necessary condition for a network to have 100 % rLFA failure coverage. I study which networks are the “bad cases”' for rLFA where the coverage is particularly low. Furthermore, I reveal the connections between LFA and rLFA and show the conclusions that can be drawn regarding one if information about the other one exists [S5][S6].
Initial deployments confirmed that in many operational networks, LFA indeed does not guarantee protection for each failure scenarios. This calls for developing network optimization tools to tune the network topology so that the number of failure cases protectable by LFA increases. There are various approaches to reach this end. One way is LFA network design, which aims to design LFA-friendly network topologies right from the outset. Another approach is LFA graph extension , where the task is to alter the network topology to boost LFA coverage. I have developed a third approach called LFA cost optimization [S2][S4], which requires constructing IGP link costs in a way as to maximize the number of possible failure cases protectable by LFA. I have proven that this problem is NP-complete, therefore I formalized a linear program (ILP) and several approximating algorithms to reach the best results through many fine tunable parameters. I have shown that LFA cost optimization can attain close to 100% LFA failure case coverage in many operational networks. Since the previous two methods may not be afforded by network operators, I have developed a combined LFA network optimization task [S1], which requires both adding new links and optimizing link costs to the same end. The results have shown that the number of new links has to be added can be significantly (by 50%) reduced.
improving IP resilience is a recurring theme in the literature, for the
specific case of LFA, only the joint optimization of network performance
and resilience has been investigated so far. Thus, at the moment
very little understanding is available as to how much LFA-based IP Fast ReRoute is suitable to protect an IP network and to what extent this can be improved. My Remote LFA analysis also
shows that many practically important topologies do no allow 100%
failure coverage even if it is protected by rLFA [S5]. As a way of an
improvement, the aforementioned LFA graph extension method was adopted
in order to improve rLFA coverage, i.e., the rLFA graph extension problem
was studied. Moreover, in the analyses and ways of improvement we also considered the
case of node failures [S6]. Note that node failure may involve many failures of links attached to that
node, therefore on average a network may need 1-2 additional
links in order to attain 100% failure coverage. A summary of results can be
found in the table below, where the column on the far left side shows
the real-world network topologies, whilst η(G) denotes the initial LFA coverage and μ(G) marks the initial remote LFA coverage. η(G)_GE indicates
the number of new links to be added to attain 100% LFA failure case
coverage, while the column of the corresponding results of the case of
rLFA is noted by μ(G)_GE. Last but not least, η(G)_CO denotes that how high LFA coverage can be reached by merely optimizing the link costs.
During my research, I worked with several researchers from the industry, including András Császár and Gábor Enyedi. The collaboration proved fruitful in that, too, that we have won a Best Paper Award in 2011 at the most prestigious conference of this field. My further research presented at another prestigious conference was also nominated for the Best Paper Award.
Our results and algorithms were also implemented in an industrial software, which enables network operators to easily analyze and optimize their network from a reliability perspective.
Furthermore, we are in touch with the member of the IETF Working Group standardizing remote LFA. We share our ideas and results in order to improve the quality of this draft.
In the future, I definitely wish to extend my remote LFA research and adapt the well performing LFA cost optimization to rLFA. Moreover, the appealing field of Software Defined Networks (SDN) will allow me studying other approaches in real-world network topologies.
Related own publications
Levente Csikor, Máté Nagy, Gábor Rétvári, Network Optimization Techniques for Improving Fast IP-level Resilience with Loop-Free Alternates. INFOCOMMUNICATIONS JOURNAL 3:(4) pp. 2-10. (2011)
Levente Csikor, János Tapolcai, Gábor Rétvári, Optimizing IGP Link Costs for Improving IP-level Resilience with Loop-Free Alternates. COMPUTER COMMUNICATIONS 36:(6) pp. 645-655. (2013)
Levente Csikor, Gábor Rétvári, János Tapolcai, High Availability in the Future Internet. In Future Internet Architecture Book, Future Internet Assembly 2013: Validated Results and New Horizons, Lecture Notes in Computer Science, Springer, pp. 64-76., ISBN: 978-3-642-38081-5 (2013)
Gábor Rétvári, Levente Csikor, János Tapolcai, Gábor Enyedi, András Császár, Optimizing IGP Link Costs for Improving IP-level Resilience. In Proc. of 8th International Workshop on the Design of Reliable Communication Networks (DRCN), Cracow, Poland, pp. 62-69. ISBN: 978-1-61284-123-6 (2011)
Levente Csikor, Gábor Rétvári, IP Fast Reroute with Remote Loop-Free Alternates: the Unit Link Cost Case. In Proc. of RNDM 2012 4th International Workshop on Reliable Networks Design and Modelling. Saint Petersburg, Russia, pp. 16-22. ISBN: 978-1-4673-2178-5 (2012)
Levente Csikor, Gábor Rétvári, On Providing Fast Protection with Remote Loop-Free Alternates: Analyzing and Optimizing Unit Cost Networks. submitted to Telecommunication Systems Journal, (2013)
List of references
A. Atlas, A. Zinin, Basic specification for IP fast reroute: Loop-Free Alternates. RFC 5286, 2008.
G. Schollmeier, J. Charzinski, A. Kirstädter, C. Reichert, K. Schrodi, Y. Glickman, C. Winkler, Improving the resilience in IP Networks. in. Proc. HPSR, 2003.
A. Kvalbein A. F. Hansen, T. Čičić, S. Gjessing, O. Lysne, Fast IP network recovery using multiple routing configurations. in Proc. IEEE INFOCOM, 2006.
S. Bryant, M. Shand, S. Previdi, IP fast reroute using Not-via addresses. Internet Draft, 2008.
S. Lee, Y. Yu, S. Nelakuditi, Z.-L. Zhang, C.-N. Chuah, Proactive vs. Reactive Approaches to Failure Resilient Routing. in Proc. IEEE INFOCOM, Hong Kong, 2004.
Z. Zhong, S. Nelakuditi, Y. Yu, S. Lee, J. Wang, C.-N. Chuah, Failure Inferencing based Fast Rerouting for Handling Transient Link and Node failures. in IEEE INFOCOM, 2005.
K. Lakshminarayanan, M. Caesar, M. Rangan, T. Anderson, S. Shenker and I. Stoica, Achieving convergence-free routing using failure-carrying packets. in. Proc. SIGCOMM, 2007.
S. Bryant, C. Filsfils, S. Previdi, M. Shand, N. So, Remote LFA FRR. Internet Draft, 2013.
G. Rétvári, J. Tapolcai, G. Enyedi, and A. Császár, IP Fast ReRoute: Loop Free Alternates revisited. in INFOCOM, 2011, pp. 2948–2956.