Names: Konstantinos Dovrolis and Parameswaran Ramanathan Title: FAULT MANAGEMENT IN NEXT GENERATION INTERNET Affiliation: University of Wisconsin, Madison Postal Address: Professor Parmesh Ramanathan Department of Electrical and Computer Engineering 1415 Engineering Drive Madison, WI 53706-1691 E-mail Address: parmesh@ece.wisc.edu Phone Number: (608) 263-0557 Fax Number: (608) 265-4623 Fault Management In Next Generation Internet -------------------------------------------- Advancements in technology have opened the way for many new network applications like teleconferencing, remote seminars, virtual reality, online information gathering, visualization, or remote medical support. These applications will radically affect the way people communicate in their everyday life. They will also provide new methods for scientific research, education, health care, and for several other activities. The Next Generation Internet (NGI) can be the first widely used operational network to provide different Quality-of-Service (QoS) classes to successfully support applications such as the above. This paper focuses on the fault management mechanisms that NGI has to support. Most of the proposed network architectures for providing QoS work only if there are no failures of network components. If a component (i.e, a link or a router/switch) fails, the network has to either permanently or temporarily disrupt the flows whose QoS can be no longer met. This lack of reliability can be intolerable for several of the emerging applications due to the following reasons. First, these applications are expected to use the network for a much longer duration than present day applications, increasing significantly the likelihood of a flow encountering a component failure. For example, in a teleconferencing or a remote lecture application, the duration of a flow is expected to average in the order of an hour, as compared to seconds in typical current applications like WWW or FTP. Second, in many of these applications the consequences of losing the QoS guarantees can be very severe. For example, disrupting the flow used by a remote medical operation can be life threatening. Similarly, if the flow from a multiparty teleconference to a certain user is disrupted, the entire teleconference may have to be terminated. Third, as networks expand the likelihood of a failure occurrence becomes substantial, thereby increasing the chances that some flows will be dropped or temporarily suspended. In telephone networks, where the issue of reliability is also very important, there are many methods that minimize the chances of dropping calls in case of failures. The same (or even higher) level of reliability has to be provided by the next generation networks also, since the involved applications will gradually become as important as the telephone service, or even more. In general, there are two distinct approaches, reactive and proactive, for recovering from network component failures. The reactive approach deals with a failure only after it occurs. In a connection-oriented network, a typical action of a reactive method is to reallocate the unreserved fault-free network resources among the connections that were affected by a failure. Since the amount of unreserved resources may not be adequate for total recovery, some connections may be dropped. In a connectionless network, such as the current Internet, failures are also handled in a reactive way. When a failure occurs, the routing tables of the neighboring nodes are dynamically updated and the failure-affected flows are then forwarded through new routes. If this rerouting procedure is too slow, the application can consider that the flow was dropped. A drawback of the reactive approach is that the recovery procedure can take up to several seconds during which the affected connections or flows experience a service disruption period. Unexpected termination or service disruption can be unacceptable for applications like online stock market trade or remote medical operations and frustrating for applications like teleconferencing. On the other hand, reactive fault recovery methods are relatively simple and economical from the network's perspective. The proactive approach is an alternative way of dealing with network component failures. Some network resources, in the form of spare links, bandwidth, or buffer space, are reserved a priori solely for the purpose of facilitating recovery from possible failures. The key advantage of this approach is that none of the flows will be terminated and the service disruption period can be acceptably small, for up to a certain number of failures. The drawback of proactive schemes is that the resources reserved for recovery from failures cannot be reserved for other flows. Consequently, given the same amount of network resources, proactive schemes can serve fewer flows than reactive schemes. In general, both reactive and proactive fault recovery techniques can coexist in a network. The applications that have higher reliability requirements can use proactive mechanisms, while the rest can use reactive mechanisms, possibly at a lower cost. At University of Wisconsin-Madison, we have developed a proactive technique, called RAFT (Resource Aggregation for Fault Tolerance). The main objective of our approach is to reduce, to the maximum possible extent, the amount of extra resources reserved for fault recovery, by allowing multiple flows to share them. Specifically, resources are reserved for each flow along two disjoint routes. The first of these routes, called primary path, is used in the fault-free case. The flow is switched to a second route, called secondary path, when a component in its primary path fails. The basic idea is to aggregate the resource reservations along the secondary path of a flow with the secondary path reservations of other flows, if it is certain that the shared resources will never be used simultaneously by these flows. We have shown that RAFT leads to significant resource utilization gains for the network when a large number of flows use proactive fault recovery. The effective cost of providing fault recovery using RAFT, as well as the effects of the network topology and of the routing scheme used, have also been studied. There are several important issues that have to be investigated regarding fault management in a network such as NGI. A major question is whether reactive or proactive schemes should be used, and what are the exact techniques for each type of recovery. If both types are used, how are they going to be simultaneously supported? Another open issue is the effect of implementing these protocols on the rest of the network architecture. It is important for the fault management protocols to be integrated with the existing network protocols (such as the routing or the resource reservations ones) without requiring significant modifications to them. For example, we currently study ways to integrate RAFT with the Resource Reservation Protocol (RSVP), as well as with various emerging multicast routing protocols. In conclusion, it is certain than 21st century networks will be used by applications that require higher levels of reliability. The chances of a flow being affected by a network component failure is likely to increase. NGI can be the ideal network where different fault management/recovery mechanisms will be implemented and tested in a realistic environment.