System-Level Performance of Wide Area Distributed Systems Gary J. Minden The University of Kansas March 26, 1997 1.0 Introduction This White Paper concerns issues of networking system performance across wide-area networked systems. Experience in the MAGIC Gigabit Testbed shows that measuring performance at multiple levels of the networking system, i.e. application level, protocol processing level, and hardware interface level, is crucial to tuning the distributed system towards it maximum capability. The MAGIC(1) group has, since 1993, been concerned with the performance of systems distributed over a wide area (approximately 1000 KM). In order to understand the performance of these systems, the group has developed techniques to synchronize recording of significant events, transmission of those significant events, and correlation of significant events across multiple systems. Using these techniques, the MAGIC group has shown, and eliminated, many "bottlenecks" in wide area networking systems. Following we briefly describe the systems we have implemented and the importance of implementing these types of systems in the National Information Infrastructure. 1.1 Network Event Tracking In order to track network events across a wide- area network, one needs three things. First, and perhaps most important is a reliable, and synchronized time source. Without knowing the "time" at each source/sink on the network, it is impossible to determine what happened when, and what the cause of missing data or low transmission performance might be. In order to determine network behavior timing accuracy to the order of micro-seconds is necessary. Accurate network time is crucial to understanding NII behavior. Within the MAGIC network, GPS receivers are used a several nodes and the NTP system is used to synchronize adjacent systems. This White Paper proposes that the Next Generation Internet support a common time synchronization protocol. 1.2 Common Logging Format Second, a common format for logging events is necessary. Lawrence Berkeley National Laboratory introduced a "NetLogger" format for tracking events across the wide area network. Recording events, such as requested disk blocks, time requested, time obtained from disk, time transmitted, and time received, the MAGIC group was able to understand the behavior of the distributed system and tune that system to maximum performance. The MAGIC group is extending the NetLogger capability from the application level into the typical UNIX kernel level. It is crucial that applications and underlying systems log significant events to be able to analyze those events for maximum system performance. This White Paper proposes that the Next Generation Internet support a common distributed network event logging protocol for measuring and evaluating performance of distributed network applications. 1.3 Analysis of Detailed Network Events Finally, events across a wide area need to be collected, correlated, and analyzed for performance impediments. The MAGIC group has demonstrated initial capabilities of correlating distributed events, attempting to understand where and when performance impediments intervene, and iterating methods for elevating bottlenecks in the distributed system. This White Paper proposes development of a common set of basic network performance tools for determining and understanding the performance of distributed, wide area applications. Our goal is maximum application performance across the wide-area network. To obtain that goal, we need to accurately measure the time of the event we need to accurately record the behavior of the event, and we need to be able to analyze the behavior of the distributed system. The MAGIC group has development initial techniques in all these areas. These initial techniques are offered as a starting point for further development within the National Information Infrastructure. 1.4 Proposal for NII Performance Measurement and Analysis Given our experience in the MAGIC testbed, we suggest the Next Generation Internet adopt common protocols for the following: (1) Network Standard Timing (2) Common Event Logging Format (3) Analysis and Display capabilities for Wide Area Network performance capabilities. 1.5 Actions for the NII We believe that accurate measurment and performance of top level application performance is crucial to the development of the NII. Without developing the necessary tools for measuring performance at all levels, from application thru network interface, can we be assured of maximum NII preformance. This White Paper proposes as detailed discussion of NII performance issues at the CRA Workshop Gary J. Minden The University of Kansas (1) MAGIC is a DARPA sponsored project dedicated to high-performance distributed computational systems. Principle members of the group are: The University of Kansas, Sprint Inc., SRI International, Lawrence Berkeley National Laboratory, The USGS EROS Data Center, and The Minnesota Supercomputing Center. Gary J. Minden, Ph.D., is associated with the Information and Telecommunication Technology Center at The University of Kansas. At ITTC he is a principal investigator in the DARPA sponsored MAGIC gigabit testbed and the ACTS/ATM/Interconnect (AAI) testbed. He also leads the design and implementation of the DARPA sponsored Rapidly Deployable Radio Network. Between June 1994 and December 1996 he served as Program Manager for Networking Systems at the Defense Advanced Research Projects Agency. While at DARPA he formulated the Active Networking Program and participated in the initial formulation of the Large Scale Networking component of the National Science and Technology Council's Committee on Computing, Information, and Communications Research and Development Subcommittee.