Achieving High-Performance Computing on the Next Generation Internet White Paper Rich Wolski Francine Berman Department of Computer Science and Engineering University of California, San Diego email: rich@cs.ucsd.edu, berman@cs.ucsd.edu The Next Generation Internet provides a unique opportunity to leverage advances in networking technology for high-performance applications. Experience with the I-Way [2], DOCT [3], and other metacomputing projects demonstrates that large-scale, shared, networked systems can aggregate and deliver performance to distributed applications. However, achieving that performance can be difficult. High-level software systems like those provided by the Globus system [5] or the Legion system [4] can provide interoperability, but without a performance-oriented network software infrastructure, wide-scale high-performance computing using the Internet will be infeasible. In particular, the Next Generation Internet must provide facilities for monitoring and reporting performance measurements that reflect application-achievable performance levels. Distributed resource management mechanisms require this information to make effective application scheduling decisions in the presence of contention and distributed administrative control. A goal of the NGI initiative is to provide an efficient computational environment for a wide spectrum of high-performance distributed applications including telemedicine, digital libraries, collaboratories, etc. In order for these applications to achieve the performance potential of the NGI, considerable software infrastructure is required to manage resources, guarantee security, ensure efficient communication, and promote efficient application scheduling. Without support from the underlying and ubiquitous internet software infrastructure, the NGI, from the perspective of an application, may be no more efficient than the current Internet in spite of its performance potential. Of particular importance to the achievement of application performance is the development of adaptive application scheduling mechanisms. Since the NGI will be shared, the performance it can deliver to an individual application will vary over time as a function of contention. An application scheduling mechanism must respond to the load and availability of NGI resources in order to determine a schedule that best leverages those resources at any given time. Moreover, it is not feasible to rely on a centralized global scheduling mechanism to achieve performance for each application using the Internet. NGI resources will be administered and controlled by their individual owners making an overarching scheduling policy infeasible. An approach we have been investigating is to augment each application with its own customized scheduler that evaluates all potentially useful resources strictly in terms of their utility to the application. The function of the scheduler is to quantify and compare how well various available resources meet the resource requirements of the application, and to derive a schedule that maximizes application performance using the "best" resource set. Our experience with such Application Level Schedulers as part of the AppLeS project at the University of California, San Diego indicates that they are able to achieve performance for distributed, parallel applications in production network computing environments [1,6]. To do so, application schedulers must be able to predict deliverable resource performance for the time frame in which the application will execute. AppLeS schedulers use a facility called the Network Weather Service (NWS)[7] to operate a distributed set of performance sensors and make dynamic statistical forecasts of resource load and availability. Based on our work with AppLeS and the NWS, we have several observations to make concerning the software infrastructure that the Next Generation Internet must support in order to facilitate high-performance distributed computation. -- Resource performance information should be accessible by all interested applications. Current and future internet technology supports performance monitoring in a variety of forms. Frequently, however this information can only be accessed by privileged processes using ad hoc APIs. We advocate an standard interface that is open to processes executing with ``standard'' user privileges. Not only should this interface support user tools, but it must also support automatic program, schedulers and agents efficiently. -- A protocol for managing and distributing resource performance should be included in the infrastructure of the NGI . One of the great successes of the current internet infrastructure has been the Domain Name System (DNS) protocol and implementation. Effective distributed naming has made the proliferation of systems such as the World Wide Web possible. We believe that an analogous infrastructure is required for the management of resource performance information. Important to this new protocol is the ability for individual services (such as the Network Weather Service) to be able to contribute and advertise information. Resource performance measurements and forecasts change dynamically so an ``active'' database of information must be supported. -- Network topology and routing information must be ubiquitously available. To a distributed application concerned with performance, the network is a resource to be scheduled. Knowledge of its architecture and topology as well as the state of its routing decision making system must be accessible. Indeed, schedulers may wish to use source routing to control the way in which their client applications use the network, although the stability of the network as a whole must be carefully studied before such capability is provided. -- Application-level Performance Monitoring must be supported. In addition to system level monitoring, intrinsic performance monitoring and measurement facilities must reflect the deliverable performance to the application. Current facilities that use the ICMP ping protocol, for example, do not accurately reflect application-level network performance, particularly with respect to reliable streaming protocols. For the Next Generation Internet to deliver next generation networking performance to the application, an NGI software infrastructure must be developed which permits applications to leverage the performance potential of the network hardware. Without this software infrastructure, applications may be able to execute no more efficiently than they can today, despite considerable hardware performance advances. If the potential performance of the NGI is to be realized, the focus must be on both the development of advanced networking technology and the design and development of a software infrastructure which delivers the performance potential of the hardware to the user. [1] F. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao, "Application Level Scheduling on Distributed Heterogeneous Networks", Proceedings of Supercomputing 1996, Nov. 1996. [2] T. DeFanti and I. Foster and M. Papka and R. Stevens and T. Kuhfuss, "Overview of the I-WAY: Wide Area Visual Supercomputing", to appear in the International Journal of Supercomputer Applications. [3] "Distributed Object Computation Testbed", http://www.sdsc.edu/DOCT/QuadPage.html. [4] A. Grimshaw, W. Wulf, J. French, A. Weaver and P. Reynolds, "Legion: The Next Logical Step Towrd a Nationwide Virtual Computer", University of Virginia tech. report CS-94-21, 1994. [5] I. Foster and C. Kesselman, "Globus: A Metacomputing Infrastructure Toolkit", available at ftp://ftp.mcs.anl.gov/pub/nexus/reports/globus.ps.Z. [6] G. Shao, R. Wolski, and F. Berman, "Modeling the Cost of Redistribution in Scheduling", Proceedings of SIAM conference on Parallel Processing, March, 1997. [7] R. Wolski, "Dynamically Forecasting Network Performance Using the Network Weather Service," U.C. San Diego tech. report TR-CS96-494, October, 1996.