Cover Page: Information Based Computing Networks Workshop on Research Directions for the Next Generation Internet May 13-14, 1997, Vienna, VA Reagan W. Moore Director, Enabling Technologies San Diego Supercomputer Center P.O. Box 85608 San Diego, CA 92186-5608 moore@sdsc.edu Telephone: (619) 534-5073 Fax: (619) 534-5152 ----------------------------------------------------------------- Paper: Information Based Computing Networks Workshop on Research Directions for the Next Generation Internet May 13-14, 1997, Vienna, VA Reagan W. Moore, Chaitanya Baru, Philip Bourne, Mark H. Ellisman* Sid Karin*, Arcot Rajasekar, Stephen J. Young* San Diego Supercomputer Center *University of California, San Diego San Diego, CA Introduction The emergence of ubiquitous access to information is revolutionizing the conduct of science. Researchers are publishing scientific results on the Web, and providing Web-based access mechanisms for querying data repositories and applying analysis algorithms. The opportunity now exists to develop scalable information discovery systems that generalize the above capabilities and enable analysis of terabyte sized data collections. Information Based Computing is the concept that researchers should be able to incorporate information discovery within their applications. Codes running on supercomputers should be able to access all available information sources, including scientific data generated by simulations, observational data, standard reference test case results, published scientific algorithms for analyzing data, published research literature, data collections, and domain specific databases. Rapid progress can be made by building upon existing trends. Supercomputer centers are evolving from support for predominantly numerically-intensive computing to also support for data-intensive applications. Systems that can manage the movement of terabytes of data per day are in development. The data handling environments are also evolving from syntactic based access systems to semantic based access systems. Instead of accessing data by a Unix path name or URL, information discovery systems are being developed that support access by attributes. At the same time digital library technology is evolving to include the capability to analyze data in associated workspaces through application of published algorithms. Finally, the user interfaces to these systems are evolving into dynamic collaboration environments in which researchers simultaneously view and interact with data. Architecture The basic software infrastructure that needs to be integrated to build an Information Based Computing Environment includes: * Metacomputer or persistent object computation environment that supports run-time execution of supercomputer applications. * Information Discovery System that supports intelligent integration of information, metadata mining, semantic interoperability, shared ontologies, and improved data annotation. * Digital library that supports publication, cataloging, and curation of scientific data sets. * Data management system that provides the system-level metadata catalog for supporting interoperation between objects and resources within the metacomputer. * Storage resource broker that provides a uniform access mechanism to heterogeneous data sources. * Database repositories that support domain-specific data collections. * Archival storage systems that provide permanent data repositories. * High- speed network linking the data handling environment. We have pursued this approach in an ongoing DARPA-funded project called MDAS (Massive Data Analysis System) at the San Diego Supercomputer Center. The central-theme in MDAS is to provide a metadata-enabled computational framework that can inter-operate with diverse resources, applications, methods and datasets. The system has a comprehensive ontology for storing core system-level metadata that supports interoperation between resources and objects. Moreover, it provides the means to seamlessly integrate application- and resource-specific metadata that can be effectively utilized to deal with peculiarities of individual systems. Data management is facilitated by a Storage Resource Broker (SRB) which provides a uniform access mechanism to diverse and distributed data sources. The SRB provides the protocol conversion required to interface to heterogeneous data sources. The SRB interfaces with the MDAS metadata catalog to access metadata information about individual files and objects. We have currently implemented stream I/O and object store/retrieve interfaces on file systems, archival systems, and object-relational databases. Digital libraries provide the publication capability of the Information Based Computing Environment. SDSC has established collaborations for implementing local clones of the Alexandria Digital Library from UCSB and the UC Berkeley digital library. As part of this effort, we are integrating the digital library software with MDAS services to provide transparent, seamless access to databases, file systems, and archival systems. Applications Development of the Information Based Computing Environment requirements can be further refined by examining the current and projected needs of the information-rich neurobiology and molecular biology communities. Each has characteristics which make them a suitable testbed for the emerging technology. Neurobiology is prototypical in that it has very large 3-D datasets with associated metadata that need to be visualized and processed. Molecular biology and the Protein Kinase Resource (PKR) in particular are prototypical in that they represent a "digital continuum" of information, from primary experimental data, through derived data, through final publication data found in traditional text form and in a variety of structured databases. Neuroscientists are accumulating detailed and diverse image data and functional maps of human and animal brains. The complexity represented by the myriad of brain interconnections, and the vast dimensional ranges involved from single molecules to whole brains, represent a daunting challenge to the effective analysis of this information. Three-D data is now available from CAT, MRI, PET imaging of large brain regions, the confocal light microscope, and electron microscope tomography. The aggregate size across data repositories at UCSD, UCLA, and WU is now 150 GB and will exceed multiple TBs in 5 years. New biological databases integrate access to research literature and scientific data sets. The Protein Kinase Resource (PKR) is such a database that brings together data from diverse sources into a single representation with a well defined data model and metadata description. Resources like the PKR prompt the development of true collaboratories that will be customizable for specific applications. These collaboratories will be the user interface to the underlying databases and compute engines. MICE is an example of a specific collaboratory for molecular structure (http://www.sdsc.edu/pb/vis/molexp.htm). Information comes from a molecular gallery which defines molecular scenes in a way that can be interactively queried both with respect to the known function of the macromolecule and the details of the scene. Acknowledgment Many of the motivations for the concepts listed in this paper have been contributed by colleagues including Richard Frost, Michael Wan, John Helly, and Peter Arzberger. The research activities have been sponsored by NSF, DARPA, NIH, DOD, USPTO, and IBM.