Making Vast Amounts of Data Available On-Line: The Use of Widely Deployed, High Performance Networks for Cataloguing, Caching, and Processing Application-Specific Data William E. Johnston Information and Computing Sciences Division Lawrence Berkeley National Laboratory The Motivation ------------- Widely deployed high-speed networks can make (as we are starting to see now) large amounts of organized data - the precursor to information and knowledge - available on-line. Routine analysis of terabytes of data per day by widely dispersed data user communities is thus becoming increasingly urgent in order to realize the potential benefits of the data, and to make the most effective use of on-line instrument systems in almost every area of science, health care, engineering, the environment, etc. Useful access to vast stores of data will require: o Cataloguing in ways that makes information about data (e.g. "coverage", currency, access methods, etc.) easy to locate and interpret; o Network-based data caches and processing systems that can be configured on-demand to provide on-line access to data by distributed analysis systems; o Generalized, distributed, transparent, and strong security mechanisms that allow easily specified and enforced data owner-imposed use conditions; o New network capabilities that provide reliable and efficient distribution of data to many locations simultaneously, reserved network bandwidth, protection from denial of service attacks, etc. The Issues --------- Our experience in several high-speed metropolitan and wide-area testbeds (e.g., [MAGIC], [NTONC], and [BAGNet]) has involved all aspects of dealing with very large data stores, and high-speed, on-line data sources. This includes on-line collection, cataloguing, and high-performance application access to high-speed medical and scientific instrument data, as well as making large archival (near-line and off-line) data repositories available for on-line applications. (See, e.g., [Johnston97L].) From this work a set of issues and some approaches for dealing with these on-line data-intensive systems has emerged: o General-purpose, multi-disciplinary and multi-modal analysis applications require data from multiple sources to accomplish useful tasks. This data must be processed into application-specific forms and made available on-line. o High quality data are maintained by specialists in a discipline. These data curators use their own specialized techniques and formats, and will provide or define access methods. For some applications, reformatting discipline-specific data suffices; other cases require complex transformations and refinement of the data. o Discipline-specific data (e.g., medical and scientific imaging, environmental monitoring data, etc.) is collected from many sources, frequently continuously and over long periods of time. The resulting large volumes are typically kept in off-line, archival repositories at the responsible sites. A relatively small portion is kept on-line to satisfy specific requests. o No combination of discipline-specific repositories will bring on-line all of the data needed by multi-datatype applications like MAGIC's terrain and environment visualization and navigation application [TerraVision], and on-line storage will never be adequate for all of the data that is needed by such an application. o On-line catalogues must provide searchable, up-to-date information about discipline-specific data, and there must to be some uniformity in how this metadata is made available on the network. o The techniques used to bring large amounts of diverse discipline-specific data on-line (e.g. cataloguing and application access) cannot have a significant impact on the programs and curators that generate and manage the data: relatively non-intrusive mechanisms must make the data visible and available on the network. o Network-based caches that can be configured on demand to provide on-line access to data by distributed analysis systems will be required. o High-performance distributed applications for large-scale data analysis must be able to be built from composable, high-performance modules, rather than being built and tuned from the bottom up every time a new variation is required. o Performance monitoring at every level of the application, at every location of distributed processing and data handling components, and in the network itself, will be an essential aspect of widely distributed systems that can routinely deal with Terrabytes of data. (See [Tierney96].) o Access control for datasets must enforce the use-conditions of the data owners. The access control must be easily available, administered, used, and enforced, and as distributed as the data, the data owners, and the applications. Approaches --------- Application use of large and interesting datasets will typically be the result of locating, transforming, and aggregating multiple discipline-specific data into application-ready datasets. Applications need to request data in their terms, have the underlying discipline-specific data located and processed (to produce application-specific forms), and then cached in the network on an on-demand basis so that all of the required application inputs are available simultaneously and/or so the application can support real-time human interaction. If the conversion from discipline-specific data is expensive, or if application datasets are used frequently, the application-specific data may itself be catalogued and archived. Curating this application data archive may involve monitoring the underlying data and updating the affected application datasets when changes are detected. The network can provide a dynamic environment that continually monitors network-based data, processing, and storage resources that can be used to produce application-specific datasets. Cataloguing can proceed from a data managing architecture that uses agents to continuously maintain information about the state of data in repositories (e.g. which repositories have what data elements and what data coverage - e.g., geographic areas). Brokers acting on behalf of users can locate application dataset precursor data elements, and manage their processing into application datasets. The network can also provide for on-demand construction of the required caching and processing systems, with agents maintaining information about the location and state of the available network-based processing and storage elements. Brokers acting on behalf of users negotiate for on-line storage and processing elements, and assemble these elements into distributed storage and processing systems. The same set of agents and brokers monitor the internal state of the distributed systems to provide the information needed to adapt to bottlenecks in the networks and in the systems. These dynamically assembled systems are then used to rapidly convert data set precursors into application-ready data sets that are deposited in the network-based cache storage in order to make the datasets available to applications like TerraVision. A cryptographic certificate-based use-condition approach to distributed security, similar to that being developed for global financial enterprises, can provide distributed security. (E.g., [Johnston96S].) Various aspects of this approach are being explored in the MAGIC-II project [MAGIC] (where terrain imaging and environmental data are being organized as described above), in the LBNL-Kaiser on-line health care imaging system project [Thompson] (where 10s of Gigabytes/day are collected from an on-line medical imaging system, automatically catalogued, and made available via Web-based catalogues to the primary care physicians), and in the RHIC-STAR project [Greiman97H] (where 20-40 Megabyte/sec data streams from high energy and nuclear physics particle accelerators and detectors must be analyzed and catalogued by collaborators all over the country). These testbed environments are providing the experience that generates the list of issues and capabilities noted above. All of these issues will have to be addressed in the Next Generation Internet in order to make vast quantities of useful data available on-line to increase the effectiveness of many scientific and societal activities. References ---------- BAGNet "Bay Area Gigabit Testbed". See http://www-itg.lbl.gov/BAGNet.html DPSS "The Distributed-Parallel Storage System (DPSS) Home Page" http://www-itg.lbl.gov/DPSS Greiman97H "High-Speed Distributed Data Handling for HENP". W. Greiman, W. E. Johnston, C. McParland, D. Olson, B. Tierney, C. Tull, International Conference on Computing in High Energy Physics, Berlin, Germany, April, 1997. Also available at http://www-itg.lbl.gov/STAR. Johnston97L "Real-Time Digital Libraries based on Widely Distributed, High Performance Management of Large-Data-Objects". W. E. Johnston, et al. Draft for an invited submission to International Journal of Digital Libraries Special Issue on "Digital Libraries in Medicine". Available at http://www-itg.lbl.gov/WALDO . Johnston96S Johnston, W. and C. Larsen, "Security Architectures for Large-Scale Remote Collaboratory Environments: A Use-Condition Centered Approach to Authenticated Global Capabilities" (draft at http://www-itg.lbl.gov/~johnston/Security.Arch.Global.C ap.html) MAGIC MAGIC (Multidimensional Applications and Gigabit Internetwork Consortium) is a high-speed, wide area, ATM network testbed. See: http://www.magic.net. Also see "The MAGIC Project: From Vision to Reality". B. Fuller and I. Richer, IEEE Network, May, 1996, Vol. 10, no. 3. NTONC The National Transparent Optical Network Consortium (NTONC) is a program of collaborative research, deployment and demonstration of an all-optical open testbed communications network in the San Francisco Bay Area. See http://www.ntonc.org . TerraVision "TerraVision: A Terrain Visualization System". Y. Leclerc and S. Lau, Jr. SRI International, Technical Note #540, Menlo Park, CA, 1994. Also see: http://www.ai.sri.com/~magic/terravision.html Thompson "Distributed Health Care Imaging Information Systems". M. Thompson, W. Johnston, G. Jin, J. Lee, B. Tierney, Lawrence Berkeley National Laboratory, and J. Terdiman, Kaiser Permanente, Division of Research. PACS Design and Evaluation: Engineering and Clinical Issues, SPIE Medical Imaging, 1997. Available at http://www-itg.lbl.gov/Kaiser.IMG Tierney96 Tierney, B., W. Johnston, G. Hoo, J. Lee, "Performance Analysis in High-Speed Wide-Area ATM Networks: Top-to-Bottom End-to-End Monitoring", IEEE Network, May, 1996, Vol. 10, no. 3. (Also see http://www-itg.lbl.gov/DPSS/papers.html .)