On the Importance of a Component Architecture of Distributed Database Services for Internet Applications Ratko Orlandic Department of Math and Computer Science Suffolk University 41 Temple Street Boston, MA 02114 phone: 617-573-8016 fax: 617-573-8591 e-mail: ratko@zeus.clas.suffolk.edu Abstract: If appropriate measures are not taken, the distributed database technology can easily become a tool for some kind of "Internet cleansing", leading to a monopolization of an important segment of the Internet. To prevent this outcome, small independent software vendors of Internet services will need a reference model in the form of a new distributed database architecture that will facilitate integration of database technology in their applications. Keywords: Internet applications, database systems, distributed systems. Even a superficial examination of contemporary Internet services will reveal that at the core of many of their problems lies inadequate database support. One can observe inadequate security and recovery to protect the data, low-degree concurrency control, inadequate data-distribution support, lack of replication and client caching, problems with fault tolerance and scalability with respect to the data-set size, etc. Fortunately, the problems are neither new nor intractable. They have been studied by the database community for quite some time and the appropriate solutions have been (or are being) incorporated in commercial database management systems (DBMSs). As the community of Internet users matures, it will undoubtedly demand the same from the vendors of Internet services. But, what will those numerous small independent software vendors (ISVs) that develop Internet applications be able to offer in response? Considering the present state of affairs, the ISVs have a choice between: (1) developing or improving their own database engines, or (2) adopting a foreign DBMS which already provides the required features. The first option is desirable, but raises immediate concerns. In particular, the complexity of building robust, distributed database technology has reached such proportions that fewer and fewer vendors will be willing to attempt the task. The second option is not ideal either. Today's distributed client/server DBMSs are huge, complex, and expensive. If an Internet product adopts one, the resulting price/performance increase may significantly diminish its ability to compete. Compounding the problem, contemporary DBMSs are essentially monolithic systems (except perhaps for some add-on features, e.g. replication) that give the applications a choice between taking all or nothing. This is hardly useful for different Internet applications, whose database needs are often radically different--some require only a fast and reliable navigational database engine, others require higher degrees of automation, but nobody wants to pay for things that are not used. In either case, as soon as the problems of inadequate database support become widely recognized and the demands for appropriate solutions more vocal, small ISVs of Internet services that require database support may find themselves in a no-win situation. Under this scenario, many of them will be driven out of the Internet and new players will have difficulty entering the arena. Few lucky ones that survive will dominate the enormous potentials of the Internet in one of its important segments. Reason? The same reason that, today, a large portion of wealth of the database market is being harvested by few database powerhouses--the complexity of developing the distributed database technology is overwhelming, and it is only going to get worse. Can anything be done to prevent this gloomy scenario? In search for an answer, observe first that the area of networking, whose problems are perhaps just as complex as those of database management, does not have its markets dominated by few giants. While there may be several reasons for this, one is perhaps the most important. For quite some time, the networking community has been blessed with the availability of standard component architectures (e.g., ISO OSI and TCP/IP) which have successfully decomposed the enormous problem of networking into a set of smaller subproblems. As a result, the development of networking infrastructure has traditionally been a distributed effort of many contributors, large and small alike. In contrast, the database standards (e.g., SQL and ODMG) are not really architectures, much less componentized, but rather languages for interaction with database systems. Their unfortunate feature is that they bundle entire database functionality under a single level of data abstraction, giving no clue as to how the development of database technology can be decomposed. Now, imagine that the ISVs of Internet applications have an agreed-upon hierarchical and modular architecture they could use as a reference model for building distributed database technology. Then, they could cooperatively fill in individual "pieces of the puzzle", without tackling the entire problem of distributed database management. The decomposition of the problem into a number of more manageable subproblems would allow the vendors to concentrate on optimal implementation of individual modules. Alternatively (or, in addition), they could choose from a number of available components offered by other vendors only the components they actually need. Consequently, they could effectively satisfy their database needs with a mix of relatively inexpensive measures. Another side effect is that small ISVs could again be able to make more significant inroads into the database market itself. Unfortunately, although many areas of software development have embraced the general idea of published component architectures (e.g., the OMG OMA and OSF DCE standards for distributed computing), it has not been widely accepted by the database community yet. Few proposals that address the need for DBMS componentization and a related problem of database extensibility do not go nearly as far as necessary. The design principles behind such an architecture must necessarily resemble those applied in the successful communication architectures ISO OSI and TCP/IP. It must be an open, hierarchical, multi-layered architecture in which each level provides a well-defined set of related functions and consist of potentially many independent modules. Lower levels should provide services to higher ones, and the concepts and mechanisms defined at higher levels should be implemented using the lower-level ones. Each level should provide a distinct way of viewing and manipulating data (effectively, a different data model). To provide such an architecture, one must: define individual levels of data abstraction; define the functions and interfaces of the layers; provide a programmable object-oriented interface for the entire system; and prescribe a way of integrating individual system components produced by different vendors. Unfortunately, for that, one must also address a number of unique problems. A bulk of design issues must deal with the subdivision of tasks between the individual levels in order to increase sharing, reduce redundancies across the system, and satisfy a number of other objectives. Since multiple data models must co-exist under the same umbrella, how can we make sure that the architecture appears as a unified, coherent system? Another likely consequence of the modeling polymorphism of the layered architecture is that the process of database design could be much more complex than in the homogenous modeling environments (e.g., those that employ purely relational or purely object-oriented systems). Thus, providing appropriate guidelines for logical database design must also be a subject of future research. Many system-integration concerns will also demand some degree of specificity in the architecture itself. Otherwise, the implementors of individual modules could make assumptions that could easily result in incompatibility of modules produced by different vendors. For example, the assumed operational architecture (i.e., whether the server is envisioned to operate as multiple single-threaded processes, as a single multi-threaded process, or as many multi-threaded processes) can influence the layout of structures, access to shared objects in main memory, forms of communication and synchronization between different operational entities of the system, maintenance of and access to user's context, etc. But, how can the inter-dependencies and the information flow between different levels and components be reduced? Other open issues are concerned with object identity, data-dictionary organization, query languages appropriate at each level, common API, etc. All these and related questions are possible topics for serious investigation which demand appropriate answers.