Project: Case-Based Reasoning: Rough Sets Versus Bayesian Networks
Student Researchers: Jennifer Allen, Christine Mick
Advisors: Ronald S. King, Kay Pleasant
Institution: University of Texas at Tyler
Goals and Purpose:
A comparative analysis will be conducted for application of rough sets to at least one other, preferably more than one, data mining approach in case-based reasoning. Rough sets allow for data mining involving inconsistent and noisy data sets. Experiments will be carried out employing Bayesian networks and rough sets. Many rules produced utilizing the rough set approach will involve overfit which can be attributed to a function of increasing training set size.
Processes:
Two methods for data mining, suitable for utilization in case-based reasoning, were comparatively evaluated: Bayesian Networks and Rough Sets. Experiments were carried out on a Bayesian classifier using AutoClass and on a Rough Set tool for producing logic rules using Rosetta. Several benchmarks were employed, which were included in the original documentation for Rosetta and AutoClass. The Rough Set method produced many rules, and the researchers determined that overfit increases with the size of the training set. Considerable attention to human interpretation of the results had to be exercised when utilizing Bayesian classification. Rules generated by Rosetta were found useful for background knowledge in Case-Based Reasoning, and in making predictions for class membership. Criteria for Data Mining techniques included: noise level, consistency, prior knowledge, output, retractability, robustness, overfit/underfit, and whether or not the method evaluated was dynamic.
Different experiments with different data sets were performed. The data sets employed were obtained from the original documentation for Rosetta and AutoClass, which included the KDD standard IRIS database. Attributes were chosen that had some possible functional dependencies. Also attributes were chosen that would partition the values in a given attribute, in order to predict the failure type given values from the other attributes.
Jennifer and Christine conducted a background search on case-based reasoning. Then while Jennifer researched rough sets, Christine researched Bayesian networks. Various resources were used in the research including books, Internet sites, and articles. Findings include a core knowledge of case based reasoning, rough set, and Bayesian networks and various ways they are used in real world applications.
After looking into the background of each topic, we downloaded software from the internet that employed these techniques. Jennifer downloaded ROSETTA to work with rough sets and Christine downloaded AutoClass C to work with Bayesian networks. After learning how to run each program and their capabilities, we teamed up to discover how they would work together on the same set of data.
Conclusions and Results:
Noise and Retractability
Rough Set Method: This method is capable of handling most types of noise. If the input data is missing an attribute that is dispensable then the classification process is not affected. If the input data are missing important attributes, then the Rough Sets can not classify the object and the object is thrown away.
Bayesian Method: Able to handle noise, but only as a normal attribute value. Like the Rough Set Method, this method appears unable to handle values that are .not applicable.. The user must be cautious, using AutoClass, when trying to fill in values for missing and .not applicable. attribute values, otherwise unreliable results will be produced.
Rough Set Method / Bayesian Method: Since both methods are based upon logical and statistical foundations it is possible to find out how the tolls discovered the rules or classes. The latter statement demonstrates that both methods are retractable.
Output:
Rough Set Method: The output from Rosetta is rules used for: