Data mining often involves the production of accurate yet simple rules that describe records of interest within a database. For example, in a motor insurance database, the records of interest may be those motorists who have claimed on their insurance.

Often, rules produced are conjunctions of simple clauses, where each clause applies to one field within the database. For example:

if age < 25 and car type = sports then claim = yes


is such a rule.

Although this approach produces useful rules, these rules may describe only small subsets of the records of interest. To describe larger subsets accurately, collections of such rules are required.

B. de la Iglesia et al. [1] have researched the application of multi-objective genetic algorithms to data mining, using confidence and coverage as two separate objectives. This approach leads to the production of a collection of pareto-optimal rules.

Richards et al. [5] have created software that produces all the rules, above given confidence and coverage thresholds that apply to the data. This all-rules algorithm can produce many thousands of rules. For example, on the `adult' database, seeking rules of greater than 15% coverage and greater than 60% confidence results in 4785 rules. Examining these rules by eye immediately reveals that many are very similar. Others, while appearing to be different, may in fact match similar sets of records. In order to determine which rules match similar sets of records, we applied two clustering algorithms to the rules produced by the all-rules algorithm: k-medoids and Partitioning about Medoids.

References

  1. de la Iglesia, B., Richards, G., Philpott, M. S. and Rayward-Smith, V. J. (2006) The application and effectiveness of a multi-objective metaheuristic algorithm for partial classification. European Journal of Operational Research, 169 (3). pp. 898-917. ISSN 0377-2217
  2. Reynolds, A.P. and Richards, G. and de la Iglesia, B., Clustering rules: a comparison of partitioning and hierarchical clustering algorithms, Journal of Mathematical Modelling and Algorithms, Vol 5, no 4, pp 475-504 (2006)
  3. Reynolds, A.P. and de la Iglesia, B., Rule Induction Using Multi-Objective Metaheuristics:, 2006 IEEE World Congress on Computational Intelligence and, pp. 6375-6382, (2006) PDF Document (859 KB)
  4. Reynolds, A.P. and Richards, G. and Rayward-Smith, V.J., The application of k-medoids and PAM to the clustering of, IDEAL 2004 - Intelligent Data Engineering and Automated, volume LNCS 3177, Springer-Verlag, Exeter, pp. 173-178, (2004)
  5. Richards, G. and Rayward-Smith, V. J. (2005) The Discovery of Association Rules from Tabular Databases Comprising Nominal and Ordinal Attributes. Journal of Intelligent Data Analysis, 9 (3). pp. 289-307. ISSN 1088-467X

Research Team

Prof Vic Rayward-Smith,  Dr. Beatriz de la Iglesia, Dr. Graeme Richards, Dr. Alan Reynolds