Machine learning ensemble methodology Machine learning ensemble methodology

An ensemble in the context of machine learning can be broadly defined as a machine learning system as shown in the picture that is constructed with a set of individual models working in parallel and whose outputs are combined with a decision fusion strategy to produce a single answer for a given problem.

The models can be classifiers, predictors or filters, depending on the type of task – classification, prediction, regression or clustering, that the ensemble is designed to do. The rational behind the ensemble approach is based on the simple fact that no individual models can be perfectly developed for solving non-trivial real world problems.

It is common nowadays to employ some inductive machine learning algorithms to induce models, e.g. decision trees, neural nets or other models, from data quickly at a relatively low cost to build ensembles. Based on the mechanisms of construction and operation, the performance of an ensemble can be evaluated in terms of complexity, reliability and accuracy.

The key assumption of the ensemble approach is that the models that are used in building an ensemble should be independent from each to avoid making the same errors. However, the studies have shown that it is not always the case simply because that the models even developed 'independently' of each other are still likely to fail dependently. To make an ensemble more accurate, its member models, apart from having a certain level of accuracy, must be diverse enough from each other to prevent making common failures simultaneously. Nevertheless, a high level of diversity does not come easily by just manipulating some modelling parameters when generating models as member candidates for building an ensemble and hoping it will be beneficial without really understanding the relationships between ensemble's performance and diversity and accuracy of individual models.  

This study attempts to address some fundamental issues in ensemble approach, identify the most relevant factors associated with constructions and operations of ensemble, and establish  possible relationships between the accuracy of ensemble and the identified factors including individual's accuracy, diversity, decision fusion strategy.

Ensemble methods for feature selection

Identifying and quantifying relevance of input features are particularly useful in data mining when dealing with real-world high dimensional data defined problems.

The conventional methods, such as statistics and correlation analysis, appear to be less effective because the data of such type of problems usually contains high-level noise and the actual distributions of attributes are unknown. This research aims to develop machine learning based methods to identify relevant input features and quantify their general and specified relevance, and then select the relevant features for further modelling analyses including classification, regression, prediction and clustering. We have so far developed two novel methods: neural-net clamping and decision tree path scoring, and applied to them to some real world problems including identifying the risk factors for osteoporosis (see picture) and achieved better results than the conventional methods.  

References

  1. Wang, W. Some fundamental issues in ensemble methods. Proceedings of IEEE World Congress on computational Intelligence (WCCI) –International Joint Conference on Neural Networks (IJCNN08), pp2244-2251, Hong Kong, June 1-6, 2008.
  2. Harrison, R., Birchall, R., Mann, D., and Wang, W. Novel consensus approaches to the reliable ranking of features for seabed imagery classification. Int. J Neural Syst. 22(6), Dec. 2012.
  3. Richards, G., and Wang, W. What influences the accuracy of decision tree ensembles? Journal of Intelligent Information Systems. Springer, 39: 627-650, 2012.
  4. Guile, G. and Wang, W. Factors affecting boosting ensemble performance. IEEE World Congress on Computational Intelligence (WCCI)- International Joint Conference on Neural Networks IJCNN, Barcelona, Spain, July 18-23, 2010.

Research Team

Dr. Wenjia Wang, Geoffrey R. Guile, Richard Harrison, Alex Mace, Majed Marrash, Ghadah Aldehim, Tahani Alqurashi