DNA microarrays, as shown in the right-hand image, which measure the expression levels of several thousand genes with a single DNA chip, can be used for classification of cancer types, prediction of disease outcome and identification of relevant genes. However, the high dimensionality and complexity of the generated data (up to 20,000 genes and usually less than one hundred samples) have overwhelmed conventional data analysis methods and hence machine learning and data mining techniques are needed to conduct more sophisticated analysis.

Ensemble methods such as bagging and boosting which produce a committee of classification models can be more accurate than those that produce a single model and have been successfully applied to DNA microarray data. 

Boosting algorithms iteratively employ a base learner algorithm to generate a series of models. The base learner can be any algorithm normally used for classification or prediction that allows weighting of samples. Different boosting algorithms vary in the way that the sample weights are adjusted and the way the votes of the committee members are set. These differences may result in different overall performance on a given dataset

In this study we investigate the strategies of feature non-replacement and variable depths of decision trees for enhancing boosting techniques when applied to the classification of microarray datasets. The left-hand image shows the framework of our proposed boosting ensemble system.  

Research Team

Dr. Wenjia Wang, Geoffrey R. Guile, Jamil Al Shaqsi, Richard Harrison