Cancer classification based on micro-array data is one of the classic applications of machine learning in computational biology. The aim is to identify biomarker genes, the expression of which are diagnostic of a particular form of cancer. In this project, we extended the sparse logistic regression algorithm of Shevade and Keerthi  to provide Bayesian regularisation, where the regularisation parameter controlling sparsity is integrated out analytically, using an uninformative Jeffrey's prior (c.f. ). This not only obviates the need for costly cross-validation based model selection procedures but also entirely eliminates the possibility of selection bias - a common pitfall in this application . The results obtained using this approach are competitive with those obtained via cross-validation based regularisation and the with the Relevance Vector Machine (RVM) [4,5] on the well known colon cancer  and leukaemia  benchmark datasets. This page contains supplementary information for .
A MATLAB implementation of the BLogReg algorithm is made available for research purposes under the GNU General Public License (GPL). For efficiency, it is implemented as a C-language MEX file. If you download both files then, provided you have configured mex correctly, blogreg should automatically (an transparently) compile itself the first time it is executed.
- Bayesian logistic regression routine (blogreg.c)
- Online help file (blogreg.m)
- Minimal demonstration using Leukaemia dataset (demo.m)
- MAT file containing the Leukaemia dataset (leukaemia.mat)
This work was supported by grants from the U.K. Biotechnology and Biological Sciences Research Council (BBSRC) under the Exploiting Genomics initiative (grant numbers and 83/EGM16126 and 83/EGM16128 "Computational approaches to identifying gene regulatory systems in Arabidopsis"). N.B. the BLogReg algorithm was originally developed for use in detection of transcription factor binding sites, see e.g. , this work in currently in progress.
- Shevade, S.K. and Keerthi, S.S. A simple and efficient algorithm for gene selection using sparse logistic regression, Bioinformatics, vol. 19, no. 17, pp. 2246-2253, 2003.
- Buntine, W. L. and Weigend, A. S., "Bayesian back propagation", Complex Systems, vol. 5, pp 603-643, 1991.
- Ambroise, C. and McLachlan, G. J., Selection bias in gene extraction on the basis of microarray gene-expression data, PNAS, vol.99, pp. 6562-6566, 2002.
- Tipping, M.E. Sparse Bayesian learning and the Relevance Vector Machine, Journal of Machine Learning Research, vol. 1., pp. 211-244, June 2001.
- Tipping,M.E. and Faul,A.C. Fast marginal likelihood maximisation for sparse Bayesian models, In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, Jan 3-6 2003.
- Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek,M., Mesirov, J.P., Coller, H., Loh, M. L. Downing, J. R. Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, vol. 286, no. 5439, pp. 531-537, 15 October 1999.
- Alon,U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine,A. J. Broad patterns of gene expression revealed by clustering analysis of tumour and normal colon tissues probed by oligonucleotide arrays, PNAS, vol. 96, no. 12, pp. 6745-6750, June 1999.
- Cawley, G. C. and Talbot, N. L. C. Gene selection in cancer classification using Sparse Logistic Regression with Bayesian Regularisation, Bioinformatics, volume 22, number 19, pages 2348-2355, October 2006. (PDF, 268KB)
- Li, Y., Lee,K. K., Walsh,S., Smith, C., Hadingham, S., Sorefan, K., Cawley, G.C. and Bevan, M. W. Establishing glucose- and ABA-regulated transcription networks in Arabidopsis by microarray analysis and promoter classification using a Relevance Vector Machine, Genome Reseach, vol. 16, no. 3, pp. 414-427, March 2006.