The data mining group within CMP has been involved in several case studies involving health based data. These are detailed below. They are in various states of development, some have been largely completed, some are currently being researched, some are currently the basis of other bids for grant funding and some are still at the developmental stage.

Case Study 1 | Case Study 2 | Case Study 3

Case Study 4 | Case Study 5 | Case Study 6

Case Study 1: Data Mining a Database of Clinical Records of Diabetic patients

It is estimated that 150 million people have diabetes world wide, and that this number may double by the year 2025. Most people with diabetes in developed countries will be aged 65 years or more, in developing countries most will be in the 45-64 year age bracket and affected in their most productive years. Estimates suggest that the number of people with diabetes in the UK will have doubled between 1995 and 2010; making it one of the fastest-growing threats to health in the UK today. Diabetes increases the risk of heart disease, stroke, amputation, kidney failure, blindness and early mortality.

There is no cure for diabetes. However, the condition can be managed and early treatment can minimise the complications described. A key factor in providing early treatment is to identify those most at risk of complications at an early stage. The data mining group at UEA has been working in this area for some time on a collaborative project with St. Thomas' Hospital, London. The work to date has concerned identifying those patients most at risk of early mortality. Computerised clinical records on all diabetic patients referred to St. Thomas' Hospital, London since 1973 are stored in the Diabeta 3 clinical information system. At the time of this study there were data on over 21,000 patients collected over 27 years. Conventional hypothesis testing methods can be used to analyse large clinical databases such as Diabeta 3 but it is likely that such databases contain a wealth of ‘hidden' information that may not be found using traditional techniques. Therefore, it was proposed that automated, non hypothesis-driven, data mining methods be used to search for patterns in the data. We used the DataLamp KDD software (developed by the UEA) for rule discovery.

In this study we wished to identify factors that were associated with early mortality, i.e. we wanted rules with the conclusion "died young". Rules extracted showed clearly that those with peripheral neuropathy were most at risk of premature death. This result was quite unexpected and original. Current research and teaching on outcome in people with diabetes identifies cardiac risk factors as being the most likely indicators of early mortality. The data mining study occurred in parallel with the independent analysis of a cohort of 1,000 patients with diabetes re-examined after ten years. This analysis also identified peripheral neuropathy as the most important risk factor for premature death. The study was limited to a small portion of the available data and identified only associations with early mortality. There is huge scope for further valuable research in this area in terms of widening the objectives of the analysis and using more of the data in the analysis. Further databases on diabetes are available locally but to develop an integrated approach to both data sets may require a diabetes ontology.

Case Study 2: Osteoporosis

Osteoporosis is a disease that causes reduction of bone density and quality, leading to weakness of the skeleton and increased risk of fracture, particularly of the spine, wrist, hip, pelvis and upper arm. Osteoporosis and associated fractures represent an important cause of mortality and morbidity. Bone loss is gradual and shows no obvious symptoms or warning signs until the disease has advanced to its late stage.

Osteoporosis is a global problem because 1 in 3 women and at least 1 in 12 men will develop osteoporosis during their lifetime. For these reasons, osteoporosis is often referred to as the "silent epidemic". The WHO has identified it as a priority health issue (along with other major non-communicable diseases). The costs to national healthcare systems from osteoporosis-related hospitalisation are staggering. In the UK, according to estimates made by the National Osteoporosis Society (see,

  • there are an estimated 3 million people in the UK suffering from osteoporosis
  • osteoporosis is responsible for nearly 200,000 fractures per year
  • osteoporosis costs the NHS and government over £1.5 billion each year.

Although there are some treatments there is currently no cure for osteoporosis. But it could be effectively prevented. Early detection of bone loss is key to the prevention of suffering and escalation of health care costs. However, screening facilities and qualified technical personnel remain inadequate in most countries. The UK has only about two DXA Bone Mass Densitometers per million of population and less than 10% of patients receive treatment. We have been conducting research on osteoporosis since 1997 with an aim of investigating and developing a methodology to identify the associated risk factor and to predict the likelihood of developing osteoporosis. The research has produced some very encouraging initial findings which have been published in journals and major conferences, both in medical and computing fields. Built upon our solid research track record, a new project, funded by EPSRC, began in October 2002 to continue the investigation. The aim of the project is to further develop the methodology that can be applied to develop a computer-aided artificial intelligence system in order to assist GPs and consultants to detect the osteoporosis as early as possible. In addition, the research will assess the performances of QUS and DXA in diagnosing osteoporosis and variations between consultants of rheumatology.

Case Study 3: Gastrointestinal Endoscopy

Since clinical gastroenterological endoscopy began to develop in the late 1960s, there has been a very large increase on the number of procedures being performed each year. It is estimated that in the UK over 1% of the adult population undergoes an upper gastrointestinal endoscopic procedure every year. Other commonly performed gastrointestinal procedures include flexible sigmoidoscopy, colonoscopy and ERCP (endoscopic retrograde cholangio-pancreatography). In total, probably about 1 million gastrointestinal endoscopic procedures are currently being conducted in the UK each year. Once some form of colo-rectal cancer screening programme is introduced then the demand for endoscopic procedures is likely to increase even more dramatically.

The endoscopy unit at the Norfolk and Norwich University Hospital (NNUH) is the largest and busiest in East Anglia. Last year it conducted over 9250 procedures (6091 gastroscopies, 1693 colonoscopies, 1144 flexible sigmoidoscopies and 361 ERCPs). National prospective audits with 30 day morbidity and mortality figures for both upper and lower GI endoscopy have suggested that over 50% of the deaths and serious complications are a) cardiopulmonary and b) relate to the dose of sedation used. The evidence also suggests that elderly patients require only a fraction of the dosages of sedative and analgesic drugs that fit younger patients need and that the effects of combination of benzodiazepines and opioids are synergistic rather than additive. The most vulnerable group are sick elderly patients in whom much evidence suggests that sedative dosage is in many units dangerously high. An initial data mining study focussed on the use of analgesics and discovered wide variation between clinicians.

Consensus documents have suggested that not all endoscopic procedures are appropriate or justified on clinical grounds. There is an enormous potential in developing data mining techniques to explore existing Endoscopy Databases. The School of Computing Sciences at UEA has begun to work with clinicians on the NNUH Endoscopy Unit on ways of data mining the existing ‘Endoscribe' system. A number of specific projects have been identified and others will be developed.

One of the key projects has involved text mining. A large amount of crucial clinical information is at the moment presented in the form of unstructured text reports which are attached to other patient data in a number of legacy systems. Most of that information is at present unused in clinical research because of the difficulty and lack of tools for the analysis of textual data. Some simple but important clinical questions are not being answered because it is not possible without considerable effort to query such data, or present it in a meaningful way. Colonoscopy reports, for example, typically include: used medications and dosages, findings (e.g. presence of polyps, diverticula etc.), description of difficulties in carrying out the procedure (eg. looping in the colon), patient's level of comfort, disposition (e.g. follow-up colonoscopy), etc. Such data could hide some interesting information: relationships between patient age and presence of polyps, influence of medications and dosages on procedure success and safety, findings on follow-up colonoscopy after polyp detection, etc.

Research on classification of colonoscopy reports has led to a number of innovative algorithms combining clustering as a pre-processing step with document classification to produce highly accurate document classification.  The combination of clustering and classification has significantly increased classification accuracy in other domains, where we have also tested its efficacy. 

Efforts are now underway to secure funding to continue working on this very promising area.

People involved will be: Fathi H Saad, G. D. Bell, Alan Reynolds, Beatriz de la Iglesia


Publications from this project:

F. H. Saad, B. de la Iglesia, and G. D. Bell, Effect of Document Representation on the Performance of Medical Document Classification, Proceedings of the 2006 International Conference on Data Mining(DMIN-06), Las Vegas, USA, 2006. 

F. H. Saad, B. de la Iglesia, and G. D. Bell, A Comparison of Two Document Clustering Approaches for Clustering Medical Documents, Proceedings of the 2006 International Conference on Data Mining(DMIN-06), Las Vegas, USA, 2006.

F. Saad, B. de la Iglesia and  G. D. Bell. Comparison of Document Classification Techniques to Classify Medical Reports, W. K. Ng, M. Kitsuregawa and J. Li (Eds.): PAKDD 2006, Lecture Notes in Computing Science 3918, pp. 285-291.

F. H. Saad, G. D Bell and B. de la Iglesia. Classification Techniques with Minimal Labelling Effort and Application to Medical Reports. Int. Journal of Data Mining and Bioinformatics, 2:3, 2008 (in press).

Case Study 4: Hip and Knee Replacement

Each year in the UK, many thousands of people undergo operations to have artificial joints fitted. Hip replacement is one of the most successful operations carried out in the National Health Service (up to 35,000 primary total hip replacement operations are carried out annually by the NHS in England). There is notable variation in the types of implants used, surgical techniques, postoperative surveillance, and longer terms outcomes across the UK. However, there is a shortfall of auditable standards for the operation and associated care. Standards can only be set by the widespread collection of uniform data, centred on NHS Trusts and made available for regional and national audit – see

The problem of lack of data was identified during the late 1990s as a result of an enquiry by the Royal College of Surgeons of England into poor quality hip implants that were given to hundreds of patients during the previous years – see An implant, called the 3M Capital Hip was used in at least 4,700 patients in the UK between 1991 and 1997. After some hospitals noticed failure rates higher than expected with this implant, an investigation was ordered. The investigation was hindered by the fact that many hospitals did not have accurate data to identify patients that received a Capital Hip, or information on what type of Capital Hip they had received, etc. The report produced a number of recommendations to avoid similar situations in the future. Most of the recommendations referred to the need to keep detailed information about the implants, any design changes introduced, the operations, and the progress of the patients after the operation. As a result of this report, on the 2nd July 2001 Health Minister Lord Phillip Hunt announced that the government would establish a national hip registry to enable audit of performance at a national level. The establishment of a national hip register requires the collection and assembly of hip replacement operation data from different hospitals across the country, data from the manufacturers of implants, data concerning patient records, patient follow up detail, other data sources such as microbiological data generated as a result of postoperative infections, etc. These varied data sources contain images (x-rays form an important part of patient assessment), free text data and more structured data. In addition, this type of data is characterised by uncertainty in the form of missing data, inaccurate or unreliable data. Also the language, codes and concepts used in the description of hip replacement data may vary from hospital to hospital and from practitioners to manufacturers, etc.

In order to analyse such varied data effectively we are currently developing a large-scale ontology for this area. This should aid development of tools for semantic directed knowledge discovery, which can complement more traditional data mining approaches. The tools should be effective in the presence of large amounts of uncertainty.

A system of this kind will deliver the following benefits:

  • The collection and analysis of data on hip and knee operations will conform to the recommendations established by the Royal College of Surgeons and will be in line with new government policy on a national hip register.
  • Analysis of this type of data should be translated into reduced risk for patients and reduced costs for the NHS. Also the performance of implants, including any design modification, can be closely monitored. At present, published results of many hip implants offer little help to the surgeon wishing to make an informed choice and most outcome research is short term, non-comparative and does not take into account case-mix and variations in the operative technique of the operating surgeon.
  • A study of this type will contribute to the development of robust data mining techniques for mixed data types (images, text data, etc).

Dr. B. de la Iglesia and Mr. K. Tucker (MB, FRCS) have been exploring ways of funding this activity.

Case Study 5: Applications in Oncology

The School of Medicine, Health Policy and Practice is at present funded by the European Union through the European Society for Therapeutic Radiotherapy and Oncology to study late effects of cancer treatment. Large databases for patients who have been treated in randomised clinical trials in Paris, UK and Sweden are available for analysis of late effects. At present there is no internationally agreed way of documenting late effects. The National Cancer Institute is proposing a new classification which will contain 537 Items. This is not considered appropriate for every day clinical use and they are attempting to derive Level 1 and 2 measures of toxicity to be used routinely although it is recognised that the level 3 definitions that are being developed may still be appropriate for specific clinical trial situations. Using the databases that we have available we would like to look to see if one or two particular measures of late toxicity of treatment might act as surrogates for the overall toxicity of the treatment. Other work on the same data sets is concerned with determination optimum follow-up times. These will be derived by studying the patterns of incidence and the timing of onset of different endpoints such as local disease control, development of metastases and development of complications.

The Department of Oncology at the Norfolk and Norwich University Hospital has a computerised system of note-keeping which is predominantly stored as free-text. In January they will convert to a more structured database for collecting information. The existing computer records are available for analysis and comparison with the more structured data collection which will be introduced. A new structure raises problems about the use of coding systems such as ICD 10, SNOMED and Read codes where these do not give as detailed clinical information as a free-text description which is needed in this area. Question to be addressed from analysis of the existing information stored would include the determination of the effect of co-morbidity such as diabetes and ischaemic heart disease on outcomes of cancer treatment. Treatments given in the new department will be computer generated controlled and verified. There has been a programme of incident registration for the last ten years which has been done on a manual basis. Work is needed to compare instant recording in a system which is more available to inspection by the staff and data generated in the new automated system.

Case Study 6: Preliminary Data Mining of the Norwegian (HUNT) Biobank

A number of projects are underway in the School of Information Systems relating to preliminary investigation of the Norwegian (HUNT) Biobank data sets. Biobanks are repositories of large samples of genetic and medical information and several countries have already established such repositories, with well-organised associated research facilities. The UK is about to establish a major Biobank, sampled from around 500,000 people across the country. The plan is to collect medical, genetic and lifestyle information from men and women aged 45-60. The health of these people will then be monitored over the next 10-20 years.

The HUNT Biobank in Norway consists of two major studies carried out in the region of Trondheim. HUNT1 was carried out in 1984-86 to establish the lifestyle and health history of 75,000 people. HUNT2 was undertaken in 1995-97 and collected lifestyle and medical information as well as blood samples from 65,000 people, including 45,000 from HUNT1. Although a range of statistical tests has been carried out on the data by other scientists, no data mining has yet been applied. Currently, four Masters students in the School have applied a range of preliminary data mining techniques to various samples from the HUNT1 and HUNT2 data sets. One of the outcomes of these studies is to develop a more comprehensive data mining project based on the preliminary results. The School is thus establishing and growing links with international centres of medical data repositories. [It is to be noted that the visit to UEA from the scientists at HUNT was an outcome of the EDA Norway initiative in 2002. The recognition of the School as an international centre of excellence in data mining attracted the scientists to enquire if it was possible to apply data mining techniques to their data.]


  1. Reynolds, A.P. and de la Iglesia, B. and Bell, G.D. and, To be or not to be sedated? The effect of age and gender, Gut: An International Journal of Gastroenterology Annual, volume 54, pp. a1-A117, abstract 235, 2005
  2. Reynolds, A.P. and de la Iglesia, B. and Bell, G.D. and, Monitoring colonoscopy success rates and detecting changes, Gut: An International Journal of Gastroenterology and, volume 54, pp. A1-A117, abstract 235, 2005
  3. Sheikh, K. and Reynolds, A.P. and de la Iglesia, B. and, Data mining techniques can be used to rapidly interrogate, Gut: An International Journal of Gastroenterology Annual, volume 54, pp. a1-A117, abstract 287, 2005

Research Team

Prof Vic Rayward-Smith