Data Mining (CMPSMC24-B-SEM2)
- Unit Code CMPSMC24
- School School of Computing Sciences
- Credit Value 20
- Tutor(s) Dr Beatriz De La Iglesia
- Overview
- Learning
- Resources
- Teaching
Module specific:
- To obtain an overall view of the complex process of Knowledge Discovery (KDD) and Data Mining and understand the need for a methodical approach to KDD
- To explore and review tools and algorithms available to each stage of the KDD process
- To gain experience of using KDD software tools in medium to large databases
- To learn to evaluate the suitability of software tools in the context of different data analysis tasks
- To learn to combine data manipulation and analysis approaches in order to improve the quality of input data
- To present knowledge induced in a format suitable for the target audience and for a particular application
- To perform cost/benefit analysis of any discovered knowledge so that the outcome of a KDD project can be "sold" successfully
Transferable skills:
- To present findings to technical and non technical audiences using appropriate methods in each case
- To gain further experience in IT skills by the use of different packages running on different operating systems and platforms
- To learn how to search for relevant reference material using all available sources of information, and particularly the Internet
- To practice problem solving using a methodical approach
On completion of this module students should have achieved the following skills:
Module specific:
- Understanding of the complex process of KDD and Data Mining and the need for a methodical approach to KDD.
- Critically evaluation of tools and algorithms available to each stage of the KDD process.
- Competence in using KDD software tools in medium to large databases.
- Competence in applying relevant techniques at each stage of the KDD process
- Ability to evaluate the suitability of software tools in the context of different data analysis tasks.
- Competence in combining data manipulation and analysis approaches in order to improve the quality of input data.
- Understanding and identification of problems in input data such as outliers, missing data, unreliable data, differences in granularity, and others, and identify an adequate strategy to deal with the problem data.
- Presentation of knowledge induced in a format suitable for the target audience and for the particular application.
- Perform cost/benefit analysis of any discovered knowledge so that the outcome of a KDD project can be "sold" successfully.
Course notes will be distributed during lectures and also made available on the Blackboard site for this module.
Students will have to use the packages installed in the CMP labs for their practical work. Some of these packages (not all) are available for installation in students' own machines. Laboratory work will take place during timetabled laboratory periods. The coursework may require students to spend time working in the School's laboratories outside the timetabled hours.
Recommended reading:
Dunham, M.H. (2003) Data Mining Introductory and Advanced Topics, Prentice Hall.
Other relevant textbooks:
- Pyle, P. (1999) Data Preparation for Data Mining, Morgan Kaufmann Publishers Inc., San Francisco
- Witten, I.H. and Frank, E.(2000) Data Mining, Morgan Kaufmann Publishers Inc., San Francisco
- Jan, J. and Kamber,M.(2001) Data Mining Concepts and Techniques, Morgan Kaufmann Publishers Inc., San Francisco
- Tan, P.N., Steinbach, M. and Kumar,V.(2006) Introduction to Data Mining, Addison Wesley, Boston
Also the proceedings from the International Conference on Knowledge Discovery in Databases (1995-200) have a wealth of relevant papers.
Web-based material:
The kdnuggets.com page is probably one of the most informative and up-to-date pages for KDD.
This unit is delivered as a programme of lectures (22 hours) and laboratory classes (18 hours).
Total hours: 40
Lectures: 22 hours (with provisional weekly schedule)
-
Introduction to KDD: concepts, definitions and applications
- The KDD Roadmap dissected
- Initial stages: data warehousing, data marts, OLAP
- Data cleansing: missing data, outlier handling, balancing, sampling
- Data Pre-processing: feature subset selection, feature construction, discretisation, Principal component analysis
- Data mining: clustering
- Data mining: classification using decision trees
- Data mining: classification using Neural Nets
- Data mining: partial classification and association rules
- Text mining
- Case studies.
Workshops: 0 hours
Laboratory classes: 18 hours (with provisional weekly schedule)
- Clementine tutorial
- Using Clementine - basic features
- Using Clementine - advanced features
- KnowledgeSeeker - tutorial
- Using KnowledgeSeeker - basic features
- Using KnowledgeSeeker - advanced features
- DataLamp tutorial
- Using DataLamp: simple features
- Using DataLamp: advanced features.
Submission
Written coursework should be submitted by following the standard CMP practice. Students are advised to refer to the Guidelines and Hints on Written Work in CMP.
Deadlines
Coursework should be submitted before 23:59 on the deadline day. Paper copies can be submitted via the Hub drop boxes up to 22.00 in the LTS Hub, and there will be a ‘late box’ in the Library for submissions between 22.00 and midnight.
If coursework is handed in after the deadline day or an agreed extension:
| Work submitted | Marks deducted |
| On the day following the due date | 10 marks |
| On either the 2nd or 3rd day after the due date | 20 marks |
| On the 4th day after the due date and before the 20th day after the due date | All the marks the work merits if submitted on time (ie no marks awarded) |
| After 20 working days | Work will not be marked and a mark of zero will be entered |
All extension requests will be managed through the LTS Hub. A request for an extension to a deadline for the submission of work for assessment should be submitted by the student to the appropriate Learning and Teaching Service Hub, prior to the deadline, on a University Extension Request Form accompanied by appropriate evidence. Extension requests will be considered by the appropriate Learning and Teaching Service Manager in those instances where (a) acceptable extenuating circumstances exist and (b) the request is submitted before the deadline. All other cases will be considered by a Coursework Coordinator in CMP.
Plagiarism
Plagiarism is the copying or close paraphrasing of published or unpublished work, including the work of another student; without due acknowledgement. Plagiarism is regarded a serious offence by the University, and all cases will be investigated. Possible consequences of plagiarism include deduction of marks anddisciplinary action, as detailed by UEA's Policy on Plagiarism and Collusion.

