Introduction to Data Mining for Large and Complex Data Sets

This course is designed as an applied introduction to the field of data mining and will cover such topics as pattern recognition, data linkage, variable reduction, clustering, anomaly detection and visualization.

Level 3 - runs over 5 days

Dr Mark Griffin is the Director of ResearchStats, which is a Division of Insight Research Services Associated ( ResearchStats provides training and consulting in statistics for academic audiences. Mark is also an Industry Fellow with the School of Business, University of Queensland, and has established and written training materials for several of their courses in Business Analytics. Mark serves on the Executive Committee for the Statistical Society of Australia, and is Founding Chair of their Section for Business Analytics. Mark is also the Founding Chair of the Business Analytics Special Interest Group within the International Institute of Business Analysis. To date he has presented over 100 two-day and 30 five-day workshops in statistics around Australia.

Course dates: Monday 1 February 2016 - Friday 5 February 2016
Course status: Course completed (no new applicants)
Week 2
About this course: 

Data mining covers a wide range of techniques useful for anyone wanting to explore within and between large, complex datasets. Data mining is a multi-disciplinary field involving methods from artificial intelligence, machine learning, statistics, and database systems. Within this course we will discuss pattern recognition techniques such as regression (for describing the relationship between variables), data linkage, variable reduction (merging similar variables into combined scores), clustering (grouping together observations with similar characteristics), anomaly detection (where the research question is to identify those observations different to the norm) and visualization. This course is designed to introduce participants to this range of techniques where only a basic prior knowledge of statistics will be assumed at the start of the course.


Detailed notes with worked examples and references will be provided as a basis for both the lecture and hands-on computing aspect of the course.


The target audience for this course is researchers working with large, complex datasets that are asking initial questions about what to do with these datasets.

Course syllabus: 

Day 1 - Overall themes and connections with other ACSPRI courses
In this first day we shall provide a foundation covering the breadth of data mining. During this time we will explore some questions providing a motivation for the use of data mining, we shall describe the links between data mining and other courses taught within the ACSPRI program, and we shall provide an introduction to a number of techniques described later on within this workshop. Specific questions that we will explore include:

  • Where do these datasets come from – the advantages of administrative (existing) data versus primary data?
  • What is “big data” and does bigger always mean better?
  • What is the difference between statistics and machine learning?
  • What is the role of survey sampling when it comes to large datasets?
  • Pattern recognition and regression
  • Confidentiality and providing data to other users


Day 2 - Data linkage
A number of research projects explore the relationship between different datasets sometimes from different sources. For example, is there a relationship between a person’s workplace behaviour (as described in data from the Department of Employment) and their health status (using data from the Department of Health). In simple cases this data linkage can be performed using simple deterministic techniques (eg. Each dataset contains a unique ID number for each person with a one-to-one matching between datasets). In many more complex, real-world scenarios this matching is not as trivial (eg where the matching is performed on person’s name where more than one person can have the same name). In these cases probabilistic techniques are used that describe the probability that a record in one dataset should be matched with a given record in another dataset. During this day we shall discuss the techniques for deterministic and probabilistic data linkage.


Day 3 - Dimension reduction and clustering
Dimension reduction techniques are useful when more than one measured variable is an indicator of a given underlying, latent trait. Consider for example a survey containing a number of questions relating to a person’s IQ, where we want to combine these into some measure of a person’s intelligence. Dimension reduction techniques are useful for obtaining these latent variables for subsequent data analysis stages, for data visualization where it is too complex to visualize each original, measured variable, and indeed for assessing whether a group of variables are all indicators of the same latent variable or whether there are a number of different latent variables giving rise to these measurements. Within this workshop we shall discuss dimension reduction techniques such as factor analysis.
Clustering techniques are useful for grouping observations into groups. Consider a group of patients with different health characteristics, and the task of grouping these patients into one set of similar symptoms calling that group “measles” and a second set of similar symptoms calling that group “mumps”. Within this workshop we shall discuss clustering techniques such as k Nearest Neighbours and mixture modelling.


Day 4 - Anomaly detection
    In almost all statistical analyses it is useful at an early stage to identify whether there are any outliers in the dataset (observations that are different to the norm) and to utilize more complex statistical procedures if such outliers are discovered. Other research questions are specifically designed to seek out such observations and to treat these observations as the key findings in a study. Consider for example studies where we want to seek out and report participants with fraudulent behaviour or hospitals that are under-performing. During this day we shall describe the techniques for anomaly detection.


Day 5 - Visualization
In every study it is vital that visualization of the data occurs. This is useful for conveying the patterns found within a dataset to your readers. However, it is even more useful for the researcher themselves to visualize their data for their own benefit. This should be conducted to assess the validity of statistical analysis results before communicating those results to their readers. As datasets increase in complexity it becomes more challenging and yet also more important to be able to visualize such datasets. In the last day of this workshop we shall describe techniques for visualization of these datasets.

Course format: 

This course will take place in a computer lab. All equipment will be supplied. You are encouraged to bring a data set and/or research problem with you.

Recommended Background: 

Participants must have completed an introductory course in statistics or have equivalent experience. While this workshop will be taught using Stata it is not essential that participants have had prior exposure to Stata.

Recommended Texts: 

Other references include:

  • Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (2011), Ian H. Witten, Eibe Frank, Mark A. Hall
  • Data Mining: Concepts and Techniques, Third Edition (2011), Jiawei Han, Micheline Kamber, Jian Pei


Course fees
Non Member: 
Full time student Member: 
Summer Program 2016

The instructor's bound, book length course notes will serve as the course texts.

Supported by: 

Stata is distributed in Australia and New Zealand by Survey Design and Analysis Services.