Please note that this course will not be held in a computer lab. Participants are required to bring their own laptop.
Data mining is a broad term for statistical and computational approaches to tackling the new data analysis challenges. The main emphasis of this course will be on predictive modeling using modern and classification methods, including attention to datasets where the number of explanatory variables and/or number of observations may be large. A further focus will be exploratory analysis of multivariate data.
Statistical Learning is a name for "modern regression" type methodologies that fit curves and surfaces with automatic choice of smoothing parameter, or that fit a classification model with automatic choice of model complexity. Where the conditions for their use are near enough to satisfied, they can be highly effective. Potential for over-fitting, as can happen with time series and other correlated data, will be noted.
The new methodologies supplement, rather than replace, tools that are in a more conventional tradition of data analysis. The course will aim to give a sense of where the new tools fit in this larger context.
What is R?
The R system is a free software environment for scientific and statistical computing and graphics that runs on all common computing platforms. An active and highly skilled developer community works on development and improvement. It has become an environment of choice for the implementation of new methodology. It is at the same time attracting wide attention from statistical application area specialists.
The course will exploit R’s powerful and innovative graphics abilities. These include the provision of well-designed publication-quality plots that can include mathematical symbols and formulae. While the user has full control when required, careful default graphical design choices reduce the need for user intervention to a minimum.
The first day will aim to provide an overview of R, while giving a broad introduction to the course content for the remaining days. There will be some limited use of the graphical user interface provided by the rattle package for R. Most use of R will however be from the command line.
The R packages that will be important for this course include mgcv, randomForest, MASS, rattle, tm (for text mining). Extensive notes will be provided, with worked examples. Further details can be found on the web page http://www.maths.anu.edu.au/~johnm/courses/dm-acspri.html
Participants should be comfortable using the command line and comfortable finding their way around the file structures of a PC environment. MacOS X users who bring their own laptops should have the corresponding skills on that platform.
For further details of the course, and suggestions for preparatory reading and exercises, go to http://www.maths.anu.edu.au/~johnm/courses/dm-acspri.html. The tutor will provide limited pre-course advice on computer setup, should this be necessary.
Following a first in Mathematics at Auckland University and a variety of teaching and lecturing positions, John Maindonald settled down to working with other researchers as a quantitative problem solver. Until his move from New Zealand to Australia in 1996, much of his work was in plant, fruit and insect and other pest research, with industrial consulting as a sideline. He took up a position at The Australian National University (ANU) in 1998. At ANU he has relished the stimulus of working with biologists (including molecular biologists), ecologists, epidemiologists, public health researchers, demographers, computer scientists, numerical analysts, machine learners, an economic historian, forensic linguists, and a lively group of statisticians. He is the author of a book on Statistical Computation. He the senior author of "Data Analysis and Graphics Using R". This example-based exposition of practical approaches to data analysis, now into its third edition, has sold more than 10,000 copies. Now in semi-retirement, he does occasional consulting, and fronts workshops on the use of the open source R system for scientific and statistical applications and for graphics.
Prior completion of "Fundamentals of Statistics" or an equivalent tertiary course is absolutely necessary. Prior completion of an intermediate quantitative course such as Fundamentals of Multiple Regression is also strongly recommended. Some familiarity with multivariate methods will be helpful.
The following, with relevant sections substantially supplemented by course notes, will be used as a text.
Maindonald, J.H. and Braun, W.J. (2010). Data Analysis and Graphics Using R. An Example-Based Approach. 3rd edn, Cambridge University Press.
The following are primarily for background reading and/or reference. The course will not make any direct use of them:
Ian Ayres 2007, Super Crunchers. Why Thinking-By-Numbers is the New Way to be Smart. Bantam. [This gives a useful overview of the new demands on data analysts, in part because of the new opportunities and challenges of the internet.]
Wood, S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC. [This is, effectively, a manual for the mgcv package for R].
Hastie, T., Tibshirani, R., and Friedman, J. 2009. Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer. [This is a comprehensive account of statistical learning approaches, albeit staying strictly within an independent observations theoretical framework.]