Modern Regression, Classification and Multivariate Exploration using R

This course formerly had the title 'Statistical Learning and Data Mining with R'.
 
BYO laptops are required for this course.
 
Following a review of “traditional” regression and classification methods, the focus will move to predictive modeling using more recently developed regression and classification methods.  There will be attention to datasets where the number of explanatory variables and/or number of observations may be large.  A further focus will be exploratory analysis of multivariate data.  The course will use a “learning by doing” style of teaching.
 
"Modern regression" is a name for methodologies that allow the fitting of curves and surfaces with automatic choice of smoothing parameter, or that fit a classification model with automatic choice of model complexity.  “Statistical Learning” is another commonly used name for such methodologies.  Where the conditions for their use are near enough to satisfied, they can be highly effective.  Potential for over-fitting, as can happen with time series and other correlated data, will be noted.
 
Modern regression and related methodologies enhance and supplement, rather than replace, tools that are in a more conventional tradition of data analysis.  The course will aim to give a sense of where the new tools fit in this larger context.
 
This course is in part motivated by technological and methodological changes and demands that have gathered pace in the past decade.  These include:
• Synergies between huge increases in computational power and in computer storage, and advances in statistical and algorithmic methodology.  The R system is a product of such synergies.
• New data collection tools and new types of data, arising from advances in instrumentation, from the internet and from widespread deployment of databases.
 
What is R?
The R system is a free software environment for scientific and statistical computing and graphics that runs on all common computing platforms. An active and highly skilled developer community works on development and improvement. It has become an environment of choice for the implementation of new methodology. It is at the same time attracting wide attention from statistical application area specialists.  
 
The course will exploit R’s powerful and innovative graphics abilities. These include the provision of well-designed publication-quality plots that can include mathematical symbols and formulae. While the user has full control when required, careful default graphical design choices reduce the need for user intervention to a minimum.  
The first day will provide an overview of R, while giving a broad introduction to the course content for the remaining days.  There will be some limited use of the graphical user interface provided by the rattle package for R.  Most use of R will however be from the command line.
The R packages that will be important for this course include mgcv, randomForest, MASS, rattle, and possibly tm (for text mining). Extensive notes will be provided, with worked examples.  Further details can be found on the web page http://www.maths.anu.edu.au/~johnm/courses/dm-acspri.html
 

 
Level 4 - runs over 5 days
Instructor: 

Following a first in Mathematics at Auckland University and a variety of teaching and lecturing positions, John Maindonald settled down to working with other researchers as a quantitative problem solver. Until his move from New Zealand to Australia in 1996, much of his work was in plant, fruit and insect and other pest research, with industrial consulting as a sideline. He took up a position at The Australian National University (ANU) in 1998.  At ANU he has relished the stimulus of working with biologists (including molecular biologists), ecologists, epidemiologists, public health researchers, demographers, computer scientists, numerical analysts, machine learners, an economic historian, forensic linguists, and a lively group of statisticians. He is the author of a book on Statistical Computation.  He the senior author of "Data Analysis and Graphics Using R". This example-based exposition of practical approaches to data analysis, now into its third edition, has sold more than 10,000 copies.  Now in semi-retirement, he does occasional consulting, and fronts workshops on the use of the open source R system for scientific and statistical applications and for graphics.

Course dates: Monday 26 September 2011 - Friday 30 September 2011
Course status: Course completed (no new applicants)
Week: 
Week 1
Recommended Background: 

Prior completion of “Fundamentals of Multiple Regression” or competence at an equivalent level is required.  Some familiarity with multivariate methods will be helpful. Participants should be comfortable using the command line and comfortable finding their way around the file structures of a PC environment.  MacOS X users who bring their own laptops should have the corresponding skills on that platform.
 
For further details of the course, and suggestions for preparatory reading and exercises, go to http://www.maths.anu.edu.au/~johnm/courses/dm-acspri.html.  The tutor will provide limited pre-course advice on computer setup, should this be necessary.
 

Recommended Texts: 

 The following, with relevant sections substantially supplemented by course notes, will be used as a text. Copies will be available at a discounted price from the Cambridge University Press Melbourne office. Contact the tutor for details.
 
Maindonald, J.H. and Braun, W.J. (2010).  Data Analysis and Graphics Using R.  An Example-Based Approach.  3rd edn, Cambridge University Press.  
 
The following are primarily for background reading and/or reference.  The course will not make direct use of them:
 
 Ian Ayres 2007, Super Crunchers. Why Thinking-By-Numbers is the New Way to be Smart. Bantam. [This gives a useful overview of the new demands on data analysts, in part because of the new opportunities and challenges of the internet.]
 
Wood, S.N. (2006) Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC.  [This is, effectively, a manual for the mgcv package for R].
 
Hastie, T., Tibshirani, R., and Friedman, J. 2009. Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer. [This is a comprehensive account of statistical learning approaches, albeit staying strictly within an independent observations theoretical framework.]
 

Course fees
Member: 
$1,590
Non Member: 
$2,850
Full time student Member: 
$1,590