Classification, Modern Regression and Multivariate Exploration Using R

This course formerly had the title 'Statistical Learning and Data Mining with R'.

 

 

Laptops are required for this course. We recommend bringing your own to the course. If this is not possible, please contact ACSPRI at least two weeks prior to the course.

 

Modern computing and communications technology is dramatically reshaping large aspects of social, business, government and research activity.  Organizations now commonly have vast new data resources that can be used to inform their decisions.  This course focuses on the practical use of some of the more important statistical and graphical tools that are available for analysis. 

 

The R system that will be used for this course has become the leading tool for statistics, data analysis, machine learning and statistical graphics. It is supported by an active community of thousands of developers and contributors, and more than 2 million users. It has become the environment of choice for the implementation of new techniques, with over 2000 modules (packages) – with more added every day – covering the methods of every discipline from anthropology to zoology. 

 

The course will use a “learning by doing” style of teaching.   Wherever possible, graphical displays will be used to help in the interpretation of results.  The focus will be on a small number of methods that have proved highly effective, and on issues that are important irrespective of the method used.

 

The course will start by reviewing “traditional” regression and classification methods, including relevant parts of the level 3 course “Data Analysis, Graphics and Visualization Using R”.  The focus will then move to predictive modeling using more recently developed regression and classification methods.    A further focus will be exploratory analysis of multivariate data.  Techniques will be demonstrated for gaining insightful two or three dimensional views of data where there may be many variables.  Graphs will be used, extensively, to display and give insight into results.  R's powerful and innovative graphics include the provision of well-designed publication-quality plots that can include mathematical symbols and formulae.

 

"Modern regression" is a name for methodologies that allow the automated fitting of curves and surfaces, or of a classification model.  “Statistical Learning” is another commonly used name. Where the conditions for their use are near enough to satisfied, they can be highly effective.  Implications of failure of the independence assumption, as with time series and other correlated data, will be noted.

 

The methodology is modern, also, in its use of re-sampling methods – cross-validation, repeated simulation, and bootstrap sampling.  It enhances and supplements, rather than replaces, methodology from  a more conventional tradition of data analysis.  The course will aim to give a sense of where the new tools fit in this larger context.  The new tools have an important role in  avoiding over-fitting and over-optimistic assessment of accuracy as a result of variable and/or model selection.

 

Intending participants are encouraged to work through the introductory notes on the R system that are noted below. There will be some limited use of the graphical user interface provided by the R Commander package for R.  Most use of R will however be from the command line, using the attractive RStudio "interactive display environment" to manage, organize and record work.

 

Notes will be provided.  The Maindonald and Braun text noted below covers a substantial part of the course content, and will be useful for supplementary reading.  Arrangements will be made for course participants to purchase this text at a discounted cost.

 

Data will be provided.  Participants who can provide the data in advance of the course will, if the data are suitable for the methods covered in the course, have the opportunity to analyse their own data and discuss the output.

 

For information on relevant components of the R system, and on preparation for this course, go to:

 

http://maths-people.anu.edu.au/~johnm/r-courseprep.html

 

On the connection to data mining and “big data”, see:

 

http://maths-people.anu.edu.au/~johnm/dm-in-context.html

 
Level 4 - runs over 5 days
Instructor: 

Following a first in Mathematics at Auckland University and a variety of teaching and lecturing positions, John Maindonald settled down to working with other researchers as a quantitative problem solver. Until his move from New Zealand to Australia in 1996, much of his work was in plant, fruit and insect and other pest research, with industrial consulting as a sideline. He took up a position at The Australian National University (ANU) in 1998.  At ANU he has relished the stimulus of working with biologists (including molecular biologists), ecologists, epidemiologists, public health researchers, demographers, computer scientists, numerical analysts, machine learners, an economic historian, forensic linguists, and a lively group of statisticians. He is the author of a book on Statistical Computation.  He the senior author of "Data Analysis and Graphics Using R". This example-based exposition of practical approaches to data analysis, now into its third edition, has sold more than 10,000 copies.  Now in semi-retirement, he does occasional consulting, and fronts workshops on the use of the open source R system for scientific and statistical applications and for graphics.

Course dates: Monday 9 February 2015 - Friday 13 February 2015
Course status: Course completed (no new applicants)
Week: 
Week 3
Recommended Background: 

 

Knowledge and practical experience of regression methods, to the level of the course “Data Analysis, Graphics and Visualization Using R”, or of the course "Applied Multiple Regression".  Previous experience with R or with

another computer language is very desirable.  Participants who are in doubt regarding their level of preparation should contact the tutor.

 

Recommended Texts: 

 

 

Maindonald, J.H. and Braun, W.J. Data Analysis and Graphics Using R – An Example-Based Approach.  Cambridge University Press 2010.

 

Ian Ayres 2007, Super Crunchers. Why Thinking-By-Numbers is the New Way to be Smart. Bantam.

 

Course fees
Member: 
$1,800
Non Member: 
$3,230
Full time student Member: 
$1,800
Program: 
Summer Program 2015