Multivariate statistics provide researchers with the ability to analyse complex data sets. It allows them the ability to plot large sets of data, reduce the number of variables, predict and identify groups of inter-related variables, and detect natural groups of observations. The aim of the course is to provide the participants with understanding multivariate analysis sufficient to determine the appropriate technique for a given problem, format data as required for analysis, run the analysis using the Stata statistical program, and interpret the results.

This course will cover the following multivariate techniques:

1) Multiple Regression: Multiple regression analysis is often used to model the relationship between a single dependent interval variable with several varying types of independent variables. This technique is often used in economics for prediction and forecasting (e.g. national economy), and in social research for evaluating what determines an effective program (e.g. the best predictors of success in high-school), or determining which personality variable best predicts a social trait

2) Logistic Regression: Logistic regression is used when there is a binary dependent variable and several varying types of independent variables. Logit analysis is used to predict the probability of an event in the dependent variable. This analysis is used widely in health research where the dependent variable is the outcome of a disease or health condition (e.g. lung cancer) or in social research where the outcome is a certain event (e.g. employment status).

3) Canonical correlation: Canonical correlation is used to investigate the relationship between two sets of variables. One set contains two or more dependent variables and the other set contains two or more independent variables. For example, it has been used to investigate the relationship between a number of risk factors to a group of symptoms in social research.

4) Discriminant analysis: Discriminant analysis is used to study the differences between two or more groups with respect to several variables simultaneously. It can be used to understand differences in groups so as to predict the likelihood that an individual belongs to a certain group. For example, investigating which background variables discriminate between patients likely to recover fully, partially or not at all.

5) Principal components and factor analysis: Principal components analysis is an exploratory technique used to produce a smaller number of artificial variables (called principal components) that will account for most of the variance in the originally observed variables. It is also often used to uncover unknown trends in data. The principal components may then be used as predictor or criterion variables in subsequent analyses. For example, a large number of highly correlated measures for job satisfaction can be transformed into a smaller set of uncorrelated principal components that are then used for subsequent analysis (e.g. regression analysis).

6) Exploratory Factor analysis: Exploratory Factor analysis is used to obtain distinct new variables of factors. Factor analysis looks at the interrelationships among a large number of variables and explains them in terms of their underlying factors or dimensions. This technique is often used in social science to measure a trait that cannot be measured directly (e.g. self-esteem).

7) Cluster analysis: Cluster analysis is an exploratory technique that uses a number of different algorithms and methods to combine observations into previously unknown mutually exclusive natural groups or clusters based on specific similarities. For example, social researchers have used this technique to produce unique groups based on socio-economic profiles.

8) Multidimensional scaling: Multidimensional scaling for two way data is a data dimension-reduction and visualization technique that looks at dissimilarities between observations based on certain characteristics. Distance measures of similarity and dissimilarity are used to produce graphs of relative positioning. For example, researchers have reviewed how close American universities are to each other, reviewing the differences between private and public universities.

9) Correspondence analysis: Simple correspondence analysis provides graphical representations of two-way frequency tables to improve the researcher’s understanding of any similarities and associations between the variables. Thus, it is especially good for the analysis of large contingency tables. For example, it could be used to investigate various crimes across the different states.

10) Survival analysis: Survival analysis data deals with the outcome being the waiting time until the occurrence of a well-defined event. Observations are censored, in the sense that for some units the event of interest has not occurred at the time the data are analysed and explanatory variables are used to control for the effect on the waiting time. The point of survival analysis is to follow subjects over time and observe at which point in time they experience the event of interest (e.g. cancer). Survival analysis is often referred to as time to event analysis, mainly used in biomedical sciences where the interest is in observing time to death. However, over the past few years this analysis has been extended to other areas of research such as the social sciences (e.g. forensic analysis, employment analysis, marriage) and even engineering sciences (e.g. failure time analysis).

Sample datasets will be provided, but participants are encouraged to bring some of their own Stata data for analysis. Teaching and practice will be closed and integrated, and individual assistance will be provided as needed.

Participants should have completed an intermediate statistics course covering at least some of the syllabus of “Data Analysis Using Stata”. Stata will be available, and experience with Stata will be assumed (e.g. use of Stata’s Do files).

Course notes will be supplied