Introduction to Data Management and Information Systems

This course is designed for researchers and professional staff who might be well-versed in their chosen research areas but need assistance in developing strategies for managing their corresponding datasets. The course is aimed at people without significant IT skills, and covers both the skills needed to adequately describe the required information system as well as the tools needed to implement such a system.

Level 3 - runs over 5 days

Dr Mark Griffin is the Director of Insight Research Services Associated (, and holds Adjunct appointments within the School of Public Health, University of Queensland and the Sydney Medical School, University of Sydney. Mark serves on the Executive Committee for the Statistical Society of Australia, and is Chair of their Section for Business Analytics. Mark also serves as the Asia-Pacific Regional Director for the International Institute of Business Analysis, is Chair of their Business Analytics Special Interest Group, and is an IIBA Endorsed Education Provider. He is currently doing research with the Queensland Ambulance Service analyzing their incident reports, where the QAS visits approximately 700,000 incidents per year. To date he has presented over 90 two-day and 10 five-day workshops in statistics around Australia.

About this course: 

Advanced skills in data management and information systems go beyond simply an ability to conduct statistical analyses or write computer software. These skills require staff to be able to adequately describe the data systems that best suits their current needs, an ability to think about the meaning of information that is not commonly taught (even within the scope of core courses in IT). During this workshop, participants will begin with an introduction to the design and conduct of surveys (the data generation process most commonly employed in social research). They will then learn a number of graphical approaches for describing how information is collected and transferred within an organization, and will apply this to simulated work scenarios (such as how to describe an information system for managing workshop enrolments). Participants will then learn how to implement this system using the MySQL database language, and will spend time implementing the database design for their chosen scenario. Finally there will be supporting sessions in data linkage, regular expressions, data visualization, and accounting for missing data.


On one hand this course is specifically designed for participants with no previous experience in the tools and techniques of data management and information systems. On the other hand some previous exposure will allow a participant to better grasp the more advanced topics within this workshop. Detailed notes with worked examples and references will be provided as a basis for both the lecture and hands-on computing aspect of the course. The computer sessions will involve the use of MySQL and R, where no prior exposure to these software programs is needed.


Course syllabus: 

Day 1 Introduction to Survey Design
In many cases the person responsible for the data management and information systems within a project will not be responsible for designing the original survey instrument or going into the field to collect data. However it is important that the data manager understands these processes in order to understand the needs and requirements of the survey manager, and to understand the data as it is first given to the data manager. On this day we will explore a number of elements including:

  • Study design and reporting (including Setting goals and objectives, Inclusion and exclusion criteria for participant selection, Data Management and data linkage, Reporting styles including the CONSORT statement, Ethics including privacy and confidentiality)
  • Designing the research questions
  • Evaluating the chosen research questions
  • Methods of data collection (including online methods)
  • Who do we ideally want in our survey and who can we have in practice
  • Choosing a representative subset from a list of possible survey participants (survey sampling)
  • Processing the data prior to statistical analysis


Day 2Describing a required database or information system
As described above, it is quite an art to capture all of the information needed to adequately describe a required database or information system. This can include information gathering techniques such as focus groups, interviews, or observational periods where a researcher can study how users may wish to interact with an information system during their day to day duties. Specific graphical tools that we will discuss include:

  • State space diagrams (for understanding how a system moves from one state to another)
  • Sequence diagrams (for understanding the sequence in which information moves from one stakeholder to another)
  • Attribute diagrams (for understanding how information can be divided into elements and attributes)

As well as learning the theory of these descriptive methods, participants will be divided into small workgroups. Each workgroup will be given a particular workplace scenario (such as designing an information system to capture enrolments in the ACSPRI workshop program). Each workgroup will have time to develop diagrams describing their particular context, and will then present those diagrams to the workshop class.


Day 3Implementing a database using MySQL
In this session we will describe the theory of relational databases, and how to implement such a database using MySQL. Again no prior experience with MySQL will be required for this workshop. After learning the theory of database design and MySQL participants will return to review the database they designed on the previous day. They will then have the opportunity to work individually in implementing such a database using MySQL. This will involve both the implementation of the database, as well as the population of such a database using simulated data. During this day we will also briefly describing the role of HTTP, PHP and Java in implementing a web interface for such a database.


Day 4 - Data linkage and regular expressions in R
A number of research projects explore the relationship between different datasets sometimes from different sources. For example, is there a relationship between a person’s workplace behaviour (as described in data from the Department of Employment) and their health status (using data from the Department of Health). In simple cases this data linkage can be performed using simple deterministic techniques (eg. Each dataset contains a unique ID number for each person with a one-to-one matching between datasets). In many more complex, real-world scenarios this matching is not as trivial (eg where the matching is performed on person’s name where more than one person can have the same name). In these cases probabilistic techniques are used that describe the probability that a record in one dataset should be matched with a given record in another dataset. During this day we shall discuss the techniques for deterministic and probabilistic data linkage.
Regular expressions can be useful when the dataset to be explored consists of unstructured text. They are used for identifying, searching, and modifying patterns within this text.   


Day 5 Data visualization and missing data in R
In every study it is vital that visualization of the data occurs. This is useful for conveying the patterns found within a dataset to your readers. However, it is even more useful for the researcher themselves to visualize their data for their own benefit. This should be conducted to assess the validity of statistical analysis results before communicating those results to their readers. As datasets increase in complexity it becomes more challenging and yet also more important to be able to visualize such datasets. In the last day of this workshop we shall describe techniques for visualization of these datasets.
On this day we will also describe the potential impact that missing data can have on a research study. We will also discuss the method of multiple imputation for addressing the presence of missing data.


Course format: 

This course may run in a computer lab, or you may be advised to bring your own laptop with specified software.

We will let you know in advance.


You are encouraged to bring a data set and/or research problem with you.

Recommended Background: 

On one hand this course is specifically designed for participants with no previous experience in the tools and techniques of data management and information systems. In addition no previous exposure to MySQL or R is assumed.

Some participants may find that some exposure to these tools and techniques may allow a participant to better grasp the more advanced topics within this workshop.

Recommended Texts: 

Other references include:

  • Survey Methodology, Second Edition (2009) Robert M. Groves and Floyd J. Fowler Jr.
  • A Web-Based Introduction to Programming: Essential Algorithms, Syntax, and Control Structures Using PHP, HTML, and MySQL, Third Edition (2014). Mike O'Kane
  • Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection (2012) Peter Christen



The instructor's bound, book length course notes will serve as the course texts.