Big Data Analysis for Social Scientists

This course introduces you to the collection and analysis of socially-generated 'big data' using the R statistical software and Gephi network visualisation software. The focus is on programmatic approaches for collecting and analysing big data from social media and the WWW. The course will also provide an opportunity for you to learn how these data and techniques are already being used in social science research.

Level 3 - runs over 5 days
About this course: 

Big data involves data on:

(1) people (social web) e.g. online social networks (e.g. Facebook), microblogs (e.g. Twitter);

(2) information (WWW) e.g. web pages, clickstreams;

(3) things (sensor web) e.g. phones, temperature sensors, and

(4) places (geospatial web) e.g. geology, land use maps.

This course is focused on collecting and analysing data from the social web and the WWW.


In this course, you will learn how to:

  • Collect data from Twitter, Reddit, YouTube and Web 1.0 websites. Who are the actors, and what actor attributes are available for them?
  • Construct, analyse and visualise networks of people and organisations (social networks) and terms (semantic networks). How can we find connections between actors, and how can we use social network analysis to understand the social scientific meaning of such connections?
  • Extract and analyse text data. What text can be attributed to these actors, and what does analysis of this text tell us about the actors and society as whole?
  • Conduct temporal analysis. How can we study behaviour over time, identifying significant events or trends?
  • Identify and engage with advanced techniques for dealing with very large datasets, including software optimisation and sampling techniques.
  • Utilise social science to engage with and reason about the challenges and opportunities of big data, including the interpretation of findings and methodological considerations.

The main software used in the course is R, but we also cover the use of Gephi for advanced visualisation. Data collection will mainly be via the vosonSML R package for collecting and processing social media data (created at the VOSON Lab specifically for use in this course) and the VOSON software (for collecting WWW hyperlinks and text content). We also provide R scripts covering other important packages for network and text analysis such as: igraph (network analysis and visualisation), quanteda (quantitative analysis of textual data), tidytext and tm (text mining), RTextTools (supervised machine learning for text classification), wordcloud (text word clouds and term frequency visualisation), dplyr (data manipulation), and topicmodels (topic modelling of textual data).


The target audience for this course is people with a fairly strong technical background. The course will be particularly appealing to social scientists who want to become more computationally literate, and those from technical disciplines, (e.g. computer science, engineering) who want to become more familiar with social science approaches to big data research.

Course syllabus: 

The following is an indicative list of topics covered during the course. Prior to the course running we will ascertain student interest in particular topics and focus the course accordingly.


Day 1

  • R and RStudio refresher
  • Introduction to vosonSML, VOSON Dashboard
  • Social network analysis (SNA) using VOSON Dashboard - 1 (network plots, basic node-/network-level metrics)
  • SNA in R/igraph - 1 (network plots, basic node-/network-level metrics)
  • Collecting Twitter data using VOSON Dashboard


Day 2

  • Collecting Twitter data with vosonSML
  • SNA using VOSON Dashboard - 2 (clusters, creating subnetworks)
  • SNA in R/igraph – 2 (clusters, creating subnetworks)
  • Text Analysis using VOSON Dashboard (frequency counts, wordclouds)
  • Text analysis in R – 1 (text preparation, frequency counts and wordclouds)
  • Collecting YouTube data with VOSON Dashboard and vosonSML


Day 3

  • Gephi for network visualisation - 1 (network maps, node and network statistics)
  • Collecting hyperlink networks and website text content using VOSON
  • Collecting Reddit data with VOSON Dashboard & vosonSML
  • Assortativity and homophily – VOSON Dashboard and R
  • Text analysis in R – 2 (sentiment analysis, semantic networks)


Day 4

  • Dynamic network analysis in R (analysing networks over time, identifying changes in behaviour of individual nodes, clusters or entire networks)
  • Text analysis in R - 3 (topic models)
  • Gephi for network visualisation – 2 (filtering and creating subnetworks)
  • Writing with data: Producing reports in R using Rmarkdown and knitr
  • Individual assistance/working on class projects


Day 5

  • Publishing interactive network analysis on the web (R/Shiny)
  • Advanced topics
  • Individual assistance/ Working on class projects


Course format: 

This course may run in a computer lab, or you may be advised to bring your own laptop with specified software.

We will let you know in advance.


If the course is run with participants using their own laptops, you will need to have installed:

     (1) R statistical software plus a set of packages which will be specified before the course starts,

     (2) Gephi. 

Both R and Gephi run with Mac/Windows/Linux.

Recommended Background: 

It is advisable that you have taken at least one of the following ACSPRI courses, or have had some equivalent exposure to social network analysis:


It is also advisable that you have some experience with the R programming language (or similar languages) for example, via the following ACSPRI courses:

Recommended Texts: 






Q: Do I have to have to done an ACSPRI R Course before attempting this course?

A: Not necessarily. However it is advisable that you either have some experience with social network analysis or experience with the R programming language (or a similar programming language).

Participant feedback: 

Mostly hands on computer programming demo and tutorials, which I expected to be, are materials in the workshop. Excellent staff. (Summer 2019)


It provided a broad overview of approaches, enough depth to gauge the approach but not to much to get lost. Great that we get updated notes & R code. (Summer 2019)


Rob is very excellent instructor who is well prepared and cares for his students (Summer 2017)


excellent! (Summer 2017)


I was interested to see where this area of work was headed, in this the course was very useful. (Winter 2015)


First time it has been run - so things will improve as far as organisation. So naturally more structure will come that said Rob & Tim were extremely adaptable & open to co-creating content with their students – EXCELLENT!! (Winter 2015)


The instructor's bound, book length course notes will serve as the course texts.