This course introduces you to the collection and analysis of socially-generated 'big data' using the R statistical software and Gephi network visualisation software. The focus is on programmatic approaches for collecting and analysing big data from social media and the WWW. The course will also provide an opportunity for you to learn how these data and techniques are already being used in social science research.
Big data involves data on:
(1) people (social web) e.g. online social networks (e.g. Facebook), microblogs (e.g. Twitter);
(2) information (WWW) e.g. web pages, clickstreams;
(3) things (sensor web) e.g. phones, temperature sensors, and
(4) places (geospatial web) e.g. geology, land use maps.
This course is focused on collecting and analysing data from the social web and the WWW.
In this course, you will learn how to:
- Collect data from Twitter, Facebook, YouTube and Web 1.0 websites. Who are the actors, and what actor attributes are available for them?
- Construct, analyse and visualise networks of people and organisations (social networks) and terms (semantic networks). How can we find connections between actors, and how can we use social network analysis to understand the social scientific meaning of such connections?
- Extract and analyse text data. What text can be attributed to these actors, and what does analysis of this text tell us about the actors and society as whole?
- Conduct temporal analysis. How can we study behaviour over time, identifying significant events or trends?
- Identify and engage with advanced techniques for dealing with very large datasets, including software optimisation and sampling techniques.
- Utilise social theory to engage with and reason about the challenges and opportunities of big data, including the interpretation of findings and methodological considerations.
The main software used in the course is R, but we also cover the use of Gephi for advanced visualisation. Data collection will mainly be via an R package for collecting and processing social media data (created at the VOSON Lab specifically for use in this course) and the VOSON software (for collecting WWW hyperlinks and text content). We also provide R scripts covering other important packages for data analysis such as: igraph (network analysis and visualisation), tm (text mining), RTextTools (supervised machine learning for text classification), wordcloud (text word clouds and term frequency visualisation), plyr and stringr (text sentiment analysis), and topicmodels (topic modelling of textual data).
The target audience for this course is people with a fairly strong technical background. The course will be particularly appealing to social scientists who want to become more computationally literate, and those from technical disciplines (e.g. computer science, engineering) who want to become more socially literate.
The following is an indicative list of topics covered during the course. Prior to the course running we will ascertain student interest in particular topics and focus the course accordingly.
- R and RStudio refresher
- Introduction to SocialMediaLab; Collecting YouTube video comment data with SocialMediaLab
- Social network analysis in R – 1 (graph visualisation, core node- and network-level metrics)
- Collecting Facebook data with SocialMediaLab
- Text analysis in R – 1 (building a corpus, descriptive analysis, wordclouds)
- Social network analysis in R – 2 (clustering, bimodal networks)
- Collecting WWW hyperlink and website text content with vosonR
- Collecting Twitter data with SocialMediaLab
- Introduction to Gephi
- Text analysis in R – 2 (supervised machine learning [e.g. support vector machines], unsupervised machine learning [topic modelling], sentiment analysis, gender analysis)
- Dynamic network analysis in R (analysing networks over time, identifying significant changes in behaviour of individual nodes, clusters or entire networks)
- Dynamic network analysis in Gephi
- Rmarkdown and other useful tools for ‘working smarter’ in R
- Optimising R to handle big datasets
- Publishing interactive networks on the web (Shiny)
- Advanced topics
You will be advised in advance whether this course will be run in a computer lab or whether you will have to bring your own laptop.
If the course is run with participants using their own laptops, you will need to have installed:
(1) R statistical software plus a set of packages which will be specified before the course starts,
(2) Gephi. Both R and Gephi run with Mac/Windows/Linux.
You are advised to have taken at least one of the following ACSPRI courses or have had some equivalent exposure to social network analysis and quantitative text analysis:
- Social Media Analysis
- Introduction to Social Network Research and Network Analysis
- Advanced Network Analysis for Social Research
You must also have some experience with the R programming language. You don't need to be an R expert but must have some familiarity with how to program in R (or other similar languages) for example, via the following ACSPRI courses:
- Learning R: Open Source (Free) Stats Package
- Using R for Practical Research and Data Visualisation
- Data Analysis, Graphics and Visualisation Using R
Q: Do I have to have to done an ACSPRI R Course before attempting this course?
A: Not necessarily, as long as you have some experience with the R programming language.
I was interested to see where this area of work was headed, in this the course was very useful. (Winter 2015)
First time it has been run - so things will improve as far as organisation. So naturally more structure will come that said Rob & Tim were extremely adaptable & open to co-creating content with their students – EXCELLENT!! (Winter 2015)
The instructor's bound, book length course notes will serve as the course texts.