General Overview

This course will provide an overview of key concepts for creating an effective data driven project and will introduce tools and techniques for data wrangling, statistical modelling, visualisation and reproducible reporting using R, a public domain language for data analysis. The R language provides a rich and flexible environment for working with data, especially data to be used for statistical modelling or graphics.

The R system has an extensive library of packages that offer state-of-the-art-abilities. Many of the analyses that they offer are not even available in any of the standard packages. R enables you to escape from the restrictive environments and sterile analyses offered by commonly used statistical software packages. It enables easy experimentation and exploration, which improves data analysis. Sharing your discovery of data analysis knowledge is necessary in making it useful. R is a tool that enables reporting modern data analyses in a reproducible manner. It makes analysis more useful to others because the data and code that actually conducted the analysis can be made available and easily shared. As such R has become the lingua franca of quantitative research. Accordingly, this course will emphasize packages that will help you do data analysis, visualisation and communication with a wider audience.

The course will start by introducing the fundamental concepts of R: basic use of R console through RStudio IDE, inputting and importing data, record keeping and general good practice of R project workflow. It will then progress to basic statistical concepts and statistical modelling techniques. Basic statistical concepts, which theoretically may be perceived as complex, can be more effectively communicated by using visualisation. Hence, the formal abstract nature of Statistics can be demystified by visualising its application context. This is why the focus is directed on building appropriate visualisation of a given data analysis problem, and the reporting of intelligent reproducible data analysis using RMarkdown. Using real data and real examples we will introduce you to fundamental statistical concepts to set the stage for key statistical modelling techniques. We will finish the course by introducing you to the key Machine Learning (ML) algorithms, providing you with insight into how ML adapts and modifies assumptions through its three-step process (data -> model -> action) and by reacting to errors.

Version control has become an essential tool for keeping track when working on DS projects, as well as collaborating. RStudio supports working with Git, an open source distributed version control system, which is easy to use when combined with GitHub, a web-based Git repository hosting service. Throughout the course you will be introduced to GitHub and you’ll become acquainted with good practice when incorporating the use of Git into your R project workflow.

Objectives:

  • To be familiar with R/RStudio’s data handling facilities that will expand the range of Data Science problems that can be effectively analysed
  • To provide a framework for developing analytical skills for handling a range of data sets and the appropriate analytical methodologies
  • To introduce the basic principles behind effective data visualisation
  • To provide the tools and technical skills to enable a range of statistical analysis to be undertaken
  • To enable intelligent reproducible reporting of the results of statistical analysis to target audiences with diverse levels of numerate/statistical understanding
  • To provide a sufficient base to enable the pursuance of more complex statistical analysis

How the course works

Teaching and Learning Strategy

The material is structured within three weekly modules. Each module covers various related topics through appropriate case studies, presentations, readings and discussion forums. Essential data handling and statistical modelling techniques are introduced during the teaching sessions. Students are then expected to use their own time to deepen their understanding of the data models presented in the session. The conceptual models come to life when practice becomes reality during the hands on taught sessions, through the application of R. Students are then expected to use their own time to practise and hone the data handling expertise acquired during the taught sessions. Students are given the opportunity to test their knowledge, both conceptual and practical, on a weekly basis through interactive student/teacher workshops.

Students are expected to participate fully in all of these delivery modes, but in particular are expected to have attempted any pre-set work and come fully prepared to discuss any problems encountered and debate the ideas and any issues raised.

We recommend you complete each of the following before the end of each week:

  • Readings and hand-outs/exercises
  • Participation in the discussion forums
  • Quizzes covering concepts from tutorials and/or readings

Who can enrol

This course is for people from varying backgrounds and diverse profiles. It is designed for people who recognise the paramount importance of data and its use.

This course will benefit anyone who has the curiosity and desire to enter the realm of data science. We will make sense of the world of data and learn effective and attractive ways to visually analyse and communicate related information. With the knowledge gained on this course, you will be ready to undertake your very own data analysis for the first time.

Data Literacy is not simply fashionable jargon, but rather a set of tools that empower data enriched living, so whatever industry you’re in, this is relevant to you!


© 2020 Tatjana Kecojevic