Introduction to Data Analysis with R - lecture materials by Ágoston Reguly (CEU) with Gábor Békés (CEU, KRTK, CEPR)
This course material is a supplement to Data Analysis for Business, Economics, and Policy by Gábor Békés (CEU) and Gábor Kézdi (U. Michigan), Cambridge University Press, 2021
Textbook information: see the textbook's website gabors-data-analysis.com or visit Cambridge University Press
To get a copy: Inspection copy for instructors or Order online
This is version 0.2. (2022-07-11)
Comments are really welcome in email or as a GitHub issue.
The course serves as an introduction to the R programming language and software environment for data exploration, data munging, data visualization, reporting, and modeling.
Lectures 1 to 11 complements Part I: Data Exploration (Chapter 1-6) focuses on basic programming principles, data structures, data cleaning and data exploration with descriptives and graphs, and simple hypothesis testing. This is an intro package to learning R and using it for exploration and some basic analysis.
Lecture 12 to 20 complements PART II: Regression Analysis (Chapter 7-12) focuses on statistical methods such as nonparametric regression, single and multiple linear cross-sections, binary models and simple time-series analysis while adding more advanced toolkit for visualization and reporting. This is a regression focused package with advnaced features for analysis including markdown.
Lecture 21 to 27 complements PART III: Prediction (Chapter 13-18). These lectures are not intended to be part of an introductory R course, but rather a more advanced seminar to support Data Analysis with machine learning tools for prediction. In this seminar-style course, students will cover topics such as model selection with cross-validation, LASSO, RIDGE or Elastic Net regularization, regression trees with CART, random forest, and boosting. These methods are applied to cross-sectional data, especially to the continuous outcome, and also for binary outcomes to model probability and handle classification problems. Time series modeling on the long run and short run via ARIMA and VAR models are also covered. For properly understanding this material, the prerequisite is to complete the coding lectures from 1 to 19.
We believe students will learn using R by writing scripts and solving problems on their own. We provide and show them good practices on how to carry out such tasks, but extensive usage is needed.
This is not a hardcore coding course, but a course to supplement data analysis. The material focuses on specific issues in this topic and balances between higher levels of coding such as tidyverse
-- which is more intuitive, easier to learn, but less flexible -- and lower levels in form of basic coding principles -- which allows greater complexity, deeper understanding, but requires much more practice and has a steeper learning curve.
The material structure reflects these principles. The majority of the lecturers have pre-written codes which include in-class tasks to practice and face problems along with regular homework. This enables the instructor to show a greater variety of codes, good examples for coding, and way more commands and functions than live coding while providing room for practicing. For this type of lecture, homework is essential, as it helps students to deepen their coding skills. There are also few live-coding lectures, which require flexibility and more preparation from the teacher (material provides detailed instructions). These lectures are focusing on basic coding principles such as the introduction to coding, functions, loops, conditionals, etc., and show students possible paths to hardcore coding, while showing alternative methods as well. Exceptions are lecture 21-27 as they are intended to use as a seminar material to support theory and assumes good level of coding. There are no homework and/or in-class tasks.
It is always a good question if solutions for the tasks or homework should be made available for students. We believe show students the in-class solution is beneficial and does not distort motivation as slower learners may want to revise and compare the true solution to their own. Hence, for each lecture, we provide the solutions for these tasks. However, this is not the case for the homework. We found that showing solutions to the students rather depresses their motivation and creativity, therefore there are no solutions for the homework. (It is important that there are (infinitely) many good solutions for an HW, thus we usually encourage students to try out different paths as well.)
This course material may be used as a basis for a course on learning coding with R for the purpose of analyzing data. It is developed to be taught simultaneously with the textbook but may be used independently. It is rather comprehensive and thus, may be used without any textbook to prepare.
We have not invented the coding wheel. Instead tried to adopt best practices and combine them with real-life case studies from the textbook.
There are no slides, but codes are commented heavily thus it should be easy to follow. In some cases, it is beneficial to read the related case study and/or the chapter to fully appreciate the codes and comments, but not necessary.
Within each lecture, there is an estimated time that the lecture would need with suggestions on how to shorten the lecture if it would be too long. The lectures are -- in purpose -- contain more material than what a classical 100-mins class per week for 12 weeks would take. It is always easier to cut material than add to it and the taste of each instructor and/or class may differ. We highly encourage you to use each lecture as a starting point and modify it accordingly. Later, we propose an example for this 100-mins class per week for a semester (12 weeks).
The material is based on multiple years of teaching coding courses at Central European University as well as advice from many many great resources such as
- Hadley Wickham and Garrett Grolemund R for Data Science
- Jae Yeon Kim: R Fundamentals for Public Policy, Course material
- Winston Chang: R Graphics Cookbook
- Andrew Heiss: Data Visualization with R
- Grant McDermott: Data Science for Economists
and many others, listed in the lecture's READMEs.
The following table shows a brief summary of the lectures: what is the type of the lecture, what is the expected learning outcome, and how it relates to the textbook's case studies and datasets.
Lecture | Lecture Type | Learning outcomes | Case-study | Dataset |
---|---|---|---|---|
PART I. | ||||
lecture00-intro | live coding or pre-written | Setting up R and RStudio. Introduction to the interface of R-studio. Packages and tryout of tidyverse and knitting a pre-written Rmarkdown |
- | - |
lecture01-coding-basics | live coding | Introduction to coding with R: R-objects, basic operations, functions, vectors, lists | - | - |
lecture02-data-imp-n-exp | pre-written | How to import and export data with readr and APIs |
- | hotels-vienna, football** |
lecture03-tibbles | pre-written | Introduces tibble -s as data variable. Selecting, adding or removing rows (observations) and columns (variables). Convert to wide and long formta. Merge two tibbles in multiple ways. |
Ch 02C: Football Managers | football |
lecture04-data-munging | pre-written | Intro to data munging with dplyr : add, remove, separate, convert variables, filter observations, etc. |
Ch 02A: Hotels prep* | hotels-europe |
lecture05-data-exploration | pre-written | Intro to data exploration: modelsummary for descriptive stats in various ways, ggplot2 to plot one variable distributions (histogram, density) and two variable associations (scatter, bin-scatter), t.test for simple hypothesis testing. |
Core: Ch06A: Online vs offline prices. Related: Ch03A: Hotels: exploration, Ch04A: Management & firm size | billion-prices, wms-management-survey** |
lecture06-rmarkdown101 | pre-written | Intro to RMarkdown: knitting pdf and Html. Structure of RMarkdown, formatting text, plots and tables. | Ch06A: Online vs offline prices* | billion-prices, hotels-europe** |
lecture07-ggplot-indepth | pre-written | Tools to cutomize ggplot2 graph. Write your own theme. Bar charts, box and violine plots. theme_bg() and source() from file and url. |
Ch03B: Hotels: Vienna vs London | hotels-europe |
lecture08-conditionals | live coding | Conditional programming: if-else statements, logical operations with vectors, creating new variables with conditionals. | - | wms-management |
lecture09-loops | live coding | Imperative programming with for and while loops. Exercise to calculate yearly sp500 returns. |
Ch05A: Loss on stock portfolio | sp500 |
lecture10-random-numbers | live coding | Introduction to random number generators and random sampling. | Ch03D: Height and income, Ch05A: Loss on a stock portfolio* | height-income-distributions, sp500 |
lecture11-functions | live coding | Writing functions: control for input(s) and output(s), error handling. User written confidence-intervals, sampling distribution for t-statistics, bootstrapping. | Ch05A: Loss on a stock portfolio?*, Good-to-know: Ch06A: Online vs offline prices and Ch06B: Testing loss on a stock portfolio | wms-management, sp500 |
PART II. | ||||
lecture12-intro-to-regression | pre-written | Intro to regressions: binary means, binscatters, non-parametric regression via lowess, simple linear regression. Predicted values and residuals. | Ch07A: Hotels with simple regression | hotels-vienna |
lecture13-feature-engineering | pre-written | Intro to feature engineering. Covering variable transformations/manipulations which are used in the book/case-studies/this R course. Can be skipped, but good overview. | Ch01C: Data collection, Ch04A: Management & firm size* , Ch08C: Measurement error as HW, Ch17A: Predicting firm exit* | wms-management-survey, bisnode-firms, hotels-vienna** |
lecture14-simple-regression | live coding | Level-level, log-level, level-log, log-log, polynomial and linear spline transformations for simple regressions. Weighted OLS. Graphical representation of these models. Model comparison, theory and statistical based decision for model choice. | Ch08B: Life expectancy, Ch08A: Hotels with non-linear as HW | worldbank-lifeexpectancy, hotels-vienna** |
lecture15-advanced-linear-regression | pre-written | Introduces to multiple variable regression. Model evaluation: R2, prediction and error analysis with graphs. Confidence and prediction intervals. Robustness tests: checking parameter stability across time/location/type of obs. | Ch09B: Hotel stability, Ch10B: Hotels with multiple regression | hotels-europe |
lecture16-binary-models | pre-written | Introduction to binary outcome models: saturated models, linear probability models, logit and probit models. Estimating average marginal effects for non-linear models, via marginaleffects and summarize by modelsummary . Evaluating models by R2, Pseudo-R2, Brier score and Log-loss. Comparison of predicted probabilities for certain groups and the distribution for different models. Bias of the model and calibration curve. |
Ch11A: Smoking health risk | share-health |
lecture17-dates-n-times | pre-written | Introduction to basic date and time variable manipulations. lubridate and rounding, differencing. Dataset aggregation, differenced and lag-ged variables, unit root tests. Visualize time series. |
Ch12A: Returns: company vs market** | stocks-sp500 |
lecture18-timeseries-regression | pre-written | Introduction to time series analysis. Time-series data manipulations, simple visualizations and (partial) autocorrelation graph. Differencing, lags of outcome and explanatory variables and deterministic seasonality. Using Newey-West standard errors. Model comparison and estimating cumulative effects with valid SEs. | Ch12B: Electricity and temperature | arizona-electricity, case-shiller-la** |
lecture19-advaced-rmarkdown | pre-written | RMarkdown formatting for data anaysis report. Chunks, general and local set-options, formatting figures, descriptive tables and model comparison tables. Equations, greek letters and hypothesis testing. Organizing appendix. | Ch10A: Gender wage gap | cps-earnings |
lecture20-basic-spatial-vizz | pre-written | Introducing to spatial visualization via maps (package based maps) and rgdal (user supplied maps). How to create world map and show life expectancy or color the average hotel prices for London boroughs or Vienna districts. Handling maps via geom_polygon and set the scaling, colors, etc. |
Ch08B: Life expectancy* , Ch03B: Compare hotel prices Vienna vs London* | worldbank-lifeexpectancy, hotels-europe |
PART III. | ||||
lecture21-cross-validation | seminar | Model comparison introduced by BIC and RMSE. Limitations of these comparisons. Cross-validation: using different samples to tackle overfitting. The caret package. |
Ch13A Predicting used car value with linear regressions and Ch14A Predicting used car value: log prices | used-cars |
ecture22-lasso | seminar | Feature engineering for LASSO: interactions and polynomials. Cross-validation in detail. LASSO (and RIDGE, Elastic Net) via glmnet . Training-test samples and the holdout sample to evaluate predictions. LASSO diagnostics. |
Ch14B Predicting AirBnB apartment prices: selecting a regression model | airbnb |
lecture23-regression-tree | seminar | Estimating regression tree via rpart . Understanding regression trees and comparing them to linear regressions. Tuning and setup of CART. Tree and variable importance plots. |
CH15A Predicting used car value with regression trees | used-cars |
lecture24-random-forest | seminar | Data cleaning and feature engineering specifics for random forest (RF). Estimate RFs via ranger . Examine the results of RFs with variable importance plots, and partial dependence plots, and check the quality of predictions in (important) subgroups. Gradient Boosting Method (GBM) via gbm package. Prediction comparisons (prediction horse-race) for OLS, LASSO, CART, RF, and GBM. |
Ch16A Predicting apartment prices with random forest | airbnb |
lecture25-classification-wML | seminar | Predicting probabilities and classification with machine learning tools. Cross validated logit models. LASSO with logit, CART, and Random Forest (bonus: why not use Classification Forest). Classification of probabilities, ROC curve, and AUC. Confusion Matrix. Model comparison via RMSE or AUC. User-defined loss function to weight false-positive and false-negative rate. Optimizing threshold value for classification to get best loss function value. | CH17A Predicting firm exit: probability and classification | bisnode-firms |
lecture26-long-term-time-series-wML | seminar | Forecasting time series data on the long run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. Modeling with deterministic trend, seasonality and other dummy variables for long term horizon. Evaluation of model and forecast precision. prophet as machine learning tool for time series data. |
Ch18A Forecasting daily ticket sales for a swimming pool | swim-transactions |
lecture27-short-term-time-series-ARIMA-VAR | seminar | Forecasting time series data on the short run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. ARIMA and VAR models for short term forecasting. Evaluation of forecasts on short run: performance on hold out set, fan-chart to assess risks and stability of forecasting performance on an extended time period. | CH18B Forecasting a house price index | case-shiller-la |
*case study was the base for the material, but coding material is modified
**only used in homework
Within each lecture there is the following folder structure:
raw_codes
: includes codes, which are ready to use during the course but require some live coding in class.complete_codes
: includes codes with suggested solutions to codes inraw_codes
data
: in some cases, there is a data folder, which includes data files (typically in '.csv'). I have found it crucial during live-coding classes to make sure everybody has the same data.- if there are no folders then:
- lecture has a notebook format, which implies a complete live-coding class (mostly introduction or technical ''hard-core coding'' lectures)
- lecture has a complete R-script. In this case, the lecturer should pay attention to the interpretation of the material itself rather than to coding. Typically this is for more advanced case studies (chapters 13-18), where there is no new coding technique, but interpreting the results might be challenging.
Probably, the largest difference compared to the book is that data handling is the most challenging and most time-consuming part of coding, while it is a relatively little (but as important!) part of the book. It is always a challenge to keep up with the material if the two courses (Data Analysis and Coding) are running parallel. Experience shows that lecture05-data-exploration in this course is the first truly common point with the book and lecture06-rmarkdown101 enables students to submit data analysis material via pdf or HTML. This coding material was developed such that it catches up with the book as quickly as possible, showing truly essential tools to do data handling with the data in an easy way. The result is that after 6 lectures from both courses (teaching Part I. of the book) there is room for common assignment in the form of a descriptive analysis: e.g. carry out a data-collection exercise, clean the data and do exploratory analysis. The 'cost' is that apart from some references or homework there is no true connection between the two courses before lecture05-data-exploration in coding and the data handling skills can be improved even more. Therefore do not expect students to be able to solve (all) of the data exercises from the book (however, there were some positive surprises during the years).
In contrast, Part II in the book deals with regressions of various forms. This is fairly simple from the coding perspective, which allows the lecturer to deepen students' knowledge of
- basic coding principles;
- add further data handling practices to students' toolkit, and
- provide more skills on Rmarkdown, while following the material of the book.
If material is properly taught -- for Part III of the book -- there is no need for an extra coding course, but a simple seminar type of supplement, which put emphasis on interpretation and practice of machine learning methods. This material is provided in the folder part-III-case-studies. In principle after these materials, students should be able to code by themself and understand and work with case study materials related to Part IV.
Or one can relate each case study from the book to specific lectures.
*partial match: the case study is only used as a starting point for the lecture.
**students can understand and replicate material based on that lecture
As an example for a coding course, which takes one 100-mins class per week for a semester (12 weeks), we have taught the followings:
Class | Lecture(s) | Comments |
---|---|---|
Class 01 | lecture00-intro, lecture01-coding-basics | Installation of R, RStudio, and tidyverse package along with knitting an RMarkdown is asked to be done before the class. From coding basics some materials (e.g. numeric vs integer vs double, or indexing or lists) are left out if I run out of time. |
Class 02 | lecture02-data-imp-n-exp, lecture03-tibbles | Sometimes lecture03-tibbles finished on next class. |
Class 03 | lecture04-data-munging, start: lecture05-data-exploration | Ask about RMarkdown knitting. |
Class 04 | Finish: lecture05-data-exploration, lecture06-rmarkdown101 | At this point, should assess students that they understand the basics of coding and make sure nobody is struggling. From this class they should be able to prepare for submitting a project for 6th week's assessment, which should be 2 weeks from this point. |
Class 05 | lecture07-ggplot-indepth, lecture08-conditionals | This class provides some room for repetition or clarifying concepts. |
Class 06 | lecture09-loops, lecture10-random-numbers and lecture11-functions | Should be a more relaxed class as during these days there are many (other) assessment for student and concentrate more on the joy of programming. Many students may already know this material, try to come up with some entertaining tasks for them as well. |
Class 07 | lecture12-intro-to-regression, lecture13-feature-engineering | Feature engineering is new material, but fits here quite well. Class 07 should be after first class from Part II, which discusses Chapter 7. |
Class 08 | lecture14-simple-regression | Great opportunity for in-class (team) work for students with live coding. |
Class 09 | lecture15-advanced-linear-regression | Make sure students covered Chapter 10 from the book. If not, spatial data visualization is a great substitute here. |
Class 10 | lecture16-binary-models | In some cases this material is covered as a seminar from the course that discusses Part II. This provides an opportunity to fill any gaps or make class 12 not so dense, by jumping to the next class's material. |
Class 11 | lecture17-dates-n-times, lecture18-timeseries-regression | If short in time, skip lecture17-dates-n-times |
Class 12 | lecture19-advaced-rmarkdown, lecture20-basic-spatial-vizz | Two paths: discuss lecture19-advaced-rmarkdown in detail with the whys as well, but then there is no time for lecture20-basic-spatial-vizz. Or stick with the technical details in both lectures, which allows higher probability to finish. |
Class * | lecture20-basic-spatial-vizz | This lecture seldomly fits into the timeframe of the class, especially if this coding class runs along with theory classes for Part I and II and serves as a supplement both in coding and understanding the material. However, if there is a mismatch, this class can be flexibly used as a substitute (e.g. theory class is lagging behind) |
- Tidyverse and not data.table. Some friends love data.table. But it seems, tidyverse has become the more popular choice, especially at a starter level.
- Starting with
rm(list = ls())
Yes, we know. There is a strong view suggesting project based workflow "If the first line of your R script is rm(list = ls()) I will come into your office and SET YOUR COMPUTER ON FIRE". We are warned directly, too. At the same time, for beginners, this seems a good start. So we kept it for lectures 01-20, not beyond. Feel free to use a version without. - Do descriptive tables with
Datasummary
-- takes a bit of time to get used to be nice. - All regressions (except when we start) is with
fixest
. We think it is the future regression command for all uses.
Thanks to all folks who contributed to the codebase for the course, especially Gábor Kézdi, co-author of the book. But also thanks to Zsuzsa Holler, Kinga Ritter, Ádám Víg, Jenő Pál, János Divényi, Marc Kaufmann, Gábors' and Ágoston's many students. Big thanks to Laurent Bergé, Grant McDermott and Vincent Arel-Bundock for awesome packages and all the help on coding over several years.
Awesome, we know there are errors and bugs. Or just much better ways to do a procedure.
To make a suggestion, please open a GitHub issue
here with a title containing the case study name. You may also contact us directly.