Code
setwd("~/Desktop/POL272")Sebastian Koehler
February 5, 2026
Materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).
This week, we will cover the following topics:
Download and save india.csv in the data folder you created last week.
Start by creating an R script to keep track of your code. In RStudio, you can open a new script by clicking File > New File > R Script.
Save your script by clicking File > Save As and saving it in your POL272 folder with the name seminar2.R.
Clear your environment to avoid operating with objects from previous work by mistake. You can do this by clicking on the broom icon in the Environment tab.
Set the working directory to your POL272 folder as you did last week so R can access and save files there:
We will be needing extra tools to make working with data easier on R, so we will have to install some packages. A package in R is like an app on your phone. It adds new tools and features that don’t come with R by default.
One of the packages we will use is tidyverse. It includes several tools that make working with data easier, especially for tasks like cleaning, transforming, and visualizing data.
We install packages from the Comprehensive R Archive Network (CRAN), which is like an app store for R packages.
To install a package from CRAN, we use the install.packages() function, making sure to put the package name in quotes:
R will download the package and print a lot of text in the console as it installs. When it’s done, you’ll see a message like: The downloaded binary packages are in... followed by a long directory name.
You only need to install a package once on your computer.
Even though we installed tidyverse, we still need to load it to tell R we want to use it in this session.
To load a package in R, use the library() function (without quotes now):
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Important:
install.packages() happens once.library() happens every time you start a new R session.We will revisit the use of tidyverse functions at the end of the seminar with a few examples.
We will analyse data from an experiment conducted in India, where villages were randomly assigned to have a female council head. This is based on Raghabendra Chattopadhyay and Esther Duflo. 2004. Women as Policy Makers: Evidence from a Randomized Policy Experiment in India, Econometrica, 72(5): 1409–43.
Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.
Table 1: Variables in “india.csv{download = ‘india.csv’}”
| Variable | Description |
|---|---|
| village | village identifier (“Gram Panchayat number_village number”) |
| female | whether village was assigned a female politician: 1=yes, 0=no |
| water | number of new (or repaired) drinking water facilities in the village since random assignment |
| irrigation | number of new (or repaired) irrigation facilities in the village since random assignment |
To load this dataset, use the read.csv() function as shown below:
village female water irrigation
1 GP1_village2 1 10 0
2 GP1_village1 1 0 5
3 GP2_village2 1 2 2
4 GP2_village1 1 31 4
5 GP3_village2 0 0 0
6 GP3_village1 0 0 0
This function loads the data stored in “india.csv” into R and assigns it to a new object called india using the assignment operator <-, which you learned about last week. After running this line, you should see the india object appear in the Environment.
Note: data/ tells R to look inside a folder called data, which is located in your working directory. If the file were in the main folder POL272 (not in a sub-folder like data), we would simply run:
To explore the contents of the dataset, you can type its name in the R script and run it. However, this displays the entire dataset in the R console, which can be overwhelming for larger datasets.
A better option is to click the dataset name in the Environment tab to open a spreadsheet-style viewer or use the View() function:
Remember, R is case-sensitive, and this function starts with an uppercase V.
If you only need a quick look at the first few rows, use the head() function:
village female water irrigation
1 GP1_village2 1 10 0
2 GP1_village1 1 0 5
3 GP2_village2 1 2 2
4 GP2_village1 1 31 4
5 GP3_village2 0 0 0
6 GP3_village1 0 0 0
By default, head() shows the first six rows. To customize this, add the n argument for the number of rows you want to see. For example, to display the first three rows:
india dataset?The unit of observation represents the individuals or entities about which information is recorded.
To determine the unit of observation, ask: What does each row represent? Is it an individual, a household, a country, an event, or something else?
The unit of observation inindia is villages. Hence, every row of data in the india dataframe represents a different village in the study.
india dataset? In other words, how many villages were part of this experiment?To identify the number of observations (rows) in the dataset, you can use the nrow() function:
There are 322 observations in the dataset. In other words, 322 villages were part of this study.
A variable is a characteristic or piece of information that can vary across observations in a dataset. For example:
In a dataset about people, a variable could be age, gender, or height.
In a dataset about countries, a variable could be population size, GDP, or continent.
Variables are represented as columns in a dataframe.
india dataset?To identify the number of variables (columns) in the dataset, you can use the ncol() function:
There are 4 variables in the dataset.
To identify both the number of observations (rows) and variables (columns) in a dataset, we could also use the dim() function, which stands for “dimensions”:
Let’s start off by calculating the total number of new water facilities using the sum() function that we learned last week. The only required argument is the code identifying the variable (column) in the dataset:
The total number of new or repaired drinking water facilities since random assignment is 5,745.
The dollar sign ($) is used to access a specific column inside the india dataframe. In this case, we are selecting the water column from india and adding up all the values in that column.
Now, let’s use the mean() function to calculate the average number of new water facilities in a similar way:
Let’s use another function we learned last week to round this mean to a whole number:
The average number of new or repaired drinking water facilities since random assignment is 18 per village.
This approach of nesting functions inside each other might be a bit hard to read. Instead, we can use the pipe operator (%>%) from tidyverse to make the code easier to follow.
[1] 17.8
Think of the pipe (%>%) as saying “then”. Instead of writing everything inside one big function, the pipe lets us chain steps together one by one, in an order that’s easy to understand. Here’s what we’re telling R to do:
india.tidyverse summarise() functionNow that we’ve seen how to calculate simple statistics like sum and mean, let’s take it a step further.
Instead of using $ to extract columns and calling mean() or sum() separately, we can use summarise() from tidyverse to calculate both in one step:
What’s happening here?
india %>%: Passes the india dataset to the summarise() function, telling R to perform the calculations on this dataset.sum(water): Calculates the total number of new water facilities.mean(water): Calculates the average number of new water facilities.The result is a clean dataframe that combines your summary statistics.
Lets make it even cleaner. R automatically names the columns something like sum(water) and mean(water), which can be hard to read.
To make the output clearer, we can assign custom names to our calculations:
total_water average_water
1 5745 17.84161
Now, the result is easier to understand!
Use summarise() to calculate the average number of new or repaired irrigation facilities. Interpret the result.
Use summarise() to calculate the average of the variable female. Interpret the result.
Use summarise() to calculate the minimum and maximum number of new water facilities in the dataset. Hint: Open the help file for summarise() and look under “Useful functions” to find functions for range (minimum and maximum).
avg_irrigation
1 3.263975
avg_female
1 0.3354037
#34% of the villages in the experiment were randomly assigned to have a female politician.
#We round up from 33.54% as people are ‘whole units’; we cannot have half a female.
#The unit of measurement is %, after multiplying the rounded output by 100 (0.34*100=34%).
#3. Calculate the minimum and maximum number of new water facilities.
india %>%
summarise(
min_water = min(water),
max_water = max(water)
) min_water max_water
1 0 340
R comes with several built-in datasets that you can explore without needing to download any files. One of these is mtcars, which contains information about different car models and their characteristics.
Since mtcars is built-in, you don’t need to download anything—just run this command to load it:
Get an overview of the dataset using the help file.
View the first five rows of the dataset.
What is the unit of observation in mtcars?
How many observations (rows) and variables (columns) are in the dataset?
Calculate the average miles per gallon (mpg) across all cars using summarise().
Find the minimum and maximum horsepower (hp) in the dataset using summarise().
What type of variable is am?
Calculate the mean of am using summarise(). Interpret the result.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
[1] 32 11
avg_mpg
1 20.09062
min_hp max_hp
1 52 335
mean_am
1 0.40625