Seminar 3

Author

Sebastian Koehler

Published

February 6, 2026

Materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).

Seminar Objectives

This week, we will cover the following topics:

  • Subsetting
  • Relational operators
  • Creating new variables using ifelse()
  • Computing the difference-in-means estimator

Getting started

  1. Start by creating an R script to keep track of your code. In RStudio, you can open a new script by clicking File > New File > R Script.

  2. Save your script by clicking File > Save As and saving it in your POL272 folder with the name seminar3.R.

  3. Clear your environment to avoid operating with objects from previous work by mistake. You can do this by clicking on the broom icon in the Environment tab.

  4. Set the working directory by clicking on Session > Set Working Directory > Choose Directory. Navigate to your POL272 folder and click Open.

When you select the folder, RStudio will print the setwd(...) command in the Console. Copy and paste the command into your script so that R automatically sets the correct directory every time you run your script.

  1. Load the “india.csv” dataset that you downloaded last week using the read.csv() function as shown below:
Code
india <- read.csv("data/india.csv")

Note: data/ tells R to look inside a folder called data, which is located in your working directory. If the file were in the main folder POL272 (not in a sub-folder like data), we would simply run:

Code
india <- read.csv("india.csv")

Subsetting

Last week, we learned about the $ operator, which allows us to extract a single column from a dataset.

For example, we used $ to extract the water column from the india dataset:

Code
india$water

What if we wanted to select both rows and columns together?

→ To do this, we use square brackets [ ].

The basic structure of [ ] is:

Code
dataframe[row, column]

Let’s start by selecting one column, just like we did with $, but using [ ]:

Code
india[ , "water"]

This gives the same result as india$water but it’s written differently.

Now let’s select the first row (observation):

Code
india[1, ]
       village female water irrigation
1 GP1_village2      1    10          0

This returns all columns for the first row.

To select the first five rows, run:

Code
india[1:5, ]
       village female water irrigation
1 GP1_village2      1    10          0
2 GP1_village1      1     0          5
3 GP2_village2      1     2          2
4 GP2_village1      1    31          4
5 GP3_village2      0     0          0

We can combine row and column selection. For example, we select the water value for the first row by running:

Code
india[1, "water"]
[1] 10

This tells us that the first village “GP1_village2” has 10 new (or repaired) water facilities after random assignment.

We can achieve the same thing by running:

Code
india$water[1]
[1] 10

Here, R already knows that we are working with the water column because $water explicitly extracts that column from the dataset.

Relational Operators

Relational operators allow us to test logical conditions. For example:

  • == tests if two values are equal.

If we run:

Code
3 == 3
[1] TRUE

R lets us know that it is TRUE that 3 equals 3.

If we instead run:

Code
3 == 4
[1] FALSE

R returns a FALSE, indicating that 3 is not equal to 4.

We can use relational operators to check values in a dataset. For instance, if we wanted to determine which villages had exactly 10 new water facility, we run:

Code
india$water == 10

R checks every value in the water column and returns TRUE if the value is 10 and FALSE if it is not 10.

Now let’s use both subsetting and logical operators to select all rows where female == 1:

Code
india[india$female == 1, ]

What if we only wanted to see the number of new water facilities in villages with female leaders? We would run:

Code
india[india$female == 1, "water"]

We can achieve the same thing by running:

Code
india$water[india$female == 1]

Here, india$water extracts only the water column and [india$female == 1] applies the logical condition directly to this column.

Creating new variables using ifelse()

The ifelse() function allows us to create new variables based on a logical test. It works like a simple “if-then” statement:

Code
ifelse(condition, value_if_true, value_if_false)

Let’s create a new variable called ten_water that categorizes villages as either:

  • 1 if they have exactly 10 new water facilities.
  • 0 if they have any other number of new water facilities.

To do this, we run:

Code
india$ten_water <- ifelse(india$water == 10, 1, 0)

Let’s break this down:

  • india$ten_water <-: Create a new column called ten_water inside the india dataframe and store the values there.
  • india$water == 10: Check if the number of new water facilities (water) is equal to 10 for each village.
  • 1: Assign 1 to ten_water if the condition india$water==10 is TRUE.
  • 0: Assign 0 to ten_water if the condition india$water==10 is FALSE.

Take a look at the first few observations of the dataframe using head() to ensure that the new binary variable was created correctly:

Code
head(india)
       village female water irrigation ten_water
1 GP1_village2      1    10          0         1
2 GP1_village1      1     0          5         0
3 GP2_village2      1     2          2         0
4 GP2_village1      1    31          4         0
5 GP3_village2      0     0          0         0
6 GP3_village1      0     0          0         0

Computing the Difference-in-Means Estimator

The Difference-in-Means Estimator measures the impact of a treatment by comparing the average outcomes between the treatment group and the control group.

For our dataset:

  • Treatment group: Villages with a female council head (female == 1).

  • Control group: Villages without a female council head (female == 0).

  • Outcome variable: Number of new or repaired water facilities.

The formula is:

[ = {Y}{} - {Y}{} ]

where:

  • ({Y}_{}) = mean outcome for treatment group

  • ({Y}_{}) = mean outcome for control group

Considering that the dataset comes from a randomized experiment, the treatment and control groups should, on average, be identical in both observed and unobserved pre-treatment characteristics due to random assignment.

Therefore, we can use the Difference-in-Means Estimator to estimate the average causal effect of having a female politician on the number of new (or repaired) water facilities.

1. Compute the mean for each group

First, let’s compute the means separately for each group:

  • Calculate the average number of new or repaired water facilities in villages with a female politician:
Code
mean_water_female <- mean(india$water[india$female == 1])

mean_water_female #Print the result
[1] 23.99074

Interpretation: the average number of new (or repaired) water facilities in villages with a female politician (female == 1) is 24 facilities.

  • Calculate the average number of new or repaired water facilities in villages with a male politician:
Code
mean_water_male <- mean(india$water[india$female == 0])

mean_water_male #Print the result
[1] 14.73832

Interpretation: the average number of new or repaired water facilities in villages with a male politician (female == 0) is 15 facilities.

  • Tidyverse Alternative:

Instead of computing means separately, we can make the process more efficient using summarise() along with group_by(), a function that allows us to compute statistics by group:

Code
library(tidyverse) #Load the tidyverse package
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
india %>%
  group_by(female) %>%
  summarise(mean_water = mean(water)) 
# A tibble: 2 × 2
  female mean_water
   <int>      <dbl>
1      0       14.7
2      1       24.0
  • group_by(female): Groups the data into villages with female politicians (female=1) and villages without female politicians (female=0).
  • summarise(mean_water = mean(water)): Computes the average number of water facilities for each group.

2. Compute the difference in means

The difference-in-means estimator is calculated by subtracting the average outcome for the control group from the average outcome for the treatment group:

Code
diff_in_means <- mean_water_female - mean_water_male
diff_in_means #Print the result
[1] 9.252423

Interpretation: having a female politician increases the number of new or repaired water facilities by 9 facilities, on average.

Exercises

  1. Create a new variable called high_water that equals:

    • 1 if a village has more than the mean number of new (or repaired) water facilities.
    • 0 otherwise.
  2. How many villages have more than the average number of new (or repaired) water facilities?

  3. Find the proportion of villages where high_water == 1 for female-led vs. male-led villages. Hint: Use group_by() and summarise().

  4. Calculate the average causal effect of having a female politician on the number of new (or repaired) irrigation facilities. Specify the following:

    1. What is the assumption we are making when estimating the average causal effect?
    2. Why is this a reasonable assumption?
    3. What is the treatment?
    4. What is the outcome?
    5. What is the direction, size, and unit of measurement of the average causal effect?
Show Solution
Code
# 1. Create high_water variable
india$high_water <- ifelse(india$water > mean(india$water), 1, 0)

# 2. Count villages with high_water == 1
sum(india$high_water)
[1] 90
Code
# 90 villages have more than the average number of drinking water facilities.

# 3. Compute the proportion of high_water villages by leader gender
india %>%
  group_by(female) %>%
  summarise(proportion_high_water = mean(high_water))
# A tibble: 2 × 2
  female proportion_high_water
   <int>                 <dbl>
1      0                 0.280
2      1                 0.278
Code
# 28.0% of male-led villages have above-average drinking water facilities.
# 27.8% of female-led villages have above-average drinking water facilities.

# 4. Compute the Difference-in-Means Estimator for irrigation facilities
mean_irrigation_female <- mean(india$irrigation[india$female == 1])
mean_irrigation_male <- mean(india$irrigation[india$female == 0])

diff_in_means_irrigation <- mean_irrigation_female - mean_irrigation_male
diff_in_means_irrigation
[1] -0.3693319
Code
#a. Female-led villages are comparable to male-led villages (no confounders).

#b. Data comes from a randomised experiment

#c. Having a female politician in the village

#d. The number of new (or repaired) irrigation facilities

#e. Having a female politician decreases the number of new or repaired irrigation facilities by 0.37 facilities, on average.