Code
india <- read.csv("data/india.csv")Sebastian Koehler
February 6, 2026
Materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).
This week, we will cover the following topics:
ifelse()Start by creating an R script to keep track of your code. In RStudio, you can open a new script by clicking File > New File > R Script.
Save your script by clicking File > Save As and saving it in your POL272 folder with the name seminar3.R.
Clear your environment to avoid operating with objects from previous work by mistake. You can do this by clicking on the broom icon in the Environment tab.
Set the working directory by clicking on Session > Set Working Directory > Choose Directory. Navigate to your POL272 folder and click Open.
When you select the folder, RStudio will print the setwd(...) command in the Console. Copy and paste the command into your script so that R automatically sets the correct directory every time you run your script.
read.csv() function as shown below:Note: data/ tells R to look inside a folder called data, which is located in your working directory. If the file were in the main folder POL272 (not in a sub-folder like data), we would simply run:
Last week, we learned about the $ operator, which allows us to extract a single column from a dataset.
For example, we used $ to extract the water column from the india dataset:
What if we wanted to select both rows and columns together?
→ To do this, we use square brackets [ ].
The basic structure of [ ] is:
Let’s start by selecting one column, just like we did with $, but using [ ]:
This gives the same result as india$water but it’s written differently.
Now let’s select the first row (observation):
This returns all columns for the first row.
To select the first five rows, run:
village female water irrigation
1 GP1_village2 1 10 0
2 GP1_village1 1 0 5
3 GP2_village2 1 2 2
4 GP2_village1 1 31 4
5 GP3_village2 0 0 0
We can combine row and column selection. For example, we select the water value for the first row by running:
This tells us that the first village “GP1_village2” has 10 new (or repaired) water facilities after random assignment.
We can achieve the same thing by running:
Here, R already knows that we are working with the water column because $water explicitly extracts that column from the dataset.
Relational operators allow us to test logical conditions. For example:
== tests if two values are equal.If we run:
R lets us know that it is TRUE that 3 equals 3.
If we instead run:
R returns a FALSE, indicating that 3 is not equal to 4.
We can use relational operators to check values in a dataset. For instance, if we wanted to determine which villages had exactly 10 new water facility, we run:
R checks every value in the water column and returns TRUE if the value is 10 and FALSE if it is not 10.
Now let’s use both subsetting and logical operators to select all rows where female == 1:
What if we only wanted to see the number of new water facilities in villages with female leaders? We would run:
We can achieve the same thing by running:
Here, india$water extracts only the water column and [india$female == 1] applies the logical condition directly to this column.
ifelse()The ifelse() function allows us to create new variables based on a logical test. It works like a simple “if-then” statement:
Let’s create a new variable called ten_water that categorizes villages as either:
1 if they have exactly 10 new water facilities.0 if they have any other number of new water facilities.To do this, we run:
Let’s break this down:
india$ten_water <-: Create a new column called ten_water inside the india dataframe and store the values there.india$water == 10: Check if the number of new water facilities (water) is equal to 10 for each village.1: Assign 1 to ten_water if the condition india$water==10 is TRUE.0: Assign 0 to ten_water if the condition india$water==10 is FALSE.Take a look at the first few observations of the dataframe using head() to ensure that the new binary variable was created correctly:
The Difference-in-Means Estimator measures the impact of a treatment by comparing the average outcomes between the treatment group and the control group.
For our dataset:
Treatment group: Villages with a female council head (female == 1).
Control group: Villages without a female council head (female == 0).
Outcome variable: Number of new or repaired water facilities.
The formula is:
[ = {Y}{} - {Y}{} ]
where:
({Y}_{}) = mean outcome for treatment group
({Y}_{}) = mean outcome for control group
Considering that the dataset comes from a randomized experiment, the treatment and control groups should, on average, be identical in both observed and unobserved pre-treatment characteristics due to random assignment.
Therefore, we can use the Difference-in-Means Estimator to estimate the average causal effect of having a female politician on the number of new (or repaired) water facilities.
First, let’s compute the means separately for each group:
[1] 23.99074
Interpretation: the average number of new (or repaired) water facilities in villages with a female politician (female == 1) is 24 facilities.
[1] 14.73832
Interpretation: the average number of new or repaired water facilities in villages with a male politician (female == 0) is 15 facilities.
Tidyverse Alternative:Instead of computing means separately, we can make the process more efficient using summarise() along with group_by(), a function that allows us to compute statistics by group:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# A tibble: 2 × 2
female mean_water
<int> <dbl>
1 0 14.7
2 1 24.0
group_by(female): Groups the data into villages with female politicians (female=1) and villages without female politicians (female=0).summarise(mean_water = mean(water)): Computes the average number of water facilities for each group.The difference-in-means estimator is calculated by subtracting the average outcome for the control group from the average outcome for the treatment group:
[1] 9.252423
Interpretation: having a female politician increases the number of new or repaired water facilities by 9 facilities, on average.
Create a new variable called high_water that equals:
1 if a village has more than the mean number of new (or repaired) water facilities.0 otherwise.How many villages have more than the average number of new (or repaired) water facilities?
Find the proportion of villages where high_water == 1 for female-led vs. male-led villages. Hint: Use group_by() and summarise().
Calculate the average causal effect of having a female politician on the number of new (or repaired) irrigation facilities. Specify the following:
[1] 90
# A tibble: 2 × 2
female proportion_high_water
<int> <dbl>
1 0 0.280
2 1 0.278
# 28.0% of male-led villages have above-average drinking water facilities.
# 27.8% of female-led villages have above-average drinking water facilities.
# 4. Compute the Difference-in-Means Estimator for irrigation facilities
mean_irrigation_female <- mean(india$irrigation[india$female == 1])
mean_irrigation_male <- mean(india$irrigation[india$female == 0])
diff_in_means_irrigation <- mean_irrigation_female - mean_irrigation_male
diff_in_means_irrigation[1] -0.3693319
#a. Female-led villages are comparable to male-led villages (no confounders).
#b. Data comes from a randomised experiment
#c. Having a female politician in the village
#d. The number of new (or repaired) irrigation facilities
#e. Having a female politician decreases the number of new or repaired irrigation facilities by 0.37 facilities, on average.