Seminar 2

Author

Sebastian Koehler

Published

February 5, 2026

Materials presented here build on the resources for instructors designed by Elena Llaudet and Kosuke Imai in Data Analysis for Social Science: A Friendly and Practical Introduction (Princeton University Press).

Seminar Objectives

This week, we will cover the following topics:

  • Packages
  • Dataframes
  • Observations
  • Variables
  • Summary statistics

Getting started

  1. Download and save india.csv in the data folder you created last week.

  2. Start by creating an R script to keep track of your code. In RStudio, you can open a new script by clicking File > New File > R Script.

  3. Save your script by clicking File > Save As and saving it in your POL272 folder with the name seminar2.R.

  4. Clear your environment to avoid operating with objects from previous work by mistake. You can do this by clicking on the broom icon in the Environment tab.

  5. Set the working directory to your POL272 folder as you did last week so R can access and save files there:

Code
setwd("~/Desktop/POL272")

Packages

We will be needing extra tools to make working with data easier on R, so we will have to install some packages. A package in R is like an app on your phone. It adds new tools and features that don’t come with R by default.

One of the packages we will use is tidyverse. It includes several tools that make working with data easier, especially for tasks like cleaning, transforming, and visualizing data.

Installing packages

We install packages from the Comprehensive R Archive Network (CRAN), which is like an app store for R packages.

To install a package from CRAN, we use the install.packages() function, making sure to put the package name in quotes:

Code
install.packages("tidyverse")

R will download the package and print a lot of text in the console as it installs. When it’s done, you’ll see a message like: The downloaded binary packages are in... followed by a long directory name.

You only need to install a package once on your computer.

Loading packages

Even though we installed tidyverse, we still need to load it to tell R we want to use it in this session.

To load a package in R, use the library() function (without quotes now):

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Important:

  • Installing packages using install.packages() happens once.
  • Loading using library() happens every time you start a new R session.

We will revisit the use of tidyverse functions at the end of the seminar with a few examples.

Loading data

We will analyse data from an experiment conducted in India, where villages were randomly assigned to have a female council head. This is based on Raghabendra Chattopadhyay and Esther Duflo. 2004. Women as Policy Makers: Evidence from a Randomized Policy Experiment in India, Econometrica, 72(5): 1409–43.

Table 1 shows the names and descriptions of the variables in this dataset, where the unit of observation is villages.

Table 1: Variables in “india.csv{download = ‘india.csv’}”

Variable Description
village village identifier (“Gram Panchayat number_village number”)
female whether village was assigned a female politician: 1=yes, 0=no
water number of new (or repaired) drinking water facilities in the village since random assignment
irrigation number of new (or repaired) irrigation facilities in the village since random assignment

To load this dataset, use the read.csv() function as shown below:

Code
india <- read.csv("data/india.csv")
       village female water irrigation
1 GP1_village2      1    10          0
2 GP1_village1      1     0          5
3 GP2_village2      1     2          2
4 GP2_village1      1    31          4
5 GP3_village2      0     0          0
6 GP3_village1      0     0          0

This function loads the data stored in “india.csv” into R and assigns it to a new object called india using the assignment operator <-, which you learned about last week. After running this line, you should see the india object appear in the Environment.

Note: data/ tells R to look inside a folder called data, which is located in your working directory. If the file were in the main folder POL272 (not in a sub-folder like data), we would simply run:

Code
india <- read.csv("india.csv")

To explore the contents of the dataset, you can type its name in the R script and run it. However, this displays the entire dataset in the R console, which can be overwhelming for larger datasets.

A better option is to click the dataset name in the Environment tab to open a spreadsheet-style viewer or use the View() function:

Code
View(india)

Remember, R is case-sensitive, and this function starts with an uppercase V.

If you only need a quick look at the first few rows, use the head() function:

Code
head(india)
       village female water irrigation
1 GP1_village2      1    10          0
2 GP1_village1      1     0          5
3 GP2_village2      1     2          2
4 GP2_village1      1    31          4
5 GP3_village2      0     0          0
6 GP3_village1      0     0          0

By default, head() shows the first six rows. To customize this, add the n argument for the number of rows you want to see. For example, to display the first three rows:

Code
head(india, n=3)
       village female water irrigation
1 GP1_village2      1    10          0
2 GP1_village1      1     0          5
3 GP2_village2      1     2          2

Observations

  • What is an observation?
Show Solution An observation is the information collected from a particular individual or entity in the study. In a dataframe, each row represents an observation.
  • What is the unit of observation in the india dataset?
Show Solution

The unit of observation represents the individuals or entities about which information is recorded.

To determine the unit of observation, ask: What does each row represent? Is it an individual, a household, a country, an event, or something else?

The unit of observation in india is villages. Hence, every row of data in the india dataframe represents a different village in the study.
  • How many observations are in the india dataset? In other words, how many villages were part of this experiment?

To identify the number of observations (rows) in the dataset, you can use the nrow() function:

Code
nrow(india)
[1] 322

There are 322 observations in the dataset. In other words, 322 villages were part of this study.

Variables

  • What is a variable?
Show Solution

A variable is a characteristic or piece of information that can vary across observations in a dataset. For example:

  • In a dataset about people, a variable could be age, gender, or height.

  • In a dataset about countries, a variable could be population size, GDP, or continent.

Variables are represented as columns in a dataframe.

  • How many variables are in the india dataset?

To identify the number of variables (columns) in the dataset, you can use the ncol() function:

Code
ncol(india)
[1] 4

There are 4 variables in the dataset.

To identify both the number of observations (rows) and variables (columns) in a dataset, we could also use the dim() function, which stands for “dimensions”:

Code
dim(india)
[1] 322   4

Calculating Summary Statistics

Let’s start off by calculating the total number of new water facilities using the sum() function that we learned last week. The only required argument is the code identifying the variable (column) in the dataset:

Code
sum(india$water)
[1] 5745

The total number of new or repaired drinking water facilities since random assignment is 5,745.

The dollar sign ($) is used to access a specific column inside the india dataframe. In this case, we are selecting the water column from india and adding up all the values in that column.

Now, let’s use the mean() function to calculate the average number of new water facilities in a similar way:

Code
mean(india$water)
[1] 17.84161

Let’s use another function we learned last week to round this mean to a whole number:

Code
round(mean(india$water), digits = 0) 
[1] 18

The average number of new or repaired drinking water facilities since random assignment is 18 per village.

This approach of nesting functions inside each other might be a bit hard to read. Instead, we can use the pipe operator (%>%) from tidyverse to make the code easier to follow.

Code
india$water %>%       #get the water column from india.
  mean() %>%          #then calculate its mean.
  round(digits = 1)   #then round the result to 1 decimal place.
[1] 17.8

Think of the pipe (%>%) as saying “then”. Instead of writing everything inside one big function, the pipe lets us chain steps together one by one, in an order that’s easy to understand. Here’s what we’re telling R to do:

  • Get the water variable from india.
  • Then calculate its mean.
  • Then round the result to 1 decimal place.

The tidyverse summarise() function

Now that we’ve seen how to calculate simple statistics like sum and mean, let’s take it a step further.

Instead of using $ to extract columns and calling mean() or sum() separately, we can use summarise() from tidyverse to calculate both in one step:

Code
india %>%
  summarise(sum(water), mean(water))
  sum(water) mean(water)
1       5745    17.84161

What’s happening here?

  • india %>%: Passes the india dataset to the summarise() function, telling R to perform the calculations on this dataset.
  • sum(water): Calculates the total number of new water facilities.
  • mean(water): Calculates the average number of new water facilities.

The result is a clean dataframe that combines your summary statistics.

Lets make it even cleaner. R automatically names the columns something like sum(water) and mean(water), which can be hard to read.

To make the output clearer, we can assign custom names to our calculations:

Code
india %>%
  summarise(
    total_water = sum(water),   #Name the sum as "total_water"
    average_water = mean(water) #Name the mean as "average_water"
  )
  total_water average_water
1        5745      17.84161

Now, the result is easier to understand!

Exercises

  1. Use summarise() to calculate the average number of new or repaired irrigation facilities. Interpret the result.

  2. Use summarise() to calculate the average of the variable female. Interpret the result.

  3. Use summarise() to calculate the minimum and maximum number of new water facilities in the dataset. Hint: Open the help file for summarise() and look under “Useful functions” to find functions for range (minimum and maximum).

Show Solution
Code
#1. Calculate the average number of new irrigation facilities.

india %>%
  summarise(avg_irrigation = mean(irrigation))  
  avg_irrigation
1       3.263975
Code
#The average number of new (or repaired) irrigation facilities per village is 3.26.

#2. Calculate the average of the variable female.

india %>%
  summarise(avg_female = mean(female))  
  avg_female
1  0.3354037
Code
#34% of the villages in the experiment were randomly assigned to have a female politician. 
#We round up from 33.54% as people are ‘whole units’; we cannot have half a female. 
#The unit of measurement is %, after multiplying the rounded output by 100 (0.34*100=34%).

#3. Calculate the minimum and maximum number of new water facilities.

india %>%
  summarise(
    min_water = min(water),  
    max_water = max(water)   
  )
  min_water max_water
1         0       340
Code
#Some villages (min_water = 0) had no new or repaired water facilities. 
#The village with the highest number of new or repaired water facilities had 340.

Homework (Optional)

R comes with several built-in datasets that you can explore without needing to download any files. One of these is mtcars, which contains information about different car models and their characteristics.

Since mtcars is built-in, you don’t need to download anything—just run this command to load it:

Code
data(mtcars)
  1. Get an overview of the dataset using the help file.

  2. View the first five rows of the dataset.

  3. What is the unit of observation in mtcars?

  4. How many observations (rows) and variables (columns) are in the dataset?

  5. Calculate the average miles per gallon (mpg) across all cars using summarise().

  6. Find the minimum and maximum horsepower (hp) in the dataset using summarise().

  7. What type of variable is am?

  8. Calculate the mean of am using summarise(). Interpret the result.

Show Solution
Code
#1. Get an overview of the dataset using the help file

?mtcars 

#2. View the first five rows of the dataset

head(mtcars, n = 5)  
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Code
#3. What is the unit of observation in mtcars?

#Each row represents a different car model

#4. How many observations (rows) and variables (columns) are in the mtcars dataset?

dim(mtcars)  # 32 observations and 11 variables
[1] 32 11
Code
#5. Calculate the average miles per gallon (mpg) across all cars.

mtcars %>%
  summarise(avg_mpg = mean(mpg))  
   avg_mpg
1 20.09062
Code
#6. Find the minimum and maximum horsepower (hp) in the dataset.

mtcars %>%
  summarise(
    min_hp = min(hp),  # Finds the minimum horsepower
    max_hp = max(hp)   # Finds the maximum horsepower
  )
  min_hp max_hp
1     52    335
Code
#7. What type of variable is `am`?

#`am` is binary (0 = Automatic, 1 = Manual)

#8. Calculate the mean of `am` and interpret the result.

mtcars %>%
  summarise(mean_am = mean(am))  
  mean_am
1 0.40625
Code
#This gives the proportion of cars with manual transmission (am = 1). 
#40.6% of cars in the dataset have manual transmission.