Solutions Mock Exam

Author

Sebastian Koehler

Published

May 19, 2026

Working with the British Election Study

In this exercise you will work with data from the British Election Study. The data are from the 2024 post election survey, where randomly selected voters were asked questions about their opinion and behaviour. The dataset is called bes2024.csv and can be downloaded here.

Table 1: Variable description

Variable Name	Question	Values and Labels
B01	Talking with people about the general election on 4th July 2024, we have found that a lot of people didn’t manage to vote. How about you, did you manage to vote in the general election?	1 = Yes, voted, 0 = No, did not vote
A03	How interested would you say you are in politics? Would you say you are…	1 = Very interested, 2 = Fairly interested, 3 = Not very interested, 4 = Not at all interested
B07	Did you care which party won the recent general election?	1 = Cared a good deal, 0 = Didn’t care very much
C01	How interested were you in the general election that was held on 4th July 2024?	1 = Very interested, 2 = Somewhat interested, 3 = Not very interested, 4 = Not at all interested
C02_1	Going to vote is a lot of effort	1 = Strongly disagree, 2 = Disagree, 3 = Neither agree nor disagree, 4 = Agree, 5 = Strongly agree
C02_2	I feel a sense of satisfaction when I vote	1 = Strongly disagree, 2 = Disagree, 3 = Neither agree nor disagree, 4 = Agree, 5 = Strongly agree
C02_3	It is every citizen’s duty to vote in an election	1 = Strongly disagree, 2 = Disagree, 3 = Neither agree nor disagree, 4 = Agree, 5 = Strongly agree
D01	Generally speaking, do you think of yourself as Labour, Conservative, Liberal Democrat, (Scottish National/Plaid Cymru) [in Scotland/Wales] or what?	0 = None/No, 1 = Labour, 2 = Conservative, 3 = Liberal Democrat, 4 = Scottish National Party (SNP), 5 = Plaid Cymru, 6 = Green Party, 8 = Reform UK, 9 = Other

As always, we’ll start by loading and looking at the data (remembering to set our working directory first!):

Code

library(tidyverse)
besdata <- read.csv("bes2024.csv") # reads and stores data as object called besdata
head(besdata) # looking at first few rows of dataset

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

  b01 a03 b07 c01 c02_1 c02_2 c02_3 d01
1   1   2   1   1     2     4     4   3
2   1   4   0   4     1     3     4   0
3   1   2   1   1     1     4     1   3
4   1   2   1   1     1     5     5  NA
5   1   3   1   2     2     4     4  NA
6   1   3   0   1     2     1     4   0

Question 1

You want to explain what makes somebody go out and vote in the 2024 general election. You use the data from the British Election Study to better understand the factors which determine individual turnout.

a) Which variable do you use as a dependent variable to analyse this question?

The appropriate dependent variable is B01, which asks explicitly whether the respondent voted.

b) What kind of variable is it?

It is a binary variable.

Question 2

You think that people who care about the outcome of the election will be more likely to vote. To test this you use variable B07.

a) Calculate the difference-in-means estimator using the lm function.

Code

lm(b01 ~ b07, data = besdata) %>% 
  summary()


Call:
lm(formula = b01 ~ b07, data = besdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8629 -0.4494  0.1371  0.1371  0.5506 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.44940    0.01266   35.49   <2e-16 ***
b07          0.41351    0.01552   26.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.402 on 3012 degrees of freedom
  (75 observations deleted due to missingness)
Multiple R-squared:  0.1907,    Adjusted R-squared:  0.1905 
F-statistic: 709.9 on 1 and 3012 DF,  p-value: < 2.2e-16

b) Write down the regression model you used to estimate the difference-in-means.

\(\widehat{B01} = \hat{\alpha} + \hat{\beta} A07 + \epsilon\)

c) Interpret the estimate. Can you interpret the difference-in-means estimate causally? Explain.

We are estimating a linear probability model. The explanatory variable is a binary variable (respondent cared / did not care about the election result). We are therefore looking at a group difference, comparing respondents who cared with those who did not care.

It is the nature of the linear probability model that we estimate a probability. That is, we are comparing the probability to vote across the two groups. So, the coefficient for variable b07 is 0.41. That means that for a voter who cares about the election result, the probability of voting is 41 percentage points bigger than for a voter who does not care about the election result.

This effect cannot be interpreted causally. How much a voter cares is not assigned randomly, so it is highly likely that the two groups of voters are systematically different among a range of dimension, such as education, income, occupation, race, etc.

d) Is the difference statistically significant at the 5% level? Explain how you come to the conclusion.

The difference is statistically significant. The easiest way to see this is to look at the p-value. The p-values tells us the probability of observing a value of the test statistic as extreme as the one we get, if the Null Hypothesis is true.

The null hypothesis is: \(H_0: \hat{\beta} = 0\) The alternative hypothesis is: \(H_a: \hat{\beta} \neq 0\)

The p-value is a number smaller than \(2^{-16}\), which is a number starting with 16 zeros, i.e. 0.0000000000000002.

With a significance level \(\alpha = 0.05\) we clearly see that the p-value is dramatically smaller than \(\alpha\).

Question 3

Add variables a03 and c01 to the model. Interpret the coefficient for the variable b07 in this updated model.

Code

lm(b01 ~ b07 + a03 + c01, data = besdata) %>% 
  summary()


Call:
lm(formula = b01 ~ b07 + a03 + c01, data = besdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.98492 -0.25126  0.05013  0.22322  0.74874 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.943649   0.030329   31.11  < 2e-16 ***
b07          0.214365   0.018242   11.75  < 2e-16 ***
a03         -0.035042   0.009789   -3.58  0.00035 ***
c01         -0.138055   0.010450  -13.21  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3808 on 3005 degrees of freedom
  (80 observations deleted due to missingness)
Multiple R-squared:  0.2735,    Adjusted R-squared:  0.2728 
F-statistic: 377.1 on 3 and 3005 DF,  p-value: < 2.2e-16

The main difference is now that we are estimating a multiple regression model. The interpretation is therefore slightly altered.

It is still a linear probability model. That is, we are still comparing the probability to vote across the two groups. However, we are control for two additional factor: How interested the respondent is in politics and how interested the voter was in the election.

The coefficient for variable b07 is now 0.21. That means that for a voter who cares about the election result, the probability of voting is 21 percentage points bigger than for a voter who does not care about the election result holding the other variables constant.

Question 4

Now add all other variables except d01 to the model. Interpret the coefficient for variable c02_1.

Code

lm(b01 ~ b07 + a03 + c01 + c02_1 + c02_2 + c02_3, data = besdata) %>% 
  summary()


Call:
lm(formula = b01 ~ b07 + a03 + c01 + c02_1 + c02_2 + c02_3, data = besdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.10338 -0.10338  0.06674  0.22140  0.93145 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.591109   0.045477  12.998  < 2e-16 ***
b07          0.162946   0.017939   9.083  < 2e-16 ***
a03         -0.013604   0.009543  -1.426    0.154    
c01         -0.093319   0.010440  -8.939  < 2e-16 ***
c02_1       -0.054388   0.006665  -8.161 4.90e-16 ***
c02_2        0.037470   0.006928   5.408 6.88e-08 ***
c02_3        0.064658   0.006417  10.076  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3626 on 2918 degrees of freedom
  (164 observations deleted due to missingness)
Multiple R-squared:  0.3131,    Adjusted R-squared:  0.3117 
F-statistic: 221.7 on 6 and 2918 DF,  p-value: < 2.2e-16

Question 5

Create a new variable called reform, which is 1 if the voter identifies with Reform UK and 0 otherwise.

There are two ways to do this. Which one you choose is indifferent. What matters is the correct outcome.

Using base R:

Code

besdata$reform <- ifelse(besdata$d01 == 8, 1, 0)

using tidyverse:

Code

besdata <- besdata %>% mutate(reform = case_when( d01 == 8 ~ 1,
                                        is.na(d01) ~ NA, 
                                        TRUE ~ 0))
# Note that without the line: is.na(d01) ~ NA, all missing values would be recoded to 0. So you need to add this line!

Question 6

Add the variable reform to the regression and interpret the coefficient. Are reform supporters more likely to vote than supporters of other parties?

Code

lm(b01 ~ reform + b07 + a03 + c01 + c02_1 + c02_2 + c02_3, data = besdata) %>% 
  summary()


Call:
lm(formula = b01 ~ reform + b07 + a03 + c01 + c02_1 + c02_2 + 
    c02_3, data = besdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.10147 -0.10147  0.06854  0.21686  0.93017 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.5786937  0.0471792  12.266  < 2e-16 ***
reform       0.0001552  0.0292556   0.005    0.996    
b07          0.1625079  0.0188647   8.614  < 2e-16 ***
a03         -0.0117443  0.0099150  -1.184    0.236    
c01         -0.0931726  0.0108511  -8.586  < 2e-16 ***
c02_1       -0.0544179  0.0068865  -7.902 3.98e-15 ***
c02_2        0.0394077  0.0071632   5.501 4.13e-08 ***
c02_3        0.0645122  0.0066630   9.682  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3591 on 2656 degrees of freedom
  (425 observations deleted due to missingness)
Multiple R-squared:  0.3145,    Adjusted R-squared:  0.3127 
F-statistic: 174.1 on 7 and 2656 DF,  p-value: < 2.2e-16

As we can see, the coefficient of the variable reform is not statistically significant. In other words, we cannot reject the Null Hypothesis that the true parameter is 0.

There is therefore no statistical evidence that supporters of Reform UK are more likely to vote than the supporters of other parties.

Question 7

The variable a03 has a statistically significant coefficient in some model specifications but not in others. Can you think of a reason for why that may be the case?

Variable a03 measure the interest in politics. It is likely that the interest in politics directly affects both the likelihood of voting and the values of some of the other independent variables, particularly the interest in the general election of 2024. In other words, it is a confounder.