POL272 Quantitative Methods for Social Science Research
assumptions
Recall the linear model:
\[ Y_i = \beta_0 + \beta_1X_i + u_i \]
For the OLS estimator of the parameters \(\beta_0\) and \(\beta_1\) to be appropriate four key assumptions have to be satisfied:
Conditional Mean Independence Assumption: \(E(u_i|X_i) = 0\)$(X_i, Y_i)$ are i.i.d.: \((X_i, Y_i), i=1, \dots, n\) are i.i.d.Large outliers are unlikelyThere is no perfect multicollinearity.Conditional Mean Independence Assumption means that the conditional distribution of \(u_i\) given \(X_i\) has a mean zero.
Asserts that all other possible factors that are currently contained in \(u_i\) are unrelated to \(X_i\).
Can be written as \(E(u_i|X_i) = 0\) or \(corr(X_i, u_i) = 0\)
i.e. For a given value of X, the expected value of the error term is zero
This is essentially a restatement of omitted variable bias \(\rightarrow\) we appealed to this type of motivation when we first introduced multiple regression



\(Y_i\) seems randomly distributed around the line for all values of X
\(\rightarrow\) in expectation, for any value \(X_i\), the value of \(u_i = 0\)




\(Y_i\) is not randomly distributed around the line for all values of X
\(\rightarrow\) the expected value of \(u_i\) is not zero for all \(X_i\) values
\(\rightarrow\) it seems we have omitted an important variable from our model
In a randomized controlled experiment, subjects are randomly assigned to the treatment group (\(X=1\)) or to the control group (\(X=0\))
If random assignment is correctly implemented it will be done independently of all personal characteristics of the subject
Random assignment makes \(X\) independent of all possible confounders (Z)
In observational data, \(X\) is not randomly assigned
We try to assess whether \(X\) is as good as randomly assigned conditional on other variables
How convincing is a causal claim based on observational data?
We have to make this judgement for each given empirical application with observational data
Independent: \(Y_1\) gives no information about the value of \(Y_2\)Identically distributed: Distribution of \(Y_i\) is the same for all \(i\)A good example of i.i.d. data is survey data drawn from a random subset of the population.
Often this is not the case with observational data:
Time series data: observations of the same unit over time
Clustered data: observations grouped within higher units
We will not explore time-series and spatial dependence in data in this module, but it is one of the advanced topics that you will encounter regularly when working with data
Perfect multicollinearity \(\rightarrow\) when one X variable is a perfect linear function of another X variable
Imagine you would like to know the association between height (\(X\)) and shoe size (\(Y\))
You cannot estimate the model \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2\)
\(X_1\) and \(X_2\) are perfectly multicollinear: \(X_2 = X_1*2.54\)
\(\rightarrow\) it is impossible to compute the OLS estimator, we have to drop one of these variables from the model
Intuition: in multiple regression the coefficient of one of the regressors is the effect of a change in that regressor, holding the other regressors constant.
When Shoe Size is regressed on \(\text{height(in)}\) and \(\text{height(cm)}\)
This is not logical and cannot be estimated using OLS
A general solution is to modify the list of X variables
Imperfect multicollinearity \(\rightarrow\) a “high” degree of correlation between two X variables, but not perfect linear combinations
Imagine you want to know the association between education and vote choice, and between intelligence and vote choice
Does not prevent estimation of \(\beta\) with OLS, but:
There is no general solution
When these assumptions hold, in large samples the OLS estimators have normal sampling distributions
This allows us to develop methods for hypothesis testing and confidence intervals
Violations of assumptions are very important \(\rightarrow\) call into question statistical inference using linear regression.
What makes a study that uses multiple regression reliable or unreliable?
We can assess the validity of an empirical study by focussing on two classes of problems:
internal validity.external validity.Relates to published work using linear regression, final coursework for this module, your dissertations, etc
Tip
The statistical inferences about causal effects are valid for the population being studied
Tip
The statistical inferences can be generalized from the population and setting studied to other populations and settings
Internal validity consists of two components:
Estimators of parameters should be unbiased & consistent
Unbiased: In expectation, \(\hat{\beta}\) represents the ‘true’ valueConsistent: As n increases, \(\hat{\beta}\) converges to the ‘true’ valueHypothesis tests & confidence intervals should have the desired significance level
These threats lead to failures of the least squares assumptions.
\(\hat{\beta}\) may be biased (even in large samples) due to:
All four reasons are due to a correlation between the error term and the regressor – violation of Assumption 1.
Review: OVB is present when we forget to include in our model a variable that determines Y and is related to one or more Xcontrolling for them.If controls are unavailable we cannot add them to the model
Randomized experiment: assignment of X is controlled by the researcher and will be unrelated to other X variables by constructionInstrumental variables: variables that are correlated with X but not with YIf the true population regression function is nonlinear but the estimated regression is linear, then this functional form misspecification makes the OLS estimator biased.
We can deal with this by modelling nonlinear relationships:
Does ideology determine `rebelliousness’ in parliamentary votes? Theory: governing party MPs’ left-right position should predict votes against government policy. Context: votes of 400 Labour MPs in the House of Commons in 2005.
Dependent variable ($Y$): Percentage of votes in which the MP voted against the government lineIndependent variable ($X_1$): MP’s self placement on a left right scale (0 = most left, 10 = most right)Control variable ($X_2$): Years MP has been in parliament\[ \text{Rebel Votes}_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + u_i \]
| Percentage of rebel votes | |
|---|---|
| Left-Right Placement | 0.393 |
| (0.460) | |
| Experience | 1.427\(^{***}\) |
| (0.044) | |
| Constant | 2.435\(^{***}\) |
| (0.131) | |
| Observations | 400 |
| \(R^{2}\) | 0.734 |
| Note: | \(^{*}p<0.1\); \(^{**}p<0.05\); \(^{***}p<0.01\) |
\(\rightarrow\) no relationship between ideology and rebellion.
Is the assumption of a linear relationship reasonable?
residuals to detect non-linearity\[ u_i = Y_i - \hat{Y}_i \]
Implication: In our sample, \(u_i\) should be distributed around zero for each value of X
\[ \text{Rebel Votes}_i = \beta_0 + \beta_1X_1 + \beta_2{X_1}^2+ \beta_3X_2 + u_i \]
| Percentage of rebel votes | ||
|---|---|---|
| (1) | (2) | |
| Left Right Placement | 0.393 | -3.172\(^{***}\) |
| (0.460) | (0.114) | |
| Left Right Placement\(^2\) | 0.458\(^{***}\) | |
| (0.018) | ||
| Experience | 1.427\(^{***}\) | 1.416\(^{***}\) |
| (0.044) | (0.027) | |
| Constant | 2.435\(^{***}\) | 6.206\(^{***}\) |
| (0.131) | (0.170) | |
| Observations | 400 | 400 |
| R\(^{2}\) | 0.734 | 0.898 |
| Note: | \(^{*}\)p$<$0.1; \(^{**}\)p$<$0.05; \(^{***}\)p$<$0.01 |
Accounting for non-linearity reveals the potential danger of misspecifying the functional form
Let’s plot the residuals from model 2 against the L-R variable:

What is the relationship between an individual’s use of social media and their level of political knowledge? We ask survey respondents how many minutes they spend on social media each day, and then test their political knowledge on a series of questions.
Dependent variable (\(Y\)): Political knowledge (-50 to 50)Independent variable (\(X_1\)): Reported daily use of social media (minutes)Let’s assume (naively) that respondents make random errors in the amount of time they spend on social media.
\[ \tilde{X_i} = X_i + w_i \]
If \(w_i\) is truly random, surely this won’t be a problem, right?
Wrong.

As measurement error increases, \(\hat{\beta}\) \(\rightarrow\) 0

As measurement error increases, \(\hat{\beta}\) \(\rightarrow\) 0

As measurement error increases, \(\hat{\beta}\) \(\rightarrow\) 0
More likely is that respondents make systematic errors in stating the amount of time they spend on social media
\[ \tilde{X_i} = X_i*.60 \]

If the error is non-random, \(\hat{\beta}\) will also be biased
Solutions to measurement error in X, in order of frequency of use:
Missing data are a common feature in any data analysis task.
sample selection bias.Let’s assume that people with low levels of political knowledge are on average less likely to respond to political surveys.

If sample selection is based on Y, \(\hat{\beta}\) will also be biased
multiple imputation methods to deal with the issuetruncated and censored dataselection models to deal with the sample selection biasAll of these approaches are beyond the scope of our course
Inconsistent standard errors are problematic. Even if \(\hat{\beta}\) is consistent and the sample size is large:
Inconsistent standard errors are usually due to:
Another assumption: The variance of the error term (\(\sigma^2\)) is the same for all units i, i.e. it does not depend on \(X_i\)
Homoskedasticity
\(Var(u_i | X_i)=Var(u)=\) constant \(\rightarrow\) does not depend on X
Income and expenditure on meals
We would like to know whether richer people spend more money on food than poorer people. We collect data on 1000 meals purchased by individuals with different income levels.
Dependent variable (\(Y\)): Cost of meal (£)Independent variable (\(X_1\)): Individual income (£ per year)
\(Var(u_i | X_i=x)=\)const.

\(Var(u_i | X_i=x)\neq\)const.
We can detect heteroskedasticity visually by again plotting the residuals against our X variables:
\(\rightarrow\) when the distribution of \(u_i\) follows this “funnel” shape, that suggests the errors are heteroskedastic
The good news:
Whether the errors are homoskedastic or heteroskedastic:
The bad news:
If the homoskedasticity assumption is violated:
Heteroskedasticity can lead to standard errors that are too small or too large. But we generally care less about overestimating the standard error.
Test for heteroskedasticity using the Breuch-Pagan test.
Intuition:
If heteroskedasticity is present, calculate heteroskedasticity-robust standard errors
\(\rightarrow\) there is evidence of heteroskedasticity in the meal cost data
> library(sandwich)
> library(texreg)
# Use coeftest from the sandwich package to tell R to calculate
# heteroskedasticity-robust SEs (``HC3'')
> model2 <- coeftest(model1, cov = vcov(model1, type = "HC3"))
> screenreg(list(model1, model2))
==================================
Model 1 Model 2
----------------------------------
(Intercept) 0.89 0.89
(1.61) (1.61)
income 0.14 *** 0.14
(0.04) (0.13)
----------------------------------
R^2 0.25
Adj. R^2 0.24
Num. obs. 200
RMSE 17.22
==================================
*** p < 0.001, ** p < 0.01, * p < 0.05\(\rightarrow\) when adjusting for heteroskedasticity, \(\hat{\beta}_\text{income}\) is no longer significant
Potential threats to external validity arise from differences between the population and setting studied and the population and setting of interest.

POL272