What Is Omitted Variable Bias? | Definition & Example

Omitted variable bias occurs when a statistical model fails to include one or more relevant variables. In other words, it means that you left out an important factor in your analysis.

Example: Omitted variable bias
Let’s say you want to investigate the effect of education on people’s salaries. In order to correctly analyse this effect, you should also include ability in your model. Ability makes a student more successful than their peers in school, which may lead to a better job and a better salary after graduation.

If you don’t have a trustworthy measure of ability, you may have to exclude it from your model despite knowing that it’s an important variable.

In this case, excluding ability causes omitted variable bias. This may lead to an overestimation or under-estimation of the effect of your other variables.

As a result, the model mistakenly attributes the effect of the missing variable to the included variables. Exclusion of important variables can limit the validity of your study findings.

What is an omitted variable?

An omitted variable is a confounding variable related to both the supposed cause and the supposed effect of a study. In other words, it is related to both the independent and dependent variable.

Example: Omitted variable
Let’s revisit the example of the effect of education on salaries.

Here, the independent variable is education. However, salary is also likely to be related to ability, which you previously decided to exclude. In turn, ability is also likely related to the level of education a person attains, as those with greater ability are likely to pursue higher education.

The omitted variable (ability) affects your analysis of both education (the independent variable) and earnings (the dependent variable).

Omitted variable bias

While a variable can be omitted because you are not aware that it exists, it’s also possible to omit variables that you can’t measure, even though you are aware of their existence.

What is omitted variable bias?

Omitted variable bias occurs in linear regression analysis when one or more relevant independent variables are not included in your regression model.

A regression model describes the relationship between one or more independent variables (also called predictors, covariates, or explanatory variables) and a dependent variable (often called a response or target variable).

Because the omitted variable is hidden or unobserved, it’s not factored into your analysis, affecting your results.

This can bias your coefficients if the omitted variable is correlated with either:

  1. The dependent variable
  2. One or more other independent variables
Example: Biased coefficients
Let’s consider the simple linear regression formula for the effect of education on salaries:

Salary = β0 + β1 ∗ Educ + ε

Where:

  • Salary is the wage in dollars (dependent variable)
  • Educ is the years of education completed (independent variable)
  • β0 is the intercept, or the predicted value of Salary when Educ is 0
  • β1 is the regression coefficient, or how much we expect salaries to change as education increases.
  • ε is the error term, showing how much variation there is in our estimate of the regression coefficient.

As we saw, ability is the omitted variable in this model it’s absent, but it shouldn’t be. Ability is correlated with both salary and education. Since it is not included in our regression model, we conclude that it’s “hiding” somewhere. But where?

Why is omitted variable bias a problem?

An omitted variable is a source of endogeneity. Endogeneity occurs when a variable in the error term is also correlated with an independent variable.

When this happens, the causal effect from the omitted variable becomes tangled up in the coefficient on the variable with which it is correlated. This, in turn, undermines our ability to infer causality and severely impacts our results.

Example: Endogeneity
Going back to our example, ability is in the error term due to endogeneity. It is correlated with the independent variable, as people with high ability also tend to achieve a higher level of education.

Since ability is not in the regression model, our estimate of β1 will absorb some of the effect of ability.

The estimate is now biased, so we can no longer make a causal claim about education.

Omitting a variable might lead to an overestimation (upward bias) or underestimation (downward bias) of the coefficient of your independent variable(s). Since the coefficient becomes unreliable, the regression model also becomes unreliable.

How to deal with omitted variable bias

Regression models cannot always perfectly predict the value of the dependent variable. Thus, every regression model has one or more omitted variables. While it can’t be avoided altogether, there are steps you can take to mitigate omitted variable bias.

  • If the required data are not available, like in the case of ability, you can use control variables. Taking the example of salaries, controls are variables that in theory affect salary, such as years of work experience.
  • If you don’t have the data, use proxies for the omitted variables. These are variables that are similar enough to the omitted variable to give you an idea about its value, but that you are able to measure. For example, you might use an IQ test as a proxy for an individual’s ability.
  • If you are not able to resolve the research bias, try to make a prediction about which direction your estimates are biased. This is called ‘signing’ the bias. You can sign it as either positive or negative, and this helps you estimate the omitted variable bias.

Estimating omitted variable bias

Without getting too far into advanced algebra, we can use logical thinking to predict the direction of the omitted variable. In this way, we can establish whether we have overestimated or underestimated the effect of the variable we included in our regression model.

The table below summarises the direction of the omitted variable bias. The sign of the bias is based on the sign of the relationships between the omitted variables and the variables in the model.

Let’s assume:

Y is the dependent variable
A is an independent variable
B is another independent variable, the omitted variable.

A and B are positively correlated A and B are negatively correlated
B has a positive effect on Y Positive bias Negative bias
B has a negative effect on Y Negative bias Positive bias

Note that with positive bias, we tend to overestimate, while with negative bias, we tend to underestimate.

Example: Estimating omitted variable bias
We can now make a logical conjecture about how ability affects education, as well as how ability affects salary.

As a reminder, our regression as it stands now is:

Salary = β0 + β1 ∗ Educ + ε

While it should be:

Salary = β0 + β1∗ Educ + β2 ∗ Abil +ε

We would expect that the higher the education, the higher the salary. So, we can expect β1 to have a positive sign, i.e., β1 > 0.

We would also expect the higher the ability, the higher the salary. So, we can expect β2 to also have a positive sign, i.e., β2 > 0.

At the same time, the higher the ability, the higher the education level completed. Therefore, we can conclude that:

  1. Salary and education are positively correlated
  2. Education and ability are positively correlated

What does this imply for our regression analysis? We know that education is likely to lead to higher salary. At the same time, someone with a higher level of education likely has a higher level of ability.

When omitting the ability variable, we see that the education variable may actually also be accounting for the effects of ability, and not just education.

Thus, β1 suffers from bias. More specifically, it suffers from upward bias because both ability and education have a positive effect on salary. Leaving out ability lets the coefficient of education pick up parts of the positive effects of ability.

Since ability is likely to be positively correlated with both salary and education, we can conclude that the effect of education on salary is overestimated in our analysis.

Other types of research bias

Frequently asked questions

How do I prevent omitted variable bias from interfering with research?

Omitted variable bias is common in linear regression as it’s usually not possible to include all relevant variables in the model. You can mitigate the effects of omitted variable bias by:

Using logic to predict whether you have overestimated or underestimated the effect of the variable(s) included in your regression model

What are the two requirements that must be fulfilled for omitted variable bias to occur?

Omitted variable bias occurs when two requirements are fulfilled:

  1. The omitted variable relates to the dependent variable.
  2. The omitted variable relates to one or more other independent variables.
Why does omitted variable bias matter?

Omitted variable bias matters because it can lead researchers to draw false conclusions by attributing the effects of a missing variable to those that are included in a statistical model.

Sources for this article

We strongly encourage students to use sources in their work. You can cite our article (APA Style) or take a deep dive into the articles below.

This Scribbr article

Nikolopoulou, K. (2023, March 16). What Is Omitted Variable Bias? | Definition & Example. Scribbr. Retrieved 9 December 2024, from https://www.scribbr.co.uk/bias-in-research/omitted-variable-bias-explained/

Sources

Lopes, H. F. (2016, September 21). Omitted Variable Bias: The Simple Case. Hedibert. http://hedibert.org/wp-content/uploads/2016/09/Bias-omittedvariable.pdf

Regression for Managers 4.1: Omitted Variable Bias. (2019, June 29). [Video]. YouTube. https://www.youtube.com/watch?v=pFR76qpt0Lk

Is this article helpful?
Kassiani Nikolopoulou

Kassiani has an academic background in Communication, Bioeconomy and Circular Economy. As a former journalist she enjoys turning complex scientific information into easily accessible articles to help students. She specialises in writing about research methods and research bias.