Missing Data | Types, Explanation, & Imputation

Missing data, or missing values, occur when you don’t have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.

In any dataset, there are usually some missing data. In quantitative research, missing values appear as blank cells in your spreadsheet.

Types of missing data

Missing data are errors because your data don’t represent the true values of what you set out to measure.

The reason for the missing data is important to consider, because it helps you determine the type of missing data and what you need to do about it.

There are three main types of missing data.

Type Definition
Missing completely at random (MCAR) Missing data are randomly distributed across the variable and unrelated to other variables.
Missing at random (MAR) Missing data are not randomly distributed but they are accounted for by other observed variables.
Missing not at random (MNAR) Missing data systematically differ from the observed values.
Example: Research project
You collect data on end-of-year holiday spending patterns. You survey adults on how much they spend annually on gifts for family and friends in dollar amounts.

Missing completely at random

When data are missing completely at random (MCAR), the probability of any particular value being missing from your dataset is unrelated to anything else.

The missing values are randomly distributed, so they can come from anywhere in the whole distribution of your values. These MCAR data are also unrelated to other unobserved variables.

Example: MCAR data
You note that there are a few missing values in your holiday spending dataset. Some people started answering your survey but dropped out or skipped a question.

However, you note that you have data points from a wide distribution, ranging from low to high values.

Therefore, you conclude that the missing values aren’t related to any specific holiday spending amount range.

Data are often considered MCAR if they seem unrelated to specific values or other variables. In practice, it’s hard to meet this assumption because ‘true randomness’ is rare.

When data are missing due to equipment malfunctions or lost samples, they are considered MCAR.

Missing at random

Data missing at random (MAR) are not actually missing at random; this term is a bit of a misnomer.

This type of missing data systematically differs from the data you’ve collected, but it can be fully accounted for by other observed variables.

The likelihood of a data point being missing is related to another observed variable but not to the specific value of that data point itself.

Example: MAR data
You repeat your data collection with a new group. You notice that there are more missing values for adults aged 18–25 than for other age groups.

But looking at the observed data for adults aged 18–25, you notice that the values are widely spread. It’s unlikely that the missing data are missing because of the specific values themselves.

Instead, some younger adults may be less inclined to reveal their holiday spending amounts for unrelated reasons (e.g., more protective of their privacy).

Missing not at random

Data missing not at random (MNAR) are missing for reasons related to the values themselves.

Example: MNAR data
In the new dataset, you also notice that there are fewer low values. Some participants with low incomes avoid reporting their holiday spending amounts because they are low.

This type of missing data is important to look for because you may lack data from key subgroups within your sample. Your sample may not end up being representative of your population.

Attrition bias

In longitudinal studies, attrition bias can be a form of MNAR data. Attrition bias means that some participants are more likely to drop out than others.

For example, in long-term medical studies, some participants may drop out because they become more and more unwell as the study continues. Their data are MNAR because their health outcomes are worse, so your final dataset may only include healthy individuals, and you miss out on important data.

Are missing data problematic?

Missing data are problematic because, depending on the type, they can sometimes bias your results. This means your results may not be generalisable outside of your study because your data come from an unrepresentative sample.

In practice, you can often consider two types of missing data ignorable because the missing data don’t systematically differ from your observed values:

  • MCAR data
  • MAR data

For these two data types, the likelihood of a data point being missing has nothing to do with the value itself. So it’s unlikely that your missing values are significantly different from your observed values.

On the flip side, you have a biased dataset if the missing data systematically differ from your observed data. Data that are MNAR are called non-ignorable for this reason.

How to prevent missing data

Missing data often come from attrition, non-response, or poorly designed research protocols. When designing your study, it’s good practice to make it easy for your participants to provide data.

Here are some tips to help you minimise missing data:

  • Limit the number of follow-ups
  • Minimise the amount of data collected
  • Make data collection forms user friendly
  • Use data validation techniques
  • Offer incentives

After you’ve collected data, it’s important to store them carefully, with multiple backups.

How to deal with missing values

To tidy up your data, your options usually include accepting, removing, or recreating the missing data.

You should consider how to deal with each case of missing data based on your assessment of why the data are missing.

  • Are these data missing for random or non-random reasons?
  • Are the data missing because they represent zero or null values?
  • Was the question or measure poorly designed?

Your data can be accepted, or left as is, if it’s MCAR or MAR. However, MNAR data may need more complex treatment.

Acceptance

The most conservative option involves accepting your missing data: you simply leave these cells blank.

It’s best to do this when you believe you’re dealing with MCAR or MAR values. When you have a small sample, you’ll want to conserve as much data as possible because any data removal can affect your statistical power.

You might also recode all missing values with labels of ‘N/A’ (short for ‘not applicable’) to make them consistent throughout your dataset.

These actions help you retain data from as many research subjects as possible with few or no changes.

Deletion

You can remove missing data from analyses using listwise or pairwise deletion.

Listwise deletion

Listwise deletion means deleting data from all cases (participants) who have data missing for any variable in your dataset. You’ll have a dataset that’s complete for all participants included in it.

A downside of this technique is that you may end up with a much smaller and/or a biased sample to work with. If significant amounts of data are missing from some variables or measures in particular, the participants who provide those data might significantly differ from those who don’t.

Your sample could be biased because it doesn’t adequately represent the population.

Example: Listwise deletion
You decide to remove all participants with missing data from your survey dataset. This reduces your sample from 114 to 77 participants.

You notice that most of the participants with missing data left a specific question about their opinions unanswered. Many of those participants were also women, so your sample now mainly consists of men.

Pairwise deletion

Pairwise deletion lets you keep more of your data by only removing the data points that are missing from any analyses. It conserves more of your data because all available data from cases are included.

It also means that you have an uneven sample size for each of your variables. But it’s helpful when you have a small sample or a large proportion of missing values for some variables.

When you perform analyses with multiple variables, such as a correlation, only cases (participants) with complete data for each variable are included.

Example: Pairwise deletion
You decide to only remove missing values, while retaining the other data points for these participants. This does not reduce your overall sample size.

  • 12 people didn’t answer a question about their gender, reducing the sample size from 114 to 102 participants for the variable ‘gender’.
  • 3 people didn’t answer a question about their age, reducing the sample size from 114 to 11 participants for the variable ‘age’.

You are able to retain more values this way, but the sample size now differs across variables.

Imputation

Imputation means replacing a missing value with another value based on a reasonable estimate. You use other data to recreate the missing value for a more complete dataset.

You can choose from several imputation methods.

The easiest method of imputation involves replacing missing values with the mean or median value for that variable.

Hot-deck imputation

In hot-deck imputation, you replace each missing value with an existing value from a similar case or participant within your dataset. For each case with missing values, the missing value is replaced by a value from a so-called ‘donor’ that’s similar to that case based on data for other variables.

Example: Hot-deck imputation
In a survey, you ask participants to answer questions about how they rate a new shopping app from 1 to 5. You notice that two participants skipped Question 3, so these cells are empty.

You sort the data based on other variables and search for participants who responded similarly to other questions compared to your participants with missing values.

You take the answer to Question 3 from a donor and use it to fill in the blank cell for each missing value.

Cold-deck imputation

Alternatively, in cold-deck imputation, you replace missing values with existing values from similar cases from other datasets. The new values come from an unrelated sample.

Example: Cold-deck imputation
Instead of replacing the missing values with answers from participants from the same sample, you open a different dataset from a coworker. They conducted a similar survey but used a different sample.

You search for participants who responded similarly to other questions compared to your participants with missing values.

You take the answer to Question 3 from the other dataset and use it to fill in the blank cell for each missing value.

Use imputation carefully

Imputation is a complicated task because you have to weigh the pros and cons.

Although you retain all of your data, this method can create bias and lead to inaccurate results. You can never know for sure whether the replaced value accurately reflects what would have been observed or answered. That’s why it’s best to apply imputation with caution.

Frequently asked questions

What are missing data?

Missing data, or missing values, occur when you don’t have data stored for certain variables or participants.

In any dataset, there’s usually some missing data. In quantitative research, missing values appear as blank cells in your spreadsheet.

Why are missing data important?

Missing data are important because, depending on the type, they can sometimes bias your results. This means your results may not be generalisable outside of your study because your data come from an unrepresentative sample.

How do I deal with missing data?

To tidy up your missing data, your options usually include accepting, removing, or recreating the missing data.

  • Acceptance: You leave your data as is
  • Listwise or pairwise deletion: You delete all cases (participants) with missing data from analyses
  • Imputation: You use other data to fill in the missing data
What are the types of missing data?

There are three main types of missing data.

Missing completely at random (MCAR) data are randomly distributed across the variable and unrelated to other variables.

Missing at random (MAR) data are not randomly distributed but they are accounted for by other observed variables.

Missing not at random (MNAR) data systematically differ from the observed values.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2022, October 04). Missing Data | Types, Explanation, & Imputation. Scribbr. Retrieved 11 November 2024, from https://www.scribbr.co.uk/stats/missing-values/

Is this article helpful?
Pritha Bhandari

Pritha has an academic background in English, psychology and cognitive neuroscience. As an interdisciplinary researcher, she enjoys writing articles explaining tricky research concepts for students and academics.