Data Cleaning | A Guide with Examples & Steps

Data cleaning involves spotting and resolving potential data inconsistencies or errors to improve your data quality. An error is any value (e.g., recorded weight) that doesn’t reflect the true value (e.g., actual weight) of whatever is being measured.

In this process, you review, analyse, detect, modify, or remove ‘dirty’ data to make your dataset ‘clean’. Data cleaning is also called data cleansing or data scrubbing.

Why does data cleaning matter?

In quantitative research, you collect data and use statistical analyses to answer a research question. Using hypothesis testing, you find out whether your data demonstrate support for your research predictions.

Example: Quantitative research
You investigate whether a new drug reduces the effects of fatigue.

You survey participants before and at the end of the drug treatment. Using closed-ended questions, you ask Likert-scale questions about participants’ experiences and symptoms on a 1-to-7 scale

Errors are often inevitable, but cleaning your data helps you minimise them. If you don’t remove or resolve these errors, you could end up with a false or invalid study conclusion.

Example: Data errors
Most of the questions are framed positively, but some questions have negative frames to engage the participants.

Question: Please rate the extent to which you agree or disagree with these statements from 1 to 7.

  • Positive frame: I feel well rested when I wake up in the morning.
  • Negative frame: I do not feel energetic after getting 8 hours of sleep at night.

Both questions measure the same thing: how respondents feel after waking up in the morning. But the answers to negatively worded questions need to be reverse-coded before analysis so that all answers are consistently in the same direction.

Reverse coding means flipping the number scale in the opposite direction so that an extreme value (e.g., 1 or 7) means the same thing for each question.

If you forget to reverse-code these answers before analysis, you may end up with an invalid conclusion because of data errors.

With inaccurate or invalid data, you might make a Type I or II error in your conclusion. These types of erroneous conclusions can be practically significant with important consequences, because they lead to misplaced investments or missed opportunities.

Example: Type I error
Based on the results, you make a Type I error. You conclude that the drug is effective when it’s not.

Your organisation decides to invest in this new drug and people are prescribed the drug instead of effective therapies.

Dirty vs clean data

Dirty data include inconsistencies and errors. These data can come from any part of the research process, including poor research design, inappropriate measurement materials, or flawed data entry.

Clean data meet some requirements for high quality while dirty data are flawed in one or more ways. Let’s compare dirty with clean data.

Dirty data Clean data
Invalid Valid
Inaccurate Accurate
Incomplete Complete
Inconsistent Consistent
Duplicate entries Unique
Incorrectly formatted Uniform

Valid data

Valid data conform to certain requirements for specific types of information (e.g., whole numbers, text, dates). Invalid data don’t match up with the possible values accepted for that observation.

Example: Data validation
A date of birth on a form may only be recognised if it’s formatted a certain way, for example, as dd-mm-yyyy, if you use data validation techniques.

The day field will allow numbers up to 31, the month field up to 12, and the year field up to 2021. If any numbers exceed those values, the form won’t be submitted.

Without valid data, your data analysis procedures may not make sense. It’s best to use data validation techniques to make sure your data are in the right formats before you analyse them.

Accurate data

In measurement, accuracy refers to how close your observed value is to the true value. While data validity is about the form of an observation, data accuracy is about the actual content.

Example: Inaccurate data
You ask survey respondents the following question:

How often do you go grocery shopping in person?

  • Every day
  • Once a week
  • Biweekly
  • Once a month
  • Less than once a month
  • Never

Some of the respondents select ‘biweekly’ as their answer. But this word can mean either twice a week or once every two weeks, and these are fairly different frequencies.

You have no idea how each person interpreted this word, so your data are inaccurate because of inadequate response items.

Complete data

Complete data are measured and recorded thoroughly. Incomplete data are statements or records with missing information.

Example: Incomplete data
In an online survey, a participant starts entering a response to an open-ended question. But they get distracted and do something else before returning to the survey. They move on to the next question without filling in a complete answer.

Reconstructing missing data isn’t easy to do. Sometimes, you might be able to contact a participant and ask them to redo a survey or an interview, but you might not get the answer that you would have otherwise.

Consistent data

Clean data are consistent across a dataset. For each member of your sample, the data for different variables should line up to make sense logically.

Example: Inconsistent data
In your survey, you collect information about demographic variables, including age, ethnicity, education level, and socioeconomic status. One participant enters ’13’ for their age and PhD-level education as their highest attained degree.

These data are inconsistent because it’s highly unlikely for a 13-year-old to hold a doctorate degree in your specific sample. It’s more likely that an incorrect age was entered.

Unique data

In data collection, you may accidentally record data from the same participant twice.

Example: Duplicate entries
In an online survey, a participant fills in the questionnaire and hits enter twice to submit it. The data gets reported twice on your end.

It’s important to review your data for identical entries and remove any duplicate entries in data cleaning. Otherwise, your data might be skewed.

Uniform data

Uniform data are reported using the same units of measure. If data aren’t all in the same units, they need to be converted to a standard measure.

Example: Nonuniform data
In a survey, you ask participants to enter their gross salary in pounds.

Some participants respond with their monthly salary, while others report their annual salary.

Unless you provide a time unit, participants may answer this question using different time frames. You won’t know for sure whether they’re reporting their monthly or annual salary.

How do you clean data?

Every dataset requires different techniques to clean dirty data, but you need to address these issues in a systematic way. You’ll want to conserve as much of your data as possible while also ensuring that you end up with a clean dataset.

Data cleaning is a difficult process because errors are hard to pinpoint once the data are collected. You’ll often have no way of knowing if a data point reflects the actual value of something accurately and precisely.

In practice, you may focus instead on finding and resolving data points that don’t agree or fit with the rest of your dataset in more obvious ways. These data might be missing values, outliers, incorrectly formatted, or irrelevant.

You can choose a few techniques for cleaning data based on what’s appropriate. What you want to end up with is a valid, consistent, unique, and uniform dataset that’s as complete as possible.

Data cleaning workflow

Generally, you start data cleaning by scanning your data at a broad level. You review and diagnose issues systematically and then modify individual items based on standardised procedures. Your workflow might look like this:

  1. Apply data validation techniques to prevent dirty data entry.
  2. Screen your dataset for errors or inconsistencies.
  3. Diagnose your data entries.
  4. Develop codes for mapping your data into valid values.
  5. Transform or remove your data based on standardised procedures.

Not all of these steps will be relevant to every dataset. You can carefully apply data cleaning techniques where necessary, with clear documentation of your processes for transparency.

By documenting your workflow, you ensure that other people can review and replicate your procedures.

Data validation

Data validation involves applying constraints to make sure you have valid and consistent data. It’s usually applied even before you collect data, when designing questionnaires or other measurement materials requiring manual data entry.

Different data validation constraints help you minimise the amount of data cleaning you’ll need to do.

Data-type constraints: Values can only be accepted if they are of a certain type, such as numbers or text.

Example: Data-type constraint
If a date is entered with both text and numbers (e.g., 20 March 2021), instead of just numbers (e.g., 20-03-2021), it will not be accepted.

Range constraints: Values must fall within a certain range to be valid.

Example: Range constraint
You design a questionnaire for a target population with ages ranging from 18 to 45. When reporting age, participants can only enter a value between 18 and 45 to proceed with the form.

Mandatory constraints: A value must be entered.

Example: Mandatory constraint
Participants filling in a form must select a button that says ‘I consent’ to begin.

Data screening

Once you’ve collected your data, it’s best to create a backup of your original dataset and store it safely. If you make any mistakes in your workflow, you can always start afresh by duplicating the backup and working from the new copy of your dataset.

Data screening involves reviewing your dataset for inconsistent, invalid, missing, or outlier data. You can do this manually or with statistical methods.

Step 1: Straighten up your dataset

These actions will help you keep your data organised and easy to understand.

  • Turn each variable (measure) into a column and each case (participant) into a row.
  • Give your columns unique and logical names.
  • Remove any empty rows from your dataset.

Step 2: Visually scan your data for possible discrepancies

Go through your dataset and answer these questions:

  • Are there formatting irregularities for dates, or textual or numerical data?
  • Do some columns have a lot of missing data?
  • Are any rows duplicate entries?
  • Do specific values in some columns appear to be extreme outliers?

Make note of these issues and consider how you’ll address them in your data cleaning procedure.

Step 3: Use statistical techniques and tables/graphs to explore data

By gathering descriptive statistics and visualisations, you can identify how your data are distributed and identify outliers or skewness.

  1. Explore your data visually with boxplots, scatterplots, or histograms
  2. Check whether your data are normally distributed
  3. Create summary (descriptive) statistics for each variable
  4. Summarise your quantitative data in frequency tables

You can get a rough idea of how your quantitative variable data are distributed by visualising them. Boxplots and scatterplots can show how your data are distributed and whether you have any extreme values. It’s important to check whether your variables are normally distributed so that you can select appropriate statistical tests for your research.

If your mean, median, and mode all differ from each other by a lot, there may be outliers in the dataset that you should look into.

Data diagnosing

After a general overview, you can start getting into the nitty-gritty of your dataset. You’ll need to create a standard procedure for detecting and treating different types of data.

Without proper planning, you might end up cherry-picking only some data points to clean, leading to a biased dataset.

Here we’ll focus on ways to deal with common problems in dirty data:

  • Duplicate data
  • Invalid data
  • Missing values
  • Outliers

De-duplication

De-duplication means detecting and removing any identical copies of data, leaving only unique cases or participants in your dataset.

Example: De-duplication
You compile your data in a spreadsheet where the columns are the questions and the rows are the participants. Each row contains one participant’s data.

You sort the data by a column and review the data row by row to check whether there are any identical rows. You remove identical copies of a row.

If duplicate data are left in the dataset, they will bias your results. Some participants’ data will be weighted more heavily than others’.

Invalid data

Using data standardisation, you can identify and convert data from varying formats into a uniform format.

Unlike data validation, you can apply standardisation techniques to your data after you’ve collected it. This involves developing codes to convert your dirty data into consistent and valid formats.

Data standardisation is helpful if you don’t have data constraints at data entry or if your data have inconsistent formats.

Example: Invalid data
Using an open-ended question, you ask participants to report their age. Your responses contain a mix of numbers and text, with some typos.

These are some of the responses:

  • 23
  • twenty
  • 19
  • eihgteen
  • 22

String-matching methods

To standardise inconsistent data, you can use strict or fuzzy string-matching methods to identify exact or close matches between your data and valid values.

A string is a sequence of characters. You compare your data strings to the valid values you expect to obtain and then remove or transform the strings that don’t match.

Strict string-matching: Any strings that don’t match the valid values exactly are considered invalid.

Example: Strict string-matching
Your valid values include numbers between 18 and 45 and any correctly spelled words denoting numbers with the first letter capitalised.

In this case, only 3 out of 5 values will be accepted with strict matching.

  • 23
  • twenty
  • 19
  • eihgteen
  • 22

Fuzzy string-matching: Strings that closely or approximately match valid values are recognised and corrected.

Example: Fuzzy string-matching
Your valid values include numbers between 18 and 45 and any words denoting numbers. You use a computer program to allow any values that closely match these valid values in your dataset.

For closely matching strings, your program checks how many edits are needed to change the string into a valid value, and if the number of edits is small enough, it makes those changes.

All five values will be accepted with fuzzy string-matching.

  • 23
  • twenty
  • 19
  • eihgteen
  • 22

After matching, you can transform your text data into numbers so that all values are consistently formatted.

Fuzzy string-matching is generally preferable to strict string-matching because more data are retained.

Missing data

In any dataset, there’s usually some missing data. These cells appear blank in your spreadsheet.

Missing data can come from random or systematic causes.

  • Random missing data include data entry errors, inattention errors, or misreading of measures.
  • Non-random missing data result from confusing, badly designed, or inappropriate measurements or questions.

Dealing with missing data

Your options for tackling missing data usually include:

  • Accepting the data as they are
  • Removing the case from analyses
  • Recreating the missing data

Random missing data are usually left alone, while non-random missing data may need removal or replacement.

With deletion, you remove participants with missing data from your analyses. But your sample may become smaller than intended, so you might lose statistical power.

Example: Missing data removal
You decide to remove all participants with missing data from your survey dataset. This reduces your sample from 114 participants to 77.

Alternatively, you can use imputation to replace a missing value with another value based on a reasonable estimate. You use other data to replace the missing value for a more complete dataset.

It’s important to apply imputation with caution, because there’s a risk of bias or inaccuracy.

Outliers

Outliers are extreme values that differ from most other data points in a dataset. Outliers can be true values or errors.

True outliers should always be retained, because these just represent natural variations in your sample. For example, athletes training for a 100-metre Olympic sprint have much higher speeds than most people in the population. Their sprint speeds are natural outliers.

Outliers can also result from measurement errors, data entry errors, or unrepresentative sampling. For example, an extremely low sprint time could be recorded if you misread the timer.

Detecting outliers

Outliers are always at the extreme ends of any variable dataset.

You can use several methods to detect outliers:

  • Sorting your values from low to high and checking minimum and maximum values
  • Visualising your data in a boxplot and searching for outliers
  • Using statistical procedures to identify extreme values

Dealing with outliers

Once you’ve identified outliers, you’ll decide what to do with them in your dataset. Your main options are retaining or removing them.

In general, you should try to accept outliers as much as possible unless it’s clear that they represent errors or bad data.

It’s important to document each outlier you remove and the reasons so that other researchers can follow your procedures.

Frequently asked questions about data cleaning

What is the definition of data cleaning?

Data cleaning involves spotting and resolving potential data inconsistencies or errors to improve your data quality. An error is any value (e.g., recorded weight) that doesn’t reflect the true value (e.g., actual weight) of something that’s being measured.

In this process, you review, analyse, detect, modify, or remove ‘dirty’ data to make your dataset ‘clean’. Data cleaning is also called data cleansing or data scrubbing.

Why does data cleaning matter?

Data cleaning is necessary for valid and appropriate analyses. Dirty data contain inconsistencies or errors, but cleaning your data helps you minimise or resolve these.

Without data cleaning, you could end up with a Type I or II error in your conclusion. These types of erroneous conclusions can be practically significant with important consequences, because they lead to misplaced investments or missed opportunities.

When do you clean data?

Data cleaning takes place between data collection and data analyses. But you can use some methods even before collecting data.

For clean data, you should start by designing measures that collect valid data. Data validation at the time of data entry or collection helps you minimize the amount of data cleaning you’ll need to do.

After data collection, you can use data standardisation and data transformation to clean your data. You’ll also deal with any missing values, outliers, and duplicate values.

What is the difference between clean and dirty data?

Clean data are valid, accurate, complete, consistent, unique, and uniform. Dirty data include inconsistencies and errors.

Dirty data can come from any part of the research process, including poor research design, inappropriate measurement materials, or flawed data entry.

Is this article helpful?
Pritha Bhandari

Pritha has an academic background in English, psychology and cognitive neuroscience. As an interdisciplinary researcher, she enjoys writing articles explaining tricky research concepts for students and academics.

Still have questions?

Please click the checkbox on the left to verify that you are a not a bot.