Data Management.com

data cleansing (data cleaning, data scrubbing)

By Craig Stedman

What is data cleansing?

Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then changing, updating or removing data to correct them. Data cleansing improves data quality and helps provide more accurate, consistent and reliable information for decision-making in an organization.

Data cleansing is a key part of the overall data management process and one of the core components of data preparation work that readies data sets for use in business intelligence (BI) and data science applications. It's typically done by data quality analysts and engineers or other data management professionals. But data scientists, BI analysts and business users may also clean data or take part in the data cleansing process for their own applications.

Data cleansing vs. data cleaning vs. data scrubbing

Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most part, they're considered to be the same thing. In some cases, though, data scrubbing is viewed as an element of data cleansing that specifically involves removing duplicate, bad, unneeded or old data from data sets.

Data scrubbing also has a different meaning in connection with data storage. In that context, it's an automated function that checks disk drives and storage systems to make sure the data they contain can be read and to identify any bad sectors or blocks.

Why is clean data important?

Business operations and decision-making are increasingly data-driven, as organizations look to use data analytics to help improve business performance and gain competitive advantages over rivals. As a result, clean data is a must for BI and data science teams, business executives, marketing managers, sales reps and operational workers. That's particularly true in retail, financial services and other data-intensive industries, but it applies to organizations across the board, both large and small.

If data isn't properly cleansed, customer records and other business data may not be accurate and analytics applications may provide faulty information. That can lead to flawed business decisions, misguided strategies, missed opportunities and operational problems, which ultimately may increase costs and reduce revenue and profits. IBM estimated that data quality issues cost organizations in the U.S. a total of $3.1 trillion in 2016, a figure that's still widely cited.

What kind of data errors does data scrubbing fix?

Data cleansing addresses a range of errors and issues in data sets, including inaccurate, invalid, incompatible and corrupt data. Some of those problems are caused by human error during the data entry process, while others result from the use of different data structures, formats and terminology in separate systems throughout an organization.

The types of issues that are commonly fixed as part of data cleansing projects include the following:

What are the steps in the data cleansing process?

The scope of data cleansing work varies depending on the data set and analytics requirements. For example, a data scientist doing fraud detection analysis on credit card transaction data may want to retain outlier values because they could be a sign of fraudulent purchases. But the data scrubbing process typically includes the following actions:

  1. Inspection and profiling. First, data is inspected and audited to assess its quality level and identify issues that need to be fixed. This step usually involves data profiling, which documents relationships between data elements, checks data quality and gathers statistics on data sets to help find errors, discrepancies and other problems.
  2. Cleaning. This is the heart of the cleansing process, when data errors are corrected and inconsistent, duplicate and redundant data is addressed.
  3. Verification. After the cleaning step is completed, the person or team that did the work should inspect the data again to verify its cleanliness and make sure it conforms to internal data quality rules and standards.
  4. Reporting. The results of the data cleansing work should then be reported to IT and business executives to highlight data quality trends and progress. The report could include the number of issues found and corrected, plus updated metrics on the data's quality levels.

The cleansed data can then be moved into the remaining stages of data preparation, starting with data structuring and data transformation, to continue readying it for analytics uses.

Characteristics of clean data

Various data characteristics and attributes are used to measure the cleanliness and overall quality of data sets, including the following:

Data management teams create data quality metrics to track those characteristics, as well as things like error rates and the overall number of errors in data sets. Many also try to calculate the business impact of data quality problems and the potential business value of fixing them, partly through surveys and interviews with business executives.

The benefits of effective data cleansing

Done well, data cleansing provides the following business and data management benefits:

Data cleansing and other data quality methods are also a key part of data governance programs, which aim to ensure that the data in enterprise systems is consistent and gets used properly. Clean data is one of the hallmarks of a successful data governance initiative.

Data cleansing challenges

Data cleansing doesn't lack for challenges. One of the biggest is that it's often time-consuming, due to the number of issues that need to be addressed in many data sets and the difficulty of pinpointing the causes of some errors. Other common challenges include the following:

Data cleansing tools and vendors

Numerous tools can be used to automate data cleansing tasks, including both commercial software and open source technologies. Typically, the tools include a variety of functions for correcting data errors and issues, such as adding missing values, replacing null ones, fixing punctuation, standardizing fields and combining duplicate records. Many also do data matching to find duplicate or related records.

Tools that help cleanse data are available in a variety of products and platforms, including the following:

Learn how strong data governance policies can help organizations prevent data silos and ensure better quality data.

28 Jan 2022

All Rights Reserved, Copyright 2005 - 2024, TechTarget | Read our Privacy Statement