This post describes an opinionated method for achieving data quality within a small organization. There are some pre-qualifiers and assumptions made about the data system architecture and the skill set of the team.
One of the biggest struggles for organizations that rely on data is to achieve an acceptable level of data quality. Steps are taken by everyone from front end developers to ETL engineers to capture what is considered, “clean data,” but there are always exceptions that seem to creep into the data that can throw off reporting, damage organizational reputation, or the time of valuable resources as they track down the source of poor information.
The problem of poor data quality has been addressed by many disciplined approaches. If you have poor data quality and you would like to see some improvement to your data, there is hope. Here are just a few of the strategies and techniques organizations use to improve their data quality.
- Acquisition of new data.
- Standardization or normalization of new data
- Linking records
- Data and Schema Integration
- Trusted sources
- Process control
- Process redesign
- Error localization and correction
Acquisition of new data – the overall quality of data can be improved by importing standard data sets that meet a high quality metric. For instance, common lookup values like regional names, postal codes, or weights and measures can be standardized by replacing existing data that does not meet the organization’s quality standards.
Standardization or normalization of new data – data standards are created by an organizational governing body. Those standards are enforced by replacing the non-standard naming with a standard. For example, all instances of ‘Street’, ‘Str’, or ‘st’ are replaced with ST. ‘Texas’ or ‘Tex’, are replaced with TX.
Linked records – this is a little like aggregating similar data into a more robust object, and then using quality data from those objects to inform the quality of other similar objects. This works well when working with an inventory of similar items, like a database of used cars, single-family homes, or SEO data.
Data and Schema Integration – this is probably the most common, where primitive data types are used as evaluation criteria for data inserted into records. Integer values are enforced and must be positive or negative numbers. Some data systems will even allow you to enforce certain data formats to meet organizational policy or business rules.
Trusted Sources – when dealing with multiple sources of data, it’s often difficult to know which one best represents “good data,” so a single source of truth is developed that informs all other data sources of what constitutes quality data. A record from the trusted source is trusted over what should be the same record in another data source.
Process control – is the way most organizations practice maintaining data quality. A gatekeeper is in charge of checking the data quality, such as an internet web form with input validation or an API with a strictly enforced schema. This often works as long as data continues to only flow through the gateways with these protections.
Process redesign – many organizations face such poor data quality that their only recourse to achieve data quality is to refactor their existing architecture and data capture systems in order to prevent the poor data quality. This might mean replacing legacy systems that cannot enforce process control or building a new type of data system that allows the organization to cleanse the data, which leads us to the system I believe is one many organizations need to consider.
Error localization and correction – this system is focused on analyzing the existing data and checking it for quality. The goal is to identify and eliminate data quality issues by first detecting records that do not meet the organizational standard, and then cleansing that data with scripts or other data sources. This is the system I think most enterprises would benefit most from for the following reasons:
- Organizations do not always control the data capture applications that input data into their systems
- Organizations often inherit data from other organizations, either through acquisitions or through list purchases–think direct mail companies or organizations that make use of public data sources
- Correcting the data incrementally allows an organization to trust their data more over time. There are few data quality fixes that occur over night. It usually takes a team and a process, and the error localization and correction process fits well with teams and processes.
Knowing the many ways to achieve data quality can help an organization begin the journey toward cleaner more reliable data sources with their organization.
In my next post, I’ll dive into the many ways that an organization can analyze their existing data and apply metrics that will help establish data quality goals.