
I recently completed a guide for a client related to creating data quality metrics. It’s an interesting problem to tackle, because you absolutely need to understand what goals you need to achieve as an organization. Without these goals defined, you can make the mistake of either blanketing your data with far too many metrics or too few. To give you an idea of how many possible data metrics are available, here is a list of possibilities:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Currency
• Volatility
• Uniqueness
• Appropriate Amount of Data
• Accessibility
• Credibility
• Interpretability
• Usability
• Derivation of Integrity
• Conciseness
• Maintainability
• Applicability
• Convenience
• Speed
• Comprehensiveness
• Clarity
• Traceability
• Security
• Correctness
• Objectivity
• Relevancy
• Ease of Operation
• Interactivity
The following rather extensive list was taken from https://www.researchgate.net/publication/220565749_Methodologies_for_Data_Quality_Assessment_and_Improvement
Attempting to gather metrics on all these possible points would be an exhausting exercise, but from this list you can see how many possible choices are available. What’s important now is selecting a few an putting those metrics into practice.
I’m going to cheat, and just focus on those that the DeeQu library supports. This means we’re going to look for metrics related to the following areas:
• Uniqueness
• Completeness
• Compliance
• Distinctness
These areas of data quality can be clearly defined and tied to a reportable metric or in some cases, more than one metric. Metrics mean dashboards, which makes it easier to bring data owners and contributors into the picture.
Even though there are several Data Quality tools on the market (Great Expectations is one I’m reviewing now), I’ll focus on using Amazon’s DeeQu. Primarily because it’s an Open Source library and it integrates well with modern data tools like Apache Spark. It’s easy to implement and it gives us reports out-of-the-box that meet our needs for metrics.
I won’t go into details on how to set up and use the tool. That’s covered in the library’s example code. However, I will pull a portion of their reference out and use it to help us plan the first step of our data quality process.
How do these fit into the process, though? Well, we want to begin our process by performing an initial run of data analysis. Our first phase should be an attempt to narrow from general data quality to specifics. Data quality can and often does reach from the quality of the data schema (the shape and the primitive data types that make up the general data container,) as well as the data values – this is the intention of the data. In unstructured or semi-structured data, the very shape of a collection of data points can be a data value. Keep these points in mind when considering data quality for your individual systems. However, to keep the scope of this set of articles a little more focused, I’m going to stick with data quality applied to the data values, and not the various possible data schema.
To narrow down from our abstract concept of general data quality to measurable attributes that will help us test our data and build meaningful reports, we need to section off what it is that we’re expecting from our data.
For the initial phase of analysis, I can see a data engineer generating reports and sitting down with an analyst to discuss what metrics are truly important for either the specific dataset or the experiment. I can also see this as a loop of meetings until the data begins to meet with the organization’s ideal data quality objective—and I can see this as a process of learning. Many organizations won’t initially know what type of data quality they want to aspire to, because they haven’t invested the time to analyze their data. This will likely be an ongoing, ever evolving process of learning for both the organization and the data engineering team. Building a process around this learning is valuable.
Assessing the Data
Completeness Metric – This does dip a little into the world of Intelligence Quality. It’s about the completeness of a record. For instance, if you have a record with a vehicle, that record should include the make, the model, the year it was built, and the vehicle identification number. If one of these elements is missing, the record lacks completeness.
How do we gather metrics on completeness?
DeeQu includes a metric for completeness. It reports on the fraction of non-null values in a column. To measure the completeness of an object represented in the database, an analyst would need to identify the columns needed to make that data set “complete”. This is a case where shaping the data schema to match the DQ rules would probably be worth the extra effort.
Data transformed from the ingested data to match a dataset defined by the business domain could easily be loaded in one or more tables and analyzed for completeness. Datasets that do not meet the completeness strategy could be sent to a bad records table or marked with meta-data related to the failure. It’s important to understand that completeness as a concept is not necessarily limited to the actual “completeness,” function within the library.
Other functions from within the library can be combined to establish a dataset’s completeness. All the data could exist for a dataset, but if one of the values is incorrect, it could mean that the dataset is lacking completeness. This should be considered while establishing metrics for completeness.
Uniqueness Metric – Data that represents a specific dataset only appears once in the data system. If there is a record that represents Joe Smith, that records for Joe Smith only appears once. In today’s data systems with multiple stages, like bronze, silver, gold, and platinum, that might not be possible, but it should be something that is aimed for the in the most trusted layer of the data system – for instance, it might be something that’s tested for in the lower stages, but isn’t acceptable in the higher stages that are considered a trusted reporting layer.
Compliance Metric – in a structured database with clearly defined datatypes it’s fairly easy to be certain that an integer will be an integer and that a CHAR will be a CHAR, whoever if you have constraints such as, “All values within this column must be between .001234 and .002345,” you’ll need to either write custom functions within your data system to check for these constraints or you’ll need to have a way to check those values for compliance. How many records within the dataset meet the compliance requirements, how many do not?
Distinctness Metric – how many values are discrete? DeeQu measures this the fraction of distinct values of a column over the number of all values of a column. If you are looking for an entry that should be generally unique, for instance national identification numbers, but you run across values that are not unique, that could be a reason to trigger a warning.
My code for something like this looks like the following:
val verificationResult: VerificationResult = { VerificationSuite()
// data to run the verification on
.onData(dataset)
// define a data quality check
.addCheck(
Check(CheckLevel.Error, "Review Check")
.hasSize(_ >= 3000000) // at least 3 million rows
.hasMin("star_rating", _ == 1.0) // min is 1.0
.hasMax("star_rating", _ == 5.0) // max is 5.0
.isComplete("review_id") // should never be NULL
.isUnique("review_id") // should not contain duplicates
.isComplete("marketplace") // should never be NULL
// contains only the listed values
.isContainedIn("marketplace", Array("US", "UK", "DE", "JP", "FR"))
.isNonNegative("year")) // should not contain negative values
// compute metrics and verify check conditions
.run()
}
My report looks like this:
name | instance | value |
Completeness | review_id | 1 |
Completeness | marketplace | 1 |
Compliance | marketplace contained in US,UK,DE,JP,FR | 1 |
Compliance | year is non-negative | 1 |
Maximum | star_rating | 5 |
Minimum | star_rating | 1 |
Size | * | 3120938 |
Uniqueness | review_id | 0.99266 |
Based on what we learn from these reports, we can now go back to our data and make decisions related to how we will start the process for testing our dataset for our new data quality goals.
[…] far in this series on data quality I’ve covered Data Quality Metrics, Data Quality or Intelligence Quality, and The Many Paths to Data Quality. I’m wrapping this […]