So far in this series on data quality I’ve covered Data Quality Metrics, Data Quality or Intelligence Quality, and The Many Paths to Data Quality. I’m wrapping this series up now with an overall approach to Data Quality Management that Scales for a small or large team. This is more an outline than a complete guide, because I have some great news. I’ll be working with a team to develop out these data quality approaches further. I have had a wonderful opportunity to see some of these ideas practically implemented in an organization I’ve been working with over the past year and a half. However, like any implementation I discovered a lot of room for improvement and some opportunities to shift focus from the ideas I thought were important to ideas that the client thought were important for their customers–this is the way it should work.
What follows is a high-level overview of a playbook that a team could use to implement data quality. This is more of a first draft, and I’ll be honest, I stole a lot of these concepts from the data mining world.
This playbook was designed for a particular type of architecture. The system it was designed for is a batch lambda data system similar to most big data architectures. It makes use of a data ingestion layer, processing layer, and serving layer. A data lake would be ideal, but we were able to accomplish some of our goals using SQL databases. So it’s fairly flexible, but I would encourage separating data processing concerns. Here’s a good example of our approach to data quality.
The following is another view of the data pipeline with some of the decision logic in place:
I won’t go into a lot of detail about the diagram, but it covers the basic concepts of ingestion from a raw source that is as similar to the source data as possible, staging that data for processing, processing it using a Databricks job, and then moving it on to Refined data layer for additional business rules processing.
Why separate Data Quality from Business Quality? Wouldn’t it be easier to process them both in the same place?
Perhaps, but it depends on the situation. For instance, our Refined layer for this particular solution is where the business domain entities start to take shape. We could perform that shaping in Databricks, but for our team the consensus was on applying the business rules at the Refined stage. This is one of the joys of working with a distributed data system. You have some flexibility when it comes to where you want to perform your data processing work, but I would recommend sticking with this basic conceptual structure:
These layers allow for the separation of data processing concerns. If you are familiar with functional data engineering, this is the type of architecture that makes that possible. You can follow this conceptual shape using a fully distributed Modern Data Warehouse approach or you can do the same thing in a SQL Server–it depends on your needs and the size of your data. I would recommend that you consult with a Cloud Solutions Architect to help you determine how to right-size your solution. There are trade offs to whatever architecture you chose. So there’s no such thing as a one size fits all architecture.
Loops and Checklists
How we get things done. Many years ago I had the fortune of working with a rock star architect. It was probably a turning point in my career. It was the moment I went from being a below average enterprise developer to someone who gave a damn about my career and the type of solutions I helped produce for my clients. One of the first concepts he taught me was the idea of The Pit of Success.
The first DevOps book I read:
There were a few great concepts I was able to walk away from that particular engagement understanding:
- Checklists are your friend
- The whiteboard is a great place to discuss architecture and make decisions
- A Schema is a contract – be sure you know what you’re doing when you create one
- Every developer is his or her own project manager
- Being a good developer is way more than writing code
There were of course more things that I learned on this journey, but those were the highlights.
What does that have to do with Data Quality? Checklists, guardrails, and the pit of success are all a part of modern process engineering. Ideally, systems inform our decisions and our actions. Reinventing the ways in which we work for every new
data ingestion is incredibly inefficient.
By introducing systems and processes to the workflow of moving data from initial ingestion to the final trusted layer, we free ourselves of the need to decide what to do for each new data source. Following the prescribed process also prevents us from missing a critical step. There will be situations where some steps are not required, but if the step is shortened or ignored it is done so for a documented reason.
In short, we follow the same path as often as possible. We only deviate from the path if there is a clear reason to do so. We practice a process of continuous improvement and adjust our processes to meet better quality–quality is the key here.
Life Cycles and Process Loops
The system that is proposed here is a repeatable process loop that moves from initial data ingestion to the final staging of data in a trusted layer or data warehouse. The process is made up of four distinct phases with their own milestones and deliverables – in other words, any observer will know when a phase is done or not done based on deliverables, goals met, and quality metrics satisfied.
Any participant in the process will know decisively where they are in the data pipeline’s journey. Each phase has a clear set of goals, methods of achieving those goals through meetings, tasks, and decisions, and deliverables that are tangible to an observer.
The following section goes over the necessary aspects of every phase – even if that phase is very small, each step in the checklist should be considered.
Our overall life cycle breaks down as follows:
- Business Understanding
- Data Quality Management
- Data Warehousing
- Data Governance
Along with the overall life cycle, there are loops within the life cycle for each stage to help build out data quality in a more iterative development process.
This section outlines the goals, tasks, meetings, and deliverables associated with the business understanding phase of the playbook.
This is the first stage in any new line of data processing, whether that is a line of data from within the target organization or from a partnering organization.
This is the first stage in the recommended life cycle for achieving Data Quality Management for the target organization.
- Determine the key data points and where they will serve in the abstract model
- Determine key metrics to data quality success
- Identify the relevant data sources, how to access them, how they fit within the the organization’s Data Lake model (structured, unstructured, semi-structured data)
b) How we do it
- Work directly with partners or line of data processing owners to understand and identify the key metrics. What does quality data look like?
- Review the abstract model and determine if the data will fit this ideal or if it will require its own modeling.
- Discuss timelines, expectations of data quality, and shape of data
- Data shape on ingress and egress
- What are successful metrics for this data?
- Can we run our standard data quality metrics, or do we need to define new quality metrics?
- Are the metrics Specific, Measurable, Achievable, Relevant, and Time-bound?
- Minimal charter – this is to be updated by the Product Success Manager, Solution Architect, but contribution is welcomed by the whole team. This line of work is also given a feature in the backlog, an identifier, and staff assignments – who to go to for what – and is featured in the common Teams Wiki
- Raw Data Sources – team has access to the raw data sources for data provisioning work and data acquisition
- Data Dictionaries – Meta data report on the data, its schema, and its recommended Data Quality Metrics generated by tools used for Data Quality Management (Amazon DeeQu or Agile Lab’s DQ)
Data Quality Management
This article outlines the phase of the DQM related to the acquisition and understanding of the line of data processing’s data. Use this phase to determine how to best achieve data ingestion, data quality, and date delivery.
- Ingest the data and use DQM tools to measure and determine its raw state data quality
- Build tests against the assumed metrics determined during the Business Understanding phase
- Produce a set of Data Quality Constraints
- Set up a data pipeline to score new or regularly refreshed data
b) How we do it
Ingest the data – stage the data in its RAW section of the Architecture. Initial data processing elements might also need to be developed to support lookup tables, reporting meta data, or housing for “good,” and “bad” data.
- Explore the data to Generate DQM reports based off the DQM tools
- Data Architects, Data Analysts, Data Quality Analysts, and Development Engineers meet to determine the best ways technically to meet the data delivery challenges – add tasks to the existing user stories to achieve this quality
- Set up the data pipeline – Move data from RAW to refined by applying the knowledge gained from the data exploration
A Data Quality report built off the raw data with clear metrics noted
- Data Quality Cleansing solution – post the data cleansing exercise process
- Solution architecture – Any revisions or additions to the standard Batch Lambda Architecture are recorded and documented on the Batch Lambda Team Wiki
- Milestone Decision Document – Before the data pipeline is turned on and in regular use, the whole team meets to discuss what has been done, what the data looks like, and what it will take to achieve some of the data quality objectives – at this point it’s fair to discuss if it is valuable to the owner of the line of data processing to continue or abandon the pipeline
- Determine how refined data fits within the data warehouse abstract model
- Create the transformation to fit the data warehouse and any meta data related to changes from source systems
- Exercise the data warehouse model
b) How we do it
- Model the mapping from the refined to the trusted using mapping tools. Discuss and revise target system as required by the feature’s stated metrics and goals.
- Transform from the refined layer to the trusted layer using data pipeline built for transformation. This transformation should account for changes in schema by first testing the target schema against the transformation for differences.
- Regression test the data warehouse, reports, and systems that depend on the data warehouse
- Working refined data pipeline with mapping and transformations tested and working
- Solution Architecture updated to match the new state of the data warehouse if there were changes
Data Governance Acceptance
- Finalize project deliverables
b) How we do it
- System validation – meet with the Data Operations Team to discuss handoff and to review all changes to the system. Ideally, this should be an ongoing process, because many members of the Data Operations Team can also be members of the DQM and Data Warehouse Team.
- Project hand-off – turn responsibility of production operations to the Data Operations Team
- Solution Document
- Configuration settings
- Passcode and secret vault access
Roles with Accountability
One of the new ways of thinking in IT relates to the approach to teams and team members. Team members take on and fulfill Roles. Roles have areas of responsibility, common tasks, and goals related to achieving data quality. The following is a break down of some of those possible roles and how they fit into the operation of the playbook:
- Client or a member of the organization who meets with the DQM team to discuss data metrics and meta data – can be someone already serving as a Business Analyst
- Analyze raw data quality reports to help construct data quality requirements
- Update and revise Data Dictionaries
- Review refined data
- Communicate with business leaders and team leads on the state of organization Data Quality
- Build common understanding between organization and partners the importance of Data Quality
- Contributing to Data Dictionaries
- Communicating non-technical requirements with business and team members associated with data lines of processing
IT business analysts or Quality Assurance person who will help shape data quality from the technical perspective
- Review Raw Data DQM reports to help define data quality metrics based on the line of data processing goals
- Help define data quality metrics
- Help define the high data quality standards of organization and communicate that to the organization and data teams
- Write tests to meet organization data quality metrics
- Write tests for data constraints
- Data quality metrics and expectations are in alignment with the goals of the business and the expectations set during the business analysis phase
- Data pipelines adhere to the data quality management expectations set for the specific data line of processing
- Pipeline test are in alignment with data quality expectations
System Development Engineers
- SQL, ETL, and application developer responsible for turning business data requirements into reusable code that fits within the Batch-Lambda Data Architecture and the general Data Pipeline process
- Uses modern DevOps practices to check code into and out of source code
- Instructs architects and data analyst on proven practices to achieve data goals and metrics through technology in alignment with the Batch-Lambda Architecture
- Writing code required to meet data pipeline objectives
- Testing that code and writing automated tests for the code for the DevOps pipeline
- Writing clean testable code that adheres to the functional nature of the Batch-Lambda Architecture – is it safe, is it repeatable from raw, and are system decisions singular and centralized as much as possible (Modern Functional Data Engineering – https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a)
- Meeting with architects and lead engineer to work out the “how to,” of technical tasks
- Working tasks from the backlog and checking in changes to the repository related to those tasks
- Code review sessions with architects and engineering leads
- Delivery of functional code to meet business objectives using the Batch-Lambda Architecture – this could be SQL, Python, or some other language commonly used in modern data pipeline development.
- Checking in code and requesting code reviews
Responsible for developing infrastructure as code as it relates to the Batch-Lambda Architecture, provision Azure resources as needed, overseeing Azure resource budgets, and insuring environments meet with Service Provider standards and security requirements. Mentoring architects and developers on proven practices to deliver quality data systems built on Azure, like the Batch-Lambda Architecture. Help team to adhere to DevOps practices and help teams to use the Azure DevOps tools.
- Provisioning Azure Resources
- Assigning roles to users
- Scripting infrastructure as code and testing the deployment of that code in the CI/CD pipeline
- Analyzing the technical feasibility of proposed data and solution architecture
- Revising documentation to include “how,” certain tasks were done
- Creating automation and repeatability wherever possible
- Working with the tech team to meet delivery objectives
- Code reviews
- DevOps practice related to the Batch-Lambda Architecture
- Batch-Lambda Architecture security
- Batch-Lambda Architecture budgeting
- Azure DevOps CI/CD pipelines and code repositories
SQL, ETL and application developer focused on translating business needs into Batch-Lambda Architecture in the Azure Cloud environment
- Work with business, solution architect, engineers, and developers to define the best architecture to fit the needs of the line of data processing.
- Reviews solution architecture functional proposals, diagrams, and data requirements to develop a technical solution to meet the business needs.
- Instructs Lead Engineer and System Development engineers on the best methods to achieve the technical vision.
- Submits minimal documentation and diagrams to solution architect for inclusion in the overall system documentation.
- Attends meetings where technical decisions are required
- Mentors and instructs team members on the reasoning behind architectural choices.
- Quality checks on built infrastructure
- Code reviews
- Creating diagrams
- Creating proof-of-concept code
- Creating working code and scripts to support engineering efforts
Product Success Manager
A product/project manager who helps guide the team to remain within scope of the defined iteration, remain on task, and deliver based on the Program Manager’s stated line of data processing goals and metrics
- Arrange and attend meetings, take notes, guide team to remain within scope of meeting goals
- Reporting meetings notes with emphasis on meeting goals, decided means to achieve those goals, and expected deliverables – as well as team assignments
- Generate reports showing state of overall project status and line of data processing status for Product Manager
- Align with team on proven practices for using Azure DevOps, Microsoft Project, and other project management tools and methodologies for meeting delivery
- Delivering meeting notes
- Product backlog
Works with the business to align technical teams, processes, systems architecture, and tasks to the business objectives
- Attend most meetings to deeply understand the business problem
- Clearly communicate and document the functional business solution with the data teams
- Help create a technical architecture with data teams
- Help manage the scope of iterations so that teams are on task and able to deliver
- Reviews architecture
- Reviews code
- Solution Architecture Documentation
- Code reviews
- Decision Documents
- Revisions to the Batch-Lambda Architecture Playbook
Directs data teams to achieve business goals utilizing the Batch-Lambda architecture
- Set line of data processing iteration priority with the team and clearly communicate the overall goals and objectives of each line of data processing – so solutions aren’t over or under engineered to meet delivery expectations
- Set clear delivery expectations – “Success will look like this when we’ve achieved all of the following goals…”
- Review and approve overall data quality metrics associated with a line of data processing iteration
- Set the vision of team excellence and encourage adherence to the line of data processing iteration process
- Attend milestone meetings – daily stand up, sprint planning review, backlog grooming, and approve Service Provider estimated User Stories
- Product Backlog Priority
- Success of each line of data processing iteration
- Overall data quality for each line of data processing iteration
- Overall morale and health of the data teams and Batch-Lambda Architecture
Applying this Process at Your Organization
This is definitely a journey worth taking. Issues with data quality account for many failed projects, customer support issues, and the loss of trust from customers. Data quality issues can build a damn in your value streams.
If you would like help or guidance with this journey, please connect with me on LinkedIn and we can arrange a meeting. We have a skilled team of data systems experts, certified Azure developers and architects, as well as the project delivery experts to help begin and support your data practices.
Additional Data Quality Resources
I recently read another great overview of the data quality process on the site Toptal. Toptal describes itself as, “… an exclusive network of the top freelance software developers, designers, finance experts, product managers, and project managers in the world. Top companies hire Toptal freelancers for their most important projects.”