As a new engineer, understanding the concept of continuous integration and continuous deployment (CI/CD) is important for several reasons.
First, CI/CD is a key component of cloud native application development. It is the process of automatically building, testing, and deploying code changes as soon as they are committed to the code repository. By understanding how CI/CD works, you will be able to build, deploy, and manage cloud-native applications more effectively.
Second, CI/CD allows for faster development and deployment. By automating the build, test, and deployment process, it enables developers to make changes to the code and have them deployed to production faster. This facilitates faster innovation and time-to-market for new features.
Third, CI/CD promotes better collaboration and DevOps culture. By automating the process of building, testing, and deploying code changes, it becomes easier for different teams and developers to work together on the same application.
Fourth, CI/CD allows for better quality and reliability of the software. By automating the testing process, it ensures that code changes are tested as soon as they are made, which helps to catch any bugs or errors early in the development cycle.
In summary, as a new engineer, understanding the concept of CI/CD is important because it is a key component of cloud native application development, allows for faster development and deployment, promotes better collaboration and DevOps culture, and allows for better quality and reliability of the software. It is a powerful tool for building and deploying applications in a cloud environment and is essential for any engineer working in the field today.
Learning Materials
Here’s a list to get you started learning about CI/CD. Note that some of these links may not be free and may require a subscription or payment. I receive no affiliate payments for these links.
Continuous Delivery is a revolutionary approach to software development that focuses on efficient feedback, strong engineering discipline, reducing the amount of work, and cycle time reduction. Automation is essential to reducing cycle time and the deployment pipeline is used to prove that changes are fit for production.
Possible Learning Path
Hands-on experience: Start by setting up a simple Kubernetes cluster and experimenting with different CI/CD tools such as Jenkins, Travis CI, and GitLab CI/CD. This can be done by following tutorials and guides and deploying these tools on a cloud platform like AWS, Azure, or GCP.
Theoretical learning: Once you have a basic understanding of CI/CD, you can begin to explore the underlying concepts and technologies such as pipeline management, version control, and testing. This can be done through online resources such as tutorials, courses, and documentation provided by CI/CD tools, as well as books and blogs on the topic.
Understanding the principles and best practices: CI/CD is an important aspect of modern software development, so it’s important to understand the key principles and best practices of CI/CD such as automation, testing, and deployment.
Joining a community: Joining a community of CI/CD enthusiasts will help you connect with other people who are learning and working with CI/CD for Kubernetes. This can be done through online forums, meetups, and social media groups.
Practice, practice, practice: As with any new technology, the best way to learn is by doing. The more you practice deploying and using CI/CD tools in a Kubernetes cluster, the more comfortable and proficient you will become with the technology.
A Note from the Architect
I’m about to tell you how I used to publish my websites way back in the day. You see, back then, many of us actually kept a really powerful server under our desks. I had such a server. This server had a network card with two different IP addresses bound to it. One was a Public IP address. Yes, I had a public IP address bound directly onto the network card of the server sitting under my desk. I also had a private IP address so that I could share the website’s folder on the network for the whole company.
Ok, stop laughing, you’re making me feel bad. I haven’t even mentioned the fact that I edited my Active Server Pages directly from that folder. That’s right, changes went directly from development to the internet with no steps in between. I had lots of files that were named things like, “about-us.asp.old, about-us.asp.older, about-us.asp.donotuse.’
Well, luckily those days are behind us. Today we use CICD (Continuous Integration and Continuous Deployment).
CICD is a software development practice that involves automatically building, testing, and deploying code changes. The basic idea behind CICD is to catch and fix errors as early as possible, which helps to prevent bugs and other issues from making it into production. This is accomplished by integrating the code changes into a central repository and then running a series of automated tests and builds.
It’s more popular now because it helps to improve software quality, increase collaboration among developers, and reduce time to market for new features. It also helps to ensure that software is always in a releasable state, which makes it easier to roll out new features and bug fixes.
One of the key enablers of CICD is version control. Version control is a necessity for modern development because it allows developers to track changes to the codebase over time, collaborate with other developers, and roll back to a previous version if something goes wrong. There’s no longer a need to append extra extensions onto files you might want to keep as backup.
Pipelines help to ensure quality by automating the process of building, testing, and deploying code changes. This makes it easier to catch and fix errors early in the development process, which helps to prevent bugs and other issues from making it into production. It can also help with security by establishing a software supply chain. When code changes are automatically built, tested and deployed, it’s easier to track, identify and isolate any malicious code that may have been introduced in the codebase.
Basically, CICD is a software development practice that involves automatically building, testing, and deploying code changes. It helps to improve software quality, increase collaboration among developers, and reduce time to market for new features. Version control is a necessity for modern development, and pipelines help to ensure quality and can help with security by establishing a software supply chain.
Connect with Shawn
Connect with me on LinkedIn. It’s where I’m most active, and it’s the easiest way to connect with me.
So far in this series on data quality I’ve covered Data Quality Metrics, Data Quality or Intelligence Quality, and The Many Paths to Data Quality. I’m wrapping this series up now with an overall approach to Data Quality Management that Scales for a small or large team. This is more an outline than a complete guide, because I have some great news. I’ll be working with a team to develop out these data quality approaches further. I have had a wonderful opportunity to see some of these ideas practically implemented in an organization I’ve been working with over the past year and a half. However, like any implementation I discovered a lot of room for improvement and some opportunities to shift focus from the ideas I thought were important to ideas that the client thought were important for their customers–this is the way it should work.
What follows is a high-level overview of a playbook that a team could use to implement data quality. This is more of a first draft, and I’ll be honest, I stole a lot of these concepts from the data mining world.
The Architecture
This playbook was designed for a particular type of architecture. The system it was designed for is a batch lambda data system similar to most big data architectures. It makes use of a data ingestion layer, processing layer, and serving layer. A data lake would be ideal, but we were able to accomplish some of our goals using SQL databases. So it’s fairly flexible, but I would encourage separating data processing concerns. Here’s a good example of our approach to data quality.
Our Basic Data Quality Process for Separating “Good” from “Bad” data.
The following is another view of the data pipeline with some of the decision logic in place:
I won’t go into a lot of detail about the diagram, but it covers the basic concepts of ingestion from a raw source that is as similar to the source data as possible, staging that data for processing, processing it using a Databricks job, and then moving it on to Refined data layer for additional business rules processing.
Why separate Data Quality from Business Quality? Wouldn’t it be easier to process them both in the same place?
Perhaps, but it depends on the situation. For instance, our Refined layer for this particular solution is where the business domain entities start to take shape. We could perform that shaping in Databricks, but for our team the consensus was on applying the business rules at the Refined stage. This is one of the joys of working with a distributed data system. You have some flexibility when it comes to where you want to perform your data processing work, but I would recommend sticking with this basic conceptual structure:
Basic Batch Data Processing Layers
These layers allow for the separation of data processing concerns. If you are familiar with functional data engineering, this is the type of architecture that makes that possible. You can follow this conceptual shape using a fully distributed Modern Data Warehouse approach or you can do the same thing in a SQL Server–it depends on your needs and the size of your data. I would recommend that you consult with a Cloud Solutions Architect to help you determine how to right-size your solution. There are trade offs to whatever architecture you chose. So there’s no such thing as a one size fits all architecture.
Loops and Checklists
How we get things done. Many years ago I had the fortune of working with a rock star architect. It was probably a turning point in my career. It was the moment I went from being a below average enterprise developer to someone who gave a damn about my career and the type of solutions I helped produce for my clients. One of the first concepts he taught me was the idea of The Pit of Success.
Falling into the Pit of Success
The first DevOps book I read:
The Book on Falling into the Pit of Success
There were a few great concepts I was able to walk away from that particular engagement understanding:
Checklists are your friend
The whiteboard is a great place to discuss architecture and make decisions
A Schema is a contract – be sure you know what you’re doing when you create one
Every developer is his or her own project manager
Being a good developer is way more than writing code
There were of course more things that I learned on this journey, but those were the highlights.
What does that have to do with Data Quality? Checklists, guardrails, and the pit of success are all a part of modern process engineering. Ideally, systems inform our decisions and our actions. Reinventing the ways in which we work for every new data ingestion is incredibly inefficient.
By introducing systems and processes to the workflow of moving data from initial ingestion to the final trusted layer, we free ourselves of the need to decide what to do for each new data source. Following the prescribed process also prevents us from missing a critical step. There will be situations where some steps are not required, but if the step is shortened or ignored it is done so for a documented reason.
In short, we follow the same path as often as possible. We only deviate from the path if there is a clear reason to do so. We practice a process of continuous improvement and adjust our processes to meet better quality–quality is the key here.
Life Cycles and Process Loops
The system that is proposed here is a repeatable process loop that moves from initial data ingestion to the final staging of data in a trusted layer or data warehouse. The process is made up of four distinct phases with their own milestones and deliverables – in other words, any observer will know when a phase is done or not done based on deliverables, goals met, and quality metrics satisfied.
Any participant in the process will know decisively where they are in the data pipeline’s journey. Each phase has a clear set of goals, methods of achieving those goals through meetings, tasks, and decisions, and deliverables that are tangible to an observer.
The following section goes over the necessary aspects of every phase – even if that phase is very small, each step in the checklist should be considered.
Our overall life cycle breaks down as follows:
Business Understanding
Data Quality Management
Data Warehousing
Data Governance
Along with the overall life cycle, there are loops within the life cycle for each stage to help build out data quality in a more iterative development process.
Business Understanding
This section outlines the goals, tasks, meetings, and deliverables associated with the business understanding phase of the playbook.
This is the first stage in any new line of data processing, whether that is a line of data from within the target organization or from a partnering organization.
This is the first stage in the recommended life cycle for achieving Data Quality Management for the target organization.
a) Goals
Determine the key data points and where they will serve in the abstract model
Determine key metrics to data quality success
Identify the relevant data sources, how to access them, how they fit within the the organization’s Data Lake model (structured, unstructured, semi-structured data)
b) How we do it
Work directly with partners or line of data processing owners to understand and identify the key metrics. What does quality data look like?
Review the abstract model and determine if the data will fit this ideal or if it will require its own modeling.
Discuss timelines, expectations of data quality, and shape of data
Data shape on ingress and egress
What are successful metrics for this data?
Can we run our standard data quality metrics, or do we need to define new quality metrics?
Are the metrics Specific, Measurable, Achievable, Relevant, and Time-bound?
c) Deliverables
Minimal charter – this is to be updated by the Product Success Manager, Solution Architect, but contribution is welcomed by the whole team. This line of work is also given a feature in the backlog, an identifier, and staff assignments – who to go to for what – and is featured in the common Teams Wiki
Raw Data Sources – team has access to the raw data sources for data provisioning work and data acquisition
Data Dictionaries – Meta data report on the data, its schema, and its recommended Data Quality Metrics generated by tools used for Data Quality Management (Amazon DeeQu or Agile Lab’s DQ)
Data Quality Management
This article outlines the phase of the DQM related to the acquisition and understanding of the line of data processing’s data. Use this phase to determine how to best achieve data ingestion, data quality, and date delivery.
a) Goals
Ingest the data and use DQM tools to measure and determine its raw state data quality
Build tests against the assumed metrics determined during the Business Understanding phase
Produce a set of Data Quality Constraints
Set up a data pipeline to score new or regularly refreshed data
b) How we do it
Ingest the data – stage the data in its RAW section of the Architecture. Initial data processing elements might also need to be developed to support lookup tables, reporting meta data, or housing for “good,” and “bad” data.
Explore the data to Generate DQM reports based off the DQM tools
Data Architects, Data Analysts, Data Quality Analysts, and Development Engineers meet to determine the best ways technically to meet the data delivery challenges – add tasks to the existing user stories to achieve this quality
Set up the data pipeline – Move data from RAW to refined by applying the knowledge gained from the data exploration
c) Deliverables
A Data Quality report built off the raw data with clear metrics noted
Data Quality Cleansing solution – post the data cleansing exercise process
Solution architecture – Any revisions or additions to the standard Batch Lambda Architecture are recorded and documented on the Batch Lambda Team Wiki
Milestone Decision Document – Before the data pipeline is turned on and in regular use, the whole team meets to discuss what has been done, what the data looks like, and what it will take to achieve some of the data quality objectives – at this point it’s fair to discuss if it is valuable to the owner of the line of data processing to continue or abandon the pipeline
Data Warehousing
a) Goals
Determine how refined data fits within the data warehouse abstract model
Create the transformation to fit the data warehouse and any meta data related to changes from source systems
Exercise the data warehouse model
b) How we do it
Model the mapping from the refined to the trusted using mapping tools. Discuss and revise target system as required by the feature’s stated metrics and goals.
Transform from the refined layer to the trusted layer using data pipeline built for transformation. This transformation should account for changes in schema by first testing the target schema against the transformation for differences.
Regression test the data warehouse, reports, and systems that depend on the data warehouse
c) Deliverables
Working refined data pipeline with mapping and transformations tested and working
Solution Architecture updated to match the new state of the data warehouse if there were changes
Data Governance Acceptance
a) Goals
Finalize project deliverables
b) How we do it
System validation – meet with the Data Operations Team to discuss handoff and to review all changes to the system. Ideally, this should be an ongoing process, because many members of the Data Operations Team can also be members of the DQM and Data Warehouse Team.
Project hand-off – turn responsibility of production operations to the Data Operations Team
c) Deliverables
Solution Document
Configuration settings
Passcode and secret vault access
Roles with Accountability
One of the new ways of thinking in IT relates to the approach to teams and team members. Team members take on and fulfill Roles. Roles have areas of responsibility, common tasks, and goals related to achieving data quality. The following is a break down of some of those possible roles and how they fit into the operation of the playbook:
Data Analyst
Role
Client or a member of the organization who meets with the DQM team to discuss data metrics and meta data – can be someone already serving as a Business Analyst
Common Tasks
Analyze raw data quality reports to help construct data quality requirements
Update and revise Data Dictionaries
Review refined data
Communicate with business leaders and team leads on the state of organization Data Quality
Build common understanding between organization and partners the importance of Data Quality
Responsible for
Contributing to Data Dictionaries
Communicating non-technical requirements with business and team members associated with data lines of processing
Quality Analysts
Role
IT business analysts or Quality Assurance person who will help shape data quality from the technical perspective
Common Tasks
Review Raw Data DQM reports to help define data quality metrics based on the line of data processing goals
Help define data quality metrics
Help define the high data quality standards of organization and communicate that to the organization and data teams
Write tests to meet organization data quality metrics
Write tests for data constraints
Responsible for
Data quality metrics and expectations are in alignment with the goals of the business and the expectations set during the business analysis phase
Data pipelines adhere to the data quality management expectations set for the specific data line of processing
Pipeline test are in alignment with data quality expectations
System Development Engineers
Role
SQL, ETL, and application developer responsible for turning business data requirements into reusable code that fits within the Batch-Lambda Data Architecture and the general Data Pipeline process
Uses modern DevOps practices to check code into and out of source code
Instructs architects and data analyst on proven practices to achieve data goals and metrics through technology in alignment with the Batch-Lambda Architecture
Common Tasks
Writing code required to meet data pipeline objectives
Testing that code and writing automated tests for the code for the DevOps pipeline
Meeting with architects and lead engineer to work out the “how to,” of technical tasks
Working tasks from the backlog and checking in changes to the repository related to those tasks
Code review sessions with architects and engineering leads
Responsible for
Delivery of functional code to meet business objectives using the Batch-Lambda Architecture – this could be SQL, Python, or some other language commonly used in modern data pipeline development.
Checking in code and requesting code reviews
Lead Engineer
Role
Responsible for developing infrastructure as code as it relates to the Batch-Lambda Architecture, provision Azure resources as needed, overseeing Azure resource budgets, and insuring environments meet with Service Provider standards and security requirements. Mentoring architects and developers on proven practices to deliver quality data systems built on Azure, like the Batch-Lambda Architecture. Help team to adhere to DevOps practices and help teams to use the Azure DevOps tools.
Common Tasks
Provisioning Azure Resources
Assigning roles to users
Scripting infrastructure as code and testing the deployment of that code in the CI/CD pipeline
Analyzing the technical feasibility of proposed data and solution architecture
Revising documentation to include “how,” certain tasks were done
Creating automation and repeatability wherever possible
Working with the tech team to meet delivery objectives
Code reviews
Responsible for
DevOps practice related to the Batch-Lambda Architecture
Batch-Lambda Architecture security
Batch-Lambda Architecture budgeting
Azure DevOps CI/CD pipelines and code repositories
Data Architects
Role
SQL, ETL and application developer focused on translating business needs into Batch-Lambda Architecture in the Azure Cloud environment
Common Tasks
Work with business, solution architect, engineers, and developers to define the best architecture to fit the needs of the line of data processing.
Reviews solution architecture functional proposals, diagrams, and data requirements to develop a technical solution to meet the business needs.
Instructs Lead Engineer and System Development engineers on the best methods to achieve the technical vision.
Submits minimal documentation and diagrams to solution architect for inclusion in the overall system documentation.
Attends meetings where technical decisions are required
Mentors and instructs team members on the reasoning behind architectural choices.
Responsible for
Quality checks on built infrastructure
Code reviews
Creating diagrams
Creating proof-of-concept code
Creating working code and scripts to support engineering efforts
Product Success Manager
Role
A product/project manager who helps guide the team to remain within scope of the defined iteration, remain on task, and deliver based on the Program Manager’s stated line of data processing goals and metrics
Common Tasks
Arrange and attend meetings, take notes, guide team to remain within scope of meeting goals
Reporting meetings notes with emphasis on meeting goals, decided means to achieve those goals, and expected deliverables – as well as team assignments
Generate reports showing state of overall project status and line of data processing status for Product Manager
Align with team on proven practices for using Azure DevOps, Microsoft Project, and other project management tools and methodologies for meeting delivery
Responsible for
Delivering meeting notes
Product backlog
Solution Architect
Role
Works with the business to align technical teams, processes, systems architecture, and tasks to the business objectives
Common Tasks
Attend most meetings to deeply understand the business problem
Clearly communicate and document the functional business solution with the data teams
Help create a technical architecture with data teams
Help manage the scope of iterations so that teams are on task and able to deliver
Reviews architecture
Reviews code
Responsible for
Solution Architecture Documentation
Code reviews
Decision Documents
Revisions to the Batch-Lambda Architecture Playbook
Program Manager
Role
Directs data teams to achieve business goals utilizing the Batch-Lambda architecture
Common Tasks
Set line of data processing iteration priority with the team and clearly communicate the overall goals and objectives of each line of data processing – so solutions aren’t over or under engineered to meet delivery expectations
Set clear delivery expectations – “Success will look like this when we’ve achieved all of the following goals…”
Review and approve overall data quality metrics associated with a line of data processing iteration
Set the vision of team excellence and encourage adherence to the line of data processing iteration process
Attend milestone meetings – daily stand up, sprint planning review, backlog grooming, and approve Service Provider estimated User Stories
Responsible for
Product Backlog Priority
Success of each line of data processing iteration
Overall data quality for each line of data processing iteration
Overall morale and health of the data teams and Batch-Lambda Architecture
Applying this Process at Your Organization
This is definitely a journey worth taking. Issues with data quality account for many failed projects, customer support issues, and the loss of trust from customers. Data quality issues can build a damn in your value streams.
If you would like help or guidance with this journey, please connect with me on LinkedIn and we can arrange a meeting. We have a skilled team of data systems experts, certified Azure developers and architects, as well as the project delivery experts to help begin and support your data practices.
Additional Data Quality Resources
I recently read another great overview of the data quality process on the site Toptal. Toptal describes itself as, “… an exclusive network of the top freelance software developers, designers, finance experts, product managers, and project managers in the world. Top companies hire Toptal freelancers for their most important projects.”
I recently completed a guide for a client related to creating data quality metrics. It’s an interesting problem to tackle, because you absolutely need to understand what goals you need to achieve as an organization. Without these goals defined, you can make the mistake of either blanketing your data with far too many metrics or too few. To give you an idea of how many possible data metrics are available, here is a list of possibilities:
I’m going to cheat, and just focus on those that the DeeQu library supports. This means we’re going to look for metrics related to the following areas:
These areas of data quality can be clearly defined and tied to a reportable metric or in some cases, more than one metric. Metrics mean dashboards, which makes it easier to bring data owners and contributors into the picture.
Even though there are several Data Quality tools on the market (Great Expectations is one I’m reviewing now), I’ll focus on using Amazon’s DeeQu. Primarily because it’s an Open Source library and it integrates well with modern data tools like Apache Spark. It’s easy to implement and it gives us reports out-of-the-box that meet our needs for metrics.
I won’t go into details on how to set up and use the tool. That’s covered in the library’s example code. However, I will pull a portion of their reference out and use it to help us plan the first step of our data quality process.
How do these fit into the process, though? Well, we want to begin our process by performing an initial run of data analysis. Our first phase should be an attempt to narrow from general data quality to specifics. Data quality can and often does reach from the quality of the data schema (the shape and the primitive data types that make up the general data container,) as well as the data values – this is the intention of the data. In unstructured or semi-structured data, the very shape of a collection of data points can be a data value. Keep these points in mind when considering data quality for your individual systems. However, to keep the scope of this set of articles a little more focused, I’m going to stick with data quality applied to the data values, and not the various possible data schema.
To narrow down from our abstract concept of general data quality to measurable attributes that will help us test our data and build meaningful reports, we need to section off what it is that we’re expecting from our data.
For the initial phase of analysis, I can see a data engineer generating reports and sitting down with an analyst to discuss what metrics are truly important for either the specific dataset or the experiment. I can also see this as a loop of meetings until the data begins to meet with the organization’s ideal data quality objective—and I can see this as a process of learning. Many organizations won’t initially know what type of data quality they want to aspire to, because they haven’t invested the time to analyze their data. This will likely be an ongoing, ever evolving process of learning for both the organization and the data engineering team. Building a process around this learning is valuable.
Assessing the Data
Completeness Metric – This does dip a little into the world of Intelligence Quality. It’s about the completeness of a record. For instance, if you have a record with a vehicle, that record should include the make, the model, the year it was built, and the vehicle identification number. If one of these elements is missing, the record lacks completeness.
How do we gather metrics on completeness?
DeeQu includes a metric for completeness. It reports on the fraction of non-null values in a column. To measure the completeness of an object represented in the database, an analyst would need to identify the columns needed to make that data set “complete”. This is a case where shaping the data schema to match the DQ rules would probably be worth the extra effort.
Data transformed from the ingested data to match a dataset defined by the business domain could easily be loaded in one or more tables and analyzed for completeness. Datasets that do not meet the completeness strategy could be sent to a bad records table or marked with meta-data related to the failure. It’s important to understand that completeness as a concept is not necessarily limited to the actual “completeness,” function within the library.
Other functions from within the library can be combined to establish a dataset’s completeness. All the data could exist for a dataset, but if one of the values is incorrect, it could mean that the dataset is lacking completeness. This should be considered while establishing metrics for completeness.
Uniqueness Metric – Data that represents a specific dataset only appears once in the data system. If there is a record that represents Joe Smith, that records for Joe Smith only appears once. In today’s data systems with multiple stages, like bronze, silver, gold, and platinum, that might not be possible, but it should be something that is aimed for the in the most trusted layer of the data system – for instance, it might be something that’s tested for in the lower stages, but isn’t acceptable in the higher stages that are considered a trusted reporting layer.
Compliance Metric – in a structured database with clearly defined datatypes it’s fairly easy to be certain that an integer will be an integer and that a CHAR will be a CHAR, whoever if you have constraints such as, “All values within this column must be between .001234 and .002345,” you’ll need to either write custom functions within your data system to check for these constraints or you’ll need to have a way to check those values for compliance. How many records within the dataset meet the compliance requirements, how many do not?
Distinctness Metric – how many values are discrete? DeeQu measures this the fraction of distinct values of a column over the number of all values of a column. If you are looking for an entry that should be generally unique, for instance national identification numbers, but you run across values that are not unique, that could be a reason to trigger a warning.
My code for something like this looks like the following:
val verificationResult: VerificationResult = { VerificationSuite()
// data to run the verification on
.onData(dataset)
// define a data quality check
.addCheck(
Check(CheckLevel.Error, "Review Check")
.hasSize(_ >= 3000000) // at least 3 million rows
.hasMin("star_rating", _ == 1.0) // min is 1.0
.hasMax("star_rating", _ == 5.0) // max is 5.0
.isComplete("review_id") // should never be NULL
.isUnique("review_id") // should not contain duplicates
.isComplete("marketplace") // should never be NULL
// contains only the listed values
.isContainedIn("marketplace", Array("US", "UK", "DE", "JP", "FR"))
.isNonNegative("year")) // should not contain negative values
// compute metrics and verify check conditions
.run()
}
My report looks like this:
name
instance
value
Completeness
review_id
1
Completeness
marketplace
1
Compliance
marketplace contained in US,UK,DE,JP,FR
1
Compliance
year is non-negative
1
Maximum
star_rating
5
Minimum
star_rating
1
Size
*
3120938
Uniqueness
review_id
0.99266
Based on what we learn from these reports, we can now go back to our data and make decisions related to how we will start the process for testing our dataset for our new data quality goals.
How does an organization go about determining data quality? What are the first steps in an overall process to achieve the organizational goals related to data quality? First, before we can answer any of these questions, I think it’s important to agree upon a standard definition of data quality.
Wikipedia gives the following definition – “Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is “fit for [its] intended uses in operations, decision making and planning”. Moreover, data is deemed of high quality if it correctly represents the real-world construct to which it refers.”
This is where I want to touch on a different term, a term that is often meant for data quality, but most people don’t separate the two. That’s Information Quality. Information Quality is defined as, “Information quality (IQ) is the quality of the content of information systems. It is often pragmatically defined as: “The fitness for use of the information provided.”. IQ frameworks also provides a tangible approach to assess and measure DQ/IQ in a robust and rigorous manner.”
Information Quality is the value of your data in service to your organization’s goals and objectives. Before we start attempting to build a set of processes around improving overall data quality, it’s important to solidify what the expectations are for Information Quality. For our purposes, we’ll stick with the following understanding of Information Quality.
Information Quality should be one of the primary means of determining the performance of business initiatives. Information and Data Quality managed correctly within the goals and guidance of business objectives will primarily be a trailing indicator of where the business is on the journey toward success.
This means that the Information System must be capable of delivering the business the information it needs, when it needs it, and where it needs it, and to the people who need it.
Therefore, it seems that the first step in determining where to focus is to understand the following:
What portions of my existing data does the business need to see in relation to the stated goals and objectives? This will likely require business analysis to uncover the portions of the system that can lend information to the objective.
What is the tempo of information delivery? What data ranges are under consideration? When is this information timely to the objectives of the business? For instance, if I need to see property related to the latest sales on homes in a particular area, because my business wants to build new homes in the area, data from three years ago is likely to be of less value than data from three weeks ago. However, if I’m looking at various flood cycle changes to soil on valley farmland, I’ll likely want the last five years data, and not the last five weeks.
Where do decision makers and stakeholders need to go to view their data – and to a certain extent, where exactly within the view of that data to they need to look? Handing over a massive multi-paged report when the stakeholder needs something that could fit in a short message on a mobile text isn’t helpful. So be sure to clarify where the data needs to reside to help the decision make have what they need when they need it.
Who needs to see it? This is usually covered in Data Governance and Security, but it’s important to understand who needs to see this data and especially who does not need to see this data.
Does this mean that we don’t need to practice data quality standards across our entire system? Not necessarily? There should still be processes in place to ensure that minimal data quality is met. That data needs to be available to pull from at any time—however, the degree to which you enforce rigor should be measured based on the specific business objectives.
For instance, let’s say that our data team’s investment house has just acquired a smaller investment firm. Our data team is now responsible for importing the newly acquired data into our data system. The acquired business used an Oracle database and traditional ETL processes. Our company uses an AWS Data Lake architecture. Does it make sense to import the entire data set from Oracle into our DynamoDM? Probably not. We likely just need the elements from the Oracle database that fir our business goals and objectives. Now, it does make sense to store that historical data in a way that we can easily access it if it seems that data could be of value for future business objectives, for instance in the RAW areas of our data in flat files? Probably.
When a data team has a large collection of data and no guidance from the business on how that data provides value to business decisions, it’s understandable that data teams aren’t clear on where they should focus their efforts. Data teams should work closely with the business to understand the business goals and objectives—likewise, the business must set aside time to clarify with the data team where the value resides within the Information System. This will allow those teams to better understand where they should focus their data and intelligence quality efforts.
This post describes an opinionated method for achieving data quality within a small organization. There are some pre-qualifiers and assumptions made about the data system architecture and the skill set of the team.
One of the biggest struggles for organizations that rely on data is to achieve an acceptable level of data quality. Steps are taken by everyone from front end developers to ETL engineers to capture what is considered, “clean data,” but there are always exceptions that seem to creep into the data that can throw off reporting, damage organizational reputation, or the time of valuable resources as they track down the source of poor information.
The problem of poor data quality has been addressed by many disciplined approaches. If you have poor data quality and you would like to see some improvement to your data, there is hope. Here are just a few of the strategies and techniques organizations use to improve their data quality.
Acquisition of new data.
Standardization or normalization of new data
Linking records
Data and Schema Integration
Trusted sources
Process control
Process redesign
Error localization and correction
Acquisition of new data – the overall quality of data can be improved by importing standard data sets that meet a high quality metric. For instance, common lookup values like regional names, postal codes, or weights and measures can be standardized by replacing existing data that does not meet the organization’s quality standards.
Standardization or normalization of new data – data standards are created by an organizational governing body. Those standards are enforced by replacing the non-standard naming with a standard. For example, all instances of ‘Street’, ‘Str’, or ‘st’ are replaced with ST. ‘Texas’ or ‘Tex’, are replaced with TX.
Linked records – this is a little like aggregating similar data into a more robust object, and then using quality data from those objects to inform the quality of other similar objects. This works well when working with an inventory of similar items, like a database of used cars, single-family homes, or SEO data.
Data and Schema Integration – this is probably the most common, where primitive data types are used as evaluation criteria for data inserted into records. Integer values are enforced and must be positive or negative numbers. Some data systems will even allow you to enforce certain data formats to meet organizational policy or business rules.
Trusted Sources – when dealing with multiple sources of data, it’s often difficult to know which one best represents “good data,” so a single source of truth is developed that informs all other data sources of what constitutes quality data. A record from the trusted source is trusted over what should be the same record in another data source.
Process control – is the way most organizations practice maintaining data quality. A gatekeeper is in charge of checking the data quality, such as an internet web form with input validation or an API with a strictly enforced schema. This often works as long as data continues to only flow through the gateways with these protections.
Process redesign – many organizations face such poor data quality that their only recourse to achieve data quality is to refactor their existing architecture and data capture systems in order to prevent the poor data quality. This might mean replacing legacy systems that cannot enforce process control or building a new type of data system that allows the organization to cleanse the data, which leads us to the system I believe is one many organizations need to consider.
Error localization and correction – this system is focused on analyzing the existing data and checking it for quality. The goal is to identify and eliminate data quality issues by first detecting records that do not meet the organizational standard, and then cleansing that data with scripts or other data sources. This is the system I think most enterprises would benefit most from for the following reasons:
Organizations do not always control the data capture applications that input data into their systems
Organizations often inherit data from other organizations, either through acquisitions or through list purchases–think direct mail companies or organizations that make use of public data sources
Correcting the data incrementally allows an organization to trust their data more over time. There are few data quality fixes that occur over night. It usually takes a team and a process, and the error localization and correction process fits well with teams and processes.
Knowing the many ways to achieve data quality can help an organization begin the journey toward cleaner more reliable data sources with their organization.
In my next post, I’ll dive into the many ways that an organization can analyze their existing data and apply metrics that will help establish data quality goals.
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here:
Cookie Policy