Data Quality Control

A Requirement for Data for Data Science and Analytics Users

Data Quality
Quality Control

Data quality is important for modern analytics and data science teams. No one treats it that way. Here, I enumerate some ways to think about and monitor data quality.

Published

February 19, 2023

In manufacturing settings, a fall in quality of input materials and component parts results in scrapped final product or product that fails at too high a rate. For a long time, practitioners of quality have worked to address these issues, and through the application of statistical methods and monitoring regimens, they have attained great results. Why then, for data scientists and analysts, do we have almost no quality control over the raw material of our trades: the data which we use to build models and gather insights?

Physics puts some bounds on the ways that materials and component parts fail, but data rarely has such constraints. Things can go wrong in exciting and interesting ways! In both settings, errors can be very hard to catch. Industrial quality control advocates are careful to instrument and monitor processes and materials to discover problems. Data engineers sometimes take a completely orthogonal approach where they to have only a single way to get a value, and thus minimize the number of places where something could go wrong, while also minimizing the number of places we might spot an error had occurred.

First, note that data quality really only makes sense when there are questions and constraints. If we have no questions to answer with data then there’s no sense in which one set of data is of a higher quality than another. Similarly, if there are no constraints then anything can mean anything, and no questions can be answered. If you look at our task as one of manufacturing answers to questions then we see that data science and analytics are very much industrial applications amenable to quality control techniques.

The Importance of Questions

The questions that we can answer with our data belong to some domain of questions that is often poorly defined. This can happen when those designing the data systems are not the same people that are asking the data questions. Complex relationships can leave holes in our understanding, and this can lead to an inability to know which questions we can answer and which questions we cannot.

Noy and McGuinness (2001) discuss the importance of Competency Questions, questions that the system should be able to answer, in constructing a formal vocabulary to describe our knowledge. Generating a set of questions and definitions for what important terms in those questions mean is the best way to ensure that a data system is appropriate for such questions.

It’s natural that the data source where we try to answer questions takes on a shape that matches the vocabulary we use to ask those questions in the long run, and the faster you get to that point the less suffering you’ll undergo. As such, having a definite and controlled vocabulary like an ontology is hugely beneficial. All of this is under a field called knowledge organization and modeling.

Check it out at your local library!

Constraints

Data quality cannot exist without constraints. Constraints can be abstract requirements such as:

  • Each event in real life should correspond to exactly one row in our event table, or
  • The sum of the values of items in the basket must be equal to the value of the basket before taxes.

But constraints can also be definitions, for example:

  • A new customer is one that hasn’t purchased a product in the past 120 days

This would imply a constraint that the count of new customers should not include customers who have made purchases in the past 120 days.

No matter the form, constraints are the key to us discovering problems in data quality. Let’s look at two examples where things go awry, one in terms of definition, another in terms of lack of declared constraints.

A Cursed Customer Renewal Report

Let’s say we want a report from our reporting database and we want to make sure that our renewed customer count matches across the whole report. We’ll need to pull data from three separate systems:

  1. The store system handles orders and clearing those orders with the payment processor.
  2. The activation system handles making sure customers have a valid license when they renew.
  3. The operations system keeps track of customers subscriptions and helps us notify them when they need to renew their subscription and when their renewal is active.

Here’s where things get dark. We go to a person at each group and ask how to get their renewal numbers. They each give us instructions but we end up with none of our renewal numbers matching. So we go back and talk more to each person and discover they each have two different ways of defining a renewal.

  1. The store system records both the number of renewals submitted to a payment processor that day and the number of renewals the transaction processor cleared that day.
  2. The activation system records the number of renewals the transaction processor cleared that day and the number of products that activated their renewal subscription that day.
  3. The operations system records the number of renewals submitted to the payment processor and the number of renewed activations.

Because of the vagaries of “renewed customer” we find ourselves in a situation where we can build a report that matches any two of the three systems but never all three. The lack of a definitional constraint means that we can’t meet our other constraint: an matching count of renewed customers. We have a data quality problem because our data is not suitable to answer our questions.

A Cursed Tracking of Web Activity

Let’s say we have a third party providing clicktracking for us (let’s call them Qdobe) and we use that data to analyze experiments we run on our web site. We want to determine if Qdobe is providing us with good data. But the Qdobe setup is a decade old and few if any performance constraints exist from when it was first set up. We suspect something is wrong with our Qdobe setup so we investigate.

Each experiment we run takes place on a specific page, but when we look in the database we find many different pages appear to be the first page where the experimental treatment was assigned. This feels wrong, but they could be randomizing users into experiments when they hit the site, then only showing them the experiment on the page. Since we have no constraint that users should be randomized into experiments on the page where the experiment takes place we have no proof that something is wrong.

Several of our experiments showed a sample ratio that was different than what we targeted. This suggests something wrong with randomization or missing data. But since we don’t know which, it makes acting on this information rather difficult. We have strong evidence that something is wrong, but we can’t take action.

Because of this we write a more aggressive test. We send out 30 hits of our own from an automated script to our website, carefully logging when the hits occurred and any Qdobe metadata we got during the interaction. We repeat this process while using a VPN to test from many different locations. After the test we observe 8% of the hit data we create is missing from our data source. Might the vendor have dropped some traffic thinking it was from bots? It was from bots after all. Would we expect some proportion of traffic to drop anyway and be acceptable to the business? Is there some proportion missing that is unacceptable? No constraints, so it’s unclear that anything is wrong with Qdobe.

Finally, and in desperation to try to validate that our data is good, we compare the counts of unique identifiers within the vendor’s dashboard to local counts of identifiers showed in our copy of their data. Here we have an implicit constraint that their summary of their data should match our summary of our copy of their data. And here we find a difference!

Consider what we really found. We found several pieces of evidence for data missing in numerous ways, and possibly evidence for poor randomization, but almost none of it violated any particular constraint. Missing data and poor randomization cause severe data quality issues and yet nothing could be done about it except for when two dashboards give different results. Our lack of constraints has made us dumb. I looked up the definition of “dumb” just now and it says “to make dumb or unheard; silence” so I think it applies here.

Back to Constraints

If these hypothetical organizations had an understanding of their data’s meaning and constraints upon performance then they would have been in much better hypothetical shape. One way to approach getting a company better data quality is by defining some dimensions across which we can analyze the questions and constraints to search for data quality issues.

A lot of people have put together dimensions of data quality to help give a vocabulary for such a task, and they tend to be remarkably similar. Next, let’s look at my version of this vocabulary cribbed from others such as this one from icedq.com as we go further down this rabbit hole.

Dimensions of Data Quality

The table below has nine dimensions of data quality and questions that give a sense for what that dimension describes. These are not orthogonal dimensions — you’ll note some crossover between them — but they are the result of a fair amount impatient Googling in the area.

Concept Description
Accuracy To what extent does the data agree with the real world?
Completeness Is any of the data unexpectedly missing? When a value is missing, is it deliberate and meaningful?
Consistency Does the data agree with itself?
Uniqueness Can we rely on the count of a thing to represent a true fact? Or may things accidentally get recorded multiple times?
Validity Are business rules and formula correctly being represented in the data? Are categorical values within their possible set of values? Is there testing data in the production data sources?
Timeliness and Currency How long does it take data to get processed? Are there temporal dependencies that can become race conditions later on? How current is the most current data? If the data is meant to be near real-time, does the data represent the current state or is it stale and inactionable?
Meaningfullness Do we understand what a value means? For datetime, do we know timezone? For categorical data do we know the meaning of each level? If nullable, what does null mean?
Auditability Can we derive the state of things at a point in time in the past? When looking at validity, are business rules and formula from a period correctly being represented in the data for that period?
Conformity All data has a type; does the particular data have the same type where stored, including timestamp, or same scale/value/precision? Also, is naming consistent?

I think this serves to flesh out a lot of what we need to account for with data quality. I suspect that you, dear reader, went through the list thinking about how historical data from production does or does not meet these questions. But there are more types of data involved in data quality than just the data that is at rest.

Metadata is data that describes our data. This can contain definitions, constraints, data types, and notes about historical changes, among other things. Also, the process by which data is moved into the final location (where it is at rest) both has its own metadata and produces its own data. This transit data is also important in understanding our data quality.

Some dimensions, such as Auditability and Meaningfullness, are about the questions we intend to ask and are determined when the plan to bring together the data was constructed. As such, these apply more to the metadata than to the data at rest. Some dimensions, like Accuracy, seem to be concerns for both the data at rest and the metadata. Other dimensions, like Timeliness, are really concerned with how the data is transmitted.

On Data On Data

Our metadata is an important way for people to discover new data, understand how data changed over time, and sets important constraints on the data. If data is worth recording, then it’s worth documenting in the metadata. Yet metadata is often neglected.

  • Accuracy requires our definitions and data types to be correct.
  • Auditability requires that if these definitions changed any point in time then it be recorded.
  • Completeness, Consistency, and Uniqueness of metadata require that when two things are the same then they should have the same definition, and when two things are different then they should have different definition.
  • Meaningfullness requires that our definitions describe what the data represents and how it does this. That means indicating the meaning of each level when something is categorical.

When transmitting data from a source of truth to our repository of data at rest we have lots of opportunity to define constraints and prevent difficult-to-detect data quality issues from cropping up. We usually have two kinds of data here:

  1. Metadata about the start and end time of the process, the number of rows/entities transferred, and other characteristics.
  2. The data being transmitted itself.

This is interesting because certain dimensions can take on dual meanings. For example, Accuracy can be both:

  • Correctly recording what happened, and
  • Maintaining the fidelity of the data being transmitted (e.g. care not to use different encodings, different precisions of numeric representations, or anything that does “ignore-errors”).

Jobs that transmit data are vulnerable in ways that metadata and the data at rest are not. Metadata is usually small in size and can be easily controlled. The data at rest usually resides somewhere that’s well engineered to grow as the amount of data increases in a controlled way. But jobs that transmit data can behave weirdly over time. Over time, individual jobs may increase in runtime as the data transmitted increases in size, and this increase may not be the same for all jobs. This can cause the order in which data arrives to change over time, and this can mess with the Timeliness and Currency of the data.

Transmission data also needs:

  • Auditability so we can see how jobs behaved over time and assess if they interacted in any way.
  • Meaningfullness beyond that of the recorded transmission, as we should be able to inspect the job to determine from where it collected data, what operations occurred in transit, and where it deposited the information.

Testing Data Quality

Many of the things we touched on above would probably not be tested. Most organizations would just assume it was put in place correctly by some team long ago that did things infallibly. In reality, the team putting things together probably just checked that things matched once, twice, or even three times, and then assumed it was fine for all time. How can we address this oversight?

The problem we run into is that testing data quality is hard. There are only a few ways we can test for data quality.

Internal Tests: These are tests where a constraint is expressed and we look at the data and ensure that this constraint is met. These are tests that take place entirely among the data sources and so should be easy to automate. Examples include: the number of rows read from a source data and inserted into a target database should be the same, or the value of an order should equal the sum of its line items.

Stochastic Tests: These are statistical tests based on the distribution of things in a data set. Such tests are not (usually) suitable to validate data initially, but may be able to detect changes in the data that indicate an emergent problem with data quality. For example, we might find a drastic change in the distribution of null values as evidence that a problem in data quality has emerged. One could even use compressability as a measure to monitor how certain collections of data relate to each other. These test, as they involve random error, are susceptible to false alarms and so require the same care as done in statistical quality control applications.

External Tests: Here we take some external action and trace that the activity is represented in the data. This is often one of the only ways we have to test overall accuracy. For example, we could submit 30 visits to a web site by automated means, and check that those specific visits were represented in the data, and that the represented data matches our logging of the original visits (just like in the example above). These can require manual intervention and can be difficult to automate.

Competency Questions and Documentation: Documentation is not thought of as a test but thorough and exhaustive documentation, such as ontologies, produce definitions that everyone can share and helps us make our domain assumption explicit. This serves as a baseline for accuracy and meaningfullness, and may be the first place where we realize certain competency questions are unanswerable. Further, if a new question is ambiguous given the documentation then the question or the business concepts should be refined until the question is no longer ambiguous. This may require painful changes to the data at rest.

Other: I can’t think of another kind of test we could do. Every test I can think of falls into one (or maybe several) of the categories above. I mean, I guess you could say “eyeballing” the data to see if it looks wrong is an option, but that’s a terrible test. So I guess I think we’ve covered all of the non-terrible ways to test data for data quality (at a high level).

Conclusions

Data quality is foundational to any data science or analytics efforts. Yet very little is done to analyze and guarantee data quality. Here we discussed several ways to look at the problem of data quality, and applied that to the problem of our data, the metadata that describes our data, and the way in which we transport data to our data source. We also discussed some types of tests one could to do monitor data quality, however most aspects of data quality still cannot be easily tested.

Acknowledgements

I’d like to thank Juliana B. for discussion of and substantive contributions to this post.

References

Noy, Natalya F, and Deborah L McGuinness. 2001. “Ontology Development 101: A Guide to Creating Your First Ontology.” Stanford knowledge systems laboratory technical report KSL-01-05. https://protege.stanford.edu/publications/ontology_development/ontology101.pdf.