Not All Are Created Equal: Evaluating Data Quality
Article Highlights:
We can evaluate data quality by examining characteristics such as relevance, accuracy, completeness, timeliness, and consistency.
Data can be flawed; however, it is still possible to make informed decisions based on data as long as we understand the limitations and work to remedy them.
In the last post in this series, we discussed how guiding our intuition with data can mitigate biases, with the caveat that simply incorporating data does not mean we are bias-free. First and foremost, as we have suggested in earlier posts, it is impossible to fully eliminate biases in our decision making. Secondly, even when we do our due diligence in critiquing our assumptions and understanding the scope of the problem at hand, the data we use to guide our intuition may lack the quality needed to draw meaningful conclusions. To help us understand this better, this post will highlight some characteristics of data sources we should consider.
At this point in the process, we have analyzed and critiqued our research question[1]. We have a sense of what we know, what we are assuming, and maybe what data we would need to confirm (and disconfirm) our intuition. Perhaps through this process, we were able to ground our intuition in data, but maybe that was not possible and thus need to collect data to test our intuition.
When working to answer a research question or add support to our intuition, it is necessary to assess the quality of our data. Poor data often can be worse than no data, leading to flawed interpretations and misinformed decisions. An important disclaimer here is that most data has flaws and thus limitations. The key is to consider the data in the context of our question (hence why critiquing our assumptions and understanding the full picture of the broader problem is so important) while also weighing the flaws. By doing this, we are able to better anticipate the utility and limitations of the conclusions we draw from the data.
In order to evaluate data quality, consider the following characteristics:
Relevance: This pertains to your research question and ensuring that the data is appropriate to answer the question. An obvious example would be using education policy data from California for a research question about education policy in New York. Of course, there may be instances when this is your only option, but it is important to try to anticipate these limitations prior to analysis.
Accuracy: This refers to the extent to which the data is free from errors. For example, some data still needs to be manually entered or cleaned which opens the door to potential data errors. Even in cases when data is being tracked and constantly captured, the format of the data may be compromised when, for example, importing it into your analysis tool. The goal is to ensure the data has been verified to catch any errors before analysis.
Completeness: Partial data records can obscure your results possibly leading to inaccurate or misleading interpretations. It is important to ask yourself, how much of my data is missing? It is also critical to examine any systematic patterns of missing data before analysis. For instance, are any particular groups’ data missing more than others? Why might this be?
Timeliness: In most cases, you will want data to represent real-time events. So, it is important to ask yourself, how old is this data? Will the insights be representative of the current situation/problem?
Consistency: Somewhat related to accuracy, it is important to consider whether the data you have access to is corroborated with other trusted sources of data. For example, imagine you are interested in how much money is being spent on infrastructure upgrades in your county, but the data from the local government contradicts the amounts in the data provided by the state government.
These characteristics provide a helpful framework for assessing the quality of data. As mentioned earlier, most data have flaws in relation to the research question being asked, so it is unlikely that each of these characteristics will align perfectly with your research question. The goal is to be able to proactively identify the flaws or limitations and if possible, determine ways to remedy them (e.g. collect additional, complementary data). In cases where this is not possible, it is necessary to consider how these flaws impact your ability to make decisions based on the data.
In the next post, we’ll discuss the how to make meaning from this process when we have data in hand and a decision to make.
[1] In this blog series, we use two similar but differentiated terms: 'research problem' and 'research question'. The research problem is the broader challenge. For example, homelessness. The research question, on the other hand, carves out a specific part of the research problem and frames it for exploration: Does an underperforming economy lead to homelessness?
This is one of several forthcoming pieces in Hawai‘i Data Collaborative’s Data for Good Decisions Series. The purpose of this series is to showcase how to elevate data into important policy and social change decisions to solve challenging problems.