Data CleansingMachine Learning

The 6 Attributes of High-Quality Data

Today’s digital technology gives us the capability and means to not only measure large quantities of data but to securely store it. However, the volume of data, growing exponentially by the day, requires new methods to manage the quality of that information. To detect, prevent, or repair data issues, the concept of data quality has evolved. Data quality is denoted by a number of factors such as the accuracy, completeness, relevancy, validity, timeliness and consistency of the data set at hand. With high-quality data come valuable insights, while low-quality data introduces more opportunities for errors and inaccuracies in future analysis.  

What makes data “high quality” and how do you measure it. In this article, we will define high-quality data and consider real-life scenarios in which data quality should not only be taken into serious consideration but should become the baseline for data management. 

1. Data Accuracy

Accuracy is the measure of how well a data set models the reality of the event being analyzed. An example of inaccurate data is when your thermometer displays that it is 50 degrees Fahrenheit outside, but it is actually 85 degrees. In this situation, the data does not model the real-world temperature. Such inaccurate data could lead to a person thinking it was cold enough to wear a jacket when none is necessary. Although a trivial example, this illustrates how inaccurate data can lead to bad decisions, and bad decisions are how organizations fail.  

As one step toward maintaining accurate data, consider the following questions:  

  • Does this sample truly reflect real-world events being described? 
  • Are there data points that represent incorrect measures that need to be fixed?  

2. Data Completeness

Completeness is the measure of how well data fulfills the expectations of what is being measured. In other words, complete data has no gaps in it and is a full, comprehensive measure. Incomplete data could be as simple as a customer not filling in their gender on a survey. While they may have provided other information, without their gender, the other information is no longer useful as the full picture of the participant is unclear. For instance, if one was looking to understand voter turnout based on gender, without a participant’s gender, their voting record is essentially useless. The completeness of a data set is crucial to comprehensive analysis.  

To assess and measure the completeness of your data, ask questions such as: Are all the fields complete? Was all relevant information provided? Are there any missing factors? Are the available answers fulfilling expectations of what is comprehensive? Missing data impacts the quality of the entire data set, so it is of the utmost importance that there is as much completeness as possible. 

3. Consistency

Consistency is another measure that looks at how well existing data represents reality. Essentially, does the information in your data truly reflect the same information potentially stored in another place. When looking at consistency, it is important to assess not only the content, but also the format of the data. For example, when the value of money deposited into an account does not match the value recorded by the bank, there is an inconsistency in the content of the data; i.e., the dollar amount between the bank’s records and your own is inconsistent. Inconsistency can lead to large discrepancies in what is assumed to be true. While you might assume you have $1000 in your account, the bank account only has $500.  If you try to buy something over $500, the transaction will fail.  

To measure and understand the consistency of your data, it is important to compare a data set to multiple other data sets. Theoretically, if numerous data sets are measuring the same event, the values should also be the same. If not, there may be an error in your data collection method and/or output. Catching data inconsistencies early help prevent illogical analysis in the future. 

4. Timeliness

Timeliness is another important data quality metric that quantifies the availability of certain data to a user when it is needed. Unlike the other metrics, timeliness is largely related to a user’s experience. Timeliness can also refer to the time at which a data event was recorded. In this situation, time can affect the quality of the data itself rather than the availability to the user. For instance, data collected about the stock market highly revolves around the timeliness of activities and calculations in the stock market. Because the markets are so mercurial, data about stocks is only relevant for a short period of time after collection and results in data that is no longer relevant at a point in the near future. 

When looking at timeliness, ask yourself: When was the data collected? What could have possibly changed since this data was collected? If too much time has passed between collection and use, the data may no longer be relevant, and thereby degrading the quality of the data. 

5. Data Validity

The validity of data often has to do with the data’s adherence to a specific format and/or rule. Invalid data typically has been input in a format that is not understandable by the program and/or person analyzing such a set. Validity is often a consequence of how the data is collected. For instance, say a specific survey asks for a participant’s social security number. SSNs are normally in the format XXX-XX-XXXX (a nine-digit number with two dashes). A program inputting such SSNs into a company database can only recognize numbers input in the correct nine-digit format. Any SSNs input without dashes, misplaced dashes, or less than/more than nine digits are deemed invalid. This connects to the dimension of completeness because without the SSN, the participant’s other answers are no longer relevant. Invalid data has far-reaching consequences.  

To assess validity of your data, review the formatting and adherence to rules of certain well-known attributes such as birthdays, monetary amounts, SSNs, ages, time, etc. Any missing or incorrectly formatted data contributes to lingering  issues for the quality of data.  

6. Relevancy (Uniqueness)

The last metric that is important to data quality measures relevancy. Relevant data is data that is useful to the analysis at hand. If a specific study is looking at how geography affects healthcare, it is irrelevant to ask which food the subject likes to eat. Ultimately, the data point of a favorite food will not help solve healthcare disparities based on geography. Furthermore, when looking at relevancy, it is also important to assess the uniqueness of the data. Data points that are repetitive or represent the same factor as another sample can be irrelevant and introduce redundant data. For instance, asking about a person’s socioeconomic status and their annual income is repetitive.  

In deeming the relevancy of a specific factor, look at the usefulness of the specific factor within the larger analysis/and or conclusion. It is also important to ask yourself if a data point represents the same event as another data point. If yes, your data is not unique and thus the quality of the data will decrease.  

Building Value through High-Quality Data

Now that we have covered the characteristics of data quality, we can look at some benefits of high-quality data. Perhaps the most important factors have to do with the user’s perception. In the age of misinformation, high-quality data certification goes a long way. Users want to feel comfortable knowing that the data they are accessing is not only factually correct, but also a true depiction of the events they are trying to analyze. High-quality data confirms that the events being described are true to the data and that the data has no areas that could contribute to misanalysis.  

Data quality is relevant to all companies, not just technology-based users. For instance, marketing companies are constantly assessing data points to understand how they interact with their target audience. Marketers arguably have the most to gain from reliable, high-quality data. High data quality in conjunction with AI systems and machine learning help marketers feel comfortable and confident that the patterns they are identifying in their target client base are true and honest to what is happening in reality.  

High data quality across all platforms will ultimately contribute to a new standard in data collection and analysis. Companies can be assured that their streamlining efforts to combine data with AI provide accurate and trustworthy results. As shown in this article, high data quality really can be boiled down to six easy checks. Yet although seemingly simple, these measures contribute to an increased feeling of trust in our data.  

Start Taking Action Right Away

The low-quality data issues you are dealing with will not go away by themselves and the sooner you start addressing this problem, the better. A good place to start is by getting an overview of the health of your data using DataGroomr’s Trimmr and Brushr modules.  

Trimmr provides an instant analysis of how many duplicates you have in your Salesforce data and then allows you to eliminate them individually or en masse. Brushr analyzes your data and verifies attributes such as phone numbers, emails, and addresses. Moreover, Brushr has powerful features to transform, update, and or delete the information you no longer need. 

Try DataGroomr today with our free 14-day trial

Steven Pogrebivsky

About Steven Pogrebivsky

Steve Pogrebivsky has founded multiple successful startups and is an expert in data and content management systems with over 25 years of experience. Previously, he co-founded and was the CEO of MetaVis Technologies, which built tools for Microsoft Office 365, Salesforce and other cloud-based information systems. MetaVis was acquired by Metalogix in 2015. Before MetaVis, Steve founded several other technology companies, including Stelex Corporation which provided compliance and technical solutions to FDA-regulated organizations.