Part 2: Conducting a Data Quality Assessment
In last week’s blog post, we went into detail on the value of defining a purpose and establishing specific goals and objectives (if you missed that article, you can read it here). This article looks at how to measure success by establishing key quality metrics. The first question to ask is whether the goals are quantifiable or qualitative (i.e., they can only be described with words). The quality of your data can and should be quantified in some form. In this week’s post, we will discuss the key metrics you should be tracking as you progress through your data quality assessment.
Measuring the Completeness of Your Data
Completeness defines whether or not all of the necessary data is present in a given dataset. We can break down completeness into two categories: critical fields and optional fields. While you should place more emphasis on the critical fields, you can never know which information will become useful when the data is put to use. For example, a field like a birthdate may not seem initially relevant, but if a personal message is constructed around this information, it may be very appealing to your prospect.
The most obvious metric in this area is measuring how many null and blank values exist in your datasets. Also, consider data that is just a placeholder instead of the actual value. There are lots of tools that will do the counting for you (we will cover this in more detail in the next article). If certain critical fields are missing, the entire record may be deemed unusable, which is another metric you need to be tracking.
Measuring the Accuracy of Your Data
Accuracy measures the degree of correctness for a given dataset. There will be situations where such measurements are fairly straightforward, such as someone’s birthdate. It’s either correct or incorrect and therefore fairly easy to measure. The accuracy of other data may not be so trivial. Consider that many foreign names have alternate spellings. This will not be as simple to establish and verify. The bigger question is whether the accuracy is relevant?
Using the example above … you send a letter to a prospect with an alternate name spelling, but the mail carrier is familiar with this variation and still delivers it. The goal is achieved regardless of the error. On the other hand, the recipient may be displeased with this mistake and choose against doing business with you. Ultimately, accuracy is dependent on the use case for the data, which may or may not be known at the time of the assessment.
Here is another example. I am sure at one time or another we have all received coupons by mail. In many cases, the name is just replaced with some generic identifier like “Current Resident.” Since the purpose is just to deliver coupons, it probably makes no sense tracking the actual name of the current resident. But, if you are selling expensive products or services, clients are looking for a more personal touch and will likely disregard your offering.
I suggest that data accuracy measurement frameworks should focus on four key aspects:
- Where the data is captured – Look into where the data was captured along the customer journey and what alterations (if any) may occur while the data travels to the end user.
- What data is included – Records usually consist of many different attributes/fields but some of the fields are given higher importance. Be sure to measure the accuracy of each key attribute or field based on the most important use cases for that data.
- The measurement method or device – This refers to any technology or methodology used to ensure the correctness of the data. Consider which methods were used for data collection. Was it inspected by experts? Was it compared to existing data or to some gold standard? Were business rules and validation in place to prevent errors?
- Scales at which the results are reported – At the field or attribute level, accuracy can be defined as the number of fields determined to be correct divided by the total number of fields tested. This can also be done on the record level and would follow a similar process with the number of records judged as “completely correct” over the number of records tested.
As we mentioned earlier, some data errors are costlier than others. Therefore, in certain circumstances, it may be necessary to report accuracy in terms of the costs associated with the error.
Measuring Data Consistency
This metric is focused on measuring the consistency of the “meaning” of your data. Let’s try to illustrate through this example: You need to determine how well your sales team is generating leads over time so you start tracking the lead creation date as your primary indicator. However, the record creation date can mean different things to different people. It may be interpreted as the date when the client’s contact information was entered into the dataset, while others may assume that it indicates the first time contact was made with the client. Therefore, to measure data consistency, you need to make sure that the same terms mean the same thing for everyone. How do we do this?
A very effective option is to calculate the number of unique (or similar) values you have for a given data field. Take the number of unique values in a set of records compared to the total number of records. If all values for the given field are unique, then the total count and the unique count would be equal. For example, we know that there are only 50 states, so a column containing the state should not have more than 50 unique values. If you do get more than 50, this means that there are inconsistencies in the ways the states are entered (e.g., “PA” instead of “Pennsylvania”).
In another case, I came across a CRM system which had a very high unique rate for the job title field. This is despite the fact that the titles were pretty standardized in their industry. It was pretty obvious that the same job title was entered in many different ways and the solution was to create a drop-down list rather than a free-form text field.
Measuring Data Validity
Data validation is performed to determine how well your data conforms to the required attributes. This can be something as simple as making sure the dates follow the same format (e.g., mm/dd/yyyy or dd/mm/yyyy). Validating data structures is important because it confirms that the data model you are using is compatible with the applications that will consume this information.
There are a variety of ways to assess the validity of your data. Most data repositories will allow users to develop scripts that compare existing data values and structure against defined rules. This will ensure that the information meets the required quality parameters. In some situations where the volume of data is very large, this method can be very time-consuming.
Another option is to use an out-of-the-box software program to perform the data validation for you. Such programs include a database of pre-configured scenarios that define rules and formats for a variety of applications. This should reduce the amount of set-up work required on your part. If this approach is a good fit for you, look for a tool that allows you to build validation into every step of your workflow. We will cover some of the tools used for data validation and other stages in the next part of the series.
Measuring the Timeliness of Your Data
Data Timeliness refers to measuring when the information required by your organization is available and how long it remains accessible. We can measure timeliness as the duration between when information is expected and when it is readily available for use. It is often important to study this latency (or data lag) and understand the root causes.
To illustrate the value of data timeliness, consider the following example. In 2011, the Harvard Business Review did a study where it determined that you effectively have one hour to qualify a lead from the time it is submitted before that lead goes cold. Fast forward almost ten years and timeliness has accelerated. You now have only 5 minutes to reach out to prospects before your chances of garnering a response drop exponentially. This is in addition to all of the lead nurturing activities that need to be done after the initial contact is established. An organization must constantly monitor signals from prospects and act on them in a timely manner. Be sure to check back next week for Part III where we will take a look at some of the tools you should be using to help you measure the state of data quality.
DataGoomr Offers Valuable Insights Into the Quality of Your Data
Try DataGroomr for yourself today with our free 14-day trial. As part of the free trial, we offer an instant data assessment that tells you how many duplicates you have. There is no customization or setup required, so you will be able to get started right away. DataGroomr will help you clean up your data, make it easier for your sales team to work with the data in Salesforce, and reduce the workload of your Salesforce admins as they won’t need to waste time creating rules to catch every possible fuzzy duplicate!