Part 1: Conducting a Data Quality Assessment
In last week’s blog post, we briefly summarized some of the steps and processes involved in a data quality assessment. If you missed that blog post, you can read it on our website. This week, we are beginning a five-part series where we will take a more in-depth look at each of the core stages that we discussed in the overview. Our aim is to provide a broader understanding of the value and core elements for each one. The first stage and the topic of this article will be determining the purpose and outlining the goals of performing a data quality assessment.
How Should You Approach Data Quality?
If we were to survey C-suite executives about the state of their data, they would likely respond with something like good, bad, high, low. This homogeneous approach is generally too vague because each of your processes may have their own acceptable level of data quality. For example, if 5% of the records in your CRM are duplicates, this may be acceptable or even fantastic. However, if you are a highly regulated Clinical Research Organization that is performing a clinical study, there is virtually no room for error. Therefore, data quality should be viewed more through the lens intended for use rather than some overall number.
Broadly speaking, the purpose of this process would be to offer insights into the health of your data using certain processes and technologies on increasingly larger and more complex data sets. However, this overall goal is made up of several smaller ones and each of them is equally important in accomplishing the main objective. Let’s take a look at the first one.
The Validity of Your Data
In a business intelligence context, validity is a measurement of how accurately your data represents your customers and prospects and the effectiveness of your processes. The reason this aspect is so important is because “validity” is what gives your sales teams (and anyone else using the data in your CRM) confidence that what they are seeing corresponds to reality. Bad data leads to employees doubting the information and searching for ways to confirm it or worse, just ignoring it completely. As you can imagine, this may and will result in a lot of wasted time and lost revenue; with your sales and marketing teams spending less time corresponding with customers and more time validating data.
There are a variety of tools and sources available to help you validate your data and we will provide some examples in a later installment of our series. Let’s begin by reviewing the activities that often occur during this stage:
- Data Verification – Making sure the contact emails, phone numbers and addresses are valid. This will also help you when capturing contact data in the future.
- Data Maintenance – Remove and prevent duplicates, standardize and normalize your data by converting disparate data sets into a common data format.
- Data Analytics – Reviewing the health of your data through analytics will avoid small problems snowballing into bigger ones. Once problems get big enough, they will quickly impact sales and marketing activities, such as pipeline management, campaign performance, and client retention.
- Data Organization – When data is sourced from multiple repositories, it is often complex and time-consuming to access and organize into a useful and presentable format. So, collating this information into a single actionable view will result in improved data management productivity and operational efficiency.
The Integrity of Your Data
Data integrity refers to safeguards intended to reduce the risk of data entry errors or any means of data manipulation. By tending to the integrity of your data, you are ensuring that it remains complete and accurate regardless of where it is stored, who accessed it, how often it was accessed or any other actions performed with the data. During this phase, your goals should be to examine the following aspects of data custody:
- Logical integrity – Does the data remain unchanged as it is being accessed from its repository (e.g., relational database)? Are there any precautionary measures in place that prevent human- or computer-generated errors?
- Entity integrity – Are there primary keys and validation rules that identify pieces of data and ensure that there are no duplicates created? Relational systems store data in tables that can be linked and used in a variety of ways that prevent duplication or null values.
- Referential integrity – Is the data stored and used in a uniform way? Do your rules include constraints that eliminate the entry of duplicate data?
- User-defined integrity – Did your users create rules and constraints to fit their individual needs? If so, are these rules and constraints in accordance with your existing policies?
Now that we have established what is encompassed by data integrity, we also need to understand what is not. Even though data integrity is concerned with keeping your information intact and accurate, it is not a substitute for the quality of the data itself. As the saying goes, “garbage in, garbage out.” If the quality of your data is poor, maintaining data integrity is not going to make that data useful.
The Precision of Your Data
In today’s business environment, every organization strives to be data-driven and to make decisions based on accurate and timely information. However, does the data you currently have contain enough details to enable you to do this? A report from New Vantage Partners shows that while 98.6% of executives responded that they aspire to complement a data-driven culture, only 32.4% report having success. While your ultimate goal should be to become data-driven, you should also set smaller goals and objectives, for example:
- Improve data quality at the point of capture
- Do not accept null values for critical data
- An acceptable ratio of true valid data to overall data
For our purposes let’s take a look at duplicate data, which is a problem that many businesses are struggling with. It is a good idea to check for duplicates before any data is imported into Salesforce (or other repositories) because introducing more duplicates will just exacerbate the problems. In much the same way, preventing imports with missing critical data will make your data more actionable. Finally, if you are working to improve your data hygiene, it is vital not to accidentally eliminate critical data. This often occurs when systems become overwhelmed with invalid data and “the baby gets thrown out with the bathwater.”
The Reliability of Your Data
When setting objectives on the reliability of your data, you should look at the following characteristics:
- Completeness – It’s hard to rely on data that contains all kinds of gaps in information. Identify the number of fields that have been left blank and ones that contain junk data such as fake email addresses or (000)000-000 for phone numbers.
- Uniqueness – Duplicate records by their very nature cannot be considered reliable since they can contain contradicting, outdated, or incomplete information. Be sure to measure the level of duplication and set a goal to eliminate duplicates from your data repositories.
- Timeliness – Determine the degree to which your data is up-to-date and available within a reasonable amount of time and duration. While it is important to capture complete and accurate information, it must be available in a timely manner. Data such as emails, street addresses, and phone numbers change frequently, especially these days when more and more people are working remotely.
- Consistency – This represents the absence of any discrepancies between the data items representing the same object. To determine consistency, you can compare your data against other data sets with similar specifications.
The Timeliness of Your Data
We touched on the value of timeliness as part of the reliability of your data. This aspect of the data assessment activity deserves a lot more attention since timing is everything when you are capturing, interpreting, and acting on real-time data (or what is perceived as timely data). The focus should be on setting goals for how quickly data needs to be provided in order for it to be actionable. If you are in the B2C segment, there is a small window of time during which consumers will be interested in a particular product. The duration can be estimated by observing consumer activities on social media, search queries, and click-through events. In the B2B segment, you also have a limited amount of time to reach stakeholders before they move on to another product or service provider.
This concludes Part I of our series examining the goals and objectives of a data quality assessment. Be sure to check back next week for part two, where we will examine setting the metrics to measure the data quality.
DataGroomr’s Free Trial Includes an Instant Data Assessment
Since checking for duplicates is an important aspect of the data quality assessment, you need a comprehensive tool that will allow you to better manage the health of your data. By leveraging machine learning, it is more accurate than the rule-based deduping approach that other providers use.You can just log in and start cleaning your duplicates. Over time, the algorithm will self-learn and customize itself for the specific needs of your organization without any human involvement.
Try DataGroomr for yourself today with our free 14-day trial. There is no customization or setup required so you will be able to get started right away. DataGroomr will help you clean up your data, make it easier for your sales team to work with the data in Salesforce, and reduce the workload of your Salesforce admins as they will no longer need to waste time creating rules to catch every possible fuzzy duplicate!
Read How to Conduct a Data Quality Assessment and be sure to check back next week for Part II of this series.