Data CleansingDedupe SalesforceMachine Learning

How to Conduct a Data Quality Assessment

By October 21, 2020 No Comments

Poor data quality is a widespread problem across all industries and we have all heard or read the horror stories about the consequences at different companies. Data issues come in all shapes and sizes. It’s pretty common to see things like (000)000-0000 for a customer contact number, 5/55/55 as the date of purchase, shipping addresses that are missing state information, and many others. It’s also clear that the longer you wait to fix data quality problems, the costlier it will be to remediate the impact (see our previous blog, Risks of Poor Data Quality in Salesforce, to read more about the 1-10-100 rule). The first step towards cleaning up your data is to conduct a data quality assessment to identify the source and severity of the issues. In this article, we have outlined a step-by-step process for monitoring your data quality. In future articles, we plan to drill down further into the specific steps and recommend processes and remediations that will improve your data.  

The Causes of Poor Data

In order to fix all of the problems that cause bad data, we need to understand the root of the issues. The most important thing to realize is that data is never static. It is constantly being transmitted from one point to another; new data is created; existing data is adjusted, stored, and sometimes destroyed. Each of the actions that occur over the course of the data’s lifetime presents a potential threat to its quality. Therefore, it should come as no surprise that the data which was once regarded as being correct has transformed into a state that no longer offers any business value. All of these processes that manipulate data fall under the umbrella term of active data use. However, there are also various processes that can be characterized as passive data use. These do not result in changes to the data itself and can include the way data is being used, new collection methods, system upgrades, and many others. 

Both active and passive data use can erode data quality and the effects may not be evident right away. The goal should be to gain an accurate picture of the state of your data and also shed light on the costs and consequences of bad data quality. The process itself will reveal a path to cleanse existing data and prevent new issues in the future. The following steps describe how to assess data quality. 

Phase 1: Determining the Purpose 

In most cases, when an organization decides to undertake a data quality assessment, they most likely are experiencing certain issues and trying to get a sense of the magnitude of the problem.  Besides providing potential solutions to the discovered issues, the process should document the extent your data meets or does not meet the following five data quality standards: 

  1. Validity Does the data clearly and accurately represent the intended result? 
  2. Integrity Are there any safeguards in place that would reduce the risk of data entry errors manipulating the data? 
  3. Precision Does the data contain enough detail to use it as a basis for making business decisions? 
  4. Reliability Does the data reflect consistent data collection processes and analysis methods over the course of its lifetime?
  5. Timeliness How soon can the data be made available? Is the data current? Can it be available during real-time decision making? 

Using data that does not meet these standards can cause the confidence of your team members to erode, which in turn is likely to lead to a significant loss of time and productivity. 

Phase 2: Setting Metrics to Measure the Data Quality 

Now that we know what we are trying to measure, we need to identify some metrics that we will use to measure the data quality. We recommend using the following set of metrics: 

  • Error-to-Data Ratio 
  • The Number of Empty Values
  • Data Transformation Error Rate
  • Unusable Data
  • Additional industry specific metrics

While there are many other metrics that you can use, deciding which one is appropriate will depend on the needs of each individual company. The same holds true for identifying which tools to use. Let’s take a look at this process next. 

Phase 3: Determining the Data Quality Assessment Tools and Methods

First of all, we need to understand the limitations of data quality management tools. Users must realize that they will not provide a quick fix for data that is incomplete, missing, or just outdated beyond a usable state. The same can be said for outdated legacy systems, spreadsheets, or documents. Any gaps or shortcomings in your current data collection and management processes will have to be reexamined within the context of the entire data framework. Then, and only then, the focus should turn to understanding the strengths and weaknesses of each data cleansing tool available to you. For example, some tools are only designed for specific apps like Salesforce, SharePoint, or SAP, while others are only intended to address a specific need such as spotting errors in physical mailing addresses or phone numbers. 

While there are products that claim to provide comprehensive data cleansing, the variety of issues is such that no one application can effectively address all these problems. Instead, once  you have identified and ranked specific issues, it is a good idea to choose a tool that targets the ones that are most common. For example, many organizations have issues with duplicate data. There are, in fact, tools that are specifically designed to address duplicate data, which has nuances that may lead to a discovery of other smaller issues. In some cases, duplicates are hard to detect because they are misspelled or take on different forms through the use of acronyms. The deduplication tool must be designed to account for all of these nuances and must obviously be a time and cost saver. 

Phase 4: Assigning Roles and Responsibilities

The actual size of the team will vary depending on the size of the company but usually it will consist of both technical and business specialists. Some of the usual roles include: 

  • Data Owner – Data Owners are senior stakeholders within your organization who are accountable for the quality of one or more data sets. This covers activities such as making sure there are definitions in place, action is taken on data quality issues, and Data Quality Reporting is in place.
  • Data Users – These will be representatives of the people who use the data everyday. This could be sales reps, marketing managers, and anyone else who relies on the data to communicate with customers. 
  • Data Producers – These individuals are responsible for capturing the data and making sure it complies with the quality standards of the Data Users. If your organization uses Salesforce, this will most likely be your Salesforce admins since they are responsible for detecting duplicates and other issues with the data. 
  • Data Steward – The Data Steward helps the Data Owner identify the appropriate remedial actions. They also ensure that employees are following the documented rules and procedures for data and metadata generation, access, and use. 
  • Data Analyst – The roles and responsibilities of the data analyst will vary. They may include exploring, assessing, and summarizing the findings into reports for the stakeholders, but can also include fulfilling some of the duties of the other team members mentioned above. 

Now that everybody knows who’s responsible for what, it is important to keep track of the findings. This is where reporting comes into play. 

Phase 5: Create a Data Quality Assessment Report

Reporting is a key aspect of the data quality assessment framework. The report should include answers to the questions that had initiated the effort in the first place; provide you with actionable insights into the problems you are experiencing; and, finally, describe how to detect and locate the problem data. In fact, it is a good idea to conduct a routine quality assessment report on a regular basis to identify problems early on. Based on this report, there are a number of actions that could follow: 

  • The stakeholders may decide to extend the scope of the data quality analysis if certain critical objects had fallen outside the initial scope. 
  • Systematic flaws provoke a further investigation into how the data governance process could be improved and whether any technical documentation is lacking. 
  • If you are migrating the data to a new environment, such as CRM migration, the data should probably be cleansed based on the finding of the analysis, prior to the start of the migration.

Consider Using DataGroomr to Cleanse Your Data

As you are going about your data analysis, you will need to find out how many duplicates are in your data. DataGroomr can perform an audit and determine the percentage of duplicates and help you merge them later on. DataGroomr uses a pre-trained machine-learning algorithm to identify duplicates. There is no need to set up any complicated rules or matching criteria. You can just log in and start cleaning your duplicates. Over time, the algorithm will self-learn and customize itself for the specific needs of your organization without any human involvement.

Try DataGroomr for yourself today with our free 14-day trial. There is no customization or setup required so you will be able to get started right away. DataGroomr will help you clean up your data, make it easier for your sales team to work with the data in Salesforce, and reduce the workload of your Salesforce admins as they will no longer need to waste time creating rules to catch every possible fuzzy duplicate!

Steven Pogrebivsky

About Steven Pogrebivsky

Steve Pogrebivsky has founded multiple successful startups and is an expert in data and content management systems with over 25 years of experience. Previously, he co-founded and was the CEO of MetaVis Technologies, which built tools for Microsoft Office 365, Salesforce and other cloud-based information systems. MetaVis was acquired by Metalogix in 2015. Before MetaVis, Steve founded several other technology companies, including Stelex Corporation which provided compliance and technical solutions to FDA-regulated organizations.