Data CleansingDedupe SalesforceMachine Learning

Harnessing the Power of the Deduplication Algorithm

By October 14, 2020 No Comments

In certain respects, the data collected by a business is a raw resource, similar to oil retrieved from a well. Once the oil is pumped out of the ground, it still must be refined, transported, and stored prior to being put to use. Similarly, raw data must be cleansed, transformed into the correct format, and stored in a convenient place to be consumed by business users. This is why companies hire data engineers and devote significant IT resources to these activities. A business that can convert raw data into business intelligence has a clear advantage over their competition. And a business that understands the value of a deduplication algorithm accelerates even further beyond the competition.

Just like in the manufacturing sector, data analytics relies on advanced tooling methods and approaches. These tools can help to prepare data through automation of large-scale, repetitive tasks. In the case of data cleansing and duplicate detection, the introduction of machine learning technology has the potential to significantly reduce the amount of time and money spent on these activities. Deduplication tools have existed for decades, relying largely on the use of user created data matching rules combined with availability of ever more powerful computing equipment to perform these tasks. Today, we will delve deeper into various deduplication approaches, including those used by Salesforce and third-party apps and contrast this with the evolving use of machine learning algorithms to perform these tasks.

What is Inside Salesforce’s Dedupe Algorithm?    

Salesforce’s dedupe algorithm includes three components. 

  • Matching Equation—This determines the fields that have to match in order to be considered a duplicate. For example, for Contacts, this could be First Name AND Last Name And Company Name. More detailed information about the matching equation can be found here
  • Matching Criteria—Salesforce uses this to determine which algorithm will be used for each field to identify duplicates and how to view blank fields. For example, the field MailingStreet is broken down into sections and compared by those sections (i.e., street name, number, suffix etc.). Each of these sections will have its own matching method and corresponding score. Complete information on this can also be viewed in the document referenced above. 
  • Matching Algorithm—There is a set of various matching algorithms used in those fields to find exact and fuzzy matches. We should note here that it is possible to use more than one matching algorithm to compare a field and each matching algorithm will be scored differently based on how closely it matches the field. Complete information on the matching algorithm can be found here

Even though Salesforce has deduplication algorithms built-in, they have certain handicaps especially for organizations working with large volumes of records. We have written extensively about this in the past and you can learn more about the limitations of Salesforce’s deduplication functions in our previous blog posts. Since Salesforce alone falls short of adequately cleansing data, a lot of companies turn to third-party apps. These apps do provide richer and more user friendly functionality, but as we outline below they only offer marginal improvement on Salesforce’s built-in feature set. 

The Deduplication Algorithm Used by Third-Party Apps

When working initially with third-party apps, many companies simply assume that similar records are likely to be duplicates. What is less clear is defining what “similar” actually means. Again, a common solution to this problem was by introducing metrics or scores. If two strings are similar, a high score is returned; conversely, a low score is assigned when strings are deemed dissimilar. For example, one popular string metric is the Hamming distance, which counts the number of substitutions that would have to be made to turn one string into another. There are many other techniques for measuring metrics but they all have the common trait of returning a numbered result. 

However, the availability of so many different solutions presents an even greater challenge. How do you pick the right one for your specific needs? After all, each company is unique and applying a generic string metric may not yield the desired results. In reality, what happens is that duplicates fall through the cracks until someone finds the reason for the flaw and brings this to the right person’s attention. A new duplicate detection rule is then added to address this specific flaw, and the process repeats over and over again whenever new duplicates are uncovered. This process takes up a lot of time and resources and is expensive to sustain in the long term. It is virtually impossible to create a rule to catch every possible type of fuzzy duplicate. So, your only alternative is to accept a certain level of duplication within your data.

Using Machine Learning for Duplicate Detection 

We mentioned that there are a wide range of possible solutions to working with duplicates including the creation of matching rules. A more recent approach is to apply variants of artificial intelligence including a machine learning-based approach that would automate and significantly accelerate the detection process. A self-learning and self-correcting algorithm will be able to adjust its approach based on feedback from users and improve detection without much, if any, human intervention. 

One approach for machine learning-based duplicate detection is to use chunking algorithms, which divide data into “chunks” and assign a unique hash identifier to each one. The hash is then used to compare new chunks with previously stored ones. If the system detects identical chunks in two or more records, it will ask the human user for confirmation that these records are indeed duplicates. Initially, when using machine learning, the system will ask a user to specify how different scenarios should be handled. As time goes on and the user confirms or declines identified duplicates, the system learns from the patterns and begins to identify duplicates without human intervention. This process is called active learning. The system will continuously ask you to label a record pair that it suspects of being duplicates, learning more and more about your data set. 

A lot of third-party apps (and Salesforce itself) assign more weight or importance to certain fields within the record. For example, email addresses tend to be unique and will therefore be given more weight than the mailing address, which could be shared by multiple organizations, and thus be rated lower. While apps usually allow you to adjust the particular weights inside the  customer deduplication algorithm, the process is highly reliant on human decision-making.  Conversely, a machine learning system will simply observe your actions. As you make more and more decisions, the system will automatically determine and apply the necessary weights for data fields without direct input from users.

Machine learning is scalable since it does not compare all the records at once. It uses blocking to only compare data sets that have something in common. For example, this could be the first three characters of a name field. In that case, a block of records that share these characters would be created and compared. In all likelihood, this block would be several orders of magnitude less than the entire data set, which makes the comparison a lot more manageable. If you were to compare every possible pair of records within the entire data set, it would be extremely difficult since possibilities grow with the square of the number of records. If you have 1,000 records, it could probably be done, but 1,000,000 records would be much more challenging and definitely very expensive. 

Start Leveraging Machine Learning to Dedupe Your Data

No deduplication tool is perfect, especially the first time you use it. The good news is that a lot of deduplication products allow for a wide range of customization, especially in regards to the algorithms. This includes things like the criteria used to scan files for deduplication, the weight given to each field within a record and many other aspects. When you are shopping around and comparing different products, be sure to find out how customizable the algorithms are for that particular product and how much time and effort this would involve. 

One of the biggest reasons DataGroomr is so effective is it uses its pre-trained machine learning algorithm to identify duplicates. There is no need to set up any complicated rules or matching criteria. You can just log in and start cleaning your duplicates. Over time, the algorithm will self-learn and customize itself for the specific needs of your organization without any human involvement.


Try DataGroomr for yourself today with our free 14-day trial. There is no customization or setup required so you will be able to get started right away. DataGroomr will help you clean up your data, make it easier for your sales team to work with the data in Salesforce, and reduce the workload of your Salesforce admins as they will no longer need to waste time creating rules to catch every possible fuzzy duplicate!

Steven Pogrebivsky

About Steven Pogrebivsky

Steve Pogrebivsky has founded multiple successful startups and is an expert in data and content management systems with over 25 years of experience. Previously, he co-founded and was the CEO of MetaVis Technologies, which built tools for Microsoft Office 365, Salesforce and other cloud-based information systems. MetaVis was acquired by Metalogix in 2015. Before MetaVis, Steve founded several other technology companies, including Stelex Corporation which provided compliance and technical solutions to FDA-regulated organizations.