Skip to main content
Data CleansingDedupe SalesforceMachine Learning

How Machine Learning Algorithms Get Duplicates in Salesforce

By December 9, 2020December 28th, 2020No Comments

When we think of machine learning, we tend to think about robotic process automation, virtual assistants, and self-driving cars. While these are all common applications that garner headlines, machine learning has the capability to simplify many other activities. In our case, we are using this technology to identify duplicates in your Salesforce environment. Just like with autonomous vehicles and other examples, the algorithms that power these products need to be trained to produce the desired outcome. In this article, we will discuss how machine learning algorithms are trained to dedupe not only Salesforce, but any unstructured data. 

How Does Machine Learning Match Two Records? 

If we take a look at the two records shown below, it is pretty clear that these are duplicates: 

Name Last Name Address
Michael Bolton123 Lockwood Drive
Mikebolton123 Lockwood Dr

However, a machine doesn’t have experience or background to make the same determination. In fact, it is actually much harder than it might seem. We might start by pointing out all of the similarities. Since there are obviously so many of them, we can conclude that these are duplicates. While this may be a good first step, we would then need to stipulate exactly what we mean by the word “similar.” Is there a range where something may be considered not similar at all to very similar? How would a machine go about identifying these similarities?

One of the ways researchers “teach” similarities to machines is through string metrics. This is when you take two strings and return a number that is low if the strings are similar and high if they are dissimilar. There are many string metrics out there, with one of the most well known ones being the Hamming distance. This method counts the number of substitutions that are required to turn one string into another. For example, if you consider the Last Name from the example above, the Hemming distance would only be 1 since you only need to change only one letter to convert “Bolton” to “bolton.” 

Another variation on this is learnable distance metrics which takes into consideration that different edit operations have varying significance in different domains. For example, substituting a digit makes a huge difference in a street address since it effectively changes the entire address, but a single letter substitution may not be that significant because it is more likely to be caused by a typo or an abbreviation. Therefore, adapting string edit distance to a particular domain requires assigning different weights to different strings. We will drill down into these concepts at a later point in this article. For now, let’s take a look at how all of these metrics are used to dedupe Salesforce. 

Deduping Salesforce With Machine Learning Algorithms 

There are a couple of ways we can look at a Salesforce record. Let’s start by assuming it is a single block of text (as shown below): 

Record 1Record 2
Michael Bolton 123 Lockwood DriveMike bolton 123 Lockwood Dr

Another option is to compare each field individually: 

 Record 1Record 2
First Name MichaelMike
Last NameBoltonbolton
Address 123 Lockwood Drive123 Lockwood Dr

For the “single block” approach, each field string would be treated equally. This makes it less convenient if you want any emphasis placed on a specific field, such as Last Name. The “field by field” approach allows you to do this by assigning a specific weight to each field, starting with the most important fields having the highest weight and so forth. Salesforce deduping tools that use this type of technology will allow you to set the weights for each field and then create a model so that the approach is codified and leveraged in any comparison. 

What is the Advantage of Using Machine Learning to Dedupe Salesforce? 

Every company’s dataset is unique and has its own challenges when it comes to deduplication. Whenever a human determines that a set of records are duplicates (or not), the system will “learn” from these actions and tweak the algorithm with the goal of identifying future duplicates without human interaction. This process, known as “active learning,” will continue to modify the weights assigned to each field based on user interaction and consequently improve duplicate detection.

It is important to point out that setting accurate weights for each field has its own challenges. For example, is the Last Name field twice as important as the First Name or 1.5 times and so on? It would be very difficult for any individual to make this type of determination, since we just couldn’t practically process that much data. On the other hand, computers using machine learning can crunch an almost infinite amount of data quickly and efficiently. The only limitation is the available computation power. These algorithms will be able to calculate accurate weights for each field in your dataset, a process known as regularized logistic regressions. 

Added Value of Deduping With Machine Learning 

If we take a look at some of the popular deduping tools available on the AppExchange, we notice that they are all rule-based. What this means is that every time a duplicate record is identified, your Salesforce admin will need to create an additional rule to prevent it from recurring. Not only does this take up a lot of time, but it’s nearly impossible to account for every possible “fuzzy” duplicate. You can try to set all of the weighting for each field yourself or use other metrics to catch the duplicates. In the end, it is very time-consuming and ineffective in catching all the issues. Machine learning does all of this for you; thus saving you a lot of time and hassle. 

There are many other advantages to using this type of artificial intelligence (AI). The algorithm is fully customizable and there is no need for a complicated set-up process. Remember, if you are using a tool that relies on complex rules, someone needs to set up the rules and then maintain them. A machine learning tool eliminates this effort completely, allowing you to simply download the product and start using it right away. 

Start Realizing the Benefits of Machine Learning with DataGroomr 

DataGroomr continues to innovate its application through Machine Learning. Recently, we added a Field Values Rule feature to the Supervisr module which enables users to manipulate which data will be preserved during merges. When a rule is set up, it can be designated as the default rule in your organization. Any rule that is set as the default will be applied automatically in the Trimmr review window for manual merges. For mass merges, users will be able to select any of the existing rules from a drop-down list. 

In addition to these features, there is no complicated setup process since our algorithms do all of the work for you. Machine learning, as a vigilant monitoring system that learns as you use it, enables every single user to not only collect data but to integrate the findings into immediate business decisions – like which contacts on a list are qualified leads and which are redundant or irrelevant. In seconds, a powerful algorithm can assess a database, remove duplicates, and incorporate rules from previous searches. DataGroomr is also customizable, so you can use it to address any specific issues you are experiencing. Discover more about these features on the Salesforce AppExchange

Try DataGroomr for yourself totally free for 14 days

Steven Pogrebivsky

Steve Pogrebivsky has founded multiple successful startups and is an expert in data and content management systems with over 25 years of experience. Previously, he co-founded and was the CEO of MetaVis Technologies, which built tools for Microsoft Office 365, Salesforce and other cloud-based information systems. MetaVis was acquired by Metalogix in 2015. Before MetaVis, Steve founded several other technology companies, including Stelex Corporation which provided compliance and technical solutions to FDA-regulated organizations.