When two Salesforce records are displayed side-by-side, a person can fairly easily spot similarities and determine if these records are duplicates. However, when you extrapolate that over even a modest Salesforce environment with 50,000 records, it becomes hard to envision sifting through all these duplicates one by one. As with many other repetitive tasks, computer scientists took up the challenge to find a technological solution for this problem. To do this task properly machines not only need to identify similarities between records, but also their differences. In this article, we will take a deep dive into some of the approaches used by scientists to train computers to accurately detect and label duplicates.
The questions that we will explore below include:
- How do computers determine if the values in two fields are similar?
- How do computers decide that certain fields should be given more weight than others when searching for duplicates?
- How do computers process large quantities of data to identify duplicates?
Identifying Similar Field Values within Salesforce Records
One of the methods used by researchers to solve the problem of duplicate records is string metrics. Under this approach computers analyze sequences of characters between two objects. If they are similar then the field values are likely to be similar. In our previous blog post on machine learning algorithms, we talked about one of the most famous string metrics called the Hamming distance, but there are many others, as well. This method is ideal for a scenario where you compare names such as, “stephen/steven/steve”, since these are all somewhat similar sequences of characters. However, it will run into issues when the combination is something like, “steve/stove”, because even though they have similar strings, there is no semantic relationship.
Another method researchers use is called set similarity. The idea is that when a name contains many of the same words, then the meanings of the names are likely to be similar. For example, a set such as “Joseph Robinette Biden/Joe Biden” are relatively far apart if you compare the sequences of characters. However, since they have the same last name, they may be referring to the same person. A note of caution here. This method is only effective if you have large volumes of data. Given just the two names, there is insufficient information to definitively determine that “Joseph Robinette Biden” and “Joe Biden” are indeed the same person. The computer would need additional information and then separate the words into various parts or tokens to compare them effectively.
The last method which happens to be the most popular is called semantic similarity. The approach examines the words surrounding the two names. And if they are similar then it is safe to conclude that the names are likely to be similar. This method allows computers to recognize and match nicknames and actual names that have no apparent similarities from an individual; character level. This is how “Bill/William” would be determined to be duplicates.
The next logical question is if the semantic similarity method is the most popular than is it also the best? The answer is…it depends. Actually, the most successful solutions incorporate all three approaches.
When trying to identify duplicate records in Salesforce, there is a lot more data to compare than just someone’s first and last name. These include data points such as street addresses, email addresses, phone numbers and others.
So, how does all this fit into a deduplication process? Let’s turn to machine learning for some answers.
The Beauty of Machine Learning for Identifying Duplicates
A separate string metric will be used when examining any of the fields mentioned above (or others). For example, when comparing street addresses, the learnable distance metric will be appropriate since it takes into consideration that edit operations have varying significance in different domains. There is also the Sørensen–Dice coefficient which measures how similar two strings are in terms of the number of common bigrams (a bigram is a pair of adjacent letters in the string). This can also be useful when comparing company names or working with abbreviations. The application of string metrics for comparing each field is a discussion unto itself. Besides picking an appropriate method, a determination will need to be made regarding which fields will be given greater significance when comparing records.
You may decide that a particular “Email” field is the most important, but exactly how much more weight (or importance) should it be given as opposed to the “Last Name” or for that matter, any other field? This is the beauty of machine learning. It can figure out exactly this ratio.
Initially someone will need to determine and assign field weights. Then the system will search for duplicates based on these weights and present them to a user who will decide if they are indeed duplicates or not. Regardless of which decision you make, the system will learn from the information that you provided and adjust the weights for each field accordingly. A human user would never be able to accurately determine how much more or less important one field as compared to another, but with enough training a machine learning system can do exactly that.
Is This Scalable?
As we discussed earlier, machine learning can be extremely useful for detecting duplicate records, but is this approach scalable?
Consider that if you start with a modest number of Salesforce records, such as 50,000, adding another 5,000 will require 250,000,000 comparisons to be made. To extrapolate further, if a computer manages to compare 10,000 records per second (which will require enormous computational power), it would still take almost seven hours to do the full comparison and identify duplicates:
One of the ways machine learning solves this problem is by using a process called blocking. This is when the system only compares records that have a specific trait in common. For example, let’s take the three names listed below:
- Jay Leno
- Jayson Williams
- Jayson Werth
All of these individuals share the same first three letters in their first name and this could be the trait that mandates a comparison. While these traits may identify many blocks, each one consists of only a few records, which significantly reduces the number of comparisons that need to be made and makes the entire process scalable.
The Advantage of the Machine Learning Approach for Detecting Duplicates in Salesforce
If we look at Salesforce built-in deduping features or some of the popular deduping tools on the AppExchange, they are all reliant on developing rules or filters. This requires administrators to create combinations of conditions that capture all possible variations of duplicates. To demonstrate how complex this is, simply consider that each duplicate rule must include a set of fields to compare along with how the comparison is executed (e.g. fuzzy, exact, etc). It is also unrealistic to believe that existing rules would cover every possible duplicate scenario. So new ones need to be created as the need arises. On top of that, rules must be maintained by someone. Considering all the responsibilities that an administrator already has with running a Salesforce instance, creating and maintaining good rules may not be a priority.
Machine learning has a big advantage in that it does not require rules or filters. There is no complex setup or maintenance process at all. Simply connect to Salesforce and start deduping right away. When you label records as either duplicates or not, the system will automatically learn and improve the algorithm so that future duplicate detection is more accurate.
Start Realizing the Benefits of Machine Learning with DataGroomr
DataGroomr is the only Salesforce app that allows you to realize the benefits of machine learning to dedupe your data. Recently, we have even introduced a Matching Rules feature which allows users to create, train, and apply customized machine learning models designed specifically for your organization. You simply choose the fields that you deem important and DataGroomr will convert these fields into a model. Then train the model against your own data and assign to a dataset. That’s basically the entire process. The next time your dataset is analyzed for duplicates, it will use the rule you created.
Try DataGroomr for yourself today with our free 14-day trial. There is no complex setup. DataGroomr will help you clean up your data, make it easier for your sales team to work with the data in Salesforce, and reduce the workload of your Salesforce admins. Additionally, you get the benefit of a data quality assessment as part of our free trial so you can see just how well the algorithm works for yourself.