If you invested in a deduplication tool, the odds are that you suspect your Salesforce data contains duplicate records and other hygiene issues. But even after you have installed the product and created some rules, there are always lingering doubts that all your duplicates have been detected. When new duplicate combinations are identified in your records, new match rules will need to be created or old ones adjusted. On top of everything, there is also a question of data normalization. Matching rules depend on your data following a standard. The truth of the matter is that the rule-based approach will never perform comprehensive Salesforce deduplication. It’s simply impossible to create a rule that will account for every possible scenario.
The machine learning approach used by DataGroomr is a much better alternative and it will offer you greater peace of mind that you are working with clean data. Let’s start by looking at exactly how machine learning goes about data cleansing and how this will increase your match confidence level.
The Blocking Technique
In a previous blog post, How Machine Learning Algorithms Get Duplicates in Salesforce, we talked about how machine learning uses “blocking” which is the process of grouping similar-seeming records into blocks that a machine learning component then explores exhaustively. In many blocking approaches, records are grouped together into blocks by shared properties that are indicators of duplication. This process alone significantly reduces the number of comparisons that need to be made. If the blocks are well-defined, there is greater confidence that only duplicate records are compared. You don’t have to worry about choosing the blocking properties because the machine learning algorithms will take care of this for you.
For example, let’s say you have the following two records:
|Bob||Jones||Northwest First Street|
|Robert||Jones||1st St. NW|
The above records have identical entries in the “Last Name” field, so the machine learning model would block these records together and assign scores to this and other pairs of records. Pairs scoring above a threshold are said to represent the same entity. Transitive closure is then performed on this same-entity relationship to identify the sets of duplicate records. Now that we know about the blocking method, let’s dive deeper and look at some of the various blocking techniques used by machine learning.
Various Types of Blocking Techniques
In the example above, the records are blocked by a common attribute, which is called predicate blocking. Even though in our example the “Last Name” fields are identical, the blocking attributes may be based on things like common integers, token fields, n-grams or virtually anything else. Another popular approach is Index Blocking which involves creating a special data structure, like an inverted index, that lets you quickly find records similar to target records. If you are not familiar with inverted indexes, basically it is represented by a data structure that stores mapping from words to documents or sets of documents. The index would then be populated with all the unique values that appear in the field.
A real-life example of this would be the index at the back of a book or reverse lookup and what makes reverse indexes so useful is that they do not include duplicate keywords in the index. An example of a reverse index would look something like this:
As we see the table above pairs together documents containing a specific keyword. During the blocking process, the system will search the index for values similar to the record’s field. It will then “block” together records that share at least one common search result. Since you need to build an index from all the unique values in the field, this method may be more time consuming to create than predicate blocking but it is very useful. In fact, the best machine learning solutions combine all kinds of blocking methods to achieve results, not just the ones we described in the article.
Well, now that the system has blocked together similar records, what happens next? Let’s take a look at the next step of the process.
Choosing a Good Threshold
Machine learning systems can predict the probability that a pair of records are duplicates, but how exactly does this work? In order to illustrate the answer, we need to know a little bit about Precision and Recall. Precision answers the question of what proportion of positive identifications was actually relevant. While Recall is intended to determine what proportion of actual positives was retrieved correctly. Precision and Recall are in a constant tug of war (i.e., improving precision typically reduces recall and vice versa).
Based on the precision vs. recall scenario, we can calculate the F-score, which is a score that balances both the concerns of precision and recall in one number. This score can be used as a threshold for deciding when records are duplicates that are optimal for our priorities. The threshold is calculated by looking at the true precision and recall of some data where we know their true labels (i.e., the data is actually duplicates and the data includes all duplicate instances). This threshold is our confidence level that records are in fact duplicates.
All of this probably sounds very complicated and confusing, so why is the machine learning approach better than the traditional rule-based one? We explore this in the next section.
Why is Machine Learning Based Deduplication Superior to Rule Creation?
When we were describing all the processes involved in calculating the confidence level, you may have noticed a common theme … machine learning is doing all the work. This becomes clear in a real-life scenario when we compare the amount of work that needs to be done by your Salesforce admins with rule-based tools. Imagine that one of your sales professionals spots a duplicate record. They will report this issue to the Salesforce admin who will then proceed to create a rule to prevent duplicates from recurring. This process would have to be repeated over and over again every time a duplicate is spotted that does not conform to the existing rules.
With a machine learning approach, the system learns to identify duplicates based on user actions. Consequently, salesforce administrators’ responsibilities will be significantly reduced or eliminated over time. As opposed to rule-based approach, machine learning uses a process called “Active Learning” to track unlabelled records and then adjust the weights of the record’ fields to correspond to the importance of each field in identification of duplicates. For example, if you tell the system that the “Last Name” field is more important than the “First Name” field, could you specify by exactly how much more it is important? Is it 2 times more or 2.5? The system would be able to calculate this and apply the same logic to subsequent records.
Machine Learning Gives You Peace of Mind
All the features of machine learning mentioned above are used to speed up the deduplication, increase the accuracy, and reduce the amount of work required on your behalf. The computational power and methodology of machine learning algorithms will result in greater match confidence and provide you with peace of mind that a comprehensive deduplication is being performed. The deduplication algorithms used by DataGroomr are completely customizable to fit your individual needs and, overall, machine learning offers you greater value for your investment over traditional data cleansing approaches.
Try DataGroomr for yourself today with our free 14-day trial.