Data CleansingMachine Learning

Applying Algorithms for Data Cleansing in Salesforce

By September 14, 2021September 21st, 2021No Comments

Duplicate data causes slowdowns in workflow, errors in databases, and loss in revenue. According to CIO.com, 77 percent of companies believe they lose revenue because of inaccurate and incomplete contact data. Duplicates are more insidious than most people realize, and anyone who manages a Salesforce database knows the problems that arise. But identifying duplicates is more complicated than one person alone can manage.

When a human sales professional compares two Salesforce records, such as the ones shown below,  they can determine pretty quickly that these are duplicates:  

Name Last Name Address 
Michael Bolton 123 Lockwood Drive 
Mike bolton 123 Lockwood Dr 

If we were to ask them to explain exactly why they feel these records are dupes, they would undoubtedly point to all of the similarities between them. However, if we were to train a machine learning system to look at these records just like a human would, we need to define exactly what we mean by “similar.” How similar are they? Is there a way to quantify these similarities?  

Recently, I wrote about this topic for Salesforce Ben in How to Find Duplicates in Salesforce by Using Machine Learning. I detail some of the ways data scientists go about teaching machines to identify duplicates just like a human. A lot of this is done through various string metrics that allow the system to detect the same similarity a human would see and also quantifying their importance in determining whether or not two records are duplicates or unique.  

Machine learning-based deduplication offers significant advantages compared to the standard rule-based approach. Most notably, machine learning does all of the heavy lifting for you and continuously learns to perfect its algorithms as you label each record as duplicates or unique. The old way of doing things would require you to continuously adjust the rules to account for every possible fuzzy duplicate, which is ultimately a futile approach.  

When we take a step back and look at the big picture, we see that machine learning is a smarter way to dedupe your Salesforce. This is something we looked at in greater detail in a previous blog post: Machine Learning vs. Automation: What’s the Difference? In this article, we talked about how rule-creation simply automates the mundane process of comparing every field, which still runs into the issue of fuzzy duplicates I mentioned earlier. On the other hand, AI and machine learning teach the system to actually think like a human and mimic human functions such as learning and problem-solving.  

But check out for yourself how easy machine learning can be. We offer a free, 14-day trial of DataGroomr. And please feel free to forward this article to your Trailblazer colleagues. They can start the free trial by logging in with their Salesforce credentials. There is no setup required so you can get a handle on the duplicate management of your data right away!

Happy DataGrooming, Trailblazers! 

Il'ya Dudkin

About Il'ya Dudkin

Ilya Dudkin is a Business Development Manager at Softwarium. He is a frequent contributor to popular Salesforce outlets such as SalesforceBen, Force Talks, SFDC Panther and many others as well.