Precision, recall and F1 are terms that you may have come across while reading about classification models in machine learning. While all three are specific ways of measuring the accuracy of a model, the definitions and explanations you would read in scientific literature are likely to be very complex and intended for data science researchers. Since the terms are relevant to this article (and how DataGroomr identifies duplicates), we will try to explain them in more layman manner and provide some examples that illustrate their use. Let’s start by getting a brief overview of classification models.
What are Classification Models?
In machine learning, there are many different classification models all with different outcomes. Classification models are machine learning models that predict a class type outcome. In terms of data cleansing in Salesforce, these classes would be represented by duplicate or unique records but in other use cases, they can apply to essentially any identification. The model relies on various attributes to determine its class. For example, when comparing apples to oranges these can be their shape, color or size. The attributes for duplicate records can be the first and last name, email address or any other set of metadata. The model would take all these attributes and classify them into duplicates or unique records (or any other class of your choosing). Then the model would calculate a confidence score that translates to how sure it is of the choice it has made.
In the examples we described above, we had a binary class label (i.e., apples or oranges, duplicate or unique records). However, the class label can include multiple outcomes. For example, we can expand our fruit classes to include lemons, bananas, pears, etc. In the context of data quality, the goal would be to identify invalid or valid data. For a model like this, the requirement would be to have some way of measuring its accuracy. This is where precision, recall and F1 come into play. Let’s look at precision and recall and defer F1 score for the subsequent section.
What are Precision and Recall?
Precision and recall (and F1 score as well) are all used to measure the accuracy of a model. The number of times a model either correctly or incorrectly predicts a class can be categorized into 4 buckets:
- True positives – an outcome where the model correctly predicts the positive class
- True negatives – an outcome where the model correctly predicts the negative class
- False positives – an outcome where the model incorrectly predicts the positive class
- False negatives – an outcome where the model incorrectly predicts the negative class
It is important to understand that precision and recall measure two different things. Precision is the ratio between true positives versus all positives, while recall is the measure of accurate the model is in identifying true positives. The difference between precision and recall are represented below:
While this may sound complicated, it can be easily illustrated using an example. Let’s say we have two sets of classes represented by apples and oranges as shown below:
So, how does our model work? Everything on side 1 needs to be classified as an apple while everything on side 2 needs to be classified as an orange. Based on the results above we can determine the accuracy of the model by dividing the correct classifications by the total number of observations. In our case, the model got 8 apples right plus 7 oranges, so 15 in total, and we can divide this by the total number of observations to get the accuracy: 15/20 = 75%
Even though accuracy is one of the most widely used metrics for evaluating classification models, it becomes ineffective for imbalanced classes. An imbalanced class means that you have a lot more of one type versus another – using the example above with 990 apples and 10 oranges. If this model determines that everything is an apple, it would be 99% (990/1000) accurate, but would completely misclassify the oranges.
This is why we need other metrics, like precision and recall, to evaluate the performance of our model. Let’s return to our example model and focus exclusively on side 1. We can calculate the precision by dividing the total number of correct classifications by the total number of apple side observations or 8/10 which is 80% precision. We can then calculate the recall by dividing the number of apples the model correctly classified by the total number of apples: 8/11 which is about a 72% recall rate.
Obviously, things get more complex when sample size is increased. Laying a large data set using the layout shown above is impractical. This is why data scientist use a Confusion Matrix to classify the results.
What is a Confusion Matrix and How Do You Interpret It?
As the name might suggest, a Confusion Matrix can be very confusing, especially when seeing it for the first time. In fact, it is just a mechanism for laying out how many correct classifications were made and how many were not. A typical Confusion Matrix tracks how many times:
- Class A correctly predicted as Class A
- Class B correctly predicted as Class B
- Class A incorrectly predicted as Class B
- Class B incorrectly predicted as Class A
So, how do we organize all of this information into a way that shows the number of true positives and negatives in proportion with false positives and negatives? This can be done by drawing a grid where the X-axis will be the predictions made and the Y-axis is the actual class labels:
N=1000 | Class A | Class B |
Actual Class A | 500 | 100 |
Actual Class B | 100 | 300 |
From the table above we can see that 500 Cases from Class A were correctly predicted as their true class and 300 cases from Class B were correctly predicted as Class B. In other words, the green cells represent the true positives and negatives, while the red cells represent the false positives and negatives.
Now, let’s imagine a situation where your model does a great job of predicting one class, but doesn’t do a very good job predicting the other one. Since it would be misleading to view precision and recall in isolation, we introduce a third metric, the F1 score.
What is the F1 Score?
The F1 score takes into account both precision and recall and is based on a balance of the two. So, for example, if your model does a good job of predicting both apples and lemons, then you will have a high F1 score. How is the F1 score calculated? The formula for the F1 score is as follows:
TP = True Positives
FP = False Positives
FN = False Negatives
The highest possible F1 score is a 1.0 which would mean that you have perfect precision and recall while the lowest F1 score is 0 which means that the value for either recall or precision is zero. Now that we know all about precision, recall, and the F1 score we can look at some business applications and the role of these terms in machine learning as a whole.
What are the Practical Business Applications of Precision, Recall and F1?
The F1 score attempts to strike a balance between precision and recall and factors in false positives and negatives which can have a big impact on your business. We can contrast this with accuracy, which focuses on true positives and negatives. It is measured when class distribution is similar, while F1 score is a better metric when there are imbalanced distributions. Therefore, if a software vendor tells you that their product is 95% accurate, ask them what that really means? What is the false positive rate? What effect would this have on your data and business intelligence?
Now let us shift over to the machine learning applications. Our goal with DataGroomr’s models is both a high precision and a high recall score since we want to correctly identify duplicates while reducing false positives. However, this may not be the right approach for other application of machine learning. In the healthcare field, there are situations where you may want to favor precision over recall (or vice-versa). For example, AI and machine learning technologies are being used in hospitals to help doctors detect cancers in medical images. If the system that does this was trained with Class A, which shows cancer and Class B, which shows no cancer, the stakes of mistaking a cancerous image as non-cancerous (or even the reverse) can be very high.
This means that you want to focus on recall. Essentially, you are saying, “These patients have cancer. But was the model too cautious and missed something that could indicate a malignancy?” Common sense would dictate that any borderline cases should also be flagged as cancerous and referred to a human to review.
DataGroomr Incorporates Precision, Recall and the F1 Score into Algorithm Training
Precision, recall and the F1 score all play an important role in duplicate detection since the model needs to correctly identify duplicates while avoiding false positives. The algorithms used by DataGroomr have a high level of both precision and recall making sure that duplicate records are correctly identified and are not mistaken for unique records. This also saves you the hassle of having to create complex rules so you can start deduping faster.
Try DataGroomr for yourself with ourfree trial.