Handling Imbalanced Classification in Machine Learning

TLDR - Machine learning models are powerful tools for spotting risks like fraud, project issues or asset failures, but understanding how well they work can be confusing, especially when events are rare. This guide explains why traditional metrics like accuracy can be misleading and introduces a better way to measure model performance using something called the Area Under the Precision-Recall Curve (AUC -PR). It also shows how to simplify model predictions into easy-to-understand risk categories that help drive smarter business decisions. Even small improvements in model performance can lead to big wins… if you know how to spot them.

Machine Learning (ML) models have become essential tools helping companies make informed decisions and gain competitive advantages. Approximately 75% of these models involve binary outcomes—situations with two clear results, such as identifying transactions as either fraudulent or legitimate (fraud/no-fraud). Typically, the outcome of interest (e.g., fraud) occurs infrequently. These models are known as “class-imbalanced binary classification models.”

Common examples include:

  • Detecting fraudulent transactions (fraudulent vs. normal)
  • Filtering spam emails (spam vs. normal)
  • Predicting machinery failure (failure vs. no failure)
  • Forecasting maintenance needs (break vs. no-break)

Businesses typically have teams composed of technical specialists and business stakeholders collaborating to develop these models. Yet, confusion often arises when discussing how to evaluate model performance.

Challenges Faced by Business Stakeholders

  • Understanding Model Performance: The best way to evaluate these types of models is by using a metric called the Area Under the Precision-Recall Curve (also known as AUC-PR). However, AUC-PR can be unintuitive, causing confusion among stakeholders:
    • Small improvements in AUC-PR may appear insignificant but often indicate substantial practical benefits.
    • Stakeholders may undervalue a model because AUC-PR rarely achieves a perfect score (1.0), although lower scores can still indicate strong predictive performance.
  • Interpreting Model Predictions: Stakeholders often mistake prediction scores for exact probabilities, leading to misunderstandings about risks. These scores indicate relative risk levels rather than precise probabilities. More detail on this will be developed in the article.

Key Questions to Ask

Stakeholders aiming for clarity might ask:

  • What exactly is AUC-PR, and why is it the best metric for these situations?
  • Why can small changes in AUC-PR reflect major performance improvements?
  • Why doesn’t the AUC-PR usually reach its maximum (1.0), and why isn’t that problematic?
  • How does AUC-PR compare to familiar metrics like accuracy?
  • How can predictions be simplified to help stakeholders interpret risks?

Understanding AUC-PR: A Fraud Detection Analogy

Imagine you oversee fraud detection at a bank. Two critical concepts help you measure model effectiveness:

  • Precision: The percentage of transactions flagged as fraud that are genuinely fraudulent. In other words, if we have 100 transactions identified as fraud by the model, how many of those are actually fraud.
  • Recall: The percentage of actual fraudulent transactions correctly identified by your system. If we have a total of 100 fraudulent transactions, how many of those were identified as fraud by model.

Now, let’s visualise precision and recall. Imagine your team develops a model to detect fraudulent transactions using 50,000 historical transactions, where 1% are labelled as fraud. The model is trained on 40,000 samples and tested on the remaining 10,000, which we’ll refer to as the test dataset. This test dataset consists of 9,900 non-fraudulent and 100 fraudulent transactions.

Let’s plot how the density of the model’s predictions over the test dataset might look like. If you’re wondering what density is, think of it as a histogram where the count in each bin is divided by the total number of cases—10,000 in this example. We will first plot the overall density of all predictions (Fig. 1):

Fig. 1: Density of model predictions for the whole test dataset. Each bar in the chart shows how many predictions fall within a certain range. Taller bars mean more predictions in that range. Keep in mind that predictions on the right side of the chart (like Prediction B) are considered higher risk, while those on the left (like Prediction A) are seen as lower risk by model. As you can see, the bars on the left are generally taller, which makes sense because most transactions are not fraudulent.

The first key point to understand is that most ML classification models output a value between 0 and 1. Although predictions with a high score are identified to have more risk that predictions close to 0, a prediction score of 0.6 does not necessarily mean that the probability of an event occurring is 60%. Also notice that at this point, we have not defined what threshold is going to be used to decide that a prediction is risky enough to be flagged as fraud. Then you might ask yourself, if all transactions with a score larger than 0.2 are flagged as fraud, will it be a good threshold?

Now, let’s exploit the fact that our test dataset has information about which transactions are actually fraud. In a dataset of 10,000 examples with a fraud transaction rate of 1%, there are only about 100 fraud cases. To analyse their distribution, we need to plot density curves for fraud and non-fraud cases separately to observe where fraud cases fall relative to non-fraud cases (Look at Fig. 2):

Fig. 2: This plot shows what the prediction distribution looks like when we separate fraud and non-fraud cases. Notice that the bars representing fraud are scaled differently — they are based on 100 fraud cases, while the non-fraud bars come from around 9,900 cases. This allows the fraud pattern to be visible, even though fraud cases are much fewer overall.
Fig. 3: Setting a threshold results in four possible classification outcomes: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN).

First, notice that setting a classification threshold creates the following four possible outcomes:

  • Correct Classifications:

    • True Positives (TP): transactions correctly identified fraud. All fraud transactions with a score larger or equal the threshold (0.2 in this case)

    • True Negatives (TN): transactions correctly identified as no-fraud. All no-fraud transactions with score smaller than threshold (0.2 in this case)

  • Misclassifications:

    • False Positives (FP): transactions incorrectly flagged as fraud. This is sometimes called false alarms. No-fraud transactions with a score larger or equal the threshold (0.2 in this case)

    • False Negatives (FN): transaction that are fraud but were missed to be identified. This is sometimes called missed detections. Fraud transactions with a score smaller than threshold (0.2 in this case)

Back to our definition of precision and recall, let see how they are calculated from True Positives, False Positives and False Negatives:

  • Precision = (TP) / (TP + FP)
    Precision calculates the proportion of fraud among all cases to the right of the threshold line.

  • Recall = (TP) / (TP + FN)
    Recall calculates the proportion of cases identified as fraud from all transactions that we know were actually fraud.

Observe how fraudulent transactions correctly identified by the model (True Positives) play a central role in both precision and recall. Precision and recall measure the proportion of correctly identified fraud cases in relation to the two possible types of errors: falsely flagging non-fraudulent transactions as fraud (False Positives) and failing to detect actual fraud cases (False Negatives). Also notice that True Negatives play no role in these calculations.

Now, our choice to set the threshold at 0.2 was arbitrary. However, it’s clear that adjusting this threshold—say from 0.2 to 0.5—directly impacts precision and recall. For example, at a threshold of 0.5, there would be no False Positives (FP), resulting in 100% precision. However, recall would decrease because fewer fraud cases would be correctly identified (True Positives) as the threshold increases. The key thing to remember is that the definition of precision and recall depend on choosing one specific threshold over all possible thresholds. Also, by adjusting how strictly your system flags transactions affects the balance between precision and recall. Plotting these trade-offs creates the Precision-Recall Curve as can be seen the next image (Fig. 4):

Fig. 4: Precision and Recall Curve (aka PR-Curve) is defined by moving the threshold from 0.0 to 1.0 and plotting precision and recall. In y-axis you will find precision and in x-axis recall.

The Precision-Recall Curve always passes through two defined points:

  • Highest Threshold: When the threshold is set to 1.0, it means that all predictions are classified as no-fraud. This yields a recall of 0% because not fraud transaction is identified but precision of 100% because there are not False Positives.

  • Lowest Threshold: When the threshold is set to 0.0, it means that all predictions are classified as fraud. This yields a recall of 100% because all fraud cases are recovered but a precision of 1% which is equal to the fraud rate.

In the Precision-Recall (PR) Curve of Fig. 4, we can observe multiple points where precision is perfect, but recall is low—these correspond to high threshold values. As recall improves, precision decreases until recall reaches 100%, but at the cost of lower precision—this happens at lower threshold values. Another way to interpret the PR curve is that it considers all possible thresholds that can be applied to the model’s outputs. The area under this curve (AUC-PR) provides a summary of the model’s overall effectiveness, which is especially important when fraud (the event of interest) is rare. The following image (Fig. 5) show the area under the PR-Curve highlighted:

Fig. 5: The filled region under the Precision-Recall Curve is what we call Area Under the PR-Curve or AUC-PR. In this case, the value is 0.308. The larger the value the better with a maximum at 1.0

Why AUC-PR is Superior to Accuracy

Unlike metrics like accuracy—which can be inflated by correctly identifying numerous legitimate transactions—AUC-PR specifically focuses on accurately detecting rare but significant fraud. Let’s look closer at how accuracy is calculated:

  • Accuracy = (Correct Classifications) / (Total cases) = (TP + TN) / (TP + TN + FP + FN)

For example, a classifier that always predicts no-fraud in our bank transactions example will have an accuracy of (0 + 9.9k) / (10k) = 0.99 or 99%. As you can see, the model achieves very high accuracy despite failing to detect a single fraudulent transaction. This in part for the role that the large value of True Negatives (TN) is playing in accuracy calculation.

Why Low AUC-PR Scores Can Still Be Valuable

A perfect AUC-PR (1.0) is extremely rare, typically due to data limitations rather than model quality. For example, if fraud occurs in just 1% of transactions, even exceptional models may struggle to identify every fraudulent case perfectly.

Predictions from a model with perfect AUC-PR might like the following image (Fig. 6)

Fig. 6: Prediction from a model that achieves perfect AUC-PR. There is not overlap between fraud/no-fraud classes

For perfect separability, the model’s predictors must contain enough information to consistently distinguish between fraud and non-fraud cases. However, this is rarely achievable in practice. Consider this example: A bank customer never makes transactions after 11 PM. But tonight, his son is sick, and he needs to buy medicine in the middle of the night. The model doesn’t know about the emergency, so it might wrongly flag the transaction as fraud. This challenge is even greater because there are so many legitimate transactions happening every day. With such a large volume, it’s very likely that some genuine transactions will look suspicious — and some fraudulent ones will look completely normal.

In real-world scenarios, most problems are not perfectly separable because we lack complete observability of all relevant factors. This limitation constrains the maximum achievable AUC-PR for any given model and use case. Anyway, even modest numerical improvements can signal substantial progress. Consider that randomly guessing fraudulent transactions might yield an AUC-PR of 0.01. Improving this score to 0.3 represents a substantial, 30-fold increase, providing tangible benefits to the business.

Simplifying Predictions for Stakeholders

Predictions can be grouped into easy-to-understand risk categories (Critical, High, Low, Minimal). This helps stakeholders quickly understand the urgency and take appropriate action.

For instance:

  • A “Critical” risk prediction might indicate a 60%-90% actual likelihood of fraud, significantly higher than a typical fraud rate of about 1%.

  • A “Minimal” risk prediction may reflect only a 0.2% chance, clearly indicating very low concern.

Going back to the original example, categorising model predictions into risk buckets requires finding appropriate thresholds according to the business requirements (See image Fig. 7)

This approach directly links predictions to clear, actionable business decisions.

Fig. 7: Predictions can be categorised into Critical, High, Low and Minimal risk buckets as part of the Business Logic.

How to Know if You Are Creating Value – A Real Case Example

Insightfactory.ai worked with a large construction client facing a significant challenge: predicting which of their projects were likely to experience a drop in margin within the next three months. During our data discovery phase, we found that the client used an internally developed custom score to flag risky projects.

When we built our initial ML model, we achieved an AUC-PR of 0.25, whereas their existing approach had an AUC-PR below 0.2. We initially thought this was a clear success, but when we presented this improvement, our client wasn’t impressed. They perceived going from 0.2 to 0.25 as minimal progress.

Recognising this communication gap, we decided to integrate business logic with our predictions. When we presented results in terms of actionable risk categories, the client quickly saw the real-world value. Their internal scoring method flagged about 15% of projects as critical, with only about a 30% chance these projects would indeed experience margin drops. In contrast, our model flagged less than 1% of projects but achieved an 80% likelihood of an actual margin drop.

Most compellingly, our model successfully identified critical projects that ultimately suffered large margin drops within 90 days, projects that their internal score had mistakenly categorised as low risk.

This exercise highlighted a crucial insight: creating value isn’t about achieving a perfect performance metric; it’s about outperforming the existing strategy in practical, measurable ways. Assessing a model’s true value requires proper operational evaluation, a topic we’ll explore further in future discussions.

Final Thoughts

Evaluating ML models doesn’t have to be complicated. By understanding concepts like AUC-PR through clear, relatable analogies such as fraud detection, stakeholders can confidently leverage machine learning to achieve significant business value.

I am glad you have reached this point of the article and let’s remember what we have learnt:

  • What exactly is AUC-PR, and why is it the best metric for these situations?

  • Why can small changes in AUC-PR reflect major performance improvements?

  • Why doesn’t the AUC-PR usually reach its maximum (1.0), and why isn’t that problematic?

  • How does AUC-PR compare to familiar metrics like accuracy?

  • How can predictions be simplified to help stakeholders interpret risks?

Related Articles

Forward-Deployed Engineers: The Frontline of AI & Agentic Success