Imbalanced Dataset In Machine Learning
An imbalanced dataset is a dataset where the classes are not represented equally. For example, in a binary classification problem, if one class has many more samples than the other class, the dataset is considered imbalanced. This can cause problems when training machine learning models because the model may not be able to accurately learn the patterns in the minority class.
Ways to handle Imbalanced Datasets
- Over-sampling the minority class
- Under-sampling the majority class
- Combining over-sampling and under-sampling
- Changing the algorithm
- Using different metrics
- Anomaly detection
Over-sampling the minority class
Over-sampling the minority class: This is a technique that involves creating copies of the minority class samples to balance the class distribution. There are several techniques to over-sample the minority class such as Random Over-sampling, SMOTE (Synthetic Minority Over-sampling Technique), or ADASYN (Adaptive Synthetic Sampling).
- Random Over-sampling
- SMOTE (Synthetic Minority Over-sampling Technique)
- ADASYN (Adaptive Synthetic Sampling)
Random Over-sampling
Random Over-sampling is a technique used to handle imbalanced datasets by creating copies of the minority class samples to balance the class distribution. Here's an example of how Random Over-sampling can be used to handle an imbalanced binary classification problem in Python:
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Input data and labels
X = ... # feature matrix
y = ... # target vector
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the RandomOverSampler object
ros = RandomOverSampler(random_state=42)
# Apply RandomOverSampler to the training data
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
# Train the model on the resampled data
model = ... # your model here
model.fit(X_train_resampled, y_train_resampled)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we first split the data into training and testing sets using the train_test_split function from the sklearn.model_selection module. Then, we initialize the RandomOverSampler object with a random state of 42, and apply it to the training data using the fit_resample method. The method returns the resampled feature matrix and the resampled target vector. Finally, we train the model on the resampled data and evaluate it on the test set using the accuracy_score function from the sklearn.metrics module.
SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE (Synthetic Minority Over-sampling Technique) is a technique used to handle imbalanced datasets by creating synthetic samples of the minority class. This is done to balance the class distribution and to ensure that the model is not biased towards the majority class. Here's an example of how SMOTE can be used to handle an imbalanced binary classification problem in Python:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Input data and labels
X = ... # feature matrix
y = ... # target vector
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the SMOTE object
smote = SMOTE(random_state=42)
# Apply SMOTE to the training data
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
# Train the model on the resampled data
model = ... # your model here
model.fit(X_train_resampled, y_train_resampled)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we first split the data into training and testing sets using the train_test_split function from the sklearn.model_selection module. Then, we initialize the SMOTE object with a random state of 42, and apply it to the training data using the fit_resample method. The method returns the resampled feature matrix and the resampled target vector. Finally, we train the model on the resampled data and evaluate it on the test set using the accuracy_score function from the sklearn.metrics module.
ADASYN (Adaptive Synthetic Sampling)
ADASYN (Adaptive Synthetic Sampling) is a technique used to handle imbalanced datasets by creating synthetic samples of the minority class. It is an extension of the SMOTE algorithm that creates more synthetic samples in areas where the minority class samples are more rare. Here's an example of how ADASYN can be used to handle an imbalanced binary classification problem in Python:
from imblearn.over_sampling import ADASYN
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Input data and labels
X = ... # feature matrix
y = ... # target vector
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the ADASYN object
adasyn = ADASYN(random_state=42)
# Apply ADASYN to the training data
X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train, y_train)
# Train the model on the resampled data
model = ... # your model here
model.fit(X_train_resampled, y_train_resampled)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
This code will over-sample the minority class using ADASYN, it will split the data into training and testing sets using a 80/20 ratio, and will then use the training set to fit a model and use the testing set to evaluate its performance. You can see that the process is similar to the previous examples of Random Over-sampling and SMOTE. The main difference is that we are using the ADASYN class instead of RandomOverSampler or SMOTE and it's also worth noting that ADASYN requires the number of nearest neighbors(n_neighbors) to be specified, which is set to 5 by default. The higher the number of nearest neighbors, the more similar the synthetic samples will be to the original minority samples.
Conclusion
It's important to note that oversampling methods can lead to overfitting if not used carefully, so it's important to evaluate the performance of the model on a separate validation set and use techniques like cross-validation to ensure that the model is generalizing well and pick the best oversampling method for your specific case, as each method has its own pros and cons.
Under-sampling the majority class
Under-sampling the majority class: This is a technique that involves removing samples from the majority class to balance the class distribution. There are several techniques to under-sample the majority class such as Tomek links, NearMiss, or Cluster Centroids.
- Tomek links
- NearMiss
- Cluster Centroids
Tomek links
Tomek links is a technique used to handle imbalanced datasets by removing samples from the majority class. The idea behind Tomek links is to remove samples from the majority class that are closest to the minority class samples. This helps to balance the class distribution and reduce the risk of overfitting. Here's an example of how Tomek links can be used to handle an imbalanced binary classification problem in Python:
from imblearn.under_sampling import TomekLinks
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Input data and labels
X = ... # feature matrix
y = ... # target vector
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the TomekLinks object
tl = TomekLinks()
# Apply TomekLinks to the training data
X_train_resampled, y_train_resampled = tl.fit_resample(X_train, y_train)
# Train the model on the resampled data
model = ... # your model here
model.fit(X_train_resampled, y_train_resampled)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we first split the data into training and testing sets using the train_test_split function from the sklearn.model_selection module. Then, we initialize the TomekLinks object, and apply it to the training data using the fit_resample method. The method returns the resampled feature matrix and the resampled target vector. Finally, we train the model on the resampled data and evaluate it on the test set using the accuracy_score function from the sklearn.metrics module.
NearMiss
NearMiss is a technique used to handle imbalanced datasets by removing samples from the majority class. The idea behind NearMiss is to remove samples from the majority class that are farthest from the minority class samples. This helps to balance the class distribution and reduce the risk of overfitting. Here's an example of how NearMiss can be used to handle an imbalanced binary classification problem in Python:
from imblearn.under_sampling import NearMiss
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Input data and labels
X = ... # feature matrix
y = ... # target vector
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the NearMiss object
nr = NearMiss()
# Apply NearMiss to the training data
X_train_resampled, y_train_resampled = nr.fit_resample(X_train, y_train)
# Train the model on the resampled data
model = ... # your model here
model.fit(X_train_resampled, y_train_resampled)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we first split the data into training and testing sets using the train_test_split function from the sklearn.model_selection module. Then, we initialize the NearMiss object, and apply it to the training data using the fit_resample method. The method returns the resampled feature matrix and the resampled target vector. Finally, we train the model on the resampled data and evaluate it on the test set using the accuracy_score function from the sklearn.metrics module.
The NearMiss algorithm accepts a parameter version that can be used to control the variant of NearMiss algorithm to be used. It has 3 possible values:
- version=1 : NearMiss-1 is based on the nearest-neighbors approach, where the majority samples are removed from the dataset if their average distance to the three nearest minority samples is greater than the average distance of the three nearest minority samples to their nearest minority samples.
- version=2 : NearMiss-2 is based on the nearest-neighbors approach, where majority samples are removed from the dataset if their average distance to the three nearest minority samples is greater than the average distance of the three nearest minority samples to their nearest minority samples.
- version=3 : NearMiss-3 is based on the nearest-neighbors approach, where majority samples are removed from the dataset if their average distance to the k nearest minority samples is greater than the average distance of the k nearest minority samples to their nearest minority samples, where k is specified by the user.
Cluster Centroids
Cluster Centroids is a technique used to handle imbalanced datasets by removing samples from the majority class. The idea behind Cluster Centroids is to use k-means clustering to find the centroids of the majority class and then remove all the samples that are closest to these centroids. This helps to balance the class distribution and reduce the risk of overfitting. Here's an example of how Cluster Centroids can be used to handle an imbalanced binary classification problem in Python:
from imblearn.under_sampling import ClusterCentroids
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Input data and labels
X = ... # feature matrix
y = ... # target vector
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the ClusterCentroids object
cc = ClusterCentroids()
# Apply ClusterCentroids to the training data
X_train_resampled, y_train_resampled = cc.fit_resample(X_train, y_train)
# Train the model on the resampled data
model = ... # your model here
model.fit(X_train_resampled, y_train_resampled)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we first split the data into training and testing sets using the train_test_split function from the sklearn.model_selection module. Then, we initialize the ClusterCentroids object, and apply it to the training data using the fit_resample method. The method returns the resampled feature matrix and the resampled target vector. Finally, we train the model on the resampled data and evaluate it on the test set using the accuracy_score function from the sklearn.metrics module.
Note
It's important to note that removing samples can lead to loss of information, so it's important to evaluate the performance of the model on a separate validation set and use techniques like cross-validation to ensure that the model is generalizing well. Additionally, it's important to use the right technique for under-sampling, since different techniques have different characteristics, some may be more beneficial than others depending on the context and the data.
Combining over-sampling and under-sampling
This is a technique that involves both over-sampling the minority class and under-sampling the majority class to balance the class distribution. This can be done using techniques such as SMOTE+Tomek links, which combine over-sampling using SMOTE and under-sampling using Tomek links.
Changing the algorithm
Some machine learning algorithms are more robust to imbalanced datasets than others. For example, decision trees and random forests are less sensitive to imbalanced datasets than logistic regression. Therefore, it is important to use the right algorithm for the dataset.
Using different metrics
Instead of using accuracy as the evaluation metric, other metrics such as precision, recall, F1-score, AUC-ROC, etc can be used to evaluate the performance of the model on imbalanced datasets. These metrics can provide a more complete picture of the model's performance.
Anomaly detection
Anomaly detection is a technique that can be used to identify the minority class. These techniques are designed to identify the patterns that deviate from the majority class. Anomaly detection techniques such as Isolation Forest, Local Outlier Factor(LOF), One-class SVM, etc can be used for imbalanced datasets.
Summary
It's important to note that there is no one-size-fits-all solution to handling imbalanced datasets, and the best approach will depend on the specific characteristics of the dataset and the requirements of the application. It's also important to keep in mind that oversampling methods can lead to overfitting if not used carefully.