Statistical Oversampling Methods

Description

A family of data augmentation techniques that generate synthetic minority-class examples through geometric interpolation in feature space, addressing class imbalance problems that degrade classifier performance on underrepresented groups. The foundational method, SMOTE (Synthetic Minority Over-sampling Technique), creates new instances by interpolating between existing minority samples and their k-nearest neighbours. Extensions include Borderline-SMOTE (focusing generation near decision boundaries), ADASYN (adaptively weighting harder-to-learn instances), and SVMSMOTE (using support vectors to guide generation). These methods operate directly on feature vectors without requiring neural network training, making them computationally lightweight and applicable to any tabular classification task.

Example Use Cases

Fairness

Applying SMOTE to oversample underrepresented ethnic groups in a loan approval training dataset, ensuring the classifier has sufficient examples from each demographic group to learn fair decision boundaries.

Oversampling underrepresented protected groups in a hiring dataset to enable equitable evaluation of candidates across gender and ethnicity, ensuring the model does not simply learn to predict the majority class.

Reliability

Using ADASYN to generate synthetic examples of rare manufacturing defect types, improving the reliability of quality control classifiers on edge cases that occur in fewer than 1% of production runs.

Applying Borderline-SMOTE to augment minority-class examples near the decision boundary in a medical diagnosis task, reducing false-negative rates for rare conditions without collecting additional patient data.

Limitations

Interpolation in feature space can generate synthetic instances that fall in regions of the feature space not occupied by the true data distribution, introducing noise or unrealistic combinations — particularly problematic for high-dimensional data.
Does not address the root cause of class imbalance (e.g. data collection bias, structural underrepresentation) and may mask rather than resolve underlying fairness issues in the data pipeline.
Performance degrades on high-dimensional datasets where the nearest-neighbour assumption breaks down, and oversampling in such spaces can amplify noise dimensions without improving classification.
Assumes the minority class forms coherent clusters in feature space; when minority examples are scattered or overlap heavily with majority-class regions, interpolation produces ambiguous or mislabelled synthetic samples.

Resources

Research Papers

SMOTE: Synthetic Minority Over-sampling Technique

N. V. Chawla et al.•Jun 1, 2002

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

ADASYN: Adaptive synthetic sampling approach for imbalanced learning

Haibo He et al.•Jun 1, 2008

This paper presents a novel adaptive synthetic (ADASYN) sampling approach for learning from imbalanced data sets. The essential idea of ADASYN is to use a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn compared to those minority examples that are easier to learn. As a result, the ADASYN approach improves learning with respect to the data distributions in two ways: (1) reducing the bias introduced by the class imbalance, and (2) adaptively shifting the classification decision boundary toward the difficult examples. Simulation analyses on several machine learning data sets show the effectiveness of this method across five evaluation metrics.

Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning

Guillaume Lemaıtre and Fernando Nogueira•Jan 1, 2017

Software Packages

scikit-learn-contrib/imbalanced-learn

Feb 16, 2026

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Tutorials

SMOTE for Imbalanced Classification with Python

Jason Brownlee•Jan 16, 2020

Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance. The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important. One approach […]

Related Techniques

Name	Description	Assurance Goals
GAN-Based Tabular Synthetic Data	Generates synthetic tabular datasets using Generative Adversarial Networks, most commonly through architectures such as CTGAN (Conditional Tabular GAN) and TVAE (Triplet-based Variational Autoencoder). These models learn the joint distribution of mixed-type columns — continuous, discrete, and categorical — by training a generator and discriminator in an adversarial framework, with mode-specific normalisation to handle multimodal continuous distributions and training-by-sampling to address class imbalance. The resulting synthetic tables aim to preserve statistical relationships, correlations, and marginal distributions of the original data whilst containing no real records, supporting privacy-preserving data sharing, model development on sensitive datasets, and augmentation of limited training data.	Privacy Reliability Fairness
Simulation-Based Synthetic Data Generation	Generates synthetic datasets through computational simulation of underlying data-generating processes, encompassing statistical methods (copula models, parametric distribution fitting), agent-based models, physics-informed simulators, and Monte Carlo sampling. Unlike neural-network-based approaches, these methods encode explicit domain knowledge or statistical structure into the generation process — copulas model multivariate dependencies through known distributional families, agent-based simulations construct data from interacting rule-driven entities, and physics-informed generators embed differential equation constraints. This makes the synthetic data more interpretable and auditable, with known theoretical properties, at the cost of reduced flexibility for capturing complex nonlinear patterns that lack a known generative model.	Safety Reliability Privacy
Synthetic Data Evaluation	Synthetic data evaluation assesses whether synthetic datasets protect individual privacy while maintaining statistical utility and fidelity to real data. This technique evaluates three key dimensions: privacy (through disclosure risk metrics and re-identification attack success rates), utility (by comparing statistical properties and model performance), and fidelity (measuring distributional similarity to real data). It produces evaluation reports quantifying the privacy-utility-fidelity trade-offs.	Privacy Transparency Reliability