Preferential Sampling
Description
A preprocessing fairness technique developed by Kamiran and Calders that addresses dataset imbalances by re-sampling training data with preference for underrepresented groups to achieve discrimination-free classification. This method modifies the training distribution by prioritising borderline objects (instances near decision boundaries) from underrepresented groups for duplication whilst potentially removing instances from overrepresented groups. Unlike relabelling approaches, preferential sampling maintains original class labels whilst creating a more balanced dataset that prevents models from learning biased patterns due to skewed group representation.
Example Use Cases
Fairness
Preprocessing hiring datasets by preferentially sampling candidates from underrepresented gender and ethnic groups, particularly focusing on borderline cases near decision boundaries, to ensure fair representation whilst maintaining original qualifications and labels for unbiased recruitment model training.
Reliability
Balancing medical training datasets by oversampling patients from underrepresented demographic groups to ensure reliable diagnostic performance across all populations, preventing models from exhibiting reduced accuracy for minority patient groups due to insufficient training examples.
Transparency
Creating transparent credit scoring datasets by documenting and adjusting the sampling process to ensure equal representation across demographic groups, providing clear evidence to regulators that training data imbalances have been addressed without altering original creditworthiness labels.
Limitations
- Oversampling minority groups can cause overfitting to duplicated examples, particularly when borderline instances are repeatedly sampled, potentially reducing model generalisation.
- Undersampling majority groups may remove important examples that contain valuable information, potentially degrading overall model performance.
- Does not address inherent algorithmic bias in the learning process itself, only correcting for representation imbalances in the training data.
- Selection of borderline objects requires careful threshold tuning and may be sensitive to the choice of distance metrics or similarity measures used.
- May not address intersectional fairness issues when multiple protected attributes create complex group combinations that require nuanced sampling strategies.