Preferential Sampling

Description

A preprocessing fairness technique developed by Kamiran and Calders that addresses dataset imbalances by re-sampling training data with preference for underrepresented groups to achieve discrimination-free classification. This method modifies the training distribution by prioritising borderline objects (instances near decision boundaries) from underrepresented groups for duplication whilst potentially removing instances from overrepresented groups. Unlike relabelling approaches, preferential sampling maintains original class labels whilst creating a more balanced dataset that prevents models from learning biased patterns due to skewed group representation.

Example Use Cases

Fairness

Preprocessing hiring datasets by preferentially sampling candidates from underrepresented gender and ethnic groups, particularly focusing on borderline cases near decision boundaries, to ensure fair representation whilst maintaining original qualifications and labels for unbiased recruitment model training.

Reliability

Balancing medical training datasets by oversampling patients from underrepresented demographic groups to ensure reliable diagnostic performance across all populations, preventing models from exhibiting reduced accuracy for minority patient groups due to insufficient training examples.

Transparency

Creating transparent credit scoring datasets by documenting and adjusting the sampling process to ensure equal representation across demographic groups, providing clear evidence to regulators that training data imbalances have been addressed without altering original creditworthiness labels.

Limitations

  • Oversampling minority groups can cause overfitting to duplicated examples, particularly when borderline instances are repeatedly sampled, potentially reducing model generalisation.
  • Undersampling majority groups may remove important examples that contain valuable information, potentially degrading overall model performance.
  • Does not address inherent algorithmic bias in the learning process itself, only correcting for representation imbalances in the training data.
  • Selection of borderline objects requires careful threshold tuning and may be sensitive to the choice of distance metrics or similarity measures used.
  • May not address intersectional fairness issues when multiple protected attributes create complex group combinations that require nuanced sampling strategies.

Resources

Data preprocessing techniques for classification without discrimination
Research PaperFaisal Kamiran and Toon CaldersJun 1, 2012
Classification with no discrimination by preferential sampling
Research PaperFaisal Kamiran and Toon CaldersMay 27, 2010
A Survey on Bias and Fairness in Machine Learning
DocumentationNinareh Mehrabi et al.Aug 25, 2019

Tags

Applicable Models:
Assurance Goal Category:
Data Type:
Expertise Needed:
Fairness Approach:
Technique Type: