Preferential Sampling

Description

A preprocessing fairness technique developed by Kamiran and Calders that addresses dataset imbalances by re-sampling training data with preference for underrepresented groups to achieve discrimination-free classification. This method modifies the training distribution by prioritising borderline objects (instances near decision boundaries) from underrepresented groups for duplication whilst potentially removing instances from overrepresented groups. Unlike relabelling approaches, preferential sampling maintains original class labels whilst creating a more balanced dataset that prevents models from learning biased patterns due to skewed group representation.

Example Use Cases

Fairness

Preprocessing hiring datasets by preferentially sampling candidates from underrepresented gender and ethnic groups, particularly focusing on borderline cases near decision boundaries, to ensure fair representation whilst maintaining original qualifications and labels for unbiased recruitment model training.

Reliability

Balancing medical training datasets by oversampling patients from underrepresented demographic groups to ensure reliable diagnostic performance across all populations, preventing models from exhibiting reduced accuracy for minority patient groups due to insufficient training examples.

Transparency

Creating transparent credit scoring datasets by documenting and adjusting the sampling process to ensure equal representation across demographic groups, providing clear evidence to regulators that training data imbalances have been addressed without altering original creditworthiness labels.

Limitations

  • Oversampling minority groups can cause overfitting to duplicated examples, particularly when borderline instances are repeatedly sampled, potentially reducing model generalisation.
  • Undersampling majority groups may remove important examples that contain valuable information, potentially degrading overall model performance.
  • Does not address inherent algorithmic bias in the learning process itself, only correcting for representation imbalances in the training data.
  • Selection of borderline objects requires careful threshold tuning and may be sensitive to the choice of distance metrics or similarity measures used.
  • May not address intersectional fairness issues when multiple protected attributes create complex group combinations that require nuanced sampling strategies.

Resources

Research Papers

Data preprocessing techniques for classification without discrimination
Faisal Kamiran and Toon CaldersJun 1, 2012
Classification with no discrimination by preferential sampling
Faisal Kamiran and Toon CaldersMay 27, 2010
A Survey on Bias and Fairness in Machine Learning
Ninareh Mehrabi et al.Aug 23, 2019

With the widespread use of AI systems and applications in our everyday lives, it is important to take fairness issues into consideration while designing and engineering these types of systems. Such systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that the decisions do not reflect discriminatory behavior toward certain groups or populations. We have recently seen work in machine learning, natural language processing, and deep learning that addresses such challenges in different subdomains. With the commercialization of these systems, researchers are becoming aware of the biases that these applications can contain and have attempted to address them. In this survey we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined in order to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and how they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.

Tags

Data Type:
Expertise Needed:
Fairness Approach:
Technique Type: