Empirical Calibration
Description
Empirical calibration adjusts a model's predicted probabilities to match observed frequencies. For example, if events predicted with 80% confidence only occur 60% of the time, calibration would correct this overconfidence. Common techniques include Platt scaling and isotonic regression, which learn transformations that map the model's raw scores to well-calibrated probabilities, improving the reliability of confidence measures for downstream decisions.
Example Use Cases
Reliability
Adjusting a credit default prediction model's probabilities to ensure that loan applicants with a predicted 30% default risk actually default 30% of the time, improving decision-making.
Transparency
Calibrating a medical diagnosis model's confidence scores so that stakeholders can meaningfully interpret probability outputs, enabling doctors to make informed decisions about treatment urgency based on reliable confidence estimates.
Fairness
Ensuring that a hiring algorithm's confidence scores are equally well-calibrated across different demographic groups, preventing systematically overconfident predictions for certain populations that could lead to biased decision-making.
Limitations
- Requires a separate held-out calibration dataset, which reduces the amount of data available for model training.
- Calibration performance can degrade over time if the underlying data distribution shifts, requiring periodic recalibration.
- May sacrifice some discriminative power in favour of calibration, potentially reducing the model's ability to distinguish between classes.
- Calibration methods assume that the calibration set is representative of future data, which may not hold in dynamic environments.