Illustrating the tradeoff between balance and calibration¶

by Nate TeBlunthuis¶

I wrote here about the bias encoded into the ORES models deployed on Wikipedia for helping editors to monitor changes to the encyclopedia. There I showed that the models were unfair to newcomers and anonymous editors using two different notions of fairness: balance and calibration. I brought up the fact that there is an inherent tradeoff between these two quantified notions of fairness such that in non-trivial situations it is impossible to satisfy them both. Here, I'm going to illustrate this point with a simple simulation and show how a straightfoward approach to creating a balanced model from an imbalanced one results in a model which is not calibrated.

# I'm going to use these R packages
library(ggplot2)
theme_set(theme_bw())
library(data.table)


Let's say that whether an edit is damaging is a stochastic function of two observable variables: whether the editor is anonymous and X, which stands for everything else we can observe and include in our model. We'll say the linear probit model with these two variables is the true model.

# generate a dataset according to the model
B_anon <- 2
B_X <- 1
n <- 4000
edits <- data.table(anon=c(rep(TRUE,n/2),rep(FALSE,n/2)), X = rnorm(n/2,0,1))
edits[,p_damaging := pnorm(B_anon*anon + B_X*X,1,1)]
edits[,damaging := sapply(p_damaging, function(p) rbinom(1,1,p))]


Next I'll fit a model to the generated data and generate model predictions

glm_mod = glm(damaging ~ anon + X - 1, data = edits,family=binomial(link='probit'))
edits[,p.calibration := pnorm(predict(glm_mod,newdata=edits))]
edits[,calibration.pred:= p.calibration > 0.5]


The true model should be calibrated, but not balanced. Let's verify that is the case.

edits[, .(model=mean(p.calibration), true=mean(damaging)),by=c("anon")]

anonmodeltrue
TRUE 0.76098200.7610
FALSE 0.23642880.2365

So we see that it is calibrated, but is it balanced?

edits[,mean(p.calibration),by=c("damaging","anon")]

damaginganonV1
1 TRUE 0.8258881
0 TRUE 0.5543147
0 FALSE 0.1674396
1 FALSE 0.4591487

Not even close! The model is super unbalanced. Non-damaging anonymous edits have almost the same average score as damaging non-anonymous edits!

Some people think that you can sovle algorithmic bias problems by using feature engineering and ignoring protected classes. There are some merits to this approach, but it doesn't help solve the balance vs calibration tradeoff. To illustrate this point, let's fit another model that only uses X and ignores anons.

glm_mod2 = glm(damaging ~  X , data = edits,family=binomial(link='probit'))
edits[,p.try_balance := pnorm(predict(glm_mod2,newdata=edits))]
edits[,mean(p.try_balance),by=c("damaging","anon")]

damaginganonV1
1 TRUE 0.5573522
0 TRUE 0.3137468
0 FALSE 0.4388632
1 FALSE 0.6936935

The model is still imbalanced! But that did seem to make things a little bit better. Is the model still calibrated?

edits[, .(model=mean(p.try_balance), true=mean(damaging)),by=c("anon")]

anonmodeltrue
TRUE 0.49913050.7610
FALSE 0.49913050.2365

No it's really not calibrated now! So ignoring anons makes a choice about the tradeoff between balance and calibration, but it does so in an arbitrary way that depends on myriad factors including the correlation between anonymous editing and X.

A better approach to creating a balanced model comes from Hardt et al. (2016). Since the point where the ROC curves for the two protected classes intersect corresponds to choices of threshholds with equal false positve and negative rates, you can transform a good predictor to a worse predictor that is balanced by using different threshholds for different types of editors.

Plot the ROC curves.

roc_x <- 0:100/100
tpr_anon <- edits[anon==TRUE, sapply(roc_x, function(x) sum( (p.calibration > x) & (damaging==TRUE) )/sum(damaging==TRUE))]

fpr_anon <- edits[anon==TRUE, sapply(roc_x, function(x) sum((p.calibration > x) & (damaging==FALSE))/sum(damaging==FALSE))]

tpr_nonanon <- edits[anon==FALSE, sapply(roc_x, function(x) sum( (p.calibration > x) & damaging==TRUE)/sum(damaging==TRUE))]

fpr_nonanon <- edits[anon==FALSE, sapply(roc_x, function(x) sum((p.calibration > x) & damaging==FALSE)/sum(damaging==FALSE))]

roc <- data.table(x=roc_x,tpr_anon=tpr_anon,fpr_anon=fpr_anon,tpr_nonanon=tpr_nonanon, fpr_nonanon=fpr_nonanon)
ggplot(roc) + geom_line(aes(x=fpr_nonanon,y=tpr_nonanon,color="Non anon")) + geom_line(aes(x=fpr_anon,y=tpr_anon,color="Anon")) + ylab("True positive rate") + xlab("False positive rate")


So it looks like we can find balance with the FPR is around 0.2

(t.nonanon <- roc_x[which.min(abs(fpr_nonanon - 0.21
))])

0.28
(t.anon <- roc_x[which.min(abs(fpr_anon - 0.21))])

0.78

Let's make new predictions and check balance and calibration. Note that now our threshhold for classifying an edit as damaging is much higher for anons than for non-anons.

## for anons its where fpr_anon is about 0.22 which is at about 0.77
## you can use linear programming to do this but i'm lazy
edits[anon==TRUE, balance.pred := p.calibration > t.anon]
edits[anon==FALSE, balance.pred := p.calibration > t.nonanon]
edits[,mean(balance.pred),by=.(damaging,anon)]

damaginganonV1
1 TRUE 0.7049934
0 TRUE 0.2071130
0 FALSE 0.2082515
1 FALSE 0.7103594

Check that our new predictor is balanced

#tpr
(edits[anon==TRUE, sum( (calibration.pred==TRUE) & (damaging==TRUE) )/sum(damaging==TRUE)])
(edits[anon==FALSE,sum( (calibration.pred==TRUE) & damaging==TRUE)/sum(damaging==TRUE)])

0.931668856767411
0.431289640591966
#fpr
(edits[anon==TRUE, sum((calibration.pred==TRUE) & (damaging==FALSE))/sum(damaging==FALSE)])
(edits[anon==FALSE, sum((calibration.pred==TRUE) & damaging==FALSE)/sum(damaging==FALSE)])

0.610878661087866
0.0576293385723641
#tnr
(edits[anon==TRUE, sum( (calibration.pred==FALSE) & (damaging==FALSE) )/sum(damaging==FALSE)])
(edits[anon==FALSE,sum( (calibration.pred==FALSE) & damaging==FALSE)/sum(damaging==FALSE)])

0.389121338912134
0.942370661427636
#fnr
(edits[anon==TRUE, sum((calibration.pred==FALSE) & (damaging==TRUE))/sum(damaging==TRUE)])
(edits[anon==FALSE, sum((calibration.pred==FALSE) & damaging==TRUE)/sum(damaging==TRUE)])

0.0683311432325887
0.568710359408034
#tpr
(edits[anon==TRUE, sum( (balance.pred==TRUE) & (damaging==TRUE) )/sum(damaging==TRUE)])
(edits[anon==FALSE,sum( (balance.pred==TRUE) & damaging==TRUE)/sum(damaging==TRUE)])

0.704993429697766
0.710359408033827
#fpr
(edits[anon==TRUE, sum((balance.pred==TRUE) & (damaging==FALSE))/sum(damaging==FALSE)])
(edits[anon==FALSE, sum((balance.pred==TRUE) & damaging==FALSE)/sum(damaging==FALSE)])

0.207112970711297
0.208251473477407
#tnr
(edits[anon==TRUE, sum( (balance.pred==FALSE) & (damaging==FALSE) )/sum(damaging==FALSE)])
(edits[anon==FALSE,sum( (balance.pred==FALSE) & damaging==FALSE)/sum(damaging==FALSE)])

0.792887029288703
0.791748526522593
#fnr
(edits[anon==TRUE, sum((balance.pred==FALSE) & (damaging==TRUE))/sum(damaging==TRUE)])
(edits[anon==FALSE, sum((balance.pred==FALSE) & damaging==TRUE)/sum(damaging==TRUE)])

0.295006570302234
0.289640591966173

Using different threshholds for the different classes gives us a nearly balanced classifier!
The next question is if the balanced predictor is calibrated. What do you expect?

## check if the classifier is calibrated. No way!
edits[,.(Predicted=mean(balance.pred), True=mean(damaging)), by=c("anon")]

anonPredictedTrue
TRUE 0.586 0.7610
FALSE 0.327 0.2365

Nope! Not balanced. The predicted rate of vandalism for anons is lower than the true rate and for non-anons the predicted rate of vandalism is greater than the true rate. Finally, we can visualize the difference between calibration and balance. I'm going to do this using a sample of points, color them according to whether they are false positive, false negative, true positive, or true negative, and then show how predictions change between the calibrated and balanced predictors.

idx <- sample.int(n,1  00)
samp <- edits[idx]
samp2 <- edits[idx]
samp[anon==TRUE,threshhold := 0.5]
samp[anon==FALSE,threshhold := 0.5]
samp[ (damaging==TRUE) & (calibration.pred ==TRUE), type:="True Positive"]
samp[(damaging==FALSE) & (calibration.pred ==TRUE), type:="False Positive"]
samp[(damaging==TRUE) & (calibration.pred ==FALSE), type:="False Negative"]
samp[(damaging==FALSE) & (calibration.pred ==FALSE),type:="True Negative"]
samp[,type:=factor(type,levels = c("True Positive","True Negative","False Positive","False Negative"))]
samp[,model:="calibration"]

samp2[anon==TRUE,threshhold := t.anon]
samp2[anon==FALSE,threshhold := t.nonanon]
samp2[ (damaging==TRUE) & (balance.pred ==TRUE), type:="True Positive"]
samp2[(damaging==FALSE) & (balance.pred ==TRUE), type:="False Positive"]
samp2[(damaging==TRUE) & (balance.pred ==FALSE), type:="False Negative"]
samp2[(damaging==FALSE) & (balance.pred ==FALSE),type:="True Negative"]
samp2[,type:=factor(type,levels = c("True Positive","True Negative","False Positive","False Negative"))]
samp2[,model:="balance"]

samp = rbind(samp,samp2)
samp[,model:=factor(model,levels=c("calibration","balance"))]

my_labeller = as_labeller(c('FALSE' = "Not Anon", "TRUE" = "Anon","balance"="Balance",'calibration'="Calibration"))

ggplot(samp, aes(x=X,y=p.calibration,color=type)) + geom_point(alpha=0.7) + geom_hline(data=samp,aes(yintercept = threshhold)) + facet_grid(anon~model, labeller=my_labeller) + scale_color_brewer("",palette = 'Set1') + ylab("Predicted probability") + ggtitle("Illustrating calibration vs balance")


What can we see from this plot that balancing the model reduces the overall accuracy as it introduces more false classifications than true ones. Specifically, in order to correct a small handful of false positives for anons, we introduce even more false negatives. Similarly, to correct a handful of false negatives for non-anons, we accept even more false positives. Such sacrifices must be made to achieve balance.

Finally, we can use a plot to illustrate that balance means that, within the groups of damaging or non-damaging edits, the model predicts damage with equal probabilities for anonymous and non-anonymous edits.

balance.rates = edits[,mean(balance.pred),by=c('damaging','anon')]
balance.rates[,level := 'Balanced model']
calibration.rates = edits[,mean(calibration.pred),by=c('damaging','anon')]
calibration.rates[,level:='Calibrated model']
true.rates = edits[,mean(p_damaging),by=c('damaging','anon')]
true.rates[,level:='True model']
dt <- rbind(balance.rates, calibration.rates, true.rates)
ggplot(dt,aes(x=damaging==TRUE,color=anon,group=anon,y=V1)) + geom_point() + facet_wrap(.~level) + xlab("Damaging") + ylab("Probability of predicting damage")


How much accuracy did we lose by making our model balanced? Of course, this will depend on the particulars of how I simulated the data.

(acc_calib <- edits[,mean(calibration.pred == (damaging==TRUE))])
(acc_trybal <- edits[,mean((p.try_balance > 0.5) == (damaging==TRUE))])
(acc_bal <- edits[,mean(balance.pred == (damaging == TRUE))])

0.81175
0.67475
0.74925

Accuracy is just one measure of model fitness, so let's also take a look at overall precision and recall.

names(edits)

1. 'anon'
2. 'X'
3. 'p_damaging'
4. 'damaging'
5. 'p.calibration'
6. 'calibration.pred'
7. 'p.try_balance'
8. 'balance.pred'
(recall.calib <- edits[,sum(calibration.pred == TRUE & damaging == TRUE)/sum(damaging==TRUE)])
(precision.calib <- edits[,sum(calibration.pred == TRUE & damaging == TRUE)/sum(calibration.pred==TRUE)])

0.813032581453634
0.81018981018981
(recall.balance <- edits[,sum(balance.pred == TRUE & damaging == TRUE)/sum(damaging==TRUE)])
(precision.balance <- edits[,sum(balance.pred == TRUE & damaging == TRUE)/sum(balance.pred==TRUE)])

0.706265664160401
0.771631982475356
(recall.try_balance <- edits[,sum( (p.try_balance > 0.5) & damaging == TRUE)/sum(damaging==TRUE)])
(precision.try_balance <- edits[,sum( (p.try_balance > 0.5) & damaging == TRUE)/sum(balance.pred==TRUE)])

0.672681704260652
0.734939759036145

Even though in this simulated data anons were three times as likely to make damaging edits compared to non-anons, balancing the model only costs 5 percentage points of accuracy. Moreover, the balanced model has better accuracy than the model that ignores that anons exist! And similarly, we see that balancing the model results in a substantial hit to precision and recall, but ignoring the very informative information that the editor is anonymous makes things even worse!

Choosing the point where the ROC curves for the two groups intersects is a good way to choose threshholds that will balance the model, but this comes at the cost of calibration. Removing the anon variable from the model is a way to compromise between balance and fairness, but potentially at the cost of accuracy (in this exercise, the cost in accuracy was quite high, but if we increase the rate of X enough it will not matter much).

However, total balance and total calibration can be thought of as boundaries that define the space of possible tradeoffs between these two different notions of fairness. Wikipedians might want to make a principled compromise between balance and calibration. One way to do this can be to choose different threshholds for anons and for non-anons that may not accomplish total balance, but that can preserve more calibration. There are also good approaches based on adding constraints (e.g. using KKT conditions) to the model that carefully penalize deviations from balance and calibration.