The discourse around algorithm fairness has garnered increasing attention throughout the insurance industry. As the use of machine learning has become more common, from marketing and underwriting to claims management, regulators and consumer rights organisations have raised questions about the ethical risks posed by such technology.
A catalyst of this social discourse was an article entitled Machine Bias published by the investigative news organisation ProPublica.1 The paper focussed criticism on a law case management and decision support tool used in the U.S. judicial system called Correctional Offender Management Profiling for Alternative Sanctions, or COMPAS.
A conclusion of the academic research surrounding this debate is that there are competing definitions of fairness, and these definitions may be incompatible with one another.2 The concepts of fairness at the centre of the discussion are calibration (aka predictive parity) and classification parity (aka error rate balance). There is also the concept of anti-classification, which calls for sensitive (aka protected) attributes not to be explicitly used in decision-making.3
A stylized example, designed for educational purposes, illustrates the intrinsic incompatibility of predictive parity and error rate balance. The example uses the Adult dataset, which figures prominently in studies on machine learning. Gender is the sensitive attribute of choice. It is demonstrated that a classification model that satisfies predictive parity across two groups cannot satisfy error rate balance if the baseline rate of prevalence differs between the groups.4
The dataset, which is publicly available, comprises 48,842 anonymized records of annual income and personal information (such as age, gender, years of education, etc.) extracted from the 1994 U.S. Census database. The prediction task is to determine whether a person makes more than $50,000 a year.5
In the dataset, 24 percent of individuals are high earners. This baseline rate is higher for males (30 percent) than for females (11 percent). The dataset has been an object of research for its imbalance on gender, as females make up only 15 percent of high earners but 33 percent of the entire data set.
The algorithm6 satisfies predictive parity at a chosen threshold of predicted probability of being a high earner if in the category of predicted high earners the empirical probability of being a high earner is independent of group membership, where group membership is defined by the sensitive attribute.7 In plain English, the Positive Predictive Value (PPV), defined as the ratio of true positive count to the sum of true positive count and false positive count, must be equal across groups within an acceptable margin of statistical error. As shown in the boxplot below, for a threshold of 32 percent of predicted probability of being a high earner,8 the (mean and median of the) PPV of the high-earner category equals 60.6 percent for both groups.9 In the boxplot, the median is represented by a horizontal bar within the box, and the box marks the range between the first and third quartile.