# Publications

- Benedikt Höltgen and Robert C Williamson. Forthcoming. “On the Richness of Calibration”. In FAccT2023 .
Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in design- ing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.

*So many options,*

so much potential, hiding

in calibration.FileOnTheRichenessOfCalibration.pdf271.77 KB Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function l is η-mixable if and only if the superpredition set spr(ηl) of the scaled loss function ηl slides freely inside the superprediction set spr(llog) of the log loss llog, under fairly general assumptions on the differentiability of l. Our approach provides a way to treat some concepts concerning loss functions (like properness) in a “coordinate-free” manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.

File2302.11905.pdf737.48 KB- Rabanus Derr and Robert C Williamson. 2/7/2023. “The Set Structure of Precision: Coherent Probabilities on Pre-Dynkin-Systems”. arXiv, arXiv:2302.03522.

In literature on imprecise probability little attention is paid to the fact that imprecise proba- bilities are precise on some events. We call these sets system of precision. We show that, under mild assumptions, the system of precision of a lower and upper probability form a so-called (pre-)Dynkin-system. Interestingly, there are several settings, ranging from machine learning on partial data over frequential probability theory to quantum probability theory and decision making under uncertainty, in which a priori the probabilities are only desired to be precise on a specific underlying set system. At the core of all of these settings lies the observation that precise beliefs, probabilities or frequencies on two events do not necessarily imply this precision to hold for the intersection of those events. Here, (pre-)Dynkin-systems have been adopted as systems of precision, too. We show that, under extendability conditions, those pre-Dynkin- systems equipped with probabilities can be embedded into algebras of sets. Surprisingly, the extendability conditions elaborated in a strand of work in quantum physics are equivalent to coherence in the sense of Walley [Walley, 1991, p. 84]. Thus, literature on probabilities on pre-Dynkin-systems gets linked to the literature on imprecise probability. Finally, we spell out a lattice duality which rigorously relates the system of precision to credal sets of probabilities. In particular, we provide a hitherto undescribed, parametrized family of coherent imprecise probabilities. - Christian Fröhlich, Rabanus Derr, and Robert C Williamson. 2/7/2023. “Strictly Frequentist Imprecise Probability”. arXiv, arXiv:2302.03520.
Strict frequentism defines probability as the limiting relative frequency in an infinite se- quence. What if the limit does not exist? We present a broader theory, which is applicable also to random phenomena that exhibit diverging relative frequencies. In doing so, we develop a close connection with the theory of imprecise probability: the cluster points of relative frequencies yield an upper probability. We show that a natural frequentist definition of conditional probability recovers the generalized Bayes rule. This also sug- gests an independence concept, which is related to epistemic irrelevance in the imprecise probability literature. Finally, we prove constructively that, for a finite set of elementary events, there exists a sequence for which the cluster points of relative frequencies coincide with a prespecified set which demonstrates the naturalness, and arguably completeness, of our theory.

- Rabanus Derr and Robert C Williamson. 10/2022. “Fairness and Randomness in Machine Learning: Statistical Independence and Relativization”. arXiv preprint arXiv:2207.13596.
Fair Machine Learning endeavors to prevent unfairness arising in the context of machine learning applications embedded in society. Despite the variety of definitions of fairness and proposed "fair algorithms", there remain unresolved conceptual problems regarding fairness. In this paper, we dissect the role of statistical independence in fairness and randomness notions regularly used in machine learning. Thereby, we are led to a suprising hypothesis: randomness and fairness can be considered equivalent concepts in machine learning.

In particular, we obtain a relativized notion of randomness expressed as statistical independence by appealing to Von Mises' century-old foundations for probability. This notion turns out to be "orthogonal" in an abstract sense to the commonly used i.i.d.-randomness. Using standard fairness notions in machine learning, which are defined via statistical independence, we then link the ex ante randomness assumptions about the data to the ex post requirements for fair predictions. This connection proves fruitful: we use it to argue that randomness and fairness are essentially relative and that both concepts should reflect their nature as modeling assumptions in machine learning. - Robert C Williamson and Zac Cranko. 9/2022. “Information Processing Equalities and the Information-Risk Bridge”. arXiv preprint arXiv:2207.11987.
We introduce two new classes of measures of information for statistical experiments which generalise and subsume \(\phi\)-divergences, integral probability metrics, \(\mathfrak{N}\)-distances (MMD), and \((f,\Gamma)\)-divergences between two or more distributions. This enables us to derive a simple geometrical relationship between measures of information and the Bayes risk of a statistical decision problem, thus extending the variational \(\phi\)-divergence representation to multiple distributions in an entirely symmetric manner. The new families of divergence are closed under the action of Markov operators which yields an information processing equality which is a refinement and generalisation of the classical data processing inequality. This equality gives insight into the significance of the choice of the hypothesis class in classical risk minimization.

*All Information*

Undergoes transformation.

No gap. Now equal. - Robert C Williamson and Zac Cranko. 9/2022. “The Geometry and Calculus of Losses”. arXiv preprint arXiv:2209.00238.
Statistical decision problems are the foundation of statistical machine learning. The simplest problems are binary and multiclass classification and class probability estimation. Central to their definition is the choice of loss function, which is the means by which the quality of a solution is evaluated. In this paper we systematically develop the theory of loss functions for such problems from a novel perspective whose basic ingredients are convex sets with a particular structure. The loss function is defined as the subgradient of the support function of the convex set. It is consequently automatically proper (calibrated for probability estimation). This perspective provides three novel opportunities. It enables the development of a fundamental relationship between losses and (anti)-norms that appears to have not been noticed before. Second, it enables the development of a calculus of losses induced by the calculus of convex sets which allows the interpolation between different losses, and thus is a potential useful design tool for tailoring losses to particular problems. In doing this we build upon, and considerably extend, existing results on M-sums of convex sets. Third, the perspective leads to a natural theory of `polar' (or `inverse') loss functions, which are derived from the polar dual of the convex set defining the loss, and which form a natural universal substitution function for Vovk's aggregating algorithm.

*Support gradients,*

control proper loss functions:

Secretly convex. - Christian Fröhlich and Robert C Williamson. 8/2022. “Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity”. arXiv preprint arXiv:2208.03066.
Expected risk minimization (ERM) is at the core of machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.

*Rare loss. Control how?*

The fundamental function.

Your tails are tailored. - Christian Fröhlich and Robert C Williamson. 7/2022. “Risk Measures and Upper Probabilities: Coherence and Stratification”. arXiv preprint arXiv:2206.03183.
Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of such alternatives, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we demonstrate a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems.

- Yishay Mansour, Richard Nock, and Robert C Williamson. 5/2022. “What killed the Convex Booster?”. arXiv preprint arXiv:2205.09628.
A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general standpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML:

*parameterisation*. - Lachlan McCalman, Daniel Steinberg, Grace Abuhamad, Marc-Etienne Brunet, Robert C Williamson, and Richard Zemel. 1/2022. “Assessing AI Fairness in Finance”. Computer, 55, 1, Pp. 94-97.
If society demands that a bank’s use of artificial intelligence systems is “fair,” what is the bank to actually do? This article outlines a pragmatic and defensible answer.