Publications

  • Fröhlich, Christian, and Robert C. Williamson. 2024. “Scoring Rules and Calibration for Imprecise Probabilities”. ArXiv.

    What does it mean to say that, for example, the probability for rain tomorrow is between 20% and 30%? The theory for the evaluation of precise probabilistic forecasts is well-developed and is grounded in the key concepts of proper scoring rules and calibration. For the case of imprecise probabilistic forecasts (sets of probabilities), such theory is still lacking. In this work, we therefore generalize proper scoring rules and calibration to the imprecise case. We develop these concepts as relative to data models and decision problems. As a consequence, the imprecision is embedded in a clear context. We establish a close link to the paradigm of (group) distributional robustness and in doing so provide new insights for it. We argue that proper scoring rules and calibration serve two distinct goals, which are aligned in the precise case, but intriguingly are not necessarily aligned in the imprecise case. The concept of decision-theoretic entropy plays a key role for both goals. Finally, we demonstrate the theoretical insights in machine learning practice, in particular we illustrate subtle pitfalls relating to the choice of loss function in distributional robustness.

  • Höltgen, Benedikt, and Robert C. Williamson. 2024. “Which Distribution Were You Sampled From? Towards a More Tangible Conception of Data”. ArXiv.

    Machine Learning research, as most of Statistics, heavily relies on the concept of a data-generating probability distribution. The standard presumption is that since data points are `sampled from' such a distribution, one can learn from observed data about this distribution and, thus, predict future data points which, it is presumed, are also drawn from it. Drawing on scholarship across disciplines, we here argue that this framework is not always a good model. Not only do such true probability distributions not exist; the framework can also be misleading and obscure both the choices made and the goals pursued in machine learning practice. We suggest an alternative framework that focuses on finite populations rather than abstract distributions; while classical learning theory can be left almost unchanged, it opens new opportunities, especially to model sampling. We compile these considerations into five reasons for modelling machine learning -- in some settings -- with finite populations rather than generative distributions, both to be more faithful to practice and to provide novel theoretical insights.

  • Höltgen, Benedikt, and Robert C. Williamson. 2024. “Causal Modelling Without Introducing Counterfactuals or Abstract Distributions”. ArXiv.

    The most common approach to causal modelling is the potential outcomes framework due to Neyman and Rubin. In this framework, outcomes of counterfactual treatments are assumed to be well-defined. This metaphysical assumption is often thought to be problematic yet indispensable. The conventional approach relies not only on counterfactuals but also on abstract notions of distributions and assumptions of independence that are not directly testable. In this paper, we construe causal inference as treatment-wise predictions for finite populations where all assumptions are testable; this means that one can not only test predictions themselves (without any fundamental problem) but also investigate sources of error when they fail. The new framework highlights the model-dependence of causal claims as well as the difference between statistical and scientific inference.

  • Remeli, Mina, Moritz Hardt, and Robert C. Williamson. 2024. “Limits to Predicting Online Speech Using Large Language Models”. ArXiv.

    We study the predictability of online speech on social media, and whether predictability improves with information outside a user's own posts. Recent work suggests that the predictive information contained in posts written by a user's peers can surpass that of the user's own posts. Motivated by the success of large language models, we empirically test this hypothesis. We define unpredictability as a measure of the model's uncertainty, i.e., its negative log-likelihood on future tokens given context. As the basis of our study, we collect a corpus of 6.25M posts from more than five thousand X (previously Twitter) users and their peers. Across three large language models ranging in size from 1 billion to 70 billion parameters, we find that predicting a user's posts from their peers' posts performs poorly. Moreover, the value of the user's own posts for prediction is consistently higher than that of their peers'. Across the board, we find that the predictability of social media posts remains low, comparable to predicting financial news without context. We extend our investigation with a detailed analysis about the causes of unpredictability and the robustness of our findings. Specifically, we observe that a significant amount of predictive uncertainty comes from hashtags and @-mentions. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on additional context.

  • Pacheco, Armando J. Cabrera, Rabanus Derr, and Robert C. Williamson. 2024. “An Axiomatic Approach to Loss Aggregation and an Adapted Aggregating Algorithm”. ArXiv.

    Supervised learning has gone beyond the expected risk minimization framework. Central to most of these developments is the introduction of more general aggregation functions for losses incurred by the learner. In this paper, we turn towards online learning under expert advice. Via easily justified assumptions we characterize a set of reasonable loss aggregation functions as quasi-sums. Based upon this insight, we suggest a variant of the Aggregating Algorithm tailored to these more general aggregation functions. This variant inherits most of the nice theoretical properties of the AA, such as recovery of Bayes' updating and a time-independent bound on quasi-sum regret. Finally, we argue that generalized aggregations express the attitude of the learner towards losses.

  • Iacovissi, Laura, Nan Lu, and Robert C. Williamson. 2024. “Corruptions of Supervised Learning Problems: Typology and Mitigations”. ArXiv.

    Corruption is notoriously widespread in data collection. Despite extensive research, the existing literature on corruption predominantly focuses on specific settings and learning scenarios, lacking a unified view. There is still a limited understanding of how to effectively model and mitigate corruption in machine learning problems. In this work, we develop a general theory of corruption from an information-theoretic perspective - with Markov kernels as a foundational mathematical tool. We generalize the definition of corruption beyond the concept of distributional shift: corruption includes all modifications of a learning problem, including changes in model class and loss function. We will focus here on changes in probability distributions. First, we construct a provably exhaustive framework for pairwise Markovian corruptions. The framework not only allows us to study corruption types based on their input space, but also serves to unify prior works on specific corruption models and establish a consistent nomenclature. Second, we systematically analyze the consequences of corruption on learning tasks by comparing Bayes risks in the clean and corrupted scenarios. This examination sheds light on complexities arising from joint and dependent corruptions on both labels and attributes. Notably, while label corruptions affect only the loss function, more intricate cases involving attribute corruptions extend the influence beyond the loss to affect the hypothesis class. Third, building upon these results, we investigate mitigations for various corruption types. We expand the existing loss-correction results for label corruption, and identify the necessity to generalize the classical corruption-corrected learning framework to a new paradigm with weaker requirements. Within the latter setting, we provide a negative result for loss correction in the attribute and the joint corruption case.

  • Fröhlich, Christian, Rabanus Derr, and Robert C Williamson. 2024. “Strictly Frequentist Imprecise Probability”. International Journal of Approximate Reasoning 168 (109148).

    Strict frequentism defines probability as the limiting relative frequency in an infinite se- quence. What if the limit does not exist? We present a broader theory, which is applicable also to random phenomena that exhibit diverging relative frequencies. In doing so, we develop a close connection with the theory of imprecise probability: the cluster points of relative frequencies yield an upper probability. We show that a natural frequentist definition of conditional probability recovers the generalized Bayes rule. This also sug- gests an independence concept, which is related to epistemic irrelevance in the imprecise probability literature. Finally, we prove constructively that, for a finite set of elementary events, there exists a sequence for which the cluster points of relative frequencies coincide with a prespecified set which demonstrates the naturalness, and arguably completeness, of our theory.

  • Fröhlich, Christian, and Robert C Williamson. (2024) 2024. “Risk Measures and Upper Probabilities: Coherence and Stratification”. Journal of Machine Learning Research 25 (207): 1-100.

    Machine learning typically presupposes classical probability theory which implies that aggregation is built upon expectation. There are now multiple reasons to motivate looking at richer alternatives to classical probability theory as a mathematical foundation for machine learning. We systematically examine a powerful and rich class of such alternatives, known variously as spectral risk measures, Choquet integrals or Lorentz norms. We present a range of characterization results, and demonstrate what makes this spectral family so special. In doing so we demonstrate a natural stratification of all coherent risk measures in terms of the upper probabilities that they induce by exploiting results from the theory of rearrangement invariant Banach spaces. We empirically demonstrate how this new approach to uncertainty helps tackling practical machine learning problems.

  • Frölich, Christian, and Robert C. Williamson. 2024. “Data Models With Two Manifestations of Imprecision”. ArXiv.

    Motivated by recently emerging problems in machine learning and statistics, we propose data models which relax the familiar i.i.d. assumption. In essence, we seek to understand what it means for data to come from a set of probability measures. We show that our frequentist data models, parameterized by such sets, manifest two aspects of imprecision. We characterize the intricate interplay of these manifestations, aggregate (ir)regularity and local (ir)regularity, where a much richer set of behaviours compared to an i.i.d. model is possible. In doing so we shed new light on the relationship between non-stationary, locally precise and stationary, locally imprecise data models. We discuss possible applications of these data models in machine learning and how the set of probabilities can be estimated. For the estimation of aggregate irregularity, we provide a negative result but argue that it does not warrant pessimism. Understanding these frequentist aspects of imprecise probabilities paves the way for deriving generalization of proper scoring rules and calibration to the imprecise case, which can then contribute to tackling practical problems.

  • Williamson, Robert C, and Zac Cranko. 2024. “Information Processing Equalities and the Information-Risk Bridge”. Journal of Machine Learning Research.

    We introduce two new classes of measures of information for statistical experiments which generalise and subsume \(\phi\)-divergences, integral probability metrics, \(\mathfrak{N}\)-distances (MMD), and \((f,\Gamma)\)-divergences between two or more distributions. This enables us to derive a simple geometrical relationship between measures of information and the Bayes risk of a statistical decision problem, thus extending the variational \(\phi\)-divergence representation to multiple distributions in an entirely symmetric manner. The new families of divergence are closed under the action of Markov operators which yields an information processing equality which is a refinement and generalisation of the classical data processing inequality. This equality gives insight into the significance of the choice of the hypothesis class in classical risk minimization.

    All Information
    Undergoes transformation.
    No gap. Now equal.

  • Mémoli, Facundo, Brantley Vose, and Robert C. Williamson. 2024. “Geometry and Stability of Supervised Learning Problems”. ArXiv.

    We introduce a notion of distance between supervised learning problems, which we call the Risk distance. This optimal-transport-inspired distance facilitates stability results; one can quantify how seriously issues like sampling bias, noise, limited data, and approximations might change a given problem by bounding how much these modifications can move the problem under the Risk distance. With the distance established, we explore the geometry of the resulting space of supervised learning problems, providing explicit geodesics and proving that the set of classification problems is dense in a larger class of problems. We also provide two variants of the Risk distance: one that incorporates specified weights on a problem's predictors, and one that is more sensitive to the contours of a problem's risk landscape.

     
  • Machine learning is about forecasting. Forecasts, however, obtain their usefulness only through their evaluation. Machine learning has traditionally focused on types of losses and their corresponding regret. Currently, the machine learning community regained interest in calibration. In this work, we show the conceptual equivalence of calibration and regret in evaluating forecasts. We frame the evaluation problem as a game between a forecaster, a gambler and nature. Putting intuitive restrictions on gambler and forecaster, calibration and regret naturally fall out of the framework. In addition, this game links evaluation of forecasts to randomness of outcomes. Random outcomes with respect to forecasts are equivalent to good forecasts with respect to outcomes. We call those dual aspects, calibration and regret, predictiveness and randomness, the four facets of forecast felicity.

  • Williamson, Robert C, and Zac Cranko. (2023) 2023. “The Geometry and Calculus of Losses”. Journal of Machine Learning Research 24 (342): 1-72.

    Statistical decision problems are the foundation of statistical machine learning. The simplest problems are binary and multiclass classification and class probability estimation. Central to their definition is the choice of loss function, which is the means by which the quality of a solution is evaluated. In this paper we systematically develop the theory of loss functions for such problems from a novel perspective whose basic ingredients are convex sets with a particular structure. The loss function is defined as the subgradient of the support function of the convex set. It is consequently automatically proper (calibrated for probability estimation). This perspective provides three novel opportunities. It enables the development of a fundamental relationship between losses and (anti)-norms that appears to have not been noticed before. Second, it enables the development of a calculus of losses induced by the calculus of convex sets which allows the interpolation between different losses, and thus is a potential useful design tool for tailoring losses to particular problems. In doing this we build upon, and considerably extend, existing results on M-sums of convex sets. Third, the perspective leads to a natural theory of `polar' (or `inverse') loss functions, which are derived from the polar dual of the convex set defining the loss, and which form a natural universal substitution function for Vovk's aggregating algorithm.

    Support gradients,
    control proper loss functions:
    Secretly convex.

  • Cabrera.Pacheco, Armando J., and Robert C Williamson. 2023. “The Geometry of Mixability”. Transactions on Machine Learning Research.

    Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function l is η-mixable if and only if the superpredition set spr(ηl) of the scaled loss function ηl slides freely inside the superprediction set spr(llog) of the log loss llog, under fairly general assumptions on the differentiability of l. Our approach provides a way to treat some concepts concerning loss functions (like properness) in a “coordinate-free” manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.

  • Derr, Rabanus, and Robert C. Williamson. (2023) 2023. “Systems of Precision: Coherent Probabilities on Pre-Dynkin Systems and Coherent Previsions on Linear Subspaces”. Entropy 25 (9): 1283. https://doi.org/10.3390/e25091283 .

    Abstract

    In the literature on imprecise probability, little attention is paid to the fact that imprecise probabilities are precise on a set of events. We call these sets systems of precision. We show that, under mild assumptions, the system of precision of a lower and upper probability form a so-called (pre-)Dynkin system. Interestingly, there are several settings, ranging from machine learning on partial data over frequential probability theory to quantum probability theory and decision making under uncertainty, in which, a priori, the probabilities are only desired to be precise on a specific underlying set system. Here, (pre-)Dynkin systems have been adopted as systems of precision, too. We show that, under extendability conditions, those pre-Dynkin systems equipped with probabilities can be embedded into algebras of sets. Surprisingly, the extendability conditions elaborated in a strand of work in quantum probability are equivalent to coherence from the imprecise probability literature. On this basis, we spell out a lattice duality which relates systems of precision to credal sets of probabilities. We conclude the presentation with a generalization of the framework to expectation-type counterparts of imprecise probabilities. The analogue of pre-Dynkin systems turns out to be (sets of) linear subspaces in the space of bounded, real-valued functions. We introduce partial expectations, natural generalizations of probabilities defined on pre-Dynkin systems. Again, coherence and extendability are equivalent. A related but more general lattice duality preserves the relation between systems of precision and credal sets of probabilities.
  • Mansour, Yishay, Richard Nock, and Robert C. Williamson. 2023. “Random Classification Noise Does Not Defeat All Convex Potential Boosters Irrespective of Model Choice”. In ICML2023.

    A landmark negative result of Long and Servedio has had a considerable impact on research and development in boosting algorithms, around the now famous tagline that "noise defeats all convex boosters". In this paper, we appeal to the half-century+ founding theory of losses for class probability estimation, an extension of Long and Servedio's results and a new general convex booster to demonstrate that the source of their negative result is in fact the model class, linear separators. Losses or algorithms are neither to blame. This leads us to a discussion on an otherwise praised aspect of ML, parameterisation.

  • Fröhlich, Christian, and Robert C. Williamson. 2023. “Insights From Insurance for Fair Machine Learning: Responsibility, Performativity and Aggregates”. ArXiv. https://doi.org/10.48550/arXiv.2306.14624.

    We argue that insurance can act as an analogon for the social situatedness of machine learning systems, hence allowing machine learning scholars to take insights from the rich and interdisciplinary insurance literature. Tracing the interaction of uncertainty, fairness and responsibility in insurance provides a fresh perspective on fairness in machine learning. We link insurance fairness conceptions to their machine learning relatives, and use this bridge to problematize fairness as calibration. In this process, we bring to the forefront three themes that have been largely overlooked in the machine learning literature: responsibility, performativity and tensions between aggregate and individual.

  • Höltgen, Benedikt, and Robert C Williamson. 2023. “On the Richness of Calibration”. In FAccT2023.

    Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in design- ing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.

    So many options,
    so much potential, hiding
    in calibration.

  • Derr, Rabanus, and Robert C Williamson. 2023. “The Set Structure of Precision: Coherent Probabilities on Pre-Dynkin-Systems”. ArXiv arXiv:2302.03522.


    In literature on imprecise probability little attention is paid to the fact that imprecise proba- bilities are precise on some events. We call these sets system of precision. We show that, under mild assumptions, the system of precision of a lower and upper probability form a so-called (pre-)Dynkin-system. Interestingly, there are several settings, ranging from machine learning on partial data over frequential probability theory to quantum probability theory and decision making under uncertainty, in which a priori the probabilities are only desired to be precise on a specific underlying set system. At the core of all of these settings lies the observation that precise beliefs, probabilities or frequencies on two events do not necessarily imply this precision to hold for the intersection of those events. Here, (pre-)Dynkin-systems have been adopted as systems of precision, too. We show that, under extendability conditions, those pre-Dynkin- systems equipped with probabilities can be embedded into algebras of sets. Surprisingly, the extendability conditions elaborated in a strand of work in quantum physics are equivalent to coherence in the sense of Walley [Walley, 1991, p. 84]. Thus, literature on probabilities on pre-Dynkin-systems gets linked to the literature on imprecise probability. Finally, we spell out a lattice duality which rigorously relates the system of precision to credal sets of probabilities. In particular, we provide a hitherto undescribed, parametrized family of coherent imprecise probabilities.

  • Fröhlich, Christian, and Robert C Williamson. 2023. “Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity”. Transactions on Machine Learning Research.

    Expected risk minimization (ERM) is at the core of machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.


    Rare loss. Control how?
    The fundamental function.
    Your tails are tailored.

  • Derr, Rabanus, and Robert C Williamson. (2022) 2022. “Fairness and Randomness in Machine Learning: Statistical Independence and Relativization”. ArXiv Preprint ArXiv:2207.13596.

    Fair Machine Learning endeavors to prevent unfairness arising in the context of machine learning applications embedded in society. Despite the variety of definitions of fairness and proposed "fair algorithms", there remain unresolved conceptual problems regarding fairness. In this paper, we dissect the role of statistical independence in fairness and randomness notions regularly used in machine learning. Thereby, we are led to a suprising hypothesis: randomness and fairness can be considered equivalent concepts in machine learning.
    In particular, we obtain a relativized notion of randomness expressed as statistical independence by appealing to Von Mises' century-old foundations for probability. This notion turns out to be "orthogonal" in an abstract sense to the commonly used i.i.d.-randomness. Using standard fairness notions in machine learning, which are defined via statistical independence, we then link the ex ante randomness assumptions about the data to the ex post requirements for fair predictions. This connection proves fruitful: we use it to argue that randomness and fairness are essentially relative and that both concepts should reflect their nature as modeling assumptions in machine learning.

  • Mansour, Yishay, Richard Nock, and Robert C Williamson. 2022. “What Killed the Convex Booster?”. ArXiv Preprint ArXiv:2205.09628.

    A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high precision machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio's results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general standpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive -- and otherwise prized -- aspect of ML: parameterisation.

  • McCalman, Lachlan, Daniel Steinberg, Grace Abuhamad, Marc-Etienne Brunet, Robert C Williamson, and Richard Zemel. 2022. “Assessing AI Fairness in Finance”. Computer 55 (1): 94-97.

    If society demands that a bank’s use of artificial intelligence systems is “fair,” what is the bank to actually do? This article outlines a pragmatic and defensible answer.

  • Williamson, Robert C. (2021) 2021. “The AI of Ethics”. In Machines We Trust: Perspectives on Dependable AI, 139-60. Boston: MIT Press.

    The spectacular rise of AI has lead to anxiety regarding the ethics of AI—how might one conceive and control the ethical impacts of AI technology. I will first sketch the traditional framing of “the ethics of AI.” I will then provide an inverse view, that sees AI as an extension of human reasoning, with concomitantly different conclusions regarding decisional autonomy. I will further argue that the solution to many ethical problems relies upon the better use of cognitive technologies such as AI. That is, we have largely had it backwards — it is the AI of ethics that warrants our attention, and the root of the harm is the use of AI, not the technology itself.