Model independent feature attributions: Shapley values that uncover non-linear dependencies

Daniel Vidali Fryer; Inga Strumke; Hien Nguyen

doi:10.7717/peerj-cs.582

Model independent feature attributions: Shapley values that uncover non-linear dependencies

Daniel Vidali Fryer¹, Inga Strumke ², Hien Nguyen³

1School of Mathematics Physics, University of Queensland, Queensland, St Lucia, Australia

2Department of Holistic Systems, Simula Research Laboratory, Oslo, Norway

3Department of Mathematics and Statistics, La Trobe University, Melbourne, Australia

DOI: 10.7717/peerj-cs.582

Published: 2021-06-02
Accepted: 2021-05-17
Received: 2020-10-14

Academic Editor: Charles Elkan

Subject Areas: Artificial Intelligence, Data Mining and Machine Learning, Data Science
Keywords: Shapley values, Statistics, Distance correlation, Explainable AI, Multiple correlation

Copyright: © 2021 Fryer et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Cite this article: Fryer DV, Strumke I, Nguyen H. 2021. Model independent feature attributions: Shapley values that uncover non-linear dependencies. PeerJ Computer Science 7:e582 https://doi.org/10.7717/peerj-cs.582

The authors have chosen to make the review history of this article public.

Abstract

Shapley values have become increasingly popular in the machine learning literature, thanks to their attractive axiomatisation, flexibility, and uniqueness in satisfying certain notions of ‘fairness’. The flexibility arises from the myriad potential forms of the Shapley value game formulation. Amongst the consequences of this flexibility is that there are now many types of Shapley values being discussed, with such variety being a source of potential misunderstanding. To the best of our knowledge, all existing game formulations in the machine learning and statistics literature fall into a category, which we name the model-dependent category of game formulations. In this work, we consider an alternative and novel formulation which leads to the first instance of what we call model-independent Shapley values. These Shapley values use a measure of non-linear dependence as the characteristic function. The strength of these Shapley values is in their ability to uncover and attribute non-linear dependencies amongst features. We introduce and demonstrate the use of the energy distance correlations, affine-invariant distance correlation, and Hilbert–Schmidt independence criterion as Shapley value characteristic functions. In particular, we demonstrate their potential value for exploratory data analysis and model diagnostics. We conclude with an interesting expository application to a medical survey data set.

Introduction

There are many different meanings of the term “feature importance”, even in the context of Shapley values. Indeed, the meaning of a Shapley value depends on the underlying game formulation, referred to by Merrick & Taly (2019) as the explanation game. Although, this is so far rarely discussed explicitly in the existing literature. In general, Shapley value explanation games can be distinguished as either belonging to the model-dependent category or the model-independent category. The latter category is distinguished by an absence of assumptions regarding the data generating process (DGP). Here, the term model-dependent refers to when the Shapley value depends on a choice of fitted model (such as the output of a machine learning algorithm), or on a set of fitted models (such as the set of sub-models of a linear model).

Shapley values that uncover non-linear dependencies (Sunnies) is, to the best of our knowledge, the only Shapley-based feature importance method that falls into the model-independent category. In this category, feature importance scores attempt to determine what is a priori important, in the sense of understanding the partial dependence structures within the joint distribution describing the DGP. We show that these methods that generate model-independent feature importance scores can appropriately be used as model diagnostic procedures, as well as procedures for exploratory data analysis.

Existing methods in the model-dependent category, on the other hand, seek to uncover what is perceived as important by the model (or class of models), either with regards to a performance measure (e.g., a goodness-of-fit measure) or for measuring local influences on model predictions. Model-dependent definitions of feature importance scores can be distinguished further according as to whether they depend on a fitted (i.e., trained) model or on an unfitted class of models. We refer to these as within-model scores and between-model scores, respectively. This distinction is important, since the objectives are markedly different.

Within-model Shapley values seek to describe how the model reacts to a variety of inputs, while, e.g., accounting for correlated features in the training data by systematically setting “absent” features to a reference input value, such as a conditional expectation. There are many use cases for within-model Shapley values, such as providing transparency to model predictions, e.g. for explaining a specific credit decision or detecting algorithmic discrimination (Datta, Sen & Zick, 2016), as well as understanding model structure, measuring interaction effects and detecting concept drift (Lundberg et al., 2020).

All within-model Shapley values that we are aware of fall into the class of single reference games, described by Merrick & Taly (2019). These include SAGE (Covert, Lundberg & Lee, 2020); SHAP (Lundberg & Lee, 2017); Shapley Sampling Values (Štrumbelj & Kononenko, 2013); Quantitative Input Influence (Datta, Sen & Zick, 2016); Interactions-based Method for Explanation(IME) (Štrumbelj, Kononenko & Šikonja, 2009); and TreeExplainer (Lundberg et al., 2020). Note that some within-model feature importance methods, such as SHAP, can be described as model agnostic methods, since they may be applied to any trained model. Regardless, such values are dependent on a prior choice of fitted model.

In contrast to within-model Shapley values, between-model Shapley values seek to determine which features influence an outcome of the model fitting procedure, itself, by repeatedly refitting the model to compute each marginal contribution. Such scores have been applied, for example, as a means for feature importance ranking in regression models. These include Shapley Regression Values (Lipovetsky & Conklin, 2001), ANOVA Shapley values (Owen & Prieur, 2017), and our prior work (Fryer, Strumke & Nguyen, 2020). The existing between-model feature importance scores are all global feature importance scores, since they return a single Shapley value for each feature, over the entire data set. Sunnies is also a global score, though not a between-model score.

A number of publications and associated software have been produced recently to efficiently estimate or calculate SHAP values. Tree SHAP, Kernel SHAP, Shapley Sampling Values, Max Shap, Deep Shap, Linear-SHAP and Low-Order-SHAP are all methods for either approximating or calculating SHAP values. However, these efficient model-dependent methods for calculating or approximating SHAP values are developed for local within-model scores, and are not suitable for Sunnies, which is a global and model-independent score. While Sunnies does not fit under the model-dependent frameworks for efficient estimation, Shapley values in general can be approximated via a consistent Monte Carlo algorithm introduced by Song, Nelson & Staum (2016). While efficient approximations do exist, computational details are not the focus of this paper, where we focus on the concept and relevance of Sunnies.

In “Shapley Decomposition”, we introduce the concept of the Shapley value and its decomposition. We then introduce the notion of attributed dependence on labels (ADL), and briefly demonstrate the behaviour of the R² characteristic function on a data set with non-linear dependence, to motivate our alternative measures of non-linear dependence in place of R². In “Measures of non-linear dependence”, we describe three such measures: the Hilbert Schmidt Independence Criterion (HSIC), the Distance Correlation (DC) and the Affine-Invariant Distance Correlation (AIDC). We use these as characteristic functions throughout the remainder of the work, although we focus primarily on the DC.

The DC, HSIC and AIDC do not constitute an exhaustive list of the available measures of non-linear dependence. We do not provide here a comparison of their strengths and weaknesses. Instead, our objective is to propose and demonstrate a variety of use cases for the general technique of computing Shapley values for model-independent measures of statistical dependence.

In “Exploration”, we demonstrate the value of ADL for exploratory data analysis, using a simulated DGP that exhibits mutual dependence without pairwise dependence. We also leverage this example to compare ADL to popular pairwise and model-dependent measures of dependence, highlighting a drawback of the pairwise methods, and of the popular XGBoost built-in “feature importance” score. We also show that SHAP performs favourably here. In “Diagnostics”, we introduce the concepts of attributed dependence on predictions (ADP) and attributed dependence on residuals (ADR). Using simulated DGPs, we demonstrate the potential for ADL, ADP and ADR to uncover and diagnose model misspecification and concept drift. For the concept drift demonstration (“Demonstration with concept drift”), we see that ADL provides comparable results to SAGE and SHAP, but without the need for a fitted model. Conclusions are drawn in Section “Discussion and Future Work”.

Shapley Decomposition

In approaching the question: “How do the different features X = (X₁, …, X_d) in this data set affect the outcome Y?”, the concept of a Shapley value is useful. The Shapley value has a long history in the theory of cooperative games, since its introduction in Shapley (1953), attracting the attention of various Nobel prize-winning economists (cf. Roth, 1988), and enjoying a recent surge of interest in the statistics and machine learning literature. Shapley (1953) formulated the Shapley value as the unique game theoretic solution concept, which satisfies a set of four simple and apparently desirable axioms: efficiency, additivity, symmetry and the null player axiom. For a recent monograph, defining these four axioms and introducing solution concepts in cooperative games, consult Algaba, Fragnelli & Sánchez-Soriano (2019).

As argued by Lipovetsky & Conklin (2001); Israeli (2007); Huettner & Sunder (2012), we can think of the outcome C(S) of a prediction or regression task as the outcome of a cooperative game, in which the set S = {X₁, …, X_d} of data features represent a coalition of players in the game. The function C is known as the characteristic function of the game. It maps elements S, in the power set 2^[d] of players, to a set of payoffs (or outcomes) and thus fully describes the game. Let d be the number of players. The marginal contribution of a player v ∈ S to a team S is defined as C(S∪{v}) − C(S). The average marginal contribution of player v, over the set S_k of all teams of size k that exclude v, is (1) ${\bar{C}}_{k} (v) = \frac{1}{| S_{k} |} \sum_{S \in S_{k}} [C (S \cup \{v\}) - C (S)],$ where $| S_{k} | = (\binom{d - 1}{k})$ . The Shapley value of player v, then, is given by (2) $ϕ_{v} (C) = \frac{1}{d} \sum_{k = 0}^{d - 1} {\bar{C}}_{k} (v),$ i.e., ϕ_v(C) is the average of ${\bar{C}}_{k} (v)$ over all team sizes k.

Attributed dependence on labels

The characteristic function C(S) in Eq. (1) produces a single payoff for the features with indices in S. In the context of statistical modelling, the characteristic function will depend on Y and X. To express this we introduce the notation X|_S = (X_j)_j∈S as the projection of the feature vector onto the coordinates specified by S, and we write the characteristic function C_Y(S) with subscript Y to clarify its dependence on Y as well as X (via S). Now, we can define a new characteristic function R_Y in terms of the popular coefficient of multiple correlation R², as (3) $R_{Y} (S) = R^{2} (Y, X |_{S}) = 1 - \frac{| C o r (Y, X |_{S}) |}{| C o r (X |_{S}) |},$ where |⋅| and Cor(⋅) are the determinant operator and correlation matrix, respectively (cf. Fryer, Strumke & Nguyen, 2020).

The set of Shapley values of all features in X, using characteristic function C, is known as the Shapley decomposition of C amongst the features in X. For example, the Shapley decomposition of R_Y, from Eq. (3), is the set {ϕ_v(R_Y):v ∈ [d]}, calculated via Eq. (2).

In practice, the joint distribution of (Y, X^T) is unknown, so the Shapley decomposition of C is estimated via substitution of an empirical characteristic function $\hat{C}$ in Eq. (1). In this context, we work with an n × |S| data matrix X|_S, whose ith row is the vector x|_S = (x_ij)_j∈S, representing a single observation from X|_S. As a function of this observed data, along with the vector of observed labels y = (y_i)_i∈[n], the empirical characteristic function ${\hat{C}}_{y}$ produces an estimate of C_Y that, with Eq. (1), gives the estimate $ϕ_{v} ({\hat{C}}_{y})$ , which we refer to as the Attributed Dependence on Labels (ADL) for feature v.

Recognising dependence: Example 1

For example, the empirical R² characteristic function ${\hat{R}}_{y}$ is given by (4) ${\hat{R}}_{y} (S) = 1 - \frac{| ρ (y, X |_{S}) |}{| ρ (X |_{S}) |},$ where ρ is the empirical Pearson correlation matrix.

Regardless of whether we use a population measure or an estimate, the R² measures only the linear relationship between the response (i.e., labels) Y and features X. This implies the R² may perform poorly as a measure of dependence in the presence of non-linearity. The following example from a non-linear DGP demonstrates this point.

Suppose the features X_j, j ∈ [d] are independently uniformly distributed on [ − 1, 1]. Given a diagonal matrix A = diag(a₁, …, a_d), let the response variable Y be determined by the quadratic form (5) $Y = X^{T} A X = a_{1} X_{1}^{2} + \dots + a_{d} X_{d}^{2} .$ Then, the covariance Cov(Y, X_j) = 0 for all j ∈ [d]. This is because $Cov (X^{T} A X, X_{j}) = \sum_{j = 1}^{d} Cov (X_{j}^{2}, X_{j}) = 0,$ since E[X_j] = 0 and $E [X_{j}^{3}] = 0$ . In Fig. 1, we display the X₄ cross section of 10, 000 observations generated from Eq. (5) with d = 5 and A = diag(0, 2, 4, 6, 8), along with the least squares line of best fit and associated R² value. We visualize the results for the corresponding Shapley decomposition in Fig. 2. As expected, we see that the R² is not able to capture the non-linear dependence structure of Eq. (5), and thus neither is its Shapley decomposition.

Feature X4, from Eq. (5), cross section with 100 least squares lines of best fit, each produced from a random sample of size 1,000, from the simulated population of size 10,000. — Figure 1: Feature X₄, from Eq. (5), cross section with 100 least squares lines of best fit, each produced from a random sample of size 1,000, from the simulated population of size 10,000.
The estimate of R² is 0.0043, with 95% bootstrap confidence interval (0.001, 0.013) over 100 fits. The R² is close to 0 despite the presence of strong (non-linear) dependence.

Download full-size image

DOI: 10.7717/peerjcs.582/fig-1

We note that improvements on the results in Figs. 1 and 2 can be obtained by choosing a suitable linearising transformation of the features or response prior to calculating R², but such a transformation is not known to be discernible from data in general, except in the simplest cases.

Measures of non-linear dependence

In the following, we describe three measures of non-linear dependence that, when used as a characteristic function C, have the following properties.

Independence is detectable (in theory), i.e., if C(S) = 0, then the variables Y and X|_S are independent. Equivalently, dependence is visible, i.e., if Y and X|_S are dependent, then C(S) ≠ 0.
C is model-independent. Thus, no assumptions are made about the DGP and no associated feature engineering or transformation of X or Y is necessary. Note that, since the Shapley values sum, by efficiency, to the distance correlation, we do get the guarantee that dependence on [d] is visible in the sum of Shapley values, by virtue of equalling C([d]). However, each individual Shapley value is not a distance correlation, but a linear combination of distance correlations, and thus cannot itself generally be interpreted as a distance correlation. The same is true for any sum of a strict subset of the Shapley values, since efficiency applies to the sum of all Shapley values, and not a strict subset of them.

Distance correlation and affine invariant distance correlation

The distance correlation, and its affine invariant adaptation, were both introduced by Székely, Rizzo & Bakirov (2007). Unlike the Pearson correlation, the distance correlation between Y and X is zero if and only if Y and X are statistically independent. However, the distance correlation is equal to 1 only if the dimensions of the linear spaces spanned by Y and X are equal, almost surely, and Y is a linear function of X.

First, the population distance covariance between the response Y and feature vector X is defined as a weighted L₂ norm of the difference between the joint characteristic function¹ , f_YX and the product of marginal characteristic functions f_Yf_X. In essence, this is a measure of squared deviation from the assumption of independence, i.e., the hypothesis that f_YX = f_Yf_X.

The empirical distance covariance $V_{n}^{2}$ is based on Euclidean distances between sample elements, and can be computed from data matrices Y, X as (6) ${\hat{V}}^{2} (Y, X) = \sum_{i, j = 1}^{n} A {(Y)}_{i j} A {(X)}_{i j},$ where the matrix function A(W) for W ∈ {Y, X} is given by $A {(W)}_{i j} = B {(W)}_{i j} - \frac{1}{n} \sum_{i = 1}^{n} B {(W)}_{i j} - \frac{1}{n} \sum_{j = 1}^{n} B {(W)}_{i j} + \frac{1}{n^{2}} \sum_{i, j = 1}^{n} B {(W)}_{i j},$ where ||⋅|| denotes the Euclidean norm, and B(W) is the n × n distance matrix with B(W)_ij = ||w_i − w_j||, where w_i denotes the ith observation (row) of W. Here, Y is in general a matrix of observations, with potentially multiple features. Notice the difference between Y and y, where the latter is the (single column) label vector introduced in “Attributed dependence on labels”.

The empirical distance correlation $\hat{ℛ}$ is given by (7) ${\hat{ℛ}}^{2} (Y, X) = \frac{{\hat{V}}^{2} (Y, X)}{\sqrt{{\hat{V}}^{2} (Y, Y) {\hat{V}}^{2} (X, X)}},$ for ${\hat{V}}^{2} (Y, Y) {\hat{V}}^{2} (X, X) \neq 0$ , and $\hat{ℛ} (Y, X) = 0$ otherwise. For our purposes, we define the distance correlation characteristic function estimator (8) ${\hat{D}}_{y} (S) = {\hat{ℛ}}^{2} (y, X |_{S}) .$ A transformation of the form x↦Ax + b for a matrix A and vector b is called affine. Affine invariance of the distance correlation is desirable, particularly in the context of hypothesis testing, since statistical independence is preserved under the group of affine transformations. When Y and X are first scaled as $Y^{'} = Y S_{Y}^{- 1 ∕ 2}$ and $X^{'} = X S_{X}^{- 1 ∕ 2}$ , the distance correlation $\hat{V} (Y^{'}, X^{'})$ , becomes invariant under any affine transformation of Y and X (Székely, Rizzo & Bakirov, 2007, Section 3.2). Thus, the empirical affine invariant distance correlation is defined by (9) ${\hat{ℛ}}^{'} (Y, X) = \hat{ℛ} (Y S_{Y}^{- 1 ∕ 2}, X S_{X}^{- 1 ∕ 2}),$ and we define the associated characteristic function estimator ${\hat{D}}_{y}^{'}$ in the same manner as Eq. (8). Monte Carlo studies regarding the properties of these measures are given by Székely, Rizzo & Bakirov (2007).

Hilbert–Schmidt independence criterion

The Hilbert Schmidt Independence Criterion (HSIC) is a kernel-based independence criterion, first introduced by Gretton et al. (2005a). Kernel-based independence detection methods have been adopted in a wide range of areas, such as independent component analysis (Gretton et al., 2007). The link between energy distance-based measures, such as the distance correlation, and kernel-based measures, such as the HSIC, was established by Sejdinovic et al. (2013). There, it is shown that the HSIC is a certain formal extension of the distance correlation.

The HSIC makes use of the cross-covariance operator, C_YX, between random vectors Y and X, which generalises the notion of a covariance. The response Y and feature vector X are each mapped to functions in a Reproducing Kernel Hilbert Spaces (RKHS), and the HSIC is defined as the Hilbert–Schmidt (HS) norm $| | C_{Y X} | |_{H S}^{2}$ of the cross-covariance operator between these two spaces (Gretton et al., 2005b; Gretton et al., 2007; Gretton et al., 2005a). Given two kernels ℓ, k, associated to the RKHS of Y and X, respectively, and their empirical evaluation matrices L, K with row i and column j elements ℓ_ij = ℓ(y_i, y_j) and k_ij = k(x_i, x_j), where y_i, x_i denote the ith observation (row) in data matrices X and Y, respectively, the empirical HSIC can be calculated as (10) $\hat{HSIC} (Y, X) = \frac{1}{n^{2}} \sum_{i, j}^{n} k_{i j} ℓ_{i j} + \frac{1}{n^{4}} \sum_{i, j, q, r}^{n} k_{i j} ℓ_{q r} - \frac{2}{n^{3}} \sum_{i, j, q}^{n} k_{i j} ℓ_{i q} .$ As in “Distance correlation and affine invariant distance correlation”, notice the difference between Y and y, where the latter is the (single column) label vector introduced in “Attributed dependence on labels”. Intuitively, this approach endows the cross-covariance operator with the ability to detect non-linear dependence, and the HS norm measures the combined magnitude of the resulting dependence. For a thorough discussion of positive definite Kernels, with a machine learning emphasis, see the work of Hein & Bousquet (2004).

Calculating the HSIC requires selecting a kernel. The Gaussian kernel is a popular choice that has been subjected to extensive testing in comparison to other kernel methods [see, e.g., Gretton et al., 2005a]. For our purposes, we define the empirical HSIC characteristic function by (11) ${\hat{H}}_{y} (S) = \hat{HSIC} (y, X |_{S}),$ and use a Gaussian kernel. Figure 2 shows the Shapley decomposition of $\hat{H}$ amongst the features generated from Eq. (5), again with d = 5 and A = diag(0, 2, 4, 6, 8). The decomposition has been normalised for comparability with the other measures of dependence presented in the figure. The HSIC can also be generalised to provide a measure of mutual dependence between any finite number of random vectors (Pfister et al., 2016).

Exploration

In machine learning problems, complete formal descriptions of the DGP are often impractical. However, there are advantages to gaining some understanding of the dependence structure. In particular, such an understanding is useful when inference about the data generating process is desired, such as in the contexts of causal inference, scientific inquiries (in general), or in qualitative investigations (cf. Navarro, 2018). In a regression or classification setting, the dependence structure between the features and response is an immediate point of focus. As we demonstrate in “Recognising dependence: Example 2”, the dependence structure cannot always be effectively probed by computing measures of dependence between labels and feature subsets, even when the number of marginal contributions is relatively small. In such cases, the Shapley value may not only allow us to summarise the interactions from many marginal contributions, but also to fairly distribute strength of dependence to the features.

Attributed dependence on labels (ADL) can be used for exploration in the absence of, or prior to, a choice of model; but, ADL can also be used in conjunction with a model—for example, to support, and even validate, model explanations. Even when a machine learning model is not parsimonious enough to be considered explainable, stakeholders in high risk settings may depend on the statement that “feature X_i is important for determining Y” in general. However, it is not always clear, in practice, whether such a statement about feature importance is being used to describe a property of the model, or a property of the DGP. In the following example, we demonstrate that ADL can be used to make statements about the DGP and to help qualify statements about a model.

Recognising dependence: Example 2

Consider a DGP involving the XOR function of two binary random variables X₁, X₂, with distributions given by P(X₁ = 1) = P(X₂ = 1) = 1∕2. The response is given by (12) $Y = XOR (X_{1}, X_{2}) = X_{1} (1 - X_{2}) + X_{2} (1 - X_{1}) .$ Notice that P(Y = i|X_k = j) = P(Y = i), for all i, j ∈ {0, 1} and k ∈ {1, 2}. Thus, in this example, Y is completely statistically independent of each individual feature. However, since Y is determined entirely in terms of (X₁, X₂), it is clear that Y is statistically dependent on the pair. Thus, the features individually appear to have little impact on the response, yet together they have a strong impact when their mutual influence is considered.

Faced with a sample from (Y, X₁, X₂), when the DGP is unknown, a typical exploratory practice is to take a sample correlation matrix to estimate Cor(Y, X), producing all pairwise sample correlations as estimates of Cor(Y, X_i), for i ∈ [d]. A similar approach, in the presence of suspected non-linearity, is to produce all pairwise distance correlations, or all pairwise HSIC values, rather than all pairwise correlations. Both the above approaches are model-independent. For comparison, consider a pairwise model-dependent approach: fitting individual single-feature models M_i, for i ∈ [d], that each predict Y as a function of one feature X_i; and reporting a measure of model performance for each of the d models, standardised by the result of a null feature model—that is, a model with no features (that may, for example, guess labels completely at random, or may use empirical moments of the response distribution to inform its guesses, ignoring X entirely).

As demonstrated by the results in Table 1, it is not possible for pairwise methods to capture interaction effects and mutual dependencies between features. However, Shapley feature attributions can overcome this limitation, both in the case of Sunnies and in the case of SHAP. By taking an exhaustive permutations based approach, Shapley values are able to effectively deal with partial dependencies and interaction effects amongst features. Note, all the Sunnies marginal contributions can, in this example, be derived from Table 1: the pairwise results state that ${\hat{D}}_{Y} (\{1\}) = {\hat{D}}_{Y} (\{2\}) = {\hat{D}}_{Y}^{'} (\{1\}) = {\hat{D}}_{Y}^{'} (\{2\}) = {\hat{H}}_{Y} (\{1\}) = {\hat{H}}_{Y} (\{2\}) = 0,$ and from Table 1 we can also derive ${\hat{D}}_{Y} (\{1, 2\}) = {\hat{D}}_{Y}^{'} (\{1, 2\}) = 2 \times 0.265 = 0.53$ and ${\hat{H}}_{Y} (\{1, 2\}) = 2 \times 0.16 = 0.32$ .

Table 1:

Importances of features X₁ and X₂ assigned by various methods, using a sample size of 10, 000 from DGP Eq. (12).

For pairwise XGBoost, we take the difference in mean squared prediction error between each XGBoost model and the null model (which always guesses 1). Pairwise dependence includes pairwise DC, HSIC, AIDC and Pearson correlation, which all give the same result of 0, due to statistical independence.

Method	Result X₁	Result X₂
SHAP	3.19	3.19
Shapley DC	0.265	0.265
Shapley AIDC	0.265	0.265
Shapley HSIC	0.16	0.16
Pairwise XGB	0	0
Pairwise dependence	0	0
XGB feature importance	1	0

DOI: 10.7717/peerjcs.582/table-1

The discrete XOR example demonstrates that ADL captures important symmetry between features, while pairwise methods fail to do so. The results in the final two rows of Table 1 are produced as follows: we train an XGBoost classifier on the discrete XOR problem in Eq. (12). Then, to ascertain the importance of each of the features X₁ and X₂, in determining the target class, we use the XGBoost “feature importance” method, which defines a feature’s gain as “the improvement in accuracy brought by a feature to the branches it is on” (see https://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html).

Common experiences from users suggest that the XGBoost feature importance method can be unstable for less important features and in the presence of strong correlations between features (see e.g., https://stats.stackexchange.com/questions/279730/). However, in the current XOR example, features X₁ and X₂ are statistically independent (thus uncorrelated) and have the maximum importance that two equally important features can share (that is, together they produce the response deterministically).

Although the XGBoost classifier easily achieves a perfect classification accuracy on a validation set, the associated XGBoost gain for X₁ is Gain(X₁) ≈ 0, while Gain(X₂) ≈ 1, or vice versa. In other words the full weight of the XGBoost feature importance under XOR is given to either one or the other feature. This is intuitively misleading, as both features are equally important in determining XOR, and any single one of the two features is alone not sufficient to achieve a classification accuracy greater than random guessing. In practice, ADL can help identify such flaws with other model explanation methods.

Diagnostics

In the following diagnostics sections, we present results using the distance correlation. However, similar results can also be obtained using the HSIC and the AIDC.

Model attributed dependence

Given a fitted model f, with associated predictions $\hat{Y} = f (X)$ , we seek to attribute shortcomings of the fitted model to individual features. We can do this by calculating the Shapley decomposition of the estimated strength of dependence between the model residuals $ε = Y - \hat{Y}$ , and the features X. In other words, feature v receives the attribution ϕ_v(C_ε); estimated by $ϕ_{v} ({\hat{C}}_{e})$ , where $e = y - \hat{y}$ . We refer to this as the Attributed Dependence on Residuals (ADR) for feature v.

A different technique, for diagnosing model misspecification, is to calculate the Shapley decomposition of the estimated strength of dependence between $\hat{Y}$ and X, so that each feature v receives attribution $ϕ_{v} ({\hat{C}}_{\hat{y}})$ . We call this the Attributed Dependence on Predictions (ADP), for feature v. This picture of the model generated dependence structure may then be compared, for example, to the observed dependence structure given the ADL $\{ϕ_{v} ({\hat{C}}_{y}) : v \in [d]\}$ . The diagnostic goal, then, may be to check that, for all v, (13) $| ϕ_{v} ({\hat{C}}_{\hat{y}}) - ϕ_{v} ({\hat{C}}_{y}) | < δ,$ for some δ tolerance. In other words, a diagnostic strategy making use of ADP is to compare estimates of feature importance under the model’s representation of the joint distribution, to estimates of feature importance under the empirical joint distribution, and thus to individually inspect each feature for an apparent change in predictive relevance.

We note that these techniques, ADP and ADR, are agnostic to the chosen model. All that is needed is the model outputs and the corresponding model inputs—the inner workings of the model are irrelevant for attributing dependence on predictions and residuals to individual features in this way.

Demonstration with concept drift

We illustrate the ADR and ADL techniques together with a simple and intuitive synthetic demonstration involving concept drift, where the DGP changes over time, impacting the mean squared prediction error (MSE) of a deployed XGBoost model. The model is originally trained with the assumption that the DGP is static, and the performance of the model is monitored over time with the intention of detecting violations of this assumption, as well as attributing any such violation to one or more features. A subset of the deployed features can be selected for scrutiny, by considering removal only of those selected features from the model. To highlight this, our simulated DGP has 50 features, and we perform diagnostics on 4 out of those 50 features.

For comparison, we compute SAGE values of the model mean squared error (Covert, Lundberg & Lee, 2020) and we compute the mean SHAP values of the logarithm of the model loss function (Lundberg et al., 2020). We will refer to the latter as SHAPloss. For SAGE and SHAPloss values, we employ a DGP similar to Eq. (14), with sample size 1000, but with X_i ≡ 0 for all i > 4. These features were nullified for tractability of the SAGE computation, since, unlike for Sunnies, the authors are not aware of any established method for selectively computing SAGE values of a subset of the full feature set. SAGE and SHAPloss were chosen for their popularity and ability to provide global feature importance scores.

At the initial time t = 0, we define the DGP as a function of temporal increments t ∈ ℕ∪{0}, (14) $Y = X_{1} + X_{2} + (1 + \frac{t}{10}) X_{3} + (1 - \frac{t}{10}) X_{4} + \sum_{i = 5}^{50} X_{i},$ where X_i ∼ N(0, 4), for i = 1, 2, 3, 4, and X_i ∼ N(0, 0.05), for 5 ≤ i ≤ 50. Features 1 through 4 are the most effectual to begin with, and we can imagine that these were flagged as important during model development, justifying the additional diagnostic attention they enjoy after deployment. We see from Eq. (14) that, after deployment, i.e., during periods 1 ≤ t ≤ 10, the effect of X₄ decreases linearly to 0, while the effect of X₃ increases proportionately over time. In what follows, these changes are clearly captured by the residual and response dependence attributions of those features, using the DC characteristic function.

The results, with a sample size of n = 1000, from the DGP in Eq. (14), are presented in Fig. 3. According to the ADL (top), X₄ shows early signs of significantly reduced importance $ϕ_{4} ({\hat{C}}_{y})$ , as X₃ shows an increase in importance $ϕ_{3} ({\hat{C}}_{y})$ , which is roughly symmetrical to the decrease in $ϕ_{4} ({\hat{C}}_{y})$ . The ADR (bottom) show early significant signs that X₃ is disproportionately affecting the residuals, with high $ϕ_{3} ({\hat{C}}_{e})$ . The increase in residual attribution $ϕ_{4} ({\hat{C}}_{e})$ is also evident, though the observation $ϕ_{4} ({\hat{C}}_{e}) < ϕ_{3} ({\hat{C}}_{e})$ suggests that the drift impact from X₃ is the larger of the two.

Figure 3: Attributed dependence on (A) labels (ADL) and (B) residuals (ADR), using the DC characteristic function; results for times t ∈ {0, …, 10}, from a simulation with sample size 1,000 from the DGP Eq. (14).
The bootstrap confidence bands are the 95% middle quantiles (Q_0.975 − Q_0.025) from 100 subsamples of size 1,000. The ADL of features X₃ and X₄ appear to decay/increase over time, leading to significantly different ADL, compared to the other features. We also see that X₃ and X₄ have significantly higher ADR than the other features.

Download full-size image

DOI: 10.7717/peerjcs.582/fig-3

The resulting SAGE and mean SHAPloss values are presented in Fig. 4. Interestingly, the behaviours of SHAPloss and SAGE are (up to scale and translation) analogous to the behaviour of ADL, rather than ADR, despite the model-independence of ADL. A reason for this, in this example, is that the feature with higher (resp. lower) dependence on Y contributes less (resp. more) to the residuals. When interpreting the SHAPloss and SAGE outputs, it is important to note that the model loss is increasing with t, since the true planar trend in (Y, X₃, X₄) rotates away from the learned trend. So, while the SAGE and SHAPloss results may appear to make the paradoxical suggestion that X₃ is utilised better by the model at t = 10 compared with t = 0, this is not the case: SAGE and SHAPloss are not accounting for the change in model loss over time. The model loss may decrease more when marginalising a feature under a misspecified model, than under a model with lower overall loss.

Figure 4: SAGE and SHAPloss values for the modified simulated DGP Eq. (14) (see “Demonstration with concept drift”), with sample size 1,000.
The results are comparable to ADL (see Fig. 3). Features with high SAGE value contribute less to residuals, and vice versa for the SHAPloss values.

Download full-size image

DOI: 10.7717/peerjcs.582/fig-4

Demonstration with misspecified model

To illustrate the ADL, ADP and ADR techniques, we demonstrate a case where the model is misspecified on the training set, due to model bias. The inadequacy of this misspecified model is then detected on the validation set. Unlike the example given in “Model attributed dependence”, the DGP is unchanging between the two data sets. The key technique used in this demonstration is the comparison of differences between ADL (calculated in the absence of any model) and ADP (calculated using the output of a fitted model), in order to identify any differences in the attributions between dependence on labels and the dependence on the predictions produced by the misspecified model. Such a comparison, between model absence and model outputs, is not possible using purely model-dependent Shapley values.

To make this example intuitive, we avoid using a complex model such as XGBoost, in favour of a linear regression model. Since the simulated DGP is also linear, this example allows a simple comparison between the correct model and the misspecified model. The DGP in this example is (15) $Y = X_{1} + X_{2} + 5 X_{3} X_{4} X_{5} + ε,$ where X₁, X₂, X₃ ∼ N(0, 1) are continuous, ε ∼ N(0, 0.1) is a small random error, and X₄, X₅ ∼ Bernoulli(1∕2) are binary. Hence, we can make the interpretation that the effect of X₃ is modulated by X₄ and X₅, such that X₃ is effective, only if X₄ = X₅ = 1. For this demonstration, we fit a misspecified linear model EY = β₀ + βX, where $X^{T} = {(X_{i})}_{i = 1}^{d}$ is the vector of features, and $β_{0}, β = {(β_{i})}_{i = 1}^{d}$ are real coefficients. This is a simple case where the true DGP is unknown to the analyst, who therefore seeks to summarise the 80 marginal contributions from 5 features into 5 Shapley values.

Figure 5 shows the outputs for attributed dependence on labels, residuals and predictions, via ordinary least squares estimation. From these results, we make the following observations:

(A) Barbell plots showing differences in attributed dependence on labels (ADL), based on the DC characteristic function between the training and test sets, for each feature, for the misspecified model EY = β0 + βX with DGP (Eq. (15)). Larger differences indicate that the model fails to capture the dependence structure, effectively. (B) Bar chart representing attributed dependence on residuals (ADR) for the test set. — Figure 5: (A) Barbell plots showing differences in attributed dependence on labels (ADL), based on the DC characteristic function between the training and test sets, for each feature, for the misspecified model EY = β₀ + βX with DGP (Eq. (15)). Larger differences indicate that the model fails to capture the dependence structure, effectively. (B) Bar chart representing attributed dependence on residuals (ADR) for the test set.
The shaded rectangles represent bootstrap confidence intervals, taken as the 95% middle quantile (Q_0.975 − Q_0.025) from 100 resamples of size 1,000. Non-overlapping rectangles indicate significant differences. Point makers represent individual observations from each of the 100 resamples.

Download full-size image

DOI: 10.7717/peerjcs.582/fig-5

For X₃ the ADP is significantly higher than the ADL.
For X₄ and X₅ the ADP is significantly lower than the ADL.
For X₁, X₂ there is no significant difference between ADP and ADL.
For X₁, X₂ ADR is negative, while X₃, X₄, X₅ have positive ADR.

Observations (i) and (ii) suggest that the model EY = β₀ + βX overestimates the importance of X₃ and underestimates the importance of X₄ and X₅. Observations (iii) and (iv) suggest that the model may adequately represent X₁, X₂, X₃, but that X₃, X₄ and X₅ are significantly more important for determining structure in the residuals than X₁ and X₂. A residuals versus fits plot may be useful for confirming that this structure is present and of large enough magnitude to be considered relevant.

Having observed the result in Fig. 5, for the misspecified linear model EY = β₀ + βX, we now fit the correct model: EY = β₀ + βX + X₃X₄X₅, which includes the three-way interaction effect X₃X₄X₅. The results, shown in Fig. 6, show no significant difference between the ADL and ADP for any of the features, and no significant difference in ADR between the features.

Application to Detecting Gender Bias

We analyse a mortality data set produced by the US Centers for Disease Control (CDC) via the National Health and Nutrition Examination Survey (NHANES I) and the NHANES I Epidemiologic Follow-up Study(NHEFS) (Cox, 1998). The data set consists of 79 features from medical examinations of 14, 407 individuals, aged between 25 and 75 years, followed between 1971 and 1992. Amongst these people, 4, 785 deaths were recorded before 1992. A version of this data set was also recently made available in the SHAP package (Lundberg & Lee, 2017). The same data were recently analysed in Section 2.7, Lundberg et al. (2020) (see also https://github.com/suinleelab/treeexplainer-study/tree/master/notebooks/mortality).

We use a Cox proportional hazards objective function in XGBoost, with learning rate (eta) 0.002, maximum tree depth 3, subsampling ratio 0.5, and 5,000 trees. Our training set containt 3,370 observations, balanced via random sampling to contain an equal number of males and females. We then test the model on three different data sets: a all male test set of size 1686, containing all males not in the training data; an all female test set of size 3,547, containing all females not in the training data; and a gender balanced test set of size 3,372. The data are labelled with the observed time-to-death of each patient during the follow-up study. For model fitting, we use the 16 features given in Table 2.

Table 2:

The 16 features used for fitting a Cox proportional hazards model to NHANES I and NHEFS data.

Feature name	Feature name
Age	Sex
Race	Serum albumin
Serum cholesterol	Serum iron
Serum magnesium	Serum protein
Poverty index	Physical activity
Red blood cells	Diastolic blood pressure
Systolic blood pressure	Total iron binding capacity
Transferrin saturation	Body mass index

DOI: 10.7717/peerjcs.582/table-2

Of the features in Table 2, we focus on the Shapley values for a subset of well-established risk factors for mortality: age, physical activity, systolic blood pressure, cholesterol and BMI. Note that the results presented here are purely intended as a proof of concept—the results have not been investigated in a controlled study and none of the authors are experts in medicine. We do not intend for our results to be treated as a work of medical literature.

We decompose dependence on the labels, model predictions and residuals, amongst the three features: age, systolic blood pressure (SBP) and physical activity (PA), displaying the resulting ADL, ADP and ADR for each of the three test data sets in Fig. 7(using the DC characteristic function). From this analysis we make the following observations.

Figure 7: (A) Shapley decomposition of attribution dependence on labels (ADL, pink) and predictions (ADP, blue), and (B) residuals (ADR, orange) for the three features age, physical activity (PA) and systolic blood pressure (SBP), on three different test data sets consisting of an equal proportion of females and males (“balanced”), only male (“all male”) and only females (“all female”). .
Attributes were based on the DC characteristic function.

Download full-size image

DOI: 10.7717/peerjcs.582/fig-7

Age has a significantly higher attributed dependence on residuals compared with each of the other features, across all three test sets. This suggests that age may play an important role in the structure of the model’s residuals. This observation is supported by the dumbbells for age, which suggest a significant and sizeable difference between attributed dependence on prediction and attributed dependence on labels; that is, we have evidence that the model’s predictions show a greater attributed dependence on age than the labels do.
For SBP, we observe no significant difference between ADL and ADP for the balanced and all male test sets. However, in the all female test set, we do see a significant and moderately sized reduction in the attributed dependence on SBP for the model’s predictions compared with that of labels. This suggests that the model may represent the relationship between SBP and log relative risk of mortality less effectively on the all female test set than on the other two test sets. This observation is supported by the attributed dependence on residuals for SBP, which is significantly higher in the all female test set compared to the other two sets.
For PA, we see a low attributed dependence on residuals, and a non-significant difference between ADL and ADP, for all three test sets. Thus we do not have any reason, from this investigation, to suspect that the effect of physical activity is being poorly represented by the model.

The results regarding potential heterogeneity due to gender and systolic blood pressure are not suprising given that we expect, a priori, there to be a relationship between systolic blood pressure and risk of mortality (Port et al., 2000), and that studies also indicate this relationship to be non-linear (Boutitie et al., 2002), as well as dependent on age and gender (Port, Garfinkel & Boyle, 2000). Furthermore, the mortality risk also depends on age and gender, independently of blood pressure (Port, Garfinkel & Boyle, 2000). We also expect physical activity to be important in predicting mortality risk (Mok et al., 2019).

Discussion and Future Work

After distinguishing between model-dependent and model-independent Shapley values, in “Measures of non-linear dependence”, we introduce energy distance-based and kernel-based characteristic functions, for the Shapley game formulation, as measures of non-linear dependence. We assign the name ‘Sunnies’ to Shapley values that arise from such measures.

In “Recognising dependence: Example 1” and “Exploration”, we demonstrate that the resulting model-independent Shapley values provide reasonable results compared to a number of alternatives on certain DGPs. The alternatives investigated are the XGBoost built-in feature importance score, pairwise measures of non-linear dependence, and the R² characteristic function. The investigated DGPs are a quadratic form, for its simple non-linearity; and an XOR functional dependence, for its absence of pairwise statistical dependence. These examples are simple but effective, as they act as counter-examples to the validity of the targeted measures of dependence to which we draw comparison.

In “Diagnostics”, we demonstrate how the Shapley value decomposition, with these non-linear dependence measures as characteristic function, can be used for model diagnostics. In particular, we see a variety of interesting examples, where model misspecification and concept drift can be identified and attributed to specific features. We approach model diagnostics from two angles, by scrutinising two values: the dependence attributed on predictions by the model (ADP), and the dependence between the model residuals and the input features (ADR). These are proofs of concept, and the techniques of attributed dependence on labels (ADL), ADP and ADR require development to become standard tools. However, the examples highlight the techniques’ potential, and we hope that this encourages greater interest in them.

We provide two demonstrations of the diagnostic methods: in “Demonstration with concept drift”, we use a data generating process which changes over time, and where the deployed model was trained at one initial point in time. Here, Sunnies successfully uncovers changes in the dependence structures of interest, and attributes them to the correct features, early in the dynamic process. The second demonstration, in “Demonstration with misspecified model”, shows how we use the attributed dependence on labels, model predictions and residuals, to detect which features’ dependencies or interactions are not being correctly captured by the model. Implicit in these demonstrations is the notion that the information from many marginal contributions is being summarised into a human digestible number of quantities. For example, in “Demonstration with misspecified model”, the 80 marginal contributions from 5 features are summarised as 5 Shapley values, in each of ADL, ADP and ADR, facilitating the simple graphical comparison in Fig. 5.

There is a practical difference between model-independent and model-dependent methods, highlighted in “Demonstration with misspecified model”, when comparing the dependence structure in a data set, to the dependence structure captured by a model. Model-independent methods can be applied to model predictions and residuals, but can also be applied to data labels as well. Thus, techniques using model-independent Shapley values will be markedly different from model-dependent methods in both design and interpretation. Indeed, consider that there is a different interpretation between (a) the decomposition of a measure of statistical dependence, e.g., as a measure of distance between the joint distribution functions, with and without the independence assumption, and (b) the attribution of a measure of the functional dependence of a model on the value of its inputs.

While the DC does provide a population level (asymptotic) guarantee that dependence will be detected, it must be noted that, as discussed in “Distance correlation and affine invariant distance correlation”, the DC tends to be greater for a linear association than for a non-linear association. These are not strengths, or weaknesses, of using a measure of non-linear statistical dependence as the Shapley value characteristic function (i.e., the method we call Sunnies) but rather of the particular choice of characteristic function in this method. Work is needed to investigate other measures of statistical dependence in place of DC, HSIC or AIDC, and to provide a comparison between these methods, including a detailed analysis of strengths, limitations and computational efficiency. In this paper, we have not focused on such a detailed experimental evaluation and comparison, but on the exposition of the Sunnies method itself. A potential alternative to our use of energy correlation and HSIC is the class of maximal information based non-parametric exploration (MINE) statistics, or other mutual information based measures (Kinney & Atwal, 2014; Reshef et al., 2011).

Finally, in “Application to Detecting Gender Bias”, we apply Sunnies to a study on mortality data, with the aim of detecting effects caused by gender differences. We find that, when the model is trained on a gender balanced data set, a significant difference is detected between the model’s representation of the dependence structure via its predictions (ADP) and the dependence structure on the labels (ADL); a difference which is significant for females and not for males, even though the training data was gender balanced. Although we do not claim that our result is causal, it does provide evidence regarding the potential of Sunnies to uncover and attribute discrepancies that may otherwise go unnoticed, in real data.

A well-known limitation when working with Shapley values, is their exponential computational time complexity. Ideally, in “Application to Detecting Gender Bias”, we would have calculated Shapley values of all 17 features. However, it is important to note that we do not need to calculate Shapley values of all features, if there is prior knowledge available regarding interesting or important features, or if features can be partitioned into independent blocks. To illustrate the idea of taking advantage of independent blocks, suppose we have a model with 15 features. If we know in advance that these features partition into 3 independent blocks of 5 features, then we can decompose the pairwise dependence of each block into 5 Shapley values. In this way, 15 Shapley values are computed from 240 within-block marginal contributions, rather than the full number of 32, 768 marginal contributions. In the future, it may be interesting to also consider the computational efficiencies that may arise in scenarios where the sparsity structures of marginal contributions can be directly exploited, as well as the potential for examining such marginal contributions directly (e.g., via visualisation).

Finally, note that we have made the distinction that Shapley feature importance methods may or may not be model-dependent, but this distinction holds for model explanation methods in general. We believe that complete and satisfactory model explanations should ideally include a description from both categories.

All code and data necessary to produce the results in this manuscript are available on github.com/ex2o/sunnies.

In this context, we refer to the characteristic function of a probability distribution. We would like to make the reader aware that this is a different use of the term “characteristic function” than that used to describe a cooperative game in the context of Shapley values, as in Eq. (1).

[1] Algaba E, Fragnelli V, Sánchez-Soriano J. 2019. Handbook of the Shapley value. Boca Raton: CRC Press.

[2] Boutitie F, Gueyffier F, Pocock S, Fagard R, Boissel JP. 2002. J-shaped relationship between blood pressure and mortality in hypertensive patients: new insights from a meta-analysis of individual-patient data. Annals of Internal Medicine 136(6):438-448

[3] Covert I, Lundberg S, Lee S-I. 2020. Understanding global feature contributions with additive importance measures. arXiv preprint. arXiv:2004.00668v2 preprint

[4] Cox CS. 1998. Plan and operation of the NHANES I epidemiologic followup study, 1992, 35. Hyattsville: National Center for Health Statistics.

[5] Datta A, Sen S, Zick Y. 2016. Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In: 2016 IEEE symposium on security and privacy (SP). Piscataway. IEEE. 598-617

[6] Fryer D, Strumke I, Nguyen H. 2020. Shapley value confidence intervals for variable selection in regression models. arXiv preprint. arXiv:2001.09593

[7] Gretton A, Bousquet O, Smola A, Schölkopf B. 2005a. Measuring statistical dependence with Hilbert-Schmidt norms. In: Jain S, Simon HU, Tomita E., eds. Algorithmic Learning Theory. Berlin, Heidelberg. Springer Berlin Heidelberg. 63-77

[8] Gretton A, Fukumizu K, Teo CH, Song L, Schölkopf B, Smola A. 2007. A kernel statistical test of independence.

[9] Gretton A, Herbrich R, Smola A, Bousquet O, Schölkopf B. 2005b. Kernel methods for measuring independence. Journal of Machine Learning Research 6(Dec):2075-2129

[10] Hein M, Bousquet O. 2004. Kernels, Associated Structures and Generalizations. Tübingen: Max Planck Institute for Biological Cybernetic.

[11] Huettner F, Sunder M. 2012. Axiomatic arguments for decomposing goodness of fit according to Shapley and Owen values. Electronic Journal of Statistics 6:1239-1250

[12] Israeli O. 2007. A Shapley-based decomposition of the R-square of a linear regression. Journal of Economic Inequality 5:199-212

[13] Kinney JB, Atwal GS. 2014. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences of the United States of America 111(9):3354-3359

[14] Lipovetsky S, Conklin M. 2001. Analysis of regression in game theory approach. Applied Stochastic Models in Business and Industry 17:319-330

[15] Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Katz R, Himmelfarb J, Bansal N, Lee S-I. 2020. From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence 2(1):2522-5839

[16] Lundberg SM, Lee S-I. 2017. A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. Advances in neural information processing systems. Red Hook: Curran Associates, Inc.. vol. 30:4765-4774

[17] Merrick L, Taly A. 2019. The explanation game: explaining machine learning models with cooperative game theory. In: International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Berlin, Heidelberg: Springer. 17-38

[18] Mok A, Khaw K-T, Luben R, Wareham N, Brage S. 2019. Physical activity trajectories and mortality: population based cohort study. BMJ 365:12323

[19] Navarro D. 2018. Between the devil and the deep blue sea: tensions between scientific judgement and statistical model selection. Computational Brain & Behavior 2:28-34

[20] Owen AB, Prieur C. 2017. On Shapley value for measuring importance of dependent inputs. SIAM/ASA Journal on Uncertainty Quantification 5(1):986-1002

[21] Pfister N, Bühlmann P, Schölkopf B, Peters J. 2016. Kernel-based tests for joint independence. Journal of the Royal Statistical Society. Series B: Statistical Methodology 80:5-31

[22] Port S, Demer L, Jennrich R, Walter D, Garfinkel A. 2000. Systolic blood pressure and mortality. The Lancet 355(9199):175-180

[23] Port S, Garfinkel A, Boyle N. 2000. There is a non-linear relationship between mortality and blood pressure. European Heart Journal 21(20):1635-1638

[24] Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC. 2011. Detecting novel associations in large data sets. Science 334(6062):1518-1524

[25] Roth AE. 1988. The Shapley value: essays in honor of Lloyd S. Shapley. Oakleigh: Cambridge University Press.

[26] Sejdinovic D, Sriperumbudur B, Gretton A, Fukumizu K. 2013. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics 59:2263-2291

[27] Shapley LS. 1953. A value for n-person games. Contributions to the Theory of Games 2(28):307-317