Explanations beyond single words - Adaptationsofperturbation-basedexplanationmethodsfortextclas

One of our contributions is an adaptation of explanation methods for explain-ing textual units longer than words. If longer units are explained instead of single words, the resulting explanations may be simpler and semantically more meaningful, which is a desirable property for end users [54]. In addi-tion, the words inside longer units may interact, so the resulting explanations may uncover new insights. Below we present existing approaches that extend explanations beyond single words.

Murdoch et al. [55] propose Contextual Decomposition as a way to ex-plain predictions of a Long Short-Term Memory (LSTM) network [56]. The method is able to capture both individual importance as well as interac-tion effects. Singh et al. [57] generalize this method for other deep neural network variants and propose a clustering procedure that constructs hierar-chical explanations which show the composition of word-level importances into importances of progressively longer text spans. Jin et al. [58] argue that context independence is a desirable property for the phrase importance,

2.3. EXPLANATIONS BEYOND SINGLE WORDS 11

computed by these methods. They show that existing methods do not take this into account and propose a modified method. Chen et al. [59] propose a method for computing hierarchical explanations in a top-down procedure by iteratively dividing the spans into two parts at the point where the least interaction occurs. Similarly, Zhang et al. [60] propose a bottom-up method for computing hierarchical explanations that chooses the words to group by taking into account individual importance and interaction effects.

The process of combining word importance into the importance of longer units involves a step that selects the words to group next. The number of possible choices quickly becomes too big to check exhaustively, so in most cases the outlined methods use an assumption that words frequently interact with their close neighbours [22].

A different approach, which does not require this assumption, uses a pre-defined tree structure instead of trying to construct it automatically. In work most similar to ours, Chen et al. [23] construct explanations according to the underlying constituency trees. In contrast, we use dependency trees which are similar, although more suitable and available for morphologically rich languages.

A third approach is to perform the grouping implicitly. Chen et al. [24]

propose a method that learns to group correlated words jointly with the used model on a downstream task. However, this explanation method loses the ability to produce hierarchical explanations.

Chapter 3 Background

Before we describe our modifications, we provide an overview of the indi-vidual building blocks. In Section 3.1, we describe the studied explanation methods. Then, in Section 3.2, we describe the process of text generation using language models.

3.1 Explanation methods

In this section, we describe the studied explanation methods. First, we describe LIME, a method that produces an explanation using a surrogate model. Next, we describe IME, which uses a pure sampling approach to produce explanations in the form of Shapley values.

3.1.1 LIME

LIME (Local Interpretable Model-agnostic Explanation) is a local perturba-tion-based model-agnostic explanation method introduced by Ribeiro et al.

[5]. The method creates an explanation for a given input by approximat-ing the local behaviour of the interpreted model with a simple model (e.g., linear).

The authors motivate LIME by outlining that an explanation should be:

1. Interpretable. The explanation should present the relation between the inputs and response of a model, taking into account the limits of the end user.

2. Locally faithful. The explanation must represent the true behaviour of the interpreted model in the neighbourhood of the explained in-stance. This is desired in order to increase the trust of the user in the model. While globally faithful explanations might be even more trust-worthy, they are harder to produce in a form that is still interpretable to the end user.

3. Model-agnostic. The explanations should be constructible for an arbitrary model so that the explanation method is generally applicable.

To satisfy the first requirement, the authors first define an interpretable representation of an explained instance. This representation is chosen in a way that its meaning is comprehensible to humans. In textual LIME, this is a binary vector, where 1 indicates the original word is present in the sequence, and 0 indicates it is absent. The absence is simulated by replacing the word with a dummy word (in our case, the word [PAD]).

The interpretable representations of input perturbations are used to train a model that locally approximates the interpreted model in the neighbour-hood of the explained instance. As the internals of the model will be used as an explanation, and we want the explanation to be comprehensible to the end user, the model needs to be simple enough, e.g., a linear model or a shallow decision tree.

To ensure the simple model is locally faithful to the interpreted model, its training examples are weighted according to how distant their interpretable representation is from the interpretable representation of the explained in-stance. For text, the weights are calculated with an exponential kernel over the cosine distances between the interpretable representations. The exponen-tial kernel is parametrized by the kernel width σ, which controls how local

3.1. EXPLANATION METHODS 15

the explanations are.

In explaining text classifiers, LIME first uniformly at random samples the local neighbourhood of the instance. In practice, this means that bi-nary vectors are constructed with a random number of elements zeroed out.

Using the interpretable representations, their non-interpretable counterparts are created next by replacing the absent words in original sequences with the word [PAD]. The non-interpretable representations are passed through a model in order to obtain the model’s predictions to approximate with the simpler model. Finally, the weights for the samples are computed using an exponential kernel and a simple model is fit to predict the model’s behaviour based on the interpretable representations of the neighbourhood. In our ex-periments, we learn a sparse ridge regression model [61], where the enforced sparsity keeps the explanations simple. To achieve sparsity, we first train a model using all features, then greedily select only K features with the highest absolute importance and retrain the model using the selected features.

3.1.2 IME

IME (Interactions-based Method for Explanation) [6, 7, 8] is similar to LIME in the sense that it is a local perturbation-based model-agnostic explanation method. However, in contrast to LIME, it does not explain the model by (ex-plicitly) fitting an approximate simpler model and provides the explanation in the form of Shapley values [41].

The authors motivate IME by outlining the limitation of previous meth-ods, which constructed explanations by perturbing one feature at a time.

Previous methods define the importance of a feature i as the difference be-tween the prediction of a model and the expected prediction of a model if the value of thei-th feature is unknown:

ϕ_i(x) =f(x)−E f(x₁, . . . , X_i, . . . , x_n)

, (3.1)

wherex= [x₁, . . . , x_n] is the explained instance andf is the explained model.

Such explanations can construct misleading explanations in the case when features interact. As a solution, they propose a method that perturbs features across all subsets, defining the feature importance as:

ϕ_i(x) = X where S is the set of all features. The solution to this equation corresponds to Shapley values, a concept from the coalitional game theory.

In game theory, a coalitional game is defined by a set of players S that form a grand coalition, and a characteristic function v_Q which defines the payout the players gain by forming a coalitionQ, such that v∅ = 0. In expla-nation methods for machine learning models, the set of players corresponds to the explained features, and the characteristic functionv_Q corresponds to the difference between the expected model prediction when a subset of features Q is known, and the expected model prediction. The goal is to distribute the worth of a grand coalitionv_S as defined by Equation 3.3 among features in a fair way.

v_S(x) = E(f|X_i =x_i,∀i∈S)−E(f|X_i =x_i,∀i∈ {}) (3.3) Shapley values are a unique solution that satisfies four axioms that for-malize fairness: symmetry, efficiency, law of aggregation (also called linearity or additivity), and dummy [41].

As the exact calculation of Shapley values has an exponential time com-plexity, the authors of IME present a sampling method that enables their approximation. In order to derive the method, they first reformulate

Equa-3.1. EXPLANATION METHODS 17

where π(S) is the set of all ordered permutations of feature indices, and P reⁱ(O) are the feature indices that are located beforei in the permutation O. Next, they assume feature independence, so the equation becomes:

ϕi(x) = 1

i=xi,i∈Q] is used to denote an instance x⁰ whose features Q are set to the corresponding values in x, and X is the sampling dataset.

Instead of computing the mean across all permutations (and computing expectation over all examples in the sampling dataset), the mean over m samples{(O_j, x⁰_j)}_j=1,...,m is taken as an approximation ˆϕ_i(x), with variance

σ²_i

m, whereσ_i² is the sampling population variance.

To compute the explanation for an input instancex, IME computes ˆϕ_i(x) for each featurei using the approximation procedure described above. This approximation is determined by two parameters: the minimum number of samples to take per feature (m_min), and the total sampling budget (m_max≥ m_min·|S|). First, an initial estimate of the feature importance and its variance is calculated using m_min samples. Then, the remaining sampling budget of (mmax−mmin· |S|) samples is iteratively allocated among features in a way that decreases the expected error of the estimate. Concretely, an additional sample is allocated to featurej, for which (^σ

2 j

mj −_m^σ^j²

j+1) is maximal, assuming the goal is to decrease the expected squared error¹.

1The derivation for this, and for the case where the expected absolute error is minimized instead, is available in the original paper [8].

In document Adaptationsofperturbation-basedexplanationmethodsfortextclassiﬁcationwithneuralnetworks MatejKlemen (Strani 30-38)