• Rezultati Niso Bili Najdeni

In Section 5.3, we have empirically shown that the perturbations created with generators inside modified explanation methods are more natural than those created inside the original IME and LIME methods. In this section, we quantitatively and qualitatively evaluate if using these more natural per-turbations leads to better explanations.

5.4.1 Experimental settings

We measure the quality of explanations with two automated metrics that measure the difference in prediction when the important text units (e.g., subwords or bigrams) are removed.

The first metric is switching point (SP) [32], which measures the average proportion of text units that have to be removed from the input instance to change the predicted class of a model. Initially, the classifier is used to obtain the prediction for the explained instance. Then, text units are removed from the instance in decreasing order of their estimated importance until the prediction of the classifier changes, i.e. the unit deemed the most important for the current prediction is removed first. If the prediction does not change

5.4. QUALITY OF EXPLANATIONS 41

after removing all units from the instance, the SP for that instance is 1.

This is repeated for all evaluated instances and averaged to produce the final value of the metric. Low values of SP imply that the most important units in the explanations are indicative of the predicted class and consequently the produced explanations are good.

The second metric is the area over the perturbation curve (AOPC) [33], which measures the average decrease in the probability for the predicted class ˆy when 1,2, . . . , K most important text units are removed. In our experiments, we use K = 10. As in the calculation of SP, the final AOPC is obtained by averaging the metric over all evaluated instances. A higher value of AOPC indicates better explanations.

Both metrics measure how well the explanations cover the removal-based importance, which is different from the replacement-based importance com-puted in the original and modified IME and LIME methods. Despite this misalignment, we use them because the intuition behind them is reasonable and they enable automatic measurement of the explanation quality.

Due to the high cost of computing the explanations, we report all results of this experiment on random samples of 100 test examples per dataset. In XNLI5, where the examples are in five different languages, we include an equal number of examples (i.e., 20) per language.

To compute the explanations, we use the following settings. In IME methods, we compute the explanations using mmin = 30 minimum samples per feature and 2500 (SNLI), 3000 (SST-2), 3500 (IMSyPP-sl), 4500 (XNLI5) or 8500 (SentiNews) maximum samples mmax. We determined these values heuristically as mmax = 2·L·30, rounded up to the nearest 500 samples, whereLis the maximum text length considered (in number of subwords). As in the first experiment, we use the training dataset as the sampling dataset in the original IME for all datasets except XNLI5, for which we use the subset of the validation set that contains the examples in the same language as the explained instance. To keep the settings uniform across methods, we reuse

values of mmax for the number of samples taken in LIME methods and for the size of the generated dataset in IME with an external LM. In LIME methods, we use kernel width σ = 0.5. To obtain comparable explanations, we use LIME methods to compute dense instead of sparse explanations, i.e.

explanations that use all features instead of a subset.

5.4.2 Explanations with generators

Tables 5.3 and 5.4 show the SP and AOPC metrics for the tested explanation methods.

Table 5.3: SP metric of explanations. The best scores for each dataset are marked bold. (u) indicates that the used generator is untuned (i.e. only pre-trained).

Method SST-2 SNLI IMSyPP-sl SentiNews XNLI5

IME 0.235 0.102 0.427 0.401 0.071

LIME 0.235 0.096 0.449 0.432 0.083

LIME+LM (u) 0.337 0.175 0.493 0.479 0.150

LIME+LM 0.382 0.170 0.532 0.465 0.125

LIME+CLM 0.343 0.169 0.532 0.498 0.121

IME+iLM (u) 0.354 0.217 0.505 0.521 0.184

IME+iLM 0.394 0.236 0.584 0.494 0.180

IME+iCLM 0.400 0.244 0.604 0.493 0.187

IME+eLM (u) 0.336 0.212 0.457 0.419 0.134

IME+eLM 0.396 0.204 0.550 0.446 0.155

IME+eCLM 0.340 0.194 0.534 0.425 0.152

The methods generally achieve the lowest values of SP on SNLI and XNLI5, while they achieve the highest values of SP on IMSyPP-sl and Sen-tiNews. The low values on NLI datasets show that a change in prediction

5.4. QUALITY OF EXPLANATIONS 43

Table 5.4: AOPC metric of explanations. The best scores for each dataset are markedbold. (u) indicates that the used generator is untuned (i.e. only pre-trained).

Method SST-2 SNLI IMSyPP-sl SentiNews XNLI5

IME 0.562 0.639 0.260 0.202 0.575

LIME 0.554 0.666 0.245 0.191 0.559

LIME+LM (u) 0.464 0.516 0.235 0.171 0.458

LIME+LM 0.429 0.512 0.204 0.164 0.479

LIME+CLM 0.447 0.542 0.210 0.164 0.481

IME+iLM (u) 0.401 0.494 0.224 0.151 0.377

IME+iLM 0.348 0.490 0.165 0.144 0.370

IME+iCLM 0.344 0.478 0.161 0.151 0.376

IME+eLM (u) 0.435 0.490 0.258 0.177 0.433

IME+eLM 0.370 0.510 0.199 0.176 0.428

IME+eCLM 0.410 0.508 0.192 0.177 0.430

is commonly possible by removing only a small portion of text units, which is a consequence of the task definition. For example, in entailment exam-ples the hypothesis must confirm the information present in the premise, meaning that some important words are likely to overlap. In addition, NLI datasets contain annotation artifacts [78], which reduce the difficulty of the task. For example, the hypotheses in contradiction examples might contain words such as “no” and “not” and the hypotheses in neutral examples might add adjectives that are not implied in the premise (e.g., the hypothesis “A tall human poking” given a premise “A male with sunglasses is poking at a tree with a pole”). In contrast, a change in prediction requires the removal of significantly more text units in IMSyPP-sl and SentiNews, which is a con-sequence of multiple factors. SentiNews contains sentiment-annotated news articles, in which the sentiment might not be expressed directly via certain

keywords, but more subtly. Additionally, predictions for some classes are less likely to change by removing important words. For example, in IMSyPP-sl a prediction is less likely to change from clean to hate-speech by removing text units than in the reverse scenario. This is emphasized by the class im-balance present in both datasets: IMSyPP-sl contains more clean (68) than hate-speech (32) examples, and SentiNews contains more neutral (58) than positive (17) and negative (25) examples. Due to this, we additionally pro-vide the SP and AOPC metric values by class in Appendix A.1. The reverse is true for values of AOPC shown in Table 5.4. The methods with low val-ues of SP typically have a high value of AOPC, meaning that removing the 1,2, . . . ,10 most important text units on average results in a higher decrease of predicted class probability.

The original methods achieve better values of SP and AOPC metrics than all modified methods. The best SP values are achieved by IME and LIME on one (SST-2), LIME on one (SNLI) and IME on three (IMSyPP-sl, SentiNews, XNLI5) datasets. The best AOPC values are achieved by LIME on one (SNLI) and IME on four (SST-2, IMSyPP-sl, SentiNews and XNLI5) datasets. Among the modified explanation methods, none performs best across all datasets: the highest SP is achieved by IME with an external LM on three (SST-2, IMSyPP-sl, SentiNews) and the modified LIME on two (SNLI, XNLI5) datasets, while the highest AOPC is achieved by IME with an external LM on two (IMSyPP-sl, SentiNews) and the modified LIME on three (SST-2, SNLI, XNLI5) datasets. The modified methods using tuned MLMs perform the worst, not achieving the best SP or AOPC score on any dataset. The modified methods using untuned MLMs and tuned CMLMs perform comparably: the methods using untuned MLMs perform better ac-cording to SP (achieving the best score on three datasets), while the methods perform equally well according to AOPC (each achieving the best score on two datasets and both achieving the best score on one dataset).

A major issue in the explanations computed with the modified methods

5.4. QUALITY OF EXPLANATIONS 45

Figure 5.1: Explanations computed by IME (top), LIME (middle), and IME with an internal LM (bottom) for an entailment example, showing ex-treme sparsity in explanations computed with IME with an internal LM.

Only three text units receive nonzero importance (which gets rounded to 0) because the language model samples the neighbourhood of the explained in-stance too locally. In comparison, IME and LIME assign nonzero importance to significantly more units. The importance of units is noted above them and through their color: green indicates a positive and red indicates a negative contribution towards the prediction.

(particularly with modified IME methods with an internal LM) is their spar-sity: in many cases, a majority of text units have zero importance. When computing the metrics for such instances, units get removed in an order which is not dependent on their actual importance, leading to high values of SP and low values of AOPC. An example of this is shown in Figure 5.1, where only three text units (“boards”, “on”, “ledge”) have a nonzero importance. The primary reason for this is the use of a generation strategy that creates pertur-bations too similar to the explained instance. In other words, the generator samples the neighbourhood of the explained instance too locally, so it does not uncover any variance.

The explanations computed by IME methods represent the difference be-tween the prediction and the expected prediction if no words in the explained instance were known. If the perturbations are too similar to the explained

instance, the difference between these two values is minimal, which results in sparse explanations. The original IME does not have these issues because it samples perturbations from a large sampling dataset without considering syntax and semantics, which uncovers the underlying variance, although it might come as a result of the model’s poorly calibrated prediction on an out-of-distribution perturbation. Sparsity is not an issue in the modified LIME method due to its uniqueness constraint, which ensures that the perturba-tions contain words different from those in the original instance. Although this restriction could also be used in the modified IME methods, it can lead to assigning nonzero importance to units that naturally do not vary and should therefore not be assigned any importance. For example, perturbing punctu-ation at the end of a sentence into a different word might not be sensible if all sentences end with punctuation.

To examine if the poor results are a consequence of the generation strat-egy, we perform a small experiment on 50 examples in SNLI, randomly pled from the validation set in order not to overfit the test set. On this sam-ple, we evaluate the methods using three different types of decoding in the perturbation generation strategy, keeping the remaining settings unchanged.

We evaluate the methods using greedy decoding, nucleus (top-p) sampling [79] with p = 0.9, and top-k sampling with k = 3. In contrast to greedy decoding, which selects the replacement word as the one with the highest assigned probability, top-p and top-k decoding sample the replacement word from a re-normalized truncated probability distribution. Top-p samples from a pool of words whose probability is inside the top p cumulative probability and top-k samples from a pool of k most probable words. Table 5.5 shows the SP and AOPC for the original methods and the modified methods using different types of decoding. Although the best results are still achieved by the original methods, the modified methods using top-p or top-k sampling achieve an improved SP on 6 and AOPC on 9 out of 9 settings compared to methods using greedy decoding. Because they consider multiple

possi-5.4. QUALITY OF EXPLANATIONS 47

ble replacement words, they have a higher chance of uncovering prediction variance and reducing the problem of sparsity in the computed explanations.

However, as they uncover more variance, they can also increase the variance of the computed explanations, meaning more samples need to be taken to estimate them reliably. This is an inherent trade-off that is determined by how locally or globally the generator samples the neighbourhood, and its effect depends on multiple settings, not only on the type of decoding used.

We leave the detailed exploration of this trade-off for further work.

A possible reason for the negative results is also in the formulation of the (M)LM task. In contrast to the original methods, the generators trained for MLM perform biased sampling of perturbations around the explained instance. Because the generators are trained for reconstruction, there is no guarantee that they will sample a perturbation belonging to a different class than the explained instance. However, we hypothesize this is an issue in practice only because we sample perturbations too locally. If we sample less locally around the instance or use a more general MLM, the chance of only sampling perturbations belonging to the same class as the explained instance should decrease, as motivated by the results of the decoding experiment (in Table 5.5), where the use of top-p or top-k sampling leads to a lower SP in 6 out of 9 settings.

A minor (i.e. less decisive) reason for inferior performance of the modified methods compared to the original ones is the choice of SP and AOPC metrics.

They allow us to automatically approximate to what extent the explanations capture the important units for the prediction. They are calculated by re-moving the most important units (according to the explanation) from the instance and observing the change in the predicted probability. Although the metrics are intuitive, their calculation implicitly favours methods that create explanations without considering the dependencies in the data. By removing the units from the explained instance, the resulting partial input is often incomprehensible, similarly to the perturbations used in the original

Table 5.5: SP (left) and AOPC (right) values for explanation methods using three types of decoding on a random sample of 50 validation examples in SNLI. The best overall metrics are marked bold, while the best decoding strategy for a method is marked with a full underline for SP and with a dashed underline for AOPC. The original methods do not use decoding, so the metrics are displayed in parentheses. (u) indicates that the used generator is untuned (i.e. only pre-trained).

SP AOPC

LIME+LM (u) 0.259 0.199 0.219 0.491 0.505 0.512

LIME+LM 0.175 0.182 0.200 0.525 0.529 0.507

LIME+CLM 0.212 0.217 0.239 0.509 0.511 0.497

IME+iLM (u) 0.259 0.222 0.214 0.467 0.484 0.490

IME+iLM 0.280 0.259 0.261 0.412 0.483 0.462

IME+iCLM 0.301 0.220 0.255 0.429 0.496 0.479

IME+eLM (u) 0.231 0.232 0.235 0.477 0.506 0.512

IME+eLM 0.239 0.216 0.193 0.475 0.494 0.518

IME+eCLM 0.246 0.224 0.279 0.463 0.506 0.494

IME and LIME to compute the explanations. In further work we intend to investigate the creation of modified SP and AOPC that take into account the comprehensiveness of the partial input.

One positive aspect of the modified methods, which is not captured by the metrics, is the potential to use the natural perturbations to better understand the explanations, i.e. describe the used reference distribution. Although the

5.4. QUALITY OF EXPLANATIONS 49

perturbations can be shown to users in the original methods, they are often incomprehensible, making them unhelpful. In contrast, the perturbations used in the modified methods appear much more natural, which may help to detect errors in the computed explanations or biases being propagated in the perturbations.

An issue that is present for certain datasets in both the original and mod-ified methods is the varying quality of explanations for different class labels.

We illustrate this by presenting the SP and AOPC metrics for all class labels in Appendix A.1. For example, in IMSyPP-sl the methods generally achieve small SP (and high AOPC) when explaining examples of hate speech and a significantly higher SP (and lower AOPC) when explaining clean examples.

Lower values of SP for examples of hate speech are expected, as the hate-ful content is often concentrated in profane words and slurs, the removal of which can quickly change the prediction. The reverse is often true for clean examples, i.e. a prediction cannot always be switched from a clean comment to hate speech by removing units. However, in our qualitative analysis, we often noticed that the explanations for “neutral” examples are poor. As an example, we show the explanation computed by IME and LIME for a clean tweet in Figure 5.2. None of the words receive a high score, and the impor-tance is spread among many units. Such an explanation is not intuitive and presents a challenge: what would be an intuitive explanation in such case?

In summary, the modified IME and LIME are not suitable replacements for the original methods in their current state. Although they use more natural perturbations to avoid constructing misleading explanations, they are prone to constructing sparse explanations, that are unhelpful to the user.

5.4.3 Explanation based on dependency structure

In this section, we evaluate the proposed explanations of textual units longer than words using the dependency structure of the text. Due to the high computational cost of the used generation strategy, as well as the negative

Figure 5.2: Explanations computed by IME and LIME for a clean tweet, indicating the unintuitive explanations for “neutral” labels. Translation:

“@BojanPozar Emergency measures? As far as I am concerned, this has been my lifestyle for the past few years. It was too dynamic before.” The importance of units is noted above them and through their color: green indicates a positive and red indicates a negative contribution towards the prediction.

results described in Section 5.4.2, we only augment the original methods with the dependency structure of the explained text.

Table 5.6 shows the number of instances where the optimization process yields custom explanations that differ from the word-based explanations, i.e.

where at least one merge of words occurs. The merges happen frequently:

in all datasets, the methods return custom explanations for at least 75 out of 100 instances. The most custom explanations are computed by LIME on SNLI (93), where short groups of words are often important to determine the relation. For example, Figure 5.3 shows a custom explanation for an example where the word groups “at a dining table” (in the premise) and

“at a boxing match” (in the hypothesis) are important to determine that the hypothesis contradicts the premise. To analyze what kind of merges oc-cur most frequently, we count the universal part-of-speech (UPOS) tag [70]

frequencies of the words inside the merged word groups in the custom ex-planations. Table 5.7 shows the three most common combinations for each

5.4. QUALITY OF EXPLANATIONS 51

dataset. For datasets containing English examples (SST-2, SNLI, XNLI5), we observe that the merge most frequently occurs between a determiner and a noun (e.g., “the movie”), which is a consequence of how frequently deter-miners are used in English. Although this type of merge does not add new information into the explanation, it makes the explanations less redundant.

Similarly, in Slovene, merges including an adposition (i.e. a preposition or a postposition) and a noun (e.g. “za kulturo”, translated: “for culture”) are common throughout all datasets. In Slovene datasets, this type of merge mostly makes the explanations less redundant. However, the merged words are commonly important in NLI datasets, as shown by the previously dis-cussed example in Figure 5.3. In the sentiment and hate-speech datasets (SST-2, IMSyPP, SentiNews), the class is often determined by words that modify the meaning of nouns so merges including an adjective (e.g., “deeply appealing”) are also frequent.

Table 5.6: Number of test instances (out of 100) for which the modified IME and LIME compute custom explanations where at least one merge of words occurs.

SST-2 SNLI IMSyPP-sl SentiNews XNLI5

IME + dep. 86 91 75 90 85

LIME + dep. 76 93 77 89 82

For the remaining instances, merges do not occur, meaning the individual words are more important for the prediction than compounds. This happens as either there is enough information for the prediction in the individual words or the model does not learn to handle word interactions defined by the dependency structure. Figure 5.4 shows one example for each scenario. In the first example, the individual words “young” (in the premise) and “older”

(in the hypothesis) enable the model to correctly predict that the hypothesis contradicts the premise, so the model does not need to take into account any

Table 5.7: Most frequent types of word merges inside custom

Table 5.7: Most frequent types of word merges inside custom