Adaptationsofperturbation-basedexplanationmethodsfortextclassiﬁcationwithneuralnetworks MatejKlemen

(1)

University of Ljubljana

Faculty of Computer and Information Science

Matej Klemen

Adaptations of perturbation-based explanation methods for text classification with neural networks

MASTER’S THESIS

THE 2nd CYCLE MASTER’S STUDY PROGRAMME COMPUTER AND INFORMATION SCIENCE

Supervisor : prof. dr. Marko Robnik ˇ Sikonja

Ljubljana, 2021

(2)

(3)

Univerza v Ljubljani

Fakulteta za raˇ cunalniˇ stvo in informatiko

Matej Klemen

Prilagoditve perturbacijskih razlagalnih metod za klasifikacijo

besedil z nevronskimi mreˇ zami

MAGISTRSKO DELO

MAGISTRSKI ˇSTUDIJSKI PROGRAM DRUGE STOPNJE RA ˇCUNALNIˇSTVO IN INFORMATIKA

Mentor : prof. dr. Marko Robnik ˇ Sikonja

Ljubljana, 2021

(4)

(5)

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/- by/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

(6)

(7)

Acknowledgments

I would like to sincerely thank my thesis mentor, prof. dr. Marko Robnik Sikonja for motivating me to pursue the topic of model explainability andˇ providing advice, guidance and otherwise helpful discussions throughout the process of creating the thesis.

I want to thank my family and friends, who believed in me and supported me throughout the journey.

Finally, I would like to thank Laboratory for Cognitive Modeling and Bioinformatics Laboratory for providing computational resources for running the experiments.

Matej Klemen, 2021

(8)

(9)

Abstract

Title: Adaptations of perturbation-based explanation methods for text classification with neural networks

Deep neural networks are successfully used for text classification tasks.

However, as their functioning is opaque to users, they may learn spurious patterns so we need mechanisms to explain their predictions. Current machine learning explanation methods are designed for general prediction and commonly assume tabular data. They mostly work by perturbing the inputs and assigning credit to the features that strongly impact the outputs. In our work, we propose modified versions of two popular explanation methods (Interactions-based Method for Explanation - IME, and Local Inter- pretable Model-agnostic Explanation - LIME) for explaining text classifiers.

The methods generate input perturbations considering the input dependence.

For that purpose, they use language models as generators of more natural perturbations. We first perform a distribution detection experiment, through which we empirically show that the generated perturbations are more natural than the perturbations used in the original IME and LIME. Then, we evaluate the quality of the computed explanations using automated metrics and compare them to the explanations calculated with the original methods.

We find that their quality is generally worse, which we attribute to the generation strategy and metrics that measure a different type of importance. As a second contribution, we propose the calculation of IME and LIME explanations in terms of units, longer than words, by using the dependency structure

(12)

Keywords

perturbation-based explanation methods, dependency-based explanations, text generation, IME explanation, LIME explanation

(13)

Povzetek

Naslov: Prilagoditve perturbacijskih razlagalnih metod za klasifikacijo besedil z nevronskimi mreˇzami

Globoke nevronske mreˇze lahko uspeˇsno klasificirajo besedila. Njihovo delovanje ni transparentno, kar lahko privede do tega, da se nauˇcijo laˇznih vzorcev. Zato potrebujemo metode za razlago njihovih napovedi. Trenu- tne razlagalne metode so sploˇsnonamenske in pogosto predpostavljajo ta- belariˇcno strukturo podatkov. Razlage pogosto izraˇcunajo tako, da spre- minjajo vhodne atribute in pomembnost pripiˇsejo tistim atributom, katerih spremembe moˇcno vplivajo na izhodne napovedi modela. V delu za razlago besedilnih klasifikacijskih modelov predstavimo prilagojene razliˇcice metod IME in LIME, ki upoˇstevajo odvisnosti med vhodnimi atributi. Odvisnosti upoˇstevajo z uporabo jezikovnih modelov, s katerimi generirajo naravnejˇse perturbacije vhodnih besedil. Najprej empiriˇcno pokaˇzemo, da so generirane perturbacije naravnejˇse od perturbacij, uporabljenih v originalnih me- todah IME in LIME. Nato s pomoˇcjo avtomatskih metrik preverimo kvaliteto razlag, ustvarjenih na podlagi naravnejˇsih perturbacij. Ugotovimo, da so razlage, ustvarjene s prilagojenimi metodami, veˇcinoma slabˇse od razlag, ustvarjenih z originalnima metodama IME in LIME. Kot glavna razloga navedemo uporabljeno strategijo generiranja perturbacij ter uporabljene me- trike, ki merijo drugaˇcno vrsto pomembnosti. V delu predstavimo tudi naˇcin za raˇcunanje razlag na podlagi enot, daljˇsih od posameznih besed, ki te- melji na upoˇstevanju skladenjske strukture v besedilu. Preverimo kvaliteto

(14)

Kljuˇ cne besede

perturbacijske razlagalne metode, razlage na podlagi jezikovnih odvisnosti, generiranje besedil, razlage IME, razlage LIME

(15)

Razˇ sirjeni povzetek

I Uvod

Globoke nevronske mreˇze so zaradi svoje uspeˇsnosti vse pogosteje uporabljene za reˇsevanje razliˇcnih problemov, na primer za klasifikacijo slik [1, 2] in strojno prevajanje [3]. ˇZal so tudi kompleksne in netransparentne, kar lahko prikrije nepravilno ali pristrano delovanje. Da takˇsno obnaˇsanje odkrijemo, moramo znati napovedi modelov razloˇziti, za kar uporabljamo razlagalne metode. Primer teh so perturbacijske razlagalne metode, ki razlage izraˇcunajo s spreminjanjem (perturbiranjem) vhoda v model in opazovanjem uˇcinka na napoved modela. Trenutno uporabljene metode so pogosto sploˇsnonamenske, a pri izraˇcunu razlag predpostavljajo tabelariˇcno obliko vhoda, kar lahko privede do zavajajoˇcih razlag. Druga slabost je generiranje perturbacijskih vzorcev, ki lahko ob slabi izvedbi privede do napadov na razlagalne metode [4], kar omogoˇci prikrivanje dejanskega delovanja pristranskega modela.

V delu predstavimo prilagoditve uspeˇsnih razlagalnih metod LIME (Lo- cal Interpretable Model-agnostic Explanation) [5] in IME (Interactions-based Method for Explanation) [6, 7, 8], ki upoˇstevajo medsebojne odvisnosti vhodnih besed v besedilnih podatkih. Prilagojene metode odvisnosti besed upoˇstevajo z uporabo jezikovnih modelov za generiranje slovniˇcno in po- mensko smiselnih perturbacij. Naˇs drugi prispevek je metoda, ki z uporabo skladenjske strukture besedila omogoˇca raˇcunanje razlag, daljˇsih od posameznih besed.

i

(16)

II Pregled sorodnih del

II.I Perturbacijske razlage z generatorji

Perturbacijske metode razlage delovanja napovednega modela izraˇcunajo z opazovanjem uˇcinka, ki ga povzroˇci odstranitev atributa ali skupine atributov. Ker modeli niso nujno zmoˇzni delovati z manjkajoˇcimi atributi, razlagalne metode uˇcinek manjkajoˇcih atributov simulirajo, na primer z uˇcenjem mnogih modelov, ki delujejo na delnih atributnih prostorih [6, 9, 10], ali z zamenjavo trenutne vrednosti atributa z izhodiˇsˇcnimi vrednostmi [5, 7, 11].

Med slednje metode spadata tudi LIME, ki vrednost atributa zamenja s po- sebno besedo, in IME, ki vrednost atributa zamenja z nakljuˇcno izbranimi vrednostmi iz uˇcne mnoˇzice. Metodi pri vzorˇcenju perturbacij predposta- vljata neodvisnost vhodnih atributov, kar omogoˇca njuno sploˇsnost, a lahko privede do teˇzav, ko so atributi medsebojno odvisni (na primer v besedilih).

Problem poizkusijo reˇsiti prilagojene metode, ki medsebojno odvisnost atributov upoˇstevajo z uporabo pogojnih generatorjev [12, 13, 14, 15]. Za razlago besedilnih klasifikacijskih modelov omenimo tri reˇsitve, ki uporabljajo po- gojne generatorje. Alvarez-Melis in Jaakkola [16] predstavita perturbacijsko razlagalno metodo, kjer je razlaga predstavljena s particijo bipartitnega grafa med vhodnimi perturbacijami, generiranimi z variacijskim samokodirnikom [17], in pripadajoˇcimi napovedmi. Ross in sod. [18] jezikovni model upora- bijo v povezavi z gradientno razlagalno metodo in z njim generirajo razlago s pomoˇcjo protiargumentov. Harbecke in Alt [19] jezikovni model upora- bita v metodi Occlusion [20, 21], ki razlago izraˇcuna s spreminjanjem enega atributa naenkrat. Naˇse delo je najbolj podobno slednjemu, a mi jezikovni model vkljuˇcimo v metodi IME in LIME, ki razlage izraˇcunata s spreminjanjem veˇcih atributov naenkrat, kar oteˇzi generiranje smiselnih perturbacij.

Poleg jezikovnega modela v metodi vkljuˇcimo ˇse nadzorovan jezikovni model, ki perturbacije ustvari v skladu s specifiˇcnim nadzornim signalom (na primer ˇzelenim razredom).

(17)

iii

II.II Razlage na podlagi daljˇ sih enot

Razlage se privzeto raˇcunajo za posamezne atribute, na primer besede. Za raˇcunanje razlag na podlagi daljˇsih enot je potrebno te enote definirati. Pre- gled vseh moˇznih enot je raˇcunsko prezahteven, zato se uporabljajo hevri- stiˇcni pristopi, ki jih razdelimo v tri skupine. Prvi naˇcin za doloˇcanje enot uporabi predpostavko, da so besede najbolj odvisne od sosednjih besed, in na ta naˇcin zmanjˇsa ˇstevilo moˇznih enot [22]. Drugi naˇcin enote definira glede na vnaprej podano drevesno strukturo, ki predstavlja relacije med be- sedami [23]. Tukaj spada tudi naˇs pristop na podlagi skladenjskih dreves, ki jih lahko avtomatsko pridobimo za mnoge jezike in so ˇse posebej primerna za opis strukture besedil v morfoloˇsko bogatih jezikih. Tretji naˇcin enote definira implicitno z optimizacijo dodatnega kriterija, ki korelirane besede zdruˇzi v skupno enoto [24].

III Prilagoditve razlagalnih metod

III.I Razlage z generatorji

V delu obravnavamo razlagalni metodi LIME in IME, ki raˇcunata lokalne razlage, neodvisne od uporabljenega modela. Obe metodi razlage izraˇcunata z opazovanjem obnaˇsanja modela na perturbacijah vhodnega besedila. LIME perturbacije uporabi za uˇcenje preprostega (na primer linearnega) modela, ki pribliˇzno opiˇse obnaˇsanje razlaganega modela v bliˇzini vhodnega primera.

IME perturbacije uporabi za izraˇcun razlike med priˇcakovanimi napovedmi (definirane z enaˇcbo 3.2), ko besede v vhodnem besedilu poznamo in ko jih ne poznamo.

LIME perturbacije ustvari z enakomernim vzorˇcenjem binarnih interpre- tabilnih predstavitev, v katerih vrednost 1 pomeni, da je beseda v vhodnem besedilu prisotna, vrednost 0 pa, da je odsotna (zamenjana z vrednostjo [PAD]). V prilagojeni razliˇcici metode LIME vzorˇcenje obdrˇzimo, definicijo

(18)

interpretabilne predstavitve pa spremenimo tako, da vrednost 0 pomeni, da je beseda v vhodnem besedilu zamenjana z unikatno besedo, generirano z jezikovnim modelom (vrednost 1 pa obdrˇzi prvoten pomen). Originalen in prilagojen naˇcin pridobivanja perturbacij sta prikazana na sliki 4.1.

IME za ocenjevanje pomembnosti atributa i ∈ S v vsaki perturbaciji ustvari par primerov: enega z originalno in enega s spremenjeno vrednostjo atributa i. Za vsako perturbacijo vzorˇci podmnoˇzico atributov Q ⊆ S\{i}, katerih vrednosti v perturbaciji ostanejo fiksne. Preostalim atributom dodeli vrednosti iz nakljuˇcno izbranega primera iz uˇcne mnoˇzice. V delu predstavimo dve prilagojeni razliˇcici. Prva prilagojena razliˇcica (IME + iLM) preostalim atributom dodeli vrednosti s pogojnim generatorjem glede na kontekst v besedilu. Druga prilagojena razliˇcica (IME + eLM) obdrˇzi nakljuˇcno do- deljevanje novih vrednosti preostalim atributom, a te nakljuˇcno izbere iz kvalitetnejˇse vzorˇcne mnoˇzice, generirane s pogojnim generatorjem. Origi- nalen in prilagojena naˇcina pridobivanja perturbacij sta prikazana na sliki 4.2.

III.II Razlage na podlagi skladenjskih dreves

Originalne metode izraˇcunajo razlage za posamezne besede. V predlagani metodi za izraˇcun razlag na podlagi daljˇsih enot posamezne besede zdruˇzimo v veˇcje enote na podlagi skladenjskih dreves. Metoda je prikazana na pri- meru na sliki 4.3. V prvem koraku izraˇcuna zaˇcetno pomembnost posameznih besed. Nato v vrstnem redu od spodaj navzgor in od leve proti desni po- izkusi zdruˇziti besede v istem poddrevesu. To stori zgolj, ˇce je absolutna pomembnost zdruˇzene skupine besed veˇcja od absolutne vsote pomembnosti posameznih enot pred zdruˇzitvijo.

(19)

v

IV Eksperimentalno ovrednotenje

Predlagane metode ovrednotimo na petih veˇcjezikovnih podatkovnih mno- ˇzicah: SST-2 [25], SNLI [26], IMSyPP-sl [27], SentiNews [28] in primere v petih jezikih (angleˇsˇcini, francoˇsˇcini, ruˇsˇcini, turˇsˇcini in urdujˇsˇcini) iz mnoˇzice XNLI [29]. Za klasifikacijo besedil in generiranje perturbacij uporabimo mo- dele BERT (Bidirectional Encoder Representations from Transformers) [30]

in XLM-RoBERTa [31]. Opravimo dve vrsti eksperimentov.

V prvem eksperimentu preverimo, ˇce so perturbacije, generirane z jezikov- nimi modeli, naravnejˇse od perturbacij, ki jih ustvarita originalni metodi IME in LIME. Naravnost perturbacij merimo s klasifikacijsko toˇcnostjo detekcije, ˇce perturbacija pripada empiriˇcni distribuciji ali generirani distribuciji perturbacij. Rezultate prikaˇzemo v tabeli 5.2. Ugotovimo, da vse prilagojene metode uporabljajo naravnejˇse perturbacije, ki jih kompleksen model teˇzje razloˇci od empiriˇcnih primerov kot perturbacije, uporabljene v IME in LIME.

V drugem eksperimentu kvantitativno, s pomoˇcjo metrik SP (switching point) [32] in AOPC (area over the perturbation curve) [33], ter kvalita- tivno analiziramo razlage, izraˇcunane s prilagojenimi metodami. Rezultate prikaˇzemo v tabelah 5.3 in 5.4. Ugotovimo, da razlage, izraˇcunane s prilagojenimi metodami, na vseh mnoˇzicah dosegajo slabˇse rezultate kot razlage, izraˇcunane z originalnimi metodami. Glavni problem izraˇcunanih razlag je njihova redkost, ki se pojavi zaradi preveˇc lokalno usmerjene strategije generiranja perturbacij, ki ne razkrije vedno napovedne variance klasifikacijskega modela. Kot moˇzen razlog izpostavimo tudi uporabljeni metriki, ki kvaliteto raˇcunata z odstranjevanjem posameznih besed na nenaravnih vhodnih pri- merih. S tem delujeta v korist originalnih metod IME in LIME, ki razlage prav tako raˇcunata na nenaravnih perturbacijah.

Razlage na podlagi skladenjskih dreves zaradi visoke raˇcunske zahtevnosti in negativnih rezultatov prilagojenih metod izraˇcunamo zgolj z originalnima metodama IME in LIME. V tabeli 5.6 prikaˇzemo ˇstevilo primerov (izmed

(20)

100), kjer predstavljena metoda ustvari razlage na podlagi enot, daljˇsih od posameznih besed. Ugotovimo, da so zdruˇzitve besed pogoste, saj na vseh mnoˇzicah prilagojeno razlago dobi vsaj 75% primerov. V tabeli 5.7 prikaˇzemo frekvenco besednih vrst v zdruˇzenih skupinah besed. Ugotovimo, da so naj- pogosteje zdruˇzene pomoˇzne besede s samostalniki, kar zmanjˇsa redundan- tnost razlag, v mnoˇzicah SNLI in XNLI pa tudi poveˇca informativnost razlag.

V Sklep

V delu smo predstavili prilagoditve metod IME in LIME, ki razlage z uporabo jezikovnih modelov ustvarijo na podlagi naravnejˇsih perturbacij. Em- piriˇcno smo preverili kvaliteto ustvarjenih perturbacij ter razlag. Pokazali smo, da so perturbacije naravnejˇse od tistih, ki jih uporabljata originalni metodi IME in LIME, razlage pa v sploˇsnem slabˇse. Slabˇsa kvaliteta se kaˇze zaradi redkosti razlag, kar je posledica preveˇc lokalno usmerjene strategije generiranja perturbacij. Kot drugi prispevek smo predstavili metodo, ki z uporabo skladenjske strukture besedila omogoˇca raˇcunanje prilagojenih razlag. Izraˇcunane razlage imajo zaradi zdruˇzitve besed manjˇso redundanco, predstavljena metoda pa omogoˇca tudi diagnosticiranje napovedi modela.

Potencial za nadaljnje delo vidimo v podrobni analizi vpliva strategije generiranja perturbacij na kvaliteto razlag, izboljˇsevanju uˇcinkovitosti prilagojenih metod ter boljˇsih metrikah, ki upoˇstevajo smiselnost besedila.

(21)

Chapter 1 Introduction

Neural networks are becoming an increasingly used method for solving tasks including diverse modalities and domains. A big factor in their popularity is the demonstrated practical success across various benchmarks. For example, complex neural networks (deep neural networks, DNNs) are successful in per- forming tasks such as image classification [1, 2], sentiment analysis [25], and machine translation [3]. However, there are cases where complex models have learned spurious patterns or started behaving unexpectedly given seemingly natural input, as shown by the racist internals of the COMPAS recidivism algorithm [34] or adversarial inputs that cause the network to misclassify the image [35] or start outputting racist text [36]. Therefore, it is crucial that we can explain the behaviour of the models, both to prevent such abnormalities from happening, as well as to find fixable flaws and increase the performance of the model.

Unlike simpler models such as linear regression, where we can explain the model by its weights, neural networks are much more complex and typically made up of multiple layers and a large number of weights, which we cannot make sense of easily. For example, the “base” and “large” variants of Bidirectional Encoder Representations from Transformers (BERT) [30], a commonly used neural model for text processing tasks, contain 110M and

1

(22)

340M weights, respectively. To explain such models, explanation techniques have been developed that treat the model as a black box and produce the explanation by changing (perturbing) the input and observing the effect on the output of the model. Examples of such methods include Interactions-based Method for Explanation (IME, also called Shapley sampling values) [6, 7, 8], Local Interpretable Model-agnostic Explanation (LIME) [5], SHapley Ad- ditive exPlanation (SHAP) [37], and Prediction Difference (PredDiff, also known as Occlusion) [20, 21]. In our work, we study the first two methods in a text processing setting, but the presented modifications are applicable to the other explanation methods in a similar manner.

To explain an instance, both IME and LIME perturb the input by replacing a part of the input feature values with newly selected ones, assuming feature independence in the process. IME selects the new feature values from a randomly selected example in the sampling dataset, while LIME for textual data uses a fixed new feature value, which is intended to represent missingness. Although the methods are conceptually simple and generally applicable, they may fail and lead to misleading explanations when the features are dependent (as is the case when dealing with text). The reason lies in the strategy of obtaining perturbations, which produces unnatural examples due to assumed independence and puts too much emphasis on sparse regions in feature space, where the model does not behave as expected [38].

For example, a question-answering model may output a very high confidence score in an answer even to a nonsensical question, composed of just one word [39]. Apart from causing the explanations to be less faithful to the underlying model, the flaws in the perturbation strategy may also allow an owner of a biased model to detect when the behaviour of a model is being scrutinized and synthesize an explanation that hides the bias in that case [4, 40].

To alleviate this, several authors [12, 15, 14] propose to select the new feature values in a way that they correspond well to the fixed feature values.

Stated differently, they propose sampling new feature values by conditioning

(23)

3

on the fixed feature values. In our work, we build upon this approach, ap- plying it to textual data, which is often of much higher dimensionality than tabular data. In text the features are practically always dependent due to the rules and collocations present in the language. We construct perturbations using (masked) language models that condition on a fixed part of the text and empirically show that they are “more natural”, i.e. harder to dis- tinguish from examples in the empirical distribution than original IME and LIME perturbations.

Although both IME and LIME are perturbation-based, they use the perturbations differently and as such compute different explanations. LIME uses the perturbations to learn a simple (e.g., linear) model, which aims to explain the behaviour of the original model in the neighbourhood of the explained instance. The internals of this simple model are then provided as explanations, e.g., the weights of the linear model are presented as word importances.

IME uses the perturbations to estimate the impact of subsets of features and then distribute this among individual features to produce feature contributions. The feature contributions correspond to Shapley values, which are a unique solution that satisfies certain desirable mathematical properties [41], which we outline in Chapter 3. However, the uniqueness guarantee may be a source of confusion. Different methods that compute Shapley values may produce significantly different explanations for the same explained instance due to their design choices, particularly due to the way they define the value function, which measures the model’s prediction when only a subset of features is present [42].

A possible classification of explanation methods is based on their definition of the value function, related to the used perturbation strategy discussed above. The value function can either be an observational or interventional conditional model expectation [43]. Both types of value functions compute the expected model value over multiple samples. In the observational case, these are obtained by conditioning on the known feature values. In the in-

(24)

terventional case, these are obtained by breaking correlations between the known and unknown feature values (intervening on the known feature values). Arguments have been made for and against both [15, 43, 44] and it is not immediately clear which option is better: the observational methods evaluate the models on sensible inputs, but may assign nonzero Shapley values to irrelevant features, while the interventional methods may evaluate models on nonsensical inputs, but do not assign nonzero Shapley values to irrelevant features. This distinction has an important implication for our work: as our methods estimate on-manifold Shapley values [15], we do not expect them to be the same as the Shapley values estimated by IME.

Another way of dividing the explanation methods is based on the ref- erence distribution in contrast to which the methods compute the Shapley values. Merrick and Taly [45] present a framework Formulate, Approximate, Explain, which unifies multiple methods computing Shapley values. The methods formulate the explanations differently by choosing different refer- ence distributions. The framework provides intuition for the proposed methods, which explain instances in line with true text distributions provided by language models. We experiment with pre-trained (untuned) and fine-tuned language models, which intuitively correspond to explaining the instances in a broader and narrower sense, respectively. For example, using an En- glish language model pre-trained on general text will result in explanations in terms of the broader language context, while using a language model fine- tuned to specific domain will result in explanations in terms of that narrow domain.

To summarize, in our work we modify and apply perturbation-based explanation methods IME and LIME to textual data. We make the following contributions:

• We show that with language models, we can construct input perturbations that are less distinguishable from empirical data than perturbations constructed by the original IME and LIME methods. The

(25)

5

outcome serves as motivation to augment the methods with language models.

• We present a modified version of IME and LIME that uses language models to generate perturbations.

• We evaluate the methods empirically on multiple datasets across different languages.

• We adapt the existing and proposed methods to calculate explanations for units longer than words.

We provide the implementations of the original and modified methods in an open-source Python package pete (PErturbable Text Explanations)¹.

The remainder of the thesis is structured as follows. In Chapter 2, we provide an overview of existing literature on explanation methods in general and specifically in natural language processing. In Chapter 3, we describe the relevant existing methods and propose their modifications in Chapter 4.

In Chapter 5 we empirically evaluate the modifications and compare them to the original methods. Finally, in Chapter 6 we provide a summary of our work and possible directions for further research.

1https://github.com/matejklemen/pete

(26)

(27)

Chapter 2 Related work

In this chapter, we provide an overview of related work. First, we describe existing perturbation-based explanation methods. Then, we describe previous attempts at including generators into perturbation-based explanation methods. Finally, we discuss existing work on expanding explanations beyond single words.

2.1 Perturbation-based explanations

Perturbation-based explanation methods produce an explanation by modify- ing (perturbing) the input to a model and observing the effect on its output.

Because these methods only use the model to observe the change in its prediction, they are typically model-agnostic.

The idea behind the methods is to observe the effect of a feature or a group of features by removing them from the input and observing the change in prediction. As the models are not necessarily designed to handle missing inputs, this can be handled by retraining the model on a partial feature space.

Leave-One-Covariate-Out [9] estimates feature importances by removing a single feature at a time and retraining the model. The difference in the performance of the retrained and the initial model is the estimated feature

7

(28)

importance. Lipovetsky and Conklin [10] and ˇStrumbelj et al. [6] apply the principle of feature removal and model retraining to estimate feature importances with Shapley values, although they do not limit the approaches to removing one feature at a time, instead removing feature subsets. Missing- ness can also be incorporated into the model by training the model to assume a certain feature value (e.g., 0) implies the feature is missing [46].

Retraining the model brings along additional computational complexity and can become impractical for larger models. Therefore methods commonly approximate feature removal by setting the value of a feature to a baseline value such as 0 [11, 47] or a user-defined value [5]. This can lead to the introduction of adversarial artifacts, which might be imperceivable by humans, but can lead to significant changes in the behaviour of a model [48].

Instead of a single value, multiple values per feature can be drawn from a distribution, as in PredDiff [20, 21], where values are drawn according to the estimated feature’s distribution, or Anchors [49], where only values (replacement words) with matching part-of-speech tags are drawn according to word embedding similarity. The idea is also used by methods that compute Shapley values, such as IME [7], Kernel SHAP [37], and SAGE [50], which draw replacement values for feature subsets from a sampling dataset.

The methods mentioned so far assume feature independence in their used sampling procedure. This allows the methods to be generally applicable, but can lead to problems if the features are dependent, as is the case in textual data. In such cases, the produced input perturbations are unnatural and put too much emphasis on regions in feature space where the explained model does not behave as expected [38]. To fix this, several approaches have been proposed which take into account feature dependence through the use of a conditional data generator. We discuss these next.

(29)

2.2. GENERATORS IN PERTURBATION-BASED EXPLANATIONS 9

2.2 Generators in perturbation-based expla- nations

The inclusion of generators in perturbation-based methods has mostly been studied for explaining tabular data. Aas et al. [12] extend Kernel SHAP to handle dependent features. They present four different approaches for sampling perturbations: from a multivariate Gaussian distribution, a Gaussian copula distribution, an empirically estimated conditional distribution, and a hybrid between these approaches. In a similar manner, Saito et al. [13]

and Vreˇs and Robnik ˇSikonja [14] apply generators to perturbation-based explanation methods to observe if the generated perturbations improve their robustness against adversarial attacks [4]. Specifically, the first work applies a generative adversarial network [51] to LIME and the second applies three types of generators to IME, LIME and Kernel SHAP. In both cases, the authors find that the robustness is improved.

The correctness of using a conditional generator in methods estimating Shapley values is still debated and not entirely certain [43]. In recent work, Frye et al. [15] discuss this problem in the frame of perturbations lying on- manifold or off-manifold and argue in favour of the former. In addition, they present two approaches to estimate the so-called on-manifold Shapley values, one of which generates perturbations using a variational autoencoder [52] and a masked decoder. Our work continues this line of research, estimating on- manifold Shapley values on textual data using language models as generators.

In explanation methods for text processing models, the inclusion of generators has received less attention. Alvarez-Melis and Jaakkola [16] present a perturbation-based explanation method, where explanations are represented as partitions of a bipartite graph between the perturbed inputs and outputs, which models their causal dependencies. In the method, they generate textual input perturbations using a variational autoencoder [17]. Instead of using it in an intermediate step towards producing an explanation, Ross et

(30)

al. [18] use a language model generator to directly construct a counterfactual explanation. As a token masking strategy for training the generator, they use a gradient attribution method [53]. In the work that is most similar to ours, Harbecke and Alt [19] include a language model generator into Occlusion, an explanation method that perturbs one feature at a time. In contrast to their work, we study the inclusion of language models into IME and LIME, which perturb multiple features at a time, making the generation of natural examples more difficult. While they motivate their approach mostly through axiomatic analysis, we focus on empirical evaluation on multilingual data.

Lastly, in addition to including a language model generator, we also include a controlled language model generator that attempts to generate perturbations according to a specified control signal (in our case, the target label).

2.3 Explanations beyond single words

One of our contributions is an adaptation of explanation methods for explaining textual units longer than words. If longer units are explained instead of single words, the resulting explanations may be simpler and semantically more meaningful, which is a desirable property for end users [54]. In addition, the words inside longer units may interact, so the resulting explanations may uncover new insights. Below we present existing approaches that extend explanations beyond single words.

Murdoch et al. [55] propose Contextual Decomposition as a way to explain predictions of a Long Short-Term Memory (LSTM) network [56]. The method is able to capture both individual importance as well as interaction effects. Singh et al. [57] generalize this method for other deep neural network variants and propose a clustering procedure that constructs hierarchical explanations which show the composition of word-level importances into importances of progressively longer text spans. Jin et al. [58] argue that context independence is a desirable property for the phrase importance,

(31)

2.3. EXPLANATIONS BEYOND SINGLE WORDS 11

computed by these methods. They show that existing methods do not take this into account and propose a modified method. Chen et al. [59] propose a method for computing hierarchical explanations in a top-down procedure by iteratively dividing the spans into two parts at the point where the least interaction occurs. Similarly, Zhang et al. [60] propose a bottom-up method for computing hierarchical explanations that chooses the words to group by taking into account individual importance and interaction effects.

The process of combining word importance into the importance of longer units involves a step that selects the words to group next. The number of possible choices quickly becomes too big to check exhaustively, so in most cases the outlined methods use an assumption that words frequently interact with their close neighbours [22].

A different approach, which does not require this assumption, uses a pre- defined tree structure instead of trying to construct it automatically. In work most similar to ours, Chen et al. [23] construct explanations according to the underlying constituency trees. In contrast, we use dependency trees which are similar, although more suitable and available for morphologically rich languages.

A third approach is to perform the grouping implicitly. Chen et al. [24]

propose a method that learns to group correlated words jointly with the used model on a downstream task. However, this explanation method loses the ability to produce hierarchical explanations.

(32)

(33)

Chapter 3 Background

Before we describe our modifications, we provide an overview of the individual building blocks. In Section 3.1, we describe the studied explanation methods. Then, in Section 3.2, we describe the process of text generation using language models.

3.1 Explanation methods

In this section, we describe the studied explanation methods. First, we describe LIME, a method that produces an explanation using a surrogate model. Next, we describe IME, which uses a pure sampling approach to produce explanations in the form of Shapley values.

3.1.1 LIME

LIME (Local Interpretable Model-agnostic Explanation) is a local perturbation-based model-agnostic explanation method introduced by Ribeiro et al.

[5]. The method creates an explanation for a given input by approximat- ing the local behaviour of the interpreted model with a simple model (e.g., linear).

The authors motivate LIME by outlining that an explanation should be:

13

(34)

1. Interpretable. The explanation should present the relation between the inputs and response of a model, taking into account the limits of the end user.

2. Locally faithful. The explanation must represent the true behaviour of the interpreted model in the neighbourhood of the explained instance. This is desired in order to increase the trust of the user in the model. While globally faithful explanations might be even more trust- worthy, they are harder to produce in a form that is still interpretable to the end user.

3. Model-agnostic. The explanations should be constructible for an arbitrary model so that the explanation method is generally applicable.

To satisfy the first requirement, the authors first define an interpretable representation of an explained instance. This representation is chosen in a way that its meaning is comprehensible to humans. In textual LIME, this is a binary vector, where 1 indicates the original word is present in the sequence, and 0 indicates it is absent. The absence is simulated by replacing the word with a dummy word (in our case, the word [PAD]).

The interpretable representations of input perturbations are used to train a model that locally approximates the interpreted model in the neighbourhood of the explained instance. As the internals of the model will be used as an explanation, and we want the explanation to be comprehensible to the end user, the model needs to be simple enough, e.g., a linear model or a shallow decision tree.

To ensure the simple model is locally faithful to the interpreted model, its training examples are weighted according to how distant their interpretable representation is from the interpretable representation of the explained instance. For text, the weights are calculated with an exponential kernel over the cosine distances between the interpretable representations. The exponential kernel is parametrized by the kernel width σ, which controls how local

(35)

3.1. EXPLANATION METHODS 15

the explanations are.

In explaining text classifiers, LIME first uniformly at random samples the local neighbourhood of the instance. In practice, this means that binary vectors are constructed with a random number of elements zeroed out.

Using the interpretable representations, their non-interpretable counterparts are created next by replacing the absent words in original sequences with the word [PAD]. The non-interpretable representations are passed through a model in order to obtain the model’s predictions to approximate with the simpler model. Finally, the weights for the samples are computed using an exponential kernel and a simple model is fit to predict the model’s behaviour based on the interpretable representations of the neighbourhood. In our experiments, we learn a sparse ridge regression model [61], where the enforced sparsity keeps the explanations simple. To achieve sparsity, we first train a model using all features, then greedily select only K features with the highest absolute importance and retrain the model using the selected features.

3.1.2 IME

IME (Interactions-based Method for Explanation) [6, 7, 8] is similar to LIME in the sense that it is a local perturbation-based model-agnostic explanation method. However, in contrast to LIME, it does not explain the model by (ex- plicitly) fitting an approximate simpler model and provides the explanation in the form of Shapley values [41].

The authors motivate IME by outlining the limitation of previous methods, which constructed explanations by perturbing one feature at a time.

Previous methods define the importance of a feature i as the difference between the prediction of a model and the expected prediction of a model if the value of thei-th feature is unknown:

ϕ_i(x) =f(x)−E f(x₁, . . . , X_i, . . . , x_n)

, (3.1)

wherex= [x₁, . . . , x_n] is the explained instance andf is the explained model.

(36)

Such explanations can construct misleading explanations in the case when features interact. As a solution, they propose a method that perturbs features across all subsets, defining the feature importance as:

ϕ_i(x) = X

Q⊆S\{i}

|Q|!(|S| − |Q| −1)!

|S|!

vQ∪{i}(x)−v_Q(x)

= X

Q⊆S\{i}

|Q|!(|S| − |Q| −1)!

|S|!

E(f|X_i =x_i,∀i∈(Q∪ {i}))−

E(f|Xi =xi,∀i∈Q)

, (3.2) where S is the set of all features. The solution to this equation corresponds to Shapley values, a concept from the coalitional game theory.

In game theory, a coalitional game is defined by a set of players S that form a grand coalition, and a characteristic function v_Q which defines the payout the players gain by forming a coalitionQ, such that v∅ = 0. In explanation methods for machine learning models, the set of players corresponds to the explained features, and the characteristic functionv_Q corresponds to the difference between the expected model prediction when a subset of features Q is known, and the expected model prediction. The goal is to distribute the worth of a grand coalitionv_S as defined by Equation 3.3 among features in a fair way.

v_S(x) = E(f|X_i =x_i,∀i∈S)−E(f|X_i =x_i,∀i∈ {}) (3.3) Shapley values are a unique solution that satisfies four axioms that for- malize fairness: symmetry, efficiency, law of aggregation (also called linearity or additivity), and dummy [41].

As the exact calculation of Shapley values has an exponential time complexity, the authors of IME present a sampling method that enables their approximation. In order to derive the method, they first reformulate Equa-

(37)

3.1. EXPLANATION METHODS 17

tion 3.2 as:

ϕi(x) = 1

|S|!

X

O∈π(S)

v_{P re}ⁱ(O)∪{i}(x)−v_{P re}ⁱ_(O)(x)

, (3.4)

where π(S) is the set of all ordered permutations of feature indices, and P reⁱ(O) are the feature indices that are located beforei in the permutation O. Next, they assume feature independence, so the equation becomes:

ϕi(x) = 1

|S|!

X

O∈π(S)

X

x⁰∈X

p(x⁰)

f(x⁰_[x⁰

j=xj,j∈P reⁱ(O)∪{i}])−f(x⁰_[x⁰

j=xj,j∈P reⁱ(O)])

,

(3.5) where x⁰_[x0

i=xi,i∈Q] is used to denote an instance x⁰ whose features Q are set to the corresponding values in x, and X is the sampling dataset.

Instead of computing the mean across all permutations (and computing expectation over all examples in the sampling dataset), the mean over m samples{(O_j, x⁰_j)}_j=1,...,m is taken as an approximation ˆϕ_i(x), with variance

σ²_i

m, whereσ_i² is the sampling population variance.

To compute the explanation for an input instancex, IME computes ˆϕ_i(x) for each featurei using the approximation procedure described above. This approximation is determined by two parameters: the minimum number of samples to take per feature (m_min), and the total sampling budget (m_max≥ m_min·|S|). First, an initial estimate of the feature importance and its variance is calculated using m_min samples. Then, the remaining sampling budget of (mmax−mmin· |S|) samples is iteratively allocated among features in a way that decreases the expected error of the estimate. Concretely, an additional sample is allocated to featurej, for which (^σ

2 j

mj −_m^σ^j²

j+1) is maximal, assuming the goal is to decrease the expected squared error¹.

1The derivation for this, and for the case where the expected absolute error is minimized instead, is available in the original paper [8].

(38)

3.2 Text generation

The described explanation methods create unnatural input perturbations (not from the true data distribution), which we propose to replace with more natural perturbations (closer to the data distribution), created with a text generator. In this section, we provide background on text generation: in the first section, we describe language modeling, and in the second, we describe BERT, a specific language model we use.

3.2.1 Language modeling

Language modeling is a task in which the goal is to estimate the probability of observing a text sequence (e.g., a word or a sentence) or a probability of observing a text sequence after another text sequence. It is an important component in many applications, such as machine translation or paraphrase generation. Formally, given a sequence of wordss=w₁, w₂, . . . , w_n, the task is to compute

P(s) =P(w₁, . . . , w_n) =P(w₁)·P(w₂|w₁)·. . .·P(w_n|w₁. . . wn−1). (3.6) The prediction for each word is conditioned on its preceding words. In order to learn a good estimate of the probabilities, language models are trained using large corpora containing diverse texts. A trained language model can generate new sequences by sampling from the estimated probability distributions [62].

Traditional language modeling is limited to approaches conditioning only on the left context. A modification of language modeling, which uses parts of both the left and right context is called masked language modeling (MLM) [30]². Given a partially hidden sequence, the goal of MLM is to estimate the

2While the task was named “masked language modeling” by Devlin et al., it is also known as “gap fill” or “Cloze task” [63].

(39)

3.2. TEXT GENERATION 19

probability of the hidden words given the visible words. Unlike the traditional language model, a trained masked language model can not generate arbitrary new sequences out of the box but can instead reformulate an existing sequence.

To have finer control on the modeled distribution, the language model can be conditioned on other signals in addition to the preceding and succeeding words. This enables the model to perform controlled text generation, e.g., the generation of examples that have a specific sentiment. While this provides more control over the generated text, the training process is no longer self- supervised, but supervised, as the label of the text needs to be provided.

In our work, we use a masked language model called BERT to generate more natural text perturbations, both in a masked language modeling as well as in a controlled masked language modeling (CMLM) setting.

3.2.2 BERT

BERT (Bidirectional Encoder Representations from Transformers) [30] is a bidirectional transformer-based [64] masked language model.

It is built from multiple transformer encoder layers, where each encoder layer is composed of a multi-head self-attention mechanism and a fully connected network. It operates on subword inputs, using a WordPiece vocabu- lary [65] to encode them.

As a result of a customized training objective, it is able to capture both the left and right context of a word. Instead of being optimized on the language modeling task, it is optimized on two tasks jointly:

• Masked language modeling. Given a partially masked sequence with a portion of words hidden, the model is trained to fill in the gaps correctly.

• Next sentence prediction. Given two sentences, the model is trained to predict whether the sentences are adjacent.

(40)

The training of BERT models is divided into two categories: pre-training and fine-tuning. During pre-training, the masked language modeling and next sentence prediction tasks are optimized on a large corpus of text in an unsupervised manner. During fine-tuning, a new layer (typically fully connected) is added on top of the pre-trained model and trained for a downstream task, such as hate-speech classification.

(41)

Chapter 4 Methods

In this chapter, we describe the proposed modifications of the explanation methods. In Section 4.1, we describe the inclusion of language models as text generators into the explanation methods, and in Section 4.2, we describe the calculation of explanations for text units longer than words, using dependency trees.

4.1 Explanation methods with text genera- tors

In this section, we describe the modified explanation methods which use language models to generate natural text perturbations: first, we describe the modifications in LIME, and then the modifications for IME.

4.1.1 LIME

To include language models into LIME, we first modify the definition of its interpretable representation. As described in Section 3.1.1, textual LIME uses a binary vector as an interpretable representation of the explained instance, where 1 indicates the original word is present in the sequence, and 0

21

(42)

indicates it is absent (replaced with a dummy word). The modified version of LIME keeps this binary representation, but simulates the absence of a word by replacing it with a word generated using a language model. When using a language model to re-generate the word, it may generate the original word, which violates the definition of the binary interpretable representation. To prevent this, a uniqueness constraint is imposed on the output of the language model, meaning the language model cannot replace the original word in the sequence with the same word. In practice, this is achieved by setting the probability of generating the original word to 0 and re-normalizing the output distribution using the softmax function. The modified process to obtain input perturbations is shown in comparison to the process used in LIME in Figure 4.1. Apart from the modified process to obtain the perturbations, the explanation method remains unchanged from the original LIME method.

To fully describe the modified process of creating perturbations, the user of the method chooses the generation strategy. The strategy determines how to construct the LIME’s non-interpretable representation (i.e., the actual text) given a binary interpretable representation, which indicates the words that are fixed and the words that need to be re-generated. Stated differently, it defines the local neighbourhood of the explained instance, in which the behaviour of the model is approximated with a surrogate model. In our work, we re-generate the words one by one in a left-to-right order using greedy decoding with dropout [66]. We only “hide” (mask) one word at a time in order to allow the generator to use the most context available. For example, if we need to re-generate two words, we first mask the first word and re-generate it, then mask the second word and re-generate it. Greedy decoding selects the replacement word as the one with the highest assigned probability, while dropout randomly disables (i.e., sets to 0) a portion of the language model weights and introduces variance into the distribution of replacement words with the goal to make the generated text more diverse [67].

The settings of this generation strategy were selected during development

(43)

4.1. EXPLANATION METHODS WITH TEXT GENERATORS 23

Interpretable representation

of LIME

Non-interpretable representation

(actual input)

0 This show stars Peter Capaldi

This show needs more action good was

food so

The

This show is not good This show is very good This show PAD PAD PAD This show

good

This show is good

This show is very good

PAD

PAD PAD

PAD

PAD PAD

PAD

0 0 1 1

0 0 0 1 1

1 0 0 0 0

1 1 0 1 1

1 1 1 1 1

0 0 0 1 1

1 0 0 0 0

1 1 0 1 1

1 1 1 1 1

language model

LIMEmodified LIME

Figure 4.1: A comparison between the creation of perturbations in LIME and its modified version when computing an explanation for the instance

“The show is very good”. In both cases, the interpretable representation of LIME is first determined randomly. Then, instead of replacing the absent words (on positions marked with 0 in the interpretable representation) with a dummy word, the modified version replaces them with words generated by a language model.

(44)

and we use them as reasonable defaults. We leave the detailed exploration of alternative strategies and their effect on the produced explanations for further work.

4.1.2 IME

As in the modified version of LIME, the only difference in the modified version of IME is in the process of creating perturbations.

In IME, the perturbations are pairs of examples that are used to estimate the difference in the prediction of a model when we know the value of the i-th feature, and when we do not, across all possible subsets. For a specific feature subset Q, the first example in the perturbation is created by fixing the wordsQ∪ {i} and the second example is created by fixing the words Q, while replacing the remaining words with words from a random example in the sampling dataset. Because IME assumes feature independence, the words that are not fixed are replaced randomly without taking the fixed words into account. We propose two modified versions of IME. The first strongly relaxes the assumption of feature independence. The second keeps the assumption, but generates a sampling dataset which reduces the effect of the assumption on the created perturbations. Both versions of the modified process for creating perturbations are shown in contrast to the original process in Figure 4.2 and described next.

The first version replaces the words which are not fixed randomly according to an output distribution of a language model conditioned on the remaining context. This means that instead of selecting the words blindly, they are selected to fit into the context. As currently defined, the method re- moves the feature independence assumption and generates replacement words twice per sample, once by assuming the wordsQ∪ {i} are fixed and once by assuming the words Qare fixed. However, as the context only differs by one word, we generate the replacement words only once per sample, by assuming the words Q are fixed. We make this assumption in order to reduce the

(45)

4.1. EXPLANATION METHODS WITH TEXT GENERATORS 25

sampling dataset X overbearing and over the top the promise of digital filmmaking

half an hour too long a technically superb film .

This and is the top

Q₂ = {This, is}

This and is the good

This show is very long

Q₁ = {This, show, is, very}

sampling dataset X the soup is too salty this is an absolute disgrace this show is incredibly boring

that was not very good This show is incredibly boring

Q₂ = {This, is}

This show is incredibly good

This show is very salty

This show is very good language

model Q₂ = {This, is}

language model

This showis very dark This show is very good

This show is verygood

language model

This reviewis not important

This review is not important

This review is not good

This show is very dark

IMEIME with an internal LMIME with an external LM

Figure 4.2: A comparison between the creation of perturbations in IME and its modified versions when computing the importance of the word “good” in the instance “The show is very good”. Each sample is made of two examples:

one where the words Q_i are fixed (locked) and one where the words Q_i ∪ {“good”} are fixed. The original IME (top row) replaces the words that are not fixed with those in a random example from the sampling dataset. IME with an internal LM (middle row) generates the words that are not fixed with a language model. IME with an external LM (bottom row) generates a new sampling dataset with variations of the explained instance, which is then used the same way as original IME.

(46)

computational cost of the modified method: by making this assumption, the method makes only a half of the generation calls it would otherwise make.

However, the assumption can be removed in case of a lightweight language model or an efficient generation strategy. Because this method uses a language model inside the explanation method, we refer to it as IME with an internal LM (IME + iLM).

The second version uses a language model to generate a dataset of examples that are similar to the explained instance, which is then used as a sampling dataset in the original version of IME. Stated differently, the language model is used to generate the set X in Equation 3.5. To do this, we re-generate the words in the explained instance |X| times, where |X| is the size of the generated dataset, specified by the user. We re-generate all the words instead of randomly choosing which words should be fixed and which not because the latter process is done in IME, which uses the generated dataset. The motivation for this method is to further reduce the computational cost of using a generator while creating more natural perturbations than in the original IME. The reduction in computational cost can either come from generating a dataset X whose size is smaller than the maximum number of samples used in IME, or generating the dataset in advance and benefiting from the batched generation. In contrast to the first modified version, this method uses a language model outside the explanation method, so we refer to it as IME with an external LM (IME + eLM).

Both modified versions require the user to specify a generation strategy.

We use the same generation strategy as in the modified version of LIME for IME with an internal LM. In IME with an external LM, we shuffle the generation order of words instead of generating them in left-to-right. The goal is to obtain more variations of naturally occurring text in the generated samples.

(47)

4.2. EXPLANATION BASED ON DEPENDENCY STRUCTURE 27

4.2 Explanation based on dependency struc- ture

Our second novelty is a procedure to automatically construct explanations of textual units beyond single words, using the dependency structure of a text.

To compute an explanation based on units bigger than single words, multiple features need to be combined into disjoint feature groups. In other words, individual features used originally are combined into superfeatures and treated as atomic features in the process of creating perturbations. For example, one way of grouping words into bigger units might be based on contiguous word-bigrams, i.e. the instance “This show is very good” would be explained through the importance of word groups “This show”, “is very”

and “good”.

As hinted by the example, not all feature groupings are sensible. The number of ways to group features into disjoint groups is prohibitively large¹ when dealing with more than a few features so we cannot iterate through all the options. We propose an automatic bottom-up grouping procedure based on the dependency structure of the explained text instance. The dependency structure presents the syntax of a sentence through dependence relations among its words. For example, in the sentence “This show is very good”, the word “very” depends on “good”, emphasizing its meaning, and

“This show” and “very good” are connected via a verb “is”. Compared to the constituent structure, which also expresses the syntax of a sentence, the dependency structure is more suitable to deal with morphologically rich languages [69]. Due to the Universal Dependencies project [70], dependency parsers are available for many more languages than constituency parsers.

The procedure is illustrated in Figure 4.3 and described next. As an input, the procedure receives a text instancexand its dependency structure,

1The number of possible partitions of a set withnelements is known as Bell’s number B_n [68].

(48)

which is described with a tree for each sentence contained in the instance.

In the first step, it calculates the importance of individual words, which serve as a criterion for merging of the words. Merging the child words in a subtree with the root of the subtree is attempted in bottom-up left-to-right order. To merge the children words with the parent word, the children words need to have either been previously merged into one group or be leaves of the tree. If the combined words are assigned absolute importance that is greater than the absolute sum of their individual importance, the words are merged, otherwise the merge is ignored. For example, in Figure 4.3, the words “this” and “show” are not merged because their joint absolute importance (0.13) is not greater than the absolute sum of their individual importance (|0.03 + 0.10|). The idea is to merge only those words which become more important joint due to the newly captured interaction. The procedure terminates once there are no more valid merge options because a deeper subtree was not merged, or when it arrives to the root of the (last) sentence. The procedure does not consider merging the root word of the sentence with its immediate children and terminates right before this step.

For example, in Figure 4.3, the procedure does not consider merging “This show”, “is” and “very good” even if the other conditions are met. Such a merge would produce an explanation in terms of sentences, for which a more efficient approach can be used. In addition, in our experiments some tasks involve single-sentence instances, for which a sentence explanation is not useful.

(49)

4.2. EXPLANATION BASED ON DEPENDENCY STRUCTURE 29

is

good show

This very This show is very good

This show

very good

0.03 0.10 0.08 0.15 0.35

0.13 0.08 0.18 0.38

|0.13| |(0.03 + 0.10)| → do not merge

0.54

|0.54| > |(0.15 + 0.35)| → merge 0.08

0.03 0.10

is very good

This show is

This show is very good explanation method :

1.

This show is very good explanation method :

2.

is very good explanation method :

3.

^{This show}

>

Figure 4.3: Automatic creation of explanations based on larger text units.

As the input, the instance to be explained and its dependency structure are provided. The procedure first computes the initial importance of the words (step 1) and attempts to merge words in a bottom-up, left-to-right manner according to the dependency structure (steps 2 and 3). In the figure, the importance of words is noted above them and through their color. Darker green indicates a text unit has higher importance for the current prediction (in this example, a positive sentiment).

(50)

(51)

Chapter 5 Evaluation

In this chapter, we evaluate the proposed modifications to explanation methods and compare them against the originals. We first present the datasets and tasks in Section 5.1 and the used models and generators in Section 5.2.

We present the results of the two types of experiments. In Section 5.3, we perform a distribution detection experiment that quantifies if the perturbations created with a language model are more natural than the perturbations created with the original IME and LIME method. In Section 5.4, we evaluate the explanations produced by the modified methods quantitatively and qualitatively and compare them to various baselines.

5.1 Used datasets

We use five text classification datasets in multiple languages, summarized in Table 5.1. In this section, we briefly describe them.

The Stanford Sentiment Treebank (SST) [25] contains English sentential movie review excerpts from the review aggregation website Rotten Toma- toes, annotated with their sentiment. In addition to sentence-level annotations, the dataset contains annotations for phrases in the sentences. The annotations were obtained through crowdsourcing, by independently show-

31

(52)

Table 5.1: An overview of the used datasets, including the task we use them for, the number of examples in the training, validation and test set, and the number of classes.

Dataset Task #train #val #test #classes

SST-2 sentiment

classification 60614 6735 872 2 SNLI natural language

inference 549361 9842 9824 3

IMSyPP-sl hate-speech

classification 35635 3960 7943 2 SentiNews sentiment

classification 71999 9000 9000 3 XNLI5 natural language

inference 392662 12450 25050 3

ing examples to three annotators. The annotators initially determined the sentiment on a 25-point scale, which the authors discretized into five classes (negative, somewhat negative, neutral, positive or somewhat positive) or into two classes (negative, positive), ignoring the neutral examples. In our experiments, we use a preprocessed version of the binary dataset (SST-2) which is provided in the General Language Understanding Evaluation (GLUE) bench- mark [71]. In this version, the training set contains both sentential and phrasal examples, while the validation and test set only contain sentential examples. Because the true labels for the test set are not publicly available, we use the validation set for testing and set aside 10% of randomly selected training examples for the validation set.

The Stanford Natural Language Inference dataset (SNLI) [26] contains English sentence pairs annotated with the relation between them. For a

Adaptationsofperturbation-basedexplanationmethodsfortextclassiﬁcationwithneuralnetworks MatejKlemen

University of Ljubljana

Faculty of Computer and Information Science

Matej Klemen

Adaptations of perturbation-based explanation methods for text classification with neural networks

Supervisor : prof. dr. Marko Robnik ˇ Sikonja

Ljubljana, 2021

Univerza v Ljubljani

Fakulteta za raˇ cunalniˇ stvo in informatiko

Matej Klemen

Prilagoditve perturbacijskih razlagalnih metod za klasifikacijo

besedil z nevronskimi mreˇ zami

Mentor : prof. dr. Marko Robnik ˇ Sikonja

Ljubljana, 2021

Acknowledgments

Contents

Abstract

Keywords

Povzetek

Kljuˇ cne besede

Razˇ sirjeni povzetek

I Uvod

II Pregled sorodnih del

II.I Perturbacijske razlage z generatorji

II.II Razlage na podlagi daljˇ sih enot

III Prilagoditve razlagalnih metod

III.I Razlage z generatorji

III.II Razlage na podlagi skladenjskih dreves

IV Eksperimentalno ovrednotenje

V Sklep

Chapter 1 Introduction

Chapter 2

Related work

2.1 Perturbation-based explanations

2.2 Generators in perturbation-based expla- nations

2.3 Explanations beyond single words

Chapter 3 Background

3.1 Explanation methods

3.1.1 LIME

3.1.2 IME

3.2 Text generation

3.2.1 Language modeling

3.2.2 BERT

Chapter 4 Methods

4.1 Explanation methods with text genera- tors

4.1.1 LIME

4.1.2 IME

4.2 Explanation based on dependency struc- ture

1.

2.

3.

Chapter 5 Evaluation

5.1 Used datasets