• Rezultati Niso Bili Najdeni

4.2 Explanation based on dependency struc-ture

Our second novelty is a procedure to automatically construct explanations of textual units beyond single words, using the dependency structure of a text.

To compute an explanation based on units bigger than single words, mul-tiple features need to be combined into disjoint feature groups. In other words, individual features used originally are combined into superfeatures and treated as atomic features in the process of creating perturbations. For example, one way of grouping words into bigger units might be based on contiguous word-bigrams, i.e. the instance “This show is very good” would be explained through the importance of word groups “This show”, “is very”

and “good”.

As hinted by the example, not all feature groupings are sensible. The number of ways to group features into disjoint groups is prohibitively large1 when dealing with more than a few features so we cannot iterate through all the options. We propose an automatic bottom-up grouping procedure based on the dependency structure of the explained text instance. The de-pendency structure presents the syntax of a sentence through dependence relations among its words. For example, in the sentence “This show is very good”, the word “very” depends on “good”, emphasizing its meaning, and

“This show” and “very good” are connected via a verb “is”. Compared to the constituent structure, which also expresses the syntax of a sentence, the dependency structure is more suitable to deal with morphologically rich lan-guages [69]. Due to the Universal Dependencies project [70], dependency parsers are available for many more languages than constituency parsers.

The procedure is illustrated in Figure 4.3 and described next. As an input, the procedure receives a text instancexand its dependency structure,

1The number of possible partitions of a set withnelements is known as Bell’s number Bn [68].

which is described with a tree for each sentence contained in the instance.

In the first step, it calculates the importance of individual words, which serve as a criterion for merging of the words. Merging the child words in a subtree with the root of the subtree is attempted in bottom-up left-to-right order. To merge the children words with the parent word, the children words need to have either been previously merged into one group or be leaves of the tree. If the combined words are assigned absolute importance that is greater than the absolute sum of their individual importance, the words are merged, otherwise the merge is ignored. For example, in Figure 4.3, the words “this” and “show” are not merged because their joint absolute importance (0.13) is not greater than the absolute sum of their individual importance (|0.03 + 0.10|). The idea is to merge only those words which become more important joint due to the newly captured interaction. The procedure terminates once there are no more valid merge options because a deeper subtree was not merged, or when it arrives to the root of the (last) sentence. The procedure does not consider merging the root word of the sentence with its immediate children and terminates right before this step.

For example, in Figure 4.3, the procedure does not consider merging “This show”, “is” and “very good” even if the other conditions are met. Such a merge would produce an explanation in terms of sentences, for which a more efficient approach can be used. In addition, in our experiments some tasks involve single-sentence instances, for which a sentence explanation is not useful.

4.2. EXPLANATION BASED ON DEPENDENCY STRUCTURE 29

is

good show

This very This show is very good

This show is very good

This show

very good

0.03 0.10 0.08 0.15 0.35

0.13 0.08 0.18 0.38

This show is very good explanation method :

1.

This show is very good explanation method :

Figure 4.3: Automatic creation of explanations based on larger text units.

As the input, the instance to be explained and its dependency structure are provided. The procedure first computes the initial importance of the words (step 1) and attempts to merge words in a bottom-up, left-to-right manner according to the dependency structure (steps 2 and 3). In the figure, the importance of words is noted above them and through their color. Darker green indicates a text unit has higher importance for the current prediction (in this example, a positive sentiment).

Chapter 5 Evaluation

In this chapter, we evaluate the proposed modifications to explanation meth-ods and compare them against the originals. We first present the datasets and tasks in Section 5.1 and the used models and generators in Section 5.2.

We present the results of the two types of experiments. In Section 5.3, we perform a distribution detection experiment that quantifies if the perturba-tions created with a language model are more natural than the perturbaperturba-tions created with the original IME and LIME method. In Section 5.4, we evalu-ate the explanations produced by the modified methods quantitatively and qualitatively and compare them to various baselines.

5.1 Used datasets

We use five text classification datasets in multiple languages, summarized in Table 5.1. In this section, we briefly describe them.

The Stanford Sentiment Treebank (SST) [25] contains English sentential movie review excerpts from the review aggregation website Rotten Toma-toes, annotated with their sentiment. In addition to sentence-level annota-tions, the dataset contains annotations for phrases in the sentences. The annotations were obtained through crowdsourcing, by independently

show-31

Table 5.1: An overview of the used datasets, including the task we use them for, the number of examples in the training, validation and test set, and the number of classes.

Dataset Task #train #val #test #classes

SST-2 sentiment

classification 60614 6735 872 2 SNLI natural language

inference 549361 9842 9824 3

IMSyPP-sl hate-speech

classification 35635 3960 7943 2 SentiNews sentiment

classification 71999 9000 9000 3 XNLI5 natural language

inference 392662 12450 25050 3

ing examples to three annotators. The annotators initially determined the sentiment on a 25-point scale, which the authors discretized into five classes (negative, somewhat negative, neutral, positive or somewhat positive) or into two classes (negative, positive), ignoring the neutral examples. In our exper-iments, we use a preprocessed version of the binary dataset (SST-2) which is provided in the General Language Understanding Evaluation (GLUE) bench-mark [71]. In this version, the training set contains both sentential and phrasal examples, while the validation and test set only contain sentential examples. Because the true labels for the test set are not publicly available, we use the validation set for testing and set aside 10% of randomly selected training examples for the validation set.

The Stanford Natural Language Inference dataset (SNLI) [26] contains English sentence pairs annotated with the relation between them. For a

5.1. USED DATASETS 33

sentence pair (s1, s2), the goal is to predict whether s2 (the hypothesis) en-tails s1 (the premise), contradicts it or is neutral with regard to it. The examples were constructed by taking premises from an existing image cap-tioning dataset (Flickr30k [72]) and for each premise asking crowd workers to construct one hypothesis that is true given the premise, one that may or may not be true, and one that is not true, i.e. an entailment, neutral and contradiction example, respectively. Each example was then independently annotated by five annotators, and the true label was set as the label deter-mined by at least three annotators (or “-” if a three-annotator consensus was not reached). In our experiments, we use the dataset split provided by the authors, ignoring the examples for which the consensus was not reached.

IMSyPP-sl [27] contains a sample of Slovenian tweets written between December 2017 and February 2020, annotated as appropriate, inappropriate, offensive or violent. Each tweet was annotated twice, in 90% of the tweets by two annotators and in 10% by one. In our experiments, we only keep the comments for which the annotations agree and group the inappropriate, offensive and violent comments into a common “hateful” category. We use the training-test split provided by the authors and set aside 10% of randomly selected training examples for the validation set.

SentiNews [28] contains a sample of Slovenian news articles published between September 2007 and December 2013, spread approximately evenly between five news portals. The annotation was performed at three levels: at the sentence, paragraph and document level. Each example was annotated independently by between two and six annotators using a five-point Likert scale [73] (very negative, negative, neutral, positive and very positive). The final sentiment was determined using averaging: an example is negative if its average sentiment is lower than or equal to 2.4, neutral if it is between 2.4 and 3.6 (exclusive) and positive if it is greater or equal to 3.6. We use the dataset at the paragraph level and the dataset splits provided by the authors.

The Cross-lingual Natural Language Inference dataset (XNLI) [29]

con-tains multilingual sentence pairs across 15 languages annotated with the relation between them. The goal is to predict the relations between the sen-tences as in SNLI, but in a cross-lingual setting. Concretely, the goal is to recognize the relations in different languages by learning only on relations in English. The dataset contains the validation and test set, while reusing the training set from the Multi-genre Natural Language Inference dataset (MNLI) [74], which is similar to SNLI, but contains premises from 10 dif-ferent genres instead of a single one. The English premises in XNLI were first obtained using the same procedure as in MNLI, i.e. gathered from 10 sources, combined with hypotheses written by crowd workers, and annotated.

Then, the examples were human-translated from English into 14 languages, producing the final dataset. In our experiments, we use the dataset split pro-vided by authors, but only use examples in five languages which we choose from different language groups in order to reduce the run time of the ex-periments while maintaining the diversity of the data. As a result, we refer to the dataset, which contains examples in English (a Germanic language), French (a Romance language), Urdu (an Indic language), Turkish (a Turkic language) and Russian (a Slavic language), asXNLI5.