• Rezultati Niso Bili Najdeni

it significantly.

We initialize the models and generators using a common pre-trained checkpoint from the Hugging Face model hub1, which ensures that the vo-cabulary of the classifier and generator are aligned, and removes the need for encoding conversion. We use the “bert-base-uncased” [30] pre-trained model for the English tasks, CroSloEngual BERT [75] for the Slovenian tasks and

“xlm-roberta-base” [31] for the cross-lingual task.

We use the generators in three settings: untuned MLM, tuned MLM and tuned CMLM. To tune a CMLM, we treat the control signal (i.e. the target label) as another word and prepend it to the texts. Afterwards, the model is tuned in the same way as a MLM.

To tune the models, we follow the standard model learning procedure.

We tune the models using the training set, stopping the procedure after the validation metric does not improve for 5 consecutive steps to avoid overfitting the classifier. For classifiers, the validation metric is classification accuracy, and for the generators, it is the loss value.

5.3 Quality of perturbations

In our first experiment, we quantitatively evaluate whether the perturbations generated in the modified explanation methods (described in Section 4.1) are more natural than those generated in the original IME and LIME.

We formalize the naturalness of perturbations as the extent of the shift in the embedding distribution caused by the process of creating perturbations, as proposed by Rychener et al. [76]. In other words, we evaluate whether the perturbations come from a distribution that is indistinguishable from the natural (empirical) distribution. Given a set of in-distribution samples and a set of perturbations, the distributional shift is measured by the accuracy with which a complex model can distinguish between the samples. A high

1https://huggingface.co/models

accuracy indicates that the model can easily distinguish the original samples from their perturbations, while a low accuracy indicates the samples are difficult to distinguish.

5.3.1 Experimental settings

Because the training and development set are used to train the classifier and the generator, we use the test set in this experiment to measure the distributional shift on data, unseen by the models. Consequently, we consider these results in isolation and do not use them for any model selection in further experiments to prevent overfitting the test set. In the remainder of this section, we refer to the test set asthe dataset in order to avoid confusion with the test set used in the evaluation of the distribution detection model.

We divide the dataset D randomly into two halves and use one to train and evaluate a distribution detection model and one as a reference that indi-cates the classification accuracy achieved by the distribution detection model when the examples come from the same empirical distribution (i.e. as the ideal case). Using the original and modified IME and LIME explanation methods, we randomly create one perturbation for each example in the first randomly selected half of the dataset, perturbing examples on the subword level. In methods using a controlled generator, we create a perturbation us-ing a control label selected uniformly at random. To create perturbations in the original IME, we use the training subset for sampling everywhere except for XNLI5, for which we use the subset of the validation set that contains examples in the same language as the input instance. In total, we obtain

|D|

2 examples from the empirical distribution and |D|2 examples from the dis-tribution of perturbations (as defined by explanation methods). Next, we divide the examples randomly into a training, validation and test set in the 80%:10%:10% ratio and embed the examples using a classifier trained for the specific task. To do so, we use the features which are used as the input to the final classification layer in the BERT or XLM-R classifier. Using the

5.3. QUALITY OF PERTURBATIONS 37

training set, we train a random forest [77] distribution detection model using the number of trees which achieve the best classification accuracy on the validation set2. We report the results using the classification accuracy on the test set. Because we only construct one perturbation per example, we repeat each experiment five times and report the mean accuracy and its standard deviation.

As per the current definition, the ideal accuracy in the experiment could be achieved by a copying mechanism or an overfitted generator. As we are only interested in practically useful solutions, we impose a uniqueness con-straint (as described in Section 4.1.1 for the modified LIME) on all explana-tion methods that use a generator.

5.3.2 Results

Table 5.2 shows the classification accuracy of the best distribution detection model for different strategies of creating perturbations. Due to the imposed uniqueness constraint, the modified LIME (LIME+LM) and IME with an internal LM (IME+iLM) methods create the perturbation in an identical way, so we consider them as one method.

Because the distribution detection datasets are balanced, the ideal clas-sification accuracy (reference) is around 0.5 for all datasets. The model achieves high (i.e. the worst) accuracy for both the original IME and LIME perturbations, indicating they are easily distinguishable from the natural text. Unsurprisingly, the model consistently achieves higher accuracy for LIME perturbations as LIME replaces words with a replacement word that is not natural and is easily detectable. In contrast, IME uses replacement words drawn from a sampling dataset, which can sometimes create natural perturbations. An example of this can be seen in the top row of Figure 4.2, where IME replaces “good” in “This show is very good” with a valid

alter-2Note that an undertrained or a random model could also achieve an ideal accuracy (0.5), so it is essential that we perform this modeling step thoroughly.

Table 5.2: Average accuracy and its standard deviation in the distribution detection experiment. A lower accuracy indicates better perturbations as they are harder to distinguish from the empirical distribution. “Reference”

marks the accuracy of the distribution detection model trained to distinguish between two samples from the empirical distribution and indicates the up-per bound of achievable accuracy. (u) indicates that the used generator is untuned (i.e. only pre-trained).

Method SST-2 SNLI IMSyPP-sl SentiNews XNLI5 Reference 0.500

5.3. QUALITY OF PERTURBATIONS 39

native word “long”. However, the next row shows an example where IME creates a completely incomprehensible perturbation, which explains the high accuracy for IME.

For the modified explanation methods (in the bottom six rows of Table 5.2), the distribution detection model achieves a lower classification accuracy in all settings and on all datasets, indicating the perturbations are harder to distinguish from examples in the empirical distribution. In general, the perturbations created in the modified LIME and IME with an internal LM achieve an accuracy that is closest to the ideal one. This is because the generators used in these methods have the information about fixed words in the current perturbation. In contrast, IME with an external LM generates the sampling dataset with variations of the input instance in advance, which later gets used to create the perturbations. Because of this, the creation of not-fully-comprehensible perturbations is still possible. An example of such a mistake can be seen in the first perturbation created by IME with an external LM in Figure 4.2, where the word “good” in “This show is very good” is replaced with “salty”, creating an example that is syntactically valid but not meaningful. Nonetheless, the amount of such examples is reduced in comparison to IME, because the sampling dataset is only filled with examples that are similar to the input instance, while the sampling dataset in the original IME contains very different examples, sampled from the training set.

When using tuned generators, we would expect the classification accuracy to drop significantly compared to untuned generators because the tuning should help create perturbations that are more similar to examples in the domain. However, for many settings, we observe only a small decrease in the accuracy or even an increase. We hypothesize this is a consequence of the uniqueness constraint which we impose to prevent a trivial solution from achieving the ideal accuracy. The generators are tuned to predict a high probability for the words appearing in the domain. If the generators are

tuned well, they should generalize to the test set, meaning they will predict a high probability for the word that originally appears in the input instance.

By applying the constraint, we prevent the generator from doing so, and reduce the effect of the tuning process in this experiment. The methods using controlled generators suffer less from this, likely due to the randomly used control signal. For example, when generating a replacement word in the text “This movie is very ” with the original input being “This movie is very good”, a tuned generator might try to predict “good”, which would not be allowed, while a controlled one might predict “good” or “bad”, depending on the desired sentiment.