Cross-lingualtransferofPOStaggerintoalow-resourcelanguage SanjaStojanoska

(1)

University of Ljubljana

Faculty of Computer and Information Science

Sanja Stojanoska

Cross-lingual transfer of POS tagger into a low-resource language

MASTER’S THESIS

THE 2nd CYCLE MASTER’S STUDY PROGRAMME COMPUTER AND INFORMATION SCIENCE

Supervisor : prof. dr. Marko Robnik ˇ Sikonja Co-supervisor : doc. dr. Nikola Ljubeˇsi´ c

Ljubljana, 2021

(2)

(3)

Univerza v Ljubljani

Fakulteta za raˇ cunalniˇ stvo in informatiko

Sanja Stojanoska

Medjezikovni prenos

oblikoskladenjskega oznaˇ cevalnika v jezik z malo viri

MAGISTRSKO DELO

MAGISTRSKI ˇSTUDIJSKI PROGRAM DRUGE STOPNJE RA ˇCUNALNIˇSTVO IN INFORMATIKA

Mentor : prof. dr. Marko Robnik ˇ Sikonja Somentor : doc. dr. Nikola Ljubeˇsi´ c

Ljubljana, 2021

(4)

(5)

Copyright. This work is licensed under the Creative Commons Attribution 4.0 Inter- national License. To view a copy of this license, visit http://creativecommons.org/

licenses/by/4.0/

(6)

(7)

Acknowledgments

I would like to express my gratitude to my supervisors for their help and guidance. Thank You for being very responsive during writing the thesis. I would also like to thank my family who supported me during my studies.

Sanja Stojanoska, 2021

(8)

(9)

(10)

(11)

List of used acronyms

acronym meaning

NLP natural language processing UD universal dependencies MLM masked language modeling

ELMo Embeddings from Language Models LSTM Long short-term memory

RNN Recurrent neural network UPOS universal part-of-speech

BERT bidirectional encoder representations from transformers GloVe Global Vectors for Word Representation

OOV Out-of-vocabulary CA classification accuracy LAT Latin script

CYR Cyrillic script

(14)

(15)

Abstract

Title: Cross-lingual transfer of POS tagger into a low-resource language With the continuous growth of online textual content, machine learning is the only feasible approach for implementing advanced systems for language processing. Although many natural language processing (NLP) applications exist, most of them are anglocentric and low-resourced languages are left behind. We apply a cross-lingual transfer approach from several languages to overcome this limitation. Part-of-speech tagging (POS), a fundamental text processing task, is a prerequisite for a variety of NLP problems. To implement a POS-tagger in the low-resource Macedonian language, we use pretrained multilingual models along with annotated data in Serbian, Croa- tian and Bulgarian. We show that multilingual models fine-tuned with a set of languages similar to the target language achieve good performance in solving the POS-tagging task.

Keywords

cross-lingual transfer, part-of-speech tagging, multilingual language model, low-resource language, Macedonian language

(16)

(17)

Povzetek

Naslov: Medjezikovni prenos oblikoskladenjskega oznaˇcevalnika v jezik z malo viri

Zaradi nenehne rasti koliˇcine spletnih besedil je strojno uˇcenje edini iz- vedljiv pristop za izvajanje naprednih jezikovnih obdelav. ˇCeprav obstajajo ˇstevline aplikacije za obdelavo naravnega jezika, je veˇcina anglocentriˇcnih in jeziki z malo viri so zanemarjeni. V tem delu uporabljamo medjezikovni prenos iz veˇc jezikov v jezik z malo viri. Oblikoskladenjski oznaˇcevalnik je ena od temeljnih nalog obdelave besedil in je predpogoj za razliˇcne jezikovne naloge. Za implementacijo oblikoskladenjskega oznaˇcevalnika za makedonski jezik, ki ima na voljo le malo virov, uporabljamo veˇcjezikovne modele in oznaˇcene podatke iz srbskega, hrvaˇskega in bolgarskega jezika. Pokazali smo, da veˇcjezikovni modeli, prilagojeni z jeziki podobnimi ciljnemu jeziku, dosegajo dobre rezultate pri oblikoskladenjskem oznaˇcevanju v makedonˇsˇcini.

Kljuˇ cne besede

medjezikovni prenos, oblikoskladenjski oznaˇcevalnik, veˇcjezikovni model, jezik z malo viri, makedonski jezik

(18)

(19)

Razˇ sirjeni povzetek

Oblikoskladenjski oznaˇcevalnik preslika besede v njihove slovniˇcne katego- rije. Problem je netrivialen, saj ima lahko ena beseda razliˇcne oznake, ˇce se uporablja v razliˇcnih kontekstih. Oznaˇcevanje univerzalnih besednih vrst pomaga razloˇciti nejasnosti in je predpogoj za ˇstevilne naloge obdelave naravnega jezika (angl. natural language processing, NLP). Dandanes je zaradi poveˇcevanja koliˇcine digitalnih vsebin strojno uˇcenje edini uporaben pristop za reˇsevanje te naloge. Trenutne NLP aplikacije temeljijo na globo- kem uˇcenju. Za uˇcenje globoke nevronske mreˇze je potrebno veliko oznaˇcenih podatkov, kar je ovira za jezike z malo viri. Da premagamo to omejitev iz- vajamo medjezikovni prenos iz podobnih jezikov.

I Kratek pregled sorodnih del

Nedavni pristopi za reˇsevanje razliˇcnih jezikovnih nalog se opirajo na znanje, kodirano v velikih predhodno nauˇcenih (veˇcjezikovnih) modelih. Za jezike z malo viri so veˇcjezikovni modeli, ki uporabljajo arhitekturo transformerjev [21], zelo uspeˇsen pristop k reˇsevanju razliˇcnih NLP nalog. Ti modeli so predhodno nauˇceni z veliko koliˇcino podatkov v razliˇcnih jezikih, zaradi ˇcesar so primerni za medjezikovni prenos. Najbolj priljubljena veˇcjezikovna modela sta multillingual BERT (mBERT) [7] in XLM-RoBERTa [6].

Wu et al. [23] so pokazali, da leksikalno prekrivanje izboljˇsa medjezikovni prenos v modelih BERT in da se jezikovni podatki ohranijo v vseh plasteh.

Tenney et al. [18] so pokazali, da se semantika pojavi v konˇcnih plasteh, i

(20)

ii

medtem ko se osnovne sintaktiˇcne informacije pojavijo v prejˇsnjih plasteh.

Tsai et al. [19] so pokazali, da je oblikoskladenjsko oznaˇcevanje z uporabo veˇcjezikovnih modelov uˇcinkovito tudi pri jezikih z malo viri.

Ulˇcar in Robnik-ˇSikonja [20] sta predstavila dva velika vnaprej nauˇcena jezikovna modela na osnovi modela BERT, trojeziˇcna FinEst BERT in Cro- SloEngual BERT. Pokazala sta, da v uˇcnih jezikih ti modeli delujejo bolje kot veˇcjezikovni mBERT pri ˇstevilnih nalogah, vkljuˇcno z oznaˇcevanjem UPOS.

Aepli et al. [1] predlagajo pristop, ki temelji na izkoriˇsˇcanju sorodnih jezikov za izboljˇsanje virov v jeziku z malo viri. Uˇcna mnoˇzica v makedonskem jeziku je razˇsirjena z razpoloˇzljivimi podatki iz veˇc podobnih jezikov: srbskega, bolgarskega, slovenskega in ˇceˇskega iz slovanske druˇzine in angleˇsˇcine.

Oznake iz drugih jezikov se uporabljajo za doloˇcanje oblikoskladenjskih oznak v ciljnem jeziku. Njihov oznaˇcevalnik dosega 88% toˇcnost.

II Predlagana metoda

Cilj naloge je razviti pristop za oznaˇcevanje UPOS v makedonskem jeziku. Da bi to dosegli, uporabljamo modele za medjezikovni prenos, natanˇcneje, uporabljamo model mBERT, uˇcen na veˇc kot sto jezikih, in CroSloEngual BERT, uˇcen samo na hrvaˇskih, slovenskih in angleˇskih besedilih. Poleg tega, vkljuˇcili smo ˇse dva modela: XLM-RoBERTa, uˇcen na 100 jezikih, in BERTi´c, uˇcen na hrvaˇskem, srbskem, bosanskem in ˇcrnogorskem jeziku.

Naˇstete modele smo prilagodili z oznaˇcenimi podatki iz veˇc jezikov. Upo- rabljamo oznaˇcene podatke iz Universal Dependencies (UD) projekta iz treh slovanskih jezikov: hrvaˇsˇcine, srbˇsˇcine in bolgarˇsˇcine. Podatki iz UD so raz- deljeni v tri podatkovne mnoˇzice: uˇcno, validacijsko in testno. Mnoˇzice iz vseh treh jezikov smo zdruˇzili in ustvarili veˇcje mnoˇzice za medjezikovni pristop.

V analizi smo preuˇcili vlogo pisave na prenos znanja, leksikalno prekrivanje in vpliv jezikovnih kombinacij ter korelacije med izvornimi in ciljnimi jeziki.

(21)

iii

Poleg prenosa znanja iz enega samega izvornega jezika smo kombinirali veˇc jezikov. Kombinirani pristop ima dve razliˇcici: skupno mnoˇzico iz izvornih jezikov in uˇcenje modelov vsako epoho z drugim jezikom. Preizkusili smo tudi ansambel, ki doloˇca oznako z veˇcino glasov veˇc veˇcjezikovnih modelov.

Za testirane modele poroˇcamo o njihovi klasifikacijski toˇcnosti in oceni F1.

III Evaluacija

V naˇsi primerjavi bomo mBERT in XLM-RoBERTa imenovali mnogojezi- kovna modela, CroSloEngual BERT in BERTi´c pa manjjezikovna modela, glede na ˇstevilo jezikov, na katerih so bili modeli nauˇceni.

Klasifikator UPOS, nauˇcen samo s srbskimi podatki, doseˇze F1 oceno in klasifikacijsko toˇcnost veˇc kot 86%, ˇce je klasifikator zgrajen z uporabo mnogojezikovnih modelov. To pomeni, da se model uˇci strukture UPOS iz srbskih zaporedij, ki jih je mogoˇce uporabiti v makedonskem jeziku.

Pri uporabi mnogojezikovnih modelov evaluacija klasifikatorja na make- donskih podatkih v cirilici doseˇze viˇsje ocene v primerjavi s podatki v latinici, zato se mnogojezikovni modeli bolje obnesejo pri uporabi podatkov v izvorni pisavi. Po drugi strani pa se manjjezikovni modeli uspeˇsneje uˇcijo na podatkih v latinici.

Prilagoditev mnogojezikovnih modelov z 10 % podatkov iz uˇcne mnoˇzice v ciljnem jeziku vodi do pribliˇzno 10 % izboljˇsavo ocene F1 in klasifikacijske toˇcnosti v primerjavi s prilagajanjem modela samo s podatki v izvornem jeziku. V tem primeru XLM-RoBERTa dosega boljˇse rezultate kot mBERT.

Prilagajanje obeh manjjezikovnih modelov z enako koliˇcino vzorcev iz cilj- nega jezika znatno izboljˇsa rezultate, kar je lahko posledica leksikalnega pre- krivanja. Z dodatnim poveˇcanjem dodanih podatkov v ciljnem jeziku modeli dosegajo boljˇse rezultate. Najuspeˇsnejˇsi je XLM-RoBERTa z F1 oceno 0.977 in klasifikacijsko toˇcnostjo 0.981.

Medjezikovni prenos iz bolgarˇsˇcine sledi podobnemu vzorcu kot prenos iz srbskega jezika. CroSloEngual BERT in BERTi´c predhodno nista bila

(22)

iv

uˇcena na bolgarskem jeziku. Pri prilagajanju manjjezikovnih modelov samo s podatki v izvornem jeziku, BERTi´c dosega boljˇse rezultate, kar je lahko posledica predhodnega uˇcenja na veˇc slovanskih jezikih. V primerjavi s prilagajanjem manjjezikovnih modelov samo z uporabo srbˇsˇcine, daje uporaba bolgarˇsˇcine boljˇse rezultate, kar kaˇze, da ima leksikalno prekrivanje pomem- ben vpliv na medjezikovni prenos.

Hrvaˇska podatkovna mnoˇzica je podobna srbski, zato imajo vsi modeli, prilagojeni s tema dvema jezikoma, primerljive rezultate. Med mnogojezi- kovnimi modeli ima mBERT klasifikacijsko toˇcnost 0.975 in 0.979, medtem ko ima XLM-RoBERTa toˇcnost 0.974 in 0.981 za prilagajanje z 10% in 30%

dodatnih podatkov iz uˇcne mnoˇzice v ciljnem jeziku. Oba manjjezikovna modela dosegata boljˇse rezultate s poveˇcanjem dodatnih podatkov, ˇceprav so rezultati niˇzji v primerjavi z rezultati mnogojezikovnih modelov.

Pri prilagoditvi modelov s skupno hrvaˇsko-bolgarsko mnoˇzico opaˇzamo, da imajo modeli, uˇceni s tem jezikovnim parom, skoraj enake, vendar ne viˇsje rezultate v primerjavi z modeli, uˇceni z obema jezikoma loˇceno. Razlog, zakaj doseˇzene ocene niso ˇse viˇsje, je lahko v nasprotujoˇcih oznakah, ki pripadajo razliˇcnim kategorijam UPOS v obeh od jezikih.

Druga varianta pristopa s kombiniranjem jezikov je uˇcenje veˇcjezikovnih modelov vsako epoho z razliˇcnim izvornim jezikom, ki dosega niˇzje rezultate v primerjavi z uˇcenjem s hrvaˇsko-bolgarsko mnoˇzico. Opazimo lahko, da pri uporabi kombiniranega pristopa ni bistvenega izboljˇsanja v primerjavi z uˇcenjem samo z enim izvornim jezikom.

Za implementacijo ansambla uporabljamo napovedi iz modelov, uˇcenih s 30 % dodanih podatkov iz uˇcne mnoˇzice v ciljnem jeziku, saj so predhodno dosegli najviˇsje ocene. Ansambel modelov ustvarimo iz napovedi za vsakega od izvornih jezikov. Ugotavljamo, da je ansambel modelov najuspeˇsnejˇsi pristop med vsemi, ˇce vkljuˇcuje napovedi iz mnogojezikovnih modelov. Naj- boljˇse rezultate doseˇzemo z uporabo napovedi dveh XLM-RoBERTa modelov, prilagojenih s srbskimi in hrvaˇskimi podatki, skupaj z mBERT modelom z bolgarskimi podatki.

(23)

v

IV Sklep

Ceprav je leksikalno prekrivanje med bolgarskimi in makedonskimi podatkiˇ bistveno veˇcje od drugih parov, dajajo vsi izvorni jeziki primerljive rezultate pri prilagajanju modelov. To kaˇze, da so se veˇcjezikovni modeli nauˇcili globlje jezikovne strukture in se ne zanaˇsajo le na leksikalno prekrivanje. Od vseh poskusov z enim izvornim jezikom, je najboljˇsi pristop za medjezikovni prenos uporaba hrvaˇsˇcine za prilagajanje modela XLM-RoBERTa s 30% dodanih podatkov v ciljnem jeziku, srbˇsˇcina daje skoraj enake rezultate.

Najboljˇse rezultate pri prilagajanju modelov z 10% dodanih podatkov v ciljnem jeziku dosega mBERT s hrvaˇsˇcino kot izvornim jezikom. V celoti gle- dano je najuspeˇsnejˇsa metoda ansambel veˇcjezikovnih modelov. Izkaˇze se, da tudi mala koliˇcina podatkov v ciljnem jeziku zadoˇsˇca, da ansambel modelov doseˇze visoko klasifikacijsko toˇcnost, vendar je ta pristop bolj zahteven. Za implementacije ansambla je treba vsak model prilagoditi posebej in potem zdruˇziti napovedi. V primeru oznaˇcevanja UPOS v realnem ˇcasu ta pristop zahteva tudi veˇcji pomnilnik. Sklepamo, da medjezikovni pristop z uporabo ansambla premaga pomanjkljivost oznaˇcenih podatkov, vendar ima dodatno kompleksnost.

(24)

vi

(25)

Chapter 1 Introduction

Learning a language includes acquaintance with its grammatical structure.

Part-of-speech tagging (POS) problem maps words to their grammatical categories. The problem is non-trivial, a single word may have a different part- of-speech tags when used in different contexts.

POS tags are useful because they provide linguistic information on how a word is being used within the scope of a phrase, sentence, or document.

They help reduce ambiguity since some words belong to more than one part- of-speech category.

POS tagging is a prerequisite for many NLP problems. Text-to-speech systems perform POS tagging because a word might have different meanings and pronunciations. Word-sense disambiguation, the task of identifying the meaning of a word in a given sentence, also uses POS tagging. Several other applications require POS tagging in the preprocessing steps.

In large corpora, manually tagging POS is not possible. Moreover, new open-class words appear all the time due to transfer from languages and new word formations. The only viable approach to POS tagging relies on machine learning.

The traditional language preprocessing pipeline consists of a series of tasks, from tokenizing raw text to syntactic analysis on sentences. Advanced models are built upon the initial models. Therefore, to process a low-resource

1

(26)

2 CHAPTER 1. INTRODUCTION

language, we have to first construct entry-level models. A common strategy to overcome the lack of data in such languages is to use existing resources in other languages and transfer them to a new language.

Recent approaches to natural language processing use large language models pretrained on many languages. These models provide multilingual context which can help to develop solutions in a low-resource setting. We propose a cross-lingual transfer approach for solving the POS tagging task in a less-resourced language. In this thesis, we fine-tune pretrained multilingual models with a set of similar languages as a source. As a target language, we use Macedonian. Observing transliteration, a correlation between languages, and joint-language performance we tend to find the best transfer setting. We show that using data from similar languages leads to successful compensation for the lack of resources in a target language.

This thesis is organized as follows: In Chapter 2, we overview related work on POS tagging. In Chapter 3, we present the used datasets. We have tested datasets in similar languages to Macedonian, namely Bulgarian, Croatian and Serbian. In Chapter 4, we give an overview of text processing methods. In Chapter 5, we present our transfer-learning approach and its implementation.

In Chapter 6, we present and discuss the results. We conclude with Chapter 7 where we summarise our work and discuss possible improvements.

(27)

Chapter 2 Related work

With large corpora, there is a need for an automated POS tagger to solve many natural language processing (NLP) tasks. Recent approaches rely on knowledge encoded in large pretrained (multilingual) models to tackle a diverse set of problems.

For low-resource languages, multilingual models, utilizing the transformer architecture [21], are a highly successful approach to solving a wide variety of NLP tasks. These models are pretrained on a large amount of textual data in different languages which makes them suitable for cross-lingual transfer.

Most popular multilingual models are multilingual BERT (mBERT) [7] and XLM-RoBERTa [6].

Wu et al. [23] observe a strong correlation between the percentage of overlapping subwords and cross-lingual transfer performance of BERT models and show that lexical overlap improves cross-lingual transfer. Moreover, they discuss that language-specific information is preserved in all layers. Tenney et al. [18] observe that basic syntactic information appears in the earlier layers while high-level semantics appears in the final layers.

Ulˇcar and Robnik-ˇSikonja [20] provide two pretrained trilingual BERT- based language models, Finnish-Estonian-English and Croatian-Slovenian- English. Both CroSloEngual and FinEst BERT models perform better than multilingual mBERT for many downstream tasks including POS sequence

3

(28)

4 CHAPTER 2. RELATED WORK

labeling. Ljubeˇsi´c and Lauc present BERTi´c [10], a transformer language model trained on online text written in Bosnian, Croatian, Montenegrin, and Serbian. In comparison to existing state-of-the-art transformer models, the model performed significantly better when evaluated on several token and sequence classification tasks.

Hardalov et al. [9] investigate zero-shot transfer from English as a high- resource language to Bulgarian as a low-resource language for the multiple- choice reading comprehension task. They used mBERT and Slavic BERT [7]

models, the latter built using transfer learning from mBERT model to four Slavic languages: Bulgarian, Czech, Polish, and Russian. The experimental results show that fine-tuning mBERT on large-scale English corpora is beneficial for the model transfer, while the Slavic BERT model produced lower results.

Tsai et al. [19] describe the benefits of multilingual models over models trained on a single language. They showed that solving POS tagging using multilingual models is more effective in low-resource languages.

Aepli et al. [1] propose an approach based on exploiting closely related languages to improve the resources in a low-resourced language. The training set in the Macedonian language is expanded with available annotations from several similar languages: Serbian, Bulgarian, Slovene, and Czech from the Slavic family and English. Annotations from other languages are used to disambiguate POS tags in the target language. The tags are selected with majority voting and the tagger trained on the disambiguated set achieves 88% accuracy.

Bonchanoski and Zdravkova [4] present the whole process of creating a POS tagger. The systems implemented for POS tagging of Macedonian language are trained and evaluated on the manually annotated novel “1984” by Orwell. The best performing model was using cyclic dependency network which reached 97.5% accuracy.

Vojnovski et al. [22] present the digitalization, alignment, and annotation of the Macedonian “1984” corpus. A statistical tagger Trigrams’n’Tags

(29)

5

(TnT) was trained to learn a POS tagger for the Macedonian language. Eval- uating the prepared corpora resulted in 98.1% overall accuracy.

(30)

6 CHAPTER 2. RELATED WORK

(31)

Chapter 3 Datasets

We have chosen datasets from three Slavic languages: Bulgarian, Croatian and Serbian. These languages share important structural characteristics with the target language: lexicon, grammar, and word order, therefore, seem to be a reasonable choice for cross-lingual transfer to Macedonian. Even though Serbian and Croatian languages distinguish cases, Bulgarian and Macedonian use prepositions to express grammatical relations instead of the case.

3.1 Universal Dependencies

To implement a POS tagger, we use annotated data from the Universal De- pendencies (UD) framework [13]. The goal of the UD project is to create cross-linguistically consistent treebanks for different languages to simplify cross-lingual learning approaches.

A common way to create UD treebank for a language is by creating a mapping from the language-specific tagset to the unified UPOS tags. The languages of our interest follow MULTEXT - East tagset [8] as a structuring principle for the language-specific POS tags. The annotations follow posi- tional tagset encoding where the first capital letter indicates part-of-speech and the next characters represent relevant morphological features given in a fixed order. Croatian, Macedonian, and Serbian MULTEXT-East tags have

7

(32)

8 CHAPTER 3. DATASETS

the same number of POS tags and an identical naming. On the other hand, Bulgarian has different tag naming and structure.

Each language on the UD platform has at least one treebank where the data is already split into three balanced sets: train, validation, and test. The data is structured in CoNLL-U format [5].

Files in CoNLL-U format have three types of lines: word lines that contain annotations, blank lines denoting sentence boundaries, and comment lines providing general corpus information. Word lines contain the following fields:

• ID: word index

• FORM: word form

• LEMMA: lemma or stem of the word form

• UPOS: universal part-of-speech tag

• XPOS: language-specific part-of-speech tag

• FEATS: list of morphological features

• HEAD: head of the current word in a dependency relation

• DEPREL: universal dependency relation to the HEAD

• DEPS: dependency graph

• MISC: additional annotation

Each word line has to contain all the fields or mark the missing field with an underscore.

There is a total of 17 UPOS categories. The datasets from the Croatian and Serbian treebank contain all of them while the Bulgarian treebank ex- cludes SYM and X tags. Both, Croatian and Serbian datasets are written in Latin script while the Bulgarian language is in Cyrillic.

(33)

3.2. CROATIAN 9

Language num. sentences num. tokens

Bulgarian 11,138 156,147

Croatian 9,010 198,518

Serbian 4,384 97,356

Macedonian 6,790 113,037

Table 3.1: A comparison of language set sizes in their treebanks.

3.2 Croatian

The Croatian UD treebank [2] is based on the extension of the SETimes- HR corpus. SETimes-HR is a subset of a parallel news corpora SETimes in nine South-Eastern European languages and English. A small collection of Croatian and Serbian newspaper texts and Wikipedia articles are added to this set.

3.3 Bulgarian

The Bulgarian UD treebank ( UD Bulgarian-BTB) is a part of the BulTree- Bank project from the Bulgarian Academy of Sciences. Half of the texts are fiction, 30% are news data, 10% are legal texts, and the rest are other genres.

3.4 Serbian

The SETimes-SR corpus, as well as additional news documents from the Serbian web, all written in Latin alphabet, are published in the Serbian UD treebank. The Serbian data is parallel with the Croatian and both languages have similar performance when being used as source datasets for transfer learning.

(34)

10 CHAPTER 3. DATASETS

3.5 Macedonian

The Macedonian language has limited resources. The dataset for this work was based on George Orwell’s novel ”1984”. MULTEXT-East is a parallel corpus that includes translations of this novel in multiple languages. We created a mapping from the MULTEXT-East tags to the universal POS tags as the Macedonian language isn’t supported on the UD platform [11].

In the table below UPOS tag counts are shown. The tags are similarly distributed across languages.

POS tag Bulgarian Croatian Serbian Macedonian

ADJ 13591 23817 10837 11342

ADP 22097 19089 9375 12431

ADV 6558 8934 3349 7962

AUX 9134 12560 6203 4491

CCONJ 4860 8142 3331 6767

DET 2433 7394 3665 311

INTJ 28 12 4 38

NOUN 34152 48578 23818 19390

NUM 2106 3385 2086 866

PART 2167 2069 571 3877

PRON 10094 5334 2426 12774

PROPN 8435 12849 7411 1481

PUNCT 22058 24166 12342 17099

SCONJ 1606 4796 3525 2724

VERB 16828 17393 8413 11484

Total 156147 198518 97356 113037

Table 3.2: UPOS counts for languages.

(35)

Chapter 4 Text processing overview

Natural language processing is one of the most widely applied machine learning areas. It covers approaches that analyze and try to understand human language. In this chapter, we describe an overview of word representations, deep-learning techniques, and popular architectures.

4.1 Word embeddings

Numerical representations of words are called word embeddings. Embed- ding algorithms use distances and directions in the vector space to encode semantic relations between words so that words with similar meanings are close to each other in the vector space. To represent meaning and knowledge across languages, cross-lingual word embedding methods can be used. These methods often learn representations of words in a joint embedding space.

Almost all modern neural networks for text processing start with an embedding layer. Early embeddings such as Bag of Words [12] and TF-IDF [17]

rely on word count in a sentence and do not store any contextual information. Neural embeddings such as word2vec [15] and GloVe [14] contain more semantic relations. A variant of word2vec, called fastText [3] takes subword information into account. In ELMo embeddings [15] each word has the entire sentence as a context and can learn also out of vocabulary (OOV) word vec-

11

(36)

12 CHAPTER 4. TEXT PROCESSING OVERVIEW

tors. The core idea of contextual word embeddings is to provide a different representation for each word based on its context. Training language models to produce contextual word embeddings has proved to be a very successful approach. Recent transformer architecture [21] described in section 4.3 uses attention to capture relationships between words in a sentence.

4.2 Neural networks

In recent years, deep learning techniques are used to build cutting-edge NLP systems. A neural network is a biologically-inspired programming paradigm that enables learning from observational data. Its processing units, named neurons, are positioned in connected layers. Each neuron input connection has its weight and when the neuron gets a signal from all the neurons of the previous layer it calculates a weighted sum to which an activation function is applied. A neural network learns from annotated samples, which propagate through the network. The network output is compared with the actual value and an error is calculated, which is used to update the connection weights to minimize the error.

Over years, many neural approaches for text processing were developed each trying to solve issues of its precedents.

Recurrent Neural networks (RNNs) are suitable for sequence processing.

Here, besides an input signal, the neuron gets output from the previous step, which serves as a memory. With RNNs, a text is processed as a sequence of words where previous words affect subsequent words.

However, RNNs are unable to store information from many previous steps when dealing with long sequences. A common issue is exploding and van- ishing gradient problem, due to the gradient propagating back through the network and getting multiplied with derivatives of the activation function.

If the derivatives are large, the gradient will increase exponentially as it propagates through the network and this is called the problem of exploding gradient. Alternatively, if the derivatives are small, the gradients will

(37)

4.3. TRANSFORMER ARCHITECTURE 13

decrease until they eventually vanish.

LSTM network is an improvement of RNN. With LSTM, the information flows through cell gates so that the network can selectively remember or forget. When the cell decides that the information is important it will propagate it further, otherwise, it will forget it. In this way, important information is kept longer within the network, and such network avoids the gradient problem. However, the inability of RNNs and LSTMs to work in parallel causes slow computations.

4.3 Transformer architecture

The paper ’Attention is all you need’ [21] made a revolutionary change for language modeling approaches. The objective behind the ‘Attention’ mechanism is to consider not only input words in the context vector but also their relative importance. Transformer models use the attention mechanism. They consist of encoder-decoder configuration which enables parallel sequence processing. The encoder block learns a good representation of a language and is used in the BERT model.

4.4 BERT

Bidirectional Encoder Representation from Transformers (BERT) [7] is cur- rently the best model for language representation.

BERT model implements masked language modeling and next sentence prediction training objectives. In masked language modeling (MLM) 15% of the words are randomly masked and the model learns correct replacements for the masks. The next sentence prediction learns whether two sentences are logically related. By solving both tasks BERT tries to understand the context.

Fine-tuning these models for a specific task can be done just with one additional output layer. This is an important advantage since BERT models

(38)

14 CHAPTER 4. TEXT PROCESSING OVERVIEW

require a large amount of data and their training is computationally expen- sive.

There are variants of BERT models with excellent performance on many language tasks, one of them is RoBERTa. As opposed to BERT, RoBERTa is pretrained only on the MLM task and performs dynamic masking of words in each epoch.

(39)

Chapter 5 Cross-lingual POS tagging

The goal of this thesis is to implement a POS tagger for a low-resource language. To achieve that, we use multilingual BERT models for cross- lingual transfer. Due to the complexity to train such models, we fine-tune the available models with annotated data from a set of languages. In this chapter we present an implementation of a POS tagger for the Macedonian language, we describe the methodology along with the models and training data used.

5.1 Methodology

We rely on multilingual models and apply a cross-lingual transfer from similar languages. We exploit the knowledge encoded in the pretrained models, specifically mBERT trained on more than a hundred languages and CroSlo- Engual BERT trained only on: Croatian, Slovenian, and English. To check whether the performance generalizes well across similar models, we expand the model set with XLM-RoBERTa, another multilingual model, and BERTi´c [10] trained on Croatian, Serbian, Bosnian, and Montenegrin. To find the best performing setting for this task, we study the significance of the script for the knowledge transfer, the lexical overlap and language combinations influence as well as correlations between source and target languages. For

15

(40)

16 CHAPTER 5. CROSS-LINGUAL POS TAGGING

transfer learning, we use three source languages: Bulgarian, Croatian, and Serbian due to their similarity to the Macedonian language. We define lexical overlap as the number of word pieces that are contained in the source language set for training and also in the target validation set. We use this metric to observe whether the models achieve high scores due to having duplicate tokens in the training and evaluating set or they learn underlying linguistic structure.

Besides transfer learning from a single source language we implemented a combined languages approach. This combined approach has two variants:

creating a joint set from the source languages and training the models each epoch with a different language.

5.1.1 Joint source languages approach

Aiming to provide more data to the multilingual models, we created one large joint set and used it for cross-lingual transfer. Considering the fact that Serbian and Croatian datasets contain data from the same domain, we used only Croatian dataset due to its larger size and built one concatenated set with Bulgarian. Using data from UD language treebanks that contains train, validation and test splits, we concatenated all splits for these two languages which resulted in more than 20 thousand sentences for training.

We also include a small target-language training set for model fine-tuning in addition to the source-language training datasets.

5.1.2 Approach with different languages per epoch

We are interested in whether the models can separately learn patterns from source languages, therefore we fine-tune multilingual models each epoch with a different language, and evaluate the models on Macedonian. Moreover, we modify the experiments and include Macedonian data for training as well.

(41)

5.2. MODELS 17

5.1.3 Model ensemble

To benefit from multiple models we implemented a model ensemble. Having fixed validation and test set, we decide on a prediction by majority voting from the predictions of several multilingual models. We expect that a combi- nation of diverse models, with significant performance when fine-tuned with a single language, will further improve the scores.

5.2 Models

To solve the POS tagging task in a low-resource language we use pretrained multilingual models. Training such models requires a large amount of data and powerful computational resources, therefore we fine-tune the existing models. Their advantage is the ability to solve classification tasks with few additional training steps on a target data. The multilingual models are most suitable for cross-lingual transfer.

Most of the available models nowadays are pretrained on English and other widely spoken languages such as French, Spanish, and Chinese-Mandarin whereas models for other languages are scarce. There are two multilingual variants available: mBERT and XLM-RoBERTa. The multilingual BERT is pretrained on Wikipedia data from 104 most represented languages. It provides shared representation across different languages.

CroSloEngual BERT a multilingual model trained on a smaller set of languages achieves better results when analyzing the languages it has been trained on. It is trained on a mix of news articles and general web crawl data resulting in 5.9 billion tokens.

There are two variants of the mBERT model based on the case information: cased and uncased, while CroSloEngual BERT is a case-sensitive model. Both models use bert-base architecture which is a 12-layer bidirectional transformer encoder with the hidden layer size of 768 and 110 million total parameters.

XLM-RoBERTa model also uses the bert-base architecture, but compared

(42)

to mBERT it has a larger vocabulary size and a total of 270 million parameters. It is trained on 100 languages, using 2 terabytes of filtered Com- monCrawl data, therefore achieving performance gains for a wide range of cross-lingual transfer tasks. BERTi´c model is trained using the Electra approach which preserves the base configuration: 12 transformer layers and 110 million parameters. For training, a total of 8 billion tokens of crawled Bosnian, Croatian, Serbian, and Montenegrin web domains were used.

5.3 Target task

Our target task is POS tagging Macedonian words with corresponding UD tags. There are 17 UPOS categories where each defines a grammatical feature of a word. Ideally, we would train a model and solve the task, but not having enough amount of data is an obstacle. To overcome this, we learn a classifier using pretrained multilingual models and UD data from similar languages.

5.4 Evaluation

POS tagging is a token classification task. Each word in a sentence has its corresponding tag. For training, we use Universal Dependencies treebanks in Croatian, Serbian, and Bulgarian while we evaluate the Macedonian language. We report precision, recall, and F₁ score to compare model performance for different settings. F₁(c) score is the harmonic mean of precision p and recallrfor a given classc, where the recall is the proportion of correctly classified instances actually from the class c, and the precision is defined as the proportion of correctly classified instances from the instances predicted to be from the class.

F₁(c) = 2p_c·r_c p_c+r_c

TheF₁ score returns values from [0; 1] interval, where 0 means wrong predictions and 1 correct classification. We use an averageF₁ score over all classes.

(43)

5.4. EVALUATION 19

Similarly, we calculate classification accuracy CA as the fraction of correct predictions N_c from the total predictionsN.

CA= N_c N

(44)

(45)

Chapter 6 Results

We present the results for the three source languages, each in a separate section. Next, we discuss the results from the combined languages approach as well as the model ensemble. All the experiments are done by fine-tuning multilingual models with UPOS annotated data. For comparison simplicity, we will refer to mBERT and XLM-RoBERTa as large multilingual models while CroSloEngual BERT and BERTi´c as small multilingual models, according to the number of languages the models have been trained on.

Both large multilingual models mBERT and XLM-RoBERTa are trained in Macedonian along with all used source languages. On the other hand, BERTi´c is trained on two out of three used source languages: Croatian and Serbian, CroSloEngual only on Croatian, while the Bulgarian language was not used for training these two models. Overall, used models achieve good performance when solving the POS tagging task.

6.1 Serbian language

The UPOS classifier trained using only Serbian data reaches an F₁ score and classification accuracy higher than 86% when the classifier is built using the large multilingual models and evaluated on samples in the original script. This behavior is not due to lexical overlap because the overlapping

21

(46)

22 CHAPTER 6. RESULTS

set contains around 20 subtokens ( Table A.1 in Appendix) nor due to the domain similarity as the target data is a translated novel while the source is news data. This means that the model learns UPOS structure from Ser- bian sequences which can be applied to Macedonian. When utilizing the large models, pretrained on hundreds of languages, evaluating the classifier on Macedonian data in Cyrillic script yields greater results than evaluating on Latin script data.

model target script p r F₁ CA

mBERT CYR 0.865 0.870 0.867 0.864

LAT 0.706 0.698 0.702 0.715

XLM-RoBERTa CYR 0.874 0.882 0.878 0.873

LAT 0.729 0.727 0.728 0.736

CSE BERT CYR 0.474 0.427 0.449 0.486

LAT 0.692 0.683 0.688 0.702

BERTi´c CYR 0.624 0.588 0.605 0.636

LAT 0.772 0.773 0.772 0.780

Table 6.1: Zero-shot transfer from the Serbian source language. We report precision (p), recall (r), F₁ score and classification accuracy (CA).

In order to understand what causes this behavior, we did additional experiments with both scripts and discovered that large models have difficulties recognizing transliterated data as each language may be romanized differ- ently. Furthermore, the ability to recognize well Cyrillic script may be due to the influential presence of the Russian language being among the largest training language by vocabulary size.

Aiming to discover what causes wrong predictions, we plotted a confusion matrix (Figure A.1 in Appendix). Detailed analysis showed that there are tokens that belong to different grammatical categories in the source and target datasets.

When using CroSloEngual BERT and BERTi´c models, zero-shot experi-

(47)

6.1. SERBIAN LANGUAGE 23

ments evaluated on Latin target data have higher scores than original target samples, reaching F₁ score of 0.68 and 0.77, respectively. This is expected because the models have been trained only on Latin script text.

model target script p r F1 CA

mBERT CYR 0.961 0.961 0.961 0.968

LAT 0.934 0.934 0.934 0.944

XLM-RoBERTa CYR 0.969 0.968 0.969 0.974

LAT 0.935 0.934 0.935 0.944

CSE BERT CYR 0.885 0.881 0.883 0.901

LAT 0.930 0.929 0.929 0.940

BERTi´c CYR 0.937 0.936 0.937 0.947

LAT 0.935 0.934 0.935 0.944

Table 6.2: Results from the few-shot experiments using the Serbian language with 10% target-language training data.

Fine-tuning the large models with 10% of the original target train samples (545tokens) leads to around 10% increase for the F₁ score and classification accuracy (CA) in comparison with the zero-shot transfer, when the evaluation is done on target Cyrillic script.

For this setting, XLM-RoBERTa outperforms mBERT. However, when using Latin script for the added 10% target data, both models achieve 0.944 CA which again shows that large multilingual models perform better when being evaluated on a language in its original script.

Similarly, fine-tuning both small models with 10% target samples reaches high classification accuracy for CroSloEngual BERT and BERTi´c. We can note that even a small amount of target samples significantly improves the transfer performance which may be due to the size of the lexical overlapping set.

As expected, increasing the amount of added target samples to 30% (Ta- ble A.2 in Appendix) of the original target train samples (1635 tokens) shows

(48)

better model performance: again XLM-RoBERTa leading. In the case with the smaller models, when fine-tuning with 30% Latin target data, the BERTi´c model leads with 0.966 while CroSloEngual BERT has a 0.958F1 score.

6.2 Bulgarian language

As opposed to Serbian and Croatian data, the Bulgarian dataset is written in Cyrillic script.

mBERT CYR 0.842 0.846 0.844 0.865

LAT 0.767 0.773 0.770 0.798

XLM-RoBERTa CYR 0.851 0.854 0.853 0.873

LAT 0.794 0.799 0.796 0.821

CSE BERT CYR 0.721 0.720 0.721 0.754

LAT 0.768 0.772 0.770 0.797

BERTi´c CYR 0.783 0.785 0.784 0.812

LAT 0.811 0.817 0.814 0.838

Table 6.3: Zero-shot transfer from the Bulgarian source language.

Fine-tuning the large models only with source samples and evaluating the target data results in 0.844 and 0.853 F₁ scores for mBERT and XLM- RoBERTa, respectively. Here, similarly as with Serbian, some words belong to different categories in the Bulgarian and Macedonian datasets, which causes wrong predictions when evaluating.

The experiments follow a similar pattern: large multilingual models have better performance when increasing the size of added target samples and again XLM-RoBERTa outperforms mBERT with 10% and 30% added Mace- donian train data, when the models are evaluated on Cyrillic script text. Oth- erwise, when evaluating on the Latin script samples, it is the reverse: mBERT achieves higher scores than XLM-RoBERTa (Table A.3 in Appendix).

(49)

6.3. CROATIAN LANGUAGE 25

mBERT CYR 0.964 0.963 0.963 0.969

LAT 0.923 0.926 0.925 0.935

XLM-RoBERTa CYR 0.966 0.966 0.966 0.972

LAT 0.913 0.915 0.914 0.926

CSE BERT CYR 0.861 0.862 0.861 0.883

LAT 0.914 0.916 0.915 0.927

BERTi´c CYR 0.908 0.912 0.910 0.923

LAT 0.911 0.914 0.913 0.926

Table 6.4: Results from the few-shot experiments using the Bulgarian language with 10% target-language training data.

CroSloEngual BERT and BERTić are more interesting for transfer learning from Bulgarian since none of them has been pretrained on Bulgarian data nor any Cyrillic data. Performing zero-shot transfer from Bulgarian to Macedonian, with both source and target languages in Latin script, leads to surprisingly good results on both models: CroSloEngual BERT with 0.797 CAand 0.770 F₁ score while BERTić with 0.838 CA and 0.814F₁. BERTić outperforming CroSloEngual BERT may be due to BERTić being pretrained on more Slavic languages. In comparison with the zero-shot experiments when using Serbian as a source language, the transfer learning from Bulgar- ian achieves better results, showing that lexical overlap (Table A.1) has a significant influence on the cross-lingual transfer.

6.3 Croatian language

Croatian dataset contains similar domain data as the Serbian, consequently all the models trained with these two languages have comparable results (Table A.4 and Table 6.5).

Among the large multilingual models, the amount of added target data

(50)

mBERT CYR 0.971 0.971 0.971 0.976

LAT 0.939 0.939 0.939 0.949

XLM-RoBERTa CYR 0.970 0.970 0.970 0.975

LAT 0.941 0.939 0.940 0.949

CSE BERT CYR 0.884 0.883 0.883 0.902

LAT 0.935 0.935 0.935 0.945

BERTi´c CYR 0.941 0.942 0.942 0.951

LAT 0.944 0.944 0.944 0.953

Table 6.5: Results from the few-shot experiments using the Croatian language with 10% target-language training data.

samples is positively correlated with the scores: mBERT has a classification accuracy of 0.976 and 0.979, while XLM-RoBERTa has an accuracy of 0.975 and 0.982 for the few-shot experiments with 10% and 30% added target samples.

mBERT CYR 0.975 0.976 0.975 0.979

LAT 0.961 0.962 0.962 0.968

XLM-RoBERTa CYR 0.978 0.978 0.978 0.982

LAT 0.966 0.966 0.966 0.971

CSE BERT CYR 0.926 0.925 0.926 0.939

LAT 0.961 0.960 0.960 0.967

BERTi´c CYR 0.962 0.962 0.961 0.968

LAT 0.964 0.965 0.964 0.971

Table 6.6: Results from the few-shot experiments using the Croatian source language with 30% target-language training data.

Similarly, CroSloEngual BERT and BERTi´c have improved performance

(51)

6.4. SOURCE LANGUAGES COMBINED 27

with the increase of the added target samples although these scores are lower compared to the scores of the large multilingual models.

6.4 Source languages combined

Performing zero-shot experiments with the large Croatian-Bulgarian dataset on all the large models and evaluating on Cyrillic target script, achieves scores not lower than the weaker model fine-tuned with a single source language.

This is not the case for the small models. While CSE BERT benefits from this joint-set, the scores for the BERTi´c model are somewhere in between the single-language scores.

model p r F₁ CA

mBERT 0.850 0.854 0.852 0.872 XLM-RoBERTa 0.857 0.860 0.859 0.878

CSE BERT 0.728 0.727 0.728 0.760 BERTi´c 0.777 0.776 0.776 0.804

Table 6.7: Results from the zero-shot experiments with the joint Croatian- Bulgarian training dataset.

With this dataset, we fine-tuned XLM-RoBERTa, mBERT, CroSloEn- gual BERT, and BERTi´c, and here as well XLM-RoBERTa achieves the best results. Moreover, including 10% of the target train samples to this combined dataset on the small models reaches higher scores than the scores of the few-shot experiments with 10% added target data when using a single language for training. However, there is no significant improvement for the large models.(results in Table 6.8). The reason why the achieved scores are not even higher might be because of the inconsistent tokens which belong to different UPOS categories in each of the languages.

Another variant of combining languages is training the multilingual models each epoch with a different source language. We should note that due

(52)

model p r F1 CA

CSE BERT 0.908 0.911 0.910 0.924 BERTi´c 0.943 0.946 0.944 0.953

Table 6.8: Results from the experiments with the joint Croatian-Bulgarian set with 10% target-language training data.

to the smaller size and overlapping domain with Croatian, using the Serbian dataset for the combined approach did not show significant influence on the results, hence we will discuss the setting using only Bulgarian and Croatian samples.

model p r F₁ CA

CSE BERT 0.723 0.721 0.722 0.754 BERTi´c 0.784 0.787 0.786 0.810

Table 6.9: Results from the zero-shot experiments using different source language in each epoch.

Experimenting with different source languages in each epoch reaches slightly lower scores. XLM-RoBERTa outperforms other models in this case, as well. We can observe that few-shot experiments with the large multilingual models fine-tuned with the joint set or each epoch with a different language do not have better performance than the few-shot experiments when the models are fine-tuned with a single language. (Table 6.8 and Table 6.10).

(53)

6.5. ENSEMBLE MODEL APPROACH 29

model p r F1 CA

CSE BERT 0.809 0.807 0.808 0.834 BERTi´c 0.856 0.858 0.857 0.876

Table 6.10: Results from the few-shot experiments (with 10% target- language train data) using different source language in each epoch.

6.5 Ensemble model approach

Our ensemble approach uses predictions from multiple models and with majority voting decides on the predicted label. We use predictions from the models fine-tuned with 30% added target data as they achieved the highest scores among the experiments. We note that we fine-tuned the large models with the original target script, while BERTi´c and CroSloEngual BERT were fine-tuned with Latin script. Because XLM-RoBERTa and mBERT are strong models for Bulgarian as a source language, we use predictions only from these two models when including Bulgarian. We create an ensemble model from the models fine-tuned with each of the source languages.

The ensemble approach outperforms all other models when including predictions from the large multilingual models. The best performance is reached when using predictions from two XLM-RoBERTa models fine-tuned with Serbian and Croatian data respectively, along with mBERT fine-tuned with Bulgarian. Below in Table 6.11, we can check the results from the model ensemble evaluation.

(54)

Serbian Bulgarian Croatian CA F₁ p r

XLM-R mBERT XLM-R 0.986 0.983 0.983 0.984

XLM-R XLM-R XLM-R 0.986 0.983 0.983 0.983

XLM-R mBERT CSE BERT 0.985 0.982 0.982 0.982

XLM-R mBERT mBERT 0.985 0.982 0.982 0.982

BERTi´c mBERT XLM-R 0.984 0.981 0.981 0.981 CSE BERT mBERT mBERT 0.984 0.981 0.981 0.981

mBERT mBERT mBERT 0.984 0.981 0.981 0.981

mBERT mBERT CSE BERT 0.983 0.980 0.980 0.980 mBERT XLM-R CSE BERT 0.982 0.979 0.979 0.979 BERTi´c XLM-R CSE BERT 0.980 0.976 0.976 0.975 BERTi´c mBERT CSE BERT 0.980 0.976 0.976 0.976

Table 6.11: Ensemble approach results. We report the source language used for fine-tuning, classification accuracy (CA),F₁ score, precision (p) and recall (r).

(55)

Chapter 7 Conclusion

The goal of the thesis was to develop a POS-tagger for a low-resource language using the cross-lingual transfer approach. To achieve this, we fine- tuned pretrained multilingual models on a set of languages. We trained the models on each language as well as on a joint language set. Moreover, we trained each epoch with a different language and implemented an ensemble model. We evaluated the models on the Macedonian language and used the F1 score and classification accuracy as performance metrics. We hypothe- sized that using transfer learning from similar languages can overcome the limitations of a low-resource language.

Across all zero-shot experiments, we noted that large models, pretrained on many distinct language scripts, reach higher scores when being evaluated on a language in its original script. This is due to large models having seen the Macedonian data.

Although the lexical overlap between Bulgarian and Macedonian data is significantly larger than the other source-target pairs, all source languages have comparable performance when being used for fine-tuning. This shows that multilingual models have learned deeper linguistic structure and do not rely only on lexical overlap.

Detailed error analysis on the large multilingual models (Table A.2 in Appendix) showed that both source and target datasets share words that

31

(56)

32 CHAPTER 7. CONCLUSION

are categorized in different UPOS categories, therefore the models may have learned wrong patterns when being trained on the target task. Having inconsistent tokens may be the issue why training on a combined language dataset has only a small increase over the models trained on a single language.

Another reason why models, fine-tuned with the Croatian and Serbian data, fail to accurately classify a part of the examples may be the differences in the word sequences between target and source language. We observed that there are 45% of the source language sequences containing (’NOUN’,

’NOUN’) consecutive pairs which are characteristic of the genitive grammatical case; however, the target language does not distinguish cases and this repetitive sequence is present only 0.5%.

From the single-source experiments, the best approach for cross-lingual transfer is using Croatian as a source language to fine-tune XLM-RoBERTa with added 30% target samples giving 0.982 CA and 0.978 F₁ score,when evaluated on Cyrillic script text. The model fine-tuned with Serbian data has almost equal scores: 0.981 CA and 0.977 F₁ score. From the few-shot experiments with 10% target data, the best performing approach is using Croatian as a source language to fine-tune mBERT with 0.976CAand 0.971 F1 score. Again, the Serbian model has a similar performance. Overall best performances are obtained with ensembles: the best model using predictions from XLM-RoBERTa and mBERT achieves 0.986 CA and 0.983 F₁ score.

Using the ensemble approach requires more resources since it has to load all the models in memory to decide on a predicted label. A more lightweight solution would be to use a model trained on a single source language with additional target-language training data. The proposed method is focused on a specific language family and shows that similar languages are beneficial for a low-resourced language.

(57)

7.1. LIMITATIONS AND FUTURE WORK 33

7.1 Limitations and future work

We discuss the limitations of our work and propose ideas for future improvement. When fine-tuning the models we did not implement hyperparameter optimization which could further boost the performance of the models.

There is a domain mismatch between the source languages and the target data. The target data is a translated novel with a specific vocabulary whereas the source-languages domain is general news data. Moreover, the multilingual models are pretrained on Wikipedia and crawled news texts, providing word embeddings based on a general context. We assume that the models will perform better when being used to solve a task in a similar domain as the domain used for training. To achieve this we could extend the target data by translating a portion of the source-language datasets in the target language or using target-language synonyms to generate new sentences. Furthermore, we could retrain the lexical layers of the models with a new vocabulary in the target language or a specific domain.

In addition, we noticed that there are tokens that belong to different POS categories in the source and target. To overcome this, we could exclude the inconsistent tags from the training set or augment the data by replacing the contradictory labels in each of the source languages with the target-language labels from the correct category.

We could use the language-specific (XPOS) tags instead of the UPOS for the cross-lingual transfer since the source languages share linguistic features with the target language. We expect that using these language-specific features to compare similar languages achieve better transfer performance.

Furthermore, we could improve the ensemble approach by creating one large ensemble from multiple BERT models and combine their outputs for each token to train a token classifier. Additionally, we could extend our approach with the adapter-transformers configuration [16] where we train adapters on simpler linguistic tasks and use the trained adapters for solving the POS tagging problem.

(58)

34 CHAPTER 7. CONCLUSION

(59)

Appendix A

Complete results

In this Appendix, we present additional results to Section 6.

Model % target data Bulgarian Croatian Serbian

mBERT 0 1740 18 18

10 1974 1578 1577

30 2041 1860 1859

XLM-RoBERTa 0 2120 21 21

10 2668 1837 1836

30 2841 2459 2458

CSE BERT 0 102 19 18

10 111 99 98

30 111 107 106

BERTi´c 0 195 20 19

10 218 205 204

30 219 215 214

Table A.1: The size of the lexical overlap between the source languages and the target. The table shows the models used, the amount of target-language train data and the source languages.

35

(60)

36 APPENDIX A. COMPLETE RESULTS

A.1 Error analysis on zero-shot transfer

In the figure A.1, we can observe that the model often wrongly predicts CCONJ and PRON tags.

Figure A.1: A confusion matrix from the zero-shot experiment using the Serbian source language to fine-tune mBERT.

This is the case because there are tokens which are labeled as coordinating conjunction in one language and subordinating in the another. One such example is the token ‘da’. Moreover, many tokens classified as a DET in Ser- bian language, in Macedonian language belong to the grammatical category for pronouns which is the reason for uncertainty when encountering tokens of the DET class.

In Table A.2, we can check the confusion matrix for the zero-shot experiment from the Bulgarian language. Similarly as for Serbian language, the most obvious mistake is made for the CCONJ category, but this time it is wrongly predicted as AUX. Again, such example is the token ‘da’ being categorized as an auxiliary in Bulgarian and as coordinating conjunction in Macedonian language. When discarding this token the model makes more certain predictions.

(61)

A.2. ADDITIONAL RESULTS 37

Figure A.2: A confusion matrix from the zero-shot experiment using the Bulgarian source language to fine-tune mBERT.

A.2 Additional results

mBERT CYR 0.973 0.974 0.973 0.978

LAT 0.962 0.962 0.962 0.968

XLM-RoBERTa CYR 0.977 0.977 0.977 0.981

LAT 0.965 0.965 0.965 0.971

CSE BERT CYR 0.922 0.922 0.922 0.936

LAT 0.959 0.958 0.958 0.965

BERTi´c CYR 0.959 0.959 0.959 0.966

LAT 0.966 0.966 0.966 0.972

Table A.2: Results from the few-shot experiments using the Serbian language with 30% target-language training data.

(62)

38 APPENDIX A. COMPLETE RESULTS

mBERT CYR 0.971 0.972 0.972 0.976

LAT 0.965 0.966 0.965 0.971

XLM-RoBERTa CYR 0.974 0.974 0.974 0.978

LAT 0.953 0.954 0.953 0.961

CSE BERT CYR 0.926 0.928 0.927 0.940

LAT 0.959 0.959 0.959 0.965

BERTi´c CYR 0.958 0.960 0.959 0.965

LAT 0.957 0.957 0.957 0.964

Table A.3: Results from the few-shot experiments using the Bulgarian language with 30% target-language training data.

mBERT CYR 0.869 0.874 0.872 0.867

LAT 0.708 0.703 0.706 0.715

XLM-RoBERTa CYR 0.874 0.882 0.878 0.873

LAT 0.737 0.738 0.738 0.744

CSE BERT CYR 0.486 0.435 0.459 0.499

LAT 0.698 0.690 0.694 0.708

BERTi´c CYR 0.643 0.609 0.626 0.656

LAT 0.758 0.759 0.759 0.766

Table A.4: Results from the zero-shot transfer experiments from the Croa- tian source language.

(63)

Bibliography

[1] N. Aepli, R. von Waldenfels, and T. Samardˇzi´c. Part-of-speech tag disambiguation by cross-linguistic majority vote. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pages 76–84, 2014.

[2] ˇZ. Agi´c and N. Ljubeˇsi´c. Universal dependencies for Croatian (that work for Serbian, too). In The 5th Workshop on Balto-Slavic Natural Language Processing, pages 1–8, 2015. URLhttps://www.aclweb.org/

anthology/W15-5301.

[3] P. Bojanowski, ´E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.

[4] M. Bonchanoski and K. Zdravkova. Learning syntactic tagging of Mace- donian language. Computer Science and Information Systems, 15(3):

799–820.

[5] S. Buchholz and E. Marsi. Conll-x shared task on multilingual dependency parsing. InProceedings of the tenth conference on computational natural language learning (CoNLL-X), pages 149–164, 2006.

[6] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzm´an, ´E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. Un- supervised cross-lingual representation learning at scale. InProceedings

39

(64)

40 BIBLIOGRAPHY

of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 8440–8451, 2020.

[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Pro- ceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.

[8] T. Erjavec. Multext-East: morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46 (1):131–142, 2012.

[9] M. Hardalov, I. Koychev, and P. Nakov. Beyond English-only reading comprehension: Experiments in zero-shot multilingual transfer for Bulgarian. InProceedings of the International Conference on Recent Ad- vances in Natural Language Processing (RANLP 2019), pages 447–459, 2019.

[10] N. Ljubeˇsi´c and D. Lauc. BERTi´c - the transformer language model for Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pages 37–42, 2021.

[11] N. Ljubeˇsi´c, K. Zdravkova, S. Stojanoska, and T. Erjavec. The CLASSLA-StanfordNLP model for morphosyntactic annotation of stan- dard Macedonian 1.0, 2020. URL http://hdl.handle.net/11356/

1373. Slovenian language resource repository CLARIN.SI.

[12] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. 2013.

[13] J. Nivre, M.-C. De Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D.

Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, et al. Univer- sal dependencies v1: A multilingual treebank collection. In Proceedings

(65)

BIBLIOGRAPHY 41

of the Tenth International Conference on Language Resources and Eval- uation (LREC’16), pages 1659–1666, 2016.

[14] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[15] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. In Proceed- ings of the 2018 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, 2018.

[16] J. Pfeiffer, I. Vuli´c, I. Gurevych, and S. Ruder. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, 2020.

[17] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988.

[18] I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019.

[19] H. Tsai, J. Riesa, M. Johnson, N. Arivazhagan, X. Li, and A. Archer.

Small and practical BERT models for sequence labeling. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3623–3627, 2019.

[20] M. Ulˇcar and M. Robnik-ˇSikonja. FinEst BERT and CroSloEngual BERT. InInternational Conference on Text, Speech, and Dialogue, 2020.

URL https://doi.org/10.1007/978-3-030-58323-1_11.

(66)

42 BIBLIOGRAPHY

[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000–6010, 2017.

[22] V. Vojnovski, S. Dˇzeroski, and T. Erjavec. Learning POS tagging from a tagged Macedonian text corpus.

[23] S. Wu and M. Dredze. Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. InProceedings of the 2019 Conference on Empir- ical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844, 2019.

Cross-lingualtransferofPOStaggerintoalow-resourcelanguage SanjaStojanoska

University of Ljubljana

Faculty of Computer and Information Science

Sanja Stojanoska

Cross-lingual transfer of POS tagger into a low-resource language

Supervisor : prof. dr. Marko Robnik ˇ Sikonja Co-supervisor : doc. dr. Nikola Ljubeˇsi´ c

Ljubljana, 2021

Univerza v Ljubljani

Fakulteta za raˇ cunalniˇ stvo in informatiko

Sanja Stojanoska

Medjezikovni prenos

oblikoskladenjskega oznaˇ cevalnika v jezik z malo viri

Mentor : prof. dr. Marko Robnik ˇ Sikonja Somentor : doc. dr. Nikola Ljubeˇsi´ c

Ljubljana, 2021

Acknowledgments

Contents

List of used acronyms

Abstract

Keywords

Povzetek

Kljuˇ cne besede

Razˇ sirjeni povzetek

I Kratek pregled sorodnih del

II Predlagana metoda

III Evaluacija

IV Sklep

Chapter 1 Introduction

Chapter 2

Related work

Chapter 3 Datasets

3.1 Universal Dependencies

3.2 Croatian

3.3 Bulgarian

3.4 Serbian

3.5 Macedonian

Chapter 4

Text processing overview

4.1 Word embeddings

4.2 Neural networks

4.3 Transformer architecture

4.4 BERT

Chapter 5

Cross-lingual POS tagging

5.1 Methodology

5.1.1 Joint source languages approach

5.1.2 Approach with different languages per epoch

5.1.3 Model ensemble

5.2 Models

5.3 Target task

5.4 Evaluation

Chapter 6 Results

6.1 Serbian language

6.2 Bulgarian language

6.3 Croatian language

6.4 Source languages combined

6.5 Ensemble model approach

Chapter 7 Conclusion

7.1 Limitations and future work

Appendix A

Complete results

A.1 Error analysis on zero-shot transfer

A.2 Additional results

Bibliography