6.2 Future Work

6.2.2 Applicability to non-English Languages

An important advantage of semantic representations is that they are based on con-cepts rather than words, and thus language-independent. Once we represent the text in a semantic form, all downstream methods (e.g. domain template construction, ar-ticle re-ranking in DiversiNews, and the FrameSum summarizer from this thesis) will work without a single change, regardless of the language of the original text. All background knowledge (e.g. that encoded in Cyc or WordNet) is language agnostic and reusable too. However, the methods for converting text to a semantic form de-pend, to varying extents, on resources or heuristics that are language-specific. Since we only presented results for English, it is natural to ask ourselves if comparable

resources exist for other languages (with a focus on Slovenian), and if they do not, how costly and time-consuming it would be to introduce them.

The resources fall into two main groups: static resources (dictionaries, verb usage patterns, labeled training data, etc.) and tools (tokenizers, POS taggers, parsers, etc.). Let’s look at them in order of complexity, from low to high.

Tokenization Tokenization is easy for languages that delimit their words with spaces, but non-trivial for those that do not (e.g. Japanese, Chinese, Korean). How-ever, being such a rudimentary task necessary for almost any further processing, it is well solved for the major languages.

POS tagging Part of speech tagging is one of the most basic natural language processing tasks, and has therefore been made available for a number of languages.

Even Slovenian, for example, got its first POS annotator in 1997 [119]. The required features are relatively easy to construct and there is little in terms of dependencies;

the main problem is acquiring enough training data.

It is however worth noting that just as languages differ in vocabulary, they differ in grammar, too, so different sets of POS tags apply to different languages. A normalization layer would thus be required for the downstream systems to function unchanged. Luckily, we only use coarse grammatical roles in our work: nouns, verbs, and pronouns, and those are almost certain to exist in all major languages.

For several languages, the Universal Dependencies project1 provides mappings from language-specific tags to coarse(r) language-independent tags we could use with our approach.

POS tagging alone is enough to build a rudimentary semantic frames from text [92], making the approaches discussed in this thesis theoretically viable even for languages that lack more advanced NLP tooling. In practice, however, the decreased precision and recall are likely to critically impact the quality of end output.

Parsing There are many variants to the task of parsing: shallow parsing or chunk-ing, constituency parschunk-ing, dependency parschunk-ing, and more. The SDP approach to text semantization discussed in Section 3.2 operates on dependency parses, but those in turn are usually derived (using handcrafted rules, see e.g. [73]) from constituency parses. While dependency parsers are somewhat less common, constituency parsers have been developed for a number of languages. The same relationship holds be-tween English and non-English languages as it did for POS tagging: English has bigger datasets, higher accuracy, and more readily available tools, but non-English is doable too, and has been done. The framework is generic and “English” parsers can be reused, even for languages like Japanese; the problematic part is getting the data. For example, the Slovenian dependency treebank [120] has over 300 000 words, which is enough to get to about 60% accuracy on labeled dependencies [121].


Like with part of speech tagging, different languages give rise to slightly different sets of relations between sentence constituents, so the labels employed by parsers differ from language to language. What’s more, not even parsers within a single language may not define relations in the same way and not use the same set of labels (for English, compare e.g. MiniPar [122] and Stanford Parser [90]). Parser-specific normalization would therefore possibly be needed before subsequent steps – be that feature generation for our MSRL approach (Section 3.3), or the rule-based conversion of trees into frames in the SDP approach (Section 3.2).

Coreference resolution can be seen as a subtask of (semantic) parsing. Here, too, the biggest problem is getting enough annotated data. For example, Hendrickx et al. [123] report annotating a corpus of over 300 000 words to create a reasonably performing coreference resolution system for Dutch. This is comparable to what is needed for POS tagging [119], but the use case is more limited and the expense therefore harder to justify. I am not aware of a coreference resolution system for Slovenian. Coreference resolution is “optional” for text semantization in that the pipelines will still work without it, but recall will suffer significantly as a lot of facts in natural langauge are expressed using pronouns.

Semantic role labeling With SRL, the required amount of training data is gar-gantuan, and as we saw in Section 3.3.1, problematic even for English. The time and money requirements make it unrealistic to build a comprehensive set of verbs and roles with sufficient training data in the near future. There is research in doing SRL for non-English languages; notably, the CoNLL-2009 challenge provided datasets for Catalan, Chinese, Czech, German, Japanese, and Spanish [124]. However, the goal is not to create a comprehensive SRL solution, but rather to see on a limited set of roles how well the systems can handle new languages (with new grammatical structures, poorer tooling etc.). The results vary by language; in general, F1 scores tend to be about 5% lower than for English [125].

In summary, the semantization technology is capable of consuming non-English lan-guages, but depends on non-trivial, costly amounts of training data. Therefore, while work has been done on many languages other than English, the training data falls short of its English counterpart, and so does performance of the resulting systems.


