• Rezultati Niso Bili Najdeni

The key contributions to science are listed in Section 6.1. In brief, however, they are:

ˆ A new method for semantically representing text from “any” domain, with broader scope than supervised relation extraction algorithms but still sufficient accuracy. (Section 3.2)

ˆ Two new methods for obtaining domain templates, evaluated against state of the art. (Sections 4.2, 4.3)

ˆ An interface for exposing opinions in news, based on navigating along novel dimensions, validated in a user study. (Section 5.1)

Let us summarize the novelties in a more descriptive way as well.

In Chapter 3 we propose and evaluate several techniques for text semantization.

While there is no shortage of related work (see Section 2.2), it mostly focuses on ex-tracting asmall number of semantic objects or relations withhigh precision and/or recall. There is a much smaller set of projects that valiantly attempt to extract a high number )“all”) of objects and/or relations. As this is a much harder task, they focus on precision and sacrifice (sentence-level) recall, with the goal of aggregating the extracted information over a large dataset and reconstructing “common sense”

facts or other relations that are relatively pervasive throughout a set of analyzed documents. Our work also deals with general-purpose semantic representations (i.e.

a large number of objects/relations), but sacrifices precision for recall, exploring if it is viable to semantically represent a single document well enough that it enables common text mining tasks, e.g. document similarity measurement. Prior work in this direction is very scarce, and little was known about the empirical limitations of current tooling and static resources. We demonstrate that it is possible to ex-tract (shallowly) semantic representations with a balance of reasonable recall (most sentences generate at least one feature) and precision.

We “test-drove” the new representation on the little-researched task of domain template construction (Chapter 4) – only a few papers exist on the topic, and none of them employ structured data representations or background knowledge. As the task’s output is inherently structured, we deemed it promising to devise an algorithm for it that uses the abovementioned semantic representation. The results confirmed our hypothesis: our method allows one to infer templates for a collection of documents, keeping the quality of the produced templates on par with prior state of the art, but unlike any prior work, also providing additional structure and type information for the templates.

Finally, we combined those same representations with additional semantic data and used them as the foundation of a news exploration system (Chapter 5). The innovation is on the system level rather than in individual data analysis components.

To our knowledge, no existing system provides a comparable level of in-depth analy-sis for individual news events. It is now easier than before to understand the details of a controversial news story, its different aspects, and the viewpoints of various stakeholders.

Chapter 2 Background

2.1 Terminology and Notation

Before diving deeper, let us expand on some of the key terms and expressions used throughout the thesis. Some of them appear directly in the title, Semantic approaches to domain template construction and opinion mining from natural lan-guage, others just cannot be avoided when speaking of commonalities and differences in collections of news. Some deserve to be mentioned because they are specific to a narrower domain and not widely used (e.g. role filler), others are quite common-place and used in a number of contexts (e.g. story), so we explain more precisely what we mean by them.

ˆ Semantic data is a loosely defined term. While the dictionary definition –

“semantic — Of or relating to meaning, especially meaning in language.” – is clear, there is no unanimous definition of properties that a data representation should have to be deemed semantic. We use the adjective semantic to refer to data that is meaningful and interpretable without further human intervention as a merit of a rich context in which it has been placed. The context is typically ontological (e.g. the string “President Obama” can be given context by associating it with Obama’s DBpedia page with its many relations and attributes) or structural (e.g. the string “Luke” in a list is meaningful if we also encode the fact that this is the list of 10 most frequent baby names in the US in 2013).

ˆ Many of our experiments deal with news. We use the term article to refer to the text from a single news webpage and story to refer to the informally defined collection of articles that are reporting on the same event. Because there is a one-to-one correspondence between events (which happen in real life) and stories (which report on them), we sometimes use the two terms interchangeably.

ˆ When abstracting away the set of common attributes for a collection of articles on related events (e.g. earthquakes), we present them in terms of recurring

21

roles(e.g. magnitude, location). The collection of all roles is called adomain template. Values that fill the roles (e.g. “3.4” for the magnitude) are role fillers. Note that the terminology in related work is highly varied; Table 2.1 contains the details.

ˆ Opinion or viewpoint is a person’s take on a topic. When the person au-thors a document (e.g. news article) on the topic, the opinion is reflected in aspect emphases, judgment statements, disposition towards subject matter and similar. A “common sense” definition suffices as we do not model opinions explicitly in our work; we instead model properties that are likely to correlate with opinions: sentiment, geographical provenance and topical focus.

Several methods in this thesis are based on a graph-like representation of doc-uments, roles, and summaries, with labeled nodes denoting concepts and labeled edges denoting relations between them. We use the following notation:

ˆ Node for concepts extracted directly from documents, e.g. Obama .

ˆ NodeType for generic, automatically inferred concepts, e.g. politician .

ˆ Node1 −−−−→relation Node2 for relations.

Throughout the thesis, we use “quoted sans-serif text” to present (snippets of) actual input text, and bolded text to emphasize important points or concepts.

Additional terms and notations specific to individual sections are correspondingly introduced later on.