• Rezultati Niso Bili Najdeni

2.2 Related Work

2.2.1 Semantic Representations of Text

Almost any formalization for semantically representing text can be recast as a collec-tion ofrelations. The task of semanticizing text therefore reduces to that ofrelation extraction, a subfield of information extraction (IE). The field of semantic fact ex-traction is much less researched. In “standard” IE, the topic domain is constructed beforehand and remains fixed. There is a large body of IE research available; see e.g. [9] for a survey or the very active TAC (Text Analysis Conference) challenge [10]. Of even more interest are Open Information Extraction systems; “open” in the task name refers to the fact that these systems construct new concepts and relations on the fly. Of similar interest are systems that do not quite perform open IE but consider a very large number of predefined relations.

The first open IE system was TextRunner [11, 12]. TextRunner consider each noun phrase in a sentence as a possible entity and models binary relations with noncontiguous sequences of words appearing between two entities. For a candidate pair of entities, a sequence tagger (named O-CRF, based on conditional random

1With the exception ofDiversity-Aware Summarization, where I am the sole author of its only section partially included in this thesis.

fields) decides for each word whether it is a part of the relation phrase or not. The system starts with a large number of heuristically labeled training examples, and has the possibility of bootstrapping itself by interchangeably learning relation phrases and entity pairs. TextRunner focuses on relations that can be expressed as verb phrases. It attempts to link entities to Freebase; the relations are always kept at the level of string sequences.

ReVerb [13] is the successor to TextRunner. Unlike TextRunner, it identifies potential relation phrases first, using a handcrafted regular expression over POS tags. All relations include a verb. If a relation phrase is surrounded by two noun phrases, the triple constitutes a candidate relation. Results are further refined by only keeping relation phrases that occur between multiple different noun phrases.

Finally, the authors train a supervised model that assigns a confidence score to every relation. The model was trained on a small hand-labeled dataset but is independent of the relation phrase; the features are lexical and POS-tag based.

SOFIE [14] and its successor PROSPERA [15] are interesting in that they per-form relation extraction simultaneously with alignment to the target ontology. The ontology is then also central to placing type constraints on relation candidates. For example, for presidentOf(X,Y)to hold, X has to be of typePerson. Both systems use YAGO [16]2 as the ontology, restricting themselves to extracting Wikipedia en-tities and infobox relations.

O-CRF, ReVerb, PROSPERA and the majority of other related work is based on lexical and POS patterns. In contrast, Ollie [17] uses syntactical features derived from dependency parse trees. Ollie uses ReVerb to generate a seed set of relations;

using those relations, it finds new sentences that contain the same words but different phrasing, and finally it learns link patterns in the dependency tree that connect the relation constituents. The patterns are in fact lexico-syntactical as the system allows constraints on the content of tree nodes that appear in the pattern. By using patterns of this kind, Ollie is able to find relations that are not expressed by nouns.

Another Open IE system using dependecy parse trees is “KNext-” [18]; the transformation of parse trees into the structured representation of choice is sim-ply a matter of manual rules, not unlike in our SDP approach (Section 3.2). Its output tends towards the more heavily formal logic; for example, the fragment

“those in the US” would be recognized as extraction-worthy and converted to∃x, y, z.

thing-referred-to(x)∧country(y)∧exemplar-of(z, y)∧in(x, z).

Also prominent is NELL, the Never Ending Language Learner [19, 20]. Not unlike SOFIE/PROSPERA, it relies on existing knowledge to provide constraints and hints during acquisition of new statements; however, the ontology in this case is being built by the system from scratch. NELL is unique in that it automatically proposes new categories, relations and even ontological rules. Here, we describe only candidate relation extraction from text. Each relation is seeded with a small number of samples, from which two cooperating subsystems mutually bootstrap themselves, also with the help of other subsystems (e.g. rule inference, learning entity types).

2A lightweight ontology built by cleaning wikipedia/DBpedia.

Coupled Pattern Learner (CPL) searches for frequently co-occurring lexical patterns between pairs of noun phrases, not unlike TextRunner. Also based on co-occurrence statistics, CSEAL learns HTML patterns that capture relations expressed as lists or tables on webpages.

A further very abridged but reference-rich overview can be found in a recent tutorial by Suchanek and Weikum [21].

The most established and successful projects of the above are KnowItAll (en-compasses TextRunner, ReVerb, Ollie and more) and NELL. They both aim to keep learning through time, bootstrapping their precision and recall from previously ac-quired knowledge. Both have been running for several years, with the long-term goal of capturing and structuring as much of common-sense knowledge from the internet as possible. In fact, for most of the open IE systems above aim to extract universal truths, “web-scale information extraction” being a common keyphrase. Precision is crucial, particularly if bootstrapping is intended. Our requirements are a bit dif-ferent in that we need semantic representations of a single piece of text in order to perform further computations on it; we therefore care primarily about the recall at the level of statements within an individual document, not about precision at the level of universally true statements as web-scale extraction systems do.

A very different but highly relevant take on semantic representations is provided by deep learning methods that have recently enjoyed a lot of popularity. These methods convert inputs (images, sound, ..., text) to low-dimensional vectors that carry a lot of semantics, but little to no formal structure. Mikolov et al.’s word2vec approach [22] acts on individual words and is one of the seminal papers in the area dealing with text. Even more closely related to our work are approaches that model whole sentences or paragraphs, based on various recursive or hierarchical neural net designs. One of the more prominent topologies here is the Dynamic Convolutional Neural Net [23]. Alternatively, the approach by Grefenstette et al. [24] maps text directly to a structured representation, though it requires training data in the form of sentence-parse pairs. The algoritm proceeds in two steps. In the first, a latent

“interlingua” vector is computed using a simple word2vec-like network mapping sentences to their parses. In the second step, only the projection of sentences to the latent space is retained, and is in turn used as an input to training a generative recursive neural network that produces parses.

Semantic Role Labeling (SRL). There is a relatively large amount of existing work on automated SRL. The basic design of all prominent methods is unchanged since the first attempt by Gildea and Jurafsky [25] – a supervised learning approach on top of PropBank or FrameNet annotated data (see Section 2.3.2), with hand-constructed features from parse trees.

A basic preprocessing step is constituency parsing (although a few rare examples opt for chunking or other shallower methods [26]). This gives rise to most of the features; feature engineering was shown to be very important [27]. The problem

is then typically divided into frame selection, role detection, and role identification steps; all of them are almost always performed using classic ML techniques. Here, too, deep learning has recently brought improvements to state of the art; for exam-ple, Hermann and Das [28] improve the frame selection phase by augmenting the features set with word2vec-based description of the trigger word context.

The best insight into SRL is offered by various challenges [29, 30, 31]. More re-cently, methods have been proposed that perform sequence labeling directly [32, 33]

and avoid the need for explicit deep parsing by using structured learning. Addi-tional tricks can be employed outside the core learning method, for example using text rewriting to increase the training set size [34].