Summarization - Data Processing Pipeline - Mentorica:prof.dr.DunjaMladeni´cSomentor:prof.dr.Jan

5.2 Data Processing Pipeline

5.2.7 Summarization

In response to user’s manipulation of controls, DiversiNews shows not only an ap-propriately ranked list of articles, but also a focused summary of their content.

While automatic multi-document summarization is a reasonably well researched topic, there are some context-specific constraints under which the summarizer in DiversiNews must operate:

The algorithm needs to be able to provide focused summaries, giving priority to some factoids, entities, sentences or documents over others. This way, we can generate summaries that reflect user’s current interest.

3https://en.wikipedia.org/wiki/Haversine_formula

Automatic summarization is not an easy problem and solutions may take a while to complete. In our case, the user interface needs to be responsive, the summaries need to be generated quickly. On the other hand, there is ample time for preprocessing, which the summarization algorithm may exploit to generate partial results in advance.

Based on the above constraints and our interest in the utility of semantic repre-sentations of text, we developed a multidocument summarization algorithm called FrameSum [8]. FrameSum performs extractive summarization, meaning that it selects a subset of sentences and presents them as the summary in an unchanged form.

The algorithm is built around a key assumption: in the multi-document summa-rization setting, a strong signal for the importance of a piece of information is that piece being repeatedly reported by multiple sources. In practice, for newswire, we observe this to be a sound assumption with 10–20 input documents (depending on the task) or more.

At the same time, we have to consider the standard tradeoff of diversity vs.

relevance: an algorithm that only considers relevancy will end up constructing the summary out of sentences that convey almost exactly the same, albeit very relevant, information. We combat this by requiring that no two sentences in the final summary be too similar to each other.

In the following subsections, we present the algorithm that formalizes these two considerations.

5.2.7.1 Preprocessing and Initial Content Scoring

We represent all input data in the semantic space, using the SDP algorithm presented in Section 3.2.

This representation loses a lot of detail, but allows for similarity comparisons be-tween sentences and even sentence fragments that goes beyond simple string match-ing. In fact, the basic unit of information in FrameSum is a frame, not a whole sentence.

In this first stage of the algorithm, each frame is scored separately. The score of a frame T is based primarily on its position in the document, a well-established heuristic/feature [114]:

score(T) = α·score₀(T) +β^pos(p(T⁾⁾ P

p⁰sim(p(T), p⁰) where:

score₀(T) is a prior belief about the importance of a frame or its originating sentence. This allows FrameSum to be used in a “guided summary” framework where a certain aspect of the summary is emphasized – for frames or sentences relating to that aspect, we can boost score₀(T). If FrameSum is used on its

own or if no emphasis, score₀(·)≡1 can be used for a uniform prior and hence a “standard” multi-document summary.

p(T) is the sentence containingT.

pos(p(T)) is is the zero-based index of the sentence within the document.

sim(·,·) is the similarity function between sentences, defined as the Jaccard similarity coefficient⁴ for the sets of character 4-grams of the two sentences and ranges from 0 (no similarity) to 1 (identity).

α = 0.4 and β = 0.8 are constants determined empirically by observing be-havior on several ten collections of documents.

In the nominator, the exponentially decaying factor quantifies the intuition that especially in news reporting, the important facts tend to be given early on. α is a smoothing factor, corresponding to the firmness of our belief in the score0(·) prior.

The denominator is a normalization, particularly affecting sentences that not only have similar content but are almost completely identical. This tends to happen frequently with journalistic texts as the content is often partially copied from a press release or a news syndication network’s article.

As pointed out previously in Section 5.2.6, in DiversiNews the score prior score₀(·) is based on article ranking for a specific user control configuration. Due to a project-imposed schedule, we only evaluated a simple step-function weighting scheme: frames from the top k ranked articles (k = 20) are weighted with 1, the remaining frames with 0.

5.2.7.2 Frame Graph

In the second stage, the frames are connected into a directed weighted graph with weights representing the information flow between frames. Intuitively, a large flow from T to T⁰ means that T conveys a lot of T⁰. Formally, the information flow w between two frames T and T⁰ is defined as follows:

1: w←0 .“Initialize flow”

2: for all attribute a ∈ {subject, verb, object, instrument, location, time} do

3: X ← value of a inT

4: X⁰ ←value of a inT⁰

5: if X =X⁰ then

6: w+ = 1

7: else if X is a hypernym ofX⁰ then

8: w+ = 0.7 . X strongly infers X⁰

9: else if X is a hyponym ofX⁰ then

10: w+ = 0.25 . X weakly infers X⁰

11: end if

12: end for

13: return w

4The Jaccard similarity coefficient of two setsAandB is defined as ^|A∩B|_|A∪B|.

5.2.7.3 Greedy Optimization

In the final stage, the most relevant frames are selected greedily. The sentences containing them are promoted into the summary. To determine the relevance of a frame, we first define its information content. This is initialized to

IC(T) = score(T) +X

T⁰

w_T→T⁰score(T⁰)

where IC(·) is the information content, score(·) was defined above andw_T→T⁰ is the information flow as defined in the previous paragraph. The following two steps are then repeated greedily until the desired length of the summary is reached:

1. Promote the shortest sentence containing the frame T with the highest IC(T) into summary. Ignore sentences that were originally part of quoted speech or that begin with a linking word (“however”, “also”, . . . ). If no suitable sentences are found, skip T.

2. For every frame T contained in this sentence, decrease IC(T⁰) for all remaining input frames T⁰ by a factor of 1−w_T→T⁰.

The fact that IC(T) is based on the score (which in turn is based on frequency) ensures that the summary contains relevant information. The second step of the greedy iteration above ensures that the information contained in the summary is not redundant. Intuitively, the performed decrease in IC(T⁰) can be understood as

“if T is told, then w_T→T⁰ of T⁰ is already told as well, so its information content decreases by that much.”

The order of the output sentences is determined based on the sentences’ positions in their respective original files. In case of ties, higher-scoring sentences are placed first.

In document Mentorica:prof.dr.DunjaMladeni´cSomentor:prof.dr.JanezDemˇsar Semantiˇcnipristopihkonstrukcijidomenskihpredloginodkrivanjumnenjiznaravnegabesedila MITJATRAMPUˇS (Strani 100-103)