Document (#40123)

Finegan-Dollak, C.
Radev, D.R.
Sentence simplification, compression, and disaggregation for summarization of sophisticated documents
Journal of the Association for Information Science and Technology. 67(2016) no.10, S.2437-2453
Sophisticated documents like legal cases and biomedical articles can contain unusually long sentences. Extractive summarizers can select such sentences-potentially adding hundreds of unnecessary words to the summary-or exclude them and lose important content. Sentence simplification or compression seems on the surface to be a promising solution. However, compression removes words before the selection algorithm can use them, and simplification generates sentences that may be ambiguous in an extractive summary. We therefore compare the performance of an extractive summarizer selecting from the sentences of the original document with that of the summarizer selecting from sentences shortened in three ways: simplification, compression, and disaggregation, which splits one sentence into several according to rules designed to keep all meaning. We find that on legal cases and biomedical articles, these shortening methods generate ungrammatical output. Human evaluators performed an extrinsic evaluation consisting of comprehension questions about the summaries. Evaluators given compressed, simplified, or disaggregated versions of the summaries answered fewer questions correctly than did those given summaries with unaltered sentences. Error analysis suggests 2 causes: Altered sentences sometimes interact with the sentence selection algorithm, and alterations to sentences sometimes obscure information in the summary. We discuss future work to alleviate these problems.
Automatisches Abstracting

Similar documents (author)

  1. Otterbacher, J.; Radev, D.: Exploring fact-focused relevance and novelty detection (2008) 4.57
    4.5682592 = sum of:
      4.5682592 = weight(author_txt:radev in 3210) [ClassicSimilarity], result of:
        4.5682592 = fieldWeight in 3210, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.1365185 = idf(docFreq=12, maxDocs=44421)
          0.5 = fieldNorm(doc=3210)
  2. Radev, D.R.; Libner, K.; Fan, W.: Getting answers to natural language questions on the Web (2002) 3.43
    3.4261944 = sum of:
      3.4261944 = weight(author_txt:radev in 204) [ClassicSimilarity], result of:
        3.4261944 = fieldWeight in 204, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.1365185 = idf(docFreq=12, maxDocs=44421)
          0.375 = fieldNorm(doc=204)
  3. Otterbacher, J.; Radev, D.; Kareem, O.: Hierarchical summarization for delivering information to mobile devices (2008) 3.43
    3.4261944 = sum of:
      3.4261944 = weight(author_txt:radev in 3071) [ClassicSimilarity], result of:
        3.4261944 = fieldWeight in 3071, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.1365185 = idf(docFreq=12, maxDocs=44421)
          0.375 = fieldNorm(doc=3071)
  4. Otterbacher, J.; Erkan, G.; Radev, D.R.: Biased LexRank : passage retrieval using random walks with question-based priors (2009) 3.43
    3.4261944 = sum of:
      3.4261944 = weight(author_txt:radev in 3450) [ClassicSimilarity], result of:
        3.4261944 = fieldWeight in 3450, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.1365185 = idf(docFreq=12, maxDocs=44421)
          0.375 = fieldNorm(doc=3450)
  5. Lam, W.; Chan, K.; Radev, D.; Saggion, H.; Teufel, S.: Context-based generic cross-lingual retrieval of documents and automated summaries (2005) 2.86
    2.8551621 = sum of:
      2.8551621 = weight(author_txt:radev in 2965) [ClassicSimilarity], result of:
        2.8551621 = fieldWeight in 2965, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.1365185 = idf(docFreq=12, maxDocs=44421)
          0.3125 = fieldNorm(doc=2965)

Similar documents (content)

  1. Ling, X.; Jiang, J.; He, X.; Mei, Q.; Zhai, C.; Schatz, B.: Generating gene summaries from biomedical literature : a study of semi-structured summarization (2007) 0.27
    0.2705969 = sum of:
      0.2705969 = product of:
        0.96641743 = sum of:
          0.01709279 = weight(abstract_txt:given in 1946) [ClassicSimilarity], result of:
            0.01709279 = score(doc=1946,freq=1.0), product of:
              0.05821927 = queryWeight, product of:
                1.0098257 = boost
                4.6974936 = idf(docFreq=1100, maxDocs=44421)
                0.012273096 = queryNorm
              0.29359335 = fieldWeight in 1946, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6974936 = idf(docFreq=1100, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.018007196 = weight(abstract_txt:articles in 1946) [ClassicSimilarity], result of:
            0.018007196 = score(doc=1946,freq=1.0), product of:
              0.06027754 = queryWeight, product of:
                1.0275213 = boost
                4.7798095 = idf(docFreq=1013, maxDocs=44421)
                0.012273096 = queryNorm
              0.2987381 = fieldWeight in 1946, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7798095 = idf(docFreq=1013, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.13346739 = weight(abstract_txt:biomedical in 1946) [ClassicSimilarity], result of:
            0.13346739 = score(doc=1946,freq=5.0), product of:
              0.13400415 = queryWeight, product of:
                1.5320473 = boost
                7.1267567 = idf(docFreq=96, maxDocs=44421)
                0.012273096 = queryNorm
              0.9959945 = fieldWeight in 1946, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.1267567 = idf(docFreq=96, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.13400492 = weight(abstract_txt:summary in 1946) [ClassicSimilarity], result of:
            0.13400492 = score(doc=1946,freq=4.0), product of:
              0.16568469 = queryWeight, product of:
                2.086411 = boost
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.012273096 = queryNorm
              0.80879486 = fieldWeight in 1946, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.12194773 = weight(abstract_txt:summaries in 1946) [ClassicSimilarity], result of:
            0.12194773 = score(doc=1946,freq=2.0), product of:
              0.19603232 = queryWeight, product of:
                2.26946 = boost
                7.0380287 = idf(docFreq=105, maxDocs=44421)
                0.012273096 = queryNorm
              0.62207973 = fieldWeight in 1946, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.0380287 = idf(docFreq=105, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.14987323 = weight(abstract_txt:sentence in 1946) [ClassicSimilarity], result of:
            0.14987323 = score(doc=1946,freq=2.0), product of:
              0.24755639 = queryWeight, product of:
                2.9448633 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.012273096 = queryNorm
              0.60541046 = fieldWeight in 1946, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
          0.39202416 = weight(abstract_txt:sentences in 1946) [ClassicSimilarity], result of:
            0.39202416 = score(doc=1946,freq=3.0), product of:
              0.5172648 = queryWeight, product of:
                6.020042 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.012273096 = queryNorm
              0.7578791 = fieldWeight in 1946, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.0625 = fieldNorm(doc=1946)
        0.28 = coord(7/25)
  2. Bando, L.L.; Scholer, F.; Turpin, A.: Query-biased summary generation assisted by query expansion : temporality (2015) 0.26
    0.2649906 = sum of:
      0.2649906 = product of:
        1.1041275 = sum of:
          0.043878887 = weight(abstract_txt:words in 2820) [ClassicSimilarity], result of:
            0.043878887 = score(doc=2820,freq=3.0), product of:
              0.07568122 = queryWeight, product of:
                1.1513493 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.012273096 = queryNorm
              0.5797857 = fieldWeight in 2820, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0625 = fieldNorm(doc=2820)
          0.036219448 = weight(abstract_txt:selection in 2820) [ClassicSimilarity], result of:
            0.036219448 = score(doc=2820,freq=2.0), product of:
              0.07623294 = queryWeight, product of:
                1.1555384 = boost
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.012273096 = queryNorm
              0.47511548 = fieldWeight in 2820, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.0625 = fieldNorm(doc=2820)
          0.09475578 = weight(abstract_txt:summary in 2820) [ClassicSimilarity], result of:
            0.09475578 = score(doc=2820,freq=2.0), product of:
              0.16568469 = queryWeight, product of:
                2.086411 = boost
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.012273096 = queryNorm
              0.5719043 = fieldWeight in 2820, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.0625 = fieldNorm(doc=2820)
          0.21121968 = weight(abstract_txt:summaries in 2820) [ClassicSimilarity], result of:
            0.21121968 = score(doc=2820,freq=6.0), product of:
              0.19603232 = queryWeight, product of:
                2.26946 = boost
                7.0380287 = idf(docFreq=105, maxDocs=44421)
                0.012273096 = queryNorm
              1.0774738 = fieldWeight in 2820, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.0380287 = idf(docFreq=105, maxDocs=44421)
                0.0625 = fieldNorm(doc=2820)
          0.21195275 = weight(abstract_txt:sentence in 2820) [ClassicSimilarity], result of:
            0.21195275 = score(doc=2820,freq=4.0), product of:
              0.24755639 = queryWeight, product of:
                2.9448633 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.012273096 = queryNorm
              0.85617965 = fieldWeight in 2820, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.0625 = fieldNorm(doc=2820)
          0.506101 = weight(abstract_txt:sentences in 2820) [ClassicSimilarity], result of:
            0.506101 = score(doc=2820,freq=5.0), product of:
              0.5172648 = queryWeight, product of:
                6.020042 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.012273096 = queryNorm
              0.9784177 = fieldWeight in 2820, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.0625 = fieldNorm(doc=2820)
        0.24 = coord(6/25)
  3. Aker, A.; Gaizauskas, R.: Generating descriptive multi-document summaries of geo-located entities using entity type models (2015) 0.21
    0.20722914 = sum of:
      0.20722914 = product of:
        0.86345476 = sum of:
          0.035826962 = weight(abstract_txt:words in 2726) [ClassicSimilarity], result of:
            0.035826962 = score(doc=2726,freq=2.0), product of:
              0.07568122 = queryWeight, product of:
                1.1513493 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.012273096 = queryNorm
              0.47339305 = fieldWeight in 2726, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0625 = fieldNorm(doc=2726)
          0.22360511 = weight(abstract_txt:summarizer in 2726) [ClassicSimilarity], result of:
            0.22360511 = score(doc=2726,freq=3.0), product of:
              0.22411564 = queryWeight, product of:
                1.9812951 = boost
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.012273096 = queryNorm
              0.997722 = fieldWeight in 2726, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.0625 = fieldNorm(doc=2726)
          0.06700246 = weight(abstract_txt:summary in 2726) [ClassicSimilarity], result of:
            0.06700246 = score(doc=2726,freq=1.0), product of:
              0.16568469 = queryWeight, product of:
                2.086411 = boost
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.012273096 = queryNorm
              0.40439743 = fieldWeight in 2726, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.0625 = fieldNorm(doc=2726)
          0.14935485 = weight(abstract_txt:summaries in 2726) [ClassicSimilarity], result of:
            0.14935485 = score(doc=2726,freq=3.0), product of:
              0.19603232 = queryWeight, product of:
                2.26946 = boost
                7.0380287 = idf(docFreq=105, maxDocs=44421)
                0.012273096 = queryNorm
              0.7618889 = fieldWeight in 2726, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.0380287 = idf(docFreq=105, maxDocs=44421)
                0.0625 = fieldNorm(doc=2726)
          0.10597637 = weight(abstract_txt:sentence in 2726) [ClassicSimilarity], result of:
            0.10597637 = score(doc=2726,freq=1.0), product of:
              0.24755639 = queryWeight, product of:
                2.9448633 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.012273096 = queryNorm
              0.42808983 = fieldWeight in 2726, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.0625 = fieldNorm(doc=2726)
          0.281689 = weight(abstract_txt:extractive in 2726) [ClassicSimilarity], result of:
            0.281689 = score(doc=2726,freq=2.0), product of:
              0.3425509 = queryWeight, product of:
                3.0 = boost
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.012273096 = queryNorm
              0.8223274 = fieldWeight in 2726, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.0625 = fieldNorm(doc=2726)
        0.24 = coord(6/25)
  4. Ye, S.; Chua, T.-S.; Kan, M.-Y.; Qiu, L.: Document concept lattice for text understanding and summarization (2007) 0.19
    0.18863702 = sum of:
      0.18863702 = product of:
        0.7859876 = sum of:
          0.01709279 = weight(abstract_txt:given in 1941) [ClassicSimilarity], result of:
            0.01709279 = score(doc=1941,freq=1.0), product of:
              0.05821927 = queryWeight, product of:
                1.0098257 = boost
                4.6974936 = idf(docFreq=1100, maxDocs=44421)
                0.012273096 = queryNorm
              0.29359335 = fieldWeight in 1941, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6974936 = idf(docFreq=1100, maxDocs=44421)
                0.0625 = fieldNorm(doc=1941)
          0.025611019 = weight(abstract_txt:selection in 1941) [ClassicSimilarity], result of:
            0.025611019 = score(doc=1941,freq=1.0), product of:
              0.07623294 = queryWeight, product of:
                1.1555384 = boost
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.012273096 = queryNorm
              0.33595738 = fieldWeight in 1941, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.0625 = fieldNorm(doc=1941)
          0.04546312 = weight(abstract_txt:selecting in 1941) [ClassicSimilarity], result of:
            0.04546312 = score(doc=1941,freq=1.0), product of:
              0.11176288 = queryWeight, product of:
                1.3991423 = boost
                6.5085106 = idf(docFreq=179, maxDocs=44421)
                0.012273096 = queryNorm
              0.4067819 = fieldWeight in 1941, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.5085106 = idf(docFreq=179, maxDocs=44421)
                0.0625 = fieldNorm(doc=1941)
          0.12909847 = weight(abstract_txt:summarizer in 1941) [ClassicSimilarity], result of:
            0.12909847 = score(doc=1941,freq=1.0), product of:
              0.22411564 = queryWeight, product of:
                1.9812951 = boost
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.012273096 = queryNorm
              0.5760351 = fieldWeight in 1941, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.216561 = idf(docFreq=11, maxDocs=44421)
                0.0625 = fieldNorm(doc=1941)
          0.11605167 = weight(abstract_txt:summary in 1941) [ClassicSimilarity], result of:
            0.11605167 = score(doc=1941,freq=3.0), product of:
              0.16568469 = queryWeight, product of:
                2.086411 = boost
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.012273096 = queryNorm
              0.7004369 = fieldWeight in 1941, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.0625 = fieldNorm(doc=1941)
          0.4526705 = weight(abstract_txt:sentences in 1941) [ClassicSimilarity], result of:
            0.4526705 = score(doc=1941,freq=4.0), product of:
              0.5172648 = queryWeight, product of:
                6.020042 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.012273096 = queryNorm
              0.8751234 = fieldWeight in 1941, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.0625 = fieldNorm(doc=1941)
        0.24 = coord(6/25)
  5. Vanderwende, L.; Suzuki, H.; Brockett, J.M.; Nenkova, A.: Beyond SumBasic : task-focused summarization with sentence simplification and lexical expansion (2007) 0.18
    0.18099347 = sum of:
      0.18099347 = product of:
        0.7541395 = sum of:
          0.025333488 = weight(abstract_txt:words in 1948) [ClassicSimilarity], result of:
            0.025333488 = score(doc=1948,freq=1.0), product of:
              0.07568122 = queryWeight, product of:
                1.1513493 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.012273096 = queryNorm
              0.33473945 = fieldWeight in 1948, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0625 = fieldNorm(doc=1948)
          0.06700246 = weight(abstract_txt:summary in 1948) [ClassicSimilarity], result of:
            0.06700246 = score(doc=1948,freq=1.0), product of:
              0.16568469 = queryWeight, product of:
                2.086411 = boost
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.012273096 = queryNorm
              0.40439743 = fieldWeight in 1948, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.470359 = idf(docFreq=186, maxDocs=44421)
                0.0625 = fieldNorm(doc=1948)
          0.14935485 = weight(abstract_txt:summaries in 1948) [ClassicSimilarity], result of:
            0.14935485 = score(doc=1948,freq=3.0), product of:
              0.19603232 = queryWeight, product of:
                2.26946 = boost
                7.0380287 = idf(docFreq=105, maxDocs=44421)
                0.012273096 = queryNorm
              0.7618889 = fieldWeight in 1948, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.0380287 = idf(docFreq=105, maxDocs=44421)
                0.0625 = fieldNorm(doc=1948)
          0.10597637 = weight(abstract_txt:sentence in 1948) [ClassicSimilarity], result of:
            0.10597637 = score(doc=1948,freq=1.0), product of:
              0.24755639 = queryWeight, product of:
                2.9448633 = boost
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.012273096 = queryNorm
              0.42808983 = fieldWeight in 1948, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.849437 = idf(docFreq=127, maxDocs=44421)
                0.0625 = fieldNorm(doc=1948)
          0.1991842 = weight(abstract_txt:extractive in 1948) [ClassicSimilarity], result of:
            0.1991842 = score(doc=1948,freq=1.0), product of:
              0.3425509 = queryWeight, product of:
                3.0 = boost
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.012273096 = queryNorm
              0.5814733 = fieldWeight in 1948, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.0625 = fieldNorm(doc=1948)
          0.20728816 = weight(abstract_txt:simplification in 1948) [ClassicSimilarity], result of:
            0.20728816 = score(doc=1948,freq=1.0), product of:
              0.38718432 = queryWeight, product of:
                3.6828747 = boost
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.012273096 = queryNorm
              0.53537333 = fieldWeight in 1948, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.0625 = fieldNorm(doc=1948)
        0.24 = coord(6/25)