Document (#39643)

Author
Giannella, C.
Title
¬An improved algorithm for unsupervised decomposition of a multi-author document
Source
Journal of the Association for Information Science and Technology. 67(2016) no.2, S.400-411
Year
2016
Abstract
This article addresses the problem of unsupervised decomposition of a multi-author text document: identifying the sentences written by each author assuming the number of authors is unknown. An approach, BayesAD, is developed for solving this problem: apply a Bayesian segmentation algorithm, followed by a segment clustering algorithm. Results are presented from an empirical comparison between BayesAD and AK, a modified version of an approach published by Akiva and Koppel in 2013. BayesAD exhibited greater accuracy than AK in all experiments. However, BayesAD has a parameter that needs to be set and which had a nontrivial impact on accuracy. Developing an effective method for eliminating this need would be a fruitful direction for future work. When controlling for topic, the accuracy levels of BayesAD and AK were, in all but one case, worse than a baseline approach wherein one author was assumed to write all sentences in the input text document. Hence, room for improved solutions exists.
Content
Vgl.: http://onlinelibrary.wiley.com/doi/10.1002/asi.23375/abstract.

Similar documents (content)

  1. Aldebei, K.; He, X.; Jia, W.; Yeh, W.: SUDMAD: Sequential and unsupervised decomposition of a multi-author document based on a hidden markov model (2018) 0.41
    0.40519354 = sum of:
      0.40519354 = product of:
        1.0129838 = sum of:
          0.02407319 = weight(abstract_txt:than in 37) [ClassicSimilarity], result of:
            0.02407319 = score(doc=37,freq=2.0), product of:
              0.07996048 = queryWeight, product of:
                1.0256742 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.020026762 = queryNorm
              0.3010636 = fieldWeight in 37, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.07040687 = weight(abstract_txt:room in 37) [ClassicSimilarity], result of:
            0.07040687 = score(doc=37,freq=1.0), product of:
              0.16352957 = queryWeight, product of:
                1.0371819 = boost
                7.872826 = idf(docFreq=45, maxDocs=44421)
                0.020026762 = queryNorm
              0.43054518 = fieldWeight in 37, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.872826 = idf(docFreq=45, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.045330193 = weight(abstract_txt:approach in 37) [ClassicSimilarity], result of:
            0.045330193 = score(doc=37,freq=4.0), product of:
              0.11078095 = queryWeight, product of:
                1.4785973 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.020026762 = queryNorm
              0.40918761 = fieldWeight in 37, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.06046226 = weight(abstract_txt:multi in 37) [ClassicSimilarity], result of:
            0.06046226 = score(doc=37,freq=1.0), product of:
              0.18614548 = queryWeight, product of:
                1.5649412 = boost
                5.9394164 = idf(docFreq=317, maxDocs=44421)
                0.020026762 = queryNorm
              0.32481185 = fieldWeight in 37, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9394164 = idf(docFreq=317, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.076640956 = weight(abstract_txt:document in 37) [ClassicSimilarity], result of:
            0.076640956 = score(doc=37,freq=5.0), product of:
              0.14595221 = queryWeight, product of:
                1.6971598 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.020026762 = queryNorm
              0.52510995 = fieldWeight in 37, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.1715108 = weight(abstract_txt:sentences in 37) [ClassicSimilarity], result of:
            0.1715108 = score(doc=37,freq=3.0), product of:
              0.25863275 = queryWeight, product of:
                1.844648 = boost
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.020026762 = queryNorm
              0.6631442 = fieldWeight in 37, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.000987 = idf(docFreq=109, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.18084426 = weight(abstract_txt:unsupervised in 37) [ClassicSimilarity], result of:
            0.18084426 = score(doc=37,freq=2.0), product of:
              0.30670637 = queryWeight, product of:
                2.0087836 = boost
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.020026762 = queryNorm
              0.5896332 = fieldWeight in 37, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.20253293 = weight(abstract_txt:decomposition in 37) [ClassicSimilarity], result of:
            0.20253293 = score(doc=37,freq=2.0), product of:
              0.33076286 = queryWeight, product of:
                2.086076 = boost
                7.917278 = idf(docFreq=43, maxDocs=44421)
                0.020026762 = queryNorm
              0.61232066 = fieldWeight in 37, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.917278 = idf(docFreq=43, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.08037386 = weight(abstract_txt:algorithm in 37) [ClassicSimilarity], result of:
            0.08037386 = score(doc=37,freq=1.0), product of:
              0.25761428 = queryWeight, product of:
                2.2547705 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.020026762 = queryNorm
              0.31199303 = fieldWeight in 37, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
          0.10080847 = weight(abstract_txt:author in 37) [ClassicSimilarity], result of:
            0.10080847 = score(doc=37,freq=2.0), product of:
              0.2617345 = queryWeight, product of:
                2.6243227 = boost
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.020026762 = queryNorm
              0.38515547 = fieldWeight in 37, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.0546875 = fieldNorm(doc=37)
        0.4 = coord(10/25)
    
  2. Koppel, M.; Winter, Y.: Determining if two documents are written by the same author (2014) 0.15
    0.1476715 = sum of:
      0.1476715 = product of:
        0.7383575 = sum of:
          0.088647954 = weight(abstract_txt:problem in 2602) [ClassicSimilarity], result of:
            0.088647954 = score(doc=2602,freq=3.0), product of:
              0.10493371 = queryWeight, product of:
                1.1749767 = boost
                4.4593854 = idf(docFreq=1396, maxDocs=44421)
                0.020026762 = queryNorm
              0.8447996 = fieldWeight in 2602, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4593854 = idf(docFreq=1396, maxDocs=44421)
                0.109375 = fieldNorm(doc=2602)
          0.06854976 = weight(abstract_txt:document in 2602) [ClassicSimilarity], result of:
            0.06854976 = score(doc=2602,freq=1.0), product of:
              0.14595221 = queryWeight, product of:
                1.6971598 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.020026762 = queryNorm
              0.46967265 = fieldWeight in 2602, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.109375 = fieldNorm(doc=2602)
          0.2557524 = weight(abstract_txt:unsupervised in 2602) [ClassicSimilarity], result of:
            0.2557524 = score(doc=2602,freq=1.0), product of:
              0.30670637 = queryWeight, product of:
                2.0087836 = boost
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.020026762 = queryNorm
              0.8338673 = fieldWeight in 2602, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.109375 = fieldNorm(doc=2602)
          0.18284264 = weight(abstract_txt:accuracy in 2602) [ClassicSimilarity], result of:
            0.18284264 = score(doc=2602,freq=1.0), product of:
              0.2807103 = queryWeight, product of:
                2.3536754 = boost
                5.9552646 = idf(docFreq=312, maxDocs=44421)
                0.020026762 = queryNorm
              0.65135705 = fieldWeight in 2602, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9552646 = idf(docFreq=312, maxDocs=44421)
                0.109375 = fieldNorm(doc=2602)
          0.1425647 = weight(abstract_txt:author in 2602) [ClassicSimilarity], result of:
            0.1425647 = score(doc=2602,freq=1.0), product of:
              0.2617345 = queryWeight, product of:
                2.6243227 = boost
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.020026762 = queryNorm
              0.5446921 = fieldWeight in 2602, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.109375 = fieldNorm(doc=2602)
        0.2 = coord(5/25)
    
  3. Lochbaum, K.E.; Streeter, A.R.: Comparing and combining the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval (1989) 0.12
    0.122277014 = sum of:
      0.122277014 = product of:
        0.61138505 = sum of:
          0.09195595 = weight(abstract_txt:write in 4458) [ClassicSimilarity], result of:
            0.09195595 = score(doc=4458,freq=1.0), product of:
              0.15404166 = queryWeight, product of:
                1.0066439 = boost
                7.6410246 = idf(docFreq=57, maxDocs=44421)
                0.020026762 = queryNorm
              0.59695506 = fieldWeight in 4458, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6410246 = idf(docFreq=57, maxDocs=44421)
                0.078125 = fieldNorm(doc=4458)
          0.024317596 = weight(abstract_txt:than in 4458) [ClassicSimilarity], result of:
            0.024317596 = score(doc=4458,freq=1.0), product of:
              0.07996048 = queryWeight, product of:
                1.0256742 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.020026762 = queryNorm
              0.30412018 = fieldWeight in 4458, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.078125 = fieldNorm(doc=4458)
          0.07517682 = weight(abstract_txt:improved in 4458) [ClassicSimilarity], result of:
            0.07517682 = score(doc=4458,freq=1.0), product of:
              0.16968793 = queryWeight, product of:
                1.4941604 = boost
                5.6707826 = idf(docFreq=415, maxDocs=44421)
                0.020026762 = queryNorm
              0.44302988 = fieldWeight in 4458, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6707826 = idf(docFreq=415, maxDocs=44421)
                0.078125 = fieldNorm(doc=4458)
          0.28933278 = weight(abstract_txt:decomposition in 4458) [ClassicSimilarity], result of:
            0.28933278 = score(doc=4458,freq=2.0), product of:
              0.33076286 = queryWeight, product of:
                2.086076 = boost
                7.917278 = idf(docFreq=43, maxDocs=44421)
                0.020026762 = queryNorm
              0.8747438 = fieldWeight in 4458, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.917278 = idf(docFreq=43, maxDocs=44421)
                0.078125 = fieldNorm(doc=4458)
          0.13060188 = weight(abstract_txt:accuracy in 4458) [ClassicSimilarity], result of:
            0.13060188 = score(doc=4458,freq=1.0), product of:
              0.2807103 = queryWeight, product of:
                2.3536754 = boost
                5.9552646 = idf(docFreq=312, maxDocs=44421)
                0.020026762 = queryNorm
              0.46525505 = fieldWeight in 4458, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9552646 = idf(docFreq=312, maxDocs=44421)
                0.078125 = fieldNorm(doc=4458)
        0.2 = coord(5/25)
    
  4. D'Angelo, C.A.; Giuffrida, C.; Abramo, G.: ¬A heuristic approach to author name disambiguation in bibliometrics databases for large-scale research assessments (2011) 0.12
    0.1220484 = sum of:
      0.1220484 = product of:
        0.508535 = sum of:
          0.024317596 = weight(abstract_txt:than in 190) [ClassicSimilarity], result of:
            0.024317596 = score(doc=190,freq=1.0), product of:
              0.07996048 = queryWeight, product of:
                1.0256742 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.020026762 = queryNorm
              0.30412018 = fieldWeight in 190, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.078125 = fieldNorm(doc=190)
          0.0365578 = weight(abstract_txt:problem in 190) [ClassicSimilarity], result of:
            0.0365578 = score(doc=190,freq=1.0), product of:
              0.10493371 = queryWeight, product of:
                1.1749767 = boost
                4.4593854 = idf(docFreq=1396, maxDocs=44421)
                0.020026762 = queryNorm
              0.34838948 = fieldWeight in 190, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4593854 = idf(docFreq=1396, maxDocs=44421)
                0.078125 = fieldNorm(doc=190)
          0.045790404 = weight(abstract_txt:approach in 190) [ClassicSimilarity], result of:
            0.045790404 = score(doc=190,freq=2.0), product of:
              0.11078095 = queryWeight, product of:
                1.4785973 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.020026762 = queryNorm
              0.41334188 = fieldWeight in 190, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.078125 = fieldNorm(doc=190)
          0.07517682 = weight(abstract_txt:improved in 190) [ClassicSimilarity], result of:
            0.07517682 = score(doc=190,freq=1.0), product of:
              0.16968793 = queryWeight, product of:
                1.4941604 = boost
                5.6707826 = idf(docFreq=415, maxDocs=44421)
                0.020026762 = queryNorm
              0.44302988 = fieldWeight in 190, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.6707826 = idf(docFreq=415, maxDocs=44421)
                0.078125 = fieldNorm(doc=190)
          0.18268031 = weight(abstract_txt:unsupervised in 190) [ClassicSimilarity], result of:
            0.18268031 = score(doc=190,freq=1.0), product of:
              0.30670637 = queryWeight, product of:
                2.0087836 = boost
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.020026762 = queryNorm
              0.59561956 = fieldWeight in 190, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.62393 = idf(docFreq=58, maxDocs=44421)
                0.078125 = fieldNorm(doc=190)
          0.1440121 = weight(abstract_txt:author in 190) [ClassicSimilarity], result of:
            0.1440121 = score(doc=190,freq=2.0), product of:
              0.2617345 = queryWeight, product of:
                2.6243227 = boost
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.020026762 = queryNorm
              0.5502221 = fieldWeight in 190, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.980042 = idf(docFreq=829, maxDocs=44421)
                0.078125 = fieldNorm(doc=190)
        0.24 = coord(6/25)
    
  5. Li, T.; Zhu, S.; Ogihara, M.: Text categorization via generalized discriminant analysis (2008) 0.12
    0.121931635 = sum of:
      0.121931635 = product of:
        0.5080485 = sum of:
          0.048658483 = weight(abstract_txt:text in 3119) [ClassicSimilarity], result of:
            0.048658483 = score(doc=3119,freq=5.0), product of:
              0.086162314 = queryWeight, product of:
                1.0647078 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.020026762 = queryNorm
              0.56473047 = fieldWeight in 3119, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.050655972 = weight(abstract_txt:problem in 3119) [ClassicSimilarity], result of:
            0.050655972 = score(doc=3119,freq=3.0), product of:
              0.10493371 = queryWeight, product of:
                1.1749767 = boost
                4.4593854 = idf(docFreq=1396, maxDocs=44421)
                0.020026762 = queryNorm
              0.4827426 = fieldWeight in 3119, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4593854 = idf(docFreq=1396, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.036632326 = weight(abstract_txt:approach in 3119) [ClassicSimilarity], result of:
            0.036632326 = score(doc=3119,freq=2.0), product of:
              0.11078095 = queryWeight, product of:
                1.4785973 = boost
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.020026762 = queryNorm
              0.33067352 = fieldWeight in 3119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.741144 = idf(docFreq=2864, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.16925907 = weight(abstract_txt:multi in 3119) [ClassicSimilarity], result of:
            0.16925907 = score(doc=3119,freq=6.0), product of:
              0.18614548 = queryWeight, product of:
                1.5649412 = boost
                5.9394164 = idf(docFreq=317, maxDocs=44421)
                0.020026762 = queryNorm
              0.90928376 = fieldWeight in 3119, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                5.9394164 = idf(docFreq=317, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.03917129 = weight(abstract_txt:document in 3119) [ClassicSimilarity], result of:
            0.03917129 = score(doc=3119,freq=1.0), product of:
              0.14595221 = queryWeight, product of:
                1.6971598 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.020026762 = queryNorm
              0.26838437 = fieldWeight in 3119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.16367134 = weight(abstract_txt:decomposition in 3119) [ClassicSimilarity], result of:
            0.16367134 = score(doc=3119,freq=1.0), product of:
              0.33076286 = queryWeight, product of:
                2.086076 = boost
                7.917278 = idf(docFreq=43, maxDocs=44421)
                0.020026762 = queryNorm
              0.49482986 = fieldWeight in 3119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.917278 = idf(docFreq=43, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
        0.24 = coord(6/25)