Document (#40331)

Kocher, M.
Savoy, J.
¬A simple and efficient algorithm for authorship verification
Journal of the Association for Information Science and Technology. 68(2017) no.1, S.259-269
This paper describes and evaluates an unsupervised and effective authorship verification model called Spatium-L1. As features, we suggest using the 200 most frequent terms of the disputed text (isolated words and punctuation symbols). Applying a simple distance measure and a set of impostors, we can determine whether or not the disputed text was written by the proposed author. Moreover, based on a simple rule we can define when there is enough evidence to propose an answer or when the attribution scheme is unable to make a decision with a high degree of certainty. Evaluations based on 6 test collections (PAN CLEF 2014 evaluation campaign) indicate that Spatium-L1 usually appears in the top 3 best verification systems, and on an aggregate measure, presents the best performance. The suggested strategy can be adapted without any problem to different Indo-European languages (such as English, Dutch, Spanish, and Greek) or genres (essay, novel, review, and newspaper article).

Similar documents (author)

  1. Savoy, J.: Stemming of French words based on grammatical categories (1993) 5.21
    5.2088575 = sum of:
      5.2088575 = weight(author_txt:savoy in 4649) [ClassicSimilarity], result of:
        5.2088575 = fieldWeight in 4649, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.334172 = idf(docFreq=28, maxDocs=44421)
          0.625 = fieldNorm(doc=4649)
  2. Savoy, J.: Effectiveness of information retrieval systems used in a hypertext environment (1993) 5.21
    5.2088575 = sum of:
      5.2088575 = weight(author_txt:savoy in 6510) [ClassicSimilarity], result of:
        5.2088575 = fieldWeight in 6510, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.334172 = idf(docFreq=28, maxDocs=44421)
          0.625 = fieldNorm(doc=6510)
  3. Savoy, J.: ¬A learning scheme for information retrieval in hypertext (1994) 5.21
    5.2088575 = sum of:
      5.2088575 = weight(author_txt:savoy in 7291) [ClassicSimilarity], result of:
        5.2088575 = fieldWeight in 7291, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.334172 = idf(docFreq=28, maxDocs=44421)
          0.625 = fieldNorm(doc=7291)
  4. Savoy, J.: Bayesian inference networks and spreading activation in hypertext systems (1992) 5.21
    5.2088575 = sum of:
      5.2088575 = weight(author_txt:savoy in 260) [ClassicSimilarity], result of:
        5.2088575 = fieldWeight in 260, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.334172 = idf(docFreq=28, maxDocs=44421)
          0.625 = fieldNorm(doc=260)
  5. Savoy, J.: Searching information in legal hypertext systems (1993/94) 5.21
    5.2088575 = sum of:
      5.2088575 = weight(author_txt:savoy in 825) [ClassicSimilarity], result of:
        5.2088575 = fieldWeight in 825, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.334172 = idf(docFreq=28, maxDocs=44421)
          0.625 = fieldNorm(doc=825)

Similar documents (content)

  1. Adamovic, S.; Miskovic, V.; Milosavljevic, M.; Sarac, M.; Veinovic, M.: Automated language-independent authorship verification (for Indo-European languages) : facilitating adaptive visual exploration of scientific publications by citation links (2019) 0.30
    0.2989583 = sum of:
      0.2989583 = product of:
        1.0677083 = sum of:
          0.05329029 = weight(abstract_txt:spanish in 327) [ClassicSimilarity], result of:
            0.05329029 = score(doc=327,freq=1.0), product of:
              0.12241374 = queryWeight, product of:
                1.0037432 = boost
                6.965269 = idf(docFreq=113, maxDocs=44421)
                0.017509336 = queryNorm
              0.43532932 = fieldWeight in 327, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.965269 = idf(docFreq=113, maxDocs=44421)
                0.0625 = fieldNorm(doc=327)
          0.09817721 = weight(abstract_txt:dutch in 327) [ClassicSimilarity], result of:
            0.09817721 = score(doc=327,freq=2.0), product of:
              0.14601426 = queryWeight, product of:
                1.0962387 = boost
                7.607123 = idf(docFreq=59, maxDocs=44421)
                0.017509336 = queryNorm
              0.672381 = fieldWeight in 327, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.607123 = idf(docFreq=59, maxDocs=44421)
                0.0625 = fieldNorm(doc=327)
          0.020810999 = weight(abstract_txt:text in 327) [ClassicSimilarity], result of:
            0.020810999 = score(doc=327,freq=1.0), product of:
              0.082401805 = queryWeight, product of:
                1.1646378 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017509336 = queryNorm
              0.25255513 = fieldWeight in 327, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=327)
          0.022479115 = weight(abstract_txt:when in 327) [ClassicSimilarity], result of:
            0.022479115 = score(doc=327,freq=1.0), product of:
              0.08674829 = queryWeight, product of:
                1.1949589 = boost
                4.1460857 = idf(docFreq=1910, maxDocs=44421)
                0.017509336 = queryNorm
              0.25913036 = fieldWeight in 327, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1460857 = idf(docFreq=1910, maxDocs=44421)
                0.0625 = fieldNorm(doc=327)
          0.09018017 = weight(abstract_txt:greek in 327) [ClassicSimilarity], result of:
            0.09018017 = score(doc=327,freq=1.0), product of:
              0.17383564 = queryWeight, product of:
                1.196126 = boost
                8.30027 = idf(docFreq=29, maxDocs=44421)
                0.017509336 = queryNorm
              0.5187669 = fieldWeight in 327, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.30027 = idf(docFreq=29, maxDocs=44421)
                0.0625 = fieldNorm(doc=327)
          0.23922697 = weight(abstract_txt:authorship in 327) [ClassicSimilarity], result of:
            0.23922697 = score(doc=327,freq=5.0), product of:
              0.24544726 = queryWeight, product of:
                2.0100257 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.017509336 = queryNorm
              0.9746573 = fieldWeight in 327, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.0625 = fieldNorm(doc=327)
          0.54354346 = weight(abstract_txt:verification in 327) [ClassicSimilarity], result of:
            0.54354346 = score(doc=327,freq=6.0), product of:
              0.45695943 = queryWeight, product of:
                3.3589766 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.017509336 = queryNorm
              1.1894786 = fieldWeight in 327, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.0625 = fieldNorm(doc=327)
        0.28 = coord(7/25)
  2. Savoy, J.: Estimating the probability of an authorship attribution (2016) 0.25
    0.24505794 = sum of:
      0.24505794 = product of:
        0.87520695 = sum of:
          0.06281364 = weight(abstract_txt:2014 in 3937) [ClassicSimilarity], result of:
            0.06281364 = score(doc=3937,freq=1.0), product of:
              0.13659477 = queryWeight, product of:
                1.0602897 = boost
                7.357662 = idf(docFreq=76, maxDocs=44421)
                0.017509336 = queryNorm
              0.4598539 = fieldWeight in 3937, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.357662 = idf(docFreq=76, maxDocs=44421)
                0.0625 = fieldNorm(doc=3937)
          0.10974218 = weight(abstract_txt:attribution in 3937) [ClassicSimilarity], result of:
            0.10974218 = score(doc=3937,freq=2.0), product of:
              0.15726684 = queryWeight, product of:
                1.1376957 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.017509336 = queryNorm
              0.69780874 = fieldWeight in 3937, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0625 = fieldNorm(doc=3937)
          0.029431196 = weight(abstract_txt:text in 3937) [ClassicSimilarity], result of:
            0.029431196 = score(doc=3937,freq=2.0), product of:
              0.082401805 = queryWeight, product of:
                1.1646378 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017509336 = queryNorm
              0.3571669 = fieldWeight in 3937, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=3937)
          0.022479115 = weight(abstract_txt:when in 3937) [ClassicSimilarity], result of:
            0.022479115 = score(doc=3937,freq=1.0), product of:
              0.08674829 = queryWeight, product of:
                1.1949589 = boost
                4.1460857 = idf(docFreq=1910, maxDocs=44421)
                0.017509336 = queryNorm
              0.25913036 = fieldWeight in 3937, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1460857 = idf(docFreq=1910, maxDocs=44421)
                0.0625 = fieldNorm(doc=3937)
          0.16598947 = weight(abstract_txt:certainty in 3937) [ClassicSimilarity], result of:
            0.16598947 = score(doc=3937,freq=2.0), product of:
              0.20722485 = queryWeight, product of:
                1.3059556 = boost
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.017509336 = queryNorm
              0.80101144 = fieldWeight in 3937, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.06241 = idf(docFreq=13, maxDocs=44421)
                0.0625 = fieldNorm(doc=3937)
          0.2139711 = weight(abstract_txt:authorship in 3937) [ClassicSimilarity], result of:
            0.2139711 = score(doc=3937,freq=4.0), product of:
              0.24544726 = queryWeight, product of:
                2.0100257 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.017509336 = queryNorm
              0.87175995 = fieldWeight in 3937, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.0625 = fieldNorm(doc=3937)
          0.27078024 = weight(abstract_txt:disputed in 3937) [ClassicSimilarity], result of:
            0.27078024 = score(doc=3937,freq=1.0), product of:
              0.45584732 = queryWeight, product of:
                2.7392535 = boost
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.017509336 = queryNorm
              0.5940152 = fieldWeight in 3937, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.0625 = fieldNorm(doc=3937)
        0.28 = coord(7/25)
  3. Schaalje, G.B.; Blades, N.J.; Funai, T.: ¬An open-set size-adjusted Bayesian classifier for authorship attribution (2013) 0.13
    0.12833957 = sum of:
      0.12833957 = product of:
        0.6416979 = sum of:
          0.10974218 = weight(abstract_txt:attribution in 2041) [ClassicSimilarity], result of:
            0.10974218 = score(doc=2041,freq=2.0), product of:
              0.15726684 = queryWeight, product of:
                1.1376957 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.017509336 = queryNorm
              0.69780874 = fieldWeight in 2041, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0625 = fieldNorm(doc=2041)
          0.036045708 = weight(abstract_txt:text in 2041) [ClassicSimilarity], result of:
            0.036045708 = score(doc=2041,freq=3.0), product of:
              0.082401805 = queryWeight, product of:
                1.1646378 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017509336 = queryNorm
              0.4374383 = fieldWeight in 2041, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=2041)
          0.039825317 = weight(abstract_txt:best in 2041) [ClassicSimilarity], result of:
            0.039825317 = score(doc=2041,freq=1.0), product of:
              0.12701283 = queryWeight, product of:
                1.4459269 = boost
                5.0168557 = idf(docFreq=799, maxDocs=44421)
                0.017509336 = queryNorm
              0.31355348 = fieldWeight in 2041, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.0168557 = idf(docFreq=799, maxDocs=44421)
                0.0625 = fieldNorm(doc=2041)
          0.1853044 = weight(abstract_txt:authorship in 2041) [ClassicSimilarity], result of:
            0.1853044 = score(doc=2041,freq=3.0), product of:
              0.24544726 = queryWeight, product of:
                2.0100257 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.017509336 = queryNorm
              0.75496626 = fieldWeight in 2041, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.0625 = fieldNorm(doc=2041)
          0.27078024 = weight(abstract_txt:disputed in 2041) [ClassicSimilarity], result of:
            0.27078024 = score(doc=2041,freq=1.0), product of:
              0.45584732 = queryWeight, product of:
                2.7392535 = boost
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.017509336 = queryNorm
              0.5940152 = fieldWeight in 2041, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.504243 = idf(docFreq=8, maxDocs=44421)
                0.0625 = fieldNorm(doc=2041)
        0.2 = coord(5/25)
  4. Stamatatos, E.: Masking topic-related information to enhance authorship attribution (2018) 0.11
    0.107820444 = sum of:
      0.107820444 = product of:
        0.5391022 = sum of:
          0.10249727 = weight(abstract_txt:genres in 124) [ClassicSimilarity], result of:
            0.10249727 = score(doc=124,freq=3.0), product of:
              0.13127013 = queryWeight, product of:
                1.0394186 = boost
                7.212831 = idf(docFreq=88, maxDocs=44421)
                0.017509336 = queryNorm
              0.78081185 = fieldWeight in 124, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.212831 = idf(docFreq=88, maxDocs=44421)
                0.0625 = fieldNorm(doc=124)
          0.19007905 = weight(abstract_txt:attribution in 124) [ClassicSimilarity], result of:
            0.19007905 = score(doc=124,freq=6.0), product of:
              0.15726684 = queryWeight, product of:
                1.1376957 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.017509336 = queryNorm
              1.2086403 = fieldWeight in 124, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0625 = fieldNorm(doc=124)
          0.029431196 = weight(abstract_txt:text in 124) [ClassicSimilarity], result of:
            0.029431196 = score(doc=124,freq=2.0), product of:
              0.082401805 = queryWeight, product of:
                1.1646378 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017509336 = queryNorm
              0.3571669 = fieldWeight in 124, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=124)
          0.031790268 = weight(abstract_txt:when in 124) [ClassicSimilarity], result of:
            0.031790268 = score(doc=124,freq=2.0), product of:
              0.08674829 = queryWeight, product of:
                1.1949589 = boost
                4.1460857 = idf(docFreq=1910, maxDocs=44421)
                0.017509336 = queryNorm
              0.36646566 = fieldWeight in 124, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1460857 = idf(docFreq=1910, maxDocs=44421)
                0.0625 = fieldNorm(doc=124)
          0.1853044 = weight(abstract_txt:authorship in 124) [ClassicSimilarity], result of:
            0.1853044 = score(doc=124,freq=3.0), product of:
              0.24544726 = queryWeight, product of:
                2.0100257 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.017509336 = queryNorm
              0.75496626 = fieldWeight in 124, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.0625 = fieldNorm(doc=124)
        0.2 = coord(5/25)
  5. Stover, J.A.; Winter, Y.; Koppel, M.; Kestemont, M.: Computational authorship verification method attributes a new work to a major 2nd century African author (2016) 0.10
    0.104778424 = sum of:
      0.104778424 = product of:
        0.65486515 = sum of:
          0.077599436 = weight(abstract_txt:attribution in 3503) [ClassicSimilarity], result of:
            0.077599436 = score(doc=3503,freq=1.0), product of:
              0.15726684 = queryWeight, product of:
                1.1376957 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.017509336 = queryNorm
              0.4934253 = fieldWeight in 3503, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0625 = fieldNorm(doc=3503)
          0.041621998 = weight(abstract_txt:text in 3503) [ClassicSimilarity], result of:
            0.041621998 = score(doc=3503,freq=4.0), product of:
              0.082401805 = queryWeight, product of:
                1.1646378 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017509336 = queryNorm
              0.50511026 = fieldWeight in 3503, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=3503)
          0.15130042 = weight(abstract_txt:authorship in 3503) [ClassicSimilarity], result of:
            0.15130042 = score(doc=3503,freq=2.0), product of:
              0.24544726 = queryWeight, product of:
                2.0100257 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.017509336 = queryNorm
              0.61642736 = fieldWeight in 3503, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.0625 = fieldNorm(doc=3503)
          0.38434327 = weight(abstract_txt:verification in 3503) [ClassicSimilarity], result of:
            0.38434327 = score(doc=3503,freq=3.0), product of:
              0.45695943 = queryWeight, product of:
                3.3589766 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.017509336 = queryNorm
              0.8410884 = fieldWeight in 3503, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.0625 = fieldNorm(doc=3503)
        0.16 = coord(4/25)