Document (#30160)

Author
Hoad, T.C.
Zobel, J.
Title
Methods for identifying versioned and plagiarized documents
Source
Journal of the American Society for Information Science and technology. 54(2003) no.3, S.203-215
Year
2003
Abstract
Hoad and Zobel term documents that originate from the same source, whether versions or plagiarisms, co-derivatives. Identification of co-derivatives is normally by a technique called fingerprinting, which uses hashing to generate surrogates in the form of integer strings derived from substrings of text, for comparison purposes, or by ranking using a similarity measure as in information retrieval. Hoad and Zobel derive several variants of what they term an identity measure, where documents with similar numbers of occurrences of words benefit and those with dissimilar numbers are penalized, for use in a ranking technique. They then review fingerprinting strategies, and characterize them by the substring size utilized, i.e. granularity, character of the hashing function, the size of the document fingerprint, i.e. resolution, and the substring selection strategy. In their experiments highest false match, HFM, the highest percentage score given an incorrect result, and separation, the difference between the lowest correct result and HFM were the measures utilized in two collections, one of 3,300 documents, and the other of 80,000 with 53 query documents. The new identity measure demonstrates superior performance to the alternatives. Only one fingerprinting strategy was able to identify all human identified similar documents, the anchor strategy. The key parameter in fingerprinting appears to be granularity, with three to five words producing the best results.

Similar documents (author)

  1. Kaszkiel, M.; Zobel, J.: Effective ranking with arbitrary passages (2001) 4.70
    4.6994414 = sum of:
      4.6994414 = weight(author_txt:zobel in 6764) [ClassicSimilarity], result of:
        4.6994414 = fieldWeight in 6764, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.398883 = idf(docFreq=9, maxDocs=44421)
          0.5 = fieldNorm(doc=6764)
    
  2. Heinz, S.; Zobel, J.: Efficient single-pass index construction for text databases (2003) 4.70
    4.6994414 = sum of:
      4.6994414 = weight(author_txt:zobel in 2678) [ClassicSimilarity], result of:
        4.6994414 = fieldWeight in 2678, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.398883 = idf(docFreq=9, maxDocs=44421)
          0.5 = fieldNorm(doc=2678)
    
  3. Uitdenbogerd, A.L.; Zobel, J.: ¬An architecture for effective music information retrieval (2004) 4.70
    4.6994414 = sum of:
      4.6994414 = weight(author_txt:zobel in 4055) [ClassicSimilarity], result of:
        4.6994414 = fieldWeight in 4055, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.398883 = idf(docFreq=9, maxDocs=44421)
          0.5 = fieldNorm(doc=4055)
    
  4. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.70
    4.6994414 = sum of:
      4.6994414 = weight(author_txt:zobel in 1009) [ClassicSimilarity], result of:
        4.6994414 = fieldWeight in 1009, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.398883 = idf(docFreq=9, maxDocs=44421)
          0.5 = fieldNorm(doc=1009)
    
  5. Hawking, D.; Zobel, J.: Does topic metadata help with Web search? (2007) 4.70
    4.6994414 = sum of:
      4.6994414 = weight(author_txt:zobel in 1204) [ClassicSimilarity], result of:
        4.6994414 = fieldWeight in 1204, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.398883 = idf(docFreq=9, maxDocs=44421)
          0.5 = fieldNorm(doc=1204)
    

Similar documents (content)

  1. Wartik, S.; Fox, E.; Heath, L.; Chen, Q.-F.: Hashing algorithms (1992) 0.12
    0.11799698 = sum of:
      0.11799698 = product of:
        0.9833082 = sum of:
          0.01685831 = weight(abstract_txt:with in 4510) [ClassicSimilarity], result of:
            0.01685831 = score(doc=4510,freq=1.0), product of:
              0.05403002 = queryWeight, product of:
                1.1930034 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.018143645 = queryNorm
              0.31201747 = fieldWeight in 4510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.125 = fieldNorm(doc=4510)
          0.094896086 = weight(abstract_txt:technique in 4510) [ClassicSimilarity], result of:
            0.094896086 = score(doc=4510,freq=1.0), product of:
              0.13570045 = queryWeight, product of:
                1.3369026 = boost
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.018143645 = queryNorm
              0.6993056 = fieldWeight in 4510, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.125 = fieldNorm(doc=4510)
          0.8715538 = weight(abstract_txt:hashing in 4510) [ClassicSimilarity], result of:
            0.8715538 = score(doc=4510,freq=3.0), product of:
              0.41264015 = queryWeight, product of:
                2.3312824 = boost
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.018143645 = queryNorm
              2.11214 = fieldWeight in 4510, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                9.755557 = idf(docFreq=6, maxDocs=44421)
                0.125 = fieldNorm(doc=4510)
        0.12 = coord(3/25)
    
  2. Lihui, C.; Lian, C.W.: Using Web structure and summarisation techniques for Web content mining (2005) 0.10
    0.09848965 = sum of:
      0.09848965 = product of:
        0.35174873 = sum of:
          0.011920625 = weight(abstract_txt:with in 2046) [ClassicSimilarity], result of:
            0.011920625 = score(doc=2046,freq=2.0), product of:
              0.05403002 = queryWeight, product of:
                1.1930034 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.018143645 = queryNorm
              0.22062966 = fieldWeight in 2046, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=2046)
          0.035837136 = weight(abstract_txt:result in 2046) [ClassicSimilarity], result of:
            0.035837136 = score(doc=2046,freq=1.0), product of:
              0.11254459 = queryWeight, product of:
                1.2175069 = boost
                5.0948176 = idf(docFreq=739, maxDocs=44421)
                0.018143645 = queryNorm
              0.3184261 = fieldWeight in 2046, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.0948176 = idf(docFreq=739, maxDocs=44421)
                0.0625 = fieldNorm(doc=2046)
          0.03823934 = weight(abstract_txt:similar in 2046) [ClassicSimilarity], result of:
            0.03823934 = score(doc=2046,freq=1.0), product of:
              0.11751934 = queryWeight, product of:
                1.2441244 = boost
                5.206202 = idf(docFreq=661, maxDocs=44421)
                0.018143645 = queryNorm
              0.32538763 = fieldWeight in 2046, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.206202 = idf(docFreq=661, maxDocs=44421)
                0.0625 = fieldNorm(doc=2046)
          0.082182415 = weight(abstract_txt:technique in 2046) [ClassicSimilarity], result of:
            0.082182415 = score(doc=2046,freq=3.0), product of:
              0.13570045 = queryWeight, product of:
                1.3369026 = boost
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.018143645 = queryNorm
              0.6056164 = fieldWeight in 2046, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.0625 = fieldNorm(doc=2046)
          0.047504798 = weight(abstract_txt:ranking in 2046) [ClassicSimilarity], result of:
            0.047504798 = score(doc=2046,freq=1.0), product of:
              0.13580865 = queryWeight, product of:
                1.3374355 = boost
                5.5966744 = idf(docFreq=447, maxDocs=44421)
                0.018143645 = queryNorm
              0.34979215 = fieldWeight in 2046, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5966744 = idf(docFreq=447, maxDocs=44421)
                0.0625 = fieldNorm(doc=2046)
          0.05546622 = weight(abstract_txt:size in 2046) [ClassicSimilarity], result of:
            0.05546622 = score(doc=2046,freq=1.0), product of:
              0.15058723 = queryWeight, product of:
                1.408326 = boost
                5.8933253 = idf(docFreq=332, maxDocs=44421)
                0.018143645 = queryNorm
              0.36833283 = fieldWeight in 2046, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8933253 = idf(docFreq=332, maxDocs=44421)
                0.0625 = fieldNorm(doc=2046)
          0.080598205 = weight(abstract_txt:documents in 2046) [ClassicSimilarity], result of:
            0.080598205 = score(doc=2046,freq=2.0), product of:
              0.22114804 = queryWeight, product of:
                2.9560468 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.018143645 = queryNorm
              0.3644536 = fieldWeight in 2046, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=2046)
        0.28 = coord(7/25)
    
  3. Ku, L.-W.; Chen, H.-H.: Mining opinions from the Web : beyond relevance retrieval (2007) 0.09
    0.08733826 = sum of:
      0.08733826 = product of:
        0.43669128 = sum of:
          0.011920625 = weight(abstract_txt:with in 1605) [ClassicSimilarity], result of:
            0.011920625 = score(doc=1605,freq=2.0), product of:
              0.05403002 = queryWeight, product of:
                1.1930034 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.018143645 = queryNorm
              0.22062966 = fieldWeight in 1605, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=1605)
          0.083264135 = weight(abstract_txt:words in 1605) [ClassicSimilarity], result of:
            0.083264135 = score(doc=1605,freq=4.0), product of:
              0.12437156 = queryWeight, product of:
                1.2798812 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.018143645 = queryNorm
              0.6694789 = fieldWeight in 1605, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0625 = fieldNorm(doc=1605)
          0.12172066 = weight(abstract_txt:granularity in 1605) [ClassicSimilarity], result of:
            0.12172066 = score(doc=1605,freq=1.0), product of:
              0.25429937 = queryWeight, product of:
                1.8301293 = boost
                7.6584163 = idf(docFreq=56, maxDocs=44421)
                0.018143645 = queryNorm
              0.47865102 = fieldWeight in 1605, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6584163 = idf(docFreq=56, maxDocs=44421)
                0.0625 = fieldNorm(doc=1605)
          0.09234891 = weight(abstract_txt:measure in 1605) [ClassicSimilarity], result of:
            0.09234891 = score(doc=1605,freq=2.0), product of:
              0.19219586 = queryWeight, product of:
                1.9486183 = boost
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.018143645 = queryNorm
              0.48049375 = fieldWeight in 1605, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.0625 = fieldNorm(doc=1605)
          0.12743697 = weight(abstract_txt:documents in 1605) [ClassicSimilarity], result of:
            0.12743697 = score(doc=1605,freq=5.0), product of:
              0.22114804 = queryWeight, product of:
                2.9560468 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.018143645 = queryNorm
              0.5762518 = fieldWeight in 1605, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=1605)
        0.2 = coord(5/25)
    
  4. Fricke, M.: Measuring recall (1998) 0.09
    0.085872926 = sum of:
      0.085872926 = product of:
        0.42936462 = sum of:
          0.014751022 = weight(abstract_txt:with in 4802) [ClassicSimilarity], result of:
            0.014751022 = score(doc=4802,freq=1.0), product of:
              0.05403002 = queryWeight, product of:
                1.1930034 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.018143645 = queryNorm
              0.2730153 = fieldWeight in 4802, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.109375 = fieldNorm(doc=4802)
          0.083034076 = weight(abstract_txt:technique in 4802) [ClassicSimilarity], result of:
            0.083034076 = score(doc=4802,freq=1.0), product of:
              0.13570045 = queryWeight, product of:
                1.3369026 = boost
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.018143645 = queryNorm
              0.6118924 = fieldWeight in 4802, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.109375 = fieldNorm(doc=4802)
          0.11756837 = weight(abstract_txt:ranking in 4802) [ClassicSimilarity], result of:
            0.11756837 = score(doc=4802,freq=2.0), product of:
              0.13580865 = queryWeight, product of:
                1.3374355 = boost
                5.5966744 = idf(docFreq=447, maxDocs=44421)
                0.018143645 = queryNorm
              0.86569136 = fieldWeight in 4802, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5966744 = idf(docFreq=447, maxDocs=44421)
                0.109375 = fieldNorm(doc=4802)
          0.114275955 = weight(abstract_txt:measure in 4802) [ClassicSimilarity], result of:
            0.114275955 = score(doc=4802,freq=1.0), product of:
              0.19219586 = queryWeight, product of:
                1.9486183 = boost
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.018143645 = queryNorm
              0.5945807 = fieldWeight in 4802, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.109375 = fieldNorm(doc=4802)
          0.09973519 = weight(abstract_txt:documents in 4802) [ClassicSimilarity], result of:
            0.09973519 = score(doc=4802,freq=1.0), product of:
              0.22114804 = queryWeight, product of:
                2.9560468 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.018143645 = queryNorm
              0.45098835 = fieldWeight in 4802, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.109375 = fieldNorm(doc=4802)
        0.2 = coord(5/25)
    
  5. Savoy, J.: Text representation strategies : an example with the State of the union addresses (2016) 0.08
    0.08407796 = sum of:
      0.08407796 = product of:
        0.35032484 = sum of:
          0.05974 = weight(abstract_txt:term in 4042) [ClassicSimilarity], result of:
            0.05974 = score(doc=4042,freq=4.0), product of:
              0.09967645 = queryWeight, product of:
                1.145791 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.018143645 = queryNorm
              0.5993391 = fieldWeight in 4042, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.0625 = fieldNorm(doc=4042)
          0.011920625 = weight(abstract_txt:with in 4042) [ClassicSimilarity], result of:
            0.011920625 = score(doc=4042,freq=2.0), product of:
              0.05403002 = queryWeight, product of:
                1.1930034 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.018143645 = queryNorm
              0.22062966 = fieldWeight in 4042, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=4042)
          0.041632067 = weight(abstract_txt:words in 4042) [ClassicSimilarity], result of:
            0.041632067 = score(doc=4042,freq=1.0), product of:
              0.12437156 = queryWeight, product of:
                1.2798812 = boost
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.018143645 = queryNorm
              0.33473945 = fieldWeight in 4042, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.355831 = idf(docFreq=569, maxDocs=44421)
                0.0625 = fieldNorm(doc=4042)
          0.047448043 = weight(abstract_txt:technique in 4042) [ClassicSimilarity], result of:
            0.047448043 = score(doc=4042,freq=1.0), product of:
              0.13570045 = queryWeight, product of:
                1.3369026 = boost
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.018143645 = queryNorm
              0.3496528 = fieldWeight in 4042, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.0625 = fieldNorm(doc=4042)
          0.11310386 = weight(abstract_txt:measure in 4042) [ClassicSimilarity], result of:
            0.11310386 = score(doc=4042,freq=3.0), product of:
              0.19219586 = queryWeight, product of:
                1.9486183 = boost
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.018143645 = queryNorm
              0.58848226 = fieldWeight in 4042, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.4361663 = idf(docFreq=525, maxDocs=44421)
                0.0625 = fieldNorm(doc=4042)
          0.07648023 = weight(abstract_txt:strategy in 4042) [ClassicSimilarity], result of:
            0.07648023 = score(doc=4042,freq=1.0), product of:
              0.2135497 = queryWeight, product of:
                2.054018 = boost
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.018143645 = queryNorm
              0.35813785 = fieldWeight in 4042, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.0625 = fieldNorm(doc=4042)
        0.24 = coord(6/25)