Document (#30060)

Author
Conrad, J.G.
Schriber, C.P.
Title
Managing déjà vu : collection building for the identification of nonidentical duplicate documents
Source
Journal of the American Society for Information Science and Technology. 57(2006) no.7, S.921-932
Year
2006
Abstract
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close variants. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its deleterious effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling nonidentical duplicate documents. We subsequently examine a flexible method of characterizing and comparing documents to permit the identification of near duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.
Content
Beitrag zur Problematik der automatischen Erkennung von Dubletten

Similar documents (content)

  1. Desrichard, Y.: ¬Le dedoublonage des banques de donnees bibliographiques : un etat de l'art (1997) 0.14
    0.13957231 = sum of:
      0.13957231 = product of:
        1.1631026 = sum of:
          0.025461404 = weight(abstract_txt:users in 669) [ClassicSimilarity], result of:
            0.025461404 = score(doc=669,freq=1.0), product of:
              0.057099655 = queryWeight, product of:
                1.016419 = boost
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.015747871 = queryNorm
              0.44591168 = fieldWeight in 669, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.125 = fieldNorm(doc=669)
          0.5551028 = weight(abstract_txt:duplicates in 669) [ClassicSimilarity], result of:
            0.5551028 = score(doc=669,freq=1.0), product of:
              0.5101031 = queryWeight, product of:
                3.7207515 = boost
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.015747871 = queryNorm
              1.0882169 = fieldWeight in 669, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.125 = fieldNorm(doc=669)
          0.58253837 = weight(abstract_txt:duplicate in 669) [ClassicSimilarity], result of:
            0.58253837 = score(doc=669,freq=1.0), product of:
              0.5797912 = queryWeight, product of:
                4.580436 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.015747871 = queryNorm
              1.0047382 = fieldWeight in 669, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.125 = fieldNorm(doc=669)
        0.12 = coord(3/25)
    
  2. Zhang, J.; Mostafa, J.; Tripathy, H.: Information retrieval by semantic analysis and visualization of the concept space of D-Lib® magazine (2002) 0.14
    0.13910754 = sum of:
      0.13910754 = product of:
        0.3161535 = sum of:
          0.02111148 = weight(abstract_txt:users in 2211) [ClassicSimilarity], result of:
            0.02111148 = score(doc=2211,freq=11.0), product of:
              0.057099655 = queryWeight, product of:
                1.016419 = boost
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.015747871 = queryNorm
              0.36973044 = fieldWeight in 2211, product of:
                3.3166249 = tf(freq=11.0), with freq of:
                  11.0 = termFreq=11.0
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.026308319 = weight(abstract_txt:subsequently in 2211) [ClassicSimilarity], result of:
            0.026308319 = score(doc=2211,freq=1.0), product of:
              0.11671786 = queryWeight, product of:
                1.027566 = boost
                7.212831 = idf(docFreq=88, maxDocs=44421)
                0.015747871 = queryNorm
              0.22540097 = fieldWeight in 2211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.212831 = idf(docFreq=88, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.016764874 = weight(abstract_txt:search in 2211) [ClassicSimilarity], result of:
            0.016764874 = score(doc=2211,freq=6.0), product of:
              0.059928723 = queryWeight, product of:
                1.0412945 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.015747871 = queryNorm
              0.2797469 = fieldWeight in 2211, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.041774794 = weight(abstract_txt:algorithmic in 2211) [ClassicSimilarity], result of:
            0.041774794 = score(doc=2211,freq=2.0), product of:
              0.1260883 = queryWeight, product of:
                1.0680177 = boost
                7.496775 = idf(docFreq=66, maxDocs=44421)
                0.015747871 = queryNorm
              0.3313138 = fieldWeight in 2211, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.496775 = idf(docFreq=66, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.02953924 = weight(abstract_txt:variants in 2211) [ClassicSimilarity], result of:
            0.02953924 = score(doc=2211,freq=1.0), product of:
              0.1260883 = queryWeight, product of:
                1.0680177 = boost
                7.496775 = idf(docFreq=66, maxDocs=44421)
                0.015747871 = queryNorm
              0.23427422 = fieldWeight in 2211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.496775 = idf(docFreq=66, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.013390095 = weight(abstract_txt:both in 2211) [ClassicSimilarity], result of:
            0.013390095 = score(doc=2211,freq=3.0), product of:
              0.06499809 = queryWeight, product of:
                1.084442 = boost
                3.8060317 = idf(docFreq=2684, maxDocs=44421)
                0.015747871 = queryNorm
              0.20600751 = fieldWeight in 2211, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.8060317 = idf(docFreq=2684, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.032397572 = weight(abstract_txt:permit in 2211) [ClassicSimilarity], result of:
            0.032397572 = score(doc=2211,freq=1.0), product of:
              0.13409632 = queryWeight, product of:
                1.1014112 = boost
                7.731176 = idf(docFreq=52, maxDocs=44421)
                0.015747871 = queryNorm
              0.24159925 = fieldWeight in 2211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.731176 = idf(docFreq=52, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.04625857 = weight(abstract_txt:minimizing in 2211) [ClassicSimilarity], result of:
            0.04625857 = score(doc=2211,freq=1.0), product of:
              0.17003436 = queryWeight, product of:
                1.2402505 = boost
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.015747871 = queryNorm
              0.27205423 = fieldWeight in 2211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.705735 = idf(docFreq=19, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.03127328 = weight(abstract_txt:method in 2211) [ClassicSimilarity], result of:
            0.03127328 = score(doc=2211,freq=6.0), product of:
              0.0908135 = queryWeight, product of:
                1.2818325 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.015747871 = queryNorm
              0.34436816 = fieldWeight in 2211, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.018015845 = weight(abstract_txt:test in 2211) [ClassicSimilarity], result of:
            0.018015845 = score(doc=2211,freq=1.0), product of:
              0.114249684 = queryWeight, product of:
                1.4377506 = boost
                5.046027 = idf(docFreq=776, maxDocs=44421)
                0.015747871 = queryNorm
              0.15768835 = fieldWeight in 2211, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.046027 = idf(docFreq=776, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
          0.03931946 = weight(abstract_txt:documents in 2211) [ClassicSimilarity], result of:
            0.03931946 = score(doc=2211,freq=4.0), product of:
              0.15257391 = queryWeight, product of:
                2.3496933 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015747871 = queryNorm
              0.25770763 = fieldWeight in 2211, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.03125 = fieldNorm(doc=2211)
        0.44 = coord(11/25)
    
  3. Lawrence, S.; Giles, C.L.: Inquirus, the NECI meta search engine (1998) 0.13
    0.12631224 = sum of:
      0.12631224 = product of:
        0.6315612 = sum of:
          0.029037612 = weight(abstract_txt:search in 4604) [ClassicSimilarity], result of:
            0.029037612 = score(doc=4604,freq=2.0), product of:
              0.059928723 = queryWeight, product of:
                1.0412945 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.015747871 = queryNorm
              0.4845358 = fieldWeight in 4604, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.09375 = fieldNorm(doc=4604)
          0.023192324 = weight(abstract_txt:both in 4604) [ClassicSimilarity], result of:
            0.023192324 = score(doc=4604,freq=1.0), product of:
              0.06499809 = queryWeight, product of:
                1.084442 = boost
                3.8060317 = idf(docFreq=2684, maxDocs=44421)
                0.015747871 = queryNorm
              0.35681546 = fieldWeight in 4604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8060317 = idf(docFreq=2684, maxDocs=44421)
                0.09375 = fieldNorm(doc=4604)
          0.083448336 = weight(abstract_txt:identification in 4604) [ClassicSimilarity], result of:
            0.083448336 = score(doc=4604,freq=1.0), product of:
              0.15262167 = queryWeight, product of:
                1.6617441 = boost
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.015747871 = queryNorm
              0.546766 = fieldWeight in 4604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.09375 = fieldNorm(doc=4604)
          0.058979195 = weight(abstract_txt:documents in 4604) [ClassicSimilarity], result of:
            0.058979195 = score(doc=4604,freq=1.0), product of:
              0.15257391 = queryWeight, product of:
                2.3496933 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.015747871 = queryNorm
              0.38656145 = fieldWeight in 4604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.09375 = fieldNorm(doc=4604)
          0.43690374 = weight(abstract_txt:duplicate in 4604) [ClassicSimilarity], result of:
            0.43690374 = score(doc=4604,freq=1.0), product of:
              0.5797912 = queryWeight, product of:
                4.580436 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.015747871 = queryNorm
              0.7535536 = fieldWeight in 4604, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.09375 = fieldNorm(doc=4604)
        0.2 = coord(5/25)
    
  4. Yu, L.-C.; Wu, C.-H.; Chang, R.-Y.; Liu, C.-H.; Hovy, E.H.: Annotation and verification of sense pools in OntoNotes (2010) 0.12
    0.11502812 = sum of:
      0.11502812 = product of:
        0.5751406 = sum of:
          0.051069047 = weight(abstract_txt:method in 236) [ClassicSimilarity], result of:
            0.051069047 = score(doc=236,freq=4.0), product of:
              0.0908135 = queryWeight, product of:
                1.2818325 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.015747871 = queryNorm
              0.5623508 = fieldWeight in 236, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.0625 = fieldNorm(doc=236)
          0.050956503 = weight(abstract_txt:test in 236) [ClassicSimilarity], result of:
            0.050956503 = score(doc=236,freq=2.0), product of:
              0.114249684 = queryWeight, product of:
                1.4377506 = boost
                5.046027 = idf(docFreq=776, maxDocs=44421)
                0.015747871 = queryNorm
              0.44601 = fieldWeight in 236, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.046027 = idf(docFreq=776, maxDocs=44421)
                0.0625 = fieldNorm(doc=236)
          0.017707516 = weight(abstract_txt:results in 236) [ClassicSimilarity], result of:
            0.017707516 = score(doc=236,freq=1.0), product of:
              0.08144556 = queryWeight, product of:
                1.4867413 = boost
                3.4786456 = idf(docFreq=3724, maxDocs=44421)
                0.015747871 = queryNorm
              0.21741535 = fieldWeight in 236, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4786456 = idf(docFreq=3724, maxDocs=44421)
                0.0625 = fieldNorm(doc=236)
          0.16413838 = weight(abstract_txt:near in 236) [ClassicSimilarity], result of:
            0.16413838 = score(doc=236,freq=3.0), product of:
              0.21768656 = queryWeight, product of:
                1.9845948 = boost
                6.965269 = idf(docFreq=113, maxDocs=44421)
                0.015747871 = queryNorm
              0.75401247 = fieldWeight in 236, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.965269 = idf(docFreq=113, maxDocs=44421)
                0.0625 = fieldNorm(doc=236)
          0.29126918 = weight(abstract_txt:duplicate in 236) [ClassicSimilarity], result of:
            0.29126918 = score(doc=236,freq=1.0), product of:
              0.5797912 = queryWeight, product of:
                4.580436 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.015747871 = queryNorm
              0.5023691 = fieldWeight in 236, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.0625 = fieldNorm(doc=236)
        0.2 = coord(5/25)
    
  5. Diodato, V.: User preferences for features in back of the book indexes (1994) 0.11
    0.11175446 = sum of:
      0.11175446 = product of:
        0.6984654 = sum of:
          0.019096052 = weight(abstract_txt:users in 7761) [ClassicSimilarity], result of:
            0.019096052 = score(doc=7761,freq=1.0), product of:
              0.057099655 = queryWeight, product of:
                1.016419 = boost
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.015747871 = queryNorm
              0.33443376 = fieldWeight in 7761, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.5672934 = idf(docFreq=3408, maxDocs=44421)
                0.09375 = fieldNorm(doc=7761)
          0.023192324 = weight(abstract_txt:both in 7761) [ClassicSimilarity], result of:
            0.023192324 = score(doc=7761,freq=1.0), product of:
              0.06499809 = queryWeight, product of:
                1.084442 = boost
                3.8060317 = idf(docFreq=2684, maxDocs=44421)
                0.015747871 = queryNorm
              0.35681546 = fieldWeight in 7761, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8060317 = idf(docFreq=2684, maxDocs=44421)
                0.09375 = fieldNorm(doc=7761)
          0.03830179 = weight(abstract_txt:method in 7761) [ClassicSimilarity], result of:
            0.03830179 = score(doc=7761,freq=1.0), product of:
              0.0908135 = queryWeight, product of:
                1.2818325 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.015747871 = queryNorm
              0.42176312 = fieldWeight in 7761, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.09375 = fieldNorm(doc=7761)
          0.6178752 = weight(abstract_txt:duplicate in 7761) [ClassicSimilarity], result of:
            0.6178752 = score(doc=7761,freq=2.0), product of:
              0.5797912 = queryWeight, product of:
                4.580436 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.015747871 = queryNorm
              1.0656857 = fieldWeight in 7761, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.09375 = fieldNorm(doc=7761)
        0.16 = coord(4/25)