Document (#11498)

Author
Kirriemuir, J.W.
Willet, P.
Title
Identification of duplicate and near-duplicate full-text records in database search-outputs using hierarchic cluster analysis
Source
Program. 29(1995) no.3, S.241-256
Year
1995
Abstract
Clustering the output of a multi database online search enables users to obtain an overview of the information that has been retrieved without the need to inspect any documents that contain only redundant information. Describes a classification scheme that characterizes the degree of relationship between pairs of documents in database search outputs and then reports the application of a range of clustering methods and similarity coefficients to 20 such outputs. Results indicate that clustering is capable of grouping documents in the search output on the basis of their term similarities

Similar documents (content)

  1. Tombros, A.; Villa, R.; Rijsbergen, C.J. Van: ¬The effectiveness of query-specific hierarchic clustering in information retrieval (2002) 0.21
    0.20925832 = sum of:
      0.20925832 = product of:
        1.3078645 = sum of:
          0.02363212 = weight(abstract_txt:that in 3586) [ClassicSimilarity], result of:
            0.02363212 = score(doc=3586,freq=4.0), product of:
              0.06395337 = queryWeight, product of:
                1.6259031 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.016632194 = queryNorm
              0.3695211 = fieldWeight in 3586, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=3586)
          0.7958195 = weight(title_txt:hierarchic in 3586) [ClassicSimilarity], result of:
            0.7958195 = score(doc=3586,freq=1.0), product of:
              0.26466593 = queryWeight, product of:
                1.6537962 = boost
                9.622026 = idf(docFreq=7, maxDocs=44421)
                0.016632194 = queryNorm
              3.0068831 = fieldWeight in 3586, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.622026 = idf(docFreq=7, maxDocs=44421)
                0.3125 = fieldNorm(doc=3586)
          0.061666436 = weight(abstract_txt:search in 3586) [ClassicSimilarity], result of:
            0.061666436 = score(doc=3586,freq=2.0), product of:
              0.15272291 = queryWeight, product of:
                2.512552 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.016632194 = queryNorm
              0.40377986 = fieldWeight in 3586, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.078125 = fieldNorm(doc=3586)
          0.4267465 = weight(abstract_txt:clustering in 3586) [ClassicSimilarity], result of:
            0.4267465 = score(doc=3586,freq=7.0), product of:
              0.33188123 = queryWeight, product of:
                3.2076347 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.016632194 = queryNorm
              1.285841 = fieldWeight in 3586, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.078125 = fieldNorm(doc=3586)
        0.16 = coord(4/25)
    
  2. Kishida, K.: High-speed rough clustering for very large document collections (2010) 0.15
    0.15136215 = sum of:
      0.15136215 = product of:
        0.6306757 = sum of:
          0.044745553 = weight(abstract_txt:obtain in 450) [ClassicSimilarity], result of:
            0.044745553 = score(doc=450,freq=1.0), product of:
              0.11357992 = queryWeight, product of:
                1.0833873 = boost
                6.3033047 = idf(docFreq=220, maxDocs=44421)
                0.016632194 = queryNorm
              0.39395654 = fieldWeight in 450, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3033047 = idf(docFreq=220, maxDocs=44421)
                0.0625 = fieldNorm(doc=450)
          0.050165597 = weight(abstract_txt:cluster in 450) [ClassicSimilarity], result of:
            0.050165597 = score(doc=450,freq=1.0), product of:
              0.12257605 = queryWeight, product of:
                1.1254748 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.016632194 = queryNorm
              0.409261 = fieldWeight in 450, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.0625 = fieldNorm(doc=450)
          0.11591583 = weight(abstract_txt:grouping in 450) [ClassicSimilarity], result of:
            0.11591583 = score(doc=450,freq=2.0), product of:
              0.17004094 = queryWeight, product of:
                1.3255914 = boost
                7.7124834 = idf(docFreq=53, maxDocs=44421)
                0.016632194 = queryNorm
              0.6816937 = fieldWeight in 450, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.7124834 = idf(docFreq=53, maxDocs=44421)
                0.0625 = fieldNorm(doc=450)
          0.013368347 = weight(abstract_txt:that in 450) [ClassicSimilarity], result of:
            0.013368347 = score(doc=450,freq=2.0), product of:
              0.06395337 = queryWeight, product of:
                1.6259031 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.016632194 = queryNorm
              0.20903271 = fieldWeight in 450, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=450)
          0.06508316 = weight(abstract_txt:documents in 450) [ClassicSimilarity], result of:
            0.06508316 = score(doc=450,freq=3.0), product of:
              0.14580779 = queryWeight, product of:
                2.1261013 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.016632194 = queryNorm
              0.4463627 = fieldWeight in 450, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=450)
          0.3413972 = weight(abstract_txt:clustering in 450) [ClassicSimilarity], result of:
            0.3413972 = score(doc=450,freq=7.0), product of:
              0.33188123 = queryWeight, product of:
                3.2076347 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.016632194 = queryNorm
              1.0286728 = fieldWeight in 450, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0625 = fieldNorm(doc=450)
        0.24 = coord(6/25)
    
  3. Na, S.-H.; Kang, I.-S.; Lee, J.-H.: Adaptive document clustering based on query-based similarity (2007) 0.14
    0.14064713 = sum of:
      0.14064713 = product of:
        0.5860297 = sum of:
          0.060947977 = weight(abstract_txt:similarity in 1920) [ClassicSimilarity], result of:
            0.060947977 = score(doc=1920,freq=3.0), product of:
              0.09676852 = queryWeight, product of:
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.016632194 = queryNorm
              0.6298327 = fieldWeight in 1920, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.0625 = fieldNorm(doc=1920)
          0.07094487 = weight(abstract_txt:cluster in 1920) [ClassicSimilarity], result of:
            0.07094487 = score(doc=1920,freq=2.0), product of:
              0.12257605 = queryWeight, product of:
                1.1254748 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.016632194 = queryNorm
              0.57878244 = fieldWeight in 1920, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.0625 = fieldNorm(doc=1920)
          0.08712034 = weight(abstract_txt:similarities in 1920) [ClassicSimilarity], result of:
            0.08712034 = score(doc=1920,freq=3.0), product of:
              0.12279318 = queryWeight, product of:
                1.1264712 = boost
                6.553973 = idf(docFreq=171, maxDocs=44421)
                0.016632194 = queryNorm
              0.7094884 = fieldWeight in 1920, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.553973 = idf(docFreq=171, maxDocs=44421)
                0.0625 = fieldNorm(doc=1920)
          0.013368347 = weight(abstract_txt:that in 1920) [ClassicSimilarity], result of:
            0.013368347 = score(doc=1920,freq=2.0), product of:
              0.06395337 = queryWeight, product of:
                1.6259031 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.016632194 = queryNorm
              0.20903271 = fieldWeight in 1920, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=1920)
          0.037575778 = weight(abstract_txt:documents in 1920) [ClassicSimilarity], result of:
            0.037575778 = score(doc=1920,freq=1.0), product of:
              0.14580779 = queryWeight, product of:
                2.1261013 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.016632194 = queryNorm
              0.25770763 = fieldWeight in 1920, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=1920)
          0.3160724 = weight(abstract_txt:clustering in 1920) [ClassicSimilarity], result of:
            0.3160724 = score(doc=1920,freq=6.0), product of:
              0.33188123 = queryWeight, product of:
                3.2076347 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.016632194 = queryNorm
              0.952366 = fieldWeight in 1920, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0625 = fieldNorm(doc=1920)
        0.24 = coord(6/25)
    
  4. Conrad, J.G.; Schriber, C.P.: Managing déjà vu : collection building for the identification of nonidentical duplicate documents (2006) 0.14
    0.13916421 = sum of:
      0.13916421 = product of:
        0.69582105 = sum of:
          0.044304278 = weight(abstract_txt:identification in 59) [ClassicSimilarity], result of:
            0.044304278 = score(doc=59,freq=1.0), product of:
              0.09723563 = queryWeight, product of:
                1.0024107 = boost
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.016632194 = queryNorm
              0.45563832 = fieldWeight in 59, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8321705 = idf(docFreq=353, maxDocs=44421)
                0.078125 = fieldNorm(doc=59)
          0.10672931 = weight(abstract_txt:near in 59) [ClassicSimilarity], result of:
            0.10672931 = score(doc=59,freq=2.0), product of:
              0.1386886 = queryWeight, product of:
                1.1971631 = boost
                6.965269 = idf(docFreq=113, maxDocs=44421)
                0.016632194 = queryNorm
              0.7695608 = fieldWeight in 59, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.965269 = idf(docFreq=113, maxDocs=44421)
                0.078125 = fieldNorm(doc=59)
          0.08135395 = weight(abstract_txt:documents in 59) [ClassicSimilarity], result of:
            0.08135395 = score(doc=59,freq=3.0), product of:
              0.14580779 = queryWeight, product of:
                2.1261013 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.016632194 = queryNorm
              0.55795336 = fieldWeight in 59, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.078125 = fieldNorm(doc=59)
          0.061666436 = weight(abstract_txt:search in 59) [ClassicSimilarity], result of:
            0.061666436 = score(doc=59,freq=2.0), product of:
              0.15272291 = queryWeight, product of:
                2.512552 = boost
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.016632194 = queryNorm
              0.40377986 = fieldWeight in 59, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.654598 = idf(docFreq=3123, maxDocs=44421)
                0.078125 = fieldNorm(doc=59)
          0.40176708 = weight(abstract_txt:duplicate in 59) [ClassicSimilarity], result of:
            0.40176708 = score(doc=59,freq=3.0), product of:
              0.3693863 = queryWeight, product of:
                2.7630475 = boost
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.016632194 = queryNorm
              1.087661 = fieldWeight in 59, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.037906 = idf(docFreq=38, maxDocs=44421)
                0.078125 = fieldNorm(doc=59)
        0.2 = coord(5/25)
    
  5. Hu, G.; Zhou, S.; Guan, J.; Hu, X.: Towards effective document clustering : a constrained K-means based approach (2008) 0.12
    0.123881005 = sum of:
      0.123881005 = product of:
        0.61940503 = sum of:
          0.1064173 = weight(abstract_txt:cluster in 3113) [ClassicSimilarity], result of:
            0.1064173 = score(doc=3113,freq=2.0), product of:
              0.12257605 = queryWeight, product of:
                1.1254748 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.016632194 = queryNorm
              0.86817366 = fieldWeight in 3113, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.09375 = fieldNorm(doc=3113)
          0.08385283 = weight(abstract_txt:pairs in 3113) [ClassicSimilarity], result of:
            0.08385283 = score(doc=3113,freq=1.0), product of:
              0.1317506 = queryWeight, product of:
                1.1668345 = boost
                6.7888126 = idf(docFreq=135, maxDocs=44421)
                0.016632194 = queryNorm
              0.6364512 = fieldWeight in 3113, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7888126 = idf(docFreq=135, maxDocs=44421)
                0.09375 = fieldNorm(doc=3113)
          0.014179273 = weight(abstract_txt:that in 3113) [ClassicSimilarity], result of:
            0.014179273 = score(doc=3113,freq=1.0), product of:
              0.06395337 = queryWeight, product of:
                1.6259031 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.016632194 = queryNorm
              0.22171268 = fieldWeight in 3113, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.09375 = fieldNorm(doc=3113)
          0.07971027 = weight(abstract_txt:documents in 3113) [ClassicSimilarity], result of:
            0.07971027 = score(doc=3113,freq=2.0), product of:
              0.14580779 = queryWeight, product of:
                2.1261013 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.016632194 = queryNorm
              0.54668045 = fieldWeight in 3113, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.09375 = fieldNorm(doc=3113)
          0.33524537 = weight(abstract_txt:clustering in 3113) [ClassicSimilarity], result of:
            0.33524537 = score(doc=3113,freq=3.0), product of:
              0.33188123 = queryWeight, product of:
                3.2076347 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.016632194 = queryNorm
              1.0101366 = fieldWeight in 3113, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.09375 = fieldNorm(doc=3113)
        0.2 = coord(5/25)