Document (#35464)

Kishida, K.
High-speed rough clustering for very large document collections
Journal of the American Society for Information Science and Technology. 61(2010) no.6, S.1092-1104
Document clustering is an important tool, but it is not yet widely used in practice probably because of its high computational complexity. This article explores techniques of high-speed rough clustering of documents, assuming that it is sometimes necessary to obtain a clustering result in a shorter time, although the result is just an approximate outline of document clusters. A promising approach for such clustering is to reduce the number of documents to be checked for generating cluster vectors in the leader-follower clustering algorithm. Based on this idea, the present article proposes a modified Crouch algorithm and incomplete single-pass leader-follower algorithm. Also, a two-stage grouping technique, in which the first stage attempts to decrease the number of documents to be processed in the second stage by applying a quick merging technique, is developed. An experiment using a part of the Reuters corpus RCV1 showed empirically that both the modified Crouch and the incomplete single-pass leader-follower algorithms achieve clustering results more efficiently than the original methods, and also improved the effectiveness of clustering results. On the other hand, the two-stage grouping technique did not reduce the processing time in this experiment.
Automatisches Klassifizieren

Similar documents (content)

  1. Zamir, O.; Etzioni, O.: Grouper : a dynamic clustering interface to Web search results (1999) 0.23
    0.23392819 = sum of:
      0.23392819 = product of:
        0.8354578 = sum of:
          0.019844418 = weight(abstract_txt:time in 207) [ClassicSimilarity], result of:
            0.019844418 = score(doc=207,freq=1.0), product of:
              0.06122598 = queryWeight, product of:
                1.0652121 = boost
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.013854378 = queryNorm
              0.3241176 = fieldWeight in 207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.078125 = fieldNorm(doc=207)
          0.029223593 = weight(abstract_txt:documents in 207) [ClassicSimilarity], result of:
            0.029223593 = score(doc=207,freq=1.0), product of:
              0.0907186 = queryWeight, product of:
                1.5880423 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.013854378 = queryNorm
              0.32213452 = fieldWeight in 207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.078125 = fieldNorm(doc=207)
          0.04668081 = weight(abstract_txt:document in 207) [ClassicSimilarity], result of:
            0.04668081 = score(doc=207,freq=2.0), product of:
              0.09839118 = queryWeight, product of:
                1.6538342 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.013854378 = queryNorm
              0.47444102 = fieldWeight in 207, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=207)
          0.079515904 = weight(abstract_txt:speed in 207) [ClassicSimilarity], result of:
            0.079515904 = score(doc=207,freq=1.0), product of:
              0.15445887 = queryWeight, product of:
                1.6918999 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.013854378 = queryNorm
              0.5148031 = fieldWeight in 207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.078125 = fieldNorm(doc=207)
          0.047565598 = weight(abstract_txt:high in 207) [ClassicSimilarity], result of:
            0.047565598 = score(doc=207,freq=1.0), product of:
              0.12552664 = queryWeight, product of:
                1.8680212 = boost
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.013854378 = queryNorm
              0.37892833 = fieldWeight in 207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.078125 = fieldNorm(doc=207)
          0.07740381 = weight(abstract_txt:algorithm in 207) [ClassicSimilarity], result of:
            0.07740381 = score(doc=207,freq=1.0), product of:
              0.17366628 = queryWeight, product of:
                2.1972103 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.013854378 = queryNorm
              0.44570434 = fieldWeight in 207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.078125 = fieldNorm(doc=207)
          0.53522366 = weight(abstract_txt:clustering in 207) [ClassicSimilarity], result of:
            0.53522366 = score(doc=207,freq=4.0), product of:
              0.5506391 = queryWeight, product of:
                6.3889832 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013854378 = queryNorm
              0.9720045 = fieldWeight in 207, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.078125 = fieldNorm(doc=207)
        0.28 = coord(7/25)
  2. Lee, Y.-H.; Wei, C.-P.; Hu, P.J.-H.: ¬An ontology-based technique for preserving user preferences in document-category evolutions (2011) 0.20
    0.19866928 = sum of:
      0.19866928 = product of:
        0.70953315 = sum of:
          0.045971435 = weight(abstract_txt:vectors in 353) [ClassicSimilarity], result of:
            0.045971435 = score(doc=353,freq=1.0), product of:
              0.107917905 = queryWeight, product of:
                7.7894444 = idf(docFreq=49, maxDocs=44421)
                0.013854378 = queryNorm
              0.42598525 = fieldWeight in 353, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7894444 = idf(docFreq=49, maxDocs=44421)
                0.0546875 = fieldNorm(doc=353)
          0.013891093 = weight(abstract_txt:time in 353) [ClassicSimilarity], result of:
            0.013891093 = score(doc=353,freq=1.0), product of:
              0.06122598 = queryWeight, product of:
                1.0652121 = boost
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.013854378 = queryNorm
              0.22688234 = fieldWeight in 353, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.0546875 = fieldNorm(doc=353)
          0.035431724 = weight(abstract_txt:documents in 353) [ClassicSimilarity], result of:
            0.035431724 = score(doc=353,freq=3.0), product of:
              0.0907186 = queryWeight, product of:
                1.5880423 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.013854378 = queryNorm
              0.39056736 = fieldWeight in 353, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0546875 = fieldNorm(doc=353)
          0.06535314 = weight(abstract_txt:document in 353) [ClassicSimilarity], result of:
            0.06535314 = score(doc=353,freq=8.0), product of:
              0.09839118 = queryWeight, product of:
                1.6538342 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.013854378 = queryNorm
              0.6642174 = fieldWeight in 353, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0546875 = fieldNorm(doc=353)
          0.08924447 = weight(abstract_txt:grouping in 353) [ClassicSimilarity], result of:
            0.08924447 = score(doc=353,freq=1.0), product of:
              0.21159188 = queryWeight, product of:
                1.9802396 = boost
                7.7124834 = idf(docFreq=53, maxDocs=44421)
                0.013854378 = queryNorm
              0.42177644 = fieldWeight in 353, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7124834 = idf(docFreq=53, maxDocs=44421)
                0.0546875 = fieldNorm(doc=353)
          0.1351792 = weight(abstract_txt:technique in 353) [ClassicSimilarity], result of:
            0.1351792 = score(doc=353,freq=7.0), product of:
              0.16699974 = queryWeight, product of:
                2.1546254 = boost
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.013854378 = queryNorm
              0.80945754 = fieldWeight in 353, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.5944448 = idf(docFreq=448, maxDocs=44421)
                0.0546875 = fieldNorm(doc=353)
          0.3244621 = weight(abstract_txt:clustering in 353) [ClassicSimilarity], result of:
            0.3244621 = score(doc=353,freq=3.0), product of:
              0.5506391 = queryWeight, product of:
                6.3889832 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013854378 = queryNorm
              0.5892464 = fieldWeight in 353, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0546875 = fieldNorm(doc=353)
        0.28 = coord(7/25)
  3. Zhan, J.; Loh, H.T.: Using latent semantic indexing to improve the accuracy of document clustering (2007) 0.18
    0.18209358 = sum of:
      0.18209358 = product of:
        0.91046786 = sum of:
          0.019844418 = weight(abstract_txt:time in 1264) [ClassicSimilarity], result of:
            0.019844418 = score(doc=1264,freq=1.0), product of:
              0.06122598 = queryWeight, product of:
                1.0652121 = boost
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.013854378 = queryNorm
              0.3241176 = fieldWeight in 1264, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1487055 = idf(docFreq=1905, maxDocs=44421)
                0.078125 = fieldNorm(doc=1264)
          0.0690069 = weight(abstract_txt:reduce in 1264) [ClassicSimilarity], result of:
            0.0690069 = score(doc=1264,freq=1.0), product of:
              0.1405309 = queryWeight, product of:
                1.6138165 = boost
                6.285367 = idf(docFreq=224, maxDocs=44421)
                0.013854378 = queryNorm
              0.49104428 = fieldWeight in 1264, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.285367 = idf(docFreq=224, maxDocs=44421)
                0.078125 = fieldNorm(doc=1264)
          0.06601664 = weight(abstract_txt:document in 1264) [ClassicSimilarity], result of:
            0.06601664 = score(doc=1264,freq=4.0), product of:
              0.09839118 = queryWeight, product of:
                1.6538342 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.013854378 = queryNorm
              0.6709609 = fieldWeight in 1264, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=1264)
          0.047565598 = weight(abstract_txt:high in 1264) [ClassicSimilarity], result of:
            0.047565598 = score(doc=1264,freq=1.0), product of:
              0.12552664 = queryWeight, product of:
                1.8680212 = boost
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.013854378 = queryNorm
              0.37892833 = fieldWeight in 1264, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.078125 = fieldNorm(doc=1264)
          0.70803434 = weight(abstract_txt:clustering in 1264) [ClassicSimilarity], result of:
            0.70803434 = score(doc=1264,freq=7.0), product of:
              0.5506391 = queryWeight, product of:
                6.3889832 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013854378 = queryNorm
              1.285841 = fieldWeight in 1264, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.078125 = fieldNorm(doc=1264)
        0.2 = coord(5/25)
  4. Mu, T.; Goulermas, J.Y.; Korkontzelos, I.; Ananiadou, S.: Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities (2016) 0.16
    0.15923369 = sum of:
      0.15923369 = product of:
        0.6634737 = sum of:
          0.05469968 = weight(abstract_txt:approximate in 3496) [ClassicSimilarity], result of:
            0.05469968 = score(doc=3496,freq=1.0), product of:
              0.11085706 = queryWeight, product of:
                1.0135261 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.013854378 = queryNorm
              0.4934253 = fieldWeight in 3496, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0625 = fieldNorm(doc=3496)
          0.061854687 = weight(abstract_txt:documents in 3496) [ClassicSimilarity], result of:
            0.061854687 = score(doc=3496,freq=7.0), product of:
              0.0907186 = queryWeight, product of:
                1.5880423 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.013854378 = queryNorm
              0.6818303 = fieldWeight in 3496, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=3496)
          0.055205517 = weight(abstract_txt:reduce in 3496) [ClassicSimilarity], result of:
            0.055205517 = score(doc=3496,freq=1.0), product of:
              0.1405309 = queryWeight, product of:
                1.6138165 = boost
                6.285367 = idf(docFreq=224, maxDocs=44421)
                0.013854378 = queryNorm
              0.39283544 = fieldWeight in 3496, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.285367 = idf(docFreq=224, maxDocs=44421)
                0.0625 = fieldNorm(doc=3496)
          0.045737665 = weight(abstract_txt:document in 3496) [ClassicSimilarity], result of:
            0.045737665 = score(doc=3496,freq=3.0), product of:
              0.09839118 = queryWeight, product of:
                1.6538342 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.013854378 = queryNorm
              0.46485534 = fieldWeight in 3496, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=3496)
          0.143208 = weight(abstract_txt:stage in 3496) [ClassicSimilarity], result of:
            0.143208 = score(doc=3496,freq=2.0), product of:
              0.26531494 = queryWeight, product of:
                3.1359136 = boost
                6.106756 = idf(docFreq=268, maxDocs=44421)
                0.013854378 = queryNorm
              0.5397661 = fieldWeight in 3496, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.106756 = idf(docFreq=268, maxDocs=44421)
                0.0625 = fieldNorm(doc=3496)
          0.3027682 = weight(abstract_txt:clustering in 3496) [ClassicSimilarity], result of:
            0.3027682 = score(doc=3496,freq=2.0), product of:
              0.5506391 = queryWeight, product of:
                6.3889832 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013854378 = queryNorm
              0.54984874 = fieldWeight in 3496, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0625 = fieldNorm(doc=3496)
        0.24 = coord(6/25)
  5. Guerrero, V.P.; Moya Anegón, F. de: Reduction of the dimension of a document space using the fuzzified output of a Kohonen network (2001) 0.16
    0.15737586 = sum of:
      0.15737586 = product of:
        0.65573275 = sum of:
          0.11145159 = weight(abstract_txt:vectors in 935) [ClassicSimilarity], result of:
            0.11145159 = score(doc=935,freq=2.0), product of:
              0.107917905 = queryWeight, product of:
                7.7894444 = idf(docFreq=49, maxDocs=44421)
                0.013854378 = queryNorm
              1.0327442 = fieldWeight in 935, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.7894444 = idf(docFreq=49, maxDocs=44421)
                0.09375 = fieldNorm(doc=935)
          0.023589617 = weight(abstract_txt:number in 935) [ClassicSimilarity], result of:
            0.023589617 = score(doc=935,freq=1.0), product of:
              0.06084197 = queryWeight, product of:
                1.0618664 = boost
                4.1356745 = idf(docFreq=1930, maxDocs=44421)
                0.013854378 = queryNorm
              0.38771948 = fieldWeight in 935, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1356745 = idf(docFreq=1930, maxDocs=44421)
                0.09375 = fieldNorm(doc=935)
          0.049594082 = weight(abstract_txt:documents in 935) [ClassicSimilarity], result of:
            0.049594082 = score(doc=935,freq=2.0), product of:
              0.0907186 = queryWeight, product of:
                1.5880423 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.013854378 = queryNorm
              0.54668045 = fieldWeight in 935, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.09375 = fieldNorm(doc=935)
          0.05707872 = weight(abstract_txt:high in 935) [ClassicSimilarity], result of:
            0.05707872 = score(doc=935,freq=1.0), product of:
              0.12552664 = queryWeight, product of:
                1.8680212 = boost
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.013854378 = queryNorm
              0.454714 = fieldWeight in 935, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.09375 = fieldNorm(doc=935)
          0.092884585 = weight(abstract_txt:algorithm in 935) [ClassicSimilarity], result of:
            0.092884585 = score(doc=935,freq=1.0), product of:
              0.17366628 = queryWeight, product of:
                2.1972103 = boost
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.013854378 = queryNorm
              0.53484523 = fieldWeight in 935, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7050157 = idf(docFreq=401, maxDocs=44421)
                0.09375 = fieldNorm(doc=935)
          0.32113418 = weight(abstract_txt:clustering in 935) [ClassicSimilarity], result of:
            0.32113418 = score(doc=935,freq=1.0), product of:
              0.5506391 = queryWeight, product of:
                6.3889832 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013854378 = queryNorm
              0.58320266 = fieldWeight in 935, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.09375 = fieldNorm(doc=935)
        0.24 = coord(6/25)