Document (#33357)

Author
Rooney, N.
Patterson, D.
Galushka, M.
Dobrynin, V.
Smirnova, E.
Title
¬An investigation into the stability of contextual document clustering
Source
Journal of the American Society for Information Science and Technology. 59(2008) no.2, S.256-266
Year
2008
Abstract
In this article, we assess the effectiveness of Contextual Document Clustering (CDC) as a means of indexing within a dynamic and rapidly changing environment. We simulate a dynamic environment, by splitting two chronologically ordered datasets into time-ordered segments and assessing how the technique performs under two different scenarios. The first is when new documents are added incrementally without reclustering [incremental CDC (iCDC)], and the second is when reclustering is performed [nonincremental CDC (nCDC)]. The datasets are very large, are independent of each other, and belong to two very different domains. We show that CDC itself is effective at clustering very large document corpora, and that, significantly, it lends itself to a very simple, efficient incremental document addition process that is seen to be very stable over time despite the size of the corpus growing considerably. It was seen to be effective at incrementally clustering new documents even when the corpus grew to six times its original size. This is in contrast to what other researchers have found when applying similar simple incremental approaches to document clustering. The stability of iCDC is accounted for by the unique manner in which CDC discovers cluster themes.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Patterson, E.L.: ¬The bibliographic control of microforms (1992) 6.19
    6.190705 = sum of:
      6.190705 = weight(author_txt:patterson in 1285) [ClassicSimilarity], result of:
        6.190705 = fieldWeight in 1285, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.625 = fieldNorm(doc=1285)
    
  2. Patterson, C.D.: Origins of systematic serials control : remembering Carolyn Ulrich (1988) 6.19
    6.190705 = sum of:
      6.190705 = weight(author_txt:patterson in 2475) [ClassicSimilarity], result of:
        6.190705 = fieldWeight in 2475, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.625 = fieldNorm(doc=2475)
    
  3. Rohlfing, H.; Schappacher, N.; Patterson, S.J.: ¬Das Zentralarchiv für Mathematiker-Nachlässe an der Niedersächsischen Staats- und Universitätsbibliothek (2003) 3.71
    3.7144227 = sum of:
      3.7144227 = weight(author_txt:patterson in 2321) [ClassicSimilarity], result of:
        3.7144227 = fieldWeight in 2321, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.375 = fieldNorm(doc=2321)
    
  4. Barry, E.; Bedoya, J.K.; Groom, C.; Patterson, L.: Virtual reference in UK academic libraries : the virtual enquiry project 2008-2009 (2010) 3.10
    3.0953524 = sum of:
      3.0953524 = weight(author_txt:patterson in 2966) [ClassicSimilarity], result of:
        3.0953524 = fieldWeight in 2966, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.3125 = fieldNorm(doc=2966)
    

Similar documents (content)

  1. Can, F.: Incremental clustering for dynamic information processing (1993) 0.32
    0.31648132 = sum of:
      0.31648132 = product of:
        0.98900414 = sum of:
          0.033683576 = weight(abstract_txt:large in 6627) [ClassicSimilarity], result of:
            0.033683576 = score(doc=6627,freq=1.0), product of:
              0.08066553 = queryWeight, product of:
                1.0603648 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.017079445 = queryNorm
              0.41757086 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.09375 = fieldNorm(doc=6627)
          0.03797048 = weight(abstract_txt:environment in 6627) [ClassicSimilarity], result of:
            0.03797048 = score(doc=6627,freq=1.0), product of:
              0.0873722 = queryWeight, product of:
                1.1035651 = boost
                4.635553 = idf(docFreq=1165, maxDocs=44218)
                0.017079445 = queryNorm
              0.43458307 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.635553 = idf(docFreq=1165, maxDocs=44218)
                0.09375 = fieldNorm(doc=6627)
          0.043118194 = weight(abstract_txt:effective in 6627) [ClassicSimilarity], result of:
            0.043118194 = score(doc=6627,freq=1.0), product of:
              0.09510052 = queryWeight, product of:
                1.1513379 = boost
                4.8362236 = idf(docFreq=953, maxDocs=44218)
                0.017079445 = queryNorm
              0.45339596 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8362236 = idf(docFreq=953, maxDocs=44218)
                0.09375 = fieldNorm(doc=6627)
          0.07322512 = weight(abstract_txt:dynamic in 6627) [ClassicSimilarity], result of:
            0.07322512 = score(doc=6627,freq=1.0), product of:
              0.13536797 = queryWeight, product of:
                1.3736285 = boost
                5.7699614 = idf(docFreq=374, maxDocs=44218)
                0.017079445 = queryNorm
              0.54093385 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7699614 = idf(docFreq=374, maxDocs=44218)
                0.09375 = fieldNorm(doc=6627)
          0.0753781 = weight(abstract_txt:document in 6627) [ClassicSimilarity], result of:
            0.0753781 = score(doc=6627,freq=1.0), product of:
              0.1873064 = queryWeight, product of:
                2.5548043 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017079445 = queryNorm
              0.40243202 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.09375 = fieldNorm(doc=6627)
          0.28829944 = weight(abstract_txt:incremental in 6627) [ClassicSimilarity], result of:
            0.28829944 = score(doc=6627,freq=1.0), product of:
              0.38636887 = queryWeight, product of:
                2.8422222 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.017079445 = queryNorm
              0.74617666 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.09375 = fieldNorm(doc=6627)
          0.11360031 = weight(abstract_txt:very in 6627) [ClassicSimilarity], result of:
            0.11360031 = score(doc=6627,freq=1.0), product of:
              0.24621181 = queryWeight, product of:
                2.9291105 = boost
                4.921521 = idf(docFreq=875, maxDocs=44218)
                0.017079445 = queryNorm
              0.4613926 = fieldWeight in 6627, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.921521 = idf(docFreq=875, maxDocs=44218)
                0.09375 = fieldNorm(doc=6627)
          0.3237289 = weight(abstract_txt:clustering in 6627) [ClassicSimilarity], result of:
            0.3237289 = score(doc=6627,freq=2.0), product of:
              0.39279583 = queryWeight, product of:
                3.699685 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.017079445 = queryNorm
              0.8241658 = fieldWeight in 6627, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.09375 = fieldNorm(doc=6627)
        0.32 = coord(8/25)
    
  2. Cai, X.; Li, W.: Enhancing sentence-level clustering with integrated and interactive frameworks for theme-based summarization (2011) 0.14
    0.13947088 = sum of:
      0.13947088 = product of:
        0.871693 = sum of:
          0.08519396 = weight(abstract_txt:discovers in 4770) [ClassicSimilarity], result of:
            0.08519396 = score(doc=4770,freq=1.0), product of:
              0.15573967 = queryWeight, product of:
                1.0418278 = boost
                8.752448 = idf(docFreq=18, maxDocs=44218)
                0.017079445 = queryNorm
              0.547028 = fieldWeight in 4770, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.752448 = idf(docFreq=18, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
          0.072361685 = weight(abstract_txt:datasets in 4770) [ClassicSimilarity], result of:
            0.072361685 = score(doc=4770,freq=1.0), product of:
              0.17598507 = queryWeight, product of:
                1.5662073 = boost
                6.578893 = idf(docFreq=166, maxDocs=44218)
                0.017079445 = queryNorm
              0.41118082 = fieldWeight in 4770, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.578893 = idf(docFreq=166, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
          0.123091914 = weight(abstract_txt:document in 4770) [ClassicSimilarity], result of:
            0.123091914 = score(doc=4770,freq=6.0), product of:
              0.1873064 = queryWeight, product of:
                2.5548043 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017079445 = queryNorm
              0.65716875 = fieldWeight in 4770, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
          0.5910455 = weight(abstract_txt:clustering in 4770) [ClassicSimilarity], result of:
            0.5910455 = score(doc=4770,freq=15.0), product of:
              0.39279583 = queryWeight, product of:
                3.699685 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.017079445 = queryNorm
              1.5047143 = fieldWeight in 4770, product of:
                3.8729835 = tf(freq=15.0), with freq of:
                  15.0 = termFreq=15.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.0625 = fieldNorm(doc=4770)
        0.16 = coord(4/25)
    
  3. Ruocco, A.S.; Frieder, O.: Clustering and classification of large document bases in a parallel environment (1997) 0.11
    0.113363475 = sum of:
      0.113363475 = product of:
        0.7085217 = sum of:
          0.033683576 = weight(abstract_txt:large in 1661) [ClassicSimilarity], result of:
            0.033683576 = score(doc=1661,freq=1.0), product of:
              0.08066553 = queryWeight, product of:
                1.0603648 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.017079445 = queryNorm
              0.41757086 = fieldWeight in 1661, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.09375 = fieldNorm(doc=1661)
          0.08645759 = weight(abstract_txt:corpus in 1661) [ClassicSimilarity], result of:
            0.08645759 = score(doc=1661,freq=1.0), product of:
              0.15122071 = queryWeight, product of:
                1.451834 = boost
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.017079445 = queryNorm
              0.57173115 = fieldWeight in 1661, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.09375 = fieldNorm(doc=1661)
          0.13055868 = weight(abstract_txt:document in 1661) [ClassicSimilarity], result of:
            0.13055868 = score(doc=1661,freq=3.0), product of:
              0.1873064 = queryWeight, product of:
                2.5548043 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017079445 = queryNorm
              0.6970327 = fieldWeight in 1661, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.09375 = fieldNorm(doc=1661)
          0.45782188 = weight(abstract_txt:clustering in 1661) [ClassicSimilarity], result of:
            0.45782188 = score(doc=1661,freq=4.0), product of:
              0.39279583 = queryWeight, product of:
                3.699685 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.017079445 = queryNorm
              1.1655467 = fieldWeight in 1661, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.09375 = fieldNorm(doc=1661)
        0.16 = coord(4/25)
    
  4. Zhan, J.; Loh, H.T.: Using latent semantic indexing to improve the accuracy of document clustering (2007) 0.11
    0.11260071 = sum of:
      0.11260071 = product of:
        0.7037544 = sum of:
          0.028069647 = weight(abstract_txt:large in 264) [ClassicSimilarity], result of:
            0.028069647 = score(doc=264,freq=1.0), product of:
              0.08066553 = queryWeight, product of:
                1.0603648 = boost
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.017079445 = queryNorm
              0.34797573 = fieldWeight in 264, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.454089 = idf(docFreq=1397, maxDocs=44218)
                0.078125 = fieldNorm(doc=264)
          0.04535346 = weight(abstract_txt:when in 264) [ClassicSimilarity], result of:
            0.04535346 = score(doc=264,freq=1.0), product of:
              0.13994165 = queryWeight, product of:
                1.9751488 = boost
                4.148331 = idf(docFreq=1897, maxDocs=44218)
                0.017079445 = queryNorm
              0.32408836 = fieldWeight in 264, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.148331 = idf(docFreq=1897, maxDocs=44218)
                0.078125 = fieldNorm(doc=264)
          0.12563016 = weight(abstract_txt:document in 264) [ClassicSimilarity], result of:
            0.12563016 = score(doc=264,freq=4.0), product of:
              0.1873064 = queryWeight, product of:
                2.5548043 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017079445 = queryNorm
              0.67072004 = fieldWeight in 264, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=264)
          0.50470114 = weight(abstract_txt:clustering in 264) [ClassicSimilarity], result of:
            0.50470114 = score(doc=264,freq=7.0), product of:
              0.39279583 = queryWeight, product of:
                3.699685 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.017079445 = queryNorm
              1.2848943 = fieldWeight in 264, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.078125 = fieldNorm(doc=264)
        0.16 = coord(4/25)
    
  5. Zamir, O.; Etzioni, O.: Grouper : a dynamic clustering interface to Web search results (1999) 0.09
    0.08866617 = sum of:
      0.08866617 = product of:
        0.5541636 = sum of:
          0.035931826 = weight(abstract_txt:effective in 6207) [ClassicSimilarity], result of:
            0.035931826 = score(doc=6207,freq=1.0), product of:
              0.09510052 = queryWeight, product of:
                1.1513379 = boost
                4.8362236 = idf(docFreq=953, maxDocs=44218)
                0.017079445 = queryNorm
              0.37782997 = fieldWeight in 6207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8362236 = idf(docFreq=953, maxDocs=44218)
                0.078125 = fieldNorm(doc=6207)
          0.04787966 = weight(abstract_txt:simple in 6207) [ClassicSimilarity], result of:
            0.04787966 = score(doc=6207,freq=1.0), product of:
              0.11515887 = queryWeight, product of:
                1.2669516 = boost
                5.321862 = idf(docFreq=586, maxDocs=44218)
                0.017079445 = queryNorm
              0.41577047 = fieldWeight in 6207, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.321862 = idf(docFreq=586, maxDocs=44218)
                0.078125 = fieldNorm(doc=6207)
          0.08883394 = weight(abstract_txt:document in 6207) [ClassicSimilarity], result of:
            0.08883394 = score(doc=6207,freq=2.0), product of:
              0.1873064 = queryWeight, product of:
                2.5548043 = boost
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.017079445 = queryNorm
              0.4742707 = fieldWeight in 6207, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.2926083 = idf(docFreq=1642, maxDocs=44218)
                0.078125 = fieldNorm(doc=6207)
          0.3815182 = weight(abstract_txt:clustering in 6207) [ClassicSimilarity], result of:
            0.3815182 = score(doc=6207,freq=4.0), product of:
              0.39279583 = queryWeight, product of:
                3.699685 = boost
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.017079445 = queryNorm
              0.9712888 = fieldWeight in 6207, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.2162485 = idf(docFreq=239, maxDocs=44218)
                0.078125 = fieldNorm(doc=6207)
        0.16 = coord(4/25)