Document (#27564)

Author
Wu, K.J.
Chen, M.-C.
Sun, Y.
Title
Automatic topics discovery from hyperlinked documents
Source
Information processing and management. 40(2004) no.2, S.239-255
Year
2004
Abstract
Topic discovery is an important means for marketing, e-Business and social science studies. As well, it can be applied to various purposes, such as identifying a group with certain properties and observing the emergence and diminishment of a certain cyber community. Previous topic discovery work (J.M. Kleinberg, Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, p. 668) requires manual judgment of usefulness of outcomes and is thus incapable of handling the explosive growth of the Internet. In this paper, we propose the Automatic Topic Discovery (ATD) method, which combines a method of base set construction, a clustering algorithm and an iterative principal eigenvector computation method to discover the topics relevant to a given query without using manual examination. Given a query, ATD returns with topics associated with the query and top representative pages for each topic. Our experiments show that the ATD method performs better than the traditional eigenvector method in terms of computation time and topic discovery quality.
Theme
Data Mining
Automatisches Klassifizieren

Similar documents (author)

  1. Chen, Y.N.; Chen, S.J.: ¬A metadata practice of the OFLA FRBR model : a case study for the National Palace Museum in Taipai (2004) 4.34
    4.3394766 = sum of:
      4.3394766 = weight(author_txt:chen in 4384) [ClassicSimilarity], result of:
        4.3394766 = score(doc=4384,freq=2.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          4.339477 = fieldWeight in 4384, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.5 = fieldNorm(doc=4384)
    
  2. Chen, C.C.; Chen, H.H.; Chen, K.H.: ¬The design of the XML/Metadata management system (2000) 3.99
    3.9860637 = sum of:
      3.9860637 = weight(author_txt:chen in 5633) [ClassicSimilarity], result of:
        3.9860637 = score(doc=5633,freq=3.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          3.986064 = fieldWeight in 5633, product of:
            1.7320508 = tf(freq=3.0), with freq of:
              3.0 = termFreq=3.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.375 = fieldNorm(doc=5633)
    
  3. Chen, W.Y.: Observations on cataloguing and classification (1991) 3.84
    3.8355918 = sum of:
      3.8355918 = weight(author_txt:chen in 4183) [ClassicSimilarity], result of:
        3.8355918 = score(doc=4183,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          3.835592 = fieldWeight in 4183, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.625 = fieldNorm(doc=4183)
    
  4. Chen, H.: Knowledge-based document retrieval : framework and design (1992) 3.84
    3.8355918 = sum of:
      3.8355918 = weight(author_txt:chen in 5282) [ClassicSimilarity], result of:
        3.8355918 = score(doc=5282,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          3.835592 = fieldWeight in 5282, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.625 = fieldNorm(doc=5282)
    
  5. Chen, P.S.: On inference rules of logic-based information retrieval systems (1994) 3.84
    3.8355918 = sum of:
      3.8355918 = weight(author_txt:chen in 6730) [ClassicSimilarity], result of:
        3.8355918 = score(doc=6730,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.16294746 = queryNorm
          3.835592 = fieldWeight in 6730, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.136947 = idf(docFreq=260, maxDocs=44421)
            0.625 = fieldNorm(doc=6730)
    

Similar documents (content)

  1. Potha, N.; Stamatatos, E.: Improving author verification based on topic modeling (2019) 0.14
    0.1371746 = sum of:
      0.1371746 = product of:
        0.4899093 = sum of:
          0.010342889 = weight(abstract_txt:with in 385) [ClassicSimilarity], result of:
            0.010342889 = score(doc=385,freq=2.0), product of:
              0.046878964 = queryWeight, product of:
                1.0625628 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.017674798 = queryNorm
              0.22062966 = fieldWeight in 385, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=385)
          0.08564052 = weight(abstract_txt:cyber in 385) [ClassicSimilarity], result of:
            0.08564052 = score(doc=385,freq=1.0), product of:
              0.16761228 = queryWeight, product of:
                1.1599998 = boost
                8.175107 = idf(docFreq=33, maxDocs=44421)
                0.017674798 = queryNorm
              0.5109442 = fieldWeight in 385, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.175107 = idf(docFreq=33, maxDocs=44421)
                0.0625 = fieldNorm(doc=385)
          0.032495823 = weight(abstract_txt:given in 385) [ClassicSimilarity], result of:
            0.032495823 = score(doc=385,freq=1.0), product of:
              0.1106831 = queryWeight, product of:
                1.3330936 = boost
                4.6974936 = idf(docFreq=1100, maxDocs=44421)
                0.017674798 = queryNorm
              0.29359335 = fieldWeight in 385, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6974936 = idf(docFreq=1100, maxDocs=44421)
                0.0625 = fieldNorm(doc=385)
          0.053250507 = weight(abstract_txt:certain in 385) [ClassicSimilarity], result of:
            0.053250507 = score(doc=385,freq=1.0), product of:
              0.15384337 = queryWeight, product of:
                1.5716629 = boost
                5.5381527 = idf(docFreq=474, maxDocs=44421)
                0.017674798 = queryNorm
              0.34613454 = fieldWeight in 385, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5381527 = idf(docFreq=474, maxDocs=44421)
                0.0625 = fieldNorm(doc=385)
          0.061600726 = weight(abstract_txt:topics in 385) [ClassicSimilarity], result of:
            0.061600726 = score(doc=385,freq=1.0), product of:
              0.19406651 = queryWeight, product of:
                2.1619265 = boost
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.017674798 = queryNorm
              0.3174207 = fieldWeight in 385, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.0625 = fieldNorm(doc=385)
          0.07136099 = weight(abstract_txt:method in 385) [ClassicSimilarity], result of:
            0.07136099 = score(doc=385,freq=1.0), product of:
              0.25379527 = queryWeight, product of:
                3.1917713 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.017674798 = queryNorm
              0.2811754 = fieldWeight in 385, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.0625 = fieldNorm(doc=385)
          0.17521784 = weight(abstract_txt:topic in 385) [ClassicSimilarity], result of:
            0.17521784 = score(doc=385,freq=3.0), product of:
              0.32027382 = queryWeight, product of:
                3.5855083 = boost
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.017674798 = queryNorm
              0.5470876 = fieldWeight in 385, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.0625 = fieldNorm(doc=385)
        0.28 = coord(7/25)
    
  2. Alkhodair, S.A.; Fung, B.C.M.; Patrick, O.R.; Hung, C.K.: Improving interpretations of topic modeling in microblogs (2018) 0.12
    0.12321871 = sum of:
      0.12321871 = product of:
        0.6160935 = sum of:
          0.058276813 = weight(abstract_txt:performs in 181) [ClassicSimilarity], result of:
            0.058276813 = score(doc=181,freq=1.0), product of:
              0.12967318 = queryWeight, product of:
                1.0203052 = boost
                7.190608 = idf(docFreq=90, maxDocs=44421)
                0.017674798 = queryNorm
              0.449413 = fieldWeight in 181, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.190608 = idf(docFreq=90, maxDocs=44421)
                0.0625 = fieldNorm(doc=181)
          0.010342889 = weight(abstract_txt:with in 181) [ClassicSimilarity], result of:
            0.010342889 = score(doc=181,freq=2.0), product of:
              0.046878964 = queryWeight, product of:
                1.0625628 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.017674798 = queryNorm
              0.22062966 = fieldWeight in 181, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.0625 = fieldNorm(doc=181)
          0.13774341 = weight(abstract_txt:topics in 181) [ClassicSimilarity], result of:
            0.13774341 = score(doc=181,freq=5.0), product of:
              0.19406651 = queryWeight, product of:
                2.1619265 = boost
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.017674798 = queryNorm
              0.70977426 = fieldWeight in 181, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.0625 = fieldNorm(doc=181)
          0.123600855 = weight(abstract_txt:method in 181) [ClassicSimilarity], result of:
            0.123600855 = score(doc=181,freq=3.0), product of:
              0.25379527 = queryWeight, product of:
                3.1917713 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.017674798 = queryNorm
              0.4870101 = fieldWeight in 181, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.0625 = fieldNorm(doc=181)
          0.28612953 = weight(abstract_txt:topic in 181) [ClassicSimilarity], result of:
            0.28612953 = score(doc=181,freq=8.0), product of:
              0.32027382 = queryWeight, product of:
                3.5855083 = boost
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.017674798 = queryNorm
              0.89339036 = fieldWeight in 181, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.0625 = fieldNorm(doc=181)
        0.2 = coord(5/25)
    
  3. Finn, A.; Kushmerick, N.: Learning to classify documents according to genre (2006) 0.11
    0.10715927 = sum of:
      0.10715927 = product of:
        0.44649696 = sum of:
          0.00914191 = weight(abstract_txt:with in 10) [ClassicSimilarity], result of:
            0.00914191 = score(doc=10,freq=1.0), product of:
              0.046878964 = queryWeight, product of:
                1.0625628 = boost
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.017674798 = queryNorm
              0.19501092 = fieldWeight in 10, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4961398 = idf(docFreq=9949, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.04061978 = weight(abstract_txt:given in 10) [ClassicSimilarity], result of:
            0.04061978 = score(doc=10,freq=1.0), product of:
              0.1106831 = queryWeight, product of:
                1.3330936 = boost
                4.6974936 = idf(docFreq=1100, maxDocs=44421)
                0.017674798 = queryNorm
              0.3669917 = fieldWeight in 10, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.6974936 = idf(docFreq=1100, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.0777288 = weight(abstract_txt:automatic in 10) [ClassicSimilarity], result of:
            0.0777288 = score(doc=10,freq=2.0), product of:
              0.13540487 = queryWeight, product of:
                1.4744741 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.017674798 = queryNorm
              0.5740473 = fieldWeight in 10, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.063174605 = weight(abstract_txt:query in 10) [ClassicSimilarity], result of:
            0.063174605 = score(doc=10,freq=1.0), product of:
              0.1700781 = queryWeight, product of:
                2.0239036 = boost
                4.754492 = idf(docFreq=1039, maxDocs=44421)
                0.017674798 = queryNorm
              0.37144467 = fieldWeight in 10, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.754492 = idf(docFreq=1039, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.07700091 = weight(abstract_txt:topics in 10) [ClassicSimilarity], result of:
            0.07700091 = score(doc=10,freq=1.0), product of:
              0.19406651 = queryWeight, product of:
                2.1619265 = boost
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.017674798 = queryNorm
              0.39677587 = fieldWeight in 10, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
          0.17883097 = weight(abstract_txt:topic in 10) [ClassicSimilarity], result of:
            0.17883097 = score(doc=10,freq=2.0), product of:
              0.32027382 = queryWeight, product of:
                3.5855083 = boost
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.017674798 = queryNorm
              0.558369 = fieldWeight in 10, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.078125 = fieldNorm(doc=10)
        0.24 = coord(6/25)
    
  4. Lempel, R.; Moran, S.: SALSA: the stochastic approach for link-structure analysis (2001) 0.10
    0.09997727 = sum of:
      0.09997727 = product of:
        0.49988633 = sum of:
          0.058276813 = weight(abstract_txt:performs in 1010) [ClassicSimilarity], result of:
            0.058276813 = score(doc=1010,freq=1.0), product of:
              0.12967318 = queryWeight, product of:
                1.0203052 = boost
                7.190608 = idf(docFreq=90, maxDocs=44421)
                0.017674798 = queryNorm
              0.449413 = fieldWeight in 1010, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.190608 = idf(docFreq=90, maxDocs=44421)
                0.0625 = fieldNorm(doc=1010)
          0.21572304 = weight(abstract_txt:kleinberg in 1010) [ClassicSimilarity], result of:
            0.21572304 = score(doc=1010,freq=2.0), product of:
              0.24628653 = queryWeight, product of:
                1.4061295 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.017674798 = queryNorm
              0.8759027 = fieldWeight in 1010, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.0625 = fieldNorm(doc=1010)
          0.053250507 = weight(abstract_txt:certain in 1010) [ClassicSimilarity], result of:
            0.053250507 = score(doc=1010,freq=1.0), product of:
              0.15384337 = queryWeight, product of:
                1.5716629 = boost
                5.5381527 = idf(docFreq=474, maxDocs=44421)
                0.017674798 = queryNorm
              0.34613454 = fieldWeight in 1010, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5381527 = idf(docFreq=474, maxDocs=44421)
                0.0625 = fieldNorm(doc=1010)
          0.071473904 = weight(abstract_txt:query in 1010) [ClassicSimilarity], result of:
            0.071473904 = score(doc=1010,freq=2.0), product of:
              0.1700781 = queryWeight, product of:
                2.0239036 = boost
                4.754492 = idf(docFreq=1039, maxDocs=44421)
                0.017674798 = queryNorm
              0.42024165 = fieldWeight in 1010, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.754492 = idf(docFreq=1039, maxDocs=44421)
                0.0625 = fieldNorm(doc=1010)
          0.10116207 = weight(abstract_txt:topic in 1010) [ClassicSimilarity], result of:
            0.10116207 = score(doc=1010,freq=1.0), product of:
              0.32027382 = queryWeight, product of:
                3.5855083 = boost
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.017674798 = queryNorm
              0.3158612 = fieldWeight in 1010, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.0625 = fieldNorm(doc=1010)
        0.2 = coord(5/25)
    
  5. Pons-Porrata, A.; Berlanga-Llavori, R.; Ruiz-Shulcloper, J.: Topic discovery based on text mining techniques (2007) 0.10
    0.09961163 = sum of:
      0.09961163 = product of:
        0.6225727 = sum of:
          0.07700091 = weight(abstract_txt:topics in 1916) [ClassicSimilarity], result of:
            0.07700091 = score(doc=1916,freq=1.0), product of:
              0.19406651 = queryWeight, product of:
                2.1619265 = boost
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.017674798 = queryNorm
              0.39677587 = fieldWeight in 1916, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.078125 = fieldNorm(doc=1916)
          0.089201234 = weight(abstract_txt:method in 1916) [ClassicSimilarity], result of:
            0.089201234 = score(doc=1916,freq=1.0), product of:
              0.25379527 = queryWeight, product of:
                3.1917713 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.017674798 = queryNorm
              0.35146925 = fieldWeight in 1916, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.078125 = fieldNorm(doc=1916)
          0.28275657 = weight(abstract_txt:topic in 1916) [ClassicSimilarity], result of:
            0.28275657 = score(doc=1916,freq=5.0), product of:
              0.32027382 = queryWeight, product of:
                3.5855083 = boost
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.017674798 = queryNorm
              0.8828589 = fieldWeight in 1916, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.078125 = fieldNorm(doc=1916)
          0.17361404 = weight(abstract_txt:discovery in 1916) [ClassicSimilarity], result of:
            0.17361404 = score(doc=1916,freq=1.0), product of:
              0.3956333 = queryWeight, product of:
                3.9850743 = boost
                5.616968 = idf(docFreq=438, maxDocs=44421)
                0.017674798 = queryNorm
              0.43882564 = fieldWeight in 1916, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.616968 = idf(docFreq=438, maxDocs=44421)
                0.078125 = fieldNorm(doc=1916)
        0.16 = coord(4/25)