Document (#40415)

Author
Lu, K.
Cai, X.
Ajiferuke, I.
Wolfram, D.
Title
Vocabulary size and its effect on topic representation
Source
Information processing and management. 53(2017) no.3, S.653-665
Year
2017
Abstract
This study investigates how computational overhead for topic model training may be reduced by selectively removing terms from the vocabulary of text corpora being modeled. We compare the impact of removing singly occurring terms, the top 0.5%, 1% and 5% most frequently occurring terms and both top 0.5% most frequent and singly occurring terms, along with changes in the number of topics modeled (10, 20, 30, 40, 50, 100) using three datasets. Four outcome measures are compared. The removal of singly occurring terms has little impact on outcomes for all of the measures tested. Document discriminative capacity, as measured by the document space density, is reduced by the removal of frequently occurring terms, but increases with higher numbers of topics. Vocabulary size does not greatly influence entropy, but entropy is affected by the number of topics. Finally, topic similarity, as measured by pairwise topic similarity and Jensen-Shannon divergence, decreases with the removal of frequent terms. The findings have implications for information science research in information retrieval and informetrics that makes use of topic modeling.
Content
Vgl.: http://www.sciencedirect.com/science/article/pii/S0306457317300298.
Theme
Computerlinguistik

Similar documents (author)

  1. Wolfram, D.: Inter-record linkage structure in a hypertext bibliographic retrieval system (1996) 5.09
    5.0913243 = sum of:
      5.0913243 = weight(author_txt:wolfram in 6829) [ClassicSimilarity], result of:
        5.0913243 = fieldWeight in 6829, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.146119 = idf(docFreq=34, maxDocs=44421)
          0.625 = fieldNorm(doc=6829)
    
  2. Wolfram, D.: Applied informetrics for information retrieval research (2003) 5.09
    5.0913243 = sum of:
      5.0913243 = weight(author_txt:wolfram in 5589) [ClassicSimilarity], result of:
        5.0913243 = fieldWeight in 5589, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.146119 = idf(docFreq=34, maxDocs=44421)
          0.625 = fieldNorm(doc=5589)
    
  3. Wolfram, D.: Search characteristics in different types of Web-based IR environments : are they the same? (2008) 5.09
    5.0913243 = sum of:
      5.0913243 = weight(author_txt:wolfram in 3093) [ClassicSimilarity], result of:
        5.0913243 = fieldWeight in 3093, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.146119 = idf(docFreq=34, maxDocs=44421)
          0.625 = fieldNorm(doc=3093)
    
  4. Wolfram, D.: ¬The symbiotic relationship between information retrieval and informetrics (2015) 5.09
    5.0913243 = sum of:
      5.0913243 = weight(author_txt:wolfram in 2689) [ClassicSimilarity], result of:
        5.0913243 = fieldWeight in 2689, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.146119 = idf(docFreq=34, maxDocs=44421)
          0.625 = fieldNorm(doc=2689)
    
  5. Wolfram, S.: ¬A new kind of science (2002) 5.09
    5.0913243 = sum of:
      5.0913243 = weight(author_txt:wolfram in 2866) [ClassicSimilarity], result of:
        5.0913243 = fieldWeight in 2866, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.146119 = idf(docFreq=34, maxDocs=44421)
          0.625 = fieldNorm(doc=2866)
    

Similar documents (content)

  1. Shibata, N.; Kajikawa, Y.; Sakata, I.: Measuring relatedness between communities in a citation network (2011) 0.17
    0.16705142 = sum of:
      0.16705142 = product of:
        0.5966122 = sum of:
          0.029018542 = weight(abstract_txt:number in 484) [ClassicSimilarity], result of:
            0.029018542 = score(doc=484,freq=2.0), product of:
              0.063507386 = queryWeight, product of:
                1.0043247 = boost
                4.1356745 = idf(docFreq=1930, maxDocs=44421)
                0.015289869 = queryNorm
              0.45693177 = fieldWeight in 484, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1356745 = idf(docFreq=1930, maxDocs=44421)
                0.078125 = fieldNorm(doc=484)
          0.0654928 = weight(abstract_txt:measures in 484) [ClassicSimilarity], result of:
            0.0654928 = score(doc=484,freq=2.0), product of:
              0.10927049 = queryWeight, product of:
                1.3173872 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.015289869 = queryNorm
              0.59936404 = fieldWeight in 484, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.078125 = fieldNorm(doc=484)
          0.057131447 = weight(abstract_txt:similarity in 484) [ClassicSimilarity], result of:
            0.057131447 = score(doc=484,freq=1.0), product of:
              0.12568997 = queryWeight, product of:
                1.412903 = boost
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.015289869 = queryNorm
              0.4545426 = fieldWeight in 484, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.078125 = fieldNorm(doc=484)
          0.07409555 = weight(abstract_txt:measured in 484) [ClassicSimilarity], result of:
            0.07409555 = score(doc=484,freq=1.0), product of:
              0.14947845 = queryWeight, product of:
                1.5408179 = boost
                6.3448815 = idf(docFreq=211, maxDocs=44421)
                0.015289869 = queryNorm
              0.49569386 = fieldWeight in 484, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3448815 = idf(docFreq=211, maxDocs=44421)
                0.078125 = fieldNorm(doc=484)
          0.1823273 = weight(abstract_txt:removing in 484) [ClassicSimilarity], result of:
            0.1823273 = score(doc=484,freq=1.0), product of:
              0.27244884 = queryWeight, product of:
                2.080197 = boost
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.015289869 = queryNorm
              0.66921663 = fieldWeight in 484, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.078125 = fieldNorm(doc=484)
          0.09360744 = weight(abstract_txt:topic in 484) [ClassicSimilarity], result of:
            0.09360744 = score(doc=484,freq=1.0), product of:
              0.237085 = queryWeight, product of:
                3.0682027 = boost
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.015289869 = queryNorm
              0.3948265 = fieldWeight in 484, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.078125 = fieldNorm(doc=484)
          0.094939135 = weight(abstract_txt:terms in 484) [ClassicSimilarity], result of:
            0.094939135 = score(doc=484,freq=2.0), product of:
              0.21250054 = queryWeight, product of:
                3.4369724 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.015289869 = queryNorm
              0.44677126 = fieldWeight in 484, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.078125 = fieldNorm(doc=484)
        0.28 = coord(7/25)
    
  2. Zhang, J.; Wolfram, D.; Wang, P.; Hong, Y.; Gillis, R.: Visualization of health-subject analysis based on query term co-occurrences (2008) 0.16
    0.16286333 = sum of:
      0.16286333 = product of:
        0.5816547 = sum of:
          0.022192547 = weight(abstract_txt:impact in 3376) [ClassicSimilarity], result of:
            0.022192547 = score(doc=3376,freq=1.0), product of:
              0.0776477 = queryWeight, product of:
                1.1105198 = boost
                4.572972 = idf(docFreq=1246, maxDocs=44421)
                0.015289869 = queryNorm
              0.28581074 = fieldWeight in 3376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.572972 = idf(docFreq=1246, maxDocs=44421)
                0.0625 = fieldNorm(doc=3376)
          0.045705155 = weight(abstract_txt:similarity in 3376) [ClassicSimilarity], result of:
            0.045705155 = score(doc=3376,freq=1.0), product of:
              0.12568997 = queryWeight, product of:
                1.412903 = boost
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.015289869 = queryNorm
              0.36363408 = fieldWeight in 3376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8181453 = idf(docFreq=358, maxDocs=44421)
                0.0625 = fieldNorm(doc=3376)
          0.05032038 = weight(abstract_txt:frequently in 3376) [ClassicSimilarity], result of:
            0.05032038 = score(doc=3376,freq=1.0), product of:
              0.1340149 = queryWeight, product of:
                1.4589438 = boost
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.015289869 = queryNorm
              0.37548345 = fieldWeight in 3376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.0625 = fieldNorm(doc=3376)
          0.053637255 = weight(abstract_txt:vocabulary in 3376) [ClassicSimilarity], result of:
            0.053637255 = score(doc=3376,freq=1.0), product of:
              0.16007811 = queryWeight, product of:
                1.9528712 = boost
                5.3611083 = idf(docFreq=566, maxDocs=44421)
                0.015289869 = queryNorm
              0.33506927 = fieldWeight in 3376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3611083 = idf(docFreq=566, maxDocs=44421)
                0.0625 = fieldNorm(doc=3376)
          0.07488595 = weight(abstract_txt:topic in 3376) [ClassicSimilarity], result of:
            0.07488595 = score(doc=3376,freq=1.0), product of:
              0.237085 = queryWeight, product of:
                3.0682027 = boost
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.015289869 = queryNorm
              0.3158612 = fieldWeight in 3376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.0625 = fieldNorm(doc=3376)
          0.10741138 = weight(abstract_txt:terms in 3376) [ClassicSimilarity], result of:
            0.10741138 = score(doc=3376,freq=4.0), product of:
              0.21250054 = queryWeight, product of:
                3.4369724 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.015289869 = queryNorm
              0.505464 = fieldWeight in 3376, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.0625 = fieldNorm(doc=3376)
          0.22750206 = weight(abstract_txt:occurring in 3376) [ClassicSimilarity], result of:
            0.22750206 = score(doc=3376,freq=1.0), product of:
              0.49731025 = queryWeight, product of:
                4.4437103 = boost
                7.319441 = idf(docFreq=79, maxDocs=44421)
                0.015289869 = queryNorm
              0.45746505 = fieldWeight in 3376, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.319441 = idf(docFreq=79, maxDocs=44421)
                0.0625 = fieldNorm(doc=3376)
        0.28 = coord(7/25)
    
  3. Sparck Jones, K.: ¬A statistical interpretation of term specificity and its application in retrieval (2004) 0.16
    0.15778613 = sum of:
      0.15778613 = product of:
        0.78893065 = sum of:
          0.027563503 = weight(abstract_txt:document in 5420) [ClassicSimilarity], result of:
            0.027563503 = score(doc=5420,freq=1.0), product of:
              0.06846773 = queryWeight, product of:
                1.0428095 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.015289869 = queryNorm
              0.40257657 = fieldWeight in 5420, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.09375 = fieldNorm(doc=5420)
          0.075480565 = weight(abstract_txt:frequently in 5420) [ClassicSimilarity], result of:
            0.075480565 = score(doc=5420,freq=1.0), product of:
              0.1340149 = queryWeight, product of:
                1.4589438 = boost
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.015289869 = queryNorm
              0.56322515 = fieldWeight in 5420, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.09375 = fieldNorm(doc=5420)
          0.16449915 = weight(abstract_txt:frequent in 5420) [ClassicSimilarity], result of:
            0.16449915 = score(doc=5420,freq=2.0), product of:
              0.17879778 = queryWeight, product of:
                1.6851674 = boost
                6.939294 = idf(docFreq=116, maxDocs=44421)
                0.015289869 = queryNorm
              0.92002904 = fieldWeight in 5420, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.939294 = idf(docFreq=116, maxDocs=44421)
                0.09375 = fieldNorm(doc=5420)
          0.18013436 = weight(abstract_txt:terms in 5420) [ClassicSimilarity], result of:
            0.18013436 = score(doc=5420,freq=5.0), product of:
              0.21250054 = queryWeight, product of:
                3.4369724 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.015289869 = queryNorm
              0.8476889 = fieldWeight in 5420, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.09375 = fieldNorm(doc=5420)
          0.3412531 = weight(abstract_txt:occurring in 5420) [ClassicSimilarity], result of:
            0.3412531 = score(doc=5420,freq=1.0), product of:
              0.49731025 = queryWeight, product of:
                4.4437103 = boost
                7.319441 = idf(docFreq=79, maxDocs=44421)
                0.015289869 = queryNorm
              0.6861976 = fieldWeight in 5420, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.319441 = idf(docFreq=79, maxDocs=44421)
                0.09375 = fieldNorm(doc=5420)
        0.2 = coord(5/25)
    
  4. Wolfram, D.; Zhang, J.: ¬The influence of indexing practices and weighting algorithms on document spaces (2008) 0.13
    0.132665 = sum of:
      0.132665 = product of:
        0.66332495 = sum of:
          0.045011014 = weight(abstract_txt:document in 2963) [ClassicSimilarity], result of:
            0.045011014 = score(doc=2963,freq=6.0), product of:
              0.06846773 = queryWeight, product of:
                1.0428095 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.015289869 = queryNorm
              0.6574048 = fieldWeight in 2963, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=2963)
          0.08092749 = weight(abstract_txt:discriminative in 2963) [ClassicSimilarity], result of:
            0.08092749 = score(doc=2963,freq=1.0), product of:
              0.14600842 = queryWeight, product of:
                1.0768023 = boost
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.015289869 = queryNorm
              0.5542659 = fieldWeight in 2963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.0625 = fieldNorm(doc=2963)
          0.05032038 = weight(abstract_txt:frequently in 2963) [ClassicSimilarity], result of:
            0.05032038 = score(doc=2963,freq=1.0), product of:
              0.1340149 = queryWeight, product of:
                1.4589438 = boost
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.015289869 = queryNorm
              0.37548345 = fieldWeight in 2963, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.0625 = fieldNorm(doc=2963)
          0.09302098 = weight(abstract_txt:terms in 2963) [ClassicSimilarity], result of:
            0.09302098 = score(doc=2963,freq=3.0), product of:
              0.21250054 = queryWeight, product of:
                3.4369724 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.015289869 = queryNorm
              0.43774468 = fieldWeight in 2963, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.0625 = fieldNorm(doc=2963)
          0.3940451 = weight(abstract_txt:occurring in 2963) [ClassicSimilarity], result of:
            0.3940451 = score(doc=2963,freq=3.0), product of:
              0.49731025 = queryWeight, product of:
                4.4437103 = boost
                7.319441 = idf(docFreq=79, maxDocs=44421)
                0.015289869 = queryNorm
              0.7923527 = fieldWeight in 2963, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.319441 = idf(docFreq=79, maxDocs=44421)
                0.0625 = fieldNorm(doc=2963)
        0.2 = coord(5/25)
    
  5. Alipour, O.; Soheili, F.; Khasseh, A.A.: ¬A co-word analysis of global research on knowledge organization: 1900-2019 (2022) 0.13
    0.12844673 = sum of:
      0.12844673 = product of:
        0.45873833 = sum of:
          0.012311526 = weight(abstract_txt:number in 2108) [ClassicSimilarity], result of:
            0.012311526 = score(doc=2108,freq=1.0), product of:
              0.063507386 = queryWeight, product of:
                1.0043247 = boost
                4.1356745 = idf(docFreq=1930, maxDocs=44421)
                0.015289869 = queryNorm
              0.19385974 = fieldWeight in 2108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1356745 = idf(docFreq=1930, maxDocs=44421)
                0.046875 = fieldNorm(doc=2108)
          0.037740283 = weight(abstract_txt:frequently in 2108) [ClassicSimilarity], result of:
            0.037740283 = score(doc=2108,freq=1.0), product of:
              0.1340149 = queryWeight, product of:
                1.4589438 = boost
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.015289869 = queryNorm
              0.28161258 = fieldWeight in 2108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0077353 = idf(docFreq=296, maxDocs=44421)
                0.046875 = fieldNorm(doc=2108)
          0.058159236 = weight(abstract_txt:frequent in 2108) [ClassicSimilarity], result of:
            0.058159236 = score(doc=2108,freq=1.0), product of:
              0.17879778 = queryWeight, product of:
                1.6851674 = boost
                6.939294 = idf(docFreq=116, maxDocs=44421)
                0.015289869 = queryNorm
              0.3252794 = fieldWeight in 2108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.939294 = idf(docFreq=116, maxDocs=44421)
                0.046875 = fieldNorm(doc=2108)
          0.058814783 = weight(abstract_txt:reduced in 2108) [ClassicSimilarity], result of:
            0.058814783 = score(doc=2108,freq=1.0), product of:
              0.18013883 = queryWeight, product of:
                1.6914754 = boost
                6.965269 = idf(docFreq=113, maxDocs=44421)
                0.015289869 = queryNorm
              0.326497 = fieldWeight in 2108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.965269 = idf(docFreq=113, maxDocs=44421)
                0.046875 = fieldNorm(doc=2108)
          0.059236635 = weight(abstract_txt:topics in 2108) [ClassicSimilarity], result of:
            0.059236635 = score(doc=2108,freq=3.0), product of:
              0.14365913 = queryWeight, product of:
                1.8500108 = boost
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.015289869 = queryNorm
              0.4123416 = fieldWeight in 2108, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.078731 = idf(docFreq=751, maxDocs=44421)
                0.046875 = fieldNorm(doc=2108)
          0.15304731 = weight(abstract_txt:removal in 2108) [ClassicSimilarity], result of:
            0.15304731 = score(doc=2108,freq=1.0), product of:
              0.3901191 = queryWeight, product of:
                3.0486407 = boost
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.015289869 = queryNorm
              0.3923092 = fieldWeight in 2108, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.046875 = fieldNorm(doc=2108)
          0.079428546 = weight(abstract_txt:topic in 2108) [ClassicSimilarity], result of:
            0.079428546 = score(doc=2108,freq=2.0), product of:
              0.237085 = queryWeight, product of:
                3.0682027 = boost
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.015289869 = queryNorm
              0.33502138 = fieldWeight in 2108, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.053779 = idf(docFreq=770, maxDocs=44421)
                0.046875 = fieldNorm(doc=2108)
        0.28 = coord(7/25)