Document (#32265)

Author
Zhan, J.
Loh, H.T.
Title
Using latent semantic indexing to improve the accuracy of document clustering
Source
Journal of information and knowledge management. 6(2007) no.3, S.181-188
Year
2007
Abstract
Document clustering is a significant research issue in information retrieval and text mining. Traditionally, most clustering methods were based on the vector space model which has a few limitations such as high dimensionality and weakness in handling synonymous and polysemous problems. Latent semantic indexing (LSI) is able to deal with such problems to some extent. Previous studies have shown that using LSI could reduce the time in clustering a large document set while having little effect on clustering accuracy. However, when conducting clustering upon a small document set, the accuracy is more concerned than efficiency. In this paper, we demonstrate that LSI can improve the clustering accuracy of a small document set and we also recommend the dimensions needed to achieve the best clustering performance.
Object
Latent Semantic Indexing

Similar documents (content)

  1. Zhu, W.Z.; Allen, R.B.: Document clustering using the LSI subspace signature model (2013) 0.33
    0.32908484 = sum of:
      0.32908484 = product of:
        1.0283902 = sum of:
          0.04043861 = weight(abstract_txt:efficiency in 1690) [ClassicSimilarity], result of:
            0.04043861 = score(doc=1690,freq=1.0), product of:
              0.08511899 = queryWeight, product of:
                1.0005951 = boost
                6.0810666 = idf(docFreq=275, maxDocs=44421)
                0.013989053 = queryNorm
              0.47508332 = fieldWeight in 1690, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0810666 = idf(docFreq=275, maxDocs=44421)
                0.078125 = fieldNorm(doc=1690)
          0.04983524 = weight(abstract_txt:vector in 1690) [ClassicSimilarity], result of:
            0.04983524 = score(doc=1690,freq=1.0), product of:
              0.09784079 = queryWeight, product of:
                1.0727663 = boost
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.013989053 = queryNorm
              0.5093503 = fieldWeight in 1690, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.078125 = fieldNorm(doc=1690)
          0.014399583 = weight(abstract_txt:such in 1690) [ClassicSimilarity], result of:
            0.014399583 = score(doc=1690,freq=1.0), product of:
              0.053877268 = queryWeight, product of:
                1.1258043 = boost
                3.42101 = idf(docFreq=3945, maxDocs=44421)
                0.013989053 = queryNorm
              0.2672664 = fieldWeight in 1690, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.42101 = idf(docFreq=3945, maxDocs=44421)
                0.078125 = fieldNorm(doc=1690)
          0.041875727 = weight(abstract_txt:indexing in 1690) [ClassicSimilarity], result of:
            0.041875727 = score(doc=1690,freq=2.0), product of:
              0.08712388 = queryWeight, product of:
                1.4316232 = boost
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.013989053 = queryNorm
              0.48064584 = fieldWeight in 1690, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.078125 = fieldNorm(doc=1690)
          0.055807207 = weight(abstract_txt:semantic in 1690) [ClassicSimilarity], result of:
            0.055807207 = score(doc=1690,freq=3.0), product of:
              0.09217052 = queryWeight, product of:
                1.472503 = boost
                4.4745317 = idf(docFreq=1375, maxDocs=44421)
                0.013989053 = queryNorm
              0.6054778 = fieldWeight in 1690, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4745317 = idf(docFreq=1375, maxDocs=44421)
                0.078125 = fieldNorm(doc=1690)
          0.2129316 = weight(abstract_txt:latent in 1690) [ClassicSimilarity], result of:
            0.2129316 = score(doc=1690,freq=3.0), product of:
              0.2250567 = queryWeight, product of:
                2.3009443 = boost
                6.9919376 = idf(docFreq=110, maxDocs=44421)
                0.013989053 = queryNorm
              0.94612426 = fieldWeight in 1690, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9919376 = idf(docFreq=110, maxDocs=44421)
                0.078125 = fieldNorm(doc=1690)
          0.12331591 = weight(abstract_txt:document in 1690) [ClassicSimilarity], result of:
            0.12331591 = score(doc=1690,freq=3.0), product of:
              0.21222243 = queryWeight, product of:
                3.5328548 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.013989053 = queryNorm
              0.5810692 = fieldWeight in 1690, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=1690)
          0.48978624 = weight(abstract_txt:clustering in 1690) [ClassicSimilarity], result of:
            0.48978624 = score(doc=1690,freq=2.0), product of:
              0.71261233 = queryWeight, product of:
                8.188735 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013989053 = queryNorm
              0.68731093 = fieldWeight in 1690, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.078125 = fieldNorm(doc=1690)
        0.32 = coord(8/25)
    
  2. Cai, X.; Li, W.: Enhancing sentence-level clustering with integrated and interactive frameworks for theme-based summarization (2011) 0.26
    0.26146016 = sum of:
      0.26146016 = product of:
        1.3073008 = sum of:
          0.03986819 = weight(abstract_txt:vector in 770) [ClassicSimilarity], result of:
            0.03986819 = score(doc=770,freq=1.0), product of:
              0.09784079 = queryWeight, product of:
                1.0727663 = boost
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.013989053 = queryNorm
              0.40748024 = fieldWeight in 770, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.0625 = fieldNorm(doc=770)
          0.042962827 = weight(abstract_txt:traditionally in 770) [ClassicSimilarity], result of:
            0.042962827 = score(doc=770,freq=1.0), product of:
              0.1028405 = queryWeight, product of:
                1.0998342 = boost
                6.684188 = idf(docFreq=150, maxDocs=44421)
                0.013989053 = queryNorm
              0.41776174 = fieldWeight in 770, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.684188 = idf(docFreq=150, maxDocs=44421)
                0.0625 = fieldNorm(doc=770)
          0.0118857445 = weight(abstract_txt:using in 770) [ClassicSimilarity], result of:
            0.0118857445 = score(doc=770,freq=1.0), product of:
              0.055012733 = queryWeight, product of:
                1.1376057 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.013989053 = queryNorm
              0.21605442 = fieldWeight in 770, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0625 = fieldNorm(doc=770)
          0.13951604 = weight(abstract_txt:document in 770) [ClassicSimilarity], result of:
            0.13951604 = score(doc=770,freq=6.0), product of:
              0.21222243 = queryWeight, product of:
                3.5328548 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.013989053 = queryNorm
              0.6574048 = fieldWeight in 770, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=770)
          1.073068 = weight(abstract_txt:clustering in 770) [ClassicSimilarity], result of:
            1.073068 = score(doc=770,freq=15.0), product of:
              0.71261233 = queryWeight, product of:
                8.188735 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013989053 = queryNorm
              1.5058229 = fieldWeight in 770, product of:
                3.8729835 = tf(freq=15.0), with freq of:
                  15.0 = termFreq=15.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0625 = fieldNorm(doc=770)
        0.2 = coord(5/25)
    
  3. Liu, X.; Yu, S.; Janssens, F.; Glänzel, W.; Moreau, Y.; Moor, B.de: Weighted hybrid clustering by combining text mining and bibliometrics on a large-scale journal database (2010) 0.21
    0.21489015 = sum of:
      0.21489015 = product of:
        1.0744507 = sum of:
          0.05718883 = weight(abstract_txt:efficiency in 451) [ClassicSimilarity], result of:
            0.05718883 = score(doc=451,freq=2.0), product of:
              0.08511899 = queryWeight, product of:
                1.0005951 = boost
                6.0810666 = idf(docFreq=275, maxDocs=44421)
                0.013989053 = queryNorm
              0.6718693 = fieldWeight in 451, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0810666 = idf(docFreq=275, maxDocs=44421)
                0.078125 = fieldNorm(doc=451)
          0.042280767 = weight(abstract_txt:mining in 451) [ClassicSimilarity], result of:
            0.042280767 = score(doc=451,freq=1.0), product of:
              0.08768477 = queryWeight, product of:
                1.0155638 = boost
                6.1720386 = idf(docFreq=251, maxDocs=44421)
                0.013989053 = queryNorm
              0.48219052 = fieldWeight in 451, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.1720386 = idf(docFreq=251, maxDocs=44421)
                0.078125 = fieldNorm(doc=451)
          0.0148571795 = weight(abstract_txt:using in 451) [ClassicSimilarity], result of:
            0.0148571795 = score(doc=451,freq=1.0), product of:
              0.055012733 = queryWeight, product of:
                1.1376057 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.013989053 = queryNorm
              0.27006802 = fieldWeight in 451, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.078125 = fieldNorm(doc=451)
          0.043817844 = weight(abstract_txt:improve in 451) [ClassicSimilarity], result of:
            0.043817844 = score(doc=451,freq=1.0), product of:
              0.11313742 = queryWeight, product of:
                1.6314106 = boost
                4.9574084 = idf(docFreq=848, maxDocs=44421)
                0.013989053 = queryNorm
              0.38729754 = fieldWeight in 451, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9574084 = idf(docFreq=848, maxDocs=44421)
                0.078125 = fieldNorm(doc=451)
          0.91630614 = weight(abstract_txt:clustering in 451) [ClassicSimilarity], result of:
            0.91630614 = score(doc=451,freq=7.0), product of:
              0.71261233 = queryWeight, product of:
                8.188735 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013989053 = queryNorm
              1.285841 = fieldWeight in 451, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.078125 = fieldNorm(doc=451)
        0.2 = coord(5/25)
    
  4. Shah, B.; Raghavan, V.; Dhatric, P.; Zhao, X.: ¬A cluster-based approach for efficient content-based image retrieval using a similarity-preserving space transformation method (2006) 0.20
    0.20083886 = sum of:
      0.20083886 = product of:
        0.71728164 = sum of:
          0.040032186 = weight(abstract_txt:efficiency in 118) [ClassicSimilarity], result of:
            0.040032186 = score(doc=118,freq=2.0), product of:
              0.08511899 = queryWeight, product of:
                1.0005951 = boost
                6.0810666 = idf(docFreq=275, maxDocs=44421)
                0.013989053 = queryNorm
              0.4703085 = fieldWeight in 118, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0810666 = idf(docFreq=275, maxDocs=44421)
                0.0546875 = fieldNorm(doc=118)
          0.034884665 = weight(abstract_txt:vector in 118) [ClassicSimilarity], result of:
            0.034884665 = score(doc=118,freq=1.0), product of:
              0.09784079 = queryWeight, product of:
                1.0727663 = boost
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.013989053 = queryNorm
              0.3565452 = fieldWeight in 118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.0546875 = fieldNorm(doc=118)
          0.010400026 = weight(abstract_txt:using in 118) [ClassicSimilarity], result of:
            0.010400026 = score(doc=118,freq=1.0), product of:
              0.055012733 = queryWeight, product of:
                1.1376057 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.013989053 = queryNorm
              0.18904762 = fieldWeight in 118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0546875 = fieldNorm(doc=118)
          0.020028433 = weight(abstract_txt:problems in 118) [ClassicSimilarity], result of:
            0.020028433 = score(doc=118,freq=1.0), product of:
              0.08515397 = queryWeight, product of:
                1.4153459 = boost
                4.300847 = idf(docFreq=1636, maxDocs=44421)
                0.013989053 = queryNorm
              0.23520258 = fieldWeight in 118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.300847 = idf(docFreq=1636, maxDocs=44421)
                0.0546875 = fieldNorm(doc=118)
          0.020727428 = weight(abstract_txt:indexing in 118) [ClassicSimilarity], result of:
            0.020727428 = score(doc=118,freq=1.0), product of:
              0.08712388 = queryWeight, product of:
                1.4316232 = boost
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.013989053 = queryNorm
              0.23790754 = fieldWeight in 118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3503094 = idf(docFreq=1557, maxDocs=44421)
                0.0546875 = fieldNorm(doc=118)
          0.10634526 = weight(abstract_txt:accuracy in 118) [ClassicSimilarity], result of:
            0.10634526 = score(doc=118,freq=1.0), product of:
              0.32653445 = queryWeight, product of:
                3.919581 = boost
                5.9552646 = idf(docFreq=312, maxDocs=44421)
                0.013989053 = queryNorm
              0.32567853 = fieldWeight in 118, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9552646 = idf(docFreq=312, maxDocs=44421)
                0.0546875 = fieldNorm(doc=118)
          0.48486364 = weight(abstract_txt:clustering in 118) [ClassicSimilarity], result of:
            0.48486364 = score(doc=118,freq=4.0), product of:
              0.71261233 = queryWeight, product of:
                8.188735 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013989053 = queryNorm
              0.6804031 = fieldWeight in 118, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0546875 = fieldNorm(doc=118)
        0.28 = coord(7/25)
    
  5. Cribbin, T.: Discovering latent topical structure by second-order similarity analysis (2011) 0.19
    0.19468331 = sum of:
      0.19468331 = product of:
        0.6083854 = sum of:
          0.035722252 = weight(abstract_txt:reduce in 470) [ClassicSimilarity], result of:
            0.035722252 = score(doc=470,freq=1.0), product of:
              0.090934396 = queryWeight, product of:
                1.0342112 = boost
                6.285367 = idf(docFreq=224, maxDocs=44421)
                0.013989053 = queryNorm
              0.39283544 = fieldWeight in 470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.285367 = idf(docFreq=224, maxDocs=44421)
                0.0625 = fieldNorm(doc=470)
          0.06905373 = weight(abstract_txt:vector in 470) [ClassicSimilarity], result of:
            0.06905373 = score(doc=470,freq=3.0), product of:
              0.09784079 = queryWeight, product of:
                1.0727663 = boost
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.013989053 = queryNorm
              0.70577645 = fieldWeight in 470, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.519684 = idf(docFreq=177, maxDocs=44421)
                0.0625 = fieldNorm(doc=470)
          0.07696368 = weight(abstract_txt:synonymous in 470) [ClassicSimilarity], result of:
            0.07696368 = score(doc=470,freq=1.0), product of:
              0.15169089 = queryWeight, product of:
                1.335749 = boost
                8.117949 = idf(docFreq=35, maxDocs=44421)
                0.013989053 = queryNorm
              0.5073718 = fieldWeight in 470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.117949 = idf(docFreq=35, maxDocs=44421)
                0.0625 = fieldNorm(doc=470)
          0.022889636 = weight(abstract_txt:problems in 470) [ClassicSimilarity], result of:
            0.022889636 = score(doc=470,freq=1.0), product of:
              0.08515397 = queryWeight, product of:
                1.4153459 = boost
                4.300847 = idf(docFreq=1636, maxDocs=44421)
                0.013989053 = queryNorm
              0.26880294 = fieldWeight in 470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.300847 = idf(docFreq=1636, maxDocs=44421)
                0.0625 = fieldNorm(doc=470)
          0.036453113 = weight(abstract_txt:semantic in 470) [ClassicSimilarity], result of:
            0.036453113 = score(doc=470,freq=2.0), product of:
              0.09217052 = queryWeight, product of:
                1.472503 = boost
                4.4745317 = idf(docFreq=1375, maxDocs=44421)
                0.013989053 = queryNorm
              0.39549646 = fieldWeight in 470, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4745317 = idf(docFreq=1375, maxDocs=44421)
                0.0625 = fieldNorm(doc=470)
          0.14000046 = weight(abstract_txt:polysemous in 470) [ClassicSimilarity], result of:
            0.14000046 = score(doc=470,freq=1.0), product of:
              0.22604172 = queryWeight, product of:
                1.6305699 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.013989053 = queryNorm
              0.61935675 = fieldWeight in 470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.0625 = fieldNorm(doc=470)
          0.17034528 = weight(abstract_txt:latent in 470) [ClassicSimilarity], result of:
            0.17034528 = score(doc=470,freq=3.0), product of:
              0.2250567 = queryWeight, product of:
                2.3009443 = boost
                6.9919376 = idf(docFreq=110, maxDocs=44421)
                0.013989053 = queryNorm
              0.7568994 = fieldWeight in 470, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9919376 = idf(docFreq=110, maxDocs=44421)
                0.0625 = fieldNorm(doc=470)
          0.05695718 = weight(abstract_txt:document in 470) [ClassicSimilarity], result of:
            0.05695718 = score(doc=470,freq=1.0), product of:
              0.21222243 = queryWeight, product of:
                3.5328548 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.013989053 = queryNorm
              0.26838437 = fieldWeight in 470, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=470)
        0.32 = coord(8/25)