Document (#36798)

Author
Li, T.
Zhu, S.
Ogihara, M.
Title
Hierarchical document classification using automatically generated hierarchy
Source
Journal of intelligent information systems. 29(2007) no.2, S.211-230
Year
2007
Abstract
Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Ruiz, M.E.; Srinivasan, P.: Combining machine learning and hierarchical indexing structures for text categorization (2001) 0.28
    0.27623364 = sum of:
      0.27623364 = product of:
        0.86323017 = sum of:
          0.018413445 = weight(abstract_txt:paper in 2595) [ClassicSimilarity], result of:
            0.018413445 = score(doc=2595,freq=1.0), product of:
              0.056739513 = queryWeight, product of:
                1.0164586 = boost
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.016125668 = queryNorm
              0.32452595 = fieldWeight in 2595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.09375 = fieldNorm(doc=2595)
          0.04142324 = weight(abstract_txt:text in 2595) [ClassicSimilarity], result of:
            0.04142324 = score(doc=2595,freq=2.0), product of:
              0.07731818 = queryWeight, product of:
                1.1865547 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.016125668 = queryNorm
              0.5357503 = fieldWeight in 2595, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.09375 = fieldNorm(doc=2595)
          0.11874338 = weight(abstract_txt:flat in 2595) [ClassicSimilarity], result of:
            0.11874338 = score(doc=2595,freq=1.0), product of:
              0.15602416 = queryWeight, product of:
                1.1918671 = boost
                8.117949 = idf(docFreq=35, maxDocs=44421)
                0.016125668 = queryNorm
              0.7610577 = fieldWeight in 2595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.117949 = idf(docFreq=35, maxDocs=44421)
                0.09375 = fieldNorm(doc=2595)
          0.036742996 = weight(abstract_txt:structure in 2595) [ClassicSimilarity], result of:
            0.036742996 = score(doc=2595,freq=1.0), product of:
              0.089931525 = queryWeight, product of:
                1.2796844 = boost
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.016125668 = queryNorm
              0.40856636 = fieldWeight in 2595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.09375 = fieldNorm(doc=2595)
          0.06162583 = weight(abstract_txt:categories in 2595) [ClassicSimilarity], result of:
            0.06162583 = score(doc=2595,freq=1.0), product of:
              0.12695138 = queryWeight, product of:
                1.5204272 = boost
                5.177905 = idf(docFreq=680, maxDocs=44421)
                0.016125668 = queryNorm
              0.4854286 = fieldWeight in 2595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.177905 = idf(docFreq=680, maxDocs=44421)
                0.09375 = fieldNorm(doc=2595)
          0.027506873 = weight(abstract_txt:using in 2595) [ClassicSimilarity], result of:
            0.027506873 = score(doc=2595,freq=1.0), product of:
              0.08487637 = queryWeight, product of:
                1.5226004 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.016125668 = queryNorm
              0.32408163 = fieldWeight in 2595, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.09375 = fieldNorm(doc=2595)
          0.26943883 = weight(abstract_txt:categorization in 2595) [ClassicSimilarity], result of:
            0.26943883 = score(doc=2595,freq=2.0), product of:
              0.308406 = queryWeight, product of:
                2.9023778 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.016125668 = queryNorm
              0.87364984 = fieldWeight in 2595, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.09375 = fieldNorm(doc=2595)
          0.28933555 = weight(abstract_txt:hierarchical in 2595) [ClassicSimilarity], result of:
            0.28933555 = score(doc=2595,freq=3.0), product of:
              0.31095654 = queryWeight, product of:
                3.3652067 = boost
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.016125668 = queryNorm
              0.9304694 = fieldWeight in 2595, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.09375 = fieldNorm(doc=2595)
        0.32 = coord(8/25)
    
  2. Li, T.; Zhu, S.; Ogihara, M.: Text categorization via generalized discriminant analysis (2008) 0.25
    0.24847631 = sum of:
      0.24847631 = product of:
        0.69021195 = sum of:
          0.012275631 = weight(abstract_txt:paper in 3119) [ClassicSimilarity], result of:
            0.012275631 = score(doc=3119,freq=1.0), product of:
              0.056739513 = queryWeight, product of:
                1.0164586 = boost
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.016125668 = queryNorm
              0.21635064 = fieldWeight in 3119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.013974397 = weight(abstract_txt:been in 3119) [ClassicSimilarity], result of:
            0.013974397 = score(doc=3119,freq=1.0), product of:
              0.061860267 = queryWeight, product of:
                1.0613358 = boost
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.016125668 = queryNorm
              0.22590263 = fieldWeight in 3119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.04366393 = weight(abstract_txt:text in 3119) [ClassicSimilarity], result of:
            0.04366393 = score(doc=3119,freq=5.0), product of:
              0.07731818 = queryWeight, product of:
                1.1865547 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.016125668 = queryNorm
              0.56473047 = fieldWeight in 3119, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.023433702 = weight(abstract_txt:document in 3119) [ClassicSimilarity], result of:
            0.023433702 = score(doc=3119,freq=1.0), product of:
              0.08731396 = queryWeight, product of:
                1.2609235 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.016125668 = queryNorm
              0.26838437 = fieldWeight in 3119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.024495332 = weight(abstract_txt:structure in 3119) [ClassicSimilarity], result of:
            0.024495332 = score(doc=3119,freq=1.0), product of:
              0.089931525 = queryWeight, product of:
                1.2796844 = boost
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.016125668 = queryNorm
              0.27237758 = fieldWeight in 3119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3580413 = idf(docFreq=1545, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.018337917 = weight(abstract_txt:using in 3119) [ClassicSimilarity], result of:
            0.018337917 = score(doc=3119,freq=1.0), product of:
              0.08487637 = queryWeight, product of:
                1.5226004 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.016125668 = queryNorm
              0.21605442 = fieldWeight in 3119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.0847316 = weight(abstract_txt:classification in 3119) [ClassicSimilarity], result of:
            0.0847316 = score(doc=3119,freq=9.0), product of:
              0.11319735 = queryWeight, product of:
                1.7583717 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.016125668 = queryNorm
              0.7485299 = fieldWeight in 3119, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.21527004 = weight(abstract_txt:discriminant in 3119) [ClassicSimilarity], result of:
            0.21527004 = score(doc=3119,freq=1.0), product of:
              0.3829824 = queryWeight, product of:
                2.640805 = boost
                8.993418 = idf(docFreq=14, maxDocs=44421)
                0.016125668 = queryNorm
              0.5620886 = fieldWeight in 3119, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.993418 = idf(docFreq=14, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
          0.2540294 = weight(abstract_txt:categorization in 3119) [ClassicSimilarity], result of:
            0.2540294 = score(doc=3119,freq=4.0), product of:
              0.308406 = queryWeight, product of:
                2.9023778 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.016125668 = queryNorm
              0.823685 = fieldWeight in 3119, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0625 = fieldNorm(doc=3119)
        0.36 = coord(9/25)
    
  3. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.19
    0.18525924 = sum of:
      0.18525924 = product of:
        0.6616401 = sum of:
          0.024408879 = weight(abstract_txt:text in 4389) [ClassicSimilarity], result of:
            0.024408879 = score(doc=4389,freq=1.0), product of:
              0.07731818 = queryWeight, product of:
                1.1865547 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.016125668 = queryNorm
              0.3156939 = fieldWeight in 4389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=4389)
          0.029292125 = weight(abstract_txt:document in 4389) [ClassicSimilarity], result of:
            0.029292125 = score(doc=4389,freq=1.0), product of:
              0.08731396 = queryWeight, product of:
                1.2609235 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.016125668 = queryNorm
              0.33548045 = fieldWeight in 4389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=4389)
          0.12652582 = weight(abstract_txt:witnessed in 4389) [ClassicSimilarity], result of:
            0.12652582 = score(doc=4389,freq=1.0), product of:
              0.18380578 = queryWeight, product of:
                1.2936342 = boost
                8.811096 = idf(docFreq=17, maxDocs=44421)
                0.016125668 = queryNorm
              0.6883669 = fieldWeight in 4389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.811096 = idf(docFreq=17, maxDocs=44421)
                0.078125 = fieldNorm(doc=4389)
          0.14894933 = weight(abstract_txt:booming in 4389) [ClassicSimilarity], result of:
            0.14894933 = score(doc=4389,freq=1.0), product of:
              0.2049268 = queryWeight, product of:
                1.365939 = boost
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.016125668 = queryNorm
              0.7268416 = fieldWeight in 4389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.078125 = fieldNorm(doc=4389)
          0.07262673 = weight(abstract_txt:categories in 4389) [ClassicSimilarity], result of:
            0.07262673 = score(doc=4389,freq=2.0), product of:
              0.12695138 = queryWeight, product of:
                1.5204272 = boost
                5.177905 = idf(docFreq=680, maxDocs=44421)
                0.016125668 = queryNorm
              0.57208306 = fieldWeight in 4389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.177905 = idf(docFreq=680, maxDocs=44421)
                0.078125 = fieldNorm(doc=4389)
          0.035304833 = weight(abstract_txt:classification in 4389) [ClassicSimilarity], result of:
            0.035304833 = score(doc=4389,freq=1.0), product of:
              0.11319735 = queryWeight, product of:
                1.7583717 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.016125668 = queryNorm
              0.31188744 = fieldWeight in 4389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.078125 = fieldNorm(doc=4389)
          0.22453237 = weight(abstract_txt:categorization in 4389) [ClassicSimilarity], result of:
            0.22453237 = score(doc=4389,freq=2.0), product of:
              0.308406 = queryWeight, product of:
                2.9023778 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.016125668 = queryNorm
              0.7280415 = fieldWeight in 4389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.078125 = fieldNorm(doc=4389)
        0.28 = coord(7/25)
    
  4. Desale, S.K.; Kumbhar, R.: Research on automatic classification of documents in library environment : a literature review (2013) 0.16
    0.15651053 = sum of:
      0.15651053 = product of:
        0.55896616 = sum of:
          0.015344538 = weight(abstract_txt:paper in 2071) [ClassicSimilarity], result of:
            0.015344538 = score(doc=2071,freq=1.0), product of:
              0.056739513 = queryWeight, product of:
                1.0164586 = boost
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.016125668 = queryNorm
              0.2704383 = fieldWeight in 2071, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4616103 = idf(docFreq=3788, maxDocs=44421)
                0.078125 = fieldNorm(doc=2071)
          0.024408879 = weight(abstract_txt:text in 2071) [ClassicSimilarity], result of:
            0.024408879 = score(doc=2071,freq=1.0), product of:
              0.07731818 = queryWeight, product of:
                1.1865547 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.016125668 = queryNorm
              0.3156939 = fieldWeight in 2071, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=2071)
          0.029292125 = weight(abstract_txt:document in 2071) [ClassicSimilarity], result of:
            0.029292125 = score(doc=2071,freq=1.0), product of:
              0.08731396 = queryWeight, product of:
                1.2609235 = boost
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.016125668 = queryNorm
              0.33548045 = fieldWeight in 2071, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.29415 = idf(docFreq=1647, maxDocs=44421)
                0.078125 = fieldNorm(doc=2071)
          0.039702754 = weight(abstract_txt:using in 2071) [ClassicSimilarity], result of:
            0.039702754 = score(doc=2071,freq=3.0), product of:
              0.08487637 = queryWeight, product of:
                1.5226004 = boost
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.016125668 = queryNorm
              0.46777156 = fieldWeight in 2071, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.078125 = fieldNorm(doc=2071)
          0.08647884 = weight(abstract_txt:classification in 2071) [ClassicSimilarity], result of:
            0.08647884 = score(doc=2071,freq=6.0), product of:
              0.11319735 = queryWeight, product of:
                1.7583717 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.016125668 = queryNorm
              0.7639652 = fieldWeight in 2071, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.078125 = fieldNorm(doc=2071)
          0.22453237 = weight(abstract_txt:categorization in 2071) [ClassicSimilarity], result of:
            0.22453237 = score(doc=2071,freq=2.0), product of:
              0.308406 = queryWeight, product of:
                2.9023778 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.016125668 = queryNorm
              0.7280415 = fieldWeight in 2071, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.078125 = fieldNorm(doc=2071)
          0.13920663 = weight(abstract_txt:hierarchical in 2071) [ClassicSimilarity], result of:
            0.13920663 = score(doc=2071,freq=1.0), product of:
              0.31095654 = queryWeight, product of:
                3.3652067 = boost
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.016125668 = queryNorm
              0.4476723 = fieldWeight in 2071, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.078125 = fieldNorm(doc=2071)
        0.28 = coord(7/25)
    
  5. Yoon, Y.; Lee, C.; Lee, G.G.: ¬An effective procedure for constructing a hierarchical text classification system (2006) 0.15
    0.14793412 = sum of:
      0.14793412 = product of:
        0.7396706 = sum of:
          0.024408879 = weight(abstract_txt:text in 273) [ClassicSimilarity], result of:
            0.024408879 = score(doc=273,freq=1.0), product of:
              0.07731818 = queryWeight, product of:
                1.1865547 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.016125668 = queryNorm
              0.3156939 = fieldWeight in 273, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=273)
          0.078944005 = weight(abstract_txt:classification in 273) [ClassicSimilarity], result of:
            0.078944005 = score(doc=273,freq=5.0), product of:
              0.11319735 = queryWeight, product of:
                1.7583717 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.016125668 = queryNorm
              0.6974015 = fieldWeight in 273, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.078125 = fieldNorm(doc=273)
          0.15876837 = weight(abstract_txt:categorization in 273) [ClassicSimilarity], result of:
            0.15876837 = score(doc=273,freq=1.0), product of:
              0.308406 = queryWeight, product of:
                2.9023778 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.016125668 = queryNorm
              0.5148031 = fieldWeight in 273, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.078125 = fieldNorm(doc=273)
          0.19913605 = weight(abstract_txt:hierarchies in 273) [ClassicSimilarity], result of:
            0.19913605 = score(doc=273,freq=1.0), product of:
              0.35868517 = queryWeight, product of:
                3.1300354 = boost
                7.1063476 = idf(docFreq=98, maxDocs=44421)
                0.016125668 = queryNorm
              0.5551834 = fieldWeight in 273, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.1063476 = idf(docFreq=98, maxDocs=44421)
                0.078125 = fieldNorm(doc=273)
          0.27841327 = weight(abstract_txt:hierarchical in 273) [ClassicSimilarity], result of:
            0.27841327 = score(doc=273,freq=4.0), product of:
              0.31095654 = queryWeight, product of:
                3.3652067 = boost
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.016125668 = queryNorm
              0.8953446 = fieldWeight in 273, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.7302055 = idf(docFreq=391, maxDocs=44421)
                0.078125 = fieldNorm(doc=273)
        0.2 = coord(5/25)