Document (#35173)

Author
Wang, J.
Title
¬An extensive study on automated Dewey Decimal Classification
Source
Journal of the American Society for Information Science and Technology. 60(2009) no.11, S.2269-2286
Year
2009
Abstract
In this paper, we present a theoretical analysis and extensive experiments on the automated assignment of Dewey Decimal Classification (DDC) classes to bibliographic data with a supervised machine-learning approach. Library classification systems, such as the DDC, impose great obstacles on state-of-art text categorization (TC) technologies, including deep hierarchy, data sparseness, and skewed distribution. We first analyze statistically the document and category distributions over the DDC, and discuss the obstacles imposed by bibliographic corpora and library classification schemes on TC technology. To overcome these obstacles, we propose an innovative algorithm to reshape the DDC structure into a balanced virtual tree by balancing the category distribution and flattening the hierarchy. To improve the classification effectiveness to a level acceptable to real-world applications, we propose an interactive classification model that is able to predict a class of any depth within a limited number of user interactions. The experiments are conducted on a large bibliographic collection created by the Library of Congress within the science and technology domains over 10 years. With no more than three interactions, a classification accuracy of nearly 90% is achieved, thus providing a practical solution to the automatic bibliographic classification problem.
Theme
Automatisches Klassifizieren
Object
DDC

Similar documents (author)

  1. Wang, H.; Wang, C.: Ontologies for universal information systems (1995) 4.62
    4.6221313 = sum of:
      4.6221313 = weight(author_txt:wang in 3262) [ClassicSimilarity], result of:
        4.6221313 = score(doc=3262,freq=2.0), product of:
          0.99999994 = queryWeight, product of:
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.15298282 = queryNorm
          4.622132 = fieldWeight in 3262, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.5 = fieldNorm(doc=3262)
    
  2. Wang, F.; Wang, X.: Tracing theory diffusion : a text mining and citation-based analysis of TAM (2020) 4.62
    4.6221313 = sum of:
      4.6221313 = weight(author_txt:wang in 980) [ClassicSimilarity], result of:
        4.6221313 = score(doc=980,freq=2.0), product of:
          0.99999994 = queryWeight, product of:
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.15298282 = queryNorm
          4.622132 = fieldWeight in 980, product of:
            1.4142135 = tf(freq=2.0), with freq of:
              2.0 = termFreq=2.0
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.5 = fieldNorm(doc=980)
    
  3. Wang, C.: ¬The online catalogue, subject access and user reactions : a review (1985) 4.09
    4.0854254 = sum of:
      4.0854254 = weight(author_txt:wang in 985) [ClassicSimilarity], result of:
        4.0854254 = score(doc=985,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.15298282 = queryNorm
          4.085426 = fieldWeight in 985, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.625 = fieldNorm(doc=985)
    
  4. Wang, C.: Bibliometrics : a textbook (1990) 4.09
    4.0854254 = sum of:
      4.0854254 = weight(author_txt:wang in 5108) [ClassicSimilarity], result of:
        4.0854254 = score(doc=5108,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.15298282 = queryNorm
          4.085426 = fieldWeight in 5108, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.625 = fieldNorm(doc=5108)
    
  5. Wang, P.: Users' information needs at different stages of a research project : a cognitive view (1997) 4.09
    4.0854254 = sum of:
      4.0854254 = weight(author_txt:wang in 1320) [ClassicSimilarity], result of:
        4.0854254 = score(doc=1320,freq=1.0), product of:
          0.99999994 = queryWeight, product of:
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.15298282 = queryNorm
          4.085426 = fieldWeight in 1320, product of:
            1.0 = tf(freq=1.0), with freq of:
              1.0 = termFreq=1.0
            6.5366817 = idf(docFreq=174, maxDocs=44421)
            0.625 = fieldNorm(doc=1320)
    

Similar documents (content)

  1. Riesthuis, G.J.A.: Fiction in need of transcending traditional classification (1997) 0.16
    0.15678607 = sum of:
      0.15678607 = product of:
        0.979913 = sum of:
          0.04735722 = weight(abstract_txt:library in 2808) [ClassicSimilarity], result of:
            0.04735722 = score(doc=2808,freq=1.0), product of:
              0.07920381 = queryWeight, product of:
                1.2811294 = boost
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.01938716 = queryNorm
              0.59791595 = fieldWeight in 2808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.1875 = fieldNorm(doc=2808)
          0.19541836 = weight(abstract_txt:dewey in 2808) [ClassicSimilarity], result of:
            0.19541836 = score(doc=2808,freq=1.0), product of:
              0.17800617 = queryWeight, product of:
                1.5681654 = boost
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.01938716 = queryNorm
              1.0978179 = fieldWeight in 2808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.1875 = fieldNorm(doc=2808)
          0.30797413 = weight(abstract_txt:decimal in 2808) [ClassicSimilarity], result of:
            0.30797413 = score(doc=2808,freq=2.0), product of:
              0.19133349 = queryWeight, product of:
                1.6258101 = boost
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.01938716 = queryNorm
              1.6096196 = fieldWeight in 2808, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.1875 = fieldNorm(doc=2808)
          0.42916328 = weight(abstract_txt:classification in 2808) [ClassicSimilarity], result of:
            0.42916328 = score(doc=2808,freq=3.0), product of:
              0.3310189 = queryWeight, product of:
                4.276916 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.01938716 = queryNorm
              1.2964917 = fieldWeight in 2808, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.1875 = fieldNorm(doc=2808)
        0.16 = coord(4/25)
    
  2. Rafferty, P.: ¬The representation of knowledge in library classification schemes (2001) 0.16
    0.15618324 = sum of:
      0.15618324 = product of:
        0.5577973 = sum of:
          0.02388094 = weight(abstract_txt:within in 1640) [ClassicSimilarity], result of:
            0.02388094 = score(doc=1640,freq=1.0), product of:
              0.09118147 = queryWeight, product of:
                1.1223483 = boost
                4.19049 = idf(docFreq=1827, maxDocs=44421)
                0.01938716 = queryNorm
              0.2619056 = fieldWeight in 1640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.19049 = idf(docFreq=1827, maxDocs=44421)
                0.0625 = fieldNorm(doc=1640)
          0.024784803 = weight(abstract_txt:over in 1640) [ClassicSimilarity], result of:
            0.024784803 = score(doc=1640,freq=1.0), product of:
              0.093467936 = queryWeight, product of:
                1.1363331 = boost
                4.242705 = idf(docFreq=1734, maxDocs=44421)
                0.01938716 = queryNorm
              0.26516905 = fieldWeight in 1640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.242705 = idf(docFreq=1734, maxDocs=44421)
                0.0625 = fieldNorm(doc=1640)
          0.015785739 = weight(abstract_txt:library in 1640) [ClassicSimilarity], result of:
            0.015785739 = score(doc=1640,freq=1.0), product of:
              0.07920381 = queryWeight, product of:
                1.2811294 = boost
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.01938716 = queryNorm
              0.19930531 = fieldWeight in 1640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.0625 = fieldNorm(doc=1640)
          0.06513945 = weight(abstract_txt:dewey in 1640) [ClassicSimilarity], result of:
            0.06513945 = score(doc=1640,freq=1.0), product of:
              0.17800617 = queryWeight, product of:
                1.5681654 = boost
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.01938716 = queryNorm
              0.3659393 = fieldWeight in 1640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.0625 = fieldNorm(doc=1640)
          0.0725902 = weight(abstract_txt:decimal in 1640) [ClassicSimilarity], result of:
            0.0725902 = score(doc=1640,freq=1.0), product of:
              0.19133349 = queryWeight, product of:
                1.6258101 = boost
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.01938716 = queryNorm
              0.37939098 = fieldWeight in 1640, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.0625 = fieldNorm(doc=1640)
          0.06950734 = weight(abstract_txt:bibliographic in 1640) [ClassicSimilarity], result of:
            0.06950734 = score(doc=1640,freq=2.0), product of:
              0.18587719 = queryWeight, product of:
                2.2662218 = boost
                4.230674 = idf(docFreq=1755, maxDocs=44421)
                0.01938716 = queryNorm
              0.37394226 = fieldWeight in 1640, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.230674 = idf(docFreq=1755, maxDocs=44421)
                0.0625 = fieldNorm(doc=1640)
          0.28610885 = weight(abstract_txt:classification in 1640) [ClassicSimilarity], result of:
            0.28610885 = score(doc=1640,freq=12.0), product of:
              0.3310189 = queryWeight, product of:
                4.276916 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.01938716 = queryNorm
              0.86432785 = fieldWeight in 1640, product of:
                3.4641016 = tf(freq=12.0), with freq of:
                  12.0 = termFreq=12.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=1640)
        0.28 = coord(7/25)
    
  3. Mitchell, J.S.: DDC21 and beyond : the Dewey Decimal Classification prepares for the future (1995) 0.15
    0.15124877 = sum of:
      0.15124877 = product of:
        0.7562438 = sum of:
          0.085337214 = weight(abstract_txt:distribution in 564) [ClassicSimilarity], result of:
            0.085337214 = score(doc=564,freq=1.0), product of:
              0.16264366 = queryWeight, product of:
                1.4989698 = boost
                5.5966744 = idf(docFreq=447, maxDocs=44421)
                0.01938716 = queryNorm
              0.52468824 = fieldWeight in 564, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5966744 = idf(docFreq=447, maxDocs=44421)
                0.09375 = fieldNorm(doc=564)
          0.19541836 = weight(abstract_txt:dewey in 564) [ClassicSimilarity], result of:
            0.19541836 = score(doc=564,freq=4.0), product of:
              0.17800617 = queryWeight, product of:
                1.5681654 = boost
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.01938716 = queryNorm
              1.0978179 = fieldWeight in 564, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.09375 = fieldNorm(doc=564)
          0.15398706 = weight(abstract_txt:decimal in 564) [ClassicSimilarity], result of:
            0.15398706 = score(doc=564,freq=2.0), product of:
              0.19133349 = queryWeight, product of:
                1.6258101 = boost
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.01938716 = queryNorm
              0.8048098 = fieldWeight in 564, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.09375 = fieldNorm(doc=564)
          0.07372367 = weight(abstract_txt:bibliographic in 564) [ClassicSimilarity], result of:
            0.07372367 = score(doc=564,freq=1.0), product of:
              0.18587719 = queryWeight, product of:
                2.2662218 = boost
                4.230674 = idf(docFreq=1755, maxDocs=44421)
                0.01938716 = queryNorm
              0.39662567 = fieldWeight in 564, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.230674 = idf(docFreq=1755, maxDocs=44421)
                0.09375 = fieldNorm(doc=564)
          0.24777754 = weight(abstract_txt:classification in 564) [ClassicSimilarity], result of:
            0.24777754 = score(doc=564,freq=4.0), product of:
              0.3310189 = queryWeight, product of:
                4.276916 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.01938716 = queryNorm
              0.7485299 = fieldWeight in 564, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.09375 = fieldNorm(doc=564)
        0.2 = coord(5/25)
    
  4. Olson, H.A.: ¬The ubiquitous hierarchy : an army to overcome the threat of a mob (2004) 0.14
    0.1445501 = sum of:
      0.1445501 = product of:
        0.7227505 = sum of:
          0.027625045 = weight(abstract_txt:library in 958) [ClassicSimilarity], result of:
            0.027625045 = score(doc=958,freq=1.0), product of:
              0.07920381 = queryWeight, product of:
                1.2811294 = boost
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.01938716 = queryNorm
              0.3487843 = fieldWeight in 958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.109375 = fieldNorm(doc=958)
          0.19744347 = weight(abstract_txt:dewey in 958) [ClassicSimilarity], result of:
            0.19744347 = score(doc=958,freq=3.0), product of:
              0.17800617 = queryWeight, product of:
                1.5681654 = boost
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.01938716 = queryNorm
              1.1091945 = fieldWeight in 958, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.109375 = fieldNorm(doc=958)
          0.12703285 = weight(abstract_txt:decimal in 958) [ClassicSimilarity], result of:
            0.12703285 = score(doc=958,freq=1.0), product of:
              0.19133349 = queryWeight, product of:
                1.6258101 = boost
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.01938716 = queryNorm
              0.66393423 = fieldWeight in 958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.109375 = fieldNorm(doc=958)
          0.22611223 = weight(abstract_txt:hierarchy in 958) [ClassicSimilarity], result of:
            0.22611223 = score(doc=958,freq=2.0), product of:
              0.22304183 = queryWeight, product of:
                1.7553653 = boost
                6.553973 = idf(docFreq=171, maxDocs=44421)
                0.01938716 = queryNorm
              1.013766 = fieldWeight in 958, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.553973 = idf(docFreq=171, maxDocs=44421)
                0.109375 = fieldNorm(doc=958)
          0.1445369 = weight(abstract_txt:classification in 958) [ClassicSimilarity], result of:
            0.1445369 = score(doc=958,freq=1.0), product of:
              0.3310189 = queryWeight, product of:
                4.276916 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.01938716 = queryNorm
              0.43664244 = fieldWeight in 958, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.109375 = fieldNorm(doc=958)
        0.2 = coord(5/25)
    
  5. Jouguelet, S.: Various applications of the Dewey Decimal Classification at the Bibliothèque Nationale de France (1998) 0.14
    0.14333294 = sum of:
      0.14333294 = product of:
        0.7166647 = sum of:
          0.05910231 = weight(abstract_txt:within in 892) [ClassicSimilarity], result of:
            0.05910231 = score(doc=892,freq=2.0), product of:
              0.09118147 = queryWeight, product of:
                1.1223483 = boost
                4.19049 = idf(docFreq=1827, maxDocs=44421)
                0.01938716 = queryNorm
              0.64818335 = fieldWeight in 892, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.19049 = idf(docFreq=1827, maxDocs=44421)
                0.109375 = fieldNorm(doc=892)
          0.027625045 = weight(abstract_txt:library in 892) [ClassicSimilarity], result of:
            0.027625045 = score(doc=892,freq=1.0), product of:
              0.07920381 = queryWeight, product of:
                1.2811294 = boost
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.01938716 = queryNorm
              0.3487843 = fieldWeight in 892, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.188885 = idf(docFreq=4976, maxDocs=44421)
                0.109375 = fieldNorm(doc=892)
          0.16121192 = weight(abstract_txt:dewey in 892) [ClassicSimilarity], result of:
            0.16121192 = score(doc=892,freq=2.0), product of:
              0.17800617 = queryWeight, product of:
                1.5681654 = boost
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.01938716 = queryNorm
              0.90565354 = fieldWeight in 892, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.8550286 = idf(docFreq=345, maxDocs=44421)
                0.109375 = fieldNorm(doc=892)
          0.17965157 = weight(abstract_txt:decimal in 892) [ClassicSimilarity], result of:
            0.17965157 = score(doc=892,freq=2.0), product of:
              0.19133349 = queryWeight, product of:
                1.6258101 = boost
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.01938716 = queryNorm
              0.93894476 = fieldWeight in 892, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.0702558 = idf(docFreq=278, maxDocs=44421)
                0.109375 = fieldNorm(doc=892)
          0.2890738 = weight(abstract_txt:classification in 892) [ClassicSimilarity], result of:
            0.2890738 = score(doc=892,freq=4.0), product of:
              0.3310189 = queryWeight, product of:
                4.276916 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.01938716 = queryNorm
              0.8732849 = fieldWeight in 892, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.109375 = fieldNorm(doc=892)
        0.2 = coord(5/25)