Document (#30173)

Author
Giorgetti, D.
Sebastiani, F.
Title
Automating survey coding by multiclass text categorization techniques
Source
Journal of the American Society for Information Science and technology. 54(2003) no.14, S.1269-1277
Year
2003
Abstract
In this issue Giorgetti, and Sebastiani suggest that answers to open ended questions in survey instruments can be coded automatically by creating classifiers which learn from training sets of manually coded answers. The manual effort required is only that of classifying a representative set of documents, not creating a dictionary of words that trigger an assignment. They use a naive Bayesian probabilistic learner from Mc Callum's RAINBOW package and the multi-class support vector machine learner from Hsu and Lin's BSVM package, both examples of text categorization techniques. Data from the 1996 General Social Survey by the U.S. National Opinion Research Center provided a set of answers to three questions (previously tested by Viechnicki using a dictionary approach), their associated manually assigned category codes, and a complete set of predefined category codes. The learners were run on three random disjoint subsets of the answer sets to create the classifiers and a remaining set was used as a test set. The dictionary approach is out preformed by 18% for RAINBOW and by 17% for BSVM, while the standard deviation of the results is reduced by 28% and 34% respectively over the dictionary approach.
Theme
Automatisches Klassifizieren

Similar documents (author)

  1. Sebastiani, F.: On the role of logic in information retrieval (1998) 5.94
    5.937289 = sum of:
      5.937289 = weight(author_txt:sebastiani in 1140) [ClassicSimilarity], result of:
        5.937289 = fieldWeight in 1140, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.625 = fieldNorm(doc=1140)
    
  2. Sebastiani, F.: Machine learning in automated text categorization (2002) 5.94
    5.937289 = sum of:
      5.937289 = weight(author_txt:sebastiani in 3389) [ClassicSimilarity], result of:
        5.937289 = fieldWeight in 3389, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.625 = fieldNorm(doc=3389)
    
  3. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 5.94
    5.937289 = sum of:
      5.937289 = weight(author_txt:sebastiani in 3390) [ClassicSimilarity], result of:
        5.937289 = fieldWeight in 3390, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.625 = fieldNorm(doc=3390)
    
  4. Sebastiani, F.: Classification of text, automatic (2006) 5.94
    5.937289 = sum of:
      5.937289 = weight(author_txt:sebastiani in 5003) [ClassicSimilarity], result of:
        5.937289 = fieldWeight in 5003, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.625 = fieldNorm(doc=5003)
    
  5. Debole, F.; Sebastiani, F.: ¬An analysis of the relative hardness of Reuters-21578 subsets (2005) 4.75
    4.749831 = sum of:
      4.749831 = weight(author_txt:sebastiani in 3456) [ClassicSimilarity], result of:
        4.749831 = fieldWeight in 3456, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.499662 = idf(docFreq=8, maxDocs=44218)
          0.5 = fieldNorm(doc=3456)
    

Similar documents (content)

  1. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.12
    0.117500804 = sum of:
      0.117500804 = product of:
        0.587504 = sum of:
          0.1184501 = weight(abstract_txt:naive in 1808) [ClassicSimilarity], result of:
            0.1184501 = score(doc=1808,freq=2.0), product of:
              0.16088542 = queryWeight, product of:
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.019314922 = queryNorm
              0.73623884 = fieldWeight in 1808, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.01730944 = weight(abstract_txt:from in 1808) [ClassicSimilarity], result of:
            0.01730944 = score(doc=1808,freq=2.0), product of:
              0.07085466 = queryWeight, product of:
                1.3272595 = boost
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.019314922 = queryNorm
              0.24429502 = fieldWeight in 1808, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.122831926 = weight(abstract_txt:category in 1808) [ClassicSimilarity], result of:
            0.122831926 = score(doc=1808,freq=3.0), product of:
              0.18141791 = queryWeight, product of:
                1.5017469 = boost
                6.2544694 = idf(docFreq=230, maxDocs=44218)
                0.019314922 = queryNorm
              0.67706615 = fieldWeight in 1808, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2544694 = idf(docFreq=230, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.053013086 = weight(abstract_txt:survey in 1808) [ClassicSimilarity], result of:
            0.053013086 = score(doc=1808,freq=1.0), product of:
              0.1710536 = queryWeight, product of:
                1.7859462 = boost
                4.9587345 = idf(docFreq=843, maxDocs=44218)
                0.019314922 = queryNorm
              0.3099209 = fieldWeight in 1808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9587345 = idf(docFreq=843, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
          0.27589947 = weight(abstract_txt:classifiers in 1808) [ClassicSimilarity], result of:
            0.27589947 = score(doc=1808,freq=5.0), product of:
              0.26243615 = queryWeight, product of:
                1.806211 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.019314922 = queryNorm
              1.0513014 = fieldWeight in 1808, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.0625 = fieldNorm(doc=1808)
        0.2 = coord(5/25)
    
  2. Sebastiani, F.: Classification of text, automatic (2006) 0.11
    0.11078773 = sum of:
      0.11078773 = product of:
        0.5539386 = sum of:
          0.03179947 = weight(abstract_txt:from in 5003) [ClassicSimilarity], result of:
            0.03179947 = score(doc=5003,freq=3.0), product of:
              0.07085466 = queryWeight, product of:
                1.3272595 = boost
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.019314922 = queryNorm
              0.4487986 = fieldWeight in 5003, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.09375 = fieldNorm(doc=5003)
          0.03426347 = weight(abstract_txt:approach in 5003) [ClassicSimilarity], result of:
            0.03426347 = score(doc=5003,freq=1.0), product of:
              0.09758211 = queryWeight, product of:
                1.3489237 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.019314922 = queryNorm
              0.3511245 = fieldWeight in 5003, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.09375 = fieldNorm(doc=5003)
          0.07951963 = weight(abstract_txt:survey in 5003) [ClassicSimilarity], result of:
            0.07951963 = score(doc=5003,freq=1.0), product of:
              0.1710536 = queryWeight, product of:
                1.7859462 = boost
                4.9587345 = idf(docFreq=843, maxDocs=44218)
                0.019314922 = queryNorm
              0.46488136 = fieldWeight in 5003, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9587345 = idf(docFreq=843, maxDocs=44218)
                0.09375 = fieldNorm(doc=5003)
          0.18507901 = weight(abstract_txt:classifiers in 5003) [ClassicSimilarity], result of:
            0.18507901 = score(doc=5003,freq=1.0), product of:
              0.26243615 = queryWeight, product of:
                1.806211 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.019314922 = queryNorm
              0.7052344 = fieldWeight in 5003, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.09375 = fieldNorm(doc=5003)
          0.22327703 = weight(abstract_txt:learner in 5003) [ClassicSimilarity], result of:
            0.22327703 = score(doc=5003,freq=1.0), product of:
              0.297405 = queryWeight, product of:
                1.9227853 = boost
                8.008008 = idf(docFreq=39, maxDocs=44218)
                0.019314922 = queryNorm
              0.7507508 = fieldWeight in 5003, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.008008 = idf(docFreq=39, maxDocs=44218)
                0.09375 = fieldNorm(doc=5003)
        0.2 = coord(5/25)
    
  3. Ko, Y.; Park, J.; Seo, J.: Improving text categorization using the importance of sentences (2004) 0.11
    0.10501599 = sum of:
      0.10501599 = product of:
        0.43756664 = sum of:
          0.08375687 = weight(abstract_txt:naive in 2557) [ClassicSimilarity], result of:
            0.08375687 = score(doc=2557,freq=1.0), product of:
              0.16088542 = queryWeight, product of:
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.019314922 = queryNorm
              0.5205995 = fieldWeight in 2557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.329592 = idf(docFreq=28, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.026942046 = weight(abstract_txt:techniques in 2557) [ClassicSimilarity], result of:
            0.026942046 = score(doc=2557,freq=1.0), product of:
              0.095162705 = queryWeight, product of:
                1.0876522 = boost
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.019314922 = queryNorm
              0.2831156 = fieldWeight in 2557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.057144735 = weight(abstract_txt:sets in 2557) [ClassicSimilarity], result of:
            0.057144735 = score(doc=2557,freq=2.0), product of:
              0.12468682 = queryWeight, product of:
                1.2449931 = boost
                5.185142 = idf(docFreq=672, maxDocs=44218)
                0.019314922 = queryNorm
              0.45830613 = fieldWeight in 2557, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.185142 = idf(docFreq=672, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.012239622 = weight(abstract_txt:from in 2557) [ClassicSimilarity], result of:
            0.012239622 = score(doc=2557,freq=1.0), product of:
              0.07085466 = queryWeight, product of:
                1.3272595 = boost
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.019314922 = queryNorm
              0.17274266 = fieldWeight in 2557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.08298922 = weight(abstract_txt:categorization in 2557) [ClassicSimilarity], result of:
            0.08298922 = score(doc=2557,freq=1.0), product of:
              0.20146249 = queryWeight, product of:
                1.5825366 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.019314922 = queryNorm
              0.41193387 = fieldWeight in 2557, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
          0.17449415 = weight(abstract_txt:classifiers in 2557) [ClassicSimilarity], result of:
            0.17449415 = score(doc=2557,freq=2.0), product of:
              0.26243615 = queryWeight, product of:
                1.806211 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.019314922 = queryNorm
              0.6649013 = fieldWeight in 2557, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.0625 = fieldNorm(doc=2557)
        0.24 = coord(6/25)
    
  4. Liu, R.-L.: ¬A passage extractor for classification of disease aspect information (2013) 0.09
    0.08969114 = sum of:
      0.08969114 = product of:
        0.4484557 = sum of:
          0.024964483 = weight(abstract_txt:three in 1107) [ClassicSimilarity], result of:
            0.024964483 = score(doc=1107,freq=1.0), product of:
              0.090447135 = queryWeight, product of:
                1.0603617 = boost
                4.41619 = idf(docFreq=1451, maxDocs=44218)
                0.019314922 = queryNorm
              0.27601188 = fieldWeight in 1107, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.41619 = idf(docFreq=1451, maxDocs=44218)
                0.0625 = fieldNorm(doc=1107)
          0.038101807 = weight(abstract_txt:techniques in 1107) [ClassicSimilarity], result of:
            0.038101807 = score(doc=1107,freq=2.0), product of:
              0.095162705 = queryWeight, product of:
                1.0876522 = boost
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.019314922 = queryNorm
              0.40038592 = fieldWeight in 1107, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.0625 = fieldNorm(doc=1107)
          0.012239622 = weight(abstract_txt:from in 1107) [ClassicSimilarity], result of:
            0.012239622 = score(doc=1107,freq=1.0), product of:
              0.07085466 = queryWeight, product of:
                1.3272595 = boost
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.019314922 = queryNorm
              0.17274266 = fieldWeight in 1107, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.0625 = fieldNorm(doc=1107)
          0.07091705 = weight(abstract_txt:category in 1107) [ClassicSimilarity], result of:
            0.07091705 = score(doc=1107,freq=1.0), product of:
              0.18141791 = queryWeight, product of:
                1.5017469 = boost
                6.2544694 = idf(docFreq=230, maxDocs=44218)
                0.019314922 = queryNorm
              0.39090434 = fieldWeight in 1107, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.2544694 = idf(docFreq=230, maxDocs=44218)
                0.0625 = fieldNorm(doc=1107)
          0.30223274 = weight(abstract_txt:classifiers in 1107) [ClassicSimilarity], result of:
            0.30223274 = score(doc=1107,freq=6.0), product of:
              0.26243615 = queryWeight, product of:
                1.806211 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.019314922 = queryNorm
              1.1516429 = fieldWeight in 1107, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.0625 = fieldNorm(doc=1107)
        0.2 = coord(5/25)
    
  5. Sebastiani, F.: Machine learning in automated text categorization (2002) 0.08
    0.08222633 = sum of:
      0.08222633 = product of:
        0.3426097 = sum of:
          0.031205606 = weight(abstract_txt:three in 3389) [ClassicSimilarity], result of:
            0.031205606 = score(doc=3389,freq=1.0), product of:
              0.090447135 = queryWeight, product of:
                1.0603617 = boost
                4.41619 = idf(docFreq=1451, maxDocs=44218)
                0.019314922 = queryNorm
              0.34501487 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.41619 = idf(docFreq=1451, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.03367756 = weight(abstract_txt:techniques in 3389) [ClassicSimilarity], result of:
            0.03367756 = score(doc=3389,freq=1.0), product of:
              0.095162705 = queryWeight, product of:
                1.0876522 = boost
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.019314922 = queryNorm
              0.3538945 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.5298495 = idf(docFreq=1295, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.015299528 = weight(abstract_txt:from in 3389) [ClassicSimilarity], result of:
            0.015299528 = score(doc=3389,freq=1.0), product of:
              0.07085466 = queryWeight, product of:
                1.3272595 = boost
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.019314922 = queryNorm
              0.21592833 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.049455054 = weight(abstract_txt:approach in 3389) [ClassicSimilarity], result of:
            0.049455054 = score(doc=3389,freq=3.0), product of:
              0.09758211 = queryWeight, product of:
                1.3489237 = boost
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.019314922 = queryNorm
              0.5068045 = fieldWeight in 3389, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.745328 = idf(docFreq=2839, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.1467056 = weight(abstract_txt:categorization in 3389) [ClassicSimilarity], result of:
            0.1467056 = score(doc=3389,freq=2.0), product of:
              0.20146249 = queryWeight, product of:
                1.5825366 = boost
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.019314922 = queryNorm
              0.72820306 = fieldWeight in 3389, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.590942 = idf(docFreq=164, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
          0.06626636 = weight(abstract_txt:survey in 3389) [ClassicSimilarity], result of:
            0.06626636 = score(doc=3389,freq=1.0), product of:
              0.1710536 = queryWeight, product of:
                1.7859462 = boost
                4.9587345 = idf(docFreq=843, maxDocs=44218)
                0.019314922 = queryNorm
              0.38740113 = fieldWeight in 3389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.9587345 = idf(docFreq=843, maxDocs=44218)
                0.078125 = fieldNorm(doc=3389)
        0.24 = coord(6/25)