Document (#38497)

Author
Aphinyanaphongs, Y.
Fu, L.D.
Li, Z.
Peskin, E.R.
Efstathiadis, E.
Aliferis, C.F.
Statnikov, A.
Title
¬A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization
Source
Journal of the Association for Information Science and Technology. 65(2014) no.10, S.1964-1987
Year
2014
Abstract
An important aspect to performing text categorization is selecting appropriate supervised classification and feature selection methods. A comprehensive benchmark is needed to inform best practices in this broad application field. Previous benchmarks have evaluated performance for a few supervised classification and feature selection methods and limited ways to optimize them. The present work updates prior benchmarks by increasing the number of classifiers and feature selection methods order of magnitude, including adding recently developed, state-of-the-art methods. Specifically, this study used 229 text categorization data sets/tasks, and evaluated 28 classification methods (both well-established and proprietary/commercial) and 19 feature selection methods according to 4 classification performance metrics. We report several key findings that will be helpful in establishing best methodological practices for text categorization.
Theme
Automatisches Klassifizieren

Similar documents (content)

  1. Seki, K.; Mostafa, J.: Gene ontology annotation as text categorization : an empirical study (2008) 0.29
    0.29098937 = sum of:
      0.29098937 = product of:
        0.9093418 = sum of:
          0.04218902 = weight(abstract_txt:performing in 3123) [ClassicSimilarity], result of:
            0.04218902 = score(doc=3123,freq=1.0), product of:
              0.10037934 = queryWeight, product of:
                1.0427247 = boost
                6.724734 = idf(docFreq=144, maxDocs=44421)
                0.014315271 = queryNorm
              0.42029586 = fieldWeight in 3123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.724734 = idf(docFreq=144, maxDocs=44421)
                0.0625 = fieldNorm(doc=3123)
          0.027356602 = weight(abstract_txt:performance in 3123) [ClassicSimilarity], result of:
            0.027356602 = score(doc=3123,freq=1.0), product of:
              0.09474642 = queryWeight, product of:
                1.4326625 = boost
                4.619759 = idf(docFreq=1189, maxDocs=44421)
                0.014315271 = queryNorm
              0.28873494 = fieldWeight in 3123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.619759 = idf(docFreq=1189, maxDocs=44421)
                0.0625 = fieldNorm(doc=3123)
          0.06341957 = weight(abstract_txt:text in 3123) [ClassicSimilarity], result of:
            0.06341957 = score(doc=3123,freq=3.0), product of:
              0.14497946 = queryWeight, product of:
                2.5062866 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014315271 = queryNorm
              0.4374383 = fieldWeight in 3123, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=3123)
          0.1732993 = weight(abstract_txt:supervised in 3123) [ClassicSimilarity], result of:
            0.1732993 = score(doc=3123,freq=1.0), product of:
              0.3713211 = queryWeight, product of:
                3.4736252 = boost
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.014315271 = queryNorm
              0.46671006 = fieldWeight in 3123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0625 = fieldNorm(doc=3123)
          0.22454457 = weight(abstract_txt:categorization in 3123) [ClassicSimilarity], result of:
            0.22454457 = score(doc=3123,freq=2.0), product of:
              0.38552845 = queryWeight, product of:
                4.08701 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.014315271 = queryNorm
              0.5824332 = fieldWeight in 3123, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0625 = fieldNorm(doc=3123)
          0.10773497 = weight(abstract_txt:selection in 3123) [ClassicSimilarity], result of:
            0.10773497 = score(doc=3123,freq=1.0), product of:
              0.32068047 = queryWeight, product of:
                4.167434 = boost
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.014315271 = queryNorm
              0.33595738 = fieldWeight in 3123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.0625 = fieldNorm(doc=3123)
          0.06908211 = weight(abstract_txt:methods in 3123) [ClassicSimilarity], result of:
            0.06908211 = score(doc=3123,freq=1.0), product of:
              0.26676023 = queryWeight, product of:
                4.497354 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.014315271 = queryNorm
              0.25896704 = fieldWeight in 3123, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.0625 = fieldNorm(doc=3123)
          0.20171559 = weight(abstract_txt:feature in 3123) [ClassicSimilarity], result of:
            0.20171559 = score(doc=3123,freq=2.0), product of:
              0.3866497 = queryWeight, product of:
                4.576056 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.014315271 = queryNorm
              0.52170116 = fieldWeight in 3123, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0625 = fieldNorm(doc=3123)
        0.32 = coord(8/25)
    
  2. Wang, H.; Hong, M.: Supervised Hebb rule based feature selection for text classification (2019) 0.26
    0.25703332 = sum of:
      0.25703332 = product of:
        0.91797614 = sum of:
          0.038688075 = weight(abstract_txt:performance in 36) [ClassicSimilarity], result of:
            0.038688075 = score(doc=36,freq=2.0), product of:
              0.09474642 = queryWeight, product of:
                1.4326625 = boost
                4.619759 = idf(docFreq=1189, maxDocs=44421)
                0.014315271 = queryNorm
              0.40833285 = fieldWeight in 36, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.619759 = idf(docFreq=1189, maxDocs=44421)
                0.0625 = fieldNorm(doc=36)
          0.06341957 = weight(abstract_txt:text in 36) [ClassicSimilarity], result of:
            0.06341957 = score(doc=36,freq=3.0), product of:
              0.14497946 = queryWeight, product of:
                2.5062866 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014315271 = queryNorm
              0.4374383 = fieldWeight in 36, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=36)
          0.044133436 = weight(abstract_txt:classification in 36) [ClassicSimilarity], result of:
            0.044133436 = score(doc=36,freq=1.0), product of:
              0.17688046 = queryWeight, product of:
                3.095084 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.014315271 = queryNorm
              0.24950996 = fieldWeight in 36, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=36)
          0.1732993 = weight(abstract_txt:supervised in 36) [ClassicSimilarity], result of:
            0.1732993 = score(doc=36,freq=1.0), product of:
              0.3713211 = queryWeight, product of:
                3.4736252 = boost
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.014315271 = queryNorm
              0.46671006 = fieldWeight in 36, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.467361 = idf(docFreq=68, maxDocs=44421)
                0.0625 = fieldNorm(doc=36)
          0.21546994 = weight(abstract_txt:selection in 36) [ClassicSimilarity], result of:
            0.21546994 = score(doc=36,freq=4.0), product of:
              0.32068047 = queryWeight, product of:
                4.167434 = boost
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.014315271 = queryNorm
              0.67191476 = fieldWeight in 36, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.0625 = fieldNorm(doc=36)
          0.09769685 = weight(abstract_txt:methods in 36) [ClassicSimilarity], result of:
            0.09769685 = score(doc=36,freq=2.0), product of:
              0.26676023 = queryWeight, product of:
                4.497354 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.014315271 = queryNorm
              0.3662347 = fieldWeight in 36, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.0625 = fieldNorm(doc=36)
          0.28526893 = weight(abstract_txt:feature in 36) [ClassicSimilarity], result of:
            0.28526893 = score(doc=36,freq=4.0), product of:
              0.3866497 = queryWeight, product of:
                4.576056 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.014315271 = queryNorm
              0.73779684 = fieldWeight in 36, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0625 = fieldNorm(doc=36)
        0.28 = coord(7/25)
    
  3. Mengle, S.S.R.; Goharian, N.: Ambiguity measure feature-selection algorithm (2009) 0.22
    0.22204888 = sum of:
      0.22204888 = product of:
        0.9252037 = sum of:
          0.07540654 = weight(abstract_txt:benchmark in 3804) [ClassicSimilarity], result of:
            0.07540654 = score(doc=3804,freq=2.0), product of:
              0.11733854 = queryWeight, product of:
                1.1273736 = boost
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.014315271 = queryNorm
              0.6426408 = fieldWeight in 3804, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.0625 = fieldNorm(doc=3804)
          0.08968882 = weight(abstract_txt:text in 3804) [ClassicSimilarity], result of:
            0.08968882 = score(doc=3804,freq=6.0), product of:
              0.14497946 = queryWeight, product of:
                2.5062866 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014315271 = queryNorm
              0.61863124 = fieldWeight in 3804, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=3804)
          0.044133436 = weight(abstract_txt:classification in 3804) [ClassicSimilarity], result of:
            0.044133436 = score(doc=3804,freq=1.0), product of:
              0.17688046 = queryWeight, product of:
                3.095084 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.014315271 = queryNorm
              0.24950996 = fieldWeight in 3804, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=3804)
          0.24090272 = weight(abstract_txt:selection in 3804) [ClassicSimilarity], result of:
            0.24090272 = score(doc=3804,freq=5.0), product of:
              0.32068047 = queryWeight, product of:
                4.167434 = boost
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.014315271 = queryNorm
              0.75122356 = fieldWeight in 3804, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.0625 = fieldNorm(doc=3804)
          0.09769685 = weight(abstract_txt:methods in 3804) [ClassicSimilarity], result of:
            0.09769685 = score(doc=3804,freq=2.0), product of:
              0.26676023 = queryWeight, product of:
                4.497354 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.014315271 = queryNorm
              0.3662347 = fieldWeight in 3804, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.0625 = fieldNorm(doc=3804)
          0.3773753 = weight(abstract_txt:feature in 3804) [ClassicSimilarity], result of:
            0.3773753 = score(doc=3804,freq=7.0), product of:
              0.3866497 = queryWeight, product of:
                4.576056 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.014315271 = queryNorm
              0.9760135 = fieldWeight in 3804, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0625 = fieldNorm(doc=3804)
        0.24 = coord(6/25)
    
  4. Goller, C.; Löning, J.; Will, T.; Wolff, W.: Automatic document classification : a thourough evaluation of various methods (2000) 0.21
    0.20781 = sum of:
      0.20781 = product of:
        0.865875 = sum of:
          0.10458728 = weight(abstract_txt:classifiers in 6480) [ClassicSimilarity], result of:
            0.10458728 = score(doc=6480,freq=2.0), product of:
              0.12576137 = queryWeight, product of:
                1.1671351 = boost
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.014315271 = queryNorm
              0.83163273 = fieldWeight in 6480, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.078125 = fieldNorm(doc=6480)
          0.045769133 = weight(abstract_txt:text in 6480) [ClassicSimilarity], result of:
            0.045769133 = score(doc=6480,freq=1.0), product of:
              0.14497946 = queryWeight, product of:
                2.5062866 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014315271 = queryNorm
              0.3156939 = fieldWeight in 6480, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=6480)
          0.12335671 = weight(abstract_txt:classification in 6480) [ClassicSimilarity], result of:
            0.12335671 = score(doc=6480,freq=5.0), product of:
              0.17688046 = queryWeight, product of:
                3.095084 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.014315271 = queryNorm
              0.6974015 = fieldWeight in 6480, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.078125 = fieldNorm(doc=6480)
          0.19045033 = weight(abstract_txt:selection in 6480) [ClassicSimilarity], result of:
            0.19045033 = score(doc=6480,freq=2.0), product of:
              0.32068047 = queryWeight, product of:
                4.167434 = boost
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.014315271 = queryNorm
              0.59389436 = fieldWeight in 6480, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.078125 = fieldNorm(doc=6480)
          0.14956716 = weight(abstract_txt:methods in 6480) [ClassicSimilarity], result of:
            0.14956716 = score(doc=6480,freq=3.0), product of:
              0.26676023 = queryWeight, product of:
                4.497354 = boost
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.014315271 = queryNorm
              0.5606801 = fieldWeight in 6480, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.1434727 = idf(docFreq=1915, maxDocs=44421)
                0.078125 = fieldNorm(doc=6480)
          0.2521445 = weight(abstract_txt:feature in 6480) [ClassicSimilarity], result of:
            0.2521445 = score(doc=6480,freq=2.0), product of:
              0.3866497 = queryWeight, product of:
                4.576056 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.014315271 = queryNorm
              0.65212643 = fieldWeight in 6480, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.078125 = fieldNorm(doc=6480)
        0.24 = coord(6/25)
    
  5. Maghsoodi, N.; Homayounpour, M.M.: Improving Farsi multiclass text classification using a thesaurus and two-stage feature selection (2011) 0.19
    0.1930309 = sum of:
      0.1930309 = product of:
        0.8042954 = sum of:
          0.040946223 = weight(abstract_txt:adding in 775) [ClassicSimilarity], result of:
            0.040946223 = score(doc=775,freq=1.0), product of:
              0.09839822 = queryWeight, product of:
                1.0323837 = boost
                6.6580424 = idf(docFreq=154, maxDocs=44421)
                0.014315271 = queryNorm
              0.41612765 = fieldWeight in 775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.6580424 = idf(docFreq=154, maxDocs=44421)
                0.0625 = fieldNorm(doc=775)
          0.03661531 = weight(abstract_txt:text in 775) [ClassicSimilarity], result of:
            0.03661531 = score(doc=775,freq=1.0), product of:
              0.14497946 = queryWeight, product of:
                2.5062866 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.014315271 = queryNorm
              0.25255513 = fieldWeight in 775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=775)
          0.062414106 = weight(abstract_txt:classification in 775) [ClassicSimilarity], result of:
            0.062414106 = score(doc=775,freq=2.0), product of:
              0.17688046 = queryWeight, product of:
                3.095084 = boost
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.014315271 = queryNorm
              0.35286036 = fieldWeight in 775, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.9921594 = idf(docFreq=2228, maxDocs=44421)
                0.0625 = fieldNorm(doc=775)
          0.158777 = weight(abstract_txt:categorization in 775) [ClassicSimilarity], result of:
            0.158777 = score(doc=775,freq=1.0), product of:
              0.38552845 = queryWeight, product of:
                4.08701 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.014315271 = queryNorm
              0.4118425 = fieldWeight in 775, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0625 = fieldNorm(doc=775)
          0.18660244 = weight(abstract_txt:selection in 775) [ClassicSimilarity], result of:
            0.18660244 = score(doc=775,freq=3.0), product of:
              0.32068047 = queryWeight, product of:
                4.167434 = boost
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.014315271 = queryNorm
              0.58189523 = fieldWeight in 775, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.375318 = idf(docFreq=558, maxDocs=44421)
                0.0625 = fieldNorm(doc=775)
          0.31894037 = weight(abstract_txt:feature in 775) [ClassicSimilarity], result of:
            0.31894037 = score(doc=775,freq=5.0), product of:
              0.3866497 = queryWeight, product of:
                4.576056 = boost
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.014315271 = queryNorm
              0.824882 = fieldWeight in 775, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.9023747 = idf(docFreq=329, maxDocs=44421)
                0.0625 = fieldNorm(doc=775)
        0.24 = coord(6/25)