Document (#34079)

Author
Liu, R.-L.
Title
Interactive high-quality text classification
Source
Information processing and management. 44(2008) no.3, S.1062-1075
Year
2008
Abstract
Automatic text classification (TC) is essential for information sharing and management. Its ideal goals are to achieve high-quality TC: (1) accepting almost all documents that should be accepted (i.e., high recall) and (2) rejecting almost all documents that should be rejected (i.e., high precision). Unfortunately, the ideal goals are rarely achieved, making automatic TC not suitable for those applications in which a classifier's erroneous decision may incur high cost and/or serious problems. One way to pursue the ideal is to consult users to confirm the classifier's decisions so that potential errors may be corrected. However, its main challenge lies on the control of the number of confirmations, which may incur heavy cognitive load on the users. We thus develop an intelligent and classifier-independent confirmation strategy ICCOM. Empirical evaluation shows that ICCOM may help various kinds of classifiers to achieve very high precision and recall by conducting fewer confirmations. The contributions are significant to the archiving and recommendation of critical information, since identification of possible TC errors (those that require confirmation) is the key to process information more properly.

Similar documents (content)

  1. Tagheva, K.; Borsack, J.; Condit, A.: Effects of OCR errors on ranking and feedback using the vector space model (1996) 0.17
    0.16608104 = sum of:
      0.16608104 = product of:
        0.6920043 = sum of:
          0.036804873 = weight(abstract_txt:text in 5019) [ClassicSimilarity], result of:
            0.036804873 = score(doc=5019,freq=1.0), product of:
              0.08327432 = queryWeight, product of:
                1.0086334 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.020431563 = queryNorm
              0.44197148 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.039103765 = weight(abstract_txt:documents in 5019) [ClassicSimilarity], result of:
            0.039103765 = score(doc=5019,freq=1.0), product of:
              0.08670682 = queryWeight, product of:
                1.0292109 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.020431563 = queryNorm
              0.45098835 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.19451866 = weight(abstract_txt:corrected in 5019) [ClassicSimilarity], result of:
            0.19451866 = score(doc=5019,freq=1.0), product of:
              0.20054187 = queryWeight, product of:
                1.1067902 = boost
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.020431563 = queryNorm
              0.96996534 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.09399947 = weight(abstract_txt:precision in 5019) [ClassicSimilarity], result of:
            0.09399947 = score(doc=5019,freq=1.0), product of:
              0.15559338 = queryWeight, product of:
                1.3787113 = boost
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.020431563 = queryNorm
              0.6041354 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.10608824 = weight(abstract_txt:recall in 5019) [ClassicSimilarity], result of:
            0.10608824 = score(doc=5019,freq=1.0), product of:
              0.1686627 = queryWeight, product of:
                1.4354475 = boost
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.020431563 = queryNorm
              0.6289965 = fieldWeight in 5019, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
          0.22148934 = weight(abstract_txt:errors in 5019) [ClassicSimilarity], result of:
            0.22148934 = score(doc=5019,freq=2.0), product of:
              0.21867515 = queryWeight, product of:
                1.634472 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.020431563 = queryNorm
              1.0128692 = fieldWeight in 5019, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.109375 = fieldNorm(doc=5019)
        0.24 = coord(6/25)
    
  2. Ringltetter, C.; Stubbe, A.: Practical aspects of automatic genre classification (2008) 0.15
    0.15388079 = sum of:
      0.15388079 = product of:
        0.42744663 = sum of:
          0.02974283 = weight(abstract_txt:text in 2954) [ClassicSimilarity], result of:
            0.02974283 = score(doc=2954,freq=2.0), product of:
              0.08327432 = queryWeight, product of:
                1.0086334 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.020431563 = queryNorm
              0.3571669 = fieldWeight in 2954, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
          0.054733872 = weight(abstract_txt:documents in 2954) [ClassicSimilarity], result of:
            0.054733872 = score(doc=2954,freq=6.0), product of:
              0.08670682 = queryWeight, product of:
                1.0292109 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.020431563 = queryNorm
              0.6312522 = fieldWeight in 2954, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
          0.097273484 = weight(abstract_txt:rejected in 2954) [ClassicSimilarity], result of:
            0.097273484 = score(doc=2954,freq=1.0), product of:
              0.1834788 = queryWeight, product of:
                1.0586581 = boost
                8.482592 = idf(docFreq=24, maxDocs=44421)
                0.020431563 = queryNorm
              0.530162 = fieldWeight in 2954, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.482592 = idf(docFreq=24, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
          0.025498455 = weight(abstract_txt:those in 2954) [ClassicSimilarity], result of:
            0.025498455 = score(doc=2954,freq=1.0), product of:
              0.09468376 = queryWeight, product of:
                1.0755126 = boost
                4.3088202 = idf(docFreq=1623, maxDocs=44421)
                0.020431563 = queryNorm
              0.26930127 = fieldWeight in 2954, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.3088202 = idf(docFreq=1623, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
          0.027732529 = weight(abstract_txt:should in 2954) [ClassicSimilarity], result of:
            0.027732529 = score(doc=2954,freq=1.0), product of:
              0.10013653 = queryWeight, product of:
                1.1060482 = boost
                4.4311547 = idf(docFreq=1436, maxDocs=44421)
                0.020431563 = queryNorm
              0.27694717 = fieldWeight in 2954, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4311547 = idf(docFreq=1436, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
          0.06322398 = weight(abstract_txt:automatic in 2954) [ClassicSimilarity], result of:
            0.06322398 = score(doc=2954,freq=2.0), product of:
              0.13767153 = queryWeight, product of:
                1.2968801 = boost
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.020431563 = queryNorm
              0.45923787 = fieldWeight in 2954, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.1956835 = idf(docFreq=668, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
          0.053713977 = weight(abstract_txt:precision in 2954) [ClassicSimilarity], result of:
            0.053713977 = score(doc=2954,freq=1.0), product of:
              0.15559338 = queryWeight, product of:
                1.3787113 = boost
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.020431563 = queryNorm
              0.3452202 = fieldWeight in 2954, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
          0.060621854 = weight(abstract_txt:recall in 2954) [ClassicSimilarity], result of:
            0.060621854 = score(doc=2954,freq=1.0), product of:
              0.1686627 = queryWeight, product of:
                1.4354475 = boost
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.020431563 = queryNorm
              0.35942656 = fieldWeight in 2954, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
          0.01490567 = weight(abstract_txt:that in 2954) [ClassicSimilarity], result of:
            0.01490567 = score(doc=2954,freq=2.0), product of:
              0.07130783 = queryWeight, product of:
                1.4757622 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.020431563 = queryNorm
              0.20903271 = fieldWeight in 2954, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=2954)
        0.36 = coord(9/25)
    
  3. Tseng, Y.-H.: Solving vocabulary problems with interactive query expansion (1998) 0.14
    0.13747312 = sum of:
      0.13747312 = product of:
        0.49097544 = sum of:
          0.02974283 = weight(abstract_txt:text in 6159) [ClassicSimilarity], result of:
            0.02974283 = score(doc=6159,freq=2.0), product of:
              0.08327432 = queryWeight, product of:
                1.0086334 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.020431563 = queryNorm
              0.3571669 = fieldWeight in 6159, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=6159)
          0.022345008 = weight(abstract_txt:documents in 6159) [ClassicSimilarity], result of:
            0.022345008 = score(doc=6159,freq=1.0), product of:
              0.08670682 = queryWeight, product of:
                1.0292109 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.020431563 = queryNorm
              0.25770763 = fieldWeight in 6159, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.0625 = fieldNorm(doc=6159)
          0.107427955 = weight(abstract_txt:precision in 6159) [ClassicSimilarity], result of:
            0.107427955 = score(doc=6159,freq=4.0), product of:
              0.15559338 = queryWeight, product of:
                1.3787113 = boost
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.020431563 = queryNorm
              0.6904404 = fieldWeight in 6159, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.0625 = fieldNorm(doc=6159)
          0.13555458 = weight(abstract_txt:recall in 6159) [ClassicSimilarity], result of:
            0.13555458 = score(doc=6159,freq=5.0), product of:
              0.1686627 = queryWeight, product of:
                1.4354475 = boost
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.020431563 = queryNorm
              0.80370224 = fieldWeight in 6159, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.0625 = fieldNorm(doc=6159)
          0.01825564 = weight(abstract_txt:that in 6159) [ClassicSimilarity], result of:
            0.01825564 = score(doc=6159,freq=3.0), product of:
              0.07130783 = queryWeight, product of:
                1.4757622 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.020431563 = queryNorm
              0.25601172 = fieldWeight in 6159, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=6159)
          0.06854023 = weight(abstract_txt:achieve in 6159) [ClassicSimilarity], result of:
            0.06854023 = score(doc=6159,freq=1.0), product of:
              0.18304728 = queryWeight, product of:
                1.4954071 = boost
                5.9910407 = idf(docFreq=301, maxDocs=44421)
                0.020431563 = queryNorm
              0.37444004 = fieldWeight in 6159, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.9910407 = idf(docFreq=301, maxDocs=44421)
                0.0625 = fieldNorm(doc=6159)
          0.10910915 = weight(abstract_txt:high in 6159) [ClassicSimilarity], result of:
            0.10910915 = score(doc=6159,freq=1.0), product of:
              0.35992673 = queryWeight, product of:
                3.6319969 = boost
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.020431563 = queryNorm
              0.30314267 = fieldWeight in 6159, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.0625 = fieldNorm(doc=6159)
        0.28 = coord(7/25)
    
  4. Toepfer, M.; Seifert, C.: Content-based quality estimation for automatic subject indexing of short texts under precision and recall constraints 0.12
    0.122129865 = sum of:
      0.122129865 = product of:
        0.4361781 = sum of:
          0.026289197 = weight(abstract_txt:text in 309) [ClassicSimilarity], result of:
            0.026289197 = score(doc=309,freq=1.0), product of:
              0.08327432 = queryWeight, product of:
                1.0086334 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.020431563 = queryNorm
              0.3156939 = fieldWeight in 309, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=309)
          0.02793126 = weight(abstract_txt:documents in 309) [ClassicSimilarity], result of:
            0.02793126 = score(doc=309,freq=1.0), product of:
              0.08670682 = queryWeight, product of:
                1.0292109 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.020431563 = queryNorm
              0.32213452 = fieldWeight in 309, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.078125 = fieldNorm(doc=309)
          0.07983185 = weight(abstract_txt:quality in 309) [ClassicSimilarity], result of:
            0.07983185 = score(doc=309,freq=4.0), product of:
              0.11000786 = queryWeight, product of:
                1.1592834 = boost
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.020431563 = queryNorm
              0.7256923 = fieldWeight in 309, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.6444306 = idf(docFreq=1160, maxDocs=44421)
                0.078125 = fieldNorm(doc=309)
          0.06714247 = weight(abstract_txt:precision in 309) [ClassicSimilarity], result of:
            0.06714247 = score(doc=309,freq=1.0), product of:
              0.15559338 = queryWeight, product of:
                1.3787113 = boost
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.020431563 = queryNorm
              0.43152526 = fieldWeight in 309, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.078125 = fieldNorm(doc=309)
          0.075777315 = weight(abstract_txt:recall in 309) [ClassicSimilarity], result of:
            0.075777315 = score(doc=309,freq=1.0), product of:
              0.1686627 = queryWeight, product of:
                1.4354475 = boost
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.020431563 = queryNorm
              0.44928318 = fieldWeight in 309, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.078125 = fieldNorm(doc=309)
          0.02281955 = weight(abstract_txt:that in 309) [ClassicSimilarity], result of:
            0.02281955 = score(doc=309,freq=3.0), product of:
              0.07130783 = queryWeight, product of:
                1.4757622 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.020431563 = queryNorm
              0.32001466 = fieldWeight in 309, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=309)
          0.13638644 = weight(abstract_txt:high in 309) [ClassicSimilarity], result of:
            0.13638644 = score(doc=309,freq=1.0), product of:
              0.35992673 = queryWeight, product of:
                3.6319969 = boost
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.020431563 = queryNorm
              0.37892833 = fieldWeight in 309, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8502827 = idf(docFreq=944, maxDocs=44421)
                0.078125 = fieldNorm(doc=309)
        0.28 = coord(7/25)
    
  5. Taghva, K.; Borsack, J.; Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text (1996) 0.12
    0.12191701 = sum of:
      0.12191701 = product of:
        0.50798756 = sum of:
          0.054641068 = weight(abstract_txt:text in 4553) [ClassicSimilarity], result of:
            0.054641068 = score(doc=4553,freq=3.0), product of:
              0.08327432 = queryWeight, product of:
                1.0086334 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.020431563 = queryNorm
              0.6561575 = fieldWeight in 4553, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.09375 = fieldNorm(doc=4553)
          0.033517513 = weight(abstract_txt:documents in 4553) [ClassicSimilarity], result of:
            0.033517513 = score(doc=4553,freq=1.0), product of:
              0.08670682 = queryWeight, product of:
                1.0292109 = boost
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.020431563 = queryNorm
              0.38656145 = fieldWeight in 4553, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.123322 = idf(docFreq=1954, maxDocs=44421)
                0.09375 = fieldNorm(doc=4553)
          0.080570966 = weight(abstract_txt:precision in 4553) [ClassicSimilarity], result of:
            0.080570966 = score(doc=4553,freq=1.0), product of:
              0.15559338 = queryWeight, product of:
                1.3787113 = boost
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.020431563 = queryNorm
              0.5178303 = fieldWeight in 4553, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5235233 = idf(docFreq=481, maxDocs=44421)
                0.09375 = fieldNorm(doc=4553)
          0.09093279 = weight(abstract_txt:recall in 4553) [ClassicSimilarity], result of:
            0.09093279 = score(doc=4553,freq=1.0), product of:
              0.1686627 = queryWeight, product of:
                1.4354475 = boost
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.020431563 = queryNorm
              0.53913987 = fieldWeight in 4553, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.750825 = idf(docFreq=383, maxDocs=44421)
                0.09375 = fieldNorm(doc=4553)
          0.01580985 = weight(abstract_txt:that in 4553) [ClassicSimilarity], result of:
            0.01580985 = score(doc=4553,freq=1.0), product of:
              0.07130783 = queryWeight, product of:
                1.4757622 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.020431563 = queryNorm
              0.22171268 = fieldWeight in 4553, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.09375 = fieldNorm(doc=4553)
          0.23251536 = weight(abstract_txt:errors in 4553) [ClassicSimilarity], result of:
            0.23251536 = score(doc=4553,freq=3.0), product of:
              0.21867515 = queryWeight, product of:
                1.634472 = boost
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.020431563 = queryNorm
              1.0632912 = fieldWeight in 4553, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.548176 = idf(docFreq=172, maxDocs=44421)
                0.09375 = fieldNorm(doc=4553)
        0.24 = coord(6/25)