Document (#28457)

Author
Debole, F.
Sebastiani, F.
Title
¬An analysis of the relative hardness of Reuters-21578 subsets
Source
Journal of the American Society for Information Science and Technology. 56(2005) no.6, S.584-596
Year
2005
Abstract
The existence, public availability, and widespread acceptance of a standard benchmark for a given information retrieval (IR) task are beneficial to research an this task, because they allow different researchers to experimentally compare their own systems by comparing the results they have obtained an this benchmark. The Reuters-21578 test collection, together with its earlier variants, has been such a standard benchmark for the text categorization (TC) task throughout the last 10 years. However, the benefits that this has brought about have somehow been limited by the fact that different researchers have "carved" different subsets out of this collection and tested their systems an one of these subsets only; systems that have been tested an different Reuters-21578 subsets are thus not readily comparable. In this article, we present a systematic, comparative experimental study of the three subsets of Reuters-21578 that have been most popular among TC researchers. The results we obtain allow us to determine the relative hardness of these subsets, thus establishing an indirect means for comparing TC systems that have, or will be, tested an these different subsets.
Theme
Retrievalstudien

Similar documents (author)

  1. Sebastiani, F.: On the role of logic in information retrieval (1998) 5.94
    5.9401517 = sum of:
      5.9401517 = weight(author_txt:sebastiani in 2140) [ClassicSimilarity], result of:
        5.9401517 = fieldWeight in 2140, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.504243 = idf(docFreq=8, maxDocs=44421)
          0.625 = fieldNorm(doc=2140)
    
  2. Sebastiani, F.: Machine learning in automated text categorization (2002) 5.94
    5.9401517 = sum of:
      5.9401517 = weight(author_txt:sebastiani in 4389) [ClassicSimilarity], result of:
        5.9401517 = fieldWeight in 4389, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.504243 = idf(docFreq=8, maxDocs=44421)
          0.625 = fieldNorm(doc=4389)
    
  3. Sebastiani, F.: ¬A tutorial an automated text categorisation (1999) 5.94
    5.9401517 = sum of:
      5.9401517 = weight(author_txt:sebastiani in 4390) [ClassicSimilarity], result of:
        5.9401517 = fieldWeight in 4390, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.504243 = idf(docFreq=8, maxDocs=44421)
          0.625 = fieldNorm(doc=4390)
    
  4. Sebastiani, F.: Classification of text, automatic (2006) 5.94
    5.9401517 = sum of:
      5.9401517 = weight(author_txt:sebastiani in 3) [ClassicSimilarity], result of:
        5.9401517 = fieldWeight in 3, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.504243 = idf(docFreq=8, maxDocs=44421)
          0.625 = fieldNorm(doc=3)
    
  5. Giorgetti, D.; Sebastiani, F.: Automating survey coding by multiclass text categorization techniques (2003) 4.75
    4.7521214 = sum of:
      4.7521214 = weight(author_txt:sebastiani in 172) [ClassicSimilarity], result of:
        4.7521214 = fieldWeight in 172, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.504243 = idf(docFreq=8, maxDocs=44421)
          0.5 = fieldNorm(doc=172)
    

Similar documents (content)

  1. Fagni, T.; Sebastiani, F.: Selecting negative examples for hierarchical text classification: An experimental comparison (2010) 0.24
    0.24200074 = sum of:
      0.24200074 = product of:
        0.60500187 = sum of:
          0.008865369 = weight(abstract_txt:they in 101) [ClassicSimilarity], result of:
            0.008865369 = score(doc=101,freq=1.0), product of:
              0.0378478 = queryWeight, product of:
                3.7477977 = idf(docFreq=2845, maxDocs=44421)
                0.010098677 = queryNorm
              0.23423736 = fieldWeight in 101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7477977 = idf(docFreq=2845, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.017808942 = weight(abstract_txt:standard in 101) [ClassicSimilarity], result of:
            0.017808942 = score(doc=101,freq=1.0), product of:
              0.06025617 = queryWeight, product of:
                1.2617707 = boost
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.010098677 = queryNorm
              0.29555383 = fieldWeight in 101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.008145515 = weight(abstract_txt:these in 101) [ClassicSimilarity], result of:
            0.008145515 = score(doc=101,freq=1.0), product of:
              0.040946696 = queryWeight, product of:
                1.2738982 = boost
                3.1828754 = idf(docFreq=5006, maxDocs=44421)
                0.010098677 = queryNorm
              0.19892971 = fieldWeight in 101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1828754 = idf(docFreq=5006, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.005568857 = weight(abstract_txt:that in 101) [ClassicSimilarity], result of:
            0.005568857 = score(doc=101,freq=1.0), product of:
              0.037676174 = queryWeight, product of:
                1.5775498 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.010098677 = queryNorm
              0.14780845 = fieldWeight in 101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.011731467 = weight(abstract_txt:this in 101) [ClassicSimilarity], result of:
            0.011731467 = score(doc=101,freq=4.0), product of:
              0.039003566 = queryWeight, product of:
                1.6050991 = boost
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.010098677 = queryNorm
              0.30077934 = fieldWeight in 101, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.0277928 = weight(abstract_txt:researchers in 101) [ClassicSimilarity], result of:
            0.0277928 = score(doc=101,freq=1.0), product of:
              0.09280287 = queryWeight, product of:
                1.917812 = boost
                4.791714 = idf(docFreq=1001, maxDocs=44421)
                0.010098677 = queryNorm
              0.29948214 = fieldWeight in 101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.791714 = idf(docFreq=1001, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.027547538 = weight(abstract_txt:been in 101) [ClassicSimilarity], result of:
            0.027547538 = score(doc=101,freq=3.0), product of:
              0.070404574 = queryWeight, product of:
                1.9288352 = boost
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.010098677 = queryNorm
              0.39127484 = fieldWeight in 101, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.023399431 = weight(abstract_txt:have in 101) [ClassicSimilarity], result of:
            0.023399431 = score(doc=101,freq=2.0), product of:
              0.08274531 = queryWeight, product of:
                2.5610144 = boost
                3.199388 = idf(docFreq=4924, maxDocs=44421)
                0.010098677 = queryNorm
              0.2827886 = fieldWeight in 101, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.199388 = idf(docFreq=4924, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.1463631 = weight(abstract_txt:reuters in 101) [ClassicSimilarity], result of:
            0.1463631 = score(doc=101,freq=1.0), product of:
              0.30917698 = queryWeight, product of:
                4.042018 = boost
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.010098677 = queryNorm
              0.47339582 = fieldWeight in 101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
          0.32777885 = weight(abstract_txt:21578 in 101) [ClassicSimilarity], result of:
            0.32777885 = score(doc=101,freq=1.0), product of:
              0.52922463 = queryWeight, product of:
                5.2882833 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.010098677 = queryNorm
              0.61935675 = fieldWeight in 101, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.0625 = fieldNorm(doc=101)
        0.4 = coord(10/25)
    
  2. Sun, A.; Lim, E.-P.; Ng, W.-K.: Performance measurement framework for hierarchical text classification (2003) 0.15
    0.14917427 = sum of:
      0.14917427 = product of:
        0.53276527 = sum of:
          0.016928501 = weight(abstract_txt:collection in 2808) [ClassicSimilarity], result of:
            0.016928501 = score(doc=2808,freq=1.0), product of:
              0.058253467 = queryWeight, product of:
                1.2406251 = boost
                4.649612 = idf(docFreq=1154, maxDocs=44421)
                0.010098677 = queryNorm
              0.29060075 = fieldWeight in 2808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.649612 = idf(docFreq=1154, maxDocs=44421)
                0.0625 = fieldNorm(doc=2808)
          0.008145515 = weight(abstract_txt:these in 2808) [ClassicSimilarity], result of:
            0.008145515 = score(doc=2808,freq=1.0), product of:
              0.040946696 = queryWeight, product of:
                1.2738982 = boost
                3.1828754 = idf(docFreq=5006, maxDocs=44421)
                0.010098677 = queryNorm
              0.19892971 = fieldWeight in 2808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1828754 = idf(docFreq=5006, maxDocs=44421)
                0.0625 = fieldNorm(doc=2808)
          0.011137714 = weight(abstract_txt:that in 2808) [ClassicSimilarity], result of:
            0.011137714 = score(doc=2808,freq=4.0), product of:
              0.037676174 = queryWeight, product of:
                1.5775498 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.010098677 = queryNorm
              0.2956169 = fieldWeight in 2808, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=2808)
          0.0058657336 = weight(abstract_txt:this in 2808) [ClassicSimilarity], result of:
            0.0058657336 = score(doc=2808,freq=1.0), product of:
              0.039003566 = queryWeight, product of:
                1.6050991 = boost
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.010098677 = queryNorm
              0.15038967 = fieldWeight in 2808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.0625 = fieldNorm(doc=2808)
          0.016545897 = weight(abstract_txt:have in 2808) [ClassicSimilarity], result of:
            0.016545897 = score(doc=2808,freq=1.0), product of:
              0.08274531 = queryWeight, product of:
                2.5610144 = boost
                3.199388 = idf(docFreq=4924, maxDocs=44421)
                0.010098677 = queryNorm
              0.19996175 = fieldWeight in 2808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.199388 = idf(docFreq=4924, maxDocs=44421)
                0.0625 = fieldNorm(doc=2808)
          0.1463631 = weight(abstract_txt:reuters in 2808) [ClassicSimilarity], result of:
            0.1463631 = score(doc=2808,freq=1.0), product of:
              0.30917698 = queryWeight, product of:
                4.042018 = boost
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.010098677 = queryNorm
              0.47339582 = fieldWeight in 2808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.0625 = fieldNorm(doc=2808)
          0.32777885 = weight(abstract_txt:21578 in 2808) [ClassicSimilarity], result of:
            0.32777885 = score(doc=2808,freq=1.0), product of:
              0.52922463 = queryWeight, product of:
                5.2882833 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.010098677 = queryNorm
              0.61935675 = fieldWeight in 2808, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.0625 = fieldNorm(doc=2808)
        0.28 = coord(7/25)
    
  3. Egghe, L.; Rousseau, R.: ¬A theoretical study of recall and precision using a topological approach to information retrieval (1998) 0.13
    0.13143197 = sum of:
      0.13143197 = product of:
        0.8214498 = sum of:
          0.035617884 = weight(abstract_txt:standard in 4267) [ClassicSimilarity], result of:
            0.035617884 = score(doc=4267,freq=1.0), product of:
              0.06025617 = queryWeight, product of:
                1.2617707 = boost
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.010098677 = queryNorm
              0.59110767 = fieldWeight in 4267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.7288613 = idf(docFreq=1066, maxDocs=44421)
                0.125 = fieldNorm(doc=4267)
          0.053477407 = weight(abstract_txt:systems in 4267) [ClassicSimilarity], result of:
            0.053477407 = score(doc=4267,freq=4.0), product of:
              0.06270849 = queryWeight, product of:
                1.8203623 = boost
                3.411175 = idf(docFreq=3984, maxDocs=44421)
                0.010098677 = queryNorm
              0.85279375 = fieldWeight in 4267, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.411175 = idf(docFreq=3984, maxDocs=44421)
                0.125 = fieldNorm(doc=4267)
          0.041274946 = weight(abstract_txt:different in 4267) [ClassicSimilarity], result of:
            0.041274946 = score(doc=4267,freq=1.0), product of:
              0.09022505 = queryWeight, product of:
                2.4412556 = boost
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.010098677 = queryNorm
              0.4574666 = fieldWeight in 4267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6597328 = idf(docFreq=3107, maxDocs=44421)
                0.125 = fieldNorm(doc=4267)
          0.69107956 = weight(abstract_txt:subsets in 4267) [ClassicSimilarity], result of:
            0.69107956 = score(doc=4267,freq=1.0), product of:
              0.6605882 = queryWeight, product of:
                7.8159018 = boost
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.010098677 = queryNorm
              1.0461578 = fieldWeight in 4267, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.125 = fieldNorm(doc=4267)
        0.16 = coord(4/25)
    
  4. Hung, C.-M.; Chien, L.-F.: Web-based text classification in the absence of manually labeled training documents (2007) 0.12
    0.12418715 = sum of:
      0.12418715 = product of:
        0.62093574 = sum of:
          0.01108171 = weight(abstract_txt:they in 1087) [ClassicSimilarity], result of:
            0.01108171 = score(doc=1087,freq=1.0), product of:
              0.0378478 = queryWeight, product of:
                3.7477977 = idf(docFreq=2845, maxDocs=44421)
                0.010098677 = queryNorm
              0.2927967 = fieldWeight in 1087, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7477977 = idf(docFreq=2845, maxDocs=44421)
                0.078125 = fieldNorm(doc=1087)
          0.009844442 = weight(abstract_txt:that in 1087) [ClassicSimilarity], result of:
            0.009844442 = score(doc=1087,freq=2.0), product of:
              0.037676174 = queryWeight, product of:
                1.5775498 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.010098677 = queryNorm
              0.2612909 = fieldWeight in 1087, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=1087)
          0.0073321667 = weight(abstract_txt:this in 1087) [ClassicSimilarity], result of:
            0.0073321667 = score(doc=1087,freq=1.0), product of:
              0.039003566 = queryWeight, product of:
                1.6050991 = boost
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.010098677 = queryNorm
              0.18798709 = fieldWeight in 1087, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4062347 = idf(docFreq=10885, maxDocs=44421)
                0.078125 = fieldNorm(doc=1087)
          0.18295386 = weight(abstract_txt:reuters in 1087) [ClassicSimilarity], result of:
            0.18295386 = score(doc=1087,freq=1.0), product of:
              0.30917698 = queryWeight, product of:
                4.042018 = boost
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.010098677 = queryNorm
              0.5917448 = fieldWeight in 1087, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.574333 = idf(docFreq=61, maxDocs=44421)
                0.078125 = fieldNorm(doc=1087)
          0.40972355 = weight(abstract_txt:21578 in 1087) [ClassicSimilarity], result of:
            0.40972355 = score(doc=1087,freq=1.0), product of:
              0.52922463 = queryWeight, product of:
                5.2882833 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.010098677 = queryNorm
              0.7741959 = fieldWeight in 1087, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.078125 = fieldNorm(doc=1087)
        0.2 = coord(5/25)
    
  5. Whyte, G.; Bytheway, A.; Edwards, C.: Understanding user perceptions of information systems success (1997) 0.12
    0.1168 = sum of:
      0.1168 = product of:
        0.584 = sum of:
          0.013298052 = weight(abstract_txt:they in 2367) [ClassicSimilarity], result of:
            0.013298052 = score(doc=2367,freq=1.0), product of:
              0.0378478 = queryWeight, product of:
                3.7477977 = idf(docFreq=2845, maxDocs=44421)
                0.010098677 = queryNorm
              0.35135603 = fieldWeight in 2367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.7477977 = idf(docFreq=2845, maxDocs=44421)
                0.09375 = fieldNorm(doc=2367)
          0.012218271 = weight(abstract_txt:these in 2367) [ClassicSimilarity], result of:
            0.012218271 = score(doc=2367,freq=1.0), product of:
              0.040946696 = queryWeight, product of:
                1.2738982 = boost
                3.1828754 = idf(docFreq=5006, maxDocs=44421)
                0.010098677 = queryNorm
              0.29839456 = fieldWeight in 2367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.1828754 = idf(docFreq=5006, maxDocs=44421)
                0.09375 = fieldNorm(doc=2367)
          0.0118133295 = weight(abstract_txt:that in 2367) [ClassicSimilarity], result of:
            0.0118133295 = score(doc=2367,freq=2.0), product of:
              0.037676174 = queryWeight, product of:
                1.5775498 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.010098677 = queryNorm
              0.31354907 = fieldWeight in 2367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.09375 = fieldNorm(doc=2367)
          0.02836068 = weight(abstract_txt:systems in 2367) [ClassicSimilarity], result of:
            0.02836068 = score(doc=2367,freq=2.0), product of:
              0.06270849 = queryWeight, product of:
                1.8203623 = boost
                3.411175 = idf(docFreq=3984, maxDocs=44421)
                0.010098677 = queryNorm
              0.4522622 = fieldWeight in 2367, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.411175 = idf(docFreq=3984, maxDocs=44421)
                0.09375 = fieldNorm(doc=2367)
          0.51830965 = weight(abstract_txt:subsets in 2367) [ClassicSimilarity], result of:
            0.51830965 = score(doc=2367,freq=1.0), product of:
              0.6605882 = queryWeight, product of:
                7.8159018 = boost
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.010098677 = queryNorm
              0.7846184 = fieldWeight in 2367, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.369263 = idf(docFreq=27, maxDocs=44421)
                0.09375 = fieldNorm(doc=2367)
        0.2 = coord(5/25)