Document (#29422)

Author
Robertson, S.
Title
Understanding inverse document frequency : on theoretical arguments for IDF
Source
Journal of documentation. 60(2004) no.5, S.503-520
Year
2004
Abstract
The term-weighting function known as IDF was proposed in 1972, and has since been extremely widely used, usually as part of a TF*IDF function. It is often described as a heuristic, and many papers have been written (some based on Shannon's Information Theory) seeking to establish some theoretical basis for it. Some of these attempts are reviewed, and it is shown that the Information Theory approaches are problematic, but that there are good theoretical justifications of both IDF and TF*IDF in the traditional probabilistic model of information retrieval.
Footnote
Vgl. auch unter:http://www.emeraldinsight.com/10.1108/00220410410560582.
Theme
Retrievalalgorithmen
Object
IDF
TF*IDF

Similar documents (author)

  1. Robertson, M.A.: Windows 3.0 for the online searcher (1991) 4.57
    4.574651 = sum of:
      4.574651 = weight(author_txt:robertson in 591) [ClassicSimilarity], result of:
        4.574651 = fieldWeight in 591, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.319441 = idf(docFreq=79, maxDocs=44421)
          0.625 = fieldNorm(doc=591)
    
  2. Robertson, S.E.: Some recent theories and models in information retrieval (1980) 4.57
    4.574651 = sum of:
      4.574651 = weight(author_txt:robertson in 1325) [ClassicSimilarity], result of:
        4.574651 = fieldWeight in 1325, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.319441 = idf(docFreq=79, maxDocs=44421)
          0.625 = fieldNorm(doc=1325)
    
  3. Robertson, S.E.: Theories and models in information retrieval (1977) 4.57
    4.574651 = sum of:
      4.574651 = weight(author_txt:robertson in 1843) [ClassicSimilarity], result of:
        4.574651 = fieldWeight in 1843, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.319441 = idf(docFreq=79, maxDocs=44421)
          0.625 = fieldNorm(doc=1843)
    
  4. Robertson, S.E.: On term selection for query expansion (1990) 4.57
    4.574651 = sum of:
      4.574651 = weight(author_txt:robertson in 2649) [ClassicSimilarity], result of:
        4.574651 = fieldWeight in 2649, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.319441 = idf(docFreq=79, maxDocs=44421)
          0.625 = fieldNorm(doc=2649)
    
  5. Robertson, S.E.: On relevance weight estimation and query expansion (1986) 4.57
    4.574651 = sum of:
      4.574651 = weight(author_txt:robertson in 3874) [ClassicSimilarity], result of:
        4.574651 = fieldWeight in 3874, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.319441 = idf(docFreq=79, maxDocs=44421)
          0.625 = fieldNorm(doc=3874)
    

Similar documents (content)

  1. Wong, S.K.M.; Yao, Y.Y.: ¬An information-theoretic measure of term specifics (1992) 0.21
    0.20646715 = sum of:
      0.20646715 = product of:
        0.8602798 = sum of:
          0.11395829 = weight(abstract_txt:frequency in 4806) [ClassicSimilarity], result of:
            0.11395829 = score(doc=4806,freq=2.0), product of:
              0.1444852 = queryWeight, product of:
                1.0708815 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.022680137 = queryNorm
              0.7887195 = fieldWeight in 4806, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.18361059 = weight(abstract_txt:weighting in 4806) [ClassicSimilarity], result of:
            0.18361059 = score(doc=4806,freq=2.0), product of:
              0.198575 = queryWeight, product of:
                1.2554286 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.022680137 = queryNorm
              0.924641 = fieldWeight in 4806, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.022983268 = weight(abstract_txt:information in 4806) [ClassicSimilarity], result of:
            0.022983268 = score(doc=4806,freq=2.0), product of:
              0.07166509 = queryWeight, product of:
                1.3063037 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.022680137 = queryNorm
              0.3207038 = fieldWeight in 4806, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.17434174 = weight(abstract_txt:inverse in 4806) [ClassicSimilarity], result of:
            0.17434174 = score(doc=4806,freq=1.0), product of:
              0.24169648 = queryWeight, product of:
                1.3850482 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.022680137 = queryNorm
              0.7213251 = fieldWeight in 4806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.30822745 = weight(abstract_txt:justifications in 4806) [ClassicSimilarity], result of:
            0.30822745 = score(doc=4806,freq=1.0), product of:
              0.3533868 = queryWeight, product of:
                1.6747688 = boost
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.022680137 = queryNorm
              0.8722099 = fieldWeight in 4806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.303573 = idf(docFreq=10, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.057158478 = weight(abstract_txt:some in 4806) [ClassicSimilarity], result of:
            0.057158478 = score(doc=4806,freq=1.0), product of:
              0.16574112 = queryWeight, product of:
                1.9865773 = boost
                3.6785707 = idf(docFreq=3049, maxDocs=44421)
                0.022680137 = queryNorm
              0.344866 = fieldWeight in 4806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6785707 = idf(docFreq=3049, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
        0.24 = coord(6/25)
    
  2. Robertson, S.E.; Sparck Jones, K.: Relevance weighting of search terms (1976) 0.18
    0.17763075 = sum of:
      0.17763075 = product of:
        0.74012816 = sum of:
          0.078001745 = weight(abstract_txt:shown in 139) [ClassicSimilarity], result of:
            0.078001745 = score(doc=139,freq=1.0), product of:
              0.1275776 = queryWeight, product of:
                1.0062755 = boost
                5.59 = idf(docFreq=450, maxDocs=44421)
                0.022680137 = queryNorm
              0.61140627 = fieldWeight in 139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.59 = idf(docFreq=450, maxDocs=44421)
                0.109375 = fieldNorm(doc=139)
          0.1383745 = weight(abstract_txt:probabilistic in 139) [ClassicSimilarity], result of:
            0.1383745 = score(doc=139,freq=1.0), product of:
              0.18695724 = queryWeight, product of:
                1.2181503 = boost
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.022680137 = queryNorm
              0.7401399 = fieldWeight in 139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.109375 = fieldNorm(doc=139)
          0.2623555 = weight(abstract_txt:weighting in 139) [ClassicSimilarity], result of:
            0.2623555 = score(doc=139,freq=3.0), product of:
              0.198575 = queryWeight, product of:
                1.2554286 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.022680137 = queryNorm
              1.321191 = fieldWeight in 139, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.109375 = fieldNorm(doc=139)
          0.02681381 = weight(abstract_txt:information in 139) [ClassicSimilarity], result of:
            0.02681381 = score(doc=139,freq=2.0), product of:
              0.07166509 = queryWeight, product of:
                1.3063037 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.022680137 = queryNorm
              0.37415442 = fieldWeight in 139, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.109375 = fieldNorm(doc=139)
          0.08301166 = weight(abstract_txt:theory in 139) [ClassicSimilarity], result of:
            0.08301166 = score(doc=139,freq=1.0), product of:
              0.16754866 = queryWeight, product of:
                1.6308544 = boost
                4.529811 = idf(docFreq=1301, maxDocs=44421)
                0.022680137 = queryNorm
              0.49544805 = fieldWeight in 139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.529811 = idf(docFreq=1301, maxDocs=44421)
                0.109375 = fieldNorm(doc=139)
          0.15157095 = weight(abstract_txt:theoretical in 139) [ClassicSimilarity], result of:
            0.15157095 = score(doc=139,freq=1.0), product of:
              0.28652066 = queryWeight, product of:
                2.6119707 = boost
                4.83662 = idf(docFreq=957, maxDocs=44421)
                0.022680137 = queryNorm
              0.5290053 = fieldWeight in 139, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.83662 = idf(docFreq=957, maxDocs=44421)
                0.109375 = fieldNorm(doc=139)
        0.24 = coord(6/25)
    
  3. Cornelius, I.: Theorizing information for information science (2002) 0.13
    0.13311954 = sum of:
      0.13311954 = product of:
        0.47542694 = sum of:
          0.044166792 = weight(abstract_txt:attempts in 5244) [ClassicSimilarity], result of:
            0.044166792 = score(doc=5244,freq=2.0), product of:
              0.13767788 = queryWeight, product of:
                1.0453502 = boost
                5.807065 = idf(docFreq=362, maxDocs=44421)
                0.022680137 = queryNorm
              0.32079804 = fieldWeight in 5244, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.807065 = idf(docFreq=362, maxDocs=44421)
                0.0390625 = fieldNorm(doc=5244)
          0.071265794 = weight(abstract_txt:problematic in 5244) [ClassicSimilarity], result of:
            0.071265794 = score(doc=5244,freq=2.0), product of:
              0.18940336 = queryWeight, product of:
                1.2260934 = boost
                6.8111186 = idf(docFreq=132, maxDocs=44421)
                0.022680137 = queryNorm
              0.3762647 = fieldWeight in 5244, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.8111186 = idf(docFreq=132, maxDocs=44421)
                0.0390625 = fieldNorm(doc=5244)
          0.015061376 = weight(abstract_txt:been in 5244) [ClassicSimilarity], result of:
            0.015061376 = score(doc=5244,freq=1.0), product of:
              0.10667517 = queryWeight, product of:
                1.3012968 = boost
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.022680137 = queryNorm
              0.14118914 = fieldWeight in 5244, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.0390625 = fieldNorm(doc=5244)
          0.03948435 = weight(abstract_txt:information in 5244) [ClassicSimilarity], result of:
            0.03948435 = score(doc=5244,freq=34.0), product of:
              0.07166509 = queryWeight, product of:
                1.3063037 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.022680137 = queryNorm
              0.5509565 = fieldWeight in 5244, product of:
                5.8309517 = tf(freq=34.0), with freq of:
                  34.0 = termFreq=34.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.0390625 = fieldNorm(doc=5244)
          0.15730418 = weight(abstract_txt:shannon's in 5244) [ClassicSimilarity], result of:
            0.15730418 = score(doc=5244,freq=2.0), product of:
              0.32109022 = queryWeight, product of:
                1.5964056 = boost
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.022680137 = queryNorm
              0.4899065 = fieldWeight in 5244, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.0390625 = fieldNorm(doc=5244)
          0.10689386 = weight(abstract_txt:theory in 5244) [ClassicSimilarity], result of:
            0.10689386 = score(doc=5244,freq=13.0), product of:
              0.16754866 = queryWeight, product of:
                1.6308544 = boost
                4.529811 = idf(docFreq=1301, maxDocs=44421)
                0.022680137 = queryNorm
              0.63798696 = fieldWeight in 5244, product of:
                3.6055512 = tf(freq=13.0), with freq of:
                  13.0 = termFreq=13.0
                4.529811 = idf(docFreq=1301, maxDocs=44421)
                0.0390625 = fieldNorm(doc=5244)
          0.04125058 = weight(abstract_txt:some in 5244) [ClassicSimilarity], result of:
            0.04125058 = score(doc=5244,freq=3.0), product of:
              0.16574112 = queryWeight, product of:
                1.9865773 = boost
                3.6785707 = idf(docFreq=3049, maxDocs=44421)
                0.022680137 = queryNorm
              0.2488856 = fieldWeight in 5244, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.6785707 = idf(docFreq=3049, maxDocs=44421)
                0.0390625 = fieldNorm(doc=5244)
        0.28 = coord(7/25)
    
  4. Liu, X.; Croft, W.B.: Statistical language modeling for information retrieval (2004) 0.12
    0.1236782 = sum of:
      0.1236782 = product of:
        0.44170785 = sum of:
          0.0470054 = weight(abstract_txt:frequency in 5277) [ClassicSimilarity], result of:
            0.0470054 = score(doc=5277,freq=1.0), product of:
              0.1444852 = queryWeight, product of:
                1.0708815 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.022680137 = queryNorm
              0.3253302 = fieldWeight in 5277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.06918725 = weight(abstract_txt:probabilistic in 5277) [ClassicSimilarity], result of:
            0.06918725 = score(doc=5277,freq=1.0), product of:
              0.18695724 = queryWeight, product of:
                1.2181503 = boost
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.022680137 = queryNorm
              0.37006995 = fieldWeight in 5277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.029820003 = weight(abstract_txt:been in 5277) [ClassicSimilarity], result of:
            0.029820003 = score(doc=5277,freq=2.0), product of:
              0.10667517 = queryWeight, product of:
                1.3012968 = boost
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.022680137 = queryNorm
              0.27954024 = fieldWeight in 5277, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.614442 = idf(docFreq=3251, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.02119818 = weight(abstract_txt:information in 5277) [ClassicSimilarity], result of:
            0.02119818 = score(doc=5277,freq=5.0), product of:
              0.07166509 = queryWeight, product of:
                1.3063037 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.022680137 = queryNorm
              0.29579505 = fieldWeight in 5277, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.1557232 = weight(abstract_txt:shannon's in 5277) [ClassicSimilarity], result of:
            0.1557232 = score(doc=5277,freq=1.0), product of:
              0.32109022 = queryWeight, product of:
                1.5964056 = boost
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.022680137 = queryNorm
              0.48498267 = fieldWeight in 5277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.868255 = idf(docFreq=16, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.04150583 = weight(abstract_txt:theory in 5277) [ClassicSimilarity], result of:
            0.04150583 = score(doc=5277,freq=1.0), product of:
              0.16754866 = queryWeight, product of:
                1.6308544 = boost
                4.529811 = idf(docFreq=1301, maxDocs=44421)
                0.022680137 = queryNorm
              0.24772403 = fieldWeight in 5277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.529811 = idf(docFreq=1301, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
          0.07726801 = weight(abstract_txt:function in 5277) [ClassicSimilarity], result of:
            0.07726801 = score(doc=5277,freq=1.0), product of:
              0.2535526 = queryWeight, product of:
                2.0062208 = boost
                5.5724173 = idf(docFreq=458, maxDocs=44421)
                0.022680137 = queryNorm
              0.30474156 = fieldWeight in 5277, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5724173 = idf(docFreq=458, maxDocs=44421)
                0.0546875 = fieldNorm(doc=5277)
        0.28 = coord(7/25)
    
  5. Huang, X.; Robertson, S.E.: Application of probilistic methods to Chinese text retrieval (1997) 0.12
    0.11809974 = sum of:
      0.11809974 = product of:
        0.5904987 = sum of:
          0.06561555 = weight(abstract_txt:good in 5706) [ClassicSimilarity], result of:
            0.06561555 = score(doc=5706,freq=1.0), product of:
              0.1259913 = queryWeight, product of:
                5.5551386 = idf(docFreq=466, maxDocs=44421)
                0.022680137 = queryNorm
              0.5207943 = fieldWeight in 5706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5551386 = idf(docFreq=466, maxDocs=44421)
                0.09375 = fieldNorm(doc=5706)
          0.20543288 = weight(abstract_txt:probabilistic in 5706) [ClassicSimilarity], result of:
            0.20543288 = score(doc=5706,freq=3.0), product of:
              0.18695724 = queryWeight, product of:
                1.2181503 = boost
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.022680137 = queryNorm
              1.0988228 = fieldWeight in 5706, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.09375 = fieldNorm(doc=5706)
          0.12983231 = weight(abstract_txt:weighting in 5706) [ClassicSimilarity], result of:
            0.12983231 = score(doc=5706,freq=1.0), product of:
              0.198575 = queryWeight, product of:
                1.2554286 = boost
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.022680137 = queryNorm
              0.65382 = fieldWeight in 5706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.9740796 = idf(docFreq=112, maxDocs=44421)
                0.09375 = fieldNorm(doc=5706)
          0.057158478 = weight(abstract_txt:some in 5706) [ClassicSimilarity], result of:
            0.057158478 = score(doc=5706,freq=1.0), product of:
              0.16574112 = queryWeight, product of:
                1.9865773 = boost
                3.6785707 = idf(docFreq=3049, maxDocs=44421)
                0.022680137 = queryNorm
              0.344866 = fieldWeight in 5706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.6785707 = idf(docFreq=3049, maxDocs=44421)
                0.09375 = fieldNorm(doc=5706)
          0.13245945 = weight(abstract_txt:function in 5706) [ClassicSimilarity], result of:
            0.13245945 = score(doc=5706,freq=1.0), product of:
              0.2535526 = queryWeight, product of:
                2.0062208 = boost
                5.5724173 = idf(docFreq=458, maxDocs=44421)
                0.022680137 = queryNorm
              0.5224141 = fieldWeight in 5706, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5724173 = idf(docFreq=458, maxDocs=44421)
                0.09375 = fieldNorm(doc=5706)
        0.2 = coord(5/25)