Document (#29156)

Author
Aizawa, A.
Title
¬An information-theoretic perspective of tf-idf measures
Source
Information processing and management. 39(2003) no.1, S.45-65
Year
2003
Abstract
This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency - inverse document frequency measures that are commonly used in today's information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation.
Theme
Retrievalalgorithmen
Object
TF/iDF

Similar documents (content)

  1. Bruza, P.D.; Huibers, T.W.C.: ¬A study of aboutness in information retrieval (1996) 0.20
    0.20494144 = sum of:
      0.20494144 = product of:
        0.85392267 = sum of:
          0.079142325 = weight(abstract_txt:expressed in 774) [ClassicSimilarity], result of:
            0.079142325 = score(doc=774,freq=1.0), product of:
              0.13597448 = queryWeight, product of:
                1.0598305 = boost
                6.208406 = idf(docFreq=242, maxDocs=44421)
                0.02066526 = queryNorm
              0.58203804 = fieldWeight in 774, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.208406 = idf(docFreq=242, maxDocs=44421)
                0.09375 = fieldNorm(doc=774)
          0.062145576 = weight(abstract_txt:retrieval in 774) [ClassicSimilarity], result of:
            0.062145576 = score(doc=774,freq=5.0), product of:
              0.08527303 = queryWeight, product of:
                1.1869395 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.02066526 = queryNorm
              0.7287835 = fieldWeight in 774, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.09375 = fieldNorm(doc=774)
          0.17342071 = weight(abstract_txt:probabilities in 774) [ClassicSimilarity], result of:
            0.17342071 = score(doc=774,freq=1.0), product of:
              0.22939582 = queryWeight, product of:
                1.3765769 = boost
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.02066526 = queryNorm
              0.75598896 = fieldWeight in 774, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.09375 = fieldNorm(doc=774)
          0.15640718 = weight(abstract_txt:definition in 774) [ClassicSimilarity], result of:
            0.15640718 = score(doc=774,freq=2.0), product of:
              0.21413583 = queryWeight, product of:
                1.8809073 = boost
                5.509105 = idf(docFreq=488, maxDocs=44421)
                0.02066526 = queryNorm
              0.73041105 = fieldWeight in 774, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.509105 = idf(docFreq=488, maxDocs=44421)
                0.09375 = fieldNorm(doc=774)
          0.057328112 = weight(abstract_txt:information in 774) [ClassicSimilarity], result of:
            0.057328112 = score(doc=774,freq=6.0), product of:
              0.10320551 = queryWeight, product of:
                2.0646393 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.02066526 = queryNorm
              0.5554753 = fieldWeight in 774, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.09375 = fieldNorm(doc=774)
          0.3254788 = weight(abstract_txt:theoretic in 774) [ClassicSimilarity], result of:
            0.3254788 = score(doc=774,freq=1.0), product of:
              0.43975422 = queryWeight, product of:
                2.6954281 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.02066526 = queryNorm
              0.74013793 = fieldWeight in 774, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.09375 = fieldNorm(doc=774)
        0.24 = coord(6/25)
    
  2. Wong, S.K.M.; Yao, Y.Y.: ¬An information-theoretic measure of term specifics (1992) 0.19
    0.18869124 = sum of:
      0.18869124 = product of:
        0.9434562 = sum of:
          0.10248393 = weight(abstract_txt:occurrence in 4806) [ClassicSimilarity], result of:
            0.10248393 = score(doc=4806,freq=1.0), product of:
              0.16154322 = queryWeight, product of:
                1.1551865 = boost
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.02066526 = queryNorm
              0.6344056 = fieldWeight in 4806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.15064259 = weight(abstract_txt:inverse in 4806) [ClassicSimilarity], result of:
            0.15064259 = score(doc=4806,freq=1.0), product of:
              0.20884146 = queryWeight, product of:
                1.3134577 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.02066526 = queryNorm
              0.7213251 = fieldWeight in 4806, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.19693476 = weight(abstract_txt:frequency in 4806) [ClassicSimilarity], result of:
            0.19693476 = score(doc=4806,freq=2.0), product of:
              0.24968924 = queryWeight, product of:
                2.0310595 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.02066526 = queryNorm
              0.7887195 = fieldWeight in 4806, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.0330984 = weight(abstract_txt:information in 4806) [ClassicSimilarity], result of:
            0.0330984 = score(doc=4806,freq=2.0), product of:
              0.10320551 = queryWeight, product of:
                2.0646393 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.02066526 = queryNorm
              0.3207038 = fieldWeight in 4806, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
          0.4602965 = weight(abstract_txt:theoretic in 4806) [ClassicSimilarity], result of:
            0.4602965 = score(doc=4806,freq=2.0), product of:
              0.43975422 = queryWeight, product of:
                2.6954281 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.02066526 = queryNorm
              1.0467131 = fieldWeight in 4806, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.09375 = fieldNorm(doc=4806)
        0.2 = coord(5/25)
    
  3. Rölleke, T.; Tsikrika, T.; Kazai, G.: ¬A general matrix framework for modelling Information Retrieval (2006) 0.19
    0.18530874 = sum of:
      0.18530874 = product of:
        0.5790898 = sum of:
          0.052761547 = weight(abstract_txt:expressed in 1957) [ClassicSimilarity], result of:
            0.052761547 = score(doc=1957,freq=1.0), product of:
              0.13597448 = queryWeight, product of:
                1.0598305 = boost
                6.208406 = idf(docFreq=242, maxDocs=44421)
                0.02066526 = queryNorm
              0.38802537 = fieldWeight in 1957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.208406 = idf(docFreq=242, maxDocs=44421)
                0.0625 = fieldNorm(doc=1957)
          0.03705646 = weight(abstract_txt:retrieval in 1957) [ClassicSimilarity], result of:
            0.03705646 = score(doc=1957,freq=4.0), product of:
              0.08527303 = queryWeight, product of:
                1.1869395 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.02066526 = queryNorm
              0.4345625 = fieldWeight in 1957, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.0625 = fieldNorm(doc=1957)
          0.10042839 = weight(abstract_txt:inverse in 1957) [ClassicSimilarity], result of:
            0.10042839 = score(doc=1957,freq=1.0), product of:
              0.20884146 = queryWeight, product of:
                1.3134577 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.02066526 = queryNorm
              0.4808834 = fieldWeight in 1957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=1957)
          0.02915734 = weight(abstract_txt:terms in 1957) [ClassicSimilarity], result of:
            0.02915734 = score(doc=1957,freq=1.0), product of:
              0.11536861 = queryWeight, product of:
                1.3805958 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.02066526 = queryNorm
              0.252732 = fieldWeight in 1957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.0625 = fieldNorm(doc=1957)
          0.070398636 = weight(abstract_txt:measures in 1957) [ClassicSimilarity], result of:
            0.070398636 = score(doc=1957,freq=1.0), product of:
              0.20763403 = queryWeight, product of:
                1.8521323 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.02066526 = queryNorm
              0.3390515 = fieldWeight in 1957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.0625 = fieldNorm(doc=1957)
          0.16079657 = weight(abstract_txt:frequency in 1957) [ClassicSimilarity], result of:
            0.16079657 = score(doc=1957,freq=3.0), product of:
              0.24968924 = queryWeight, product of:
                2.0310595 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.02066526 = queryNorm
              0.64398676 = fieldWeight in 1957, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.0625 = fieldNorm(doc=1957)
          0.015602735 = weight(abstract_txt:information in 1957) [ClassicSimilarity], result of:
            0.015602735 = score(doc=1957,freq=1.0), product of:
              0.10320551 = queryWeight, product of:
                2.0646393 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.02066526 = queryNorm
              0.15118122 = fieldWeight in 1957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.0625 = fieldNorm(doc=1957)
          0.11288812 = weight(abstract_txt:mathematical in 1957) [ClassicSimilarity], result of:
            0.11288812 = score(doc=1957,freq=1.0), product of:
              0.28446 = queryWeight, product of:
                2.1678705 = boost
                6.3496094 = idf(docFreq=210, maxDocs=44421)
                0.02066526 = queryNorm
              0.3968506 = fieldWeight in 1957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3496094 = idf(docFreq=210, maxDocs=44421)
                0.0625 = fieldNorm(doc=1957)
        0.32 = coord(8/25)
    
  4. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.17
    0.16986693 = sum of:
      0.16986693 = product of:
        0.5308342 = sum of:
          0.059782293 = weight(abstract_txt:occurrence in 2283) [ClassicSimilarity], result of:
            0.059782293 = score(doc=2283,freq=1.0), product of:
              0.16154322 = queryWeight, product of:
                1.1551865 = boost
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.02066526 = queryNorm
              0.37006995 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.7669935 = idf(docFreq=138, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.08889763 = weight(abstract_txt:probability in 2283) [ClassicSimilarity], result of:
            0.08889763 = score(doc=2283,freq=2.0), product of:
              0.16704129 = queryWeight, product of:
                1.1746801 = boost
                6.881186 = idf(docFreq=123, maxDocs=44421)
                0.02066526 = queryNorm
              0.53218955 = fieldWeight in 2283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.881186 = idf(docFreq=123, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.036251586 = weight(abstract_txt:retrieval in 2283) [ClassicSimilarity], result of:
            0.036251586 = score(doc=2283,freq=5.0), product of:
              0.08527303 = queryWeight, product of:
                1.1869395 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.02066526 = queryNorm
              0.42512372 = fieldWeight in 2283, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.08787484 = weight(abstract_txt:inverse in 2283) [ClassicSimilarity], result of:
            0.08787484 = score(doc=2283,freq=1.0), product of:
              0.20884146 = queryWeight, product of:
                1.3134577 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.02066526 = queryNorm
              0.42077297 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.09832917 = weight(abstract_txt:calculation in 2283) [ClassicSimilarity], result of:
            0.09832917 = score(doc=2283,freq=1.0), product of:
              0.22509298 = queryWeight, product of:
                1.3636054 = boost
                7.9878955 = idf(docFreq=40, maxDocs=44421)
                0.02066526 = queryNorm
              0.43683803 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9878955 = idf(docFreq=40, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.025512673 = weight(abstract_txt:terms in 2283) [ClassicSimilarity], result of:
            0.025512673 = score(doc=2283,freq=1.0), product of:
              0.11536861 = queryWeight, product of:
                1.3805958 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.02066526 = queryNorm
              0.2211405 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.11487861 = weight(abstract_txt:frequency in 2283) [ClassicSimilarity], result of:
            0.11487861 = score(doc=2283,freq=2.0), product of:
              0.24968924 = queryWeight, product of:
                2.0310595 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.02066526 = queryNorm
              0.46008635 = fieldWeight in 2283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.0193074 = weight(abstract_txt:information in 2283) [ClassicSimilarity], result of:
            0.0193074 = score(doc=2283,freq=2.0), product of:
              0.10320551 = queryWeight, product of:
                2.0646393 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.02066526 = queryNorm
              0.18707721 = fieldWeight in 2283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
        0.32 = coord(8/25)
    
  5. Losee, R.M.: Term dependence : a basis for Luhn and Zipf models (2001) 0.15
    0.14619449 = sum of:
      0.14619449 = product of:
        0.7309724 = sum of:
          0.023160286 = weight(abstract_txt:retrieval in 976) [ClassicSimilarity], result of:
            0.023160286 = score(doc=976,freq=1.0), product of:
              0.08527303 = queryWeight, product of:
                1.1869395 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.02066526 = queryNorm
              0.27160156 = fieldWeight in 976, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.078125 = fieldNorm(doc=976)
          0.17753398 = weight(abstract_txt:inverse in 976) [ClassicSimilarity], result of:
            0.17753398 = score(doc=976,freq=2.0), product of:
              0.20884146 = queryWeight, product of:
                1.3134577 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.02066526 = queryNorm
              0.8500897 = fieldWeight in 976, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=976)
          0.10308676 = weight(abstract_txt:terms in 976) [ClassicSimilarity], result of:
            0.10308676 = score(doc=976,freq=8.0), product of:
              0.11536861 = queryWeight, product of:
                1.3805958 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.02066526 = queryNorm
              0.8935425 = fieldWeight in 976, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.078125 = fieldNorm(doc=976)
          0.04361097 = weight(abstract_txt:information in 976) [ClassicSimilarity], result of:
            0.04361097 = score(doc=976,freq=5.0), product of:
              0.10320551 = queryWeight, product of:
                2.0646393 = boost
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.02066526 = queryNorm
              0.4225644 = fieldWeight in 976, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                2.4188995 = idf(docFreq=10748, maxDocs=44421)
                0.078125 = fieldNorm(doc=976)
          0.38358042 = weight(abstract_txt:theoretic in 976) [ClassicSimilarity], result of:
            0.38358042 = score(doc=976,freq=2.0), product of:
              0.43975422 = queryWeight, product of:
                2.6954281 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.02066526 = queryNorm
              0.8722609 = fieldWeight in 976, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.078125 = fieldNorm(doc=976)
        0.2 = coord(5/25)