Document (#29156)

Author
Aizawa, A.
Title
¬An information-theoretic perspective of tf-idf measures
Source
Information processing and management. 39(2003) no.1, S.45-65
Year
2003
Abstract
This paper presents a mathematical definition of the "probability-weighted amount of information" (PWI), a measure of specificity of terms in documents that is based on an information-theoretic view of retrieval events. The proposed PWI is expressed as a product of the occurrence probabilities of terms and their amounts of information, and corresponds well with the conventional term frequency - inverse document frequency measures that are commonly used in today's information retrieval systems. The mathematical definition of the PWI is shown, together with some illustrative examples of the calculation.
Theme
Retrievalalgorithmen
Object
TF/iDF

Similar documents (content)

  1. Bruza, P.D.; Huibers, T.W.C.: ¬A study of aboutness in information retrieval (1996) 0.20
    0.20467827 = sum of:
      0.20467827 = product of:
        0.8528261 = sum of:
          0.07892635 = weight(abstract_txt:expressed in 7705) [ClassicSimilarity], result of:
            0.07892635 = score(doc=7705,freq=1.0), product of:
              0.13570353 = queryWeight, product of:
                1.0582983 = boost
                6.203826 = idf(docFreq=242, maxDocs=44218)
                0.020669188 = queryNorm
              0.58160865 = fieldWeight in 7705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.203826 = idf(docFreq=242, maxDocs=44218)
                0.09375 = fieldNorm(doc=7705)
          0.062040444 = weight(abstract_txt:retrieval in 7705) [ClassicSimilarity], result of:
            0.062040444 = score(doc=7705,freq=5.0), product of:
              0.08516211 = queryWeight, product of:
                1.1856343 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.020669188 = queryNorm
              0.7284982 = fieldWeight in 7705, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.09375 = fieldNorm(doc=7705)
          0.17303558 = weight(abstract_txt:probabilities in 7705) [ClassicSimilarity], result of:
            0.17303558 = score(doc=7705,freq=1.0), product of:
              0.22901648 = queryWeight, product of:
                1.3748202 = boost
                8.059301 = idf(docFreq=37, maxDocs=44218)
                0.020669188 = queryNorm
              0.7555595 = fieldWeight in 7705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.059301 = idf(docFreq=37, maxDocs=44218)
                0.09375 = fieldNorm(doc=7705)
          0.15663554 = weight(abstract_txt:definition in 7705) [ClassicSimilarity], result of:
            0.15663554 = score(doc=7705,freq=2.0), product of:
              0.21430716 = queryWeight, product of:
                1.8808142 = boost
                5.512738 = idf(docFreq=484, maxDocs=44218)
                0.020669188 = queryNorm
              0.7308927 = fieldWeight in 7705, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.512738 = idf(docFreq=484, maxDocs=44218)
                0.09375 = fieldNorm(doc=7705)
          0.057443958 = weight(abstract_txt:information in 7705) [ClassicSimilarity], result of:
            0.057443958 = score(doc=7705,freq=6.0), product of:
              0.10332664 = queryWeight, product of:
                2.0649223 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.020669188 = queryNorm
              0.5559453 = fieldWeight in 7705, product of:
                2.4494898 = tf(freq=6.0), with freq of:
                  6.0 = termFreq=6.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.09375 = fieldNorm(doc=7705)
          0.3247442 = weight(abstract_txt:theoretic in 7705) [ClassicSimilarity], result of:
            0.3247442 = score(doc=7705,freq=1.0), product of:
              0.4390164 = queryWeight, product of:
                2.6919558 = boost
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.020669188 = queryNorm
              0.7397086 = fieldWeight in 7705, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.09375 = fieldNorm(doc=7705)
        0.24 = coord(6/25)
    
  2. Wong, S.K.M.; Yao, Y.Y.: ¬An information-theoretic measure of term specifics (1992) 0.19
    0.18852559 = sum of:
      0.18852559 = product of:
        0.9426279 = sum of:
          0.10321559 = weight(abstract_txt:occurrence in 4807) [ClassicSimilarity], result of:
            0.10321559 = score(doc=4807,freq=1.0), product of:
              0.16228312 = queryWeight, product of:
                1.1573087 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.020669188 = queryNorm
              0.63602173 = fieldWeight in 4807, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.15029576 = weight(abstract_txt:inverse in 4807) [ClassicSimilarity], result of:
            0.15029576 = score(doc=4807,freq=1.0), product of:
              0.20848475 = queryWeight, product of:
                1.3117459 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.020669188 = queryNorm
              0.7208957 = fieldWeight in 4807, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.19669363 = weight(abstract_txt:frequency in 4807) [ClassicSimilarity], result of:
            0.19669363 = score(doc=4807,freq=2.0), product of:
              0.24944222 = queryWeight, product of:
                2.0291424 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.020669188 = queryNorm
              0.7885338 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.033165287 = weight(abstract_txt:information in 4807) [ClassicSimilarity], result of:
            0.033165287 = score(doc=4807,freq=2.0), product of:
              0.10332664 = queryWeight, product of:
                2.0649223 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.020669188 = queryNorm
              0.32097518 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
          0.45925763 = weight(abstract_txt:theoretic in 4807) [ClassicSimilarity], result of:
            0.45925763 = score(doc=4807,freq=2.0), product of:
              0.4390164 = queryWeight, product of:
                2.6919558 = boost
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.020669188 = queryNorm
              1.0461059 = fieldWeight in 4807, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.09375 = fieldNorm(doc=4807)
        0.2 = coord(5/25)
    
  3. Rölleke, T.; Tsikrika, T.; Kazai, G.: ¬A general matrix framework for modelling Information Retrieval (2006) 0.19
    0.18521631 = sum of:
      0.18521631 = product of:
        0.578801 = sum of:
          0.05261757 = weight(abstract_txt:expressed in 957) [ClassicSimilarity], result of:
            0.05261757 = score(doc=957,freq=1.0), product of:
              0.13570353 = queryWeight, product of:
                1.0582983 = boost
                6.203826 = idf(docFreq=242, maxDocs=44218)
                0.020669188 = queryNorm
              0.38773912 = fieldWeight in 957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.203826 = idf(docFreq=242, maxDocs=44218)
                0.0625 = fieldNorm(doc=957)
          0.036993776 = weight(abstract_txt:retrieval in 957) [ClassicSimilarity], result of:
            0.036993776 = score(doc=957,freq=4.0), product of:
              0.08516211 = queryWeight, product of:
                1.1856343 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.020669188 = queryNorm
              0.43439242 = fieldWeight in 957, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0625 = fieldNorm(doc=957)
          0.10019717 = weight(abstract_txt:inverse in 957) [ClassicSimilarity], result of:
            0.10019717 = score(doc=957,freq=1.0), product of:
              0.20848475 = queryWeight, product of:
                1.3117459 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.020669188 = queryNorm
              0.48059714 = fieldWeight in 957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.0625 = fieldNorm(doc=957)
          0.029145561 = weight(abstract_txt:terms in 957) [ClassicSimilarity], result of:
            0.029145561 = score(doc=957,freq=1.0), product of:
              0.1153176 = queryWeight, product of:
                1.3796704 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.020669188 = queryNorm
              0.25274166 = fieldWeight in 957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=957)
          0.07077429 = weight(abstract_txt:measures in 957) [ClassicSimilarity], result of:
            0.07077429 = score(doc=957,freq=1.0), product of:
              0.208336 = queryWeight, product of:
                1.8544269 = boost
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.020669188 = queryNorm
              0.33971223 = fieldWeight in 957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4353957 = idf(docFreq=523, maxDocs=44218)
                0.0625 = fieldNorm(doc=957)
          0.16059966 = weight(abstract_txt:frequency in 957) [ClassicSimilarity], result of:
            0.16059966 = score(doc=957,freq=3.0), product of:
              0.24944222 = queryWeight, product of:
                2.0291424 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.020669188 = queryNorm
              0.6438351 = fieldWeight in 957, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.0625 = fieldNorm(doc=957)
          0.015634267 = weight(abstract_txt:information in 957) [ClassicSimilarity], result of:
            0.015634267 = score(doc=957,freq=1.0), product of:
              0.10332664 = queryWeight, product of:
                2.0649223 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.020669188 = queryNorm
              0.15130915 = fieldWeight in 957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0625 = fieldNorm(doc=957)
          0.11283867 = weight(abstract_txt:mathematical in 957) [ClassicSimilarity], result of:
            0.11283867 = score(doc=957,freq=1.0), product of:
              0.28432778 = queryWeight, product of:
                2.1663928 = boost
                6.3497796 = idf(docFreq=209, maxDocs=44218)
                0.020669188 = queryNorm
              0.39686123 = fieldWeight in 957, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3497796 = idf(docFreq=209, maxDocs=44218)
                0.0625 = fieldNorm(doc=957)
        0.32 = coord(8/25)
    
  4. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.17
    0.1698418 = sum of:
      0.1698418 = product of:
        0.53075564 = sum of:
          0.0602091 = weight(abstract_txt:occurrence in 1283) [ClassicSimilarity], result of:
            0.0602091 = score(doc=1283,freq=1.0), product of:
              0.16228312 = queryWeight, product of:
                1.1573087 = boost
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.020669188 = queryNorm
              0.3710127 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.784232 = idf(docFreq=135, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.08898781 = weight(abstract_txt:probability in 1283) [ClassicSimilarity], result of:
            0.08898781 = score(doc=1283,freq=2.0), product of:
              0.16712534 = queryWeight, product of:
                1.1744478 = boost
                6.8847027 = idf(docFreq=122, maxDocs=44218)
                0.020669188 = queryNorm
              0.5324615 = fieldWeight in 1283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.8847027 = idf(docFreq=122, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.03619026 = weight(abstract_txt:retrieval in 1283) [ClassicSimilarity], result of:
            0.03619026 = score(doc=1283,freq=5.0), product of:
              0.08516211 = queryWeight, product of:
                1.1856343 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.020669188 = queryNorm
              0.4249573 = fieldWeight in 1283, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.08767253 = weight(abstract_txt:inverse in 1283) [ClassicSimilarity], result of:
            0.08767253 = score(doc=1283,freq=1.0), product of:
              0.20848475 = queryWeight, product of:
                1.3117459 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.020669188 = queryNorm
              0.4205225 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.09810922 = weight(abstract_txt:calculation in 1283) [ClassicSimilarity], result of:
            0.09810922 = score(doc=1283,freq=1.0), product of:
              0.22471833 = queryWeight, product of:
                1.3618579 = boost
                7.983315 = idf(docFreq=40, maxDocs=44218)
                0.020669188 = queryNorm
              0.43658754 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.983315 = idf(docFreq=40, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.025502367 = weight(abstract_txt:terms in 1283) [ClassicSimilarity], result of:
            0.025502367 = score(doc=1283,freq=1.0), product of:
              0.1153176 = queryWeight, product of:
                1.3796704 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.020669188 = queryNorm
              0.22114895 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.11473794 = weight(abstract_txt:frequency in 1283) [ClassicSimilarity], result of:
            0.11473794 = score(doc=1283,freq=2.0), product of:
              0.24944222 = queryWeight, product of:
                2.0291424 = boost
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.020669188 = queryNorm
              0.45997804 = fieldWeight in 1283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.947494 = idf(docFreq=313, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.019346418 = weight(abstract_txt:information in 1283) [ClassicSimilarity], result of:
            0.019346418 = score(doc=1283,freq=2.0), product of:
              0.10332664 = queryWeight, product of:
                2.0649223 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.020669188 = queryNorm
              0.18723552 = fieldWeight in 1283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
        0.32 = coord(8/25)
    
  5. Losee, R.M.: Term dependence : a basis for Luhn and Zipf models (2001) 0.15
    0.14594106 = sum of:
      0.14594106 = product of:
        0.7297053 = sum of:
          0.02312111 = weight(abstract_txt:retrieval in 6976) [ClassicSimilarity], result of:
            0.02312111 = score(doc=6976,freq=1.0), product of:
              0.08516211 = queryWeight, product of:
                1.1856343 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.020669188 = queryNorm
              0.27149525 = fieldWeight in 6976, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.078125 = fieldNorm(doc=6976)
          0.17712526 = weight(abstract_txt:inverse in 6976) [ClassicSimilarity], result of:
            0.17712526 = score(doc=6976,freq=2.0), product of:
              0.20848475 = queryWeight, product of:
                1.3117459 = boost
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.020669188 = queryNorm
              0.84958375 = fieldWeight in 6976, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.689554 = idf(docFreq=54, maxDocs=44218)
                0.078125 = fieldNorm(doc=6976)
          0.10304512 = weight(abstract_txt:terms in 6976) [ClassicSimilarity], result of:
            0.10304512 = score(doc=6976,freq=8.0), product of:
              0.1153176 = queryWeight, product of:
                1.3796704 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.020669188 = queryNorm
              0.89357674 = fieldWeight in 6976, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=6976)
          0.0436991 = weight(abstract_txt:information in 6976) [ClassicSimilarity], result of:
            0.0436991 = score(doc=6976,freq=5.0), product of:
              0.10332664 = queryWeight, product of:
                2.0649223 = boost
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.020669188 = queryNorm
              0.42292193 = fieldWeight in 6976, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                2.4209464 = idf(docFreq=10677, maxDocs=44218)
                0.078125 = fieldNorm(doc=6976)
          0.3827147 = weight(abstract_txt:theoretic in 6976) [ClassicSimilarity], result of:
            0.3827147 = score(doc=6976,freq=2.0), product of:
              0.4390164 = queryWeight, product of:
                2.6919558 = boost
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.020669188 = queryNorm
              0.8717549 = fieldWeight in 6976, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.078125 = fieldNorm(doc=6976)
        0.2 = coord(5/25)