Document (#26610)

Author
Bookstein, A.
Kulyukin, V.
Raita, T.
Nicholson, J.
Title
Adapting measures of clumping strength to assess term-term similarity
Source
Journal of the American Society for Information Science and technology. 54(2003) no.7, S.611-620
Year
2003
Abstract
Automated information retrieval relies heavily an statistical regularities that emerge as terms are deposited to produce text. This paper examines statistical patterns expected of a pair of terms that are semantically related to each other. Guided by a conceptualization of the text generation process, we derive measures of how tightly two terms are semantically associated. Our main objective is to probe whether such measures yield reasonable results. Specifically, we examine how the tendency of a content bearing term to clump, as quantified by previously developed measures of term clumping, is influenced by the presence of other terms. This approach allows us to present a toolkit from which a range of measures can be constructed. As an illustration, one of several suggested measures is evaluated an a large text corpus built from an on-line encyclopedia.
Theme
Computerlinguistik

Similar documents (author)

  1. Bookstein, A.: Probability and Fuzzy-set applications to information retrieval (1985) 1.89
    1.8928305 = sum of:
      1.8928305 = product of:
        3.785661 = sum of:
          3.785661 = weight(author_txt:bookstein in 780) [ClassicSimilarity], result of:
            3.785661 = score(doc=780,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.08254833 = queryNorm
              5.353733 = fieldWeight in 780, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.625 = fieldNorm(doc=780)
        0.5 = coord(1/2)
    
  2. Bookstein, A.: Relevance (1979) 1.89
    1.8928305 = sum of:
      1.8928305 = product of:
        3.785661 = sum of:
          3.785661 = weight(author_txt:bookstein in 838) [ClassicSimilarity], result of:
            3.785661 = score(doc=838,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.08254833 = queryNorm
              5.353733 = fieldWeight in 838, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.625 = fieldNorm(doc=838)
        0.5 = coord(1/2)
    
  3. Nicholson, D.: Subject-based interoperability : issues from the High Level Thesaurus (HILT) Project (2002) 1.89
    1.8928305 = sum of:
      1.8928305 = product of:
        3.785661 = sum of:
          3.785661 = weight(author_txt:nicholson in 2916) [ClassicSimilarity], result of:
            3.785661 = score(doc=2916,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.08254833 = queryNorm
              5.353733 = fieldWeight in 2916, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.625 = fieldNorm(doc=2916)
        0.5 = coord(1/2)
    
  4. Bookstein, A.: Fuzzy requests : an approach to weighted Boolean searches (1979) 1.89
    1.8928305 = sum of:
      1.8928305 = product of:
        3.785661 = sum of:
          3.785661 = weight(author_txt:bookstein in 5503) [ClassicSimilarity], result of:
            3.785661 = score(doc=5503,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.08254833 = queryNorm
              5.353733 = fieldWeight in 5503, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.625 = fieldNorm(doc=5503)
        0.5 = coord(1/2)
    
  5. Bookstein, A.: Informetric distributions : I. Unified overview (1990) 1.89
    1.8928305 = sum of:
      1.8928305 = product of:
        3.785661 = sum of:
          3.785661 = weight(author_txt:bookstein in 6901) [ClassicSimilarity], result of:
            3.785661 = score(doc=6901,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.08254833 = queryNorm
              5.353733 = fieldWeight in 6901, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.565973 = idf(docFreq=22, maxDocs=44421)
                0.625 = fieldNorm(doc=6901)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Kim, W.; Wilbur, W.J.: Corpus-based statistical screening for content-bearing terms (2001) 0.20
    0.20048149 = sum of:
      0.20048149 = product of:
        0.62650466 = sum of:
          0.03999209 = weight(abstract_txt:strength in 188) [ClassicSimilarity], result of:
            0.03999209 = score(doc=188,freq=1.0), product of:
              0.1207296 = queryWeight, product of:
                7.0667386 = idf(docFreq=102, maxDocs=44421)
                0.017084204 = queryNorm
              0.33125338 = fieldWeight in 188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.0667386 = idf(docFreq=102, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.04355488 = weight(abstract_txt:yield in 188) [ClassicSimilarity], result of:
            0.04355488 = score(doc=188,freq=1.0), product of:
              0.12779747 = queryWeight, product of:
                1.0288552 = boost
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.017084204 = queryNorm
              0.34081176 = fieldWeight in 188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.050555892 = weight(abstract_txt:pair in 188) [ClassicSimilarity], result of:
            0.050555892 = score(doc=188,freq=1.0), product of:
              0.14114936 = queryWeight, product of:
                1.081266 = boost
                7.6410246 = idf(docFreq=57, maxDocs=44421)
                0.017084204 = queryNorm
              0.358173 = fieldWeight in 188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6410246 = idf(docFreq=57, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.10292245 = weight(abstract_txt:bearing in 188) [ClassicSimilarity], result of:
            0.10292245 = score(doc=188,freq=3.0), product of:
              0.15720417 = queryWeight, product of:
                1.1411037 = boost
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.017084204 = queryNorm
              0.6547056 = fieldWeight in 188, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.063882 = idf(docFreq=37, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.022431938 = weight(abstract_txt:text in 188) [ClassicSimilarity], result of:
            0.022431938 = score(doc=188,freq=1.0), product of:
              0.11842662 = queryWeight, product of:
                1.7154514 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017084204 = queryNorm
              0.18941635 = fieldWeight in 188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.059944272 = weight(abstract_txt:terms in 188) [ClassicSimilarity], result of:
            0.059944272 = score(doc=188,freq=4.0), product of:
              0.15812342 = queryWeight, product of:
                2.2888703 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.017084204 = queryNorm
              0.379098 = fieldWeight in 188, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.08654171 = weight(abstract_txt:term in 188) [ClassicSimilarity], result of:
            0.08654171 = score(doc=188,freq=3.0), product of:
              0.222311 = queryWeight, product of:
                2.713961 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.017084204 = queryNorm
              0.38928217 = fieldWeight in 188, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
          0.22056141 = weight(abstract_txt:clumping in 188) [ClassicSimilarity], result of:
            0.22056141 = score(doc=188,freq=1.0), product of:
              0.47481826 = queryWeight, product of:
                2.8046057 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.017084204 = queryNorm
              0.46451756 = fieldWeight in 188, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.046875 = fieldNorm(doc=188)
        0.32 = coord(8/25)
    
  2. Bookstein, A.; Raita, T.: Discovering term occurence structure in text (2001) 0.19
    0.1926138 = sum of:
      0.1926138 = product of:
        1.2038362 = sum of:
          0.08585457 = weight(abstract_txt:tendency in 6751) [ClassicSimilarity], result of:
            0.08585457 = score(doc=6751,freq=1.0), product of:
              0.12656686 = queryWeight, product of:
                1.0238895 = boost
                7.2355595 = idf(docFreq=86, maxDocs=44421)
                0.017084204 = queryNorm
              0.6783337 = fieldWeight in 6751, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2355595 = idf(docFreq=86, maxDocs=44421)
                0.09375 = fieldNorm(doc=6751)
          0.059944272 = weight(abstract_txt:terms in 6751) [ClassicSimilarity], result of:
            0.059944272 = score(doc=6751,freq=1.0), product of:
              0.15812342 = queryWeight, product of:
                2.2888703 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.017084204 = queryNorm
              0.379098 = fieldWeight in 6751, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.09375 = fieldNorm(doc=6751)
          0.62384194 = weight(abstract_txt:clumping in 6751) [ClassicSimilarity], result of:
            0.62384194 = score(doc=6751,freq=2.0), product of:
              0.47481826 = queryWeight, product of:
                2.8046057 = boost
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.017084204 = queryNorm
              1.3138541 = fieldWeight in 6751, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                9.909708 = idf(docFreq=5, maxDocs=44421)
                0.09375 = fieldNorm(doc=6751)
          0.43419546 = weight(abstract_txt:measures in 6751) [ClassicSimilarity], result of:
            0.43419546 = score(doc=6751,freq=4.0), product of:
              0.42687264 = queryWeight, product of:
                4.605936 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.017084204 = queryNorm
              1.0171546 = fieldWeight in 6751, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.09375 = fieldNorm(doc=6751)
        0.16 = coord(4/25)
    
  3. Ruge, G.: Experiments on linguistically-based term associations (1992) 0.17
    0.16518556 = sum of:
      0.16518556 = product of:
        0.6882732 = sum of:
          0.0773499 = weight(abstract_txt:statistical in 1809) [ClassicSimilarity], result of:
            0.0773499 = score(doc=1809,freq=1.0), product of:
              0.14875135 = queryWeight, product of:
                1.5697792 = boost
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.017084204 = queryNorm
              0.5199946 = fieldWeight in 1809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.09375 = fieldNorm(doc=1809)
          0.044863876 = weight(abstract_txt:text in 1809) [ClassicSimilarity], result of:
            0.044863876 = score(doc=1809,freq=1.0), product of:
              0.11842662 = queryWeight, product of:
                1.7154514 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.017084204 = queryNorm
              0.3788327 = fieldWeight in 1809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.09375 = fieldNorm(doc=1809)
          0.14769538 = weight(abstract_txt:semantically in 1809) [ClassicSimilarity], result of:
            0.14769538 = score(doc=1809,freq=1.0), product of:
              0.22894561 = queryWeight, product of:
                1.9474857 = boost
                6.881186 = idf(docFreq=123, maxDocs=44421)
                0.017084204 = queryNorm
              0.6451112 = fieldWeight in 1809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.881186 = idf(docFreq=123, maxDocs=44421)
                0.09375 = fieldNorm(doc=1809)
          0.059944272 = weight(abstract_txt:terms in 1809) [ClassicSimilarity], result of:
            0.059944272 = score(doc=1809,freq=1.0), product of:
              0.15812342 = queryWeight, product of:
                2.2888703 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.017084204 = queryNorm
              0.379098 = fieldWeight in 1809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.09375 = fieldNorm(doc=1809)
          0.14132202 = weight(abstract_txt:term in 1809) [ClassicSimilarity], result of:
            0.14132202 = score(doc=1809,freq=2.0), product of:
              0.222311 = queryWeight, product of:
                2.713961 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.017084204 = queryNorm
              0.6356951 = fieldWeight in 1809, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.09375 = fieldNorm(doc=1809)
          0.21709773 = weight(abstract_txt:measures in 1809) [ClassicSimilarity], result of:
            0.21709773 = score(doc=1809,freq=1.0), product of:
              0.42687264 = queryWeight, product of:
                4.605936 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.017084204 = queryNorm
              0.5085773 = fieldWeight in 1809, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.09375 = fieldNorm(doc=1809)
        0.24 = coord(6/25)
    
  4. Seo, H.-C.; Kim, S.-B.; Rim, H.-C.; Myaeng, S.-H.: lmproving query translation in English-Korean Cross-language information retrieval (2005) 0.12
    0.117113724 = sum of:
      0.117113724 = product of:
        0.5855686 = sum of:
          0.08425982 = weight(abstract_txt:pair in 2023) [ClassicSimilarity], result of:
            0.08425982 = score(doc=2023,freq=1.0), product of:
              0.14114936 = queryWeight, product of:
                1.081266 = boost
                7.6410246 = idf(docFreq=57, maxDocs=44421)
                0.017084204 = queryNorm
              0.59695506 = fieldWeight in 2023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.6410246 = idf(docFreq=57, maxDocs=44421)
                0.078125 = fieldNorm(doc=2023)
          0.06445825 = weight(abstract_txt:statistical in 2023) [ClassicSimilarity], result of:
            0.06445825 = score(doc=2023,freq=1.0), product of:
              0.14875135 = queryWeight, product of:
                1.5697792 = boost
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.017084204 = queryNorm
              0.43332887 = fieldWeight in 2023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.078125 = fieldNorm(doc=2023)
          0.11169956 = weight(abstract_txt:terms in 2023) [ClassicSimilarity], result of:
            0.11169956 = score(doc=2023,freq=5.0), product of:
              0.15812342 = queryWeight, product of:
                2.2888703 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.017084204 = queryNorm
              0.7064074 = fieldWeight in 2023, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.078125 = fieldNorm(doc=2023)
          0.14423619 = weight(abstract_txt:term in 2023) [ClassicSimilarity], result of:
            0.14423619 = score(doc=2023,freq=3.0), product of:
              0.222311 = queryWeight, product of:
                2.713961 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.017084204 = queryNorm
              0.64880365 = fieldWeight in 2023, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.078125 = fieldNorm(doc=2023)
          0.18091476 = weight(abstract_txt:measures in 2023) [ClassicSimilarity], result of:
            0.18091476 = score(doc=2023,freq=1.0), product of:
              0.42687264 = queryWeight, product of:
                4.605936 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.017084204 = queryNorm
              0.4238144 = fieldWeight in 2023, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.078125 = fieldNorm(doc=2023)
        0.2 = coord(5/25)
    
  5. Efron, M.: Linear time series models for term weighting in information retrieval (2010) 0.10
    0.10314034 = sum of:
      0.10314034 = product of:
        0.64462715 = sum of:
          0.10265984 = weight(abstract_txt:yield in 675) [ClassicSimilarity], result of:
            0.10265984 = score(doc=675,freq=2.0), product of:
              0.12779747 = queryWeight, product of:
                1.0288552 = boost
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.017084204 = queryNorm
              0.80330104 = fieldWeight in 675, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.270651 = idf(docFreq=83, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
          0.09990712 = weight(abstract_txt:terms in 675) [ClassicSimilarity], result of:
            0.09990712 = score(doc=675,freq=4.0), product of:
              0.15812342 = queryWeight, product of:
                2.2888703 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.017084204 = queryNorm
              0.63183004 = fieldWeight in 675, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
          0.18620811 = weight(abstract_txt:term in 675) [ClassicSimilarity], result of:
            0.18620811 = score(doc=675,freq=5.0), product of:
              0.222311 = queryWeight, product of:
                2.713961 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.017084204 = queryNorm
              0.8376019 = fieldWeight in 675, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
          0.2558521 = weight(abstract_txt:measures in 675) [ClassicSimilarity], result of:
            0.2558521 = score(doc=675,freq=2.0), product of:
              0.42687264 = queryWeight, product of:
                4.605936 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.017084204 = queryNorm
              0.59936404 = fieldWeight in 675, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
        0.16 = coord(4/25)