Document (#23770)

Author
Chung, Y.M.
Lee, J.Y.
Title
¬A corpus-based approach to comparative evaluation of statistical term association measures
Source
Journal of the American Society for Information Science and technology. 52(2001) no.4, S.283-296
Year
2001
Abstract
Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked term pairs and term clusters, analyses of the correlation among the association measures using Pearson's correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule's coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as X**2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the X**2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule's Y seem to overestimate rare terms
Theme
Automatisches Indexieren
Automatisches Klassifizieren

Similar documents (author)

  1. Chung, T.M.: ¬A corpus comparison approach for terminology extraction (2003) 5.11
    5.1094418 = sum of:
      5.1094418 = weight(author_txt:chung in 5072) [ClassicSimilarity], result of:
        5.1094418 = fieldWeight in 5072, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.175107 = idf(docFreq=33, maxDocs=44421)
          0.625 = fieldNorm(doc=5072)
    
  2. Chung, H.H.: User friendly audiovisual material cataloging at Westchester County Public Library System (2001) 5.11
    5.1094418 = sum of:
      5.1094418 = weight(author_txt:chung in 415) [ClassicSimilarity], result of:
        5.1094418 = fieldWeight in 415, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.175107 = idf(docFreq=33, maxDocs=44421)
          0.625 = fieldNorm(doc=415)
    
  3. Chung, Y.-K.: Characteristics of references in international classification systems literature (1995) 4.09
    4.0875535 = sum of:
      4.0875535 = weight(author_txt:chung in 3007) [ClassicSimilarity], result of:
        4.0875535 = fieldWeight in 3007, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.175107 = idf(docFreq=33, maxDocs=44421)
          0.5 = fieldNorm(doc=3007)
    
  4. Chung, Y.-K.: Bradford distribution and core authors in classification systems literature (1994) 4.09
    4.0875535 = sum of:
      4.0875535 = weight(author_txt:chung in 5134) [ClassicSimilarity], result of:
        4.0875535 = fieldWeight in 5134, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.175107 = idf(docFreq=33, maxDocs=44421)
          0.5 = fieldNorm(doc=5134)
    
  5. Chung, Y.-K.: Core international journals of classification systems : an application of Bradford's law (1994) 4.09
    4.0875535 = sum of:
      4.0875535 = weight(author_txt:chung in 5138) [ClassicSimilarity], result of:
        4.0875535 = fieldWeight in 5138, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          8.175107 = idf(docFreq=33, maxDocs=44421)
          0.5 = fieldNorm(doc=5138)
    

Similar documents (content)

  1. Eck, N.J. van; Waltman, L.: How to normalize cooccurrence data? : an analysis of some well-known similarity measures (2009) 0.31
    0.30797508 = sum of:
      0.30797508 = product of:
        1.2832296 = sum of:
          0.026550738 = weight(abstract_txt:among in 3942) [ClassicSimilarity], result of:
            0.026550738 = score(doc=3942,freq=1.0), product of:
              0.07521529 = queryWeight, product of:
                1.20056 = boost
                4.518356 = idf(docFreq=1316, maxDocs=44421)
                0.013865701 = queryNorm
              0.35299656 = fieldWeight in 3942, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.518356 = idf(docFreq=1316, maxDocs=44421)
                0.078125 = fieldNorm(doc=3942)
          0.040510383 = weight(abstract_txt:behavior in 3942) [ClassicSimilarity], result of:
            0.040510383 = score(doc=3942,freq=1.0), product of:
              0.09968565 = queryWeight, product of:
                1.3821243 = boost
                5.2016807 = idf(docFreq=664, maxDocs=44421)
                0.013865701 = queryNorm
              0.4063813 = fieldWeight in 3942, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2016807 = idf(docFreq=664, maxDocs=44421)
                0.078125 = fieldNorm(doc=3942)
          0.19092083 = weight(abstract_txt:cosine in 3942) [ClassicSimilarity], result of:
            0.19092083 = score(doc=3942,freq=2.0), product of:
              0.22240639 = queryWeight, product of:
                2.0644503 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.013865701 = queryNorm
              0.8584323 = fieldWeight in 3942, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.078125 = fieldNorm(doc=3942)
          0.28976175 = weight(abstract_txt:jaccard in 3942) [ClassicSimilarity], result of:
            0.28976175 = score(doc=3942,freq=2.0), product of:
              0.2937238 = queryWeight, product of:
                2.372468 = boost
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.013865701 = queryNorm
              0.98651105 = fieldWeight in 3942, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.078125 = fieldNorm(doc=3942)
          0.4255096 = weight(abstract_txt:measures in 3942) [ClassicSimilarity], result of:
            0.4255096 = score(doc=3942,freq=7.0), product of:
              0.37947628 = queryWeight, product of:
                5.0449533 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.013865701 = queryNorm
              1.1213075 = fieldWeight in 3942, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.078125 = fieldNorm(doc=3942)
          0.30997634 = weight(abstract_txt:association in 3942) [ClassicSimilarity], result of:
            0.30997634 = score(doc=3942,freq=3.0), product of:
              0.4074957 = queryWeight, product of:
                5.2278886 = boost
                5.6215343 = idf(docFreq=436, maxDocs=44421)
                0.013865701 = queryNorm
              0.76068616 = fieldWeight in 3942, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.6215343 = idf(docFreq=436, maxDocs=44421)
                0.078125 = fieldNorm(doc=3942)
        0.24 = coord(6/25)
    
  2. Alzahrani, S.; Palade, V.; Salim, N.; Abraham, A.: Using structural information and citation evidence to detect significant plagiarism cases in scientific publications (2012) 0.28
    0.2805804 = sum of:
      0.2805804 = product of:
        0.77939 = sum of:
          0.028431276 = weight(abstract_txt:similar in 982) [ClassicSimilarity], result of:
            0.028431276 = score(doc=982,freq=1.0), product of:
              0.099859014 = queryWeight, product of:
                1.3833257 = boost
                5.206202 = idf(docFreq=661, maxDocs=44421)
                0.013865701 = queryNorm
              0.28471416 = fieldWeight in 982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.206202 = idf(docFreq=661, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.031199515 = weight(abstract_txt:demonstrate in 982) [ClassicSimilarity], result of:
            0.031199515 = score(doc=982,freq=1.0), product of:
              0.10624005 = queryWeight, product of:
                1.4268389 = boost
                5.3699656 = idf(docFreq=561, maxDocs=44421)
                0.013865701 = queryNorm
              0.29367 = fieldWeight in 982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3699656 = idf(docFreq=561, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.048621804 = weight(abstract_txt:statistical in 982) [ClassicSimilarity], result of:
            0.048621804 = score(doc=982,freq=2.0), product of:
              0.11334449 = queryWeight, product of:
                1.4737744 = boost
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.013865701 = queryNorm
              0.42897367 = fieldWeight in 982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.09450098 = weight(abstract_txt:cosine in 982) [ClassicSimilarity], result of:
            0.09450098 = score(doc=982,freq=1.0), product of:
              0.22240639 = queryWeight, product of:
                2.0644503 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.013865701 = queryNorm
              0.4249023 = fieldWeight in 982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.13364458 = weight(abstract_txt:coefficient in 982) [ClassicSimilarity], result of:
            0.13364458 = score(doc=982,freq=2.0), product of:
              0.22240639 = queryWeight, product of:
                2.0644503 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.013865701 = queryNorm
              0.6009026 = fieldWeight in 982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.14342476 = weight(abstract_txt:jaccard in 982) [ClassicSimilarity], result of:
            0.14342476 = score(doc=982,freq=1.0), product of:
              0.2937238 = queryWeight, product of:
                2.372468 = boost
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.013865701 = queryNorm
              0.48829806 = fieldWeight in 982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.08483444 = weight(abstract_txt:frequency in 982) [ClassicSimilarity], result of:
            0.08483444 = score(doc=982,freq=1.0), product of:
              0.26076412 = queryWeight, product of:
                3.1613293 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.013865701 = queryNorm
              0.3253302 = fieldWeight in 982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.0555215 = weight(abstract_txt:term in 982) [ClassicSimilarity], result of:
            0.0555215 = score(doc=982,freq=1.0), product of:
              0.2117437 = queryWeight, product of:
                3.1849751 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.013865701 = queryNorm
              0.26221088 = fieldWeight in 982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
          0.1592111 = weight(abstract_txt:measures in 982) [ClassicSimilarity], result of:
            0.1592111 = score(doc=982,freq=2.0), product of:
              0.37947628 = queryWeight, product of:
                5.0449533 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.013865701 = queryNorm
              0.41955483 = fieldWeight in 982, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.0546875 = fieldNorm(doc=982)
        0.36 = coord(9/25)
    
  3. Schneider, J.W.; Borlund, P.: Matrix comparison, part 1 : motivation and important issues for measuring the resemblance between proximity measures or ordination results (2007) 0.24
    0.23576602 = sum of:
      0.23576602 = product of:
        0.73676884 = sum of:
          0.040103234 = weight(abstract_txt:behavior in 1584) [ClassicSimilarity], result of:
            0.040103234 = score(doc=1584,freq=2.0), product of:
              0.09968565 = queryWeight, product of:
                1.3821243 = boost
                5.2016807 = idf(docFreq=664, maxDocs=44421)
                0.013865701 = queryNorm
              0.40229696 = fieldWeight in 1584, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2016807 = idf(docFreq=664, maxDocs=44421)
                0.0546875 = fieldNorm(doc=1584)
          0.049244415 = weight(abstract_txt:similar in 1584) [ClassicSimilarity], result of:
            0.049244415 = score(doc=1584,freq=3.0), product of:
              0.099859014 = queryWeight, product of:
                1.3833257 = boost
                5.206202 = idf(docFreq=661, maxDocs=44421)
                0.013865701 = queryNorm
              0.49313942 = fieldWeight in 1584, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.206202 = idf(docFreq=661, maxDocs=44421)
                0.0546875 = fieldNorm(doc=1584)
          0.031199515 = weight(abstract_txt:demonstrate in 1584) [ClassicSimilarity], result of:
            0.031199515 = score(doc=1584,freq=1.0), product of:
              0.10624005 = queryWeight, product of:
                1.4268389 = boost
                5.3699656 = idf(docFreq=561, maxDocs=44421)
                0.013865701 = queryNorm
              0.29367 = fieldWeight in 1584, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3699656 = idf(docFreq=561, maxDocs=44421)
                0.0546875 = fieldNorm(doc=1584)
          0.048621804 = weight(abstract_txt:statistical in 1584) [ClassicSimilarity], result of:
            0.048621804 = score(doc=1584,freq=2.0), product of:
              0.11334449 = queryWeight, product of:
                1.4737744 = boost
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.013865701 = queryNorm
              0.42897367 = fieldWeight in 1584, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.5466094 = idf(docFreq=470, maxDocs=44421)
                0.0546875 = fieldNorm(doc=1584)
          0.08401149 = weight(abstract_txt:clustering in 1584) [ClassicSimilarity], result of:
            0.08401149 = score(doc=1584,freq=3.0), product of:
              0.14257446 = queryWeight, product of:
                1.6529193 = boost
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.013865701 = queryNorm
              0.5892464 = fieldWeight in 1584, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                6.2208285 = idf(docFreq=239, maxDocs=44421)
                0.0546875 = fieldNorm(doc=1584)
          0.051349618 = weight(abstract_txt:correlation in 1584) [ClassicSimilarity], result of:
            0.051349618 = score(doc=1584,freq=1.0), product of:
              0.14809754 = queryWeight, product of:
                1.6846308 = boost
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.013865701 = queryNorm
              0.34672835 = fieldWeight in 1584, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.3401756 = idf(docFreq=212, maxDocs=44421)
                0.0546875 = fieldNorm(doc=1584)
          0.09450098 = weight(abstract_txt:cosine in 1584) [ClassicSimilarity], result of:
            0.09450098 = score(doc=1584,freq=1.0), product of:
              0.22240639 = queryWeight, product of:
                2.0644503 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.013865701 = queryNorm
              0.4249023 = fieldWeight in 1584, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.0546875 = fieldNorm(doc=1584)
          0.33773777 = weight(abstract_txt:measures in 1584) [ClassicSimilarity], result of:
            0.33773777 = score(doc=1584,freq=9.0), product of:
              0.37947628 = queryWeight, product of:
                5.0449533 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.013865701 = queryNorm
              0.89001024 = fieldWeight in 1584, product of:
                3.0 = tf(freq=9.0), with freq of:
                  9.0 = termFreq=9.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.0546875 = fieldNorm(doc=1584)
        0.32 = coord(8/25)
    
  4. Egghe, L.: On the relation between the association strength and other similarity measures (2010) 0.21
    0.20775314 = sum of:
      0.20775314 = product of:
        1.2984571 = sum of:
          0.26728916 = weight(abstract_txt:cosine in 585) [ClassicSimilarity], result of:
            0.26728916 = score(doc=585,freq=2.0), product of:
              0.22240639 = queryWeight, product of:
                2.0644503 = boost
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.013865701 = queryNorm
              1.2018052 = fieldWeight in 585, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.769642 = idf(docFreq=50, maxDocs=44421)
                0.109375 = fieldNorm(doc=585)
          0.28684953 = weight(abstract_txt:jaccard in 585) [ClassicSimilarity], result of:
            0.28684953 = score(doc=585,freq=1.0), product of:
              0.2937238 = queryWeight, product of:
                2.372468 = boost
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.013865701 = queryNorm
              0.9765961 = fieldWeight in 585, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.109375 = fieldNorm(doc=585)
          0.38998598 = weight(abstract_txt:measures in 585) [ClassicSimilarity], result of:
            0.38998598 = score(doc=585,freq=3.0), product of:
              0.37947628 = queryWeight, product of:
                5.0449533 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.013865701 = queryNorm
              1.0276953 = fieldWeight in 585, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.109375 = fieldNorm(doc=585)
          0.35433248 = weight(abstract_txt:association in 585) [ClassicSimilarity], result of:
            0.35433248 = score(doc=585,freq=2.0), product of:
              0.4074957 = queryWeight, product of:
                5.2278886 = boost
                5.6215343 = idf(docFreq=436, maxDocs=44421)
                0.013865701 = queryNorm
              0.8695367 = fieldWeight in 585, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.6215343 = idf(docFreq=436, maxDocs=44421)
                0.109375 = fieldNorm(doc=585)
        0.16 = coord(4/25)
    
  5. Efron, M.: Linear time series models for term weighting in information retrieval (2010) 0.20
    0.20350295 = sum of:
      0.20350295 = product of:
        0.847929 = sum of:
          0.040510383 = weight(abstract_txt:behavior in 675) [ClassicSimilarity], result of:
            0.040510383 = score(doc=675,freq=1.0), product of:
              0.09968565 = queryWeight, product of:
                1.3821243 = boost
                5.2016807 = idf(docFreq=664, maxDocs=44421)
                0.013865701 = queryNorm
              0.4063813 = fieldWeight in 675, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2016807 = idf(docFreq=664, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
          0.06507498 = weight(abstract_txt:corpus in 675) [ClassicSimilarity], result of:
            0.06507498 = score(doc=675,freq=1.0), product of:
              0.1367302 = queryWeight, product of:
                1.6186875 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.013865701 = queryNorm
              0.47593716 = fieldWeight in 675, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
          0.09515815 = weight(abstract_txt:terms in 675) [ClassicSimilarity], result of:
            0.09515815 = score(doc=675,freq=4.0), product of:
              0.1506072 = queryWeight, product of:
                2.6861093 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.013865701 = queryNorm
              0.63183004 = fieldWeight in 675, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
          0.24238412 = weight(abstract_txt:frequency in 675) [ClassicSimilarity], result of:
            0.24238412 = score(doc=675,freq=4.0), product of:
              0.26076412 = queryWeight, product of:
                3.1613293 = boost
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.013865701 = queryNorm
              0.9295148 = fieldWeight in 675, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.948895 = idf(docFreq=314, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
          0.17735693 = weight(abstract_txt:term in 675) [ClassicSimilarity], result of:
            0.17735693 = score(doc=675,freq=5.0), product of:
              0.2117437 = queryWeight, product of:
                3.1849751 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.013865701 = queryNorm
              0.8376019 = fieldWeight in 675, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
          0.22744444 = weight(abstract_txt:measures in 675) [ClassicSimilarity], result of:
            0.22744444 = score(doc=675,freq=2.0), product of:
              0.37947628 = queryWeight, product of:
                5.0449533 = boost
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.013865701 = queryNorm
              0.59936404 = fieldWeight in 675, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.424824 = idf(docFreq=531, maxDocs=44421)
                0.078125 = fieldNorm(doc=675)
        0.24 = coord(6/25)