Document (#34097)

Author
Witschel, H.F.
Title
Global term weights in distributed environments
Source
Information processing and management. 44(2008) no.3, S.1049-1061
Year
2008
Abstract
This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated. The results show that very good retrieval performance can be reached when just the most frequent terms of a collection - an "extended stop word list" - are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some "domain-specific stop words" need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Witschel, H.F.: Terminologie-Extraktion : Möglichkeiten der Kombination statistischer uns musterbasierter Verfahren (2004) 6.19
    6.1935673 = sum of:
      6.1935673 = weight(author_txt:witschel in 1123) [ClassicSimilarity], result of:
        6.1935673 = fieldWeight in 1123, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.909708 = idf(docFreq=5, maxDocs=44421)
          0.625 = fieldNorm(doc=1123)
    
  2. Witschel, H.F.: Text, Wörter, Morpheme : Möglichkeiten einer automatischen Terminologie-Extraktion (2004) 6.19
    6.1935673 = sum of:
      6.1935673 = weight(author_txt:witschel in 1126) [ClassicSimilarity], result of:
        6.1935673 = fieldWeight in 1126, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.909708 = idf(docFreq=5, maxDocs=44421)
          0.625 = fieldNorm(doc=1126)
    
  3. Witschel, H.F.: Global and local resources for peer-to-peer text retrieval (2008) 6.19
    6.1935673 = sum of:
      6.1935673 = weight(author_txt:witschel in 1127) [ClassicSimilarity], result of:
        6.1935673 = fieldWeight in 1127, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.909708 = idf(docFreq=5, maxDocs=44421)
          0.625 = fieldNorm(doc=1127)
    
  4. Witschel, H.F.: Terminology extraction and automatic indexing : comparison and qualitative evaluation of methods (2005) 6.19
    6.1935673 = sum of:
      6.1935673 = weight(author_txt:witschel in 2842) [ClassicSimilarity], result of:
        6.1935673 = fieldWeight in 2842, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.909708 = idf(docFreq=5, maxDocs=44421)
          0.625 = fieldNorm(doc=2842)
    

Similar documents (content)

  1. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.21
    0.21154526 = sum of:
      0.21154526 = product of:
        0.7555188 = sum of:
          0.0224545 = weight(abstract_txt:using in 2283) [ClassicSimilarity], result of:
            0.0224545 = score(doc=2283,freq=3.0), product of:
              0.068575904 = queryWeight, product of:
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.019837566 = queryNorm
              0.32744008 = fieldWeight in 2283, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.10919494 = weight(abstract_txt:estimation in 2283) [ClassicSimilarity], result of:
            0.10919494 = score(doc=2283,freq=2.0), product of:
              0.17883728 = queryWeight, product of:
                1.1419005 = boost
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.019837566 = queryNorm
              0.61058265 = fieldWeight in 2283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.894805 = idf(docFreq=44, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.020750757 = weight(abstract_txt:terms in 2283) [ClassicSimilarity], result of:
            0.020750757 = score(doc=2283,freq=1.0), product of:
              0.093835175 = queryWeight, product of:
                1.1697608 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.019837566 = queryNorm
              0.2211405 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.13028428 = weight(abstract_txt:pruning in 2283) [ClassicSimilarity], result of:
            0.13028428 = score(doc=2283,freq=1.0), product of:
              0.25347066 = queryWeight, product of:
                1.3594495 = boost
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.019837566 = queryNorm
              0.5140014 = fieldWeight in 2283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.05897053 = weight(abstract_txt:retrieval in 2283) [ClassicSimilarity], result of:
            0.05897053 = score(doc=2283,freq=5.0), product of:
              0.1387138 = queryWeight, product of:
                2.0113566 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.019837566 = queryNorm
              0.42512372 = fieldWeight in 2283, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.14676328 = weight(abstract_txt:term in 2283) [ClassicSimilarity], result of:
            0.14676328 = score(doc=2283,freq=8.0), product of:
              0.19788903 = queryWeight, product of:
                2.0805144 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.019837566 = queryNorm
              0.7416443 = fieldWeight in 2283, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
          0.2671005 = weight(abstract_txt:weights in 2283) [ClassicSimilarity], result of:
            0.2671005 = score(doc=2283,freq=5.0), product of:
              0.3013951 = queryWeight, product of:
                2.0964394 = boost
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.019837566 = queryNorm
              0.88621384 = fieldWeight in 2283, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.0546875 = fieldNorm(doc=2283)
        0.28 = coord(7/25)
    
  2. Robertson, A.M.; Willett, P.: Use of genetic algorithms in information retrieval (1995) 0.20
    0.20215347 = sum of:
      0.20215347 = product of:
        0.7219767 = sum of:
          0.025928224 = weight(abstract_txt:using in 2486) [ClassicSimilarity], result of:
            0.025928224 = score(doc=2486,freq=1.0), product of:
              0.068575904 = queryWeight, product of:
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.019837566 = queryNorm
              0.37809524 = fieldWeight in 2486, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.109375 = fieldNorm(doc=2486)
          0.10853789 = weight(abstract_txt:achieving in 2486) [ClassicSimilarity], result of:
            0.10853789 = score(doc=2486,freq=1.0), product of:
              0.14137326 = queryWeight, product of:
                1.0152731 = boost
                7.019336 = idf(docFreq=107, maxDocs=44421)
                0.019837566 = queryNorm
              0.7677399 = fieldWeight in 2486, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.019336 = idf(docFreq=107, maxDocs=44421)
                0.109375 = fieldNorm(doc=2486)
          0.041501515 = weight(abstract_txt:terms in 2486) [ClassicSimilarity], result of:
            0.041501515 = score(doc=2486,freq=1.0), product of:
              0.093835175 = queryWeight, product of:
                1.1697608 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.019837566 = queryNorm
              0.442281 = fieldWeight in 2486, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.109375 = fieldNorm(doc=2486)
          0.107598945 = weight(abstract_txt:good in 2486) [ClassicSimilarity], result of:
            0.107598945 = score(doc=2486,freq=1.0), product of:
              0.1770904 = queryWeight, product of:
                1.6069847 = boost
                5.5551386 = idf(docFreq=466, maxDocs=44421)
                0.019837566 = queryNorm
              0.6075933 = fieldWeight in 2486, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5551386 = idf(docFreq=466, maxDocs=44421)
                0.109375 = fieldNorm(doc=2486)
          0.052744843 = weight(abstract_txt:retrieval in 2486) [ClassicSimilarity], result of:
            0.052744843 = score(doc=2486,freq=1.0), product of:
              0.1387138 = queryWeight, product of:
                2.0113566 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.019837566 = queryNorm
              0.3802422 = fieldWeight in 2486, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.109375 = fieldNorm(doc=2486)
          0.14676328 = weight(abstract_txt:term in 2486) [ClassicSimilarity], result of:
            0.14676328 = score(doc=2486,freq=2.0), product of:
              0.19788903 = queryWeight, product of:
                2.0805144 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.019837566 = queryNorm
              0.7416443 = fieldWeight in 2486, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.109375 = fieldNorm(doc=2486)
          0.23890196 = weight(abstract_txt:weights in 2486) [ClassicSimilarity], result of:
            0.23890196 = score(doc=2486,freq=1.0), product of:
              0.3013951 = queryWeight, product of:
                2.0964394 = boost
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.019837566 = queryNorm
              0.7926538 = fieldWeight in 2486, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.109375 = fieldNorm(doc=2486)
        0.28 = coord(7/25)
    
  3. Trotman, A.: Choosing document structure weights (2005) 0.18
    0.17645293 = sum of:
      0.17645293 = product of:
        0.63018906 = sum of:
          0.02619146 = weight(abstract_txt:using in 2016) [ClassicSimilarity], result of:
            0.02619146 = score(doc=2016,freq=2.0), product of:
              0.068575904 = queryWeight, product of:
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.019837566 = queryNorm
              0.38193387 = fieldWeight in 2016, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.078125 = fieldNorm(doc=2016)
          0.029643942 = weight(abstract_txt:terms in 2016) [ClassicSimilarity], result of:
            0.029643942 = score(doc=2016,freq=1.0), product of:
              0.093835175 = queryWeight, product of:
                1.1697608 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.019837566 = queryNorm
              0.31591502 = fieldWeight in 2016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.078125 = fieldNorm(doc=2016)
          0.07685638 = weight(abstract_txt:good in 2016) [ClassicSimilarity], result of:
            0.07685638 = score(doc=2016,freq=1.0), product of:
              0.1770904 = queryWeight, product of:
                1.6069847 = boost
                5.5551386 = idf(docFreq=466, maxDocs=44421)
                0.019837566 = queryNorm
              0.4339952 = fieldWeight in 2016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5551386 = idf(docFreq=466, maxDocs=44421)
                0.078125 = fieldNorm(doc=2016)
          0.037674885 = weight(abstract_txt:retrieval in 2016) [ClassicSimilarity], result of:
            0.037674885 = score(doc=2016,freq=1.0), product of:
              0.1387138 = queryWeight, product of:
                2.0113566 = boost
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.019837566 = queryNorm
              0.27160156 = fieldWeight in 2016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4765 = idf(docFreq=3732, maxDocs=44421)
                0.078125 = fieldNorm(doc=2016)
          0.074126646 = weight(abstract_txt:term in 2016) [ClassicSimilarity], result of:
            0.074126646 = score(doc=2016,freq=1.0), product of:
              0.19788903 = queryWeight, product of:
                2.0805144 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.019837566 = queryNorm
              0.37458694 = fieldWeight in 2016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.078125 = fieldNorm(doc=2016)
          0.2955645 = weight(abstract_txt:weights in 2016) [ClassicSimilarity], result of:
            0.2955645 = score(doc=2016,freq=3.0), product of:
              0.3013951 = queryWeight, product of:
                2.0964394 = boost
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.019837566 = queryNorm
              0.9806547 = fieldWeight in 2016, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.078125 = fieldNorm(doc=2016)
          0.09013125 = weight(abstract_txt:collection in 2016) [ClassicSimilarity], result of:
            0.09013125 = score(doc=2016,freq=1.0), product of:
              0.24812393 = queryWeight, product of:
                2.69007 = boost
                4.649612 = idf(docFreq=1154, maxDocs=44421)
                0.019837566 = queryNorm
              0.36325094 = fieldWeight in 2016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.649612 = idf(docFreq=1154, maxDocs=44421)
                0.078125 = fieldNorm(doc=2016)
        0.28 = coord(7/25)
    
  4. Cohen, J.D.: Highlights: language- and domain-independent automatic indexing terms for abstracting (1995) 0.17
    0.17204583 = sum of:
      0.17204583 = product of:
        0.7168576 = sum of:
          0.09303248 = weight(abstract_txt:achieving in 1861) [ClassicSimilarity], result of:
            0.09303248 = score(doc=1861,freq=1.0), product of:
              0.14137326 = queryWeight, product of:
                1.0152731 = boost
                7.019336 = idf(docFreq=107, maxDocs=44421)
                0.019837566 = queryNorm
              0.65806276 = fieldWeight in 1861, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.019336 = idf(docFreq=107, maxDocs=44421)
                0.09375 = fieldNorm(doc=1861)
          0.061613776 = weight(abstract_txt:terms in 1861) [ClassicSimilarity], result of:
            0.061613776 = score(doc=1861,freq=3.0), product of:
              0.093835175 = queryWeight, product of:
                1.1697608 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.019837566 = queryNorm
              0.65661705 = fieldWeight in 1861, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.09375 = fieldNorm(doc=1861)
          0.023978733 = weight(abstract_txt:from in 1861) [ClassicSimilarity], result of:
            0.023978733 = score(doc=1861,freq=2.0), product of:
              0.06554287 = queryWeight, product of:
                1.1973541 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.019837566 = queryNorm
              0.36584806 = fieldWeight in 1861, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.09375 = fieldNorm(doc=1861)
          0.22943318 = weight(abstract_txt:stop in 1861) [ClassicSimilarity], result of:
            0.22943318 = score(doc=1861,freq=1.0), product of:
              0.325131 = queryWeight, product of:
                2.177426 = boost
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.019837566 = queryNorm
              0.7056638 = fieldWeight in 1861, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.09375 = fieldNorm(doc=1861)
          0.12634833 = weight(abstract_txt:list in 1861) [ClassicSimilarity], result of:
            0.12634833 = score(doc=1861,freq=1.0), product of:
              0.25005236 = queryWeight, product of:
                2.3387046 = boost
                5.389733 = idf(docFreq=550, maxDocs=44421)
                0.019837566 = queryNorm
              0.50528747 = fieldWeight in 1861, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.389733 = idf(docFreq=550, maxDocs=44421)
                0.09375 = fieldNorm(doc=1861)
          0.18245111 = weight(abstract_txt:corpus in 1861) [ClassicSimilarity], result of:
            0.18245111 = score(doc=1861,freq=1.0), product of:
              0.31945938 = queryWeight, product of:
                2.6434293 = boost
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.019837566 = queryNorm
              0.5711246 = fieldWeight in 1861, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0919957 = idf(docFreq=272, maxDocs=44421)
                0.09375 = fieldNorm(doc=1861)
        0.24 = coord(6/25)
    
  5. Tseng, Y.-H.: Automatic thesaurus generation for Chinese documents (2002) 0.17
    0.16565828 = sum of:
      0.16565828 = product of:
        0.5916367 = sum of:
          0.014816128 = weight(abstract_txt:using in 226) [ClassicSimilarity], result of:
            0.014816128 = score(doc=226,freq=1.0), product of:
              0.068575904 = queryWeight, product of:
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.019837566 = queryNorm
              0.21605442 = fieldWeight in 226, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4568708 = idf(docFreq=3806, maxDocs=44421)
                0.0625 = fieldNorm(doc=226)
          0.041075848 = weight(abstract_txt:terms in 226) [ClassicSimilarity], result of:
            0.041075848 = score(doc=226,freq=3.0), product of:
              0.093835175 = queryWeight, product of:
                1.1697608 = boost
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.019837566 = queryNorm
              0.43774468 = fieldWeight in 226, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.043712 = idf(docFreq=2116, maxDocs=44421)
                0.0625 = fieldNorm(doc=226)
          0.011303683 = weight(abstract_txt:from in 226) [ClassicSimilarity], result of:
            0.011303683 = score(doc=226,freq=1.0), product of:
              0.06554287 = queryWeight, product of:
                1.1973541 = boost
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.019837566 = queryNorm
              0.17246243 = fieldWeight in 226, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.759399 = idf(docFreq=7646, maxDocs=44421)
                0.0625 = fieldNorm(doc=226)
          0.05930132 = weight(abstract_txt:term in 226) [ClassicSimilarity], result of:
            0.05930132 = score(doc=226,freq=1.0), product of:
              0.19788903 = queryWeight, product of:
                2.0805144 = boost
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.019837566 = queryNorm
              0.29966956 = fieldWeight in 226, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.794713 = idf(docFreq=998, maxDocs=44421)
                0.0625 = fieldNorm(doc=226)
          0.19306193 = weight(abstract_txt:weights in 226) [ClassicSimilarity], result of:
            0.19306193 = score(doc=226,freq=2.0), product of:
              0.3013951 = queryWeight, product of:
                2.0964394 = boost
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.019837566 = queryNorm
              0.640561 = fieldWeight in 226, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.2471204 = idf(docFreq=85, maxDocs=44421)
                0.0625 = fieldNorm(doc=226)
          0.15295546 = weight(abstract_txt:stop in 226) [ClassicSimilarity], result of:
            0.15295546 = score(doc=226,freq=1.0), product of:
              0.325131 = queryWeight, product of:
                2.177426 = boost
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.019837566 = queryNorm
              0.47044253 = fieldWeight in 226, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5270805 = idf(docFreq=64, maxDocs=44421)
                0.0625 = fieldNorm(doc=226)
          0.119122334 = weight(abstract_txt:list in 226) [ClassicSimilarity], result of:
            0.119122334 = score(doc=226,freq=2.0), product of:
              0.25005236 = queryWeight, product of:
                2.3387046 = boost
                5.389733 = idf(docFreq=550, maxDocs=44421)
                0.019837566 = queryNorm
              0.47638956 = fieldWeight in 226, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.389733 = idf(docFreq=550, maxDocs=44421)
                0.0625 = fieldNorm(doc=226)
        0.28 = coord(7/25)