Document (#34097)

Author
Witschel, H.F.
Title
Global term weights in distributed environments
Source
Information processing and management. 44(2008) no.3, S.1049-1061
Year
2008
Abstract
This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated. The results show that very good retrieval performance can be reached when just the most frequent terms of a collection - an "extended stop word list" - are known and all terms which are not in that list are treated equally. However, the list cannot always be fully estimated from a general-purpose reference corpus, but some "domain-specific stop words" need to be added. A good solution for achieving this is to mix estimates from small samples of the target retrieval collection with ones derived from a reference corpus.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Witschel, H.F.: Terminologie-Extraktion : Möglichkeiten der Kombination statistischer uns musterbasierter Verfahren (2004) 6.19
    6.190705 = sum of:
      6.190705 = weight(author_txt:witschel in 123) [ClassicSimilarity], result of:
        6.190705 = fieldWeight in 123, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.625 = fieldNorm(doc=123)
    
  2. Witschel, H.F.: Text, Wörter, Morpheme : Möglichkeiten einer automatischen Terminologie-Extraktion (2004) 6.19
    6.190705 = sum of:
      6.190705 = weight(author_txt:witschel in 126) [ClassicSimilarity], result of:
        6.190705 = fieldWeight in 126, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.625 = fieldNorm(doc=126)
    
  3. Witschel, H.F.: Global and local resources for peer-to-peer text retrieval (2008) 6.19
    6.190705 = sum of:
      6.190705 = weight(author_txt:witschel in 127) [ClassicSimilarity], result of:
        6.190705 = fieldWeight in 127, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.625 = fieldNorm(doc=127)
    
  4. Witschel, H.F.: Terminology extraction and automatic indexing : comparison and qualitative evaluation of methods (2005) 6.19
    6.190705 = sum of:
      6.190705 = weight(author_txt:witschel in 1842) [ClassicSimilarity], result of:
        6.190705 = fieldWeight in 1842, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.905128 = idf(docFreq=5, maxDocs=44218)
          0.625 = fieldNorm(doc=1842)
    

Similar documents (content)

  1. Dang, E.K.F.; Luk, R.W.P.; Allan, J.: Beyond bag-of-words : bigram-enhanced context-dependent term weights (2014) 0.21
    0.21134546 = sum of:
      0.21134546 = product of:
        0.7548052 = sum of:
          0.022562146 = weight(abstract_txt:using in 1283) [ClassicSimilarity], result of:
            0.022562146 = score(doc=1283,freq=3.0), product of:
              0.06878035 = queryWeight, product of:
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.019860812 = queryNorm
              0.32803187 = fieldWeight in 1283, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.10893585 = weight(abstract_txt:estimation in 1283) [ClassicSimilarity], result of:
            0.10893585 = score(doc=1283,freq=2.0), product of:
              0.1785165 = queryWeight, product of:
                1.1391791 = boost
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.019860812 = queryNorm
              0.6102284 = fieldWeight in 1283, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.890225 = idf(docFreq=44, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.020739973 = weight(abstract_txt:terms in 1283) [ClassicSimilarity], result of:
            0.020739973 = score(doc=1283,freq=1.0), product of:
              0.09378282 = queryWeight, product of:
                1.1676952 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.019860812 = queryNorm
              0.22114895 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.13001136 = weight(abstract_txt:pruning in 1283) [ClassicSimilarity], result of:
            0.13001136 = score(doc=1283,freq=1.0), product of:
              0.25306302 = queryWeight, product of:
                1.3563356 = boost
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.019860812 = queryNorm
              0.5137509 = fieldWeight in 1283, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.394302 = idf(docFreq=9, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.058863953 = weight(abstract_txt:retrieval in 1283) [ClassicSimilarity], result of:
            0.058863953 = score(doc=1283,freq=5.0), product of:
              0.13851734 = queryWeight, product of:
                2.0069423 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.019860812 = queryNorm
              0.4249573 = fieldWeight in 1283, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.14726672 = weight(abstract_txt:term in 1283) [ClassicSimilarity], result of:
            0.14726672 = score(doc=1283,freq=8.0), product of:
              0.19829936 = queryWeight, product of:
                2.0795727 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.019860812 = queryNorm
              0.7426484 = fieldWeight in 1283, product of:
                2.828427 = tf(freq=8.0), with freq of:
                  8.0 = termFreq=8.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
          0.26642522 = weight(abstract_txt:weights in 1283) [ClassicSimilarity], result of:
            0.26642522 = score(doc=1283,freq=5.0), product of:
              0.30082324 = queryWeight, product of:
                2.0913346 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.019860812 = queryNorm
              0.88565373 = fieldWeight in 1283, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0546875 = fieldNorm(doc=1283)
        0.28 = coord(7/25)
    
  2. Robertson, A.M.; Willett, P.: Use of genetic algorithms in information retrieval (1995) 0.20
    0.202737 = sum of:
      0.202737 = product of:
        0.7240607 = sum of:
          0.026052522 = weight(abstract_txt:using in 2418) [ClassicSimilarity], result of:
            0.026052522 = score(doc=2418,freq=1.0), product of:
              0.06878035 = queryWeight, product of:
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.019860812 = queryNorm
              0.37877858 = fieldWeight in 2418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.109375 = fieldNorm(doc=2418)
          0.1109248 = weight(abstract_txt:achieving in 2418) [ClassicSimilarity], result of:
            0.1109248 = score(doc=2418,freq=1.0), product of:
              0.14340807 = queryWeight, product of:
                1.0210327 = boost
                7.071914 = idf(docFreq=101, maxDocs=44218)
                0.019860812 = queryNorm
              0.7734906 = fieldWeight in 2418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.071914 = idf(docFreq=101, maxDocs=44218)
                0.109375 = fieldNorm(doc=2418)
          0.041479945 = weight(abstract_txt:terms in 2418) [ClassicSimilarity], result of:
            0.041479945 = score(doc=2418,freq=1.0), product of:
              0.09378282 = queryWeight, product of:
                1.1676952 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.019860812 = queryNorm
              0.4422979 = fieldWeight in 2418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.109375 = fieldNorm(doc=2418)
          0.10738922 = weight(abstract_txt:good in 2418) [ClassicSimilarity], result of:
            0.10738922 = score(doc=2418,freq=1.0), product of:
              0.17682281 = queryWeight, product of:
                1.6033819 = boost
                5.5527015 = idf(docFreq=465, maxDocs=44218)
                0.019860812 = queryNorm
              0.60732675 = fieldWeight in 2418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5527015 = idf(docFreq=465, maxDocs=44218)
                0.109375 = fieldNorm(doc=2418)
          0.05264952 = weight(abstract_txt:retrieval in 2418) [ClassicSimilarity], result of:
            0.05264952 = score(doc=2418,freq=1.0), product of:
              0.13851734 = queryWeight, product of:
                2.0069423 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.019860812 = queryNorm
              0.38009337 = fieldWeight in 2418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.109375 = fieldNorm(doc=2418)
          0.14726672 = weight(abstract_txt:term in 2418) [ClassicSimilarity], result of:
            0.14726672 = score(doc=2418,freq=2.0), product of:
              0.19829936 = queryWeight, product of:
                2.0795727 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.019860812 = queryNorm
              0.7426484 = fieldWeight in 2418, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.109375 = fieldNorm(doc=2418)
          0.23829798 = weight(abstract_txt:weights in 2418) [ClassicSimilarity], result of:
            0.23829798 = score(doc=2418,freq=1.0), product of:
              0.30082324 = queryWeight, product of:
                2.0913346 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.019860812 = queryNorm
              0.7921528 = fieldWeight in 2418, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.109375 = fieldNorm(doc=2418)
        0.28 = coord(7/25)
    
  3. Trotman, A.: Choosing document structure weights (2005) 0.18
    0.17625068 = sum of:
      0.17625068 = product of:
        0.6294667 = sum of:
          0.026317023 = weight(abstract_txt:using in 1016) [ClassicSimilarity], result of:
            0.026317023 = score(doc=1016,freq=2.0), product of:
              0.06878035 = queryWeight, product of:
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.019860812 = queryNorm
              0.38262415 = fieldWeight in 1016, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.078125 = fieldNorm(doc=1016)
          0.029628534 = weight(abstract_txt:terms in 1016) [ClassicSimilarity], result of:
            0.029628534 = score(doc=1016,freq=1.0), product of:
              0.09378282 = queryWeight, product of:
                1.1676952 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.019860812 = queryNorm
              0.3159271 = fieldWeight in 1016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.078125 = fieldNorm(doc=1016)
          0.07670659 = weight(abstract_txt:good in 1016) [ClassicSimilarity], result of:
            0.07670659 = score(doc=1016,freq=1.0), product of:
              0.17682281 = queryWeight, product of:
                1.6033819 = boost
                5.5527015 = idf(docFreq=465, maxDocs=44218)
                0.019860812 = queryNorm
              0.4338048 = fieldWeight in 1016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.5527015 = idf(docFreq=465, maxDocs=44218)
                0.078125 = fieldNorm(doc=1016)
          0.037606798 = weight(abstract_txt:retrieval in 1016) [ClassicSimilarity], result of:
            0.037606798 = score(doc=1016,freq=1.0), product of:
              0.13851734 = queryWeight, product of:
                2.0069423 = boost
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.019860812 = queryNorm
              0.27149525 = fieldWeight in 1016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4751394 = idf(docFreq=3720, maxDocs=44218)
                0.078125 = fieldNorm(doc=1016)
          0.07438093 = weight(abstract_txt:term in 1016) [ClassicSimilarity], result of:
            0.07438093 = score(doc=1016,freq=1.0), product of:
              0.19829936 = queryWeight, product of:
                2.0795727 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.019860812 = queryNorm
              0.37509412 = fieldWeight in 1016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.078125 = fieldNorm(doc=1016)
          0.2948173 = weight(abstract_txt:weights in 1016) [ClassicSimilarity], result of:
            0.2948173 = score(doc=1016,freq=3.0), product of:
              0.30082324 = queryWeight, product of:
                2.0913346 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.019860812 = queryNorm
              0.98003495 = fieldWeight in 1016, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.078125 = fieldNorm(doc=1016)
          0.09000952 = weight(abstract_txt:collection in 1016) [ClassicSimilarity], result of:
            0.09000952 = score(doc=1016,freq=1.0), product of:
              0.24784803 = queryWeight, product of:
                2.684575 = boost
                4.648501 = idf(docFreq=1150, maxDocs=44218)
                0.019860812 = queryNorm
              0.36316413 = fieldWeight in 1016, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.648501 = idf(docFreq=1150, maxDocs=44218)
                0.078125 = fieldNorm(doc=1016)
        0.28 = coord(7/25)
    
  4. Cohen, J.D.: Highlights: language- and domain-independent automatic indexing terms for abstracting (1995) 0.17
    0.17243358 = sum of:
      0.17243358 = product of:
        0.7184733 = sum of:
          0.0950784 = weight(abstract_txt:achieving in 1793) [ClassicSimilarity], result of:
            0.0950784 = score(doc=1793,freq=1.0), product of:
              0.14340807 = queryWeight, product of:
                1.0210327 = boost
                7.071914 = idf(docFreq=101, maxDocs=44218)
                0.019860812 = queryNorm
              0.66299194 = fieldWeight in 1793, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.071914 = idf(docFreq=101, maxDocs=44218)
                0.09375 = fieldNorm(doc=1793)
          0.061581746 = weight(abstract_txt:terms in 1793) [ClassicSimilarity], result of:
            0.061581746 = score(doc=1793,freq=3.0), product of:
              0.09378282 = queryWeight, product of:
                1.1676952 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.019860812 = queryNorm
              0.6566421 = fieldWeight in 1793, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.09375 = fieldNorm(doc=1793)
          0.024080526 = weight(abstract_txt:from in 1793) [ClassicSimilarity], result of:
            0.024080526 = score(doc=1793,freq=2.0), product of:
              0.06571433 = queryWeight, product of:
                1.197136 = boost
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.019860812 = queryNorm
              0.36644253 = fieldWeight in 1793, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.09375 = fieldNorm(doc=1793)
          0.22886929 = weight(abstract_txt:stop in 1793) [ClassicSimilarity], result of:
            0.22886929 = score(doc=1793,freq=1.0), product of:
              0.32452938 = queryWeight, product of:
                2.1721752 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.019860812 = queryNorm
              0.7052344 = fieldWeight in 1793, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.09375 = fieldNorm(doc=1793)
          0.12594649 = weight(abstract_txt:list in 1793) [ClassicSimilarity], result of:
            0.12594649 = score(doc=1793,freq=1.0), product of:
              0.24946915 = queryWeight, product of:
                2.3325012 = boost
                5.3851523 = idf(docFreq=550, maxDocs=44218)
                0.019860812 = queryNorm
              0.504858 = fieldWeight in 1793, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.3851523 = idf(docFreq=550, maxDocs=44218)
                0.09375 = fieldNorm(doc=1793)
          0.18291691 = weight(abstract_txt:corpus in 1793) [ClassicSimilarity], result of:
            0.18291691 = score(doc=1793,freq=1.0), product of:
              0.31993517 = queryWeight, product of:
                2.6414626 = boost
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.019860812 = queryNorm
              0.57173115 = fieldWeight in 1793, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.0984654 = idf(docFreq=269, maxDocs=44218)
                0.09375 = fieldNorm(doc=1793)
        0.24 = coord(6/25)
    
  5. Tseng, Y.-H.: Automatic thesaurus generation for Chinese documents (2002) 0.17
    0.16539457 = sum of:
      0.16539457 = product of:
        0.5906949 = sum of:
          0.014887156 = weight(abstract_txt:using in 5226) [ClassicSimilarity], result of:
            0.014887156 = score(doc=5226,freq=1.0), product of:
              0.06878035 = queryWeight, product of:
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.019860812 = queryNorm
              0.21644491 = fieldWeight in 5226, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.4631186 = idf(docFreq=3765, maxDocs=44218)
                0.0625 = fieldNorm(doc=5226)
          0.0410545 = weight(abstract_txt:terms in 5226) [ClassicSimilarity], result of:
            0.0410545 = score(doc=5226,freq=3.0), product of:
              0.09378282 = queryWeight, product of:
                1.1676952 = boost
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.019860812 = queryNorm
              0.4377614 = fieldWeight in 5226, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.0438666 = idf(docFreq=2106, maxDocs=44218)
                0.0625 = fieldNorm(doc=5226)
          0.011351668 = weight(abstract_txt:from in 5226) [ClassicSimilarity], result of:
            0.011351668 = score(doc=5226,freq=1.0), product of:
              0.06571433 = queryWeight, product of:
                1.197136 = boost
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.019860812 = queryNorm
              0.17274266 = fieldWeight in 5226, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.7638826 = idf(docFreq=7577, maxDocs=44218)
                0.0625 = fieldNorm(doc=5226)
          0.05950474 = weight(abstract_txt:term in 5226) [ClassicSimilarity], result of:
            0.05950474 = score(doc=5226,freq=1.0), product of:
              0.19829936 = queryWeight, product of:
                2.0795727 = boost
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.019860812 = queryNorm
              0.3000753 = fieldWeight in 5226, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.8012047 = idf(docFreq=987, maxDocs=44218)
                0.0625 = fieldNorm(doc=5226)
          0.19257385 = weight(abstract_txt:weights in 5226) [ClassicSimilarity], result of:
            0.19257385 = score(doc=5226,freq=2.0), product of:
              0.30082324 = queryWeight, product of:
                2.0913346 = boost
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.019860812 = queryNorm
              0.64015615 = fieldWeight in 5226, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.24254 = idf(docFreq=85, maxDocs=44218)
                0.0625 = fieldNorm(doc=5226)
          0.15257952 = weight(abstract_txt:stop in 5226) [ClassicSimilarity], result of:
            0.15257952 = score(doc=5226,freq=1.0), product of:
              0.32452938 = queryWeight, product of:
                2.1721752 = boost
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.019860812 = queryNorm
              0.47015625 = fieldWeight in 5226, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5225 = idf(docFreq=64, maxDocs=44218)
                0.0625 = fieldNorm(doc=5226)
          0.1187435 = weight(abstract_txt:list in 5226) [ClassicSimilarity], result of:
            0.1187435 = score(doc=5226,freq=2.0), product of:
              0.24946915 = queryWeight, product of:
                2.3325012 = boost
                5.3851523 = idf(docFreq=550, maxDocs=44218)
                0.019860812 = queryNorm
              0.47598472 = fieldWeight in 5226, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.3851523 = idf(docFreq=550, maxDocs=44218)
                0.0625 = fieldNorm(doc=5226)
        0.28 = coord(7/25)