Document (#26679)

Author
Heinz, S.
Zobel, J.
Title
Efficient single-pass index construction for text databases
Source
Journal of the American Society for Information Science and technology. 54(2003) no.8, S.713-729
Year
2003
Abstract
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does not require the complete vocabulary of the indexed collection in main memory, can operate within limited resources, and does not sacrifice speed with high temporary storage requirements. We show that the performance of the single-pass approach can be improved by constructing inverted files in segments, reducing the cost of disk accesses during inversion of large volumes of data.
Theme
Retrievalalgorithmen

Similar documents (author)

  1. Heinz, S.: Realisierung und Evaluierung eines virtuellen Bibliotheksregals für die Informationswissenschaft an der Universitätsbibliothek Hildesheim (2003) 2.08
    2.0768793 = sum of:
      2.0768793 = product of:
        4.1537585 = sum of:
          4.1537585 = weight(author_txt:heinz in 982) [ClassicSimilarity], result of:
            4.1537585 = score(doc=982,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.075233065 = queryNorm
              5.874302 = fieldWeight in 982, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.625 = fieldNorm(doc=982)
        0.5 = coord(1/2)
    
  2. Heinz, M.: Bemerkungen zur Entwicklung der Internationalität der Forschung : Bibliometrische Untersuchungen am SCI (2006) 2.08
    2.0768793 = sum of:
      2.0768793 = product of:
        4.1537585 = sum of:
          4.1537585 = weight(author_txt:heinz in 110) [ClassicSimilarity], result of:
            4.1537585 = score(doc=110,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.075233065 = queryNorm
              5.874302 = fieldWeight in 110, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.625 = fieldNorm(doc=110)
        0.5 = coord(1/2)
    
  3. Heinz, A.: ¬Diie Lösung des Leib-Seele-Problems bei John R. Searle (2002) 2.08
    2.0768793 = sum of:
      2.0768793 = product of:
        4.1537585 = sum of:
          4.1537585 = weight(author_txt:heinz in 299) [ClassicSimilarity], result of:
            4.1537585 = score(doc=299,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.075233065 = queryNorm
              5.874302 = fieldWeight in 299, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.625 = fieldNorm(doc=299)
        0.5 = coord(1/2)
    
  4. Großmann. R.; Heinz, M.: RAK-WB als Hypertext (1994) 1.66
    1.6615034 = sum of:
      1.6615034 = product of:
        3.3230069 = sum of:
          3.3230069 = weight(author_txt:heinz in 389) [ClassicSimilarity], result of:
            3.3230069 = score(doc=389,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.075233065 = queryNorm
              4.6994414 = fieldWeight in 389, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.5 = fieldNorm(doc=389)
        0.5 = coord(1/2)
    
  5. Heinz, M.; Voigt, H.: Inhaltliche und formale Unzulänglichkeiten bei CD-ROMs : eine Gewichtung (1993) 1.66
    1.6615034 = sum of:
      1.6615034 = product of:
        3.3230069 = sum of:
          3.3230069 = weight(author_txt:heinz in 421) [ClassicSimilarity], result of:
            3.3230069 = score(doc=421,freq=1.0), product of:
              0.70710677 = queryWeight, product of:
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.075233065 = queryNorm
              4.6994414 = fieldWeight in 421, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                9.398883 = idf(docFreq=9, maxDocs=44421)
                0.5 = fieldNorm(doc=421)
        0.5 = coord(1/2)
    

Similar documents (content)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.14
    0.13666494 = sum of:
      0.13666494 = product of:
        0.48808908 = sum of:
          0.04291799 = weight(abstract_txt:memory in 2716) [ClassicSimilarity], result of:
            0.04291799 = score(doc=2716,freq=1.0), product of:
              0.083291404 = queryWeight, product of:
                1.0251666 = boost
                6.595522 = idf(docFreq=164, maxDocs=44421)
                0.012318464 = queryNorm
              0.5152751 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.595522 = idf(docFreq=164, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.011049503 = weight(abstract_txt:data in 2716) [ClassicSimilarity], result of:
            0.011049503 = score(doc=2716,freq=1.0), product of:
              0.04246969 = queryWeight, product of:
                1.0352588 = boost
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.012318464 = queryNorm
              0.26017386 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.01974015 = weight(abstract_txt:text in 2716) [ClassicSimilarity], result of:
            0.01974015 = score(doc=2716,freq=1.0), product of:
              0.06252939 = queryWeight, product of:
                1.2561789 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.012318464 = queryNorm
              0.3156939 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.12911971 = weight(abstract_txt:temporary in 2716) [ClassicSimilarity], result of:
            0.12911971 = score(doc=2716,freq=2.0), product of:
              0.13777137 = queryWeight, product of:
                1.318481 = boost
                8.482592 = idf(docFreq=24, maxDocs=44421)
                0.012318464 = queryNorm
              0.9372028 = fieldWeight in 2716, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.482592 = idf(docFreq=24, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.045427747 = weight(abstract_txt:large in 2716) [ClassicSimilarity], result of:
            0.045427747 = score(doc=2716,freq=3.0), product of:
              0.07557143 = queryWeight, product of:
                1.3809826 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.012318464 = queryNorm
              0.6011233 = fieldWeight in 2716, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.04846945 = weight(abstract_txt:construction in 2716) [ClassicSimilarity], result of:
            0.04846945 = score(doc=2716,freq=1.0), product of:
              0.113805346 = queryWeight, product of:
                1.6946917 = boost
                5.4514923 = idf(docFreq=517, maxDocs=44421)
                0.012318464 = queryNorm
              0.42589784 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.4514923 = idf(docFreq=517, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.19136454 = weight(abstract_txt:inverted in 2716) [ClassicSimilarity], result of:
            0.19136454 = score(doc=2716,freq=2.0), product of:
              0.22563939 = queryWeight, product of:
                2.3862548 = boost
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.012318464 = queryNorm
              0.848099 = fieldWeight in 2716, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
        0.28 = coord(7/25)
    
  2. Mukhopadhyay, S.; Peng, S.; Raje, R.; Mostafa, J.; Palakal, M.: Distributed multi-agent information filtering : a comparative study (2005) 0.11
    0.10795648 = sum of:
      0.10795648 = product of:
        0.38555884 = sum of:
          0.042800147 = weight(abstract_txt:speed in 4559) [ClassicSimilarity], result of:
            0.042800147 = score(doc=4559,freq=1.0), product of:
              0.08313887 = queryWeight, product of:
                1.0242275 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.012318464 = queryNorm
              0.5148031 = fieldWeight in 4559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.078125 = fieldNorm(doc=4559)
          0.011049503 = weight(abstract_txt:data in 4559) [ClassicSimilarity], result of:
            0.011049503 = score(doc=4559,freq=1.0), product of:
              0.04246969 = queryWeight, product of:
                1.0352588 = boost
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.012318464 = queryNorm
              0.26017386 = fieldWeight in 4559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.078125 = fieldNorm(doc=4559)
          0.069123946 = weight(abstract_txt:drawbacks in 4559) [ClassicSimilarity], result of:
            0.069123946 = score(doc=4559,freq=1.0), product of:
              0.11444398 = queryWeight, product of:
                1.2016855 = boost
                7.731176 = idf(docFreq=52, maxDocs=44421)
                0.012318464 = queryNorm
              0.6039981 = fieldWeight in 4559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.731176 = idf(docFreq=52, maxDocs=44421)
                0.078125 = fieldNorm(doc=4559)
          0.052455448 = weight(abstract_txt:large in 4559) [ClassicSimilarity], result of:
            0.052455448 = score(doc=4559,freq=4.0), product of:
              0.07557143 = queryWeight, product of:
                1.3809826 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.012318464 = queryNorm
              0.6941174 = fieldWeight in 4559, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.078125 = fieldNorm(doc=4559)
          0.05792843 = weight(abstract_txt:efficient in 4559) [ClassicSimilarity], result of:
            0.05792843 = score(doc=4559,freq=1.0), product of:
              0.12816766 = queryWeight, product of:
                1.798451 = boost
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.012318464 = queryNorm
              0.45197386 = fieldWeight in 4559, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.078125 = fieldNorm(doc=4559)
          0.061698735 = weight(abstract_txt:approaches in 4559) [ClassicSimilarity], result of:
            0.061698735 = score(doc=4559,freq=2.0), product of:
              0.12144749 = queryWeight, product of:
                2.144121 = boost
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.012318464 = queryNorm
              0.5080281 = fieldWeight in 4559, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.5981455 = idf(docFreq=1215, maxDocs=44421)
                0.078125 = fieldNorm(doc=4559)
          0.090502635 = weight(abstract_txt:single in 4559) [ClassicSimilarity], result of:
            0.090502635 = score(doc=4559,freq=2.0), product of:
              0.15678765 = queryWeight, product of:
                2.4361887 = boost
                5.2244954 = idf(docFreq=649, maxDocs=44421)
                0.012318464 = queryNorm
              0.57723063 = fieldWeight in 4559, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.2244954 = idf(docFreq=649, maxDocs=44421)
                0.078125 = fieldNorm(doc=4559)
        0.28 = coord(7/25)
    
  3. MacFarlane, A.; McCann, J.A.; Robertson, S.E.: Parallel methods for the generation of partitioned inverted files (2005) 0.08
    0.08355105 = sum of:
      0.08355105 = product of:
        0.3481294 = sum of:
          0.04842284 = weight(abstract_txt:speed in 776) [ClassicSimilarity], result of:
            0.04842284 = score(doc=776,freq=2.0), product of:
              0.08313887 = queryWeight, product of:
                1.0242275 = boost
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.012318464 = queryNorm
              0.5824332 = fieldWeight in 776, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.58948 = idf(docFreq=165, maxDocs=44421)
                0.0625 = fieldNorm(doc=776)
          0.008839603 = weight(abstract_txt:data in 776) [ClassicSimilarity], result of:
            0.008839603 = score(doc=776,freq=1.0), product of:
              0.04246969 = queryWeight, product of:
                1.0352588 = boost
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.012318464 = queryNorm
              0.20813909 = fieldWeight in 776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.3302255 = idf(docFreq=4320, maxDocs=44421)
                0.0625 = fieldNorm(doc=776)
          0.027352752 = weight(abstract_txt:text in 776) [ClassicSimilarity], result of:
            0.027352752 = score(doc=776,freq=3.0), product of:
              0.06252939 = queryWeight, product of:
                1.2561789 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.012318464 = queryNorm
              0.4374383 = fieldWeight in 776, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.0625 = fieldNorm(doc=776)
          0.029673282 = weight(abstract_txt:large in 776) [ClassicSimilarity], result of:
            0.029673282 = score(doc=776,freq=2.0), product of:
              0.07557143 = queryWeight, product of:
                1.3809826 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.012318464 = queryNorm
              0.3926521 = fieldWeight in 776, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0625 = fieldNorm(doc=776)
          0.046342745 = weight(abstract_txt:efficient in 776) [ClassicSimilarity], result of:
            0.046342745 = score(doc=776,freq=1.0), product of:
              0.12816766 = queryWeight, product of:
                1.798451 = boost
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.012318464 = queryNorm
              0.3615791 = fieldWeight in 776, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.0625 = fieldNorm(doc=776)
          0.18749818 = weight(abstract_txt:inverted in 776) [ClassicSimilarity], result of:
            0.18749818 = score(doc=776,freq=3.0), product of:
              0.22563939 = queryWeight, product of:
                2.3862548 = boost
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.012318464 = queryNorm
              0.8309639 = fieldWeight in 776, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.0625 = fieldNorm(doc=776)
        0.24 = coord(6/25)
    
  4. Uratani, N.; Takeda, M.: ¬A fast string-searching algorithm for multiple patterns (1993) 0.08
    0.078040294 = sum of:
      0.078040294 = product of:
        0.48775184 = sum of:
          0.033500142 = weight(abstract_txt:text in 6274) [ClassicSimilarity], result of:
            0.033500142 = score(doc=6274,freq=2.0), product of:
              0.06252939 = queryWeight, product of:
                1.2561789 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.012318464 = queryNorm
              0.5357503 = fieldWeight in 6274, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.09375 = fieldNorm(doc=6274)
          0.06951412 = weight(abstract_txt:efficient in 6274) [ClassicSimilarity], result of:
            0.06951412 = score(doc=6274,freq=1.0), product of:
              0.12816766 = queryWeight, product of:
                1.798451 = boost
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.012318464 = queryNorm
              0.54236865 = fieldWeight in 6274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.09375 = fieldNorm(doc=6274)
          0.076794036 = weight(abstract_txt:single in 6274) [ClassicSimilarity], result of:
            0.076794036 = score(doc=6274,freq=1.0), product of:
              0.15678765 = queryWeight, product of:
                2.4361887 = boost
                5.2244954 = idf(docFreq=649, maxDocs=44421)
                0.012318464 = queryNorm
              0.48979646 = fieldWeight in 6274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.2244954 = idf(docFreq=649, maxDocs=44421)
                0.09375 = fieldNorm(doc=6274)
          0.30794355 = weight(abstract_txt:pass in 6274) [ClassicSimilarity], result of:
            0.30794355 = score(doc=6274,freq=1.0), product of:
              0.39573786 = queryWeight, product of:
                3.8704262 = boost
                8.30027 = idf(docFreq=29, maxDocs=44421)
                0.012318464 = queryNorm
              0.7781503 = fieldWeight in 6274, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.30027 = idf(docFreq=29, maxDocs=44421)
                0.09375 = fieldNorm(doc=6274)
        0.16 = coord(4/25)
    
  5. Chang, M.; Poon, C.K.: Efficient phrase querying with common phrase index (2008) 0.07
    0.06771081 = sum of:
      0.06771081 = product of:
        0.33855402 = sum of:
          0.01974015 = weight(abstract_txt:text in 3061) [ClassicSimilarity], result of:
            0.01974015 = score(doc=3061,freq=1.0), product of:
              0.06252939 = queryWeight, product of:
                1.2561789 = boost
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.012318464 = queryNorm
              0.3156939 = fieldWeight in 3061, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.040882 = idf(docFreq=2122, maxDocs=44421)
                0.078125 = fieldNorm(doc=3061)
          0.045427747 = weight(abstract_txt:large in 3061) [ClassicSimilarity], result of:
            0.045427747 = score(doc=3061,freq=3.0), product of:
              0.07557143 = queryWeight, product of:
                1.3809826 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.012318464 = queryNorm
              0.6011233 = fieldWeight in 3061, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.078125 = fieldNorm(doc=3061)
          0.08014253 = weight(abstract_txt:cost in 3061) [ClassicSimilarity], result of:
            0.08014253 = score(doc=3061,freq=2.0), product of:
              0.12630367 = queryWeight, product of:
                1.7853253 = boost
                5.743043 = idf(docFreq=386, maxDocs=44421)
                0.012318464 = queryNorm
              0.63452256 = fieldWeight in 3061, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.743043 = idf(docFreq=386, maxDocs=44421)
                0.078125 = fieldNorm(doc=3061)
          0.05792843 = weight(abstract_txt:efficient in 3061) [ClassicSimilarity], result of:
            0.05792843 = score(doc=3061,freq=1.0), product of:
              0.12816766 = queryWeight, product of:
                1.798451 = boost
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.012318464 = queryNorm
              0.45197386 = fieldWeight in 3061, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7852654 = idf(docFreq=370, maxDocs=44421)
                0.078125 = fieldNorm(doc=3061)
          0.13531516 = weight(abstract_txt:inverted in 3061) [ClassicSimilarity], result of:
            0.13531516 = score(doc=3061,freq=1.0), product of:
              0.22563939 = queryWeight, product of:
                2.3862548 = boost
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.012318464 = queryNorm
              0.5996966 = fieldWeight in 3061, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.078125 = fieldNorm(doc=3061)
        0.2 = coord(5/25)