Document (#33030)

Author
Carterette, B.
Can, F.
Title
Comparing inverted files and signature files for searching a large lexicon
Source
Information processing and management. 41(2005) no.3, S.613-634
Year
2005
Abstract
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.

Similar documents (content)

  1. Kelledy, F.; Smeaton, A.F.: Signature files and beyond (1996) 0.45
    0.45436 = sum of:
      0.45436 = product of:
        1.6227143 = sum of:
          0.012780042 = weight(abstract_txt:than in 42) [ClassicSimilarity], result of:
            0.012780042 = score(doc=42,freq=1.0), product of:
              0.042023 = queryWeight, product of:
                1.2242984 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.008817482 = queryNorm
              0.30412018 = fieldWeight in 42, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.025591949 = weight(abstract_txt:searching in 42) [ClassicSimilarity], result of:
            0.025591949 = score(doc=42,freq=1.0), product of:
              0.07642431 = queryWeight, product of:
                2.0221117 = boost
                4.2862926 = idf(docFreq=1660, maxDocs=44421)
                0.008817482 = queryNorm
              0.3348666 = fieldWeight in 42, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.2862926 = idf(docFreq=1660, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.028490277 = weight(abstract_txt:large in 42) [ClassicSimilarity], result of:
            0.028490277 = score(doc=42,freq=1.0), product of:
              0.08209065 = queryWeight, product of:
                2.0957344 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.008817482 = queryNorm
              0.3470587 = fieldWeight in 42, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.106659845 = weight(abstract_txt:file in 42) [ClassicSimilarity], result of:
            0.106659845 = score(doc=42,freq=2.0), product of:
              0.1729018 = queryWeight, product of:
                3.5120323 = boost
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.008817482 = queryNorm
              0.6168811 = fieldWeight in 42, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.20327769 = weight(abstract_txt:files in 42) [ClassicSimilarity], result of:
            0.20327769 = score(doc=42,freq=4.0), product of:
              0.2272403 = queryWeight, product of:
                4.5014915 = boost
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.008817482 = queryNorm
              0.8945495 = fieldWeight in 42, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          0.1959843 = weight(abstract_txt:inverted in 42) [ClassicSimilarity], result of:
            0.1959843 = score(doc=42,freq=1.0), product of:
              0.32680577 = queryWeight, product of:
                4.828404 = boost
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.008817482 = queryNorm
              0.5996966 = fieldWeight in 42, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
          1.0499301 = weight(abstract_txt:signature in 42) [ClassicSimilarity], result of:
            1.0499301 = score(doc=42,freq=5.0), product of:
              0.7051342 = queryWeight, product of:
                9.382394 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.008817482 = queryNorm
              1.4889791 = fieldWeight in 42, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.078125 = fieldNorm(doc=42)
        0.28 = coord(7/25)
    
  2. Lam, W.; Wong, K.-F.; Wong, C.-Y.: Chinese document indexing based on new partitioned signature file : model and evaluation (2001) 0.38
    0.37547818 = sum of:
      0.37547818 = product of:
        1.3409935 = sum of:
          0.03267258 = weight(abstract_txt:faster in 1303) [ClassicSimilarity], result of:
            0.03267258 = score(doc=1303,freq=1.0), product of:
              0.07236321 = queryWeight, product of:
                1.1360245 = boost
                7.2241306 = idf(docFreq=87, maxDocs=44421)
                0.008817482 = queryNorm
              0.45150816 = fieldWeight in 1303, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.2241306 = idf(docFreq=87, maxDocs=44421)
                0.0625 = fieldNorm(doc=1303)
          0.010224034 = weight(abstract_txt:than in 1303) [ClassicSimilarity], result of:
            0.010224034 = score(doc=1303,freq=1.0), product of:
              0.042023 = queryWeight, product of:
                1.2242984 = boost
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.008817482 = queryNorm
              0.24329615 = fieldWeight in 1303, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                3.8927383 = idf(docFreq=2461, maxDocs=44421)
                0.0625 = fieldNorm(doc=1303)
          0.03156302 = weight(abstract_txt:method in 1303) [ClassicSimilarity], result of:
            0.03156302 = score(doc=1303,freq=4.0), product of:
              0.056126926 = queryWeight, product of:
                1.4149119 = boost
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.008817482 = queryNorm
              0.5623508 = fieldWeight in 1303, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                4.4988065 = idf(docFreq=1342, maxDocs=44421)
                0.0625 = fieldNorm(doc=1303)
          0.02279222 = weight(abstract_txt:large in 1303) [ClassicSimilarity], result of:
            0.02279222 = score(doc=1303,freq=1.0), product of:
              0.08209065 = queryWeight, product of:
                2.0957344 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.008817482 = queryNorm
              0.27764696 = fieldWeight in 1303, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.0625 = fieldNorm(doc=1303)
          0.13491522 = weight(abstract_txt:file in 1303) [ClassicSimilarity], result of:
            0.13491522 = score(doc=1303,freq=5.0), product of:
              0.1729018 = queryWeight, product of:
                3.5120323 = boost
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.008817482 = queryNorm
              0.7802997 = fieldWeight in 1303, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.0625 = fieldNorm(doc=1303)
          0.114991225 = weight(abstract_txt:files in 1303) [ClassicSimilarity], result of:
            0.114991225 = score(doc=1303,freq=2.0), product of:
              0.2272403 = queryWeight, product of:
                4.5014915 = boost
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.008817482 = queryNorm
              0.5060336 = fieldWeight in 1303, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.0625 = fieldNorm(doc=1303)
          0.9938352 = weight(abstract_txt:signature in 1303) [ClassicSimilarity], result of:
            0.9938352 = score(doc=1303,freq=7.0), product of:
              0.7051342 = queryWeight, product of:
                9.382394 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.008817482 = queryNorm
              1.409427 = fieldWeight in 1303, product of:
                2.6457512 = tf(freq=7.0), with freq of:
                  7.0 = termFreq=7.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.0625 = fieldNorm(doc=1303)
        0.28 = coord(7/25)
    
  3. Robertson, A.M.; Willett, P.: Applications of n-grams in textual information systems (1998) 0.28
    0.2773791 = sum of:
      0.2773791 = product of:
        1.3868954 = sum of:
          0.07592244 = weight(abstract_txt:gram in 5715) [ClassicSimilarity], result of:
            0.07592244 = score(doc=5715,freq=1.0), product of:
              0.08742124 = queryWeight, product of:
                1.24864 = boost
                7.9402676 = idf(docFreq=42, maxDocs=44421)
                0.008817482 = queryNorm
              0.86846673 = fieldWeight in 5715, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.9402676 = idf(docFreq=42, maxDocs=44421)
                0.109375 = fieldNorm(doc=5715)
          0.23694038 = weight(abstract_txt:grams in 5715) [ClassicSimilarity], result of:
            0.23694038 = score(doc=5715,freq=2.0), product of:
              0.18669365 = queryWeight, product of:
                2.5805278 = boost
                8.20496 = idf(docFreq=32, maxDocs=44421)
                0.008817482 = queryNorm
              1.26914 = fieldWeight in 5715, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                8.20496 = idf(docFreq=32, maxDocs=44421)
                0.109375 = fieldNorm(doc=5715)
          0.14229438 = weight(abstract_txt:files in 5715) [ClassicSimilarity], result of:
            0.14229438 = score(doc=5715,freq=1.0), product of:
              0.2272403 = queryWeight, product of:
                4.5014915 = boost
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.008817482 = queryNorm
              0.62618464 = fieldWeight in 5715, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.109375 = fieldNorm(doc=5715)
          0.274378 = weight(abstract_txt:inverted in 5715) [ClassicSimilarity], result of:
            0.274378 = score(doc=5715,freq=1.0), product of:
              0.32680577 = queryWeight, product of:
                4.828404 = boost
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.008817482 = queryNorm
              0.8395752 = fieldWeight in 5715, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.676116 = idf(docFreq=55, maxDocs=44421)
                0.109375 = fieldNorm(doc=5715)
          0.6573602 = weight(abstract_txt:signature in 5715) [ClassicSimilarity], result of:
            0.6573602 = score(doc=5715,freq=1.0), product of:
              0.7051342 = queryWeight, product of:
                9.382394 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.008817482 = queryNorm
              0.93224835 = fieldWeight in 5715, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.109375 = fieldNorm(doc=5715)
        0.2 = coord(5/25)
    
  4. Lee, D.L.: Massive parallelism on the hybrid text-retrieval machine (1995) 0.19
    0.19400126 = sum of:
      0.19400126 = product of:
        1.2125078 = sum of:
          0.03418833 = weight(abstract_txt:large in 4143) [ClassicSimilarity], result of:
            0.03418833 = score(doc=4143,freq=1.0), product of:
              0.08209065 = queryWeight, product of:
                2.0957344 = boost
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.008817482 = queryNorm
              0.41647044 = fieldWeight in 4143, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                4.4423513 = idf(docFreq=1420, maxDocs=44421)
                0.09375 = fieldNorm(doc=4143)
          0.11188882 = weight(abstract_txt:memory in 4143) [ClassicSimilarity], result of:
            0.11188882 = score(doc=4143,freq=1.0), product of:
              0.18095319 = queryWeight, product of:
                3.1115193 = boost
                6.595522 = idf(docFreq=164, maxDocs=44421)
                0.008817482 = queryNorm
              0.6183302 = fieldWeight in 4143, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.595522 = idf(docFreq=164, maxDocs=44421)
                0.09375 = fieldNorm(doc=4143)
          0.09050388 = weight(abstract_txt:file in 4143) [ClassicSimilarity], result of:
            0.09050388 = score(doc=4143,freq=1.0), product of:
              0.1729018 = queryWeight, product of:
                3.5120323 = boost
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.008817482 = queryNorm
              0.52344096 = fieldWeight in 4143, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.09375 = fieldNorm(doc=4143)
          0.9759268 = weight(abstract_txt:signature in 4143) [ClassicSimilarity], result of:
            0.9759268 = score(doc=4143,freq=3.0), product of:
              0.7051342 = queryWeight, product of:
                9.382394 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.008817482 = queryNorm
              1.3840299 = fieldWeight in 4143, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.09375 = fieldNorm(doc=4143)
        0.16 = coord(4/25)
    
  5. Lee, D.L.; Ren, L.: Document ranking on weight-partitioned signature files (1996) 0.19
    0.18754686 = sum of:
      0.18754686 = product of:
        1.5628905 = sum of:
          0.18100776 = weight(abstract_txt:file in 3417) [ClassicSimilarity], result of:
            0.18100776 = score(doc=3417,freq=4.0), product of:
              0.1729018 = queryWeight, product of:
                3.5120323 = boost
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.008817482 = queryNorm
              1.0468819 = fieldWeight in 3417, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                5.58337 = idf(docFreq=453, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          0.121966615 = weight(abstract_txt:files in 3417) [ClassicSimilarity], result of:
            0.121966615 = score(doc=3417,freq=1.0), product of:
              0.2272403 = queryWeight, product of:
                4.5014915 = boost
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.008817482 = queryNorm
              0.5367297 = fieldWeight in 3417, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                5.7251167 = idf(docFreq=393, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
          1.2599162 = weight(abstract_txt:signature in 3417) [ClassicSimilarity], result of:
            1.2599162 = score(doc=3417,freq=5.0), product of:
              0.7051342 = queryWeight, product of:
                9.382394 = boost
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.008817482 = queryNorm
              1.786775 = fieldWeight in 3417, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                8.523414 = idf(docFreq=23, maxDocs=44421)
                0.09375 = fieldNorm(doc=3417)
        0.12 = coord(3/25)