Document (#32082)

Author
Wan, R.
Moffat, A.
Title
Block merging for off-line compression
Source
Journal of the American Society for Information Science and Technology. 58(2007) no.1, S.3-14
Year
2007
Abstract
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once - usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat (2000), which processes large messages as disjoint blocks to limit memory consumption. We show that the blocks emitted by Re-Pair can be postprocessed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.

Similar documents (author)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 4.88
    4.8754888 = sum of:
      4.8754888 = weight(author_txt:moffat in 2648) [ClassicSimilarity], result of:
        4.8754888 = fieldWeight in 2648, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.5 = fieldNorm(doc=2648)
    
  2. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.88
    4.8754888 = sum of:
      4.8754888 = weight(author_txt:moffat in 9) [ClassicSimilarity], result of:
        4.8754888 = fieldWeight in 9, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.5 = fieldNorm(doc=9)
    
  3. Moffat, A.; Isal, R.Y.K.: Word-based text compression using the Burrows-Wheeler transform (2005) 4.88
    4.8754888 = sum of:
      4.8754888 = weight(author_txt:moffat in 1044) [ClassicSimilarity], result of:
        4.8754888 = fieldWeight in 1044, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.5 = fieldNorm(doc=1044)
    
  4. Witten, I.H.; Moffat, A.; Bell, T.C.: Managing gigabytes : compressing and indexing documents and images (1994) 3.66
    3.6566167 = sum of:
      3.6566167 = weight(author_txt:moffat in 3083) [ClassicSimilarity], result of:
        3.6566167 = fieldWeight in 3083, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.375 = fieldNorm(doc=3083)
    
  5. Bell, T.C.; Moffat, A.; Nevill-Manning, C.G.; Witten, I.H.; Zobel, J.: Data compression in full-text retrieval system (1993) 2.44
    2.4377444 = sum of:
      2.4377444 = weight(author_txt:moffat in 5643) [ClassicSimilarity], result of:
        2.4377444 = fieldWeight in 5643, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.7509775 = idf(docFreq=6, maxDocs=44218)
          0.25 = fieldNorm(doc=5643)
    

Similar documents (content)

  1. Moffat, A.; Isal, R.Y.K.: Word-based text compression using the Burrows-Wheeler transform (2005) 0.20
    0.19950853 = sum of:
      0.19950853 = product of:
        0.9975426 = sum of:
          0.060713645 = weight(abstract_txt:mechanism in 1044) [ClassicSimilarity], result of:
            0.060713645 = score(doc=1044,freq=2.0), product of:
              0.08699036 = queryWeight, product of:
                6.31699 = idf(docFreq=216, maxDocs=44218)
                0.013770856 = queryNorm
              0.69793534 = fieldWeight in 1044, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.31699 = idf(docFreq=216, maxDocs=44218)
                0.078125 = fieldNorm(doc=1044)
          0.011328367 = weight(abstract_txt:that in 1044) [ClassicSimilarity], result of:
            0.011328367 = score(doc=1044,freq=1.0), product of:
              0.06119629 = queryWeight, product of:
                1.875478 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.013770856 = queryNorm
              0.18511525 = fieldWeight in 1044, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=1044)
          0.25428435 = weight(abstract_txt:compression in 1044) [ClassicSimilarity], result of:
            0.25428435 = score(doc=1044,freq=3.0), product of:
              0.24877472 = queryWeight, product of:
                2.391567 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.013770856 = queryNorm
              1.022147 = fieldWeight in 1044, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.078125 = fieldNorm(doc=1044)
          0.15598384 = weight(abstract_txt:blocks in 1044) [ClassicSimilarity], result of:
            0.15598384 = score(doc=1044,freq=1.0), product of:
              0.25903192 = queryWeight, product of:
                2.4403722 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.013770856 = queryNorm
              0.60217994 = fieldWeight in 1044, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.078125 = fieldNorm(doc=1044)
          0.51523244 = weight(abstract_txt:block in 1044) [ClassicSimilarity], result of:
            0.51523244 = score(doc=1044,freq=4.0), product of:
              0.41429794 = queryWeight, product of:
                3.77991 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.013770856 = queryNorm
              1.2436278 = fieldWeight in 1044, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.078125 = fieldNorm(doc=1044)
        0.2 = coord(5/25)
    
  2. Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.09
    0.090653785 = sum of:
      0.090653785 = product of:
        0.7554482 = sum of:
          0.019621305 = weight(abstract_txt:that in 4119) [ClassicSimilarity], result of:
            0.019621305 = score(doc=4119,freq=3.0), product of:
              0.06119629 = queryWeight, product of:
                1.875478 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.013770856 = queryNorm
              0.320629 = fieldWeight in 4119, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.22059444 = weight(abstract_txt:blocks in 4119) [ClassicSimilarity], result of:
            0.22059444 = score(doc=4119,freq=2.0), product of:
              0.25903192 = queryWeight, product of:
                2.4403722 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.013770856 = queryNorm
              0.851611 = fieldWeight in 4119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
          0.51523244 = weight(abstract_txt:block in 4119) [ClassicSimilarity], result of:
            0.51523244 = score(doc=4119,freq=4.0), product of:
              0.41429794 = queryWeight, product of:
                3.77991 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.013770856 = queryNorm
              1.2436278 = fieldWeight in 4119, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.078125 = fieldNorm(doc=4119)
        0.12 = coord(3/25)
    
  3. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.09
    0.08832496 = sum of:
      0.08832496 = product of:
        0.552031 = sum of:
          0.016020728 = weight(abstract_txt:that in 2648) [ClassicSimilarity], result of:
            0.016020728 = score(doc=2648,freq=2.0), product of:
              0.06119629 = queryWeight, product of:
                1.875478 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.013770856 = queryNorm
              0.26179248 = fieldWeight in 2648, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.14681114 = weight(abstract_txt:compression in 2648) [ClassicSimilarity], result of:
            0.14681114 = score(doc=2648,freq=1.0), product of:
              0.24877472 = queryWeight, product of:
                2.391567 = boost
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.013770856 = queryNorm
              0.5901369 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.5537524 = idf(docFreq=62, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.24209902 = weight(abstract_txt:compressed in 2648) [ClassicSimilarity], result of:
            0.24209902 = score(doc=2648,freq=1.0), product of:
              0.34723935 = queryWeight, product of:
                2.8254907 = boost
                8.924298 = idf(docFreq=15, maxDocs=44218)
                0.013770856 = queryNorm
              0.6972108 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.924298 = idf(docFreq=15, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
          0.1471001 = weight(abstract_txt:memory in 2648) [ClassicSimilarity], result of:
            0.1471001 = score(doc=2648,freq=1.0), product of:
              0.2851495 = queryWeight, product of:
                3.1358938 = boost
                6.603137 = idf(docFreq=162, maxDocs=44218)
                0.013770856 = queryNorm
              0.5158701 = fieldWeight in 2648, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.603137 = idf(docFreq=162, maxDocs=44218)
                0.078125 = fieldNorm(doc=2648)
        0.16 = coord(4/25)
    
  4. Fersini, E.; Messina, E.; Archetti, F.: Enhancing web page classification through image-block importance analysis (2008) 0.09
    0.08788763 = sum of:
      0.08788763 = product of:
        0.73239696 = sum of:
          0.016020728 = weight(abstract_txt:that in 2102) [ClassicSimilarity], result of:
            0.016020728 = score(doc=2102,freq=2.0), product of:
              0.06119629 = queryWeight, product of:
                1.875478 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.013770856 = queryNorm
              0.26179248 = fieldWeight in 2102, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.078125 = fieldNorm(doc=2102)
          0.27017194 = weight(abstract_txt:blocks in 2102) [ClassicSimilarity], result of:
            0.27017194 = score(doc=2102,freq=3.0), product of:
              0.25903192 = queryWeight, product of:
                2.4403722 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.013770856 = queryNorm
              1.0430063 = fieldWeight in 2102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.078125 = fieldNorm(doc=2102)
          0.44620433 = weight(abstract_txt:block in 2102) [ClassicSimilarity], result of:
            0.44620433 = score(doc=2102,freq=3.0), product of:
              0.41429794 = queryWeight, product of:
                3.77991 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.013770856 = queryNorm
              1.0770131 = fieldWeight in 2102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.078125 = fieldNorm(doc=2102)
        0.12 = coord(3/25)
    
  5. Wan, X.; Yang, J.; Xiao, J.: Towards a unified approach to document similarity search using manifold-ranking of blocks (2008) 0.08
    0.07740702 = sum of:
      0.07740702 = product of:
        0.6450585 = sum of:
          0.0090626925 = weight(abstract_txt:that in 2081) [ClassicSimilarity], result of:
            0.0090626925 = score(doc=2081,freq=1.0), product of:
              0.06119629 = queryWeight, product of:
                1.875478 = boost
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.013770856 = queryNorm
              0.1480922 = fieldWeight in 2081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3694751 = idf(docFreq=11241, maxDocs=44218)
                0.0625 = fieldNorm(doc=2081)
          0.27903235 = weight(abstract_txt:blocks in 2081) [ClassicSimilarity], result of:
            0.27903235 = score(doc=2081,freq=5.0), product of:
              0.25903192 = queryWeight, product of:
                2.4403722 = boost
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.013770856 = queryNorm
              1.0772122 = fieldWeight in 2081, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.7079034 = idf(docFreq=53, maxDocs=44218)
                0.0625 = fieldNorm(doc=2081)
          0.35696346 = weight(abstract_txt:block in 2081) [ClassicSimilarity], result of:
            0.35696346 = score(doc=2081,freq=3.0), product of:
              0.41429794 = queryWeight, product of:
                3.77991 = boost
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.013770856 = queryNorm
              0.86161053 = fieldWeight in 2081, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.9592175 = idf(docFreq=41, maxDocs=44218)
                0.0625 = fieldNorm(doc=2081)
        0.12 = coord(3/25)