Document (#32082)

Author
Wan, R.
Moffat, A.
Title
Block merging for off-line compression
Source
Journal of the American Society for Information Science and Technology. 58(2007) no.1, S.3-14
Year
2007
Abstract
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once - usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat (2000), which processes large messages as disjoint blocks to limit memory consumption. We show that the blocks emitted by Re-Pair can be postprocessed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.

Similar documents (author)

  1. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:moffat in 2716) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 2716, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=2716)
    
  2. Moffat, A.; Zobel, J.: Self-indexing inverted files for fast text retrieval (1996) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:moffat in 1009) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 1009, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=1009)
    
  3. Moffat, A.; Isal, R.Y.K.: Word-based text compression using the Burrows-Wheeler transform (2005) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:moffat in 2044) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 2044, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=2044)
    
  4. Moffat, A.; Mackenzie, J.: How much freedom does an effectiveness metric really have? (2024) 4.81
    4.811013 = sum of:
      4.811013 = weight(author_txt:moffat in 2289) [ClassicSimilarity], result of:
        4.811013 = fieldWeight in 2289, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.5 = fieldNorm(doc=2289)
    
  5. Witten, I.H.; Moffat, A.; Bell, T.C.: Managing gigabytes : compressing and indexing documents and images (1994) 3.61
    3.60826 = sum of:
      3.60826 = weight(author_txt:moffat in 4083) [ClassicSimilarity], result of:
        3.60826 = fieldWeight in 4083, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          9.622026 = idf(docFreq=7, maxDocs=44421)
          0.375 = fieldNorm(doc=4083)
    

Similar documents (content)

  1. Moffat, A.; Isal, R.Y.K.: Word-based text compression using the Burrows-Wheeler transform (2005) 0.20
    0.19970566 = sum of:
      0.19970566 = product of:
        0.9985283 = sum of:
          0.06062164 = weight(abstract_txt:mechanism in 2044) [ClassicSimilarity], result of:
            0.06062164 = score(doc=2044,freq=2.0), product of:
              0.08692174 = queryWeight, product of:
                6.312396 = idf(docFreq=218, maxDocs=44421)
                0.013770007 = queryNorm
              0.69742787 = fieldWeight in 2044, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                6.312396 = idf(docFreq=218, maxDocs=44421)
                0.078125 = fieldNorm(doc=2044)
          0.011270875 = weight(abstract_txt:that in 2044) [ClassicSimilarity], result of:
            0.011270875 = score(doc=2044,freq=1.0), product of:
              0.061002605 = queryWeight, product of:
                1.8732468 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.013770007 = queryNorm
              0.18476056 = fieldWeight in 2044, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=2044)
          0.2549169 = weight(abstract_txt:compression in 2044) [ClassicSimilarity], result of:
            0.2549169 = score(doc=2044,freq=3.0), product of:
              0.24924241 = queryWeight, product of:
                2.3947587 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.013770007 = queryNorm
              1.022767 = fieldWeight in 2044, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.078125 = fieldNorm(doc=2044)
          0.15525271 = weight(abstract_txt:blocks in 2044) [ClassicSimilarity], result of:
            0.15525271 = score(doc=2044,freq=1.0), product of:
              0.25827917 = queryWeight, product of:
                2.4377856 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.013770007 = queryNorm
              0.60110426 = fieldWeight in 2044, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=2044)
          0.5164662 = weight(abstract_txt:block in 2044) [ClassicSimilarity], result of:
            0.5164662 = score(doc=2044,freq=4.0), product of:
              0.41505116 = queryWeight, product of:
                3.7848375 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.013770007 = queryNorm
              1.2443434 = fieldWeight in 2044, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.078125 = fieldNorm(doc=2044)
        0.2 = coord(5/25)
    
  2. Moura, E.S. de; Fernandes, D.; Ribeiro-Neto, B.; Silva, A.S. da; Gonçalves, M.A.: Using structural information to improve search in Web collections (2010) 0.09
    0.09066581 = sum of:
      0.09066581 = product of:
        0.7555484 = sum of:
          0.019521728 = weight(abstract_txt:that in 119) [ClassicSimilarity], result of:
            0.019521728 = score(doc=119,freq=3.0), product of:
              0.061002605 = queryWeight, product of:
                1.8732468 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.013770007 = queryNorm
              0.32001466 = fieldWeight in 119, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.21956047 = weight(abstract_txt:blocks in 119) [ClassicSimilarity], result of:
            0.21956047 = score(doc=119,freq=2.0), product of:
              0.25827917 = queryWeight, product of:
                2.4377856 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.013770007 = queryNorm
              0.8500897 = fieldWeight in 119, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
          0.5164662 = weight(abstract_txt:block in 119) [ClassicSimilarity], result of:
            0.5164662 = score(doc=119,freq=4.0), product of:
              0.41505116 = queryWeight, product of:
                3.7848375 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.013770007 = queryNorm
              1.2443434 = fieldWeight in 119, product of:
                2.0 = tf(freq=4.0), with freq of:
                  4.0 = termFreq=4.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.078125 = fieldNorm(doc=119)
        0.12 = coord(3/25)
    
  3. Moffat, A.; Bell, T.A.H.: In situ generation of compressed inverted files (1995) 0.09
    0.08839018 = sum of:
      0.08839018 = product of:
        0.5524386 = sum of:
          0.015939426 = weight(abstract_txt:that in 2716) [ClassicSimilarity], result of:
            0.015939426 = score(doc=2716,freq=2.0), product of:
              0.061002605 = queryWeight, product of:
                1.8732468 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.013770007 = queryNorm
              0.2612909 = fieldWeight in 2716, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.14717634 = weight(abstract_txt:compression in 2716) [ClassicSimilarity], result of:
            0.14717634 = score(doc=2716,freq=1.0), product of:
              0.24924241 = queryWeight, product of:
                2.3947587 = boost
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.013770007 = queryNorm
              0.59049475 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                7.558333 = idf(docFreq=62, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.24263348 = weight(abstract_txt:compressed in 2716) [ClassicSimilarity], result of:
            0.24263348 = score(doc=2716,freq=1.0), product of:
              0.34782737 = queryWeight, product of:
                2.8289983 = boost
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.013770007 = queryNorm
              0.69756866 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                8.928879 = idf(docFreq=15, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
          0.14668939 = weight(abstract_txt:memory in 2716) [ClassicSimilarity], result of:
            0.14668939 = score(doc=2716,freq=1.0), product of:
              0.28468168 = queryWeight, product of:
                3.134557 = boost
                6.595522 = idf(docFreq=164, maxDocs=44421)
                0.013770007 = queryNorm
              0.5152751 = fieldWeight in 2716, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                6.595522 = idf(docFreq=164, maxDocs=44421)
                0.078125 = fieldNorm(doc=2716)
        0.16 = coord(4/25)
    
  4. Fersini, E.; Messina, E.; Archetti, F.: Enhancing web page classification through image-block importance analysis (2008) 0.09
    0.08785414 = sum of:
      0.08785414 = product of:
        0.73211783 = sum of:
          0.015939426 = weight(abstract_txt:that in 3102) [ClassicSimilarity], result of:
            0.015939426 = score(doc=3102,freq=2.0), product of:
              0.061002605 = queryWeight, product of:
                1.8732468 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.013770007 = queryNorm
              0.2612909 = fieldWeight in 3102, product of:
                1.4142135 = tf(freq=2.0), with freq of:
                  2.0 = termFreq=2.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.078125 = fieldNorm(doc=3102)
          0.26890558 = weight(abstract_txt:blocks in 3102) [ClassicSimilarity], result of:
            0.26890558 = score(doc=3102,freq=3.0), product of:
              0.25827917 = queryWeight, product of:
                2.4377856 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.013770007 = queryNorm
              1.0411431 = fieldWeight in 3102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.078125 = fieldNorm(doc=3102)
          0.44727284 = weight(abstract_txt:block in 3102) [ClassicSimilarity], result of:
            0.44727284 = score(doc=3102,freq=3.0), product of:
              0.41505116 = queryWeight, product of:
                3.7848375 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.013770007 = queryNorm
              1.077633 = fieldWeight in 3102, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.078125 = fieldNorm(doc=3102)
        0.12 = coord(3/25)
    
  5. Wan, X.; Yang, J.; Xiao, J.: Towards a unified approach to document similarity search using manifold-ranking of blocks (2008) 0.08
    0.07734712 = sum of:
      0.07734712 = product of:
        0.6445594 = sum of:
          0.0090167 = weight(abstract_txt:that in 3081) [ClassicSimilarity], result of:
            0.0090167 = score(doc=3081,freq=1.0), product of:
              0.061002605 = queryWeight, product of:
                1.8732468 = boost
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.013770007 = queryNorm
              0.14780845 = fieldWeight in 3081, product of:
                1.0 = tf(freq=1.0), with freq of:
                  1.0 = termFreq=1.0
                2.3649352 = idf(docFreq=11344, maxDocs=44421)
                0.0625 = fieldNorm(doc=3081)
          0.27772447 = weight(abstract_txt:blocks in 3081) [ClassicSimilarity], result of:
            0.27772447 = score(doc=3081,freq=5.0), product of:
              0.25827917 = queryWeight, product of:
                2.4377856 = boost
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.013770007 = queryNorm
              1.0752879 = fieldWeight in 3081, product of:
                2.236068 = tf(freq=5.0), with freq of:
                  5.0 = termFreq=5.0
                7.694134 = idf(docFreq=54, maxDocs=44421)
                0.0625 = fieldNorm(doc=3081)
          0.35781825 = weight(abstract_txt:block in 3081) [ClassicSimilarity], result of:
            0.35781825 = score(doc=3081,freq=3.0), product of:
              0.41505116 = queryWeight, product of:
                3.7848375 = boost
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.013770007 = queryNorm
              0.8621064 = fieldWeight in 3081, product of:
                1.7320508 = tf(freq=3.0), with freq of:
                  3.0 = termFreq=3.0
                7.963798 = idf(docFreq=41, maxDocs=44421)
                0.0625 = fieldNorm(doc=3081)
        0.12 = coord(3/25)